# Shallow Neural Networks

* Create a Neural Network with One Hidden Layer using nn.Module.
* Create a Neural Network with One Hidden Layer using nn.Sequential.
* Train the Neural Network model.
* Explain how adding more neurons in the hidden layer can improve a model.
* Construct Networks with Multiple Dimensional input in PyTorch.
* Explain what Overfitting and Underfitting are.
* Implement Multi-Class Neural Networks in PyTorch.
* Describe what back propagation and the vanishing gradient is.
* Implement Sigmoid, Tanh and Relu activation functions in Pytorch.

## What is a neural network?
A neural network is a function that can be used to approximate most functions using a set of parameters.

## Single Layer Neural Network to classify non-linearly separable 1D data

In [None]:
import torch
import torch.nn as nn
from torch import sigmoid
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

#manually seed the random num generator
torch.manual_seed(0)


#function for plotting the model
def plotStuff(X,Y, model, epoch, leg=True):
        plt.plot(X.numpy(), model(X).detach().numpy(), label=('epoch'+str(epoch)))
        plt.plot(X.numpy(), Y.numpy(), 'r')
        plt.xlabel('X')
        if leg == True:
            plt.legend()
        else:
            pass

#define the model
class Net(nn.Module):
    def __init__(self, D_in, H, D_out):
        super(Net, self).__init__()
        self.linear1= nn.Linear(D_in,H)
        self.linear2= nn.Linear(H,D_out)
        #to facilitate visualization and inspection of network during traning
        self.a1= None
        self.l1= None
        self.l2= None
        
    def forward(self,x):
        self.l1= self.linear1(x)
        self.a1= sigmoid(self.l1)
        self.l2= self.linear2(self.a1)
        yhat= sigmoid(self.l2)
        return yhat


#define training function
def train(Y, X, model, optimizer, criterion, epochs=1000):
    cost= []
    for epoch in range(epochs):
        total=0
        for y,x in zip(Y,X):
            yhat= model(x)
            loss= criterion(yhat,y)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            #cumulitive loss
            total+= loss.item()
        cost.append(total)
        
        #visualization and model inspections at certain intervals
        """
        if epoch%300 == 0:
            plotStuff(X,Y, model, epoch, leg=True)
            plt.show()
            model(X)
            #The detach() method applied on self.a1 creates a new tensor that 
            #is detached from the computational graph, ensuring that 
            #it won't be part of any further computation or gradient calculation.
            plt.scatter(model.a1.detach().numpy()[:,0], model.a1.detach().numpy()[:,1], c=Y.numpy().reshape(-1))
            plt.title('activations')
            plt.show()
        """
    return cost
    

#Make some data
X= torch.arange(-20,20,1).view(-1,1).type(torch.FloatTensor)
Y= torch.zeros(X.shape[0])
Y[(X[:,0] > -4) & (X[:,0] < 4)]= 1.0


#criterion
def criterion_cross(outputs, labels):
    out= -1 * torch.mean(labels * torch.log(outputs) + (1 - labels) * torch.log(1 - outputs))
    return out



#size of input
D_in= 1
#size of hidden layer
H= 2
#number of outputs
D_out=1
#learning rate
learning_rate = 0.1
#create the model
model= Net(D_in, H, D_out)
#optimizer
optimizer= torch.optim.SGD(model.parameters(), lr= learning_rate)


#train the model
cost_cross= train(Y, X, model, optimizer, criterion_cross, epochs=1000)

#plot the loss
plt.plot(cost_cross)
plt.xlabel('epoch')
plt.title('cross entropy loss')

## Neural Network with one hidden layer with two neurons using nn.Sequential

## Backpropogation
It reduces the number of computation that is require to calculate gradients. backpropogation uses the graidents of the output layer to calculate the graidents of the hidden layers. 

## problem of Using Sigmoid function in deep neural networks
It's vanishing graident problem. the value of the derivative of the sigmoid function becomes very small for the deeper hidden layers. 
To solve this problem change the activation function or use optimization techniques to reduce this effect.

## Commanly Used Activation Function in neural networks

* ReLu performs better as compared to tanh and sigmoid function if we had more hidden layers.

* what is the problem with tanh and sigmoid activation function?
  The derivative is near zero in many regions. this causes vanishing gradient problem.