In [None]:
import matplotlib.pyplot as plt
from IPython.display import Image
import numpy as np
%matplotlib inline

### Introduction


### Components of NN
    1. Neuron
    2. Wights/connections
    3. Input layer 
    4. Hidden Layer 
    5. Ouput layer
    6. Activation function
    7. Weight Initialization
    8. Batch size
    9. Batch Normalization
    10. Dropout
    11. Back propgation
    12. Optimizers

#### 1. Neurons
The building block for neural networks are (artificial) neurons. These are simple computational
units that have weighted input signals and produce an output signal using an activation function.

#### 2. Weights and bias
Weights and bias are basically similiar concept of coefficient and intercept in the linear regression.Weights are often initialized to small random values, such as values in the range 0 to 0.3, although more complex initialization schemes can be used. Like linear regression, larger weights indicate increased complexity and fragility of the model. It is desirable to keep weights in the network small and regularization techniques can be used.

#### 3. Input layer 
Any layer in the neural network is made up of neurons and connected to another layer by weights. Input layer is the first layer of a neural network, i.e all the X's(features) fed to this layer.

#### 4. Hidden layer

Hidden layers are intermediate layers b/w input and output layer. These layers are core of any neural network, responsible for learning representation on different levels of details.

#### 5. Output layer

Output layer is usually the last layer of a neural network. All the predicted values are neurons of the output layers. For instance, for an multi-class classification, the size of an output layer is simply the number of classes; 2 for a binary classification.


In [None]:
Image('StructureOfNN.PNG')

***Can a single layer(hidden) neural network be considered as a linear/logistic regression? If not, can we derive any relationship bw neurel net and regression?***


#### 6. Activation function
An activation function is a simple mapping of summed weighted input to the output of the neuron. It is called an activation function because it governs the threshold at which the neuron is activated and the strength of the output signal.

So, role of the activation function in a neural network is to produce a non-linear decision boundary via linear combinations of the weighted inputs.

#### Why?
If we do not apply a Activation function then the output signal would simply be a simple linear function.A linear function is just a polynomial of one degree. Now, a linear equation is easy to solve but they are limited in their complexity and have less power to learn complex functional mappings from data. A Neural Network without Activation function would simply be a Linear regression Model, which has limited power and does not performs good most of the times. 



##### Sigmoid
The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

In [None]:
#### activation function 
def sigmoid(z):
    """The sigmoid function."""
    return(1.0/(1.0+np.exp(-z)))
x = np.array([-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])


In [None]:
#### tanh

In [None]:
def tanh(z):
    return(((2.0/(1.0+np.exp(-2*z))) -1))

In [None]:
plt.plot(x, sigmoid(x))
plt.plot(x, np.tanh(x), 'r')

One point to mention is that the gradient is stronger for tanh than sigmoid ( derivatives are steeper). Deciding between the sigmoid or tanh will depend on your requirement of gradient strength. There are many other activation functions available, the following are most commenly used.

In [None]:
Image('activation_function_cheatsheet2.png')


#### 7. Weight Initialization
The weights for first step of the training are mostly randomnly initiated. Then updated on every step of the optimizer, A step is defined as processing a batch of datapoint through feedforward and backpropagation.


In [None]:
def initializer(sizes):
    num_layers = len(sizes)
    biases = [np.random.randn(y, 1) for y in sizes[1:]]
    weights = [np.random.randn(y, x)
                    for x, y in zip(sizes[:-1], sizes[1:])]
    return(num_layers, biases, weights)

This function will initilize weights and biases form the normal distribution with mean 0 and variance 1. Let's try with an example of 3 layers with sizes [784, 30, 10]

In [None]:
num_layers, biases, weights = initializer([784, 30, 10])

In [None]:
weights[0].shape

In [None]:
plt.plot(biases[0])

**Note**: As it's mentioned most of the time we inizialise the weights, but for the recent past there are different pre-trained saved weights for many deeplearning architectures. We can also use those weight as our initial weight and train the network for specific data we want to solve.


#### 8. Batch size 
The weights are updated on every step of a learning algo (optimizer), updating weights based on large number datapoints eventually requires lot of memory. The batch size helps to handle this issue. Since you train network using less number of samples the overall training procedure requires less memory. It's especially important in case if you are not able to fit dataset in memory.

#### 9. Batch Nomalization
Normalize the activation of each previous layer at each batch with N(0,1). Allows higher learning rate, faster training. Act as regulization, simplifies the creation of deeper networks.

#### 10. Dropout
Drops a fraction of nuerons randomly for eaech iteration, helps as a regularization factor.



#### 11. Back propagation
Backpropagation is about understanding how changing the weights and biases in a network changes the cost function.


In [None]:
Image('backprop.png')

In [None]:
def backprop(x, y):
    nabla_b = [np.zeros(b.shape) for b in biases]
    nabla_w = [np.zeros(w.shape) for w in weights]
    # feedforward
    activation = x
    activations = [x] # list to store all the activations, layer by layer
    zs = [] # list to store all the z vectors, layer by layer (z = Wx+b)
    for b, w in zip(biases, weights):
        z = np.dot(w, activation)+b
        zs.append(z)
        activation = sigmoid(z)
        activations.append(activation)
    # backward pass
    delta = cost_function(activations[-1], y) * \
        sigmoid_prime(zs[-1])
    nabla_b[-1] = delta
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
    # Note that the variable l in the loop below is used a little
    # differently to the notation in Chapter 2 of the book.  Here,
    # l = 1 means the last layer of neurons, l = 2 is the
    # second-last layer, and so on.  It's a renumbering of the
    # sigmoid_prime in the book, used here to take advantage of the fact
    # that Python can use negative indices in lists.
    for l in range(2, num_layers):
        z = zs[-l]
        sp = sigmoid_prime(z)
        delta = np.dot(weights[-l+1].transpose(), delta) * sp
        nabla_b[-l] = delta
        nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
    return(nabla_b, nabla_w)

<b>12. Optimizer</b>

The Process by wich we minimize the cost function - difference between Ground truth and the prediction

Can you suggest some optimizers


##### Alogrithm

A neural network can be descibed in following steps:

    1. initialize weights(W) and bias(b) for input x1, x2, x3, ...,xp

    2. feed forward through the weights updated/initialized
         2a. obtain Wx+b
         2b. apply activation function

    3. back propagate the error (compute gradients, update the weights)
         3a. Calculate error
         3b. Compute the gradients deltaW and deltab (the rate of change in errors repect to the weights and bias) 
         3c. Update the the weight with delta - w+deltaW ; b+delb

    4. repeat the step 2 and 3 for n number of epochs

Note: the feed forward goes through l no.of layers (1 to l)then errors calculated for each l and then backpropagation happen all the layers (reverse l to 1)

### 2. Derivative and cost function

From algorithm, step 3 explains the updation part through the Backpropagation. To accomplish this we need find the derivatives of errors (cost function). Cost function is the function of activation function which is sigmoid and outputs as defined below. In order to find the gradient(direction of optimal point), we have to calculate the derivative. The function for sigmoid is given in the below cell, write derivative functions for tanh and ReLU.




In [None]:

#### derivatives
def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

#### cost function
def cost_function(output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)


### 3. Feed forward



In [None]:
def feedforward(a):
    """Return the output of the network if "a" is input."""
    for b, w in zip(biases, weights):
        a = sigmoid(np.dot(w, a)+b)
    return a

### 5. Weight updation

In [None]:
def update_mini_batch(mini_batch, eta, biases, weights):
    """Update the network's weights and biases by applying
    gradient descent using backpropagation to a single mini batch.
    The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
    is the learning rate."""
    nabla_b = [np.zeros(b.shape) for b in biases]
    nabla_w = [np.zeros(w.shape) for w in weights]
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = backprop(x, y)
        nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    weights = [w-(eta/len(mini_batch))*nw
                    for w, nw in zip(weights, nabla_w)]
    biases = [b-(eta/len(mini_batch))*nb
                   for b, nb in zip(biases, nabla_b)]
    return(weights, biases)

In [None]:
import random
from pandas import get_dummies
from sklearn.datasets import load_digits

In [None]:
n_class = 10
mdata = load_digits(n_class = n_class)
n_feats = len(mdata.data[0])

In [None]:
print(n_feats)

In [None]:
len(mdata.target)

In [None]:
training_data = [(np.reshape(mdata.data[i], (n_feats, 1)), 
                  np.reshape(get_dummies(mdata.target).astype(int).values[i], (n_class, 1))) 
                 for i in range(len(mdata.target))]

In [None]:
training_data[0][1]

In [None]:
num_layers, biases, weights = initializer([n_feats, 20, 15, 10,  n_class])
epochs = 300
mini_batch_size = 64
eta = .03  # learning rate

In [None]:
ib, iw = biases, weights # save the initial weights

In [None]:
mini_batch_size

In [None]:
    mini_batches = [
        training_data[k:k+mini_batch_size]
        for k in range(0, n, mini_batch_size)]

### 6. Optimizer

In [None]:
"""Train the neural network using mini-batch stochastic
gradient descent.  The ``training_data`` is a list of tuples
``(x, y)`` representing the training inputs and the desired
outputs.  The other non-optional parameters are
self-explanatory.  If ``test_data`` is provided then the
network will be evaluated against the test data after each
epoch, and partial progress printed out.  This is useful for
tracking progress, but slows things down substantially."""
n = len(training_data)
for j in range(epochs):
    # shuffing the dataset 
    random.shuffle(training_data)
    # batches - #batches = #datapoints/batch_size
    mini_batches = [
        training_data[k:k+mini_batch_size]
        for k in range(0, n, mini_batch_size)]
    for mini_batch in mini_batches:
        weights, biases = update_mini_batch(mini_batch, eta, biases, weights)
        #print weights[0]
    if j%50 == 0:
        print("Epoch {0} complete".format(j))


# Prediction
Given a new point, how would prediction happen?

In [None]:
feedforward(training_data[89][0])

In [None]:
predicted = [np.argmax(feedforward(training_data[i][0])) for i in range(len(training_data))]
actual = [np.argmax(training_data[i][1]) for i in range(len(training_data))]
print(predicted[:5], actual[:5])

# Wow, we got 20% accuracy, might not be true, calculate yourself :P

### Post model activities (optional)

For some reason, we have saved the initial weights. Let's paly with it. 

1. subrtract the initial weights with final weights for alll the layers, obtain descriptive statistics and have a look. You might get something interesting, if any post it in forum.

2. plot the trained activations of each layer, have pleasure of seeing your pattern been capturing, if it happens, otherwise play with the architecture till you're convinced ;P.

3. Record the trained weights for every 10th epoch while training, and make distribution plots for each of them. Why?, that's how they do, you might get the proper reason in matter of time.

4. Read all the above 3 activities, get clarified if can't get the framed sentence. if you thinking of anythin about these reply to the contributers/forum.

In [None]:
iw[0]- weights[0]

In [None]:
iw[1] - weights[1]