<b>Neural Networks that vaguely mimic the process of how the brain works with neurons that fire bits of information</b>

Lets take an example of classification where we have data with two class labels and data can be seperated linearly

Linear Boundaries

        Prediction :1 if Wx + b >= 0 
                    0 if Wx + b < 0       
             
In 2-dimensional data our classification Boundary is Line<br>
In 3-dimensional data our classification Boundary is Plane<br>
In n-dimensional data our classification Boundary is n-1 dimensional hyperplane<br>

#### Perceptron

its a building of block of neural network

In [10]:
## Example of neuron code
inputs = [1, 3, 5]
weights = [0.4, 0.7, 0.6]
bias = 2

output = inputs[0]*weights[0] + inputs[1]*weights[1] + inputs[2]*weights[2] + bias
output

7.5

#### Perceptrons Algorithm

+ Start with random weights and biases
+ For every misclassification 
       if prediction is 0 
           then we update weights
            Change W = W + lr*(xi)
            Change b = b + lr
       if prediction is 1 
           then we update weights
            Change W = W - lr*(xi)
            Change b = b - lr

In [110]:
import numpy as np

def prediction(X, W, b):
    if np.matmul(X, W)+ b >= 0:
        return 1
    return 0

def perceptron(X, y, W, b, lr):
    for i in range(len(X)):
        y_hat = prediction(X[i], W, b)
        if y[i] - y_hat == 1:
            W[0] += lr*X[i][0]
            W[1] += lr*X[i][1]
            b += lr
        if y[i] - y_hat == -1:
            W[0] -= lr*X[i][0]
            W[1] -= lr*X[i][1]
            b -= lr
        
    return W, b

In [111]:
X = np.array([(0, 0), (0, 1), (1, 0), (1, 1)])
y = np.array([0, 0, 0, 1])
W = np.array(np.random.rand(2,1))
b = 0.1
for i in range(100):
    W, b = perceptron(X, y ,W, b, 0.1)

In [114]:
for i in range(len(X)):
    print("prediction value for input {} is {}".format(X[i],prediction(X[i], W, b)))

prediction value for input [0 0] is 0
prediction value for input [0 1] is 0
prediction value for input [1 0] is 0
prediction value for input [1 1] is 1


#### Error functions

The error is what's telling how badly we are doing at that moment and how far we are from an ideal solution and if we constantly take steps to decrease the error then we will eventually solve our problem.

In order to do gradient descent our error function can not be discrete , it should be <b>continous</b> and our error function needs to be <b>differentiable</b>

In order to perform gradient descent our predictions should be also continous, the way we move to continous predictions from discrete predictions is to simply change the activation function from the step function to sigmoid

#### Sigmoid function

The sigmoid function is defined as sigmoid(x) = 1/(1+e-x).

it returns the probabilty values

#### Softmax Function

Lets say we have N classes and a linear model that gives us the following scores

Linear Function :
Scores :z1, z2, ...zn

p(class i) = e^zi/(sum(e^z1+...e^zn))

In [115]:
def softmax(scores):
    probs = []
    sum_exp = sum(np.exp(scores))
    for i in range(len(scores)):
        probs.append(np.exp(scores[i])/(sum_exp))
    return probs

In [125]:
softmax([1, 2, 0])

[0.24472847105479764, 0.6652409557748219, 0.09003057317038046]

#### One Hot Encoding

All our algorithms are numerical this means we need to input numbers but input data not always have numbers we use one hot encoding for categorical variables to represent in to numeric way

#### Maximum Likelihood

What we do is we pick the model that gives the existing labels the highest probability thus, by maximizing the probability

our prediction is y = g(Wx+b)<br>
probabilty of four points in model 1 :<br>
    p(r1)*p(r2)*p(b1)*p(b2) = 0.1*0.7*0.6*0.2 = 0.0084
    
probabilty of four points in model 2 :<br>
    p(r1)*p(r2)*p(b1)*p(b2) = 0.8*0.6*0.7*0.9 = 0.3024
    
The model classifies most points correctly with p(all) indicating how accurate model is

Instead of multiplying probabilities we will add them using ln(0.1*0.7*0.6*0.2) = ln(0.1)+ln(0.7)+ln(0.6)+ln(0.2)

cross entropy of model1 = ln(0.1)+ln(0.7)+ln(0.6)+ln(0.2) = 4.8

cross entropy of model2 = ln(0.8)+ln(0.6)+ln(0.7)+ln(0.9) = 1.2

Good model is the one which has low cross entropy

<b>Minimizing cross entropy</b>

Cross entropy really says the following , if i have a bunch of events and a bunch of probabilities how likely is it that those events happen based on the probabilities?<br>
If its very likely we have a small cross entropy<br>
If its unlikely we have a large cross entropy

#### Cross entropy formula

-yi*ln(pi)-(1-yi)*ln(1-pi)

In [188]:
def cross_entropy(y, p):
    y = np.array(y, dtype='float')
    p = np.array(p, dtype='float')
    assert len(p) == len(y)
    
    return -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))

In [189]:
cross_entropy([0,0,1], [0.8, 0.7, 0.1])

5.115995809754082

In [190]:
cross_entropy([1,1,0], [0.8, 0.7, 0.1])

0.6851790109107685

#### Multi-class Cross entropy formula

CE = - sum(i)sum(j)yij*ln(pij)

#### Non Linear data 

We can combine linear models to create a non linear model
we calculate probabilty for one of them, probability for the other then add them and then we apply sigmoid function


#### Neural Network Architecture

Neural Network contains these layers and each layers contains n number of nodes
+ Input 
+ Hidden Layer
+ Output Layer


when we more hidden layers then we called neural networks are called deep neural networks

<img src='images/nn1.png' width=600>


<img src='images/nn2.png' width=600>


<img src='images/nn3.png' width=600>


<b> Binary Classification </b>
<img src='images/binary_classification.png' width=600>

<b> Multi Class Classification </b>
<img src='images/multiclass_classification.png' width=600>

#### Feed Forward

+ Deep feedforward networks, also called feedforward neural networks, or multilayer perceptrons(MLPs)

+ The goal of a feedforward network is to approximate some function f∗. 
    For example,for a classiﬁer,y=f(x) maps an input x to a category y. 
    
+ A feedforward networkdeﬁnes a mappingy=f(x;θ) and learns the value of the parameters θ that result in the best function approximation.

+ These models are called feedforward because information ﬂows through the function being evaluated from x, through the intermediate computations used to deﬁne f, and ﬁnally to the output y. There are nofeedbackconnections in which outputs of the model are fed back into itself. When feedforward neural networksare extended to include feedback connections, they are calledrecurrent neuralnetworks

+ Feedforward neural networks are called networks because they are typically represented by composing together many different functions.For example, we might have three functions f(1),f(2), andf(3)connected in a chain, to form f(x) =f(3)(f(2)(f(1)(x))). These chain structures are the most commonly used structures of neural networks. In this case,f(1)is called the first layer of the network,f(2)is called the second layer, and so on.

<img src='images/feedforward.png' width=800>

#### Back Propagation

Now, we're ready to get our hands into training a neural network. For this, we'll use the method known as backpropagation. In a nutshell, backpropagation will consist of:

+ Doing a feedforward operation.
+ Comparing the output of the model with the desired output.
+ Calculating the error.
+ Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
+ Use this to update the weights, and get a better model.
+ Continue this until we have a model that is good.

#### Overfitting and Underfitting

<img src="images/model_errors.png" width=800>


Overfitting means model will generalize well 

Two solutions for the overfitting problem:
  + Early Stopping
  + Regularization

#### Early Stopping

We do gradient testing until the testing error stops decreasing and start to increase at that moment we stop, this algorithm is called early stopping

Draw back is our training set will be low as we need to keep some validation data set

<img src="images/early_stopping.png" width=600>

                        "The Whole problem with the
               Artificial Intelligence is that bad models 
               are so certain of themselves, and good models 
                           so full of doubts"
                                           BertrAIND Russell

#### Regularization

When we have large coefficients ,the model overfits<br>
Imposing constraint such as penalizing large weights can reduce overfitting 

There are two types:<br>
L1 - Lasso (add the absolute sum of the coefficients to the error along with the lambda)<br>
       L1 is good for feature selection<br>
       L1 produces values like (1,0)
L2 - Ridge(add the square sum of the coefficients to the error along with the lambda)<br>
       L2 is better for training models
       L2 produces values like (0.5,0.5)
       
Absolute sum of (1,0) and (0.5, 0.5) are same but square sum is low for (0.5, 0.5)(0.25+0.25=0.5) than (1)(1+0) thus L2 prefers (0.5, 0.5) because it produces smaller sum of squares in turns smaller error 

#### Dropout

It is another way to overcome the overfitting.

when we train neural networks sometimes one part of the network has very large weights and it endsup dominating all the training while the other part of the network doesnot get train.

So as we go through the training in each epoch we randomly turn off some of the nodes in the network and will give opportunity for all the nodes to learn.

#### Local Minima
<img src="images/local_minima.png" width=600>

#### Random Restart

Sometimes our error can stuck at local minima to overcome this we can randomly initialize the weights mutiple times and see if we can get minimum error.
This increases the probability that we will get to the global minima or at least pretty good local minima

<img src="images/random_restart.png" width=600>

#### Vanishing Gradient

Gradients are calculated by continous mutiplication of derivates, the value of derivates of sigmoid activation function is small at the left and right which causes continous multiplication value to too tiny and gradient to practically vanish

<img src="images/sigmoid.png" width=600>


<img src="images/vanishing_gradient.png" width=600>



<img src="images/tanh.png" width=600>
<b>Tanh is better than sigmoid but both sigmoid and tanh activation functions can suffer from vanishing gradients because the derivates to the left and right is almost zero if there is no derivate then there will be no direction to move </b>

<b>RELU activation function can comes to rescue here as the derivative of the relu is linear </b>
<img src="images/relu.png" width=600>

#### Exploding Gradient

We use gradient clipping for exploding gradients

#### Batch and Stochastic Gradient Descent

<b>Batch Gradient Descent</b>:
In batchGD we update the weights in each step by running through entire data it takes huge computations as we need to run entire data on every step

<b>SGD</b>:
it takes the small subsets of the data , run them through the neural network , calculate the gradient of the error function based on those points and then move one step in that direction 

#### Learning Rate Decay
<img src="images/learning_rate.png" width=600>


<img src="images/learning_rate_decay.png" width=600>

#### Momentum

<img src="images/momentum.png" width=600>