# **Neural Networks**

**What will you learn?**
1. **Introduction :** Intro to ANNs
2. **Why do we need NN?**
3. **Example with Linear Boundaries** : Negation, AND, OR
3. **Example with Non-Linear Boundaries** : XOR
5. **Terminology**
6. **Propogation**
7. **Cost Function**
8. **Multiclass Classification** : One Hot Encoding
9. **Sklearn Implementation** : MLPClassifier

##**Introdution**


Artificial Neural Networks are one of the main tools used in machine learning. As the “neural” part of their name suggests, they are brain-inspired systems which are intended to replicate the way that we humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden layer consisting of units that transform the input into something that the output layer can use. They are excellent tools for finding patterns which are far too complex or numerous for a human programmer to extract and teach the machine to recognize.

For a basic idea of how a deep learning neural network learns, imagine a factory line. After the raw materials (the data set) are input, they are then passed down the conveyer belt, with each subsequent stop or layer extracting a different set of high-level features. If the network is intended to recognize an object, the first layer might analyze the brightness of its pixels.

The next layer could then identify any edges in the image, based on lines of similar pixels. After this, another layer may recognize textures and shapes, and so on. By the time the fourth or fifth layer is reached, the deep learning net will have created complex feature detectors. It can figure out that certain image elements (such as a pair of eyes, a nose, and a mouth) are commonly found together.

Once this is done, the researchers who have trained the network can give labels to the output, and then use backpropagation to correct any mistakes which have been made. After a while, the network can carry out its own classification tasks without needing humans to help every time.

<img src = "https://files.codingninjas.in/3nn-7659.gif" width = 800>



##**Why do we need NN?**

Neural Networks have been around even before machine learning gained pace. But they were thought to be computationally too heavy and hence, brushed aside.

A problem we faced during Logistic Regression was that, even though the decision function (Sigmoid) was non linear, we got a linear decision boundary. We fixed this problem by adding dummy data with higher powers.

To do that, we had to experiment and decide the degree of features we needed to add. Our decision boundary shoud be such, that it performs this task on its own.

Logistic regression had the following structure:

<img src = "https://files.codingninjas.in/nn1-7661.jpg" width = 500>

The intuition behind Neural Networks is a follows:

<img src = "	https://files.codingninjas.in/nn2-7662.jpg" width = 500>

So, here the final output will not be linear with respect to $x_1, x_2, x_0$.
The functions $f_1, f_2$ need not necessarily be Sigmoid. We can choose any function. Using this method we can create quite interesting decision boundaries without applying the dummy feature method.

##**Example with Linear Decision Boundaries**

To understand how to reach the boundaries, lets take a simple example

###**Example 1 : Negation**

x | y | 
:---:|:---:|
1|0|
0|1|

<img src = "https://files.codingninjas.in/negation-7688.jpg" width = 450>

Here, the function used is 
$$\frac{1}{1 + e^{-z}}$$

We want to pick the correct values of $w_0$ and $w_1$ so that we reach to the correct answer.

**Case 1** : When $x = 0$, we want $y = 1$. So, we want $z\geq0$.
$$w_0 + w_1x \geq 0$$
$$w_0 \geq 0$$

This means we need to keep $w_0$ at a high value, so sigmoid function closely reaches 1. Lets take $w_0 = 50$.

Hence, $z = 50$, which is what we wanted.

**Case 2** : When $x = 1$, we want $y = 0$. So, we want $z\geq0$.
$$w_0 + w_1x \leq 0$$
$$w_0 + w_1 \leq 0$$

Lets take $w_1$ = -100

Hence, $z = -50$, which is what we wanted.

###**Example 2 : OR**

<img src = "https://files.codingninjas.in/or-7690.jpg" width = 500>

$x_1$ | $x_2$ | $y$ 
:---:|:---:|:---:|
0|0|0
1|0|1
0|1|1
1|1|1

Here, the function used is 
$$\frac{1}{1 + e^{-z}}$$

We want to pick the correct values of $w_0$ and $w_1$ so that we reach to the correct answer.

**Case 1** : When $x_1 = 0$ and $x_2 = 0$, we want $y = 0$.

Therefore,
$$ w_1x_1 + w_2x_2 + w_0(1) < 0 $$
$$ w_1(0) + w_2(0) + w_0 < 0 $$
$$ w_0 < 0 $$

Lets take $w_0 = -50$

**Case 2** : When $x_1 = 1$ and $x_2 = 0$, we want $y = 1$.

Therefore,
$$ w_1x_1 + w_2x_2 + w_0(1) > 0 $$
$$ w_1(1) + w_2(0) + w_0 > 0 $$
$$ w_0 + w_1 > 0 $$

Lets take $w_1 = 100$

**Case 3** : When $x_1 = 0$ and $x_2 = 1$, we want $y = 1$.

Therefore,
$$ w_1x_1 + w_2x_2 + w_0(1) > 0 $$
$$ w_1(0) + w_2(1) + w_0 > 0 $$
$$ w_0 + w_2 > 0 $$

Lets take $w_2 = 100$

So, we can draw the table as :

$x_1$ | $x_2$ | $y$ | $z$ | $y_p$
:---:|:---:|:---:|:---:|:---:|
0|0|0|-50|0
1|0|1|50|1
0|1|1|50|1
1|1|1|150|1

Try doing the calculations for **AND** Gate

$x_1$ | $x_2$ | $y$ 
:---:|:---:|:---:|
0|0|0
1|0|0
0|1|0
1|1|1

##**Example with Non Linear Decision Boundaries**

###**Example 1 : XOR**

$x_1$ | $x_2$ | XOR | AND | NOR
:---:|:---:|:---:|:---:|:---:|
0|0|0|0|1
1|0|1|0|0
0|1|1|0|0
1|1|0|1|0

If we look at the table closely, when outputs of AND and NOR are 0, XOR is 1.
If any of AND and NOR is 1, output of XOR is 0.

So we can combine AND and NOR to reach XOR.
Taking AND to be $f_1$ and NOR to be $f_2$ we can say that NOR($f_1$, $f_2$) will give the desired output.

<img src = "	https://files.codingninjas.in/xor-7691.jpg">

Verify the results with your own calculations.

##**Terminology**

**Neuron** : A single unit in any layer is called neuron.

**Input Layer** : The Input layer communicates with the external environment that presents a pattern to the neural network. Its job is to deal with all the inputs only.The input layer should represent the condition for which we are training the neural network. Every input neuron should represent some independent variable that has an influence over the output of the neural network.

**Hidden Layer** : The hidden layer is the collection of neurons which has activation function applied on it and it is an intermediate layer found between the input layer and the output layer. Its job is to process the inputs obtained by its previous layer. So it is the layer which is responsible extracting the required features from the input data.


**Output Layer** : The output layer of the neural network collects and transmits the information accordingly in way it has been designed to give. The pattern presented by the output layer can be directly traced back to the input layer. The number of neurons in output layer should be directly related to the type of work that the neural network was performing.


Weights for each neuron will be found using some algorithm. What we need to decide is :
1. How many hidden layers we want?
2. How many neurons in each layer?
3. Function to be applied over hidden and output layer.


<img src = "	https://files.codingninjas.in/network-7689.png">

##**Propogation**


###**Forward Propagation**

The input X provides the initial information that then propagates to the hidden units at each layer and finally produce the output y^. The architecture of the network entails determining its depth, width, and activation functions used on each layer. Depth is the number of hidden layers. Width is the number of units (nodes) on each hidden layer since we don’t control neither input layer nor output layer dimensions. There are quite a few set of activation functions such Rectified Linear Unit, Sigmoid, Hyperbolic tangent, etc. Research has proven that deeper networks outperform networks with more hidden units. Therefore, it’s always better and won’t hurt to train a deeper network (with diminishing returns).

###**Backward Propagation**

Backpropagation refers to the method of calculating the gradient of neural network parameters. In short, the method traverses the network in reverse order, from the output to the input layer, according to the chain rule from calculus. The algorithm stores any intermediate variables (partial derivatives) required while calculating the gradient with respect to some parameters.

##**Cost Function**

For neural networks, or any other algorithm for that matter, the cost function is similar. Here, 

$$ Cost = Error + \lambda Regularisation$$

For regularisation, we will use $\sum w_j^2$, which is sum of all the weights squared.

Now, error function can be :

$$ (y_t - y_{pred})^2 $$

Therefore, 

$$ Cost = \frac{1}{m}\sum(y_t - y_{pred})^2 + \frac{\lambda}{2m} \sum w_j^2$$

##**How to handle Multiclass Classification**



To handle this, basic idea is to add additional weights. All remains same, but at output layer also, multiple units are added.

Lets assume that there are 3 classes that can be predicted.

<img src = "https://files.codingninjas.in/multiclass-7687.png">

Lets say the values predicted are:
$y_1 = 0.1$, $y_2 = 0.15$ and  $y_3 = 0.99$.

We will say that the data points belong to the max value class, in this case Class 3. The true value of the output will be in the form of a vector like [0, 0, 1].

Now, for above model, data points will also be in form of a vector, as is output.

If datapoint $x^1$ belongs to the 1st class, then its input vector is [1, 0, 0].


Similarily, if $x^2$ belongs to the 3rd class, then its input vector is [0, 0, 1].

Such an input is called **One Hot Encoded** input.


Cost function changes to :

$$Cost = \sum^m_{i = 1} \sum^k_{j = 1} f(y_{i}^j(pred),\enspace y_i^j(true)) + \frac{\lambda}{2m} \sum w_j^2 $$

This extra summation $\sum^k_{j = 1}$ penalises us if one hot encoding is incorrect. Hence, error and cost are not just to be calculated for correct units, but also for incorrect prediction of other units.

##**MLP Classifier in Sklearn**

The MLP classifier is not a very efficient classifier. It is not advised to use in implementaion of neural networks of large data or an actual product.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

In [None]:
clf  =  MLPClassifier()  # Creating object 
iris = datasets.load_iris()  # Loading dataset
X = iris.data
Y = iris.target
xtrain, xtest, ytrain, ytest = train_test_split(X, Y)
clf.fit(xtrain, ytrain)  # Training neural network 



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [None]:
clf.score(xtest,ytest) # Obtaining score 

0.9210526315789473

In [None]:
clf.predict(xtest)   # results

array([1, 2, 1, 2, 0, 1, 2, 0, 2, 2, 2, 1, 1, 2, 2, 2, 0, 2, 2, 1, 0, 0,
       0, 1, 1, 2, 0, 2, 2, 1, 0, 0, 2, 2, 2, 0, 1, 2])

###**Important Parameters**
**hidden_layer_sizes : tuple, length = n_layers - 2, default=(100,)**
The ith element represents the number of neurons in the ith hidden layer.

**activation : {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default=’relu’**
Activation function for the hidden layer.

‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x

‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).

‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).

‘relu’, the rectified linear unit function, returns f(x) = max(0, x)

**batch_size : int, default=’auto’**
Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batch_size=min(200, n_samples)

###**Important Attributes**
**coefs_ : list of shape (n_layers - 1,)**
The ith element in the list represents the weight matrix corresponding to layer i.

**intercepts_ : list of shape (n_layers - 1,)**
The ith element in the list represents the bias vector corresponding to layer i + 1.