# Neural Network
## <a href="#I">I Perceptrons</a>
## <a href="#II">II Artificial Neural Network (Multilayer Perceptron)</a>
### <a href="#II.1">II.1 Feed-Forward</a>
### <a href="#II.2">II.2 Back Propagation</a>
## <a href="#III">III Implementing Neural Network with Scikit-Learn</a>
### <a href="#III.1">III.1 Preparing Data</a>
#### <a href="#III.1.1">III.1.1 Train Test Split</a>
#### <a href="#III.1.2">III.1.2 Feature Scaling</a>
### <a href="#III.2">III.2 Training and Predictions</a>
### <a href="#III.3">III.3 Evaluating the Algorithm</a>

# Neural Network
__Neural networks__ are a set of algorithms that are designed to recognize patterns. <br>
They interpret sensory data through a kind of machine perception, labeling or clustering raw input. 
The patterns they recognize are numerical, contained in vectors, into which all real-world data (images, sound, text or time series) must be translated.

Neural networks help us cluster and classify.<br>
You can think of them as a clustering and classification layer on top of the data you store and manage. 
They help to group unlabeled data according to similarities among the example inputs (unsupervised learning), and they classify data when they have a labeled dataset to train on (supervised learning). 

Neural networks can also extract features that are fed to other algorithms for clustering and classification; so you can think of neural networks as components of larger machine-learning applications involving algorithms for reinforcement learning, classification and regression.

<a id="I"></a>
## I Perceptrons
Artificial neural networks are inspired by the human neural network architecture. 
The simplest neural network consists of only one neuron and is called a __perceptron__, as shown in the figure below:
<img src="nbimages/perceptron.png" alt="A perceptron" title="A perceptron" width=400 height=400 />
A perceptron has one input layer and one neuron. Input layer acts as the dendrites in the human nervous system and is responsible for receiving the inputs. <br>
The number of nodes in the input layer is equal to the number of features in the input dataset.<br> Each input is multiplied with a __weight__ (which is typically initialized with some random value) and the results are added together. <br>
The sum is then passed through an __activation function__. <br>
The activation function of a perceptron resembles the nucleus of human nervous system neuron. It processes the information and yields an output. In the case of a perceptron, this output is the final outcome. However, in the case of __multilayer perceptrons__, the output from the neurons in the previous layer serves as the input to the neurons of the proceeding layer.<br>

<a id="II"></a>
## II Artificial Neural Network (Multilayer Perceptron)

Now that we know what a single layer perceptron is, we can extend this discussion to __multilayer perceptrons__, or more commonly known as __artificial neural networks__ (__ANN__).<br> 
A single layer perceptron can solve simple problems where data is linearly separable in to 'n' dimensions, where 'n' is the number of features in the dataset. However, in case of non-linearly separable data, the accuracy of single layer perceptron decreases significantly. Multilayer perceptrons, on the other hand, can work efficiently with non-linearly separable data.

Multilayer perceptrons are a combination of multiple neurons connected in the form a network. 
An artificial neural network has an input layer, one or more hidden layers, and an output layer. This is shown in the image below:
<img src="nbimages/ann.png" alt="A perceptron" title="A perceptron" width=400 height=400 />

A neural network executes in two phases: __Feed-Forward__ and __Back-Propagation__.

<a id="II.1"></a>
### II.1 Feed-Forward

Following are the steps performed during the feed-forward phase:

1. The values received in the input layer are multiplied with the weights. A bias is added to the summation of the inputs and weights in order to avoid null values.
2. Each neuron in the first hidden layer receives different values from the input layer depending upon the weights and bias. Neurons have an activation function that operates upon the value received from the input layer. The activation function can be of many types, like a step function, sigmoid function, relu function, or tanh function. As a rule of thumb, relu function is used in the hidden layer neurons and sigmoid function is used for the output layer neuron.
3. The outputs from the first hidden layer neurons are multiplied with the weights of the second hidden layer; the results are summed together and passed to the neurons of the proceeding layers. This process continues until the outer layer is reached. The values calculated at the outer layer are the actual outputs of the algorithm. In the output layer, each neuron correspond to a possible class.

The feed-forward phase consists of these three steps. However, the predicted output is not necessarily correct right away; it can be wrong, and we need to correct it. The purpose of a learning algorithm is to make predictions that are as accurate as possible. To improve these predicted results, a neural network will then go through a __back-propagation__ phase. 
During back propagation, the weights of different neurons are updated in a way that the difference between the desired and predicted output is as small as possible.

<a id="II.2"></a>
### II.2 Back Propagation

Back propagation phase consists of the following steps:

1. The error is calculated by quantifying the difference between the predicted output and the desired output. This difference is called "loss" and the function used to calculate the difference is called the "loss function". Loss functions can be of different types e.g. mean squared error or cross entropy functions. Remember, neural networks are supervised learning algorithms that need the desired outputs for a given set of inputs, which is what allows it to learn from the data.

2. Once the error is calculated, the next step is to minimize that error. To do so, partial derivative of the error function is calculated with respect to all the weights and biases. This is called gradient decent. The derivatives can be used to find the slope of the error function. If the slop is positive, the value of the weights can be reduced or if the slop is negative the value of weight can be increased. This reduces the overall error. The function that is used to reduce this error is called the __optimization function__.
This one cycle of feed-forward and back propagation is called one "epoch".<br>
This process continues until a reasonable accuracy is achieved. There is no standard for reasonable accuracy, ideally you'd strive for 100% accuracy, but this is extremely difficult to achieve for any non-trivial dataset. In many cases 90%+ accuracy is considered acceptable, but it really depends on your use-case.

<a id="III"></a>
## III Implementing Neural Network with Scikit-Learn

Now we will try to build a simple neural network that predicts the class that a given iris plant belongs to.<br> 
We will use Python's Scikit-Learn library to create our neural network that performs this classification task.<br>
The dataset that we are going to use for this tutorial is the popular Iris dataset.




In [38]:
import pandas as pd

# Location of dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign colum names to the dataset
col_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read dataset to pandas dataframe
irisDF = pd.read_csv(url, names=col_names)
irisDF.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


<a id="III.1"></a>
### III.1 Preparing Data

In [39]:
# Assign data from first four columns to X variable
X = irisDF.iloc[:, 0:4]

# Assign data from first fifth columns to y variable
y = irisDF.Class # y is a Series, if we want a DataFrame we can invoke to_frame()
y.head()
y.unique() # To see the different classes (labels)
# We have three unique classes 'Iris-setosa', 'Iris-versicolor' and 'Iris-virginica'.

# Let's convert these categorical values to numerical values.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
y # y is now a numpy array

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Class, dtype: object

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

<a id="III.1.1"></a>
#### III.1.1 Train Test Split
To avoid __over-fitting__, we will divide our dataset into training and test splits. <br>
The training data will be used to train the neural network and the test data will be used to evaluate the performance of the neural network. <br>
This helps with the problem of over-fitting because we're evaluating our neural network on data that it has not seen (i.e. been trained on) before.

In [40]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

<a id="III.1.2"></a>
#### III.1.2 Feature Scaling

Before making actual predictions, it is always a good practice to scale the features so that all of them can be uniformly evaluated.<br>
With the help of StandardScaler we learn the means and standard deviation of the training set, and then:
1. Standardize the training set using the training set means and standard deviations.
2. Standardize any test set using the training set means and standard deviations.

In [41]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

StandardScaler(copy=True, with_mean=True, with_std=True)

<a id="III.2"></a>
### III.2 Training and Predictions

Now it's time to train a neural network that can actually make predictions.<br>
The __MLPClassifier__ instance is initialized with two parameters.

1. __hidden_layer_sizes__ is used to set the size of the hidden layers. In our script we will create three layers of 10 nodes each. There is no standard formula for choosing the number of layers and nodes for a neural network and it varies quite a bit depending on the problem at hand. The best way is to try different combinations and see what works best.

2. __max_iter__ specifies the number of iterations, or the epochs, that you want your neural network to execute. Remember, one epoch is a combination of one cycle of feed-forward and back propagation phase.

By default the '__relu__' activation function is used with '__adam__' cost optimizer. However, you can change these functions using the __activation__ and __solver__ parameters, respectively.

In [42]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=1000)
mlp.fit(X_train, y_train.ravel())
predictions = mlp.predict(X_test)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(10, 10, 10), learning_rate='constant',
              learning_rate_init=0.001, max_iter=1000, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)

<a id="III.3"></a>
### III.3 Evaluating the Algorithm

Now its time to evaluate how well the algorithm performs. 

In [43]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

[[ 7  0  0]
 [ 0 13  1]
 [ 0  2  7]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.87      0.93      0.90        14
           2       0.88      0.78      0.82         9

    accuracy                           0.90        30
   macro avg       0.91      0.90      0.91        30
weighted avg       0.90      0.90      0.90        30



Your results can be slightly different from these because __train_test_split()__ randomly splits data into training and test sets, so our networks may not have been trained/tested on the same data. <br>
But overall, the accuracy should be around 90% on your datasets as well.