# Deep Learning

Set of algorithms that help mimic a human brain. 

## Libraries for Deep Learning

- PyTorch
- Keras
- Tensorflow

## How Neural network works

#### Forward propogation

- All features get passed to the input layer of neurons. One feature to one neuron. 
- Each input neuron multiples a weight to the input feature and passes it onto the hidden layer. Hidden layer can have one or more neurons and each input neuron passes the input feature to all the neurons of the hidden layer. 
- Each neuron in the hidden layer adds up (weighted avg) all the features recieved and multiplies it by an activation function and adds a bias. 
- Output of the hidden layer gets passed to other hidden layers or output layers.
y= Activation function (w1*x1+w2*x2+w3*x3+ Bias)

#### Back propogation
- Initially, random weights are taken. Let's say o/p is 5 and we wanted 1. So, loss function is 15 (assuming mean squared error)
- We need to go backwards now and adjust the weights in such a manner than loss function is minimised. 
- We calculate d(loss function)/d(weight), multiply it by learning rate and adjust the weights of each neuron (weight(new) = weight(old)-learning rate*d(loss function)/d(weight) as we propogate backwards. 
- We use chain rule for calculation of d(loss function)/d(weight) where instead of calculating it directly, we use a different formula that splits it into d(loss function)/d(output)*d(output)/d(weight)
- The number of times back propogation happens is the number of epochs
- If weights are not initialised properly, the loss function might not converge to global minima (mathematics). It can keep on oscillating. This problem is called Exploding gradient descent. 



## Activation functions

#### Relu
More than 0 is x, less than 0 is 0

#### Sigmoid (1/(1+e**-x))
- Less than .5 is 0 and more than .5 is 1
- Historically, sigmoid was used. Derivate of a sigmoid function ranges between 0 to .25 (Mathematics).  Hence, derivative of a sigmoid function (used in gradient descent chain rule) becomes very small and weights don't get updated at all when the layers are more. This problem is called vanishing gradient problem. 
- Due to this problem, relu got invented. Derivative of Relu activation function is either 0 or 1. No vanishing gradient problem but a dead Relu problem when the derivative is 0 (old weight = new weight) 

#### Threshold activation function (tanh) (1-e**-2x)/(1+e**-2x))

- If less than 0, it's -1. If more than 0, then +1.
- Derivative ranges between 0 to 1. (still vanishing gradient problem, since for a deep neural network, updated wieghts will be the same as old weights since it is being multiplied by a number less than 1)

## Loss functions (Cost function)
#### Mean squared error

## Optimizers

#### Gradient Descent
- We consider all data points to calculate the loss function. Then we create a graph between weights and loss function and find the optimum weights where the loss function is minimum. 
- How do we reach global minima and not local minima : 

#### Stochastic Gradient Descent (and mini batch Stochastic Gradient Descent)

- Stochastic Gradient Descent : Instead of passing all data to calculate the loss function, we pass a single observation to save time and compute instances. If we pass a small batch, its called Mini-batch stochastic gradient descent
- Why Mini-batch stochastic gradient descent is used: To save computation power. 

#### Stochastic Gradient Descent with momentum
- SGD will have a loss function which has noise since we are not using all data. We smoothen that loss function using a smoothenining factor

#### Adaptive gradient optimizer (Adagrad)
- Gradient descent combined with different learning rates. 
- As we move towards the global minima, the learning rate tends to decrease
- The disdvantage is that it decreases too much when we reach closer to the global maxima

#### Addelta and RMSProp
- Used to remove the deficiency of Adagrad
- Learning rate becomes small but not very small as we move closer to the global minima



#### Adam optimizer




## Weight initialisation
- Should be small (not very small to avoid vanishing gradient problem)
- Should not be same (otherwise tall neurons will behave in the same manner)
- Should have good variance
- Technique #1:  Uniform distribution between -1/number of inputs to +1/number of inputs (works good with Sigmoid)
- Technique #2: Xavier/Borat (works good with Sigmoid)
- Technique #3: He uniform or He normal (works nice with Relu)



## Underfitting and overfitting

- Underfitting usually doesn;t happen in a neural network because we can increase the layers. 
- For overfitting, we have 2 options. Regularisation and dropouts. 
- In dropout, we deactivate p% of neurons (dropout ratio) on each layer and create multiple deep learning models and then take an avg. 
- p value should be higher if the model has a large number of layers (varies between 0 to 1). Use hyperparameter optimisation to find the correct dropout ratio. 



## ANN (Artificial Neural Network)

## CNN (Convolutional Neural Network)

## RNN (Recurrent Neural Network)



#### ANN code

- Import Keras **Sequential library for a model** (be it ANN, RNN or CNN)
- Import Keras **Dense library for hidden layers
- Import Keras **Dropout for regularisation**

- All hidden layers and input layer should have Relu activation function
- Final output layer should have a sigmoid activation function


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('./Data/Churn_Modelling.csv')
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]

print("initially\n")
print(X.head())
print(y.head())

geography=pd.get_dummies(X["Geography"],drop_first=True)
gender=pd.get_dummies(X['Gender'],drop_first=True)

X=pd.concat([X,geography,gender],axis=1)

## Drop Unnecessary columns
X=X.drop(['Geography','Gender'],axis=1)

print("\n\nAfter changing geography and gender\n\n")
print(X.head())
print(y.head())

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

print("\n\nAfter scalar transformation\n\n")

print(X_train)


FileNotFoundError: File b'./Data/Churn_Modelling.csv' does not exist

In [None]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LeakyReLU,PReLU,ELU
from keras.layers import Dropout


# Initialising the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(output_dim = 6, init = 'he_uniform',activation='relu',input_dim = 11))

# Adding the second hidden layer
classifier.add(Dense(output_dim = 6, init = 'he_uniform',activation='relu'))
# Adding the output layer
classifier.add(Dense(output_dim = 1, init = 'glorot_uniform', activation = 'sigmoid'))

# Compiling the ANN
classifier.compile(optimizer = 'Adamax', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the Training set
model_history=classifier.fit(X_train, y_train,validation_split=0.33, batch_size = 10, nb_epoch = 5)

# list all data in history

print(model_history.history.keys())
# summarize history for accuracy
plt.plot(model_history.history['acc'])
plt.plot(model_history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(model_history.history['loss'])
plt.plot(model_history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# Part 3 - Making the predictions and evaluating the model

# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Calculate the Accuracy
from sklearn.metrics import accuracy_score
score=accuracy_score(y_pred,y_test)