<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Deep-Learning" data-toc-modified-id="Deep-Learning-1">Deep Learning</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Neural-Networks" data-toc-modified-id="Neural-Networks-1.0.1">Neural Networks</a></span></li><li><span><a href="#Forward-Propagation" data-toc-modified-id="Forward-Propagation-1.0.2">Forward Propagation</a></span></li></ul></li><li><span><a href="#Gradient-Descent" data-toc-modified-id="Gradient-Descent-1.1">Gradient Descent</a></span></li><li><span><a href="#Backpropagation" data-toc-modified-id="Backpropagation-1.2">Backpropagation</a></span></li><li><span><a href="#Creating-a-Keras-Model" data-toc-modified-id="Creating-a-Keras-Model-1.3">Creating a Keras Model</a></span><ul class="toc-item"><li><span><a href="#Model-Specification" data-toc-modified-id="Model-Specification-1.3.1">Model Specification</a></span></li></ul></li></ul></li></ul></div>

# Deep Learning


### Neural Networks
Deep learning uses especially powerful **neural networks**

NN consists of 3 layers:
1. *Input Layer* which contain the features
2. *Output Layer* which contains the predictive variable (target)
3. *Hidden Layer* which models interaction between features (input) and reflect it to the output layer. The more nodes it contains, the more interactions it can capture, and the more complex the computation is.

<img src= "diagram.png" height="300" width="300">

### Forward Propagation

<img src= "feed-forward.png" height="400" width="400">

As you can see from the diagram above. The outputs of the input layer are multiplied by some weights that we need to predict. Adding the products of the weights with the outputs of the input layer determines the value in the hidden layer. The same procedure occurs with the output layer. Moving from the input to hidden to output layers is called **Forward Propagation**.

In order to capture non-linear behaviors between the features themselves, and the feature with the target variable, an **activation function** is used in the hidden layer. According to the function, the input values are transformed into an output that will go through the output layer. One example of activation function is **Rectified Linear Activation Function (ReLU)** which is defined as follows:

$$ RELU(x)=   \left\{
\begin{array}{ll}
      0 \space & x<0 \\
      x & a>=0\\
\end{array} 
\right.  $$

In python you can define the function as follows:

```python
def relu(input):
    '''Define your relu activation function here'''

    output = max(input, 0)
    
    return(output)
```

In order to have better predictions, the number of hidden layers is increased. The more the layers the better the model, the more complex it is. Deep networks internally build represenatations of patterns in the data. It partially replaces the need for feature engineering. More deeper layers include more sophesticated representations of the data

## Gradient Descent
The target to is minimize the loss function (commonly RMSE). This target can be achieved by moving along the loss function for each set of weight values. The slope of the tangent line to the function decide the direction to move along the curve. Move with steps along the loss function curve and measure the slope each time until the slope is minimized. The weights corresponding to the minimum slope are the ones you choose to optimize the model. In addition, the movements steps must be controlled because moving too slow leads to slow simulations and higher computations, and moving too fast might lead us astray. Therefore, we use the **learning rate** and multiply it by the current slope. Then, subtract the result from the current weights to get the new weights.

## Backpropagation

It propagates from prediction error of the output layer through the hidden layers and back to the input layer. This allows gradient descent to update all weights in neural network (by getting gradient/slope for all weights). You need to use forward propagation before updating the weights using backward propagation. Each time you generate predictions using forward propagation, you update the weights using backward propagation.

The slope for wieght is product of:
1. Node value feeding into that weight
2. Slope of activation function for the node being fed into
3. Slope of loss function w.r.t output node

It is common to calculate slopes on only a subset of the data ('batch'). Then, use a different batch of data to calculate next update until you use all data. This is called **Stochastic Gradient Descent**

-------------

In [1]:
# to ignore some warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
# import libraries needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#set plotting style to 'ggplot'
plt.style.use('ggplot')

## Setting-up your Keras and TensorFlow Libararies in Jupyter Notebook/Lab

1. Create a environment `tf`
> conda create -n tf tensorflow

 > conda activate tf
 
2. In this invironment only tesnorflow is install so you need to install all other libraries such as:
pandas, numpy, scikit-learn, matplotlib, seaborn, keras, ...
> pip install ---

3. Re-install jupyter notebook 
> pip install jupyter notebook

4. Re-install jupyter lab (if you are using it)
> python -m pip install jupyterlab


## Creating a Keras Model

1. Specify model Architecture
2. Compile the model
3. Fit the model
4. Make prediction

### Model Specification

* **`Sequential()`** each layers have nodes connected only the next layer
* **`Dense()`** ALL nodes in previous layers are connected to ALL nodes in current layer

In the next example you'll predict workers wages based on characteristics like their industry, education and level of experience.

In [7]:
df = pd.read_csv('datasets/wages.csv')
df.head()

Unnamed: 0,wage_per_hour,union,education_yrs,experience_yrs,age,female,marr,south,manufacturing,construction
0,5.1,0,8,21,35,1,1,0,1,0
1,4.95,0,9,42,57,1,1,0,1,0
2,6.67,0,12,1,19,0,0,0,1,0
3,4.0,0,12,4,22,0,0,0,0,0
4,7.5,0,12,17,35,0,1,0,0,0


In [11]:
predictors = df.drop('wage_per_hour',axis=1).values
target = df['wage_per_hour'].values

In [12]:
# Import necessary modules
import keras
from keras.layers import Dense
from keras.models import Sequential

# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]

# Set up the model: model
model = Sequential()

# Add the first layer (input layer)
model.add(Dense(50, activation='relu', input_shape=(n_cols,)))

# Add the second layer
model.add(Dense(32, activation='relu'))

# Add the output layer
model.add(Dense(1))

### Compiling and fitting a model

* Specify the optimizer
    * Control the learning rate
    * "Adam" is usually a good choice
    
    
* Specify the loss function
    * "mean_squared_error" is a common choice for regression
    

* Fitting the model
    * Apply backpropagation and gradient descent with your data to update weights


In [14]:
# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Fit the model
model.fit(predictors,target)

Epoch 1/1


<keras.callbacks.History at 0x2d1b459aac8>

## Classifiction Models

* Loss function is Log Loss: "categorical_crossentropy"
* Add metrics = ['accuracy'] to print accuracy at each epoch which makes it easier for model debugging
* Output Layer does not consist of one node here. It consists of separate node for each posible outcome.
* Change activation function of the output to 'softmax' which ensures predictions sum to 1 so that they can interpreted as probabilities



In the next example we will study the titanic dataset and predict weather a passenger survives or not

First we will convert the target variable to categories and perform one-hot incoding on it by implementing:
```python
from keras.utils import to_categorical
```

In [15]:
df = pd.read_csv('datasets/titanic.csv')
df.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,male,age_was_missing,embarked_from_cherbourg,embarked_from_queenstown,embarked_from_southampton
0,0,3,22.0,1,0,7.25,1,False,0,0,1
1,1,1,38.0,1,0,71.2833,0,False,1,0,0
2,1,3,26.0,0,0,7.925,0,False,0,0,1
3,1,1,35.0,1,0,53.1,0,False,0,0,1
4,0,3,35.0,0,0,8.05,1,False,0,0,1


In [16]:
# EDA
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,male,embarked_from_cherbourg,embarked_from_queenstown,embarked_from_southampton
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,0.647587,0.188552,0.08642,0.722783
std,0.486592,0.836071,13.002015,1.102743,0.806057,49.693429,0.47799,0.391372,0.281141,0.447876
min,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,22.0,0.0,0.0,7.9104,0.0,0.0,0.0,0.0
50%,0.0,3.0,29.699118,0.0,0.0,14.4542,1.0,0.0,0.0,1.0
75%,1.0,3.0,35.0,1.0,0.0,31.0,1.0,0.0,0.0,1.0
max,1.0,3.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0,1.0


In [21]:
# Import necessary modules
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical

# Define the predictors
predictors = df.drop(['survived'],axis=1).as_matrix()

# Convert the target to categorical: target
target = to_categorical(df.survived)

# Save number of columns
n_cols = predictors.shape[1]

# Set up the model
model = Sequential()

# Add the first layer
model.add(Dense(32,activation='relu',input_shape=(n_cols,)))

# Add the output layer
model.add(Dense(2,activation='softmax'))

# Compile the model
model.compile(optimizer='sgd',loss='categorical_crossentropy',
              metrics=['accuracy'])
# 'sgd' stochastic gradient descent

# Fit the model
model.fit(predictors,target)

Epoch 1/1


<keras.callbacks.History at 0x2d1b6012860>

This simple model in 1 epoch got as accuracy of 0.63%

### Using Your Model
You can use your model using:
* Saving (with extension `.h5`)
* Reloading
* Make prediction
* Verfiy its structure

```python
from keras.models import load_model

model.save('model_file.h5')
my_model = load_model('my_model.h5')
predictions = my_model.predict(data_to_predict_with)
probability_true = predictions[:,1]
```


--------------
## Model Optimization

If your model doesn't show much improvement try:
* Changing the learning rate
* Changing the activation function



In [None]:
# Import the SGD optimizer
from keras.optimizers import SGD

# Create list of learning rates: lr_to_test
lr_to_test = [0.000001,0.01,1]

# Loop over learning rates
for lr in lr_to_test:
    print('\n\nTesting model with learning rate: %f\n'%lr )
    
    # Build new model to test, unaffected by previous models
    model = get_new_model()
    
    # Create SGD optimizer with specified learning rate: my_optimizer
    my_optimizer = SGD(lr=lr)
    
    # Compile the model
    model.compile(optimizer=my_optimizer,loss='categorical_crossentropy')
    
    # Fit the model
    model.fit(predictors,target)