## Interactions

- Neural networks account for interactions really well
- Deep learning uses especially powerful neural networks
	- Text
	- Images
	- Videos
	- Audio
	- Source code

## Forward propagation

- Multiply-add process
- Dot product


In [1]:
import numpy as np

input_data = np.array([2, 3])

weights = {'node_0': np.array([1, 1]), 'node_1': np.array([-1, 1]), 'output': np.array([2, -1])}

node_0_value = (input_data * weights['node_0']).sum()
node_1_value = (input_data * weights['node_1']).sum()

hidden_layer_values = np.array([node_0_value, node_1_value])

output = (hidden_layer_values * weights['output']).sum()

print(output)

9


## Activation functions

- Applied to node inputs to produce node output
- ReLU (Rectified Linear Activation)

$$
\textrm{ReLU}(x) =
\begin{cases}
0, & x < 0 \\
x, & x \geq 0
\end{cases}
= \textrm{max}(0, x)
$$

~~~
def relu(input):
    '''Define your relu activation function here'''
    # Calculate the value for the output of the relu function: output
    output = max(0, input)
    
    # Return the value just calculated
    return(output)

# Calculate node 0 value: node_0_output
node_0_input = (input_data * weights['node_0']).sum()
node_0_output = relu(node_0_input)

# Calculate node 1 value: node_1_output
node_1_input = (input_data * weights['node_1']).sum()
node_1_output = relu(node_1_input)

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_output, node_1_output])

# Calculate model output (do not apply relu)
model_output = (hidden_layer_outputs * weights['output']).sum()

# Print model output
print(model_output)
~~~

## Representation learning

- Deep networks internally build representations of patterns in the data
- Partially replace the need for feature engineering
- Subsequent layers build increasingly sophisticated representations of raw data

## Deep learning

- Modeler doesn't need to specify the interactions
- When you train the model, the neural network gets weights that find the relevant patterns to make better predictions

## Loss function

- Aggregate errors in predictions from many data points into single number
- Measure of model's predictive performance

- Lower loss function value means a better model
- Goal: Find the weights that give the lowest value for the loss function

## Gradient descent

- Start at random point
- Until you are somewhere flat:
	- Find the slope
	- Take a step downhill

- If the slope is positive:
	- Going opposit ethe slope means moving to lower numbers
	- Subtract the slope from the current value
	- Too big a step might us astray

- Solution: learning rate
	- Update each weight by subtracting `learning_rate * slope`

## Slope calculation example

- To calculate the slope for a weight, need to multiply:
	- Slope of the loss function with respect to (w.r.t.) value at the node we feed into
	- The value of the node that feeds into our weight
	- Slope of the activation function w.r.t. value we feed into

~~~

# Set the learning rate: learning_rate
learning_rate = 0.01

# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = 2 * input_data * error

# Update the weights: weights_updated
weights_updated = weights - learning_rate * slope

# Get updated predictions: preds_updated
preds_updated = (input_data * weights_updated).sum()

# Calculate updated error: error_updated
error_updated = preds_updated - target

# Print the original error
print(error)

# Print the updated error
print(error_updated)
~~~

## Backpropagation

- Allows gradient descent to update all weights in NN (by getting gradients for all weights)
- Comes from chain rule of Calculus

## Backpropagation process

- Trying to estimate the slope of the loss function w.r.t. each weight
- Go back one layer at a time
- Gradients for weight is product of:
	1. Node value feeding into that weight
	2. Slope of loss function w.r.t. node it feeds into
	3. Slope of activation function at the node it feeds into

- Need to also keep track of the slopes of the loss function w.r.t. node values
- Slope of node value are the sum of the slopes for all weights that come out of them

## Stochastic Gradient Descent

- It is common to calculate slopes on only a subset of the data ('batch')
- Use a different batch of data to calculate the next update
- Start over from the beginning once all data is used
- Each time through the training data is called an epoch
- When slopes are calculated one one btach at a time: stochastic gradient descent


## `keras` model building steps

1. Specify architecture
2. Compile
3. Fit
4. Predict

## Model specification

~~~
import numpy as np

from keras.layers import Dense
from keras.models import Sequential

predictors = np.loadtxt('predictors_data.csv', delimiter=',')

n_cols = predictors.shape[1]

model = Sequential()
model.add(Dense(100, activation='relu', input_shape = (ncols,)))
model.add(Dense(100, activation='relu'))
model.add(Dense(1))
~~~

## Why you need to compile your model

- Specify the optimizer
	- Controls the learning rate
	- 'Adam' is usually a good choice
- Loss function
	- `mean_squared_error`: common for regression

~~~
model.compile(optimizer='adam', loss='mean_squared_error')
~~~

## Fitting a model

- Applying backpropagation and gradient descent with your data to update the weights
- Scaling data before fitting can ease optimization

~~~
model.fit(predictors, target)
~~~

## Classification

- `categorical_crossentropy` loss function
	- Similar to log loss: lower is better
- Add `metrics=['accuracy']` to compile step for easy-to-understand diagnostics
- Output layer has separate node for each possible outcome and uses 'softmax' activation

~~~
from keras.utils import to_categorical

data = pd.read_csv('basketball_shot_log.csv')

predictors = data.drop(['shot_result'], axis=1).as_matrix()

target = to_categorical(data.shot_result)

model = Sequential()
model.add(Dense(100, activation='relu', input_shape=(n_cols,)))
model.add(Dense(100, activation = 'relu'))
model.add(Dense(100, activation = 'relu'))
model.add(Dense(2, activation = 'softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(predictors, target)
~~~

## Using models

- Save
- Reload
- Make predictions

~~~
from keras.models import load_model

model.save('model_file.h5')

my_model = load_model('my_model.h5')

predictions = my_model.predict(data_to_predict_with)

probability_true = predictions[:,1]
~~~


## Stochastic gradient descent

~~~
def get_new_model(input_shape = input_shape):
	model = Sequential()

	model.add(Dense(100, activation='relu', input_shape=input_shape))
	model.add(Dense(100, activation='relu'))
	model.add(Dense(2, activation='softmax'))

	return model

lr_to_test = [.000001, 0.01, 1]

for lr in lr_to_test:
	model = get_new_model()

	my_optimizer = SGD(lr=lr)

	model.compile(optimizer=my_optimizer, loss='categorical_crossentropy')
~~~

## The dying neuron problem

- Once a node starts always getting negative inputs
	- It may continue only getting negative inputs
- Contributes nothing to the model
	- 'Dead' neuron

## Vanishing gradients

- Occurs when many layers have very small slopes (e.g. due to being on flat part of tanh curve)
- In deep networks, updates to backprop were close to $0$

## Validation in deep learning

- Commonly use validation split rather than cross-validation
- Deep learning widely used on large datasets
- Single validation score is based on large amount of data, and is reliable
- Repeated training from cross-validation would take long time

~~~
model.fit(predictors, target, validation_split=0.3)
~~~

## Early stopping

~~~
from keras.callbacks import EarlyStopping

early_stopping_monitor = EarlyStopping(patience=2)
# patience: num. of epochs without improvement

model.fit(predictors, target, validation_split=0.3, epochs=20, callbacks = [early_stopping_monitor])
~~~

## Workflow for optimizing model capacity

- Start with a small network
- Get the validation score
- Keep increasing capacity until validation score is no longer improving

<img src='sequential-experiments.PNG'>

## Recognizing handwritten digits

- MNIST dataset
- $28 \times 28$ grid flattened to $784$ values for each image
- Value in each part of array denotes darkness of that pixel

In [2]:
import pandas as pd

data = pd.read_csv('mnist.csv')

In [5]:
y = data.iloc[:,0]
X = data.iloc[:,1:]

In [10]:
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping

model = Sequential()

# Add the first hidden layer
model.add(Dense(50, activation='relu', input_shape=(784,)))

# Add the second hidden layer
model.add(Dense(50, activation='relu'))

# Add the output layer
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model
model.fit(X, to_categorical(y), epochs=10, validation_split=0.3, callbacks=[EarlyStopping(patience=2)])

Train on 1400 samples, validate on 600 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10


<keras.callbacks.History at 0x21fdee00630>