# Deep Learning in Python
- neural networks
- deep learning models
- using Keras 2.0
- with Dan Becker from Kaggle, Keras and TensorFlow libraries

Applications: robotics, NLP, image recognition, AI

# 1. Basics of Deep Learning and Neural Networks
Interactions
- Neural Networks account for interactions really well
- Deep learning uses powerful neural networks
    - text, images, videos, audio, source code
- Deep learning models capture interactions
    - can take 2 features and caculate this interaction to predict the outcome
    - reality is that there are a lot of interactions
- Structure
    - Input Layer - features
    - Output Layer - predicted
    - Hidden Layer - everything in between input and output
        - each component is a NODE
        - we don't have data and can't observe this layer directly
        - the more nodes, the more interactions we capture
        

### 1.1 Forward propagation = using data to make predictions
Example: bank transactions
- make predictions based on:
    - number of children
    - number of existing accounts
- customer with 2 children, 3 accounts
    - line from input to each node
    - each line has a weight for how strongly the input affects the hidden node 
    - first set of weights - parameters we train/change when NN fit to data
- Dot product
    - hidden node prediction = sum(input*weight)
    - output node - repeat the same multiply-add process (sum(node*weight))

Forward propagation = input to hidden node to output
- dot product (multiply-add process)
- forward propagation for one data point at a time
- output is the prediction for that data point


In [1]:
# Forward propagation code

import numpy as np
# recall 2 children, 3 existing accounts
input_data = np.array([2,3])
# 2 hidden nodes
weights = {'node_0': np.array([1,1]),
           'node_1': np.array([-1,1]),
           'output': np.array([2,-1])}
node_0_value = (input_data * weights['node_0']).sum()
node_1_value = (input_data * weights['node_1']).sum()

hidden_layer_values = np.array([node_0_value, node_1_value])
print(hidden_layer_values)
# [5,1]

# get output
output = hidden_layer_values * weights['output']).sum()
print(output)
# 9


### Example: Coding the forward propagation algorithm
In this exercise, you'll write code to do forward propagation (prediction) for your first neural network:

<img src="image1.png" width="500" />
Reference: https://s3.amazonaws.com/assets.datacamp.com/production/course_3524/datasets/1_4.png)

Each data point is a customer. The first input is how many accounts they have, and the second input is how many children they have. The model will predict how many transactions the user makes in the next year. You will use this data throughout the first 2 chapters of this course.

The input data has been pre-loaded as input_data, and the weights are available in a dictionary called weights. The array of weights for the first node in the hidden layer are in weights['node_0'], and the array of weights for the second node in the hidden layer are in weights['node_1'].

The weights feeding into the output node are available in weights['output'].

NumPy will be pre-imported for you as np in all exercises.

In [None]:
# Calculate node 0 value: node_0_value
node_0_value = (input_data * weights['node_0']).sum()

# Calculate node 1 value: node_1_value
node_1_value = (input_data * weights['node_1']).sum()

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_value, node_1_value])

# Calculate output: output
output = (hidden_layer_outputs * weights['output']).sum()

# Print output
print(output)

# -39

### 1.2 Activation functions
- to achieve max predictive power, we need activation function in hidden layer
- allows model to capture non-linearities
- applied to node inputs to produce node output

Improving our neural network
- used to use tanh
- now ReLU is the standard

ReLU (Rectified Linear Activation)


In [None]:
# activation functions with tanh

import numpy as np
input_data = np.array([-1, 2])
weights = {'node_0': np.array([3,3]),
           'node_1': np.array([1,5]),
           'output': np.array([2,-1])}

# note distinguishing input and output
node_0_input = (input_data * weights['node_0']).sum()
node_0_output = np.tanh(node_0_input)
node_1_input = (input_data * weights['node_1']).sum()
node_1_output = np.tanh(node_1_input)

hidden_layer_outputs = np.array([node_0_output, node_1_output])
output = hidden_layer_outputs * weights['output']).sum()
print(output)

### 1.2.a ReLU - The Rectified Linear Activation Function
As Dan explained to you in the video, an "activation function" is a function applied at each node. It converts the node's input into some output.

The rectified linear activation function (called ReLU) has been shown to lead to very high-performance networks. This function takes a single number as an input, returning 0 if the input is negative, and the input if the input is positive.

Here are some examples:
- relu(3) = 3 
- relu(-3) = 0 

In [None]:
# ReLU

def relu(input):
    '''Define your relu activation function here'''
    # Calculate the value for the output of the relu function: output
    # return 0 if input is negative, else return input
    output = max(input, 0)
    
    # Return the value just calculated
    return(output)

# Calculate node 0 value: node_0_output
node_0_input = (input_data * weights['node_0']).sum()
node_0_output = relu(node_0_input)

# Calculate node 1 value: node_1_output
node_1_input = (input_data * weights['node_1']).sum()
node_1_output = relu(node_1_input)

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_output, node_1_output])

# Calculate model output (do not apply relu)
model_output = (hidden_layer_outputs * weights['output']).sum()

# Print model output
print(model_output)

# 52
# we predicted 52 transactions

### 1.2.b Applying the network to many observations/rows of data
You'll now define a function called predict_with_network() which will generate predictions for multiple data observations, which are pre-loaded as input_data. As before, weights are also pre-loaded. In addition, the relu() function you defined in the previous exercise has been pre-loaded.

In [None]:
# Define predict_with_network()
def predict_with_network(input_data_row, weights):
    """accepts two arguments - input_data_row and weights - and 
    returns a prediction from the network as the output."""
        # Calculate node 0 value
    node_0_input = (input_data_row * weights['node_0']).sum()
    node_0_output = relu(node_0_input)

    # Calculate node 1 value
    node_1_input = (input_data_row * weights['node_1']).sum()
    node_1_output = relu(node_1_input)

    # Put node values into array: hidden_layer_outputs
    hidden_layer_outputs = np.array([node_0_output, node_1_output])
    
    # Calculate model output
    input_to_final_layer = (hidden_layer_outputs * weights['output']).sum()
    model_output = relu(input_to_final_layer)
    
    # Return model output
    return(model_output)


# Create empty list to store prediction results
results = []
for input_data_row in input_data:
    # Append prediction to results
    results.append(predict_with_network(input_data_row, weights))

# Print results
print(results)

# [52, 63, 0, 148]

### 1.3 Deeper networks
Multiple hidden layers
- can scale to even 1000 layers
- same propogation process
- assume all layers use ReLU activation function

Representation learning (aka deep learning)
- deep networks internally build representations of patterns in the data
- partially replace the need for feature engineering
- called representation learning b/c subsequent layers build increasingly sophisticated representations of raw data until we get to a stage to make predictions
- example in images:
    - early nodes: diagonal line
    - later node: face node
    - next node: cat node    
- So, more complex or "higher level" interactions are in the last layers of the model

Deep learning
- pro: modeler doesn't need to specify the interactions
- when you train the model, the neural network gets weights that find the relevant patterns to make better predictions


### 1.3.a Multi-layer neural networks
In this exercise, you'll write code to do forward propagation for a neural network with 2 hidden layers. Each hidden layer has two nodes. The input data has been preloaded as input_data. The nodes in the first hidden layer are called node_0_0 and node_0_1. Their weights are pre-loaded as weights['node_0_0'] and weights['node_0_1'] respectively.

The nodes in the second hidden layer are called node_1_0 and node_1_1. Their weights are pre-loaded as weights['node_1_0'] and weights['node_1_1'] respectively.

We then create a model output from the hidden nodes using weights pre-loaded as weights['output'].

<img src="image2.png" width="500" />
Reference: https://s3.amazonaws.com/assets.datacamp.com/production/course_3524/datasets/ch1ex10.png

In [None]:
# forward propagation with 2 hidden layers

def predict_with_network(input_data):
    # Calculate node 0 in the first hidden layer
    node_0_0_input = (input_data * weights['node_0_0']).sum()
    node_0_0_output = relu(node_0_0_input)

    # Calculate node 1 in the first hidden layer
    node_0_1_input = (input_data * weights['node_0_1']).sum()
    node_0_1_output = relu(node_0_1_input)

    # Put node values into array: hidden_0_outputs
    hidden_0_outputs = np.array([node_0_0_output, node_0_1_output])
    
    # Calculate node 0 in the second hidden layer
    node_1_0_input = (hidden_0_outputs * weights['node_1_0']).sum()
    node_1_0_output = relu(node_1_0_input)

    # Calculate node 1 in the second hidden layer
    node_1_1_input = (hidden_0_outputs * weights['node_1_1']).sum()
    node_1_1_output = relu(node_1_1_input)

    # Put node values into array: hidden_1_outputs
    hidden_1_outputs = np.array([node_1_0_output, node_1_1_output])

    # Calculate model output: model_output
    model_output = (hidden_1_outputs * weights['output']).sum()
    
    # Return model_output
    return(model_output)

output = predict_with_network(input_data)
print(output)

# 182


# 2. Optimize a Neural Network with backward propagation
Consider a baseline neural network
- uses an identity function (returns the input)
- given input and true value of target
- use forward propagation to fill in values for hidden layer node values
- use hidden layer node values to make a prediction on output, then compute error compared to actual value of target
- changing the weights of the hidden layers improves accuracy of the predicted target

Predictions with multiple points
- making accurate predictions gets harder with more points
- at any set of weights, there are many values of the error
    - ...corresponding to the many points we make predictions for
    
Loss function
- aggregates erros in predictions from many data points into single number
- measure of model's predictive performance
- example: column loss for regression - mean squared error loss function
- Lower loss function value means a better model
- Goal: Find the weights that give the lowest value for the loss function
- How? use Gradient descent algorithm

Gradient descent
- analogy: imagine you are in a pitch dark field
    - want to find the lowest point
    - feel the ground to see how it slopes
    - take a small step downhill
    - repeat until it is uphill in every direction
- steps:
    - start at random point
    - until you are somewhere flat:
        - find the slope (the derivative)
        - take a step downhill'

Optimizing a model with a single weight
- if slope is positive, to find the low point go in the opposite direction towards lower numbers
- eventually get to the minimum value


### Example 1: Calculating model errors
For the exercises in this chapter, you'll continue working with the network to predict transactions for a bank.

What is the error (predicted - actual) for the following network when the input data is [3, 2] and the actual value of the target (what you are trying to predict) is 5? It may be helpful to get out a pen and piece of paper to calculate these values.

![example](https://s3.amazonaws.com/assets.datacamp.com/production/course_3524/datasets/ch2_ex2_3.png)

Answer:
output = 16
actual value = 5
error = 16 - 5 = 11

### Example 2: Understanding how weights change model accuracy
Imagine you have to make a prediction for a single data point. The actual value of the target is 7. The weight going from node_0 to the output is 2, as shown below. If you increased it slightly, changing it to 2.01, would the predictions become more accurate, less accurate, or stay the same?
<img src="image3.png" width="500" />
Reference: https://s3.amazonaws.com/assets.datacamp.com/production/course_3524/datasets/ch2_ex2_3.png

Answer: less accurate
Increasing weight from 2 to 2.01 would increase resulting error from 9 to 9.08. So less accurate

### Example 3: Coding how weight changes affect accuracy
Now you'll get to change weights in a real network and see how they affect model accuracy!

Have a look at the following neural network: 
<img src="image4.png" width="500" />
Reference: https://s3.amazonaws.com/assets.datacamp.com/production/course_3524/datasets/ch2ex4.png

Its weights have been pre-loaded as weights_0. Your task in this exercise is to update a single weight in weights_0 to create weights_1, which gives a perfect prediction (in which the predicted value is equal to target_actual: 3).

Use a pen and paper if necessary to experiment with different combinations. You'll use the predict_with_network() function, which takes an array of data as the first argument, and weights as the second argument.

In [None]:
# The data point you will make a prediction for
input_data = np.array([0, 3])

# Sample weights
weights_0 = {'node_0': [2, 1],
             'node_1': [1, 2],
             'output': [1, 1]
            }

# The actual target value, used to calculate the error
target_actual = 3

# Make prediction using original weights
model_output_0 = predict_with_network(input_data, weights_0)

# Calculate error: error_0
error_0 = model_output_0 - target_actual

# Create weights that cause the network to make 
#  perfect prediction (3): weights_1
weights_1 = {'node_0': [2, 1],
             'node_1': [1, 0], # changed 2nd weight to 0
             'output': [1, 1]
            }

# Make prediction using new weights: model_output_1
model_output_1 = predict_with_network(input_data, weights_1)

# Calculate error: error_1
error_1 = model_output_1 - target_actual

# Print error_0 and error_1
print(error_0)
print(error_1)


### Example 4: Scaling up to multiple data points
You've seen how different weights will have different accuracies on a single prediction. But usually, you'll want to measure model accuracy on many points. You'll now write code to compare model accuracies for two different sets of weights, which have been stored as weights_0 and weights_1.

input_data is a list of arrays. Each item in that list contains the data to make a single prediction. target_actuals is a list of numbers. Each item in that list is the actual value we are trying to predict.

In this exercise, you'll use the mean_squared_error() function from sklearn.metrics. It takes the true values and the predicted values as arguments.

You'll also use the preloaded predict_with_network() function, which takes an array of data as the first argument, and weights as the second argument.

In [None]:
from sklearn.metrics import mean_squared_error

# Create model_output_0 
model_output_0 = []
# Create model_output_1
model_output_1 = []

# Loop over input_data
for row in input_data:
    # Append prediction to model_output_0
    model_output_0.append(predict_with_network(row, weights_0))
    
    # Append prediction to model_output_1
    model_output_1.append(predict_with_network(row, weights_1))

# Calculate the mean squared error for model_output_0: mse_0
mse_0 = mean_squared_error(target_actuals, model_output_0)

# Calculate the mean squared error for model_output_1: mse_1
mse_1 = mean_squared_error(target_actuals, model_output_1)

# Print mse_0 and mse_1
print("Mean squared error with weights_0: %f" %mse_0)
print("Mean squared error with weights_1: %f" %mse_1)

# Mean squared error with weights_0: 294.000000
# Mean squared error with weights_1: 395.062500

# model_output_1 has a higher mean squared error

### 2.1 Gradient descent
- if slope is positive:
    - going opposite the slope means moving to lower numbers
    - subtract the slope from the current value
    - problem: too big a step might lead us astray
- Solution: learning rate
    - update each weight by subtracting (learning rate * slope)
    - learning rates frequently around 0.01
    - this ensures small steps toward optimal rate

Slope calculation example
- to calculate the slope of a weight, need to multiply:
    - slope of the loss function w.r.t. prediction (value at the node we feed into)
        - = 2 * (Predicted value - Actual value) = 2* Error
        - example: 2 * (6-10) = -8
    - the value of the node that feeds into our weight
        - input = 3
    - slope of the activation function w.r.t. value we feed into
        - this example does not have activation function, so skip
    - slope of mean squared loss function = 2 * -4 * 3 = -24
- if learning rate = 0.01, new weight updated to 2-0.01(-24) = 2.24
- for multiple weights, repeat process for each weight separately, then update both weights simultaneously using their derivative


#### Code to calculate slope and update weights

In [None]:
import numpy as np
weights = np.array([1,2])
input_data = np.array([3, 4])
target = 6
learning_rate = 0.01
preds = (weights * input_data).sum()
error = preds - target
print(error)
# 5

# slope calculation
gradient = 2 * input_data * error
gradient
# array([30, 40])

# update weights by small step in that direction
weights_updated = weights - learning_rate * gradient
preds_updated = (weights_updated * input_data).sum()
error_updated = preds_updated - target
print(error_updated)
# -2.5
# improvement in error, repeating this would result in more improvements

### Example 1: Calculating slopes
You're now going to practice calculating slopes. 
When plotting the mean-squared error loss function against predictions,
- the slope is 2*x*(y-xb),
- or 2*(input_data)*(error)

Note that x and b may have multiple numbers (x is a vector for each data point, and b is a vector). In this case, the output will also be a vector, which is exactly what you want.

You're ready to write the code to calculate this slope while using a single data point. You'll use pre-defined weights called weights as well as data for a single point called input_data. The actual value of the target you want to predict is stored in target.

In [None]:
# Calculate the predictions: preds
preds = (input_data * weights).sum()

# Calculate the error: error
# This error corresponds to y-xb in the gradient expression
error = target - preds

# Calculate the slope: slope
slope = 2 * input_data * error

# Print the slope
print(slope)

# [-14 -28 -42]

### Example 2: Improving model weights (using the slope)
Hurray! You've just calculated the slopes you need. Now it's time to use those slopes to improve your model. If you add the slopes to your weights, you will move in the right direction. However, it's possible to move too far in that direction. So you will want to take a small step in that direction first, using a lower learning rate, and verify that the model is improving.

The weights have been pre-loaded as weights, the actual value of the target as target, and the input data as input_data. The predictions from the initial weights are stored as preds.

In [None]:
# Set the learning rate: learning_rate
learning_rate = 0.01

# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = 2 * input_data * error

# Update the weights: weights_updated
weights_updated = weights - (learning_rate * slope)

# Get updated predictions: preds_updated
preds_updated = (weights_updated * input_data).sum()

# Calculate updated error: error_updated
error_updated = preds_updated - target

# Print the original error
print(error)

# Print the updated error
print(error_updated)

# 7
# 5.04

# Updating the model weights decreased the error.

### Example 3: Making multiple updates to weights
You're now going to make multiple updates so you can dramatically improve your model weights, and see how the predictions improve with each update.

To keep your code clean, there is a pre-loaded get_slope() function that takes input_data, target, and weights as arguments. There is also a get_mse() function that takes the same arguments. The input_data, target, and weights have been pre-loaded.

This network does not have any hidden layers, and it goes directly from the input (with 3 nodes) to an output node. Note that weights is a single array.

We have also pre-loaded matplotlib.pyplot, and the error history will be plotted after you have done your gradient descent steps.

In [None]:
n_updates = 20
mse_hist = []

# Iterate over the number of updates
for i in range(n_updates):
    # Calculate the slope: slope
    # get_slope() is a pre-defined function
    slope = get_slope(input_data, target, weights)
    
    # Update the weights: weights
    # Learning rate = 0.01
    weights = weights - 0.01 * slope
    
    # Calculate mse with new weights: mse
    mse = get_mse(input_data, target, weights)
    
    # Append the mse to mse_hist
    mse_hist.append(mse)

# Plot the mse history
plt.plot(mse_hist)
plt.xlabel('Iterations')
plt.ylabel('Mean Squared Error')
plt.show()

# In the plot, you can see the mean squared error decreases as the 
# number of iterations go up.

<img src="image5.png" width="500" />

### 2.2 Backpropagation
- use backpropagation to calculate the slopes you need to optimize more complex deep learning models
- Backpropagation - takes error from output layer and propates through hidden layers to input
- Allows gradient descent to update all weights in neural network (by getting gradients for all weights)
- Important to understand the process, but you will generally use a library that implements this.

Backpropoagation process
- trying to estimate the slope of the loss function w.r.t. each weight
- Do forward propagation to calculate predictions and errors, before you do backpropagation
- Go back one layer at a time
- Gradients for weight (to get slope) is product of:
    1. Input node value - Node value feeding into that weight
        - input node is given from data
        - hidden nodes are calculated from forward propagation
    2. Slope of loss function w.r.t. output node it feeds into
    3. Slope of activation function at the node it feeds into
- Need to also keep track of the slopes of the loss function w.r.t. node values
- Slope of node values are the sum of the slopes for all weights that come out of them.

ReLU activation function
- slope = 0 if value is negatie
- slope = 1 if value positive


### 2.2.a The relationship b/n forward and backward propagation
If you have gone through 4 iterations of calculating slopes (using backward propagation) and then updated weights, how many times must you have done forward propagation?

4

Each time you generate predictions using forward propagation, you update the weights using backward propagation.

### 2.2.b Thinking about backward propagation
If your predictions were all exactly right, and your errors were all exactly 0, the slope of the loss function with respect to your predictions would also be 0. In that circumstance, which of the following statements would be correct?

The updates to all weights in the network would also be 0.

### 2.3 Backpropagation in practice
Example:
- node 0 value = 1
- node 1 value = 3
- weight 0 = 1
- weight 1 = 2
- output = 7, actual target value = 4
    - error = 3
    - relevant slope of output node = 2x error, 2*3 = 6
- Top weight slope = 1*6 = 6
- Bottom weight slope = 3*6 = 18
- Calculate one layer back
    - top node = 6
    - bottom mode = 18
    - 
Recall: Calculating slopes associated with any weight
- Recall Gradients for weight is product of:
    1. Input node value - Node value feeding into that weight
        - input node is given from data
        - hidden nodes are calculated from forward propagation
    2. Slope of loss function w.r.t. output node it feeds into
    3. Slope of activation function at the node it feeds into

Backpropagation 
- Current weight value, Gradient
    - 0, 0
    - 1, 6
    - 2, 0
    - 3, 18
    
Backpropagation recap:
- start at some random set of weights
- use forward propagation to calculate the slope of the loss function w.r.t. each weight
- multiply that slope by the learning rate, and subtract from the current weights
- Repeat cycle until we get to a flat part

Stochastic gradient descent
- it's common to calculate slopes on only a subset of the data ('batch')
- use a different batch of data to calculate the next update
- once you've used all data, start over from the beginning
- each time through the training data is called an epoch
- when slopes are calculated on one batch at a time = stochastic gradient descent


### A round of backpropagation
In the network shown below, we have done forward propagation, and node values calculated as part of forward propagation are shown in white. The weights are shown in black. Layers after the question mark show the slopes calculated as part of back-prop, rather than the forward-prop values. Those slope values are shown in purple.

This network again uses the ReLU activation function, so the slope of the activation function is 1 for any node receiving a positive value as input. Assume the node being examined had a positive value (so the activation function's slope is 1).

<img src="image6.png" width="500" />
Reference: https://s3.amazonaws.com/assets.datacamp.com/production/course_3524/datasets/ch2ex14_1.png

Recall: Gradients for weight (to get slope) is product of:
- Input node value
    - 2
- Slope of loss function w.r.t. output node it feeds into
    - 3
- Slope of activation function at the node it feeds into
    - ReLU slope = 1

What is the slope needed to update the weight with the question mark?

<img src="image7.png" width="500" />
Reference: https://s3.amazonaws.com/assets.datacamp.com/production/course_3524/datasets/ch2ex14_2.png

Answer: 2x3 = 6, which is the weight

# 3. Building Deep Learning models with keras
Model building steps
1. specify architecture
    - how many layers
    - how many nodes in each layer
    - what activation function at each layer
2. compile model
    - specify loss functions
    - details on optimization
3. Fit
    - cycle of backpropagation and optimization of model weights
4. Predict


### 3.1 Step 1: Model specification

In [None]:
import numpy as np
from keras.layers import Dense
from keras.models import Sequential

# read data
# n_cols = number of nodes in input layer
predictors = np.loadtext('predictors_data.csv', delimiter=',')
n_cols = predictors.shape[1]

# build model - Sequential is the easier of 2 ways to build a model
# Sequential specifies weights only to 1 layer directly after it
model = Sequential()
# .add method adds layers
# Called Dense b/c all nodes from previous layer connect to current layer
# You may use layers that are not Dense
# .Dense(number of layers, activation=activation function)
# In first layer, you need to specify input_shape
# Input_shape(n_cols columns, nothing here specifies any number of rows)
# The last layer has 1 node b/c it's the output layer
# This model has 2 hidden layers
# Common to use 100s or 1000s of nodes in a layer
model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
model.add(Dense(100, activation='relu'))
model.add(Dense(1))

In [None]:
# Model specification - code without notes
import numpy as np
from keras.layers import Dense
from keras.models import Sequential

# read data
predictors = np.loadtext('predictors_data.csv', delimiter=',')
n_cols = predictors.shape[1]

# build model
model = Sequential()
# add layers
model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
model.add(Dense(100, activation='relu'))
model.add(Dense(1))

#### Example: Understanding your data
You will soon start building models in Keras to predict wages based on various professional and demographic factors. Before you start building a model, it's good to understand your data by performing some exploratory analysis.

The data is pre-loaded into a pandas DataFrame called df. Use the .head() and .describe() methods in the IPython Shell for a quick overview of the DataFrame.

The target variable you'll be predicting is wage_per_hour. Some of the predictor variables are binary indicators, where a value of 1 represents True, and 0 represents False.

Of the 9 predictor variables in the DataFrame, how many are binary indicators? The min and max values as shown by .describe() will be informative here. How many binary indicator predictors are there?

Answer: 6

In [None]:
In [1]: df.head()
Out[1]: 
   wage_per_hour  union  education_yrs  experience_yrs  age  female  marr  \
0           5.10      0              8              21   35       1     1   
1           4.95      0              9              42   57       1     1   
2           6.67      0             12               1   19       0     0   
3           4.00      0             12               4   22       0     0   
4           7.50      0             12              17   35       0     1   

   south  manufacturing  construction  
0      0              1             0  
1      0              1             0  
2      0              1             0  
3      0              0             0  
4      0              0             0

In [2]: df.describe()
Out[2]: 
       wage_per_hour       union  education_yrs  experience_yrs         age  \
count     534.000000  534.000000     534.000000      534.000000  534.000000   
mean        9.024064    0.179775      13.018727       17.822097   36.833333   
std         5.139097    0.384360       2.615373       12.379710   11.726573   
min         1.000000    0.000000       2.000000        0.000000   18.000000   
25%         5.250000    0.000000      12.000000        8.000000   28.000000   
50%         7.780000    0.000000      12.000000       15.000000   35.000000   
75%        11.250000    0.000000      15.000000       26.000000   44.000000   
max        44.500000    1.000000      18.000000       55.000000   64.000000   

           female        marr       south  manufacturing  construction  
count  534.000000  534.000000  534.000000     534.000000    534.000000  
mean     0.458801    0.655431    0.292135       0.185393      0.044944  
std      0.498767    0.475673    0.455170       0.388981      0.207375  
min      0.000000    0.000000    0.000000       0.000000      0.000000  
25%      0.000000    0.000000    0.000000       0.000000      0.000000  
50%      0.000000    1.000000    0.000000       0.000000      0.000000  
75%      1.000000    1.000000    1.000000       0.000000      0.000000  
max      1.000000    1.000000    1.000000       1.000000      1.000000

#### Example: Specifying a model
Now you'll get to work with your first model in Keras, and will immediately be able to run more complex neural network models on larger datasets compared to the first two chapters.

To start, you'll take the skeleton of a neural network and add a hidden layer and an output layer. You'll then fit that model and see Keras do the optimization so your model continually gets better.

As a start, you'll predict workers wages based on characteristics like their industry, education and level of experience. You can find the dataset in a pandas dataframe called df. For convenience, everything in df except for the target has been converted to a NumPy matrix called predictors. The target, wage_per_hour, is available as a NumPy matrix called target.

For all exercises in this chapter, we've imported the Sequential model constructor, the Dense layer constructor, and pandas.

In [None]:
# Import necessary modules
import keras
from keras.layers import Dense
from keras.models import Sequential

# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]

# Set up the model: model
model = Sequential()

# Add the first layer
# input_shape parameter to be the tuple (n_cols,) which means it has 
# n_cols items in each row of data, and any number of rows of data 
# are acceptable as inputs
model.add(Dense(50, activation='relu', input_shape=(n_cols,)))

# Add the second layer
model.add(Dense(32, activation='relu'))

# Add the output layer
model.add(Dense(1))


### 3.2 Step 2 and 3: Compiling and fitting a model
Why you need to compile a model - 2 args
- Specify the optimizer
    - Controls the learning rate
    - many options and mathematically complex
        - pragmatic solution: use a versatile algorithm for most problems
        - "Adam" is usually a good choice as go-to optimizer
            - adjusts learning rate as it does gradient descent to ensure reasonable values throughout weight optimization process
- Loss function
    - "mean_squared_error" common for regression
    - use a different method for classification (see below)
    
Compiling a model - after specifying layers
- model.compile(optimizer='adam', loss='mean_squared_error')

Fitting a model
- applying backpropagation and gradient descent with your data to update the weights
- similar to .fit() in sklearn but has more options
- Note: Scaling data before fitting can ease optimization
    - common approach: subtract each feature by feature mean and divide by std dev
- model.fit(predictors, target)

In [None]:
# compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# fit the model
model.fit(predictors, target)

#### Example: Compiling the model
You're now going to compile the model you specified earlier. To compile the model, you need to specify the optimizer and loss function to use. In the video, Dan mentioned that the Adam optimizer is an excellent choice. You can read more about it as well as other keras optimizers here (https://keras.io/optimizers/#adam), and if you are really curious to learn more, you can read the original paper that introduced the Adam optimizer (https://arxiv.org/abs/1412.6980v8).

In this exercise, you'll use the Adam optimizer and the mean squared error loss function. Go for it! 

In [None]:
# Import necessary modules
import keras
from keras.layers import Dense
from keras.models import Sequential

# Specify the model
n_cols = predictors.shape[1]
model = Sequential()
model.add(Dense(50, activation='relu', input_shape = (n_cols,)))
model.add(Dense(32, activation='relu'))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Verify that model contains information from compiling
print("Loss function: " + model.loss)
# Loss function: mean_squared_error

#### Example: Fitting the model
You're at the most fun part. You'll now fit the model. Recall that the data to be used as predictive features is loaded in a NumPy matrix called predictors and the data to be predicted is stored in a NumPy matrix called target. Your model is pre-written and it has been compiled with the code from the previous exercise.

In [None]:
# Import necessary modules
import keras
from keras.layers import Dense
from keras.models import Sequential

# Specify the model
n_cols = predictors.shape[1]
model = Sequential()
model.add(Dense(50, activation='relu', input_shape = (n_cols,)))
model.add(Dense(32, activation='relu'))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Fit the model
model.fit(predictors, target)


### 3.3 Classification models - with Deep Learning
Classification - note biggest changes
- loss function: 'categorical_crossentropy'
    - instead of 'mean_squared_error' in above example
    - there are other options, but it's the most common
    - similar to log loss: Lower score is better
    - add metrics=['accuracy'] to compile step for easy to understand diagnostics
- Output layer has separate node for each possible outcomes and uses 'softmax' activation function
    - 'softmax' ensures predictions sum to 1 so they can be interpreted as probabilities
    
Quick look at the data - binary classification problem
- target (0 or 1): shot_result
    - normally use separate column for each output - use keras function
- Transforming to categorical using one-hot encoding


In [None]:
# keras Classification
from keras.utils import to_categorical
data = pd.read_csv('basketball_shot_log.csv')
predictors = data.drop(['shot_result'], axis=1).as_matrix()
# one-hot encoding of target variables
target = to_categorical(data.shot_results)

model = Sequential()
model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
model.add(Dense(100, activation='relu'))
# note: 2 nodes for 2 outcomes, and 'softmax' activation function
model.add(Dense(2), activation = 'softmax')

# Compile the model using categorical_crossentropy loss function
model.compile(optimizer='adam', loss='categorical_crossentropy',
              metrics=['accuracy'])

# Fit the model
model.fit(predictors, target)

#### 3.3.a  Understanding your classification data
Now you will start modeling with a new dataset for a classification problem. This data includes information about passengers on the Titanic. You will use predictors such as age, fare and where each passenger embarked from to predict who will survive. This data is from a tutorial on data science competitions (https://www.kaggle.com/c/titanic). Look here (https://www.kaggle.com/c/titanic/data) for descriptions of the features.

The data is pre-loaded in a pandas DataFrame called df.

It's smart to review the maximum and minimum values of each variable to ensure the data isn't misformatted or corrupted. 

What was the maximum age of passengers on the Titanic? 
- Use the .describe() method in the IPython Shell to answer this question.

In [None]:
In [1]: df.describe()
Out[1]: 
         survived      pclass         age       sibsp       parch        fare  \
count  891.000000  891.000000  891.000000  891.000000  891.000000  891.000000   
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208   
std      0.486592    0.836071   13.002015    1.102743    0.806057   49.693429   
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000   
25%      0.000000    2.000000   22.000000    0.000000    0.000000    7.910400   
50%      0.000000    3.000000   29.699118    0.000000    0.000000   14.454200   
75%      1.000000    3.000000   35.000000    1.000000    0.000000   31.000000   
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200   

             male  embarked_from_cherbourg  embarked_from_queenstown  \
count  891.000000               891.000000                891.000000   
mean     0.647587                 0.188552                  0.086420   
std      0.477990                 0.391372                  0.281141   
min      0.000000                 0.000000                  0.000000   
25%      0.000000                 0.000000                  0.000000   
50%      1.000000                 0.000000                  0.000000   
75%      1.000000                 0.000000                  0.000000   
max      1.000000                 1.000000                  1.000000   

       embarked_from_southampton  
count                 891.000000  
mean                    0.722783  
std                     0.447876  
min                     0.000000  
25%                     0.000000  
50%                     1.000000  
75%                     1.000000  
max                     1.000000

Answer: 80

#### 3.3.b  Last steps in classification models
You'll now create a classification model using the titanic dataset, which has been pre-loaded into a DataFrame called df. You'll take information about the passengers and predict which ones survived.

The predictive variables are stored in a NumPy array predictors. The target to predict is in df.survived, though you'll have to manipulate it for keras. The number of predictive features is stored in n_cols.

Here, you'll use the 'sgd' optimizer, which stands for Stochastic Gradient Descent (https://en.wikipedia.org/wiki/Stochastic_gradient_descent). You'll learn more about this in the next chapter!

In [None]:
# Import necessary modules
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical

# Convert the target to categorical: target
target = to_categorical(df.survived)

# Set up the model
model = Sequential()

# Add the first layer
model.add(Dense(32,activation='relu', input_shape=(n_cols,)))

# Add the output layer
model.add(Dense(2,activation='softmax'))

# Compile the model
model.compile(optimizer='sgd',loss='categorical_crossentropy',
metrics=['accuracy'])

# Fit the model
model.fit(predictors, target)


In [None]:
# output
Epoch 1/10

 32/891 [>.............................] - ETA: 0s - loss: 8.3484 - acc: 0.4062
736/891 [=======================>......] - ETA: 0s - loss: 2.5633 - acc: 0.5516
891/891 [==============================] - 0s - loss: 2.2398 - acc: 0.5859     
Epoch 2/10

 32/891 [>.............................] - ETA: 0s - loss: 1.3865 - acc: 0.6250
736/891 [=======================>......] - ETA: 0s - loss: 0.9423 - acc: 0.6196
891/891 [==============================] - 0s - loss: 0.9276 - acc: 0.6150     
Epoch 3/10

 32/891 [>.............................] - ETA: 0s - loss: 0.9685 - acc: 0.6562
736/891 [=======================>......] - ETA: 0s - loss: 0.6606 - acc: 0.6658
891/891 [==============================] - 0s - loss: 0.6770 - acc: 0.6465     
Epoch 4/10

 32/891 [>.............................] - ETA: 0s - loss: 0.6190 - acc: 0.5938
736/891 [=======================>......] - ETA: 0s - loss: 0.6970 - acc: 0.6549
891/891 [==============================] - 0s - loss: 0.6720 - acc: 0.6712     
Epoch 5/10

 32/891 [>.............................] - ETA: 0s - loss: 0.8023 - acc: 0.5625
736/891 [=======================>......] - ETA: 0s - loss: 0.6204 - acc: 0.6739
891/891 [==============================] - 0s - loss: 0.6173 - acc: 0.6756     
Epoch 6/10

 32/891 [>.............................] - ETA: 0s - loss: 0.6783 - acc: 0.5625
736/891 [=======================>......] - ETA: 0s - loss: 0.6219 - acc: 0.6766
891/891 [==============================] - 0s - loss: 0.6169 - acc: 0.6902     
Epoch 7/10

 32/891 [>.............................] - ETA: 0s - loss: 0.4871 - acc: 0.8125
736/891 [=======================>......] - ETA: 0s - loss: 0.6215 - acc: 0.6590
891/891 [==============================] - 0s - loss: 0.6146 - acc: 0.6712     
Epoch 8/10

 32/891 [>.............................] - ETA: 0s - loss: 0.6327 - acc: 0.6250
736/891 [=======================>......] - ETA: 0s - loss: 0.5996 - acc: 0.6970
891/891 [==============================] - 0s - loss: 0.6023 - acc: 0.6902     
Epoch 9/10

 32/891 [>.............................] - ETA: 0s - loss: 0.4249 - acc: 0.8438
736/891 [=======================>......] - ETA: 0s - loss: 0.6173 - acc: 0.6861
891/891 [==============================] - 0s - loss: 0.6157 - acc: 0.6835     
Epoch 10/10

 32/891 [>.............................] - ETA: 0s - loss: 0.5451 - acc: 0.6875
736/891 [=======================>......] - ETA: 0s - loss: 0.6051 - acc: 0.6902
891/891 [==============================] - 0s - loss: 0.6069 - acc: 0.6925
Out[1]: <keras.callbacks.History at 0x7f88fc07f0f0>


# This simple model is generating an accuracy of 68!

### 3.4 Using models
- SAVE model after you train it
- RELOAD model
- Make PREDICTIONS with model


In [None]:
# saving, reloading, and using your model
from keras.models import load_model
model.save('model_file.h5')
my_model = load_model('my_model.h5')
predictions = my_model.predict(data_to_predict_with)
# using NumPy indexing, only get probability that the show was made
probability_true = predictions[:,1]

### 3.5 Verifying model structure

In [None]:
my_model.summary()

### 3.6 Making predictions
The trained network from your previous coding exercise is now stored as model. New data to make predictions is stored in a NumPy array as pred_data. Use model to make predictions on your new data.

In this exercise, your predictions will be probabilities, which is the most common way for data scientists to communicate their predictions to colleagues.

In [None]:
# Specify, compile, and fit the model
model = Sequential()
model.add(Dense(32, activation='relu', input_shape = (n_cols,)))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='sgd', 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])
model.fit(predictors, target)

# Calculate predictions: predictions
predictions = model.predict(pred_data)

# Calculate predicted probability of survival: predicted_prob_true
predicted_prob_true = predictions[:,1]

# print predicted_prob_true
print(predicted_prob_true)


In [None]:
# output

Epoch 1/10

 32/800 [>.............................] - ETA: 0s - loss: 9.4068 - acc: 0.3438
736/800 [==========================>...] - ETA: 0s - loss: 2.1078 - acc: 0.5924
800/800 [==============================] - 0s - loss: 2.0548 - acc: 0.6013     
Epoch 2/10

 32/800 [>.............................] - ETA: 0s - loss: 0.9343 - acc: 0.5938
736/800 [==========================>...] - ETA: 0s - loss: 0.9755 - acc: 0.6372
800/800 [==============================] - 0s - loss: 0.9881 - acc: 0.6362     
Epoch 3/10

 32/800 [>.............................] - ETA: 0s - loss: 3.1622 - acc: 0.2812
736/800 [==========================>...] - ETA: 0s - loss: 0.9660 - acc: 0.6576
800/800 [==============================] - 0s - loss: 0.9560 - acc: 0.6475     
Epoch 4/10

 32/800 [>.............................] - ETA: 0s - loss: 0.7433 - acc: 0.7812
736/800 [==========================>...] - ETA: 0s - loss: 0.7070 - acc: 0.6739
800/800 [==============================] - 0s - loss: 0.7000 - acc: 0.6737     
Epoch 5/10

 32/800 [>.............................] - ETA: 0s - loss: 0.7013 - acc: 0.7188
736/800 [==========================>...] - ETA: 0s - loss: 0.6351 - acc: 0.6522
800/800 [==============================] - 0s - loss: 0.6370 - acc: 0.6562     
Epoch 6/10

 32/800 [>.............................] - ETA: 0s - loss: 0.7438 - acc: 0.5938
736/800 [==========================>...] - ETA: 0s - loss: 0.6490 - acc: 0.6332
800/800 [==============================] - 0s - loss: 0.6406 - acc: 0.6475     
Epoch 7/10

 32/800 [>.............................] - ETA: 0s - loss: 0.5850 - acc: 0.6562
736/800 [==========================>...] - ETA: 0s - loss: 0.6513 - acc: 0.6454
800/800 [==============================] - 0s - loss: 0.6484 - acc: 0.6450     
Epoch 8/10

 32/800 [>.............................] - ETA: 0s - loss: 0.6137 - acc: 0.6562
736/800 [==========================>...] - ETA: 0s - loss: 0.6187 - acc: 0.6658
800/800 [==============================] - 0s - loss: 0.6316 - acc: 0.6487     
Epoch 9/10

 32/800 [>.............................] - ETA: 0s - loss: 0.6688 - acc: 0.6250
736/800 [==========================>...] - ETA: 0s - loss: 0.6231 - acc: 0.6576
800/800 [==============================] - 0s - loss: 0.6204 - acc: 0.6588     
Epoch 10/10

 32/800 [>.............................] - ETA: 0s - loss: 0.6003 - acc: 0.6875
736/800 [==========================>...] - ETA: 0s - loss: 0.6177 - acc: 0.6753
800/800 [==============================] - 0s - loss: 0.6136 - acc: 0.6812     
[0.21993461 0.3318482  0.24766211 0.51462483 0.16771315 0.14743942
 0.07266742 0.2729067  0.17536648 0.4474381  0.18807179 0.25605577
 0.1751357  0.34066525 0.15425213 0.11837941 0.22323096 0.4354514
 0.07228678 0.42617893 0.42058256 0.19022378 0.07656103 0.2302739
 0.2898574  0.17831704 0.4497322  0.39488176 0.18922023 0.44863054
 0.4136607  0.39599928 0.15403347 0.21477048 0.27403453 0.44038844
 0.2478449  0.16212454 0.44724074 0.4080968  0.2458698  0.36913165
 0.41874805 0.11317883 0.29375517 0.08350059 0.40117604 0.12408093
 0.4339563  0.5006711  0.318091   0.01506576 0.47888747 0.46149215
 0.25830168 0.27560487 0.40160352 0.22523038 0.34403372 0.15403347
 0.17595078 0.33919814 0.24056132 0.40442795 0.31600302 0.15790847
 0.3180895  0.4518196  0.18083148 0.45688182 0.18818697 0.3890984
 0.14759547 0.07269966 0.36255804 0.29960537 0.28081793 0.25597137
 0.16022985 0.4513318  0.40332136 0.13509353 0.2791521  0.23960653
 0.18508025 0.26095796 0.2958366  0.46594405 0.30380365 0.4570568
 0.16814701]

# 4. Fine-tuning keras models
Objective: How to choose model architecture and model optimization arguments

Model optimization is hard
- simultaneously optimizing 1000s of parameters of complex relationship9s
- updates may not improve model meaningfully
- updates too small (if learning rate is low) or too large (if learning rate is high)
    - smart optimizer like adam works, but optimization problems can still occur 

### 4.1 Easiest way to see effect of different learning rates is using SGD
- Stochastic Gradient Descent (SGD) - simplest optimizer
    - learning rate = 0.01 is common


In [None]:
# Stochastic Gradient Descent (SGD) - simplest optimizer
def get_new_model(input_shape = input_shape):
    model = Sequential()
    model.add(Dense(100, activation='relu', input_shape = input_shape))
    model.add(Dense(100, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    return(model)

# test learning rates
lr_to_test = [.000001, 0.01, 1]
for lr in lr_to_test:
    model = get_new_model()
    my_optimizer = SGD(lr=lr)
    model.compile(optimizer=my_optimizer, loss='categorical_crossentropy')
    model.fit(predictors, target)

### 4.2 The dying neuron problem
- Even if your learning rate is well tuned, you can have the dying neuron problem.
- Dying neuron problem =
    - when a neuron takes a value < 0 (is negative) for all rows of data
    - recall ReLU: negative input produces output 0 and slope 0 
        - if slope 0, slopes of weights into the node are 0, so those weights don't get updated 
    - once a node starts always getting negative inputs
        - it may continue only getting negative inputs
    - contributes nothing to the model
        - "Dead" neuron
    - Attempted solutions
        - using activation function with slope never 0
            - Vanishing gradients problem - like tanh
                - occurs when many layers have very small slopes (ie. due to being on flat part of tanh curve, on either end)
                - may work in a model with a few hidden layers
                - But in deep networks, updates to backprop were close to 0
- If you are having problems, a potential solution may be changing the activation function

### 4.3 Diagnosing optimization problems
Which of the following could prevent a model from showing an improved loss in its first few epochs?

Possible Answers:
Learning rate too low.
Learning rate too high.
Poor choice of activation function.
All of the above.

Answer: All of the above

### 4.4 Changing optimization parameters
It's time to get your hands dirty with optimization. You'll now try optimizing a model at a very low learning rate, a very high learning rate, and a "just right" learning rate. 

You'll want to look at the results after running this exercise, remembering that a LOW value for the LOSS FUNCTION is GOOD.

For these exercises, we've pre-loaded the predictors and target values from your previous classification models (predicting who would survive on the Titanic). You'll want the optimization to start from scratch every time you change the learning rate, to give a fair comparison of how each learning rate did in your results. So we have created a function get_new_model() that creates an unoptimized model to optimize.

In [None]:
# Import the SGD optimizer
from keras.optimizers import SGD

# Create list of learning rates: lr_to_test
lr_to_test = [.000001, 0.01, 1]

# Loop over learning rates
for lr in lr_to_test:
    print('\n\nTesting model with learning rate: %f\n'%lr )
    
    # Build new model to test, unaffected by previous models
    model = get_new_model()
    
    # Create SGD optimizer with specified learning rate: my_optimizer
    my_optimizer = SGD(lr=lr)
    
    # Compile the model
    model.compile(optimizer=my_optimizer, loss='categorical_crossentropy')
    
    # Fit the model
    model.fit(predictors, target)
    

In [None]:
Testing model with learning rate: 0.000001

Epoch 1/10

 32/891 [>.............................] - ETA: 1s - loss: 3.6053
416/891 [=============>................] - ETA: 0s - loss: 3.7014
891/891 [==============================] - 0s - loss: 3.6057     
Epoch 2/10

 32/891 [>.............................] - ETA: 0s - loss: 3.5751
576/891 [==================>...........] - ETA: 0s - loss: 3.4996
891/891 [==============================] - 0s - loss: 3.5656     
Epoch 3/10

 32/891 [>.............................] - ETA: 0s - loss: 2.6692
480/891 [===============>..............] - ETA: 0s - loss: 3.4125
891/891 [==============================] - 0s - loss: 3.5255     
Epoch 4/10

 32/891 [>.............................] - ETA: 0s - loss: 3.0058
576/891 [==================>...........] - ETA: 0s - loss: 3.4704
891/891 [==============================] - 0s - loss: 3.4854     
Epoch 5/10

 32/891 [>.............................] - ETA: 0s - loss: 2.5452
480/891 [===============>..............] - ETA: 0s - loss: 3.4640
891/891 [==============================] - 0s - loss: 3.4454     
Epoch 6/10

 32/891 [>.............................] - ETA: 0s - loss: 3.4446
480/891 [===============>..............] - ETA: 0s - loss: 3.4553
891/891 [==============================] - 0s - loss: 3.4056     
Epoch 7/10

 32/891 [>.............................] - ETA: 0s - loss: 4.1073
512/891 [================>.............] - ETA: 0s - loss: 3.4992
891/891 [==============================] - 0s - loss: 3.3659     
Epoch 8/10

 32/891 [>.............................] - ETA: 0s - loss: 3.0972
608/891 [===================>..........] - ETA: 0s - loss: 3.2938
891/891 [==============================] - 0s - loss: 3.3263     
Epoch 9/10

 32/891 [>.............................] - ETA: 0s - loss: 3.7464
544/891 [=================>............] - ETA: 0s - loss: 3.2872
891/891 [==============================] - 0s - loss: 3.2867     
Epoch 10/10

 32/891 [>.............................] - ETA: 0s - loss: 3.3862
608/891 [===================>..........] - ETA: 0s - loss: 3.1217
891/891 [==============================] - 0s - loss: 3.2473     


Testing model with learning rate: 0.010000

Epoch 1/10

 32/891 [>.............................] - ETA: 1s - loss: 1.0910
544/891 [=================>............] - ETA: 0s - loss: 1.8761
891/891 [==============================] - 0s - loss: 1.4069     
Epoch 2/10

 32/891 [>.............................] - ETA: 0s - loss: 2.1146
576/891 [==================>...........] - ETA: 0s - loss: 0.7418
891/891 [==============================] - 0s - loss: 0.7022     
Epoch 3/10

 32/891 [>.............................] - ETA: 0s - loss: 0.5700
640/891 [====================>.........] - ETA: 0s - loss: 0.6680
891/891 [==============================] - 0s - loss: 0.6472     
Epoch 4/10

 32/891 [>.............................] - ETA: 0s - loss: 0.6192
640/891 [====================>.........] - ETA: 0s - loss: 0.6432
891/891 [==============================] - 0s - loss: 0.6228     
Epoch 5/10

 32/891 [>.............................] - ETA: 0s - loss: 0.4959
640/891 [====================>.........] - ETA: 0s - loss: 0.6158
891/891 [==============================] - 0s - loss: 0.6199     
Epoch 6/10

 32/891 [>.............................] - ETA: 0s - loss: 0.6682
640/891 [====================>.........] - ETA: 0s - loss: 0.6064
891/891 [==============================] - 0s - loss: 0.6010     
Epoch 7/10

 32/891 [>.............................] - ETA: 0s - loss: 0.6180
640/891 [====================>.........] - ETA: 0s - loss: 0.6076
891/891 [==============================] - 0s - loss: 0.5996     
Epoch 8/10

 32/891 [>.............................] - ETA: 0s - loss: 0.6059
672/891 [=====================>........] - ETA: 0s - loss: 0.5897
891/891 [==============================] - 0s - loss: 0.6055     
Epoch 9/10

 32/891 [>.............................] - ETA: 0s - loss: 0.6522
672/891 [=====================>........] - ETA: 0s - loss: 0.5991
891/891 [==============================] - 0s - loss: 0.5948     
Epoch 10/10

 32/891 [>.............................] - ETA: 0s - loss: 0.6478
672/891 [=====================>........] - ETA: 0s - loss: 0.5789
891/891 [==============================] - 0s - loss: 0.5827     


Testing model with learning rate: 1.000000

Epoch 1/10

 32/891 [>.............................] - ETA: 1s - loss: 1.0273
512/891 [================>.............] - ETA: 0s - loss: 5.4474
891/891 [==============================] - 0s - loss: 5.9885     
Epoch 2/10

 32/891 [>.............................] - ETA: 0s - loss: 4.5332
544/891 [=================>............] - ETA: 0s - loss: 6.1628
891/891 [==============================] - 0s - loss: 6.1867     
Epoch 3/10

 32/891 [>.............................] - ETA: 0s - loss: 7.0517
352/891 [==========>...................] - ETA: 0s - loss: 5.8611
736/891 [=======================>......] - ETA: 0s - loss: 6.2414
891/891 [==============================] - 0s - loss: 6.1867     
Epoch 4/10

 32/891 [>.............................] - ETA: 0s - loss: 6.0443
416/891 [=============>................] - ETA: 0s - loss: 6.3155
800/891 [=========================>....] - ETA: 0s - loss: 6.1652
891/891 [==============================] - 0s - loss: 6.1867     
Epoch 5/10

 32/891 [>.............................] - ETA: 0s - loss: 9.0664
384/891 [===========>..................] - ETA: 0s - loss: 6.5060
891/891 [==============================] - 0s - loss: 6.1867     
Epoch 6/10

 32/891 [>.............................] - ETA: 0s - loss: 6.0443
672/891 [=====================>........] - ETA: 0s - loss: 6.2122
891/891 [==============================] - 0s - loss: 6.1867     
Epoch 7/10

 32/891 [>.............................] - ETA: 0s - loss: 5.0369
672/891 [=====================>........] - ETA: 0s - loss: 6.2841
891/891 [==============================] - 0s - loss: 6.1867     
Epoch 8/10

 32/891 [>.............................] - ETA: 0s - loss: 5.0369
672/891 [=====================>........] - ETA: 0s - loss: 6.1402
891/891 [==============================] - 0s - loss: 6.1867     
Epoch 9/10

 32/891 [>.............................] - ETA: 0s - loss: 5.5406
672/891 [=====================>........] - ETA: 0s - loss: 6.0923
891/891 [==============================] - 0s - loss: 6.1867     
Epoch 10/10

 32/891 [>.............................] - ETA: 0s - loss: 5.5406
672/891 [=====================>........] - ETA: 0s - loss: 6.2362
891/891 [==============================] - 0s - loss: 6.1867

### 4.5 Model Validation
Validation in deep learning
- commonly use validation split rather than cross validation
- Deep learning widely used on large datasets
    - so k-fold cross validation is not commonly used
- Single validation score is based on large amount of data, and is reliable
    - Repeated training from cross-validation would take long time

Early Stopping - stop training when validation score is not improving

Experimentation
- experiment with different architectures
- more layers
- fewer layers
- layers with more nodes
- layers with fewer nodes
- creating a great model requires experimentation

#### 4.5.a Model validation

In [None]:
# see :param accuracy metrics for a classification problem
model.compile(optimizer='adam',loss='categorical_crossentropy', 
              metrics=['accuracy'])
# see :param validation_split
model,.fit(predictors, target, validation_split=0.3)

#### 4.5.b Early Stopping - stop early if accuracy and 
- check output

In [None]:
from keras.callbacks import EarlyStopping
# patience is number of epochs before stopping, 2 or 3 is enough
early_stopping_monitor = EarlyStopping(patience=2)
# note: callbacks param takes a list
# Advanced: add several callbacks
# keras Default: 10 epochs trained
model.fit(predictors, target, validation_split=0.3, epochs=20,
          callbacks=[early_stopping_monitor])

#### 4.5.c Evaluating model accuracy on validation dataset
Now it's your turn to monitor model accuracy with a validation data set. A model definition has been provided as model. Your job is to add the code to compile it and then fit it. You'll check the validation score in each epoch. 

In [None]:
# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape = input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Fit the model
hist = model.fit(predictors, target, validation_split=0.3)


In [None]:
Train on 623 samples, validate on 268 samples
Epoch 1/10

 32/623 [>.............................] - ETA: 0s - loss: 3.3028 - acc: 0.4062
608/623 [============================>.] - ETA: 0s - loss: 1.3342 - acc: 0.5970
623/623 [==============================] - 0s - loss: 1.3117 - acc: 0.6035 - val_loss: 0.6835 - val_acc: 0.7201
Epoch 2/10

 32/623 [>.............................] - ETA: 0s - loss: 0.6891 - acc: 0.7188
623/623 [==============================] - 0s - loss: 0.8822 - acc: 0.5714 - val_loss: 1.1022 - val_acc: 0.6418
Epoch 3/10

 32/623 [>.............................] - ETA: 0s - loss: 1.0317 - acc: 0.5938
623/623 [==============================] - 0s - loss: 0.7970 - acc: 0.6228 - val_loss: 0.8587 - val_acc: 0.6343
Epoch 4/10

 32/623 [>.............................] - ETA: 0s - loss: 0.6470 - acc: 0.6875
608/623 [============================>.] - ETA: 0s - loss: 0.7579 - acc: 0.6497
623/623 [==============================] - 0s - loss: 0.7538 - acc: 0.6485 - val_loss: 0.6898 - val_acc: 0.7052
Epoch 5/10

 32/623 [>.............................] - ETA: 0s - loss: 0.6848 - acc: 0.6562
623/623 [==============================] - 0s - loss: 0.6774 - acc: 0.6453 - val_loss: 0.5892 - val_acc: 0.7164
Epoch 6/10

 32/623 [>.............................] - ETA: 0s - loss: 0.5687 - acc: 0.6875
448/623 [====================>.........] - ETA: 0s - loss: 0.6688 - acc: 0.6518
623/623 [==============================] - 0s - loss: 0.6599 - acc: 0.6517 - val_loss: 0.5295 - val_acc: 0.7500
Epoch 7/10

 32/623 [>.............................] - ETA: 0s - loss: 0.5598 - acc: 0.7500
576/623 [==========================>...] - ETA: 0s - loss: 0.6072 - acc: 0.6771
623/623 [==============================] - 0s - loss: 0.6013 - acc: 0.6806 - val_loss: 0.5124 - val_acc: 0.7164
Epoch 8/10

 32/623 [>.............................] - ETA: 0s - loss: 0.5911 - acc: 0.7500
608/623 [============================>.] - ETA: 0s - loss: 0.5914 - acc: 0.6908
623/623 [==============================] - 0s - loss: 0.5911 - acc: 0.6902 - val_loss: 0.5254 - val_acc: 0.7649
Epoch 9/10

 32/623 [>.............................] - ETA: 0s - loss: 0.5627 - acc: 0.7500
623/623 [==============================] - 0s - loss: 0.6727 - acc: 0.6597 - val_loss: 0.5689 - val_acc: 0.7052
Epoch 10/10

 32/623 [>.............................] - ETA: 0s - loss: 0.4831 - acc: 0.8125
623/623 [==============================] - 0s - loss: 0.6196 - acc: 0.6886 - val_loss: 0.5382 - val_acc: 0.7425

#### 4.5.d Early stopping: Optimizing the optimization
Now that you know how to monitor your model performance throughout optimization, you can use early stopping to stop optimization when it isn't helping any more. Since the optimization stops automatically when it isn't helping, you can also set a high value for epochs in your call to .fit(), as Dan showed in the video.

The model you'll optimize has been specified as model. As before, the data is pre-loaded as predictors and target. 

In [None]:
# Import EarlyStopping
from keras.callbacks import EarlyStopping

# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape = input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Fit the model
model.fit(predictors, target, validation_split=0.3,
          epochs=30, callbacks=[early_stopping_monitor])


In [None]:
Train on 623 samples, validate on 268 samples
    Epoch 1/30
    
 32/623 [>.............................] - ETA: 0s - loss: 5.6563 - acc: 0.4688
608/623 [============================>.] - ETA: 0s - loss: 1.6479 - acc: 0.5625
623/623 [==============================] - 0s - loss: 1.6352 - acc: 0.5666 - val_loss: 1.0820 - val_acc: 0.6530
    Epoch 2/30
    
 32/623 [>.............................] - ETA: 0s - loss: 1.8316 - acc: 0.4688
608/623 [============================>.] - ETA: 0s - loss: 0.8358 - acc: 0.6086
623/623 [==============================] - 0s - loss: 0.8304 - acc: 0.6100 - val_loss: 0.5692 - val_acc: 0.7276
    Epoch 3/30
    
 32/623 [>.............................] - ETA: 0s - loss: 0.8393 - acc: 0.6250
623/623 [==============================] - 0s - loss: 0.7130 - acc: 0.6549 - val_loss: 0.5265 - val_acc: 0.7537
    Epoch 4/30
    
 32/623 [>.............................] - ETA: 0s - loss: 0.9949 - acc: 0.6250
608/623 [============================>.] - ETA: 0s - loss: 0.6702 - acc: 0.6760
623/623 [==============================] - 0s - loss: 0.6734 - acc: 0.6758 - val_loss: 0.5196 - val_acc: 0.7351
    Epoch 5/30
    
 32/623 [>.............................] - ETA: 0s - loss: 0.5458 - acc: 0.7812
608/623 [============================>.] - ETA: 0s - loss: 0.6727 - acc: 0.6480
623/623 [==============================] - 0s - loss: 0.6808 - acc: 0.6469 - val_loss: 0.6424 - val_acc: 0.6903
    Epoch 6/30
    
 32/623 [>.............................] - ETA: 0s - loss: 0.4596 - acc: 0.8438
623/623 [==============================] - 0s - loss: 0.6279 - acc: 0.7127 - val_loss: 0.5721 - val_acc: 0.7239
    Epoch 7/30
    
 32/623 [>.............................] - ETA: 0s - loss: 0.6590 - acc: 0.6562
608/623 [============================>.] - ETA: 0s - loss: 0.6533 - acc: 0.7007
623/623 [==============================] - 0s - loss: 0.6529 - acc: 0.6998 - val_loss: 0.6481 - val_acc: 0.6679

### 4.6 Experimenting with wider networks
Now you know everything you need to begin experimenting with different models!

A model called model_1 has been pre-loaded. You can see a summary of this model printed in the IPython Shell. This is a relatively small network, with only 10 units in each hidden layer.

In this exercise you'll create a new model called model_2 which is similar to model_1, except it has 100 units in each hidden layer.

After you create model_2, both models will be fitted, and a graph showing both models loss score at each epoch will be shown. We added the argument verbose=False in the fitting commands to print out fewer updates, since you will look at these graphically instead of as text.

Because you are fitting two models, it will take a moment to see the outputs after you hit run, so be patient.

In [None]:
# summary of model 1
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 10)                110       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                110       
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 22        
=================================================================
Total params: 242.0
Trainable params: 242
Non-trainable params: 0.0
_________________________________________________________________
None


In [None]:
# create model 2 which has 100 units in each hidden layer

# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Create the new model: model_2
model_2 = Sequential()

# Add the first and second layers
model_2.add(Dense(100, activation='relu', input_shape=input_shape))
model_2.add(Dense(100, activation='relu'))

# Add the output layer
model_2.add(Dense(2,activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam',loss='categorical_crossentropy', metrics=['accuracy'])

# Fit model_1
model_1_training = model_1.fit(predictors, target, epochs=15, validation_split=0.2, callbacks=[early_stopping_monitor], verbose=False)

# Fit model_2
model_2_training = model_2.fit(predictors, target, epochs=15, validation_split=0.2, callbacks=[early_stopping_monitor], verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r', model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()


Model 2 has a lower loss value so is better than Model 1

<img src="image8.png" width="500" />


### 4.7 Adding layers to a network - Experiment with Deeper Network
You've seen how to experiment with wider networks. In this exercise, you'll try a deeper network (more hidden layers).

Once again, you have a baseline model called model_1 as a starting point. It has 1 hidden layer, with 50 units. You can see a summary of that model's structure printed out. You will create a similar network with 3 hidden layers (still keeping 50 units in each layer).

This will again take a moment to fit both models, so you'll need to wait a few seconds to see the results after you run your code.

In [None]:
# model 1
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 50)                550       
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 102       
=================================================================
Total params: 652.0
Trainable params: 652
Non-trainable params: 0.0
_________________________________________________________________
None

In [None]:
# The input shape to use in the first hidden layer
input_shape = (n_cols,)

# Create the new model: model_2
model_2 = Sequential()

# Add the first, second, and third hidden layers
model_2.add(Dense(50,activation='relu',input_shape=input_shape))
model_2.add(Dense(50,activation='relu'))
model_2.add(Dense(50,activation='relu'))

# Add the output layer
model_2.add(Dense(2,activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

# Fit model 1
model_1_training = model_1.fit(predictors, target, epochs=20, validation_split=0.4, callbacks=[early_stopping_monitor], verbose=False)

# Fit model 2
model_2_training = model_2.fit(predictors, target, epochs=20, validation_split=0.4, callbacks=[early_stopping_monitor], verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r', model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()


For both models, you should look for the best val_loss and val_acc, which won't be the last epoch for that model.

The blue model is Model 2 and the red is the original Model 1. The model with the lower loss value is the better model.

<img src="image9.png" width="500" />

### 4.8 Thinking about Model Capacity (or Network Capacity)
- Takes some practive to get intuition to know what experiments and architectures to try.
    - Thinking about Model Capacity can guide us

Model Capacity
- this term is closely associated with Overfitting and Underfitting
- Model Capacity = model's ability to capture predictive patterns in your data
    - more capacity >> towards Low Bias and High Variance
    - Increase capacity by:
        - Increase nodes/neurons in hidden layer
        - Increase layers

Workflow for optimizing model capacity
- Start with a small network
- Get the validation score
- Keep increasing capacity until validation score is no longer improving
- Example: Sequential experiments - increased nodes, then layers, adjust
    - Hidden layer 1, Nodes/layer 100, MSE 5.4, Next step: Increase capacity
    - Hidden layer 1, Nodes/layer 250, MSE 4.8, Next step: Increase capacity
    - Hidden layer 2, Nodes/layer 250, MSE 4.4, Next step: Increase capacity
    - Hidden layer 3, Nodes/layer 250, MSE 4.5, Next step: Decrease capacity
    - Hidden layer 3, Nodes/layer 200, MSE 4.3, Next step: Done

### 4.9 Experimenting with model structures
You've just run an experiment where you compared two networks that were identical except that the 2nd network had an extra hidden layer. You see that this 2nd network (the deeper network) had better performance. Given that, which of the following would be a good experiment to run next for even better performance?

Possible Answers:
- Try a new network with fewer layers than anything you have tried yet.
- Use more units in each hidden layer.
- Use fewer units in each hidden layer.

Answer: 
Use more units in each hidden layer.


### 4.10 Stepping up to Images
- MNIST dataset
    - images of handwritten digits
    - 28x28 grid flattened to 784 values for each image (784x1 array)
    - value in each part of array denotes darkness of that pixel
        - 0 is darkest
        - 255 is lightest
    - digits 0-9
- deep learning model to predict the digit

### 4.11 Building your own digit recognition model
You've reached the final exercise of the course - you now know everything you need to build an accurate model to recognize handwritten digits!

We've already done the basic manipulation of the MNIST dataset shown in the video, so you have X and y loaded and ready to model with. Sequential and Dense from keras are also pre-imported.

To add an extra challenge, we've loaded only 2500 images, rather than 60000 which you will see in some published results. Deep learning models perform better with more data, however, they also take longer to train, especially when they start becoming more complex.

If you have a computer with a CUDA compatible GPU, you can take advantage of it to improve computation time. If you don't have a GPU, no problem! You can set up a deep learning environment in the cloud that can run your models on a GPU. Here is a blog post by Dan that explains how to do this - check it out after completing this exercise! https://www.datacamp.com/community/tutorials/deep-learning-jupyter-aws

It is a great next step as you continue your deep learning journey.

In [None]:
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50, activation='relu', input_shape=(784,)))

# Add the second hidden layer
model.add(Dense(50, activation='relu'))

# Add the output layer
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

# Fit the model
model.fit(X, y, validation_split=0.3)


In [None]:
    Train on 1750 samples, validate on 750 samples
    Epoch 1/10
    
  32/1750 [..............................] - ETA: 3s - loss: 2.3008 - acc: 0.1250
 320/1750 [====>.........................] - ETA: 0s - loss: 2.1986 - acc: 0.2125
 672/1750 [==========>...................] - ETA: 0s - loss: 2.0420 - acc: 0.3229
1056/1750 [=================>............] - ETA: 0s - loss: 1.8622 - acc: 0.4167
1344/1750 [======================>.......] - ETA: 0s - loss: 1.7421 - acc: 0.4680
1696/1750 [============================>.] - ETA: 0s - loss: 1.6015 - acc: 0.5195
1750/1750 [==============================] - 0s - loss: 1.5804 - acc: 0.5280 - val_loss: 0.8387 - val_acc: 0.7933
    Epoch 2/10
    
  32/1750 [..............................] - ETA: 0s - loss: 0.7431 - acc: 0.9375
 480/1750 [=======>......................] - ETA: 0s - loss: 0.7692 - acc: 0.8187
 864/1750 [=============>................] - ETA: 0s - loss: 0.6975 - acc: 0.8241
1152/1750 [==================>...........] - ETA: 0s - loss: 0.6736 - acc: 0.8247
1280/1750 [====================>.........] - ETA: 0s - loss: 0.6563 - acc: 0.8289
1472/1750 [========================>.....] - ETA: 0s - loss: 0.6378 - acc: 0.8308
1632/1750 [==========================>...] - ETA: 0s - loss: 0.6333 - acc: 0.8333
1750/1750 [==============================] - 0s - loss: 0.6268 - acc: 0.8354 - val_loss: 0.5000 - val_acc: 0.8720
    Epoch 3/10
    
  32/1750 [..............................] - ETA: 0s - loss: 0.6002 - acc: 0.7812
 448/1750 [======>.......................] - ETA: 0s - loss: 0.4587 - acc: 0.8750
 928/1750 [==============>...............] - ETA: 0s - loss: 0.4409 - acc: 0.8858
1408/1750 [=======================>......] - ETA: 0s - loss: 0.4191 - acc: 0.8878
1750/1750 [==============================] - 0s - loss: 0.4096 - acc: 0.8903 - val_loss: 0.4262 - val_acc: 0.8693
    Epoch 4/10
    
  32/1750 [..............................] - ETA: 0s - loss: 0.4248 - acc: 0.8438
 512/1750 [=======>......................] - ETA: 0s - loss: 0.3218 - acc: 0.9180
 864/1750 [=============>................] - ETA: 0s - loss: 0.3151 - acc: 0.9190
1152/1750 [==================>...........] - ETA: 0s - loss: 0.3197 - acc: 0.9097
1440/1750 [=======================>......] - ETA: 0s - loss: 0.3124 - acc: 0.9125
1728/1750 [============================>.] - ETA: 0s - loss: 0.3146 - acc: 0.9120
1750/1750 [==============================] - 0s - loss: 0.3176 - acc: 0.9109 - val_loss: 0.3806 - val_acc: 0.8733
    Epoch 5/10
    
  32/1750 [..............................] - ETA: 0s - loss: 0.1267 - acc: 1.0000
 512/1750 [=======>......................] - ETA: 0s - loss: 0.2501 - acc: 0.9414
 992/1750 [================>.............] - ETA: 0s - loss: 0.2473 - acc: 0.9365
1472/1750 [========================>.....] - ETA: 0s - loss: 0.2582 - acc: 0.9334
1750/1750 [==============================] - 0s - loss: 0.2529 - acc: 0.9354 - val_loss: 0.3595 - val_acc: 0.8947
    Epoch 6/10
    
  32/1750 [..............................] - ETA: 0s - loss: 0.1885 - acc: 0.9688
 512/1750 [=======>......................] - ETA: 0s - loss: 0.2542 - acc: 0.9395
 992/1750 [================>.............] - ETA: 0s - loss: 0.2131 - acc: 0.9526
1312/1750 [=====================>........] - ETA: 0s - loss: 0.2069 - acc: 0.9520
1696/1750 [============================>.] - ETA: 0s - loss: 0.2126 - acc: 0.9475
1750/1750 [==============================] - 0s - loss: 0.2126 - acc: 0.9474 - val_loss: 0.3391 - val_acc: 0.8920
    Epoch 7/10
    
  32/1750 [..............................] - ETA: 0s - loss: 0.2002 - acc: 0.9688
 480/1750 [=======>......................] - ETA: 0s - loss: 0.1501 - acc: 0.9750
 864/1750 [=============>................] - ETA: 0s - loss: 0.1551 - acc: 0.9676
1280/1750 [====================>.........] - ETA: 0s - loss: 0.1728 - acc: 0.9563
1632/1750 [==========================>...] - ETA: 0s - loss: 0.1744 - acc: 0.9559
1750/1750 [==============================] - 0s - loss: 0.1744 - acc: 0.9554 - val_loss: 0.3268 - val_acc: 0.8960
    Epoch 8/10
    
  32/1750 [..............................] - ETA: 0s - loss: 0.2347 - acc: 0.9688
 416/1750 [======>.......................] - ETA: 0s - loss: 0.1399 - acc: 0.9712
 800/1750 [============>.................] - ETA: 0s - loss: 0.1431 - acc: 0.9725
1216/1750 [===================>..........] - ETA: 0s - loss: 0.1594 - acc: 0.9663
1664/1750 [===========================>..] - ETA: 0s - loss: 0.1540 - acc: 0.9669
1750/1750 [==============================] - 0s - loss: 0.1509 - acc: 0.9674 - val_loss: 0.3195 - val_acc: 0.8960
    Epoch 9/10
    
  32/1750 [..............................] - ETA: 0s - loss: 0.1214 - acc: 0.9375
 512/1750 [=======>......................] - ETA: 0s - loss: 0.1125 - acc: 0.9766
 992/1750 [================>.............] - ETA: 0s - loss: 0.1264 - acc: 0.9738
1440/1750 [=======================>......] - ETA: 0s - loss: 0.1226 - acc: 0.9771
1750/1750 [==============================] - 0s - loss: 0.1191 - acc: 0.9777 - val_loss: 0.3337 - val_acc: 0.9040
    Epoch 10/10
    
  32/1750 [..............................] - ETA: 0s - loss: 0.1376 - acc: 1.0000
 512/1750 [=======>......................] - ETA: 0s - loss: 0.1001 - acc: 0.9844
 992/1750 [================>.............] - ETA: 0s - loss: 0.1058 - acc: 0.9819
1472/1750 [========================>.....] - ETA: 0s - loss: 0.1043 - acc: 0.9810
1750/1750 [==============================] - 0s - loss: 0.1015 - acc: 0.9806 - val_loss: 0.3343 - val_acc: 0.9013

You should see better than 90% accuracy recognizing handwritten digits, even while using a small training set of only 1750 images!

### 4.12 Final Thoughts
Next steps
- Start with standard prediction problems on tables of numbers
- Images (with convolutional neural networks) are common next steps
- data types: text, sound, etc
- Kaggle has great datasets and forums to learn
- Check out wikipedia page: List of datasets for machine learning research
- keras.io for excellent documentation
- keras and tensorflow Github repos are good
- Graphical processing unit (GPU) provides dramatic speedups in model training times
    - need a CUDA compatible GPU
    - most NVIDIA GPUs are compatible
- For training on using GPUs in the cloud look here:
    - http://bit.ly/2mYQXQb