# Introductions

The Deep Learning Foundation Program is divided into **six chapters**, some of which including a project to complete:


1.   Introductions
2.   Neural Networks (**P1: Your first neural network**)
3.   Convolutional Neural Networks (**P2: Dog Breed Classifier**)
4.   Recurrent Neural Networks (**P3: Generate TV Scripts**)
5.   Generative Adversarial Networks (**P4: Generate Faces**)
6.   Deep Reinforcement Learning (**P5: Teach a Quadcopter to Fly**)

**Machine Learning:**
* Supervised (Feedback)
* Unsupervised (No labels)
* Reinforcement Learning (Feedback at the end)

---


## Anaconda
Conda is a package manager that organizes dependencies in environments



```
conda create -n name packages=3       # Creates a new environment
conda env list                        # Lists all environments
source activate name                  # Activates an environment
  conda install numpy pandas ...      # Installs packages inside the environment
  conda list                          # Lists all packages installed in the environment
  conda env export > environment.yaml # Exports all depencencies into a yaml file
conda env create -f environment.yaml  # Creates a new environment from the yaml file
conda env remove -n name              # Remove conda env
```


## Jupyter
A notebook is a web appplication that allows you to combine explanatory text, math equations, code and visualizations all in one easily sharable document.
It is an example of [literate programming](http://www.literateprogramming.com/)

### How Jupyter works:

![explanation of how jupyter notebooks work](https://jupyter.readthedocs.io/en/latest/_images/notebook_components.png)

## Matrix Math and Numpy Refresher
### Data Dimensions
**Scalar:** Simplest shape | 0 dimsension
**Vectors:** Row or Column vectors | 1 dimesion = length
**Matrices:** 2d Vector | 2 dimensions
**Tensors:** n-dimensional tensor

Axy = x is the row and y is the column.

### Numpy
Written in C to perform fast mathematical operations.

In [3]:
import numpy as np

s = np.array(5)        # Scalar
s.shape                # () since it is a scalar

v = np.array([1,2,3])  # Vector
v.shape                # (3,)
v[1]                   # 2
v[1:]                  # 2,3 | Access elements from the second one on

m = np.array([[1,2,3], [4,5,6], [7,8,9]])
m.shape                # (3,3)

t = np.array([[[[1],[2]],[[3],[4]],[[5],[6]]],[[[7],[8]],[[9],[10]],[[11],[12]]],[[[13],[14]],[[15],[16]],[[17],[17]]]])
t.shape                # (3,3,2,1)

v.reshape(1,3)         # Change the shape
v = v[:, None]         # More experienced reshaping
v = v[None,:]
v

array([[[1],
        [2],
        [3]]])

### Element-wise Matrix Operations
Treat the items in the matrix individually and perform the same operation on each one.
Matrices have to have the same shape.

In [3]:
values = [1,2,3,4,5]
values = np.array(values) + 5
print(values)            # [6,7,8,9,10]

values = np.multiply(values, 5)
print(values)            # [30,35,40,50]

[ 6  7  8  9 10]
[30 35 40 45 50]


### Matrix Multiplication
Matrices don't have to have the same shape.
Rows of the first Matrix and the columns of the second matrix. (Taking the Dot Product multiple times)

#### Import Reminders about Matrix Multiplication
* Number of columns in the left matrix must be equal to the number of rows in the second matrix. (2x3 and 3x2) (RowsXColumns)
* The Result will have the same number of rows as the left matrix and the same number of columns as the right matrix.
* Order matters. A*B != B*A
* The data in the left should be ordered in rows and in columns in the left

#### Dot Product
Multiply the corresponding elements of each vector. Then we add up all the results.

### NumPy Matrix Multiplication

**Element-wise** Multiplication: Using the * or multiply function:

In [3]:
import numpy as np

multiple = np.array([[1,2,3],[4,5,6]])
multiple

new = multiple * 0.25
new

new * multiple

np.multiply(multiple, new)


array([[0.25, 1.  , 2.25],
       [4.  , 6.25, 9.  ]])

Finding the **Matrix Product** using NumPy's matmul function:

In [28]:
a = np.array([[1,2,3,4],[5,6,7,8]])
a

a.shape # (2,4)

b = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
b

b.shape # (4,3)

c = np.matmul(a,b)
c

c.shape # 2,3

(2, 3)

NumPy's **dot function** can be identical to matmul (if the matrices are 2-dimensional)

In [31]:
a = np.array([[1,2],[3,4]])

np.dot(a,a)
a.dot(a)
np.matmul(a,a)

array([[ 7, 10],
       [15, 22]])

### Transpose

If the original Matrix was **not** a square the new transpose will have the dimensions swapped.
Each feature is either in a row or a column.

The option if you should transpose depends on the situation(!)

**The only time you can safely use a tranpose if both data is arranged as rows**

In [33]:
m = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
m

m.T # Transpose of m

array([[ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11],
       [ 4,  8, 12]])

# Neural Networks

## Lesson 1: Introduction to Neural Networks

First idea is to compare Neural Networks to Linear Regression.

W = Weights
b = Bias
x = Input
y = label
ÿ = Prediction

### Perceptron

**Comparison:** The Perceptron is the 2\*Test + 1\*Grades -18. Smaller Nodes are Inputs. Arrows are the Weights/Bias.

![Image Perceptron](https://)

#### Perceptron as Logical Operators

**AND:** Only true if both INs are 1

**OR:** If any of it's INs are 1

**NOT:** Flips one of the Inputs

**NAND:** Not and and

The Perceptron Steps (Only works for linear function):


In [0]:
def perceptronStep(X, y, W, b, learn_rate = 0.01):
    # Fill in code
    for i in range(len(X)):
        pred = prediction(X[i], W, b)
        if y[i] - pred == 1:
            W[0] += X[i][0] * learn_rate
            W[1] += X[i][1] * learn_rate
            b += learn_rate
        if y[i] - pred == -1:
            W[0] -= X[i][0] * learn_rate
            W[1] -= X[i][1] * learn_rate
            b -= learn_rate
    return W, b

### Error Functions

* Minimizing the Error functions => Avoid local minimum. (Gradient Descent)
* Error functions needs to be differentiable and continuous.

**Goal:** Finding the way to minimize the Loss the fastest. The challenge is not to get stuck in a local minimum.
Correctly classified points should carry a small Error. Not correctly classified points should carry a high penalty.

**Prediction:** Discrete would be: Yes/no. Continious would be 0-100% likelyhood of something:
* **Sigmoid:** Change the Activation function from Step function to sigmoid. 1/(1+e^-x)
* **Softmax:** Classification Problems. The probabilities across all options have to add to 1. e^Zi / e^Zn
* **One-Hot Encoding:** Inputvariables. One variable for each class. Each class has a column with either 0 or 1. So the value for each element is for example [0,0,1]

### Maximum Probability

All Probabilities of each points multiplied. Gives the Probabilities that these points are the respective colors. The goal then becomes to maximize the likelyhood.
* Increasing the probability => Deceasing the error
* Going from products to sums: log. Becasue log(ab) = log(a) + log(b). All negative numbers => -ln(x)

### Cross-Entropy

=> -ln(x).

Good model gives a small Cross Entropy, A bad model gives a high Cross Entropy.

Cross entropy says if a bunch of events and a bunch of probabilities, how likely is it that those events happen based on the probabilities?

* Very likely: Small CE
* Very unlikely: Big CE

All probabilities of all Ys adds up to 1.

**CE:** Gives how similiar two vectors are.

In [1]:
import numpy as np

L = [1,0,1,1]
P = [0.4,0.6,0.1,0.5]

# Write a function that takes as input two lists Y, P,
# and returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
    
    result = 0
    
    for y,p in zip(Y,P):
        if y == 0:
            result += np.log(1 - p)
        else:
            result += np.log(p)
    
    return result * (-1)
  
print(cross_entropy(L,P))

4.828313737302301


### Multi-Class Cross Entropy

**CE:** -Ei Ej yij ln(pij)

A higher CE correlates to a lower Probability of this event happening (and vice versa)

### Gradient Descent (Step)

From the Error function through Gradient Descent a smaller Error function is searched for.

Negative of the Gradient of the Error function leads us to the fastest way *"down"*. (Times a small learning rate)

The **Derivative of the Sigmoid function** is: s(x) * (1-s(x))

The **Derivative of the Error function** is: -(y - ÿ)(x1,...,xn,1)

#### Step

w'i <- wi + alpha(y-ÿ)xi (where alpha is the learning rate **and** 1/m * alpha)

The update of the bias is similiar.


#### Pseudo Code

1. Start with random weights
2. For every point

  2.1 For i = i...n
  
    2.1.1 Update w'i <- wi - alpha(ÿ-y)xi
    
    2.2.2 Update b' <- b - alpha(ÿ-y)
3. Repeat until the error is small

(How many times = epochs)

In [0]:
# Activation (sigmoid) function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def output_formula(features, weights, bias):
    return sigmoid(np.dot(features, weights) + bias)

def error_formula(y, output):
    return - y*np.log(output) - (1 - y) * np.log(1-output)

def update_weights(x, y, weights, bias, learnrate):
    output = output_formula(x, weights, bias)
    d_error = -(y - output)
    weights -= learnrate * d_error * x
    bias -= learnrate * d_error
    return weights, bias

### Neural Network Architecture

**Non-linear Models:** The line seperating two point groups will not be linear anymore this is where Neural Networks come into play

=> Combining two linear models. Two probabilies via the sigmoid function. Adding two linear models to obtain a third model.

* Input Layer (containts the inputs x1, x2,..., xn)
* Hidden Layer
* Output Layer (non-linear space)

**Multi-class problems** have more Output Layers

**More Hidden Layers:** Deep(er) Neural Network.



### Feedforward

This is the process of neural networks use to turn the input into an output.
Through the weights the different inputs get a stronger emphasis.

Input vector -> Apply a sequence of linear functions and sigmoid functions to get a highly non-linear output layer.

### Backpropagation

E(W) = 1/m E yi ln(ÿi) + (1 - yi)ln(1 - ÿi)

* Doing a feedforward operation
* Comparing the output to the desired output (y - ÿ)
* Calculating the Error
* Running the feedforward operation backwards to spread the error to each of the weights
* Use this to update the weights and get a better model
* Repeat this process



## Lesson 2: Implementing Gradient Descent

In the last chapter we used the log-loss function. There are many other error functions used for neural networks. One is called the mean squared error: It is the mean of the squares of the differences between the predictions and the labels.

**SSE:** E = 1/2 EE (yj - ÿj)^2

**Caveats:** We can end up in a local minima. That happens if the weights are initialized with the wrong values.

The **derivative of the Error** with respect to wi is: - (y - ÿ)f'(h)xi

In [2]:
import numpy as np

# Defining the sigmoid function for activations
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of the sigmoid function
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Input data
x = np.array([0.1, 0.3])
# Target
y = 0.2
# Input to output weights
weights = np.array([-0.8, 0.5])

# The learning rate, eta in the weight step equation
learnrate = 0.5

# the linear combination performed by the node (h in f(h) and f'(h))
h = x[0]*weights[0] + x[1]*weights[1]
# or h = np.dot(x, weights)

# The neural network output (y-hat)
nn_output = sigmoid(h)

# output error (y - y-hat)
error = y - nn_output

# output gradient (f'(h))
output_grad = sigmoid_prime(h)

# error term (lowercase delta)
error_term = error * output_grad

# Gradient descent step 
del_w = [ learnrate * error_term * x[0],
          learnrate * error_term * x[1]]
# or del_w = learnrate * error_term * x

print(del_w)

[-0.003963803079006883, -0.011891409237020648]


Here's the general algorithm for updating the weights with gradient descent:

Set the weight step to zero: Δwi=0\Delta w_i = 0Δwi​=0

**For each record in the training data:**

* Make a forward pass through the network, calculating the output y^=f(∑iwixi)\hat y = f(\sum_i w_i x_i)y^​=f(∑i​wi​xi​)
* Calculate the error term for the output unit, δ=(y−y^)∗f′(∑iwixi)\delta = (y - \hat y) * f'(\sum_i w_i x_i)δ=(y−y^​)∗f′(∑i​wi​xi​)
* Update the weight step Δwi=Δwi+δxi\Delta w_i = \Delta w_i + \delta x_iΔwi​=Δwi​+δxi​
* Update the weights wi=wi+ηΔwi/mw_i = w_i + \eta \Delta w_i / mwi​=wi​+ηΔwi​/m where η\etaη is the learning rate and mmm is the number of records. Here we're averaging the weight steps to help reduce any large variations in the training data.

Repeat for e epochs.

**Example with Epochs:**


In [0]:
import numpy as np
from data_prep import features, targets, features_test, targets_test


def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

# TODO: We haven't provided the sigmoid_prime function like we did in
#       the previous lesson to encourage you to come up with a more
#       efficient solution. If you need a hint, check out the comments
#       in solution.py from the previous lecture.

# Use to same seed to make debugging easier
np.random.seed(42)

n_records, n_features = features.shape
last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

for e in range(epochs):
    del_w = np.zeros(weights.shape)
    for x, y in zip(features.values, targets):
        # Loop through all records, x is the input, y is the target

        # Note: We haven't included the h variable from the previous
        #       lesson. You can add it if you want, or you can calculate
        #       the h together with the output

        # TODO: Calculate the output
        h = np.dot(x, weights)
        
        output = sigmoid(h)

        # TODO: Calculate the error
        error = y - output

        # TODO: Calculate the error term
        #output_grad = output * (1 - output)
        
        error_term = error * output * (1 - output)

        # TODO: Calculate the change in weights for this sample
        #       and add it to the total weight change
        del_w += error_term * x

    # TODO: Update weights using the learning rate and the average change in weights
    weights += learnrate * del_w / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

### Multilayer Preceptrons

With hidden units, the weights between them will require two incides: wij, where i denotes input units and j are the hidden units

#### Making a column vector

It's possible to get the transpose of an array like so arr.T but for a 1D array, the transpose will return a row vector. Instead use arr[:,None] to create a column vector:

In [6]:
features = np.array([0.49671415, -0.1382643 , 0.64768854])

print(features)
# array([ 0.49671415, -0.1382643 ,  0.64768854])

print(features.T)
# array([ 0.49671415, -0.1382643 ,  0.64768854])

print(features[:, None])
# array([[ 0.49671415],
#       [-0.1382643 ],
#       [ 0.64768854]])

[ 0.49671415 -0.1382643   0.64768854]
[ 0.49671415 -0.1382643   0.64768854]
[[ 0.49671415]
 [-0.1382643 ]
 [ 0.64768854]]


#### Backpropagation

Using the chain rule to find the error with respect to the weights connecting the input layer to the hidden layer (for a two layer network).

In [1]:
import numpy as np


def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


x = np.array([0.5, 0.1, -0.2])
target = 0.6
learnrate = 0.5

weights_input_hidden = np.array([[0.5, -0.6],
                                 [0.1, -0.2],
                                 [0.1, 0.7]])

weights_hidden_output = np.array([0.1, -0.3])

## Forward pass
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)

output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)
output = sigmoid(output_layer_in)

## Backwards pass
## TODO: Calculate output error
error = target - output

# TODO: Calculate error term for output layer
output_error_term = error * output * (1 - output)

# TODO: Calculate error term for hidden layer
hidden_error_term = np.dot(output_error_term, weights_hidden_output) * hidden_layer_output * (1 - hidden_layer_output)

# TODO: Calculate change in weights for hidden layer to output layer
delta_w_h_o = learnrate * output_error_term * hidden_layer_output

# TODO: Calculate change in weights for input layer to hidden layer
delta_w_i_h = learnrate * hidden_error_term * x[:, None]

print('Change in weights for hidden layer to output layer:')
print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)


Change in weights for hidden layer to output layer:
[0.00804047 0.00555918]
Change in weights for input layer to hidden layer:
[[ 1.77005547e-04 -5.11178506e-04]
 [ 3.54011093e-05 -1.02235701e-04]
 [-7.08022187e-05  2.04471402e-04]]


#### Implementing Backpropagation

In [9]:
import numpy as np
from data_prep import features, targets, features_test, targets_test

np.random.seed(21)

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


# Hyperparameters
n_hidden = 2  # number of hidden units
epochs = 900
learnrate = 0.005

n_records, n_features = features.shape
last_loss = None
# Initialize weights
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
                                        size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
                                         size=n_hidden)

for e in range(epochs):
    del_w_input_hidden = np.zeros(weights_input_hidden.shape)
    del_w_hidden_output = np.zeros(weights_hidden_output.shape)
    for x, y in zip(features.values, targets):
        ## Forward pass ##
        # TODO: Calculate the output
        hidden_input = np.dot(x, weights_input_hidden)
        hidden_output = sigmoid(hidden_input)
        
        input_for_output = np.dot(hidden_output, weights_hidden_output)
        
        output = sigmoid(input_for_output)

        ## Backward pass ##
        # TODO: Calculate the network's prediction error
        error = y - output

        # TODO: Calculate error term for the output unit
        output_error_term = error * output * (1 - output)

        ## propagate errors to hidden layer

        # TODO: Calculate the hidden layer's contribution to the error
        hidden_error = np.dot(output_error_term, weights_hidden_output)
        
        # TODO: Calculate the error term for the hidden layer
        hidden_error_term = hidden_error * hidden_output * (1 - hidden_output)
        
        # TODO: Update the change in weights
        del_w_hidden_output += learnrate * output_error_term * hidden_output
        del_w_input_hidden += learnrate * hidden_error_term * x[:, None]

    # TODO: Update weights
    weights_input_hidden += del_w_input_hidden
    weights_hidden_output += del_w_hidden_output

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        hidden_output = sigmoid(np.dot(x, weights_input_hidden))
        out = sigmoid(np.dot(hidden_output,
                             weights_hidden_output))
        loss = np.mean((out - targets) ** 2)

        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss

# Calculate accuracy on test data
hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

ModuleNotFoundError: ignored

## Lesson 3: Training Neural Networks

**Training set:** Model is trained with that without looking at the Testing set.

**Testing set:** Testing the Model at the end (Re- Introduced)

---

**Overfitting:** Fit the data well, but it can not generalize. Error due to variance.

**Underfitting:** Trying to kill godzilla with a flyswatter. Error due to bias.

Both problems also correlate with a too simple/too complicated neural network architecture.
We err on the side of an overly complicated model and then apply techniques to prevent overfitting.

#### Early Stopping

Underfitting: Training and Test errors are big
Overfitting: Training Error is tiny and Testing Error Large

--> Model Complexity Graph
Testing Error increases with more Epochs
Training Error decreases with the Epochs
We need the point where the **Testing Error increases again** (Early Stopping)

#### Regularization

When we apply sigmoid to 10x1 and 10x2 the function becomes steeper => Gradient Descent becomes harder.

##### Punish large coefficients:
**L1**: MSE + lamda * w1,...,wn (absolute values)
* Good for features selection
* Sparse Vectors. Small number of weights.

**L2**: MSE + lamda * w1,...,wn (squared)
* Normally better for Training Models
* Does not favor small vectors

### Dropout

One part of the network has large weights and dominates the training.
**Sometimes during training some nodes are turned off.**
Probability that each node will be dropped.

### Vanishing Gradient

**Hyperbolic Tangent Function:** e^x - e^x / e^x + e^x

**Rectified Linear Unit (ReLU):** If Positive return x. If negative return 0.

### Batch vs. Stochastic Gradient Descent

Number of steps in the gradient descent = Number of epochs.
Epoch whole forward and backward pass for the data.

**Stochastic Gradient Descent:** Small subset of the data. Batches of the whole data.

### Learning Rate Decay

Best: If steep: Long steps. If plaing: small steps.

**Random Restart**: Gradient Descent from different locations.

**Momentum**: Average of the last # steps. Beta between 0-1. Step(n) + b*Step(n-1) + ...

## Lesson 6: Sentiment Analysis



In [0]:
import time
import sys
import numpy as np

# Encapsulate our neural network in a class
class SentimentNetwork:
    def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1):
        """Create a SentimenNetwork with the given settings
        Args:
            reviews(list) - List of reviews used for training
            labels(list) - List of POSITIVE/NEGATIVE labels associated with the given reviews
            hidden_nodes(int) - Number of nodes to create in the hidden layer
            learning_rate(float) - Learning rate to use while training
        
        """
        # Assign a seed to our random number generator to ensure we get
        # reproducable results during development 
        np.random.seed(1)

        # process the reviews and their associated labels so that everything
        # is ready for training
        self.pre_process_data(reviews, labels)
        
        # Build the network to have the number of hidden nodes and the learning rate that
        # were passed into this initializer. Make the same number of input nodes as
        # there are vocabulary words and create a single output node.
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

    def pre_process_data(self, reviews, labels):
        
        # populate review_vocab with all of the words in the given reviews
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                review_vocab.add(word)

        # Convert the vocabulary set to a list so we can access words via indices
        self.review_vocab = list(review_vocab)
        
        # populate label_vocab with all of the words in the given labels.
        label_vocab = set()
        for label in labels:
            label_vocab.add(label)
        
        # Convert the label vocabulary set to a list so we can access labels via indices
        self.label_vocab = list(label_vocab)
        
        # Store the sizes of the review and label vocabularies.
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        # Create a dictionary of words in the vocabulary mapped to index positions
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        
        # Create a dictionary of labels mapped to index positions
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set number of nodes in input, hidden and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Store the learning rate
        self.learning_rate = learning_rate

        # Initialize weights

        # These are the weights between the input layer and the hidden layer.
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
    
        # These are the weights between the hidden layer and the output layer.
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, 
                                                (self.hidden_nodes, self.output_nodes))
        
        # The input layer, a two-dimensional matrix with shape 1 x input_nodes
        self.layer_0 = np.zeros((1,input_nodes))
    
    def update_input_layer(self,review):

        # clear out previous state, reset the layer to be all 0s
        self.layer_0 *= 0
        
        for word in review.split(" "):
            # NOTE: This if-check was not in the version of this method created in Project 2,
            #       and it appears in Andrew's Project 3 solution without explanation. 
            #       It simply ensures the word is actually a key in word2index before
            #       accessing it, which is important because accessing an invalid key
            #       with raise an exception in Python. This allows us to ignore unknown
            #       words encountered in new reviews.
            if(word in self.word2index.keys()):
                self.layer_0[0][self.word2index[word]] += 1
                
    def get_target_for_label(self,label):
        if(label == 'POSITIVE'):
            return 1
        else:
            return 0
        
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)
    
    def train(self, training_reviews, training_labels):
        
        # make sure out we have a matching number of reviews and labels
        assert(len(training_reviews) == len(training_labels))
        
        # Keep track of correct predictions to display accuracy during training 
        correct_so_far = 0

        # Remember when we started for printing time statistics
        start = time.time()
        
        # loop through all the given reviews and run a forward and backward pass,
        # updating weights for every item
        for i in range(len(training_reviews)):
            
            # Get the next review and its correct label
            review = training_reviews[i]
            label = training_labels[i]
            
            #### Implement the forward pass here ####
            ### Forward pass ###

            # Input Layer
            self.update_input_layer(review)

            # Hidden layer
            layer_1 = self.layer_0.dot(self.weights_0_1)

            # Output layer
            layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
            
            #### Implement the backward pass here ####
            ### Backward pass ###

            # Output error
            layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            # Backpropagated error
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error

            # Update the weights
            self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
            self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate # update input-to-hidden weights with gradient descent step

            # Keep track of correct predictions.
            if(layer_2 >= 0.5 and label == 'POSITIVE'):
                correct_so_far += 1
            elif(layer_2 < 0.5 and label == 'NEGATIVE'):
                correct_so_far += 1
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the training process. 
            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                             + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")
    
    def test(self, testing_reviews, testing_labels):
        """
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        """
        
        # keep track of how many correct predictions we make
        correct = 0

        # we'll time how many predictions per second we make
        start = time.time()

        # Loop through each of the given reviews and call run to predict
        # its label. 
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the prediction process. 

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        """
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        """
        # Run a forward pass through the network, like in the "train" function.
        
        # Input Layer
        self.update_input_layer(review.lower())

        # Hidden layer
        layer_1 = self.layer_0.dot(self.weights_0_1)

        # Output layer
        layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
        
        # Return POSITIVE for values above greater-than-or-equal-to 0.5 in the output layer;
        # return NEGATIVE for other values
        if(layer_2[0] >= 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"

## Lesson 7: Keras

Examples of packages for deep learning:

* Keras
* TensorFlow
* Caffe
* Theano
* Scikit-learn
* ...

### Creating a Sequential Model in Keras

In [22]:
import numpy as np

from keras.models import Sequential
from keras.layers.core import Dense, Activation

# Layers
## Fully connected layers
## max pool layers
## activation layers
## and more

# X has shape (num_rows, num_cols) where the training data is stored as row vectors
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float32)

# y must have an output vector for each input vector
y = np.array([[0], [0], [0], [1]], dtype=np.float32)

# Create the Sequential model
model = Sequential()
# The keras.models.Sequential class is a wrapper for the neural network model
# that treats the network as a sequence of layers

# 1st layer - Add an input layer of 32 nodes with the same input shape as X
model.add(Dense(32, input_dim=X.shape[1]))

# Add a softmax activation layer
model.add(Activation('softmax'))

# 2nd layer - Add a fully connected output layer
model.add(Dense(1))

#Add a sigmoid activation layer
model.add(Activation('sigmoid'))

X

array([[0., 0.],
       [0., 1.],
       [1., 0.],
       [1., 1.]], dtype=float32)

* Keras requires the input shape to be specified for the first layer. It will automatically infer the shape of all other layers
* Activation is added after the layers
* Compiling the Keras model calls the backend (tensorflow, theano, etc.) and binds the optimizer, loss function and other paramters required before the model can be run on any input data.
* We specify the loss function to be *categorical_crossentropy*
* and specify *adam* as the optimizer
* also we specify what metrics we want to evaluate the model with

In [20]:
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics = ["accuracy"])

# See the resulting model architecture with the following command
model.summary()

# Train with the fit() method

model.fit(X, y, epochs=1000, verbose=0)

# Evaluate the model

model.evaluate()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_13 (Dense)             (None, 32)                96        
_________________________________________________________________
activation_13 (Activation)   (None, 32)                0         
_________________________________________________________________
dense_14 (Dense)             (None, 1)                 33        
_________________________________________________________________
activation_14 (Activation)   (None, 1)                 0         
Total params: 129
Trainable params: 129
Non-trainable params: 0
_________________________________________________________________


ValueError: ignored

### Keras Optimizers

**SGD - Stochastic Gradient Descent:** It uses the following paramters:
* Learning rate.
* Momentum
* Nesterov Momentum (This slows down the gradient when it's close to the solution)

**Adam:** Adaptive Moment Estimation uses a more complicated exponential decay that consists of not just considering the average (first moment), but also the variance (second moment) of the previous steps.

**RMSProp:** RMSProp (RMS stands for Roo Mean Square Error) decreases the learning rate by dividing it by an exponentially decaying average of squared gradients.

[Further explanation of Keras Optimizers](http://ruder.io/optimizing-gradient-descent/index.html#rmsprop)

[Keras Documentation about Optimizers](https://keras.io/optimizers/)

### IMDB Data in Keras

In [23]:
# Imports
import numpy as np
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(42)

# Loading the data
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=1000)

print(x_train.shape)
print(x_test.shape)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
(25000,)
(25000,)


In [24]:
print(x_train[0])
print(x_test[0])

[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]
[1, 591, 202, 14, 31, 6, 717, 10, 10, 2, 2, 5, 4, 360, 7, 4, 177, 2, 394, 354, 4, 123, 9, 2, 2, 2, 10, 10, 13, 92, 124, 89, 488, 

In [25]:
# One-hot encoding the output into vector mode, each of length 1000
tokenizer = Tokenizer(num_words=1000)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print(x_train[0])

[0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0.
 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0.
 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0.
 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

In [26]:
# One-hot encoding the output
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(25000, 2)
(25000, 2)


In [30]:
# Build the model architecture
model = Sequential()

model.add(Dense(265, input_dim=x_train.shape[1]))
model.add(Activation('sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dropout(0.1))
model.add(Dense(2))
# TODO: Compile the model using a loss function and an optimizer.
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics = ["accuracy"])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_22 (Dense)             (None, 265)               265265    
_________________________________________________________________
activation_21 (Activation)   (None, 265)               0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 265)               0         
_________________________________________________________________
dense_23 (Dense)             (None, 128)               34048     
_________________________________________________________________
activation_22 (Activation)   (None, 128)               0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_24 (Dense)             (None, 2)                 258       
Total para

In [31]:
model.fit(x_train, y_train, epochs=1000, verbose=0)

KeyboardInterrupt: ignored

In [0]:
# Evaluating the model

score = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: ", score[1])