<a href="https://colab.research.google.com/github/mkmritunjay/machineLearning/blob/master/grokkingDL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 3:

### Introduction to neural prediction: Forward Propagation

In [None]:
# simplest neural network

weight =0.1

def neural_network(input, weight):
  prediction = input * weight
  return prediction

number_of_toes = [8.5, 9.5, 10, 9]
input = number_of_toes[0]

pred = neural_network(input, weight)
print(pred)

0.8500000000000001


#### What does this neural network do?
It multiplies the input by a weight. It "scales" the input by a certain amount.

The interface for a neural network is simple. It accepts an input variable as information and a weight variable as knowledge and outputs a prediction.

It uses the knowledge in the weights to interpret the information in the input data.

### Making a prediction with multiple inputs

In [None]:
weights = [0.1, 0.2, 0]

def w_sum(a, b):
  assert(len(a) == len(b))
  output = 0

  for i in range(len(a)):
    output += a[i] * b[i]
  return output

def neural_network(inputs, weights):
  prediction = w_sum(inputs, weights)
  return prediction

toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

inputs = [toes[0], wlrec[0], nfans[0]]

pred = neural_network(inputs, weights)
print(pred)

0.9800000000000001


#### Multiple inputs: What does this neural network do?

It multiplies three inputs by three weights and take their sum. This is a weighted sum (dot product).

This new neural network can accept multiple inputs at a time per prediction. Here we take each input and multiply it with its own weight to find local prediction and in the end we sum all the local predictions to get final prediction.

The intuition behind how and why a dot product works is one of the most important parts of truly understanding how neural networks make predictions.

#### A dot product gives you a notion of similarity between two vectors. Consider below examples:

In [None]:
a = [0, 1, 0, 1]
b = [1, 0, 1, 0]
c = [0, 1, 1, 0]
d = [.5, 0, .5, 0]
e = [0, 1, -1, 0]

print('a * b: {}'.format(w_sum(a,b)))
print('b * c: {}'.format(w_sum(b,c)))
print('b * d: {}'.format(w_sum(b,d)))
print('c * c: {}'.format(w_sum(c,c)))
print('d * d: {}'.format(w_sum(d,d)))
print('c * e: {}'.format(w_sum(c,e)))
print('e * e: {}'.format(w_sum(e,e)))


a * b: 0
b * c: 1
b * d: 1.0
c * c: 2
d * d: 0.5
c * e: 0
e * e: 2


The highest weighted sum (sum(c,c)) is between vectors that are exactly identical. 

In contrast, because a and b have no overlapping weight, their dot product is zero. 

Most interesting weighted sum is between c and e, because e has a negative weight. This negative weight canceled out the positive similarity between them.

But a dot product between e and itself would yield the number 2 (-ve * -ve turns +ve).

### Multiple inputs: Complete runnable code (numpy)

In [None]:
import numpy as np

weights  = np.array([0.1, 0.2, 0])

toes = np.array([8.5, 9.5, 9.9, 9.0])
wlrec = np.array([0.65, 0.8, 0.8, 0.9])
nfans = np.array([1.2, 1.3, 0.5, 1.0])

inputs = np.array([toes[0], wlrec[0], nfans[0]])

def neural_network(inputs, weights):
  prediction = np.dot(inputs, weights)
  return prediction

pred = neural_network(inputs,weights)
print(pred)

0.9800000000000001


### Making a prediction with multiple outputs

In [None]:
weights = [0.3, 0.2, 0.9]

def ele_mul(input, weights):
  output = [0,0,0]
  assert(len(output) == len(weights))
  for i in range(len(weights)):
    output[i] = input * weights[i]
  return output

def neural_network(input, weights):
  pred = ele_mul(input, weights)
  return pred

wlrec = [0.65, 0.8, 0.8, 0.9]
input = wlrec[0]

pred = neural_network(input, weights)
print(pred)

[0.195, 0.13, 0.5850000000000001]


### Multiple inputs and outputs: How does it work?

It performs three independent weighted sums of the input to make three predictions. (3 weights going into each output node)

In [None]:
weights = np.array([[0.1, 0.1, -0.3],
          [0.1, 0.2, 0.0],
          [0.0, 1.3, 0.1]])

def w_sum(a, b):
  assert(len(a) == len(b))
  output = 0

  for i in range(len(a)):
    output += a[i] * b[i]
  return output

def vect_mat_mul(input, weights):
  assert(len(input) == len(weights))
  output = [0, 0, 0]
  for i in range(len(input)):
    output[i] = w_sum(input, weights[i])
  return output

def neural_network(input, weights):
  pred = vect_mat_mul(input, weights)
  return pred

toes = np.array([8.5, 9.5, 9.9, 9.0]) 
wlrec = np.array([0.65, 0.8, 0.8, 0.9]) 
nfans = np.array([1.2, 1.3, 0.5, 1.0])

input = [toes[0], wlrec[0], nfans[0]]

pred = neural_network(input, weights)
print(pred)

[0.555, 0.9800000000000001, 0.9650000000000001]


### Predicting on predictions

We can also take the output of one network and feed it as input to another network. This results in two consecutive vector-matrix multiplications.

In [None]:
ih_wgt = [[0.1,0.2,-0.1],
                 [-0.1,0.1,0.9],
                 [0.1,0.4,0.1]]

hp_wgt = [[0.3,1.1,-0.3],
                 [0.1,0.2,0.0],
                 [0.0,1.3,0.1]]

weights = [ih_wgt, hp_wgt]

toes = np.array([8.5, 9.5, 9.9, 9.0]) 
wlrec = np.array([0.65, 0.8, 0.8, 0.9]) 
nfans = np.array([1.2, 1.3, 0.5, 1.0])

input = [toes[0], wlrec[0], nfans[0]]

def w_sum(a, b):
  assert(len(a) == len(b))
  output = 0

  for i in range(len(a)):
    output += a[i] * b[i]
  return output

def vect_mat_mul(input, weights):
  assert(len(input) == len(weights))
  output = [0, 0, 0]
  for i in range(len(input)):
    output[i] = w_sum(input, weights[i])
  return output

def neural_network(input, weights):
  hid = vect_mat_mul(input, weights[0])
  pred = vect_mat_mul(hid, weights[1])
  return pred

prediction = neural_network(input, weights)
print(prediction)

[0.21350000000000002, 0.14500000000000002, 0.5065]


### numpy version

In [None]:
ih_wgt = np.array([[0.1,0.2,-0.1],
                 [-0.1,0.1,0.9],
                 [0.1,0.4,0.1]])

hp_wgt = np.array([[0.3,1.1,-0.3],
                 [0.1,0.2,0.0],
                 [0.0,1.3,0.1]])

weights = np.array([ih_wgt, hp_wgt])

toes = np.array([8.5, 9.5, 9.9, 9.0]) 
wlrec = np.array([0.65, 0.8, 0.8, 0.9]) 
nfans = np.array([1.2, 1.3, 0.5, 1.0])

input = np.array([toes[0], wlrec[0], nfans[0]])


def neural_network(input, weights):
  hid = np.dot(weights[0], input) # pay special attention here for position of vector and matrix during dot product
  pred = np.dot(weights[1], hid)
  return pred

prediction = neural_network(input, weights)
print(prediction)

[0.2135 0.145  0.5065]


## Chapter 4

### Introduction to neural learning: Gradient Descent

### Predict, Compare and learn

Predict: How to use a neural network to make a prediction.

Compare: Comparing gives a measurement of how much a prediction "missed" by.

Learn: Learning tells each weight how it can change to reduce the error.

### Simplest form of neural learning: Hot and Cold method.
Hot and cold means wiggling the weights to see which direction reduces the error the most, moving the weights in that direction, and repeating until the error gets to 0.

In [None]:
input = 0.5
weight = 0.5
goal_prediction = 0.8

step_amount = 0.001

for iteration in range(1101):
  prediction = input * weight
  error = (prediction - goal_prediction) ** 2
  print("error: {}, prediction: {}".format(error, prediction))
  up_prediction = input * (weight + step_amount)
  up_error = (up_prediction - goal_prediction) ** 2
  down_prediction = input * (weight - step_amount)
  down_error = (down_prediction - goal_prediction) ** 2
  if(up_error > down_error):
    weight = weight - step_amount
  if(up_error < down_error):
    weight = weight + step_amount


error: 0.30250000000000005, prediction: 0.25
error: 0.3019502500000001, prediction: 0.2505
error: 0.30140100000000003, prediction: 0.251
error: 0.30085225, prediction: 0.2515
error: 0.30030400000000007, prediction: 0.252
error: 0.2997562500000001, prediction: 0.2525
error: 0.29920900000000006, prediction: 0.253
error: 0.29866224999999996, prediction: 0.2535
error: 0.29811600000000005, prediction: 0.254
error: 0.2975702500000001, prediction: 0.2545
error: 0.29702500000000004, prediction: 0.255
error: 0.29648025, prediction: 0.2555
error: 0.29593600000000003, prediction: 0.256
error: 0.2953922500000001, prediction: 0.2565
error: 0.294849, prediction: 0.257
error: 0.29430625, prediction: 0.2575
error: 0.293764, prediction: 0.258
error: 0.2932222500000001, prediction: 0.2585
error: 0.292681, prediction: 0.259
error: 0.29214025, prediction: 0.2595
error: 0.2916, prediction: 0.26
error: 0.2910602500000001, prediction: 0.2605
error: 0.29052100000000003, prediction: 0.261
error: 0.28998225, pr

#### Problems with this method:
1. It's inefficient.
2. Sometimes it's impossible to predict the exact goal prediction.

#### Calculating both direction and amount from error:

In [None]:
weight = 0.5
goal_pred = 0.8
input = 0.5

for iteration in range(20):
  pred = input * weight
  error = (pred - goal_pred) ** 2
  direction_and_amount = (pred - goal_pred) * input # gradient descent
  weight = weight - direction_and_amount
  print("Error: {} Prediction: {}".format(error, pred))

Error: 0.30250000000000005 Prediction: 0.25
Error: 0.17015625000000004 Prediction: 0.3875
Error: 0.095712890625 Prediction: 0.49062500000000003
Error: 0.05383850097656251 Prediction: 0.56796875
Error: 0.03028415679931642 Prediction: 0.6259765625
Error: 0.0170348381996155 Prediction: 0.669482421875
Error: 0.00958209648728372 Prediction: 0.70211181640625
Error: 0.005389929274097089 Prediction: 0.7265838623046875
Error: 0.0030318352166796153 Prediction: 0.7449378967285156
Error: 0.0017054073093822882 Prediction: 0.7587034225463867
Error: 0.0009592916115275371 Prediction: 0.76902756690979
Error: 0.0005396015314842384 Prediction: 0.7767706751823426
Error: 0.000303525861459885 Prediction: 0.7825780063867569
Error: 0.00017073329707118678 Prediction: 0.7869335047900676
Error: 9.603747960254256e-05 Prediction: 0.7902001285925507
Error: 5.402108227642978e-05 Prediction: 0.7926500964444131
Error: 3.038685878049206e-05 Prediction: 0.7944875723333098
Error: 1.7092608064027242e-05 Prediction: 0.7958

#### What is direction and amount:

It represents how we want to change weight. The first part is "pure error" and second part is "scaling, negative reversal and stopping"

#### What is pure error?

(pred - goal_pred) is pure error in the code. It indicates raw amount and direction we missed. If this is a positive number, you predicted too high, and vice versa. If it's a big number we missed by a big amount.

#### What are scaling, negative reversal, and stopping?
**Stopping:** Stopping is the first effect on pure error caused by multiplying it by input. If input is 0, then it will force "direction_and_amount" to also be 0.

**Negative Reversal:** Normally when input is positive, moving weight upwards makes the prediction move upward. But if input is negative, moving weight up makes the prediction go down. It's reversed. To address this we multiply the pure error with input so that direction of weight is correct even if input is negative.

**Scaling:** If input is big weight update should also be big. (side effects of it will be fixed by including "alpha" later).




### One iteration of Gradient Descent

In [None]:
weight = 0.0
goal_pred = 0.8
input = 0.5
alpha = 0.1

pred = input * weight # these lines have secrets
error = (pred - goal_pred) ** 2 # these lines have secrets
delta = (pred - goal_pred)
weight_delta = delta * input
weight -= weight_delta * alpha

#we can write those lines like:
error = ((input * weight) - goal_pred) ** 2
# input and goal_pred are constants here, and error and weights are only variables
# Which means for any input and goal_pred, an exact relationship is defined between error and weight.

### Derivatives

**Take one:** 

Assume we have an equation like below

superman_strength = spiderman_strength * 2

We can say "superman_strength" is a function of "spiderman_strength" and the definition of this formula could be "When I increase spiderman_strength how much superman_strength increases? It's called derivative."

So "2" here in the above formula is derivative. (superman_strength / spiderman_strength)

**Take Two:** 

The other way to define a derivative is "Derivative is the slope at a point on a line or curve."

If you plot "error = ((input * weight) - goal_pred) ** 2" this function, the plot will look like a big U-shaped curve and there will be a point in the middle where error will be 0. Also right of that point, the slope of the line will be positive and to the left of that point, the slope of the line will be negative. And farther away from the "goal weight" you move, the steeper the slope gets. 

These properties are very useful. The slope's sign gives you direction and the slope's steepness gives you amount and we can use both of these to helo find the goal weight.

### How to use derivative to learn
weight_delta is our derivative.

**What is the difference between error and derivative of error and weight?**
error is a measure of how much we missed. The derivative defines a relationship between error and weight. It tells us how much we missed for a specific weight or how much changing the weight contributed to error.

**The slope of a line or curve always point in the opposite direction of the lowest point of the line or curve. So if we have negative slope, we increase wight to find minimum of error.**

So how do we use derivative to find the error minimum? We move the opposite direction of the slope - the opposite direction of the derivative. We can take each weight value, calculate it's derivative w.r.t. to error, and then change weight in the opposite direction of that slope and that will move us to the minimum.

**A derivative gives us the relationship between any two variables in a function. We use the derivative to find the relationship between any weight and error. And then we move the weight in the opposite direction of the derivative to find the lowest error.** 

This method of learning is called **gradient descent**. We move the weight value opposite to the gradient value, which reduces error to 0. We increase the weight when you have a negative gradient and vice versa.

### Gradient Descent Code

In [None]:
weight = 0.5
goal_pred = 0.8
input = 0.5

for iteration in range(20):
  pred = input * weight
  error = (pred - goal_pred) ** 2
  delta = pred - goal_pred
  weight_delta = delta * input # 'input' is derivative. rate of change in weight (weight_delta) w.r.t. error change(delta).
  weight -= weight_delta
  print("Error: {} Prediction: {}".format(error, pred))

Error: 0.30250000000000005 Prediction: 0.25
Error: 0.17015625000000004 Prediction: 0.3875
Error: 0.095712890625 Prediction: 0.49062500000000003
Error: 0.05383850097656251 Prediction: 0.56796875
Error: 0.03028415679931642 Prediction: 0.6259765625
Error: 0.0170348381996155 Prediction: 0.669482421875
Error: 0.00958209648728372 Prediction: 0.70211181640625
Error: 0.005389929274097089 Prediction: 0.7265838623046875
Error: 0.0030318352166796153 Prediction: 0.7449378967285156
Error: 0.0017054073093822882 Prediction: 0.7587034225463867
Error: 0.0009592916115275371 Prediction: 0.76902756690979
Error: 0.0005396015314842384 Prediction: 0.7767706751823426
Error: 0.000303525861459885 Prediction: 0.7825780063867569
Error: 0.00017073329707118678 Prediction: 0.7869335047900676
Error: 9.603747960254256e-05 Prediction: 0.7902001285925507
Error: 5.402108227642978e-05 Prediction: 0.7926500964444131
Error: 3.038685878049206e-05 Prediction: 0.7944875723333098
Error: 1.7092608064027242e-05 Prediction: 0.7958

### Divergence
Sometimes neural networks explode in value.

In [None]:
# change the input to 2 from 0.5
weight = 0.5
goal_pred = 0.8
input = 2

for iteration in range(20):
  pred = input * weight
  error = (pred - goal_pred) ** 2
  delta = pred - goal_pred
  weight_delta = delta * input
  weight -= weight_delta
  print("Error: {} Prediction: {}".format(error, pred))

Error: 0.03999999999999998 Prediction: 1.0
Error: 0.3599999999999998 Prediction: 0.20000000000000018
Error: 3.2399999999999984 Prediction: 2.5999999999999996
Error: 29.159999999999986 Prediction: -4.599999999999999
Error: 262.4399999999999 Prediction: 16.999999999999996
Error: 2361.959999999998 Prediction: -47.79999999999998
Error: 21257.639999999978 Prediction: 146.59999999999994
Error: 191318.75999999983 Prediction: -436.5999999999998
Error: 1721868.839999999 Prediction: 1312.9999999999995
Error: 15496819.559999991 Prediction: -3935.799999999999
Error: 139471376.03999993 Prediction: 11810.599999999997
Error: 1255242384.3599997 Prediction: -35428.59999999999
Error: 11297181459.239996 Prediction: 106288.99999999999
Error: 101674633133.15994 Prediction: -318863.79999999993
Error: 915071698198.4395 Prediction: 956594.5999999997
Error: 8235645283785.954 Prediction: -2869780.599999999
Error: 74120807554073.56 Prediction: 8609344.999999996
Error: 667087267986662.1 Prediction: -25828031.7999

The explosion in the error was caused by the fact that we made the input larger.

What happens when we have large weight update and a small error? The network overcorrects. If the new error is even bigger, the network overcorrects even more. This causes the phenomenon called divergence.

How do we predict? By multilying input with weights. What if input is big? small change in weight will cause changes in prediction. The error is very sensitive to weight or derivative (input) is really big.

### Introducing alpha
Simplest way to prevent overcorrecting weight updates.

In [None]:
weight = 0.5
goal_pred = 0.8
input = 2
alpha = 0.1

for iteration in range(20):
  pred = input * weight
  error = (pred - goal_pred) ** 2
  delta = pred - goal_pred
  weight_delta = delta * input
  weight -= weight_delta * alpha
  print("Error: {} Prediction: {}".format(error, pred))

Error: 0.03999999999999998 Prediction: 1.0
Error: 0.0144 Prediction: 0.92
Error: 0.005183999999999993 Prediction: 0.872
Error: 0.0018662400000000014 Prediction: 0.8432000000000001
Error: 0.0006718464000000028 Prediction: 0.8259200000000001
Error: 0.00024186470400000033 Prediction: 0.815552
Error: 8.70712934399997e-05 Prediction: 0.8093312
Error: 3.134566563839939e-05 Prediction: 0.80559872
Error: 1.1284439629823931e-05 Prediction: 0.803359232
Error: 4.062398266736526e-06 Prediction: 0.8020155392
Error: 1.4624633760252567e-06 Prediction: 0.8012093235200001
Error: 5.264868153690924e-07 Prediction: 0.8007255941120001
Error: 1.8953525353291194e-07 Prediction: 0.8004353564672001
Error: 6.82326912718715e-08 Prediction: 0.8002612138803201
Error: 2.456376885786678e-08 Prediction: 0.8001567283281921
Error: 8.842956788836216e-09 Prediction: 0.8000940369969153
Error: 3.1834644439835434e-09 Prediction: 0.8000564221981492
Error: 1.1460471998340758e-09 Prediction: 0.8000338533188895
Error: 4.1257699

### Chapter 5
### Learning multiple weights at a time: Generalizing gradient descent

#### Gradient descent learning with multiple inputs

In [None]:
def w_sum(a, b):
  assert(len(a) == len(b))
  output = 0
  for i in range(len(a)):
    output += (a[i] * b[i])
  return output

def neural_network(input, weights):
  pred = w_sum(input, weights)
  return pred

def ele_mul(number, vector):
  output = [0, 0, 0]
  assert(len(output) == len(vector))
  for i in range(len(vector)):
    output[i] = number * vector[i]
  return output

toes = np.array([8.5, 9.5, 9.9, 9.0]) 
wlrec = np.array([0.65, 0.8, 0.8, 0.9]) 
nfans = np.array([1.2, 1.3, 0.5, 1.0])

win_or_lose_binary = [1, 1, 0, 1]
true = win_or_lose_binary[0]
input = [toes[0], wlrec[0], nfans[0]]

alpha = 0.01
weights = [0.1, 0.2, -.1]

for i in range(3):
  pred = neural_network(input, weights)
  error = (pred - true) ** 2
  delta = pred - true
  weight_deltas = ele_mul(delta, input)
  print("Iteration: {}".format(i+1))
  print("Pred: {}".format(pred))
  print("Error: {}".format(error))
  print("Delta: {}".format(delta))
  print("Weights: {}".format(weights))
  print("weight Deltas:")
  print(weight_deltas)

  for i in range(len(weights)):
    weights[i] -= (alpha * weight_deltas[i])


Iteration: 1
Pred: 0.8600000000000001
Error: 0.01959999999999997
Delta: -0.1399999999999999
Weights: [0.1, 0.2, -0.1]
weight Deltas:
[-1.189999999999999, -0.09099999999999994, -0.16799999999999987]
Iteration: 2
Pred: 0.9637574999999999
Error: 0.0013135188062500048
Delta: -0.036242500000000066
Weights: [0.1119, 0.20091, -0.09832]
weight Deltas:
[-0.30806125000000056, -0.023557625000000044, -0.04349100000000008]
Iteration: 3
Pred: 0.9906177228125002
Error: 8.802712522307997e-05
Delta: -0.009382277187499843
Weights: [0.11498061250000001, 0.20114557625, -0.09788509000000001]
weight Deltas:
[-0.07974935609374867, -0.006098480171874899, -0.011258732624999811]


#### Gradient descent learning with multiple outputs

In [None]:
weights = [0.3, 0.2, 0.9]

def ele_mul(number, vector):
  output = [0, 0, 0]
  assert(len(output) == len(vector))
  for i in range(len(vector)):
    output[i] = number * vector[i]
  return output

def neural_network(input, weights):
  pred = ele_mul(input, weights)
  return pred


wlrec = [0.65, 1.0, 1.0, 0.9]

hurt = [0.1, 0.0, 0.0, 0.1]
win = [1,1,0,1]
sad = [0.1, 0.0, 0.1, 0.2]

input = wlrec[0]
true = [hurt[0], win[0], sad[0]]

pred = neural_network(input, weights)

error = [0, 0, 0]
delta = [0, 0, 0]

for i in range(len(true)):
  error[i] = (pred[i] - true[i]) ** 2
  delta[i] = pred[i] - true[i]

weight_deltas = ele_mul(input, delta)
alpha = 0.1

for i in range(len(weights)):
  weights[i] -= (weight_deltas[i] * alpha)

print("weights:" + str(weights))
print("Weight deltas:" + str(weight_deltas))

weights:[0.293825, 0.25655, 0.868475]
Weight deltas:[0.061750000000000006, -0.5655, 0.3152500000000001]


#### Gradient descent with multiple inputs and outputs

In [None]:
import numpy as np
weights = np.array([[0.1, 0.1, -0.3],
           [0.1, 0.2, 0.0],
           [0.0, 1.3, 0.1]])

def w_sum(a, b):
  assert(len(a) == len(b))
  output = 0

  for i in range(len(a)):
    output += a[i] * b[i]
  return output

def vect_mat_mul(vect, matrix):
  assert(len(vect) == len(matrix))
  output = [0, 0, 0]
  for i in range(len(vect)):
    output[i] = w_sum(vect, matrix[i])
  return output

def neural_network(input, weights):
  pred = vect_mat_mul(input, weights)
  return pred

def outer_prod(vec_a, vec_b):
  a = np.array(vec_a)
  b = np.array(vec_b)
  c = np.dot(a,b)
  return c

toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

hurt = [0.1, 0.0, 0.0, 0.1]
win = [1,1,0,1]
sad = [0.1, 0.0, 0.1, 0.2]

alpha = 0.01

input = [toes[0], wlrec[0], nfans[0]]
true = [hurt[0], win[0], sad[0]]
error = [0, 0, 0]
delta = [0, 0, 0]

for i in range(5):
  pred = neural_network(input, weights)

  for i in range(len(true)):
    error[i] = (pred[i] - true[i]) ** 2
    delta = pred[i] - true[i]

  weight_deltas = outer_prod(input, delta)
  weights = weights * alpha
  print(error)

[0.20702500000000007, 0.0003999999999999963, 0.7482250000000001]
[0.008920802500000002, 0.9804960399999999, 0.0081631225]
[0.009988903080250001, 0.999804009604, 0.009980709312250001]
[0.009999889000308026, 0.9999980400009604, 0.009999807000931225]
[0.009999998890000032, 0.9999999804, 0.009999998070000096]


### Chapter 6: 
### Building your first deep neural network: Introduction to backpropagation

In [None]:
# streetlight problem
import numpy as np
weights = np.array([0.5, 0.48, -0.7])
alpha = 0.1

streetlights = np.array([[ 1, 0, 1 ],
                          [ 0, 1, 1 ],
                          [ 0, 0, 1 ],
                          [ 1, 1, 1 ],
                          [ 0, 1, 1 ],
                          [ 1, 0, 1 ]])

walk_vs_stop = np.array([0,1,0,1,1,0])
input = streetlights[0]
goal_prediction = walk_vs_stop[0]

for i in range(20):
  prediction = np.dot(input, weights)
  error = (goal_prediction - prediction) ** 2
  delta = prediction - goal_prediction
  weights = weights - (alpha * (input * delta))
  print("Error: {}, Prediction: {}".format(error,prediction))

Error: 0.03999999999999998, Prediction: -0.19999999999999996
Error: 0.025599999999999973, Prediction: -0.15999999999999992
Error: 0.01638399999999997, Prediction: -0.1279999999999999
Error: 0.010485759999999964, Prediction: -0.10239999999999982
Error: 0.006710886399999962, Prediction: -0.08191999999999977
Error: 0.004294967295999976, Prediction: -0.06553599999999982
Error: 0.002748779069439994, Prediction: -0.05242879999999994
Error: 0.0017592186044416036, Prediction: -0.04194304000000004
Error: 0.0011258999068426293, Prediction: -0.03355443200000008
Error: 0.0007205759403792803, Prediction: -0.02684354560000002
Error: 0.0004611686018427356, Prediction: -0.021474836479999926
Error: 0.0002951479051793508, Prediction: -0.01717986918399994
Error: 0.00018889465931478573, Prediction: -0.013743895347199997
Error: 0.00012089258196146188, Prediction: -0.010995116277759953
Error: 7.737125245533561e-05, Prediction: -0.008796093022207963
Error: 4.951760157141604e-05, Prediction: -0.00703687441776

In [None]:
#learning whole dataset
import numpy as np
weights = np.array([0.5, 0.48, -0.7])
alpha = 0.1

streetlights = np.array([[ 1, 0, 1 ],
                          [ 0, 1, 1 ],
                          [ 0, 0, 1 ],
                          [ 1, 1, 1 ],
                          [ 0, 1, 1 ],
                          [ 1, 0, 1 ]])

walk_vs_stop = np.array([0,1,0,1,1,0])
input = streetlights[0]
goal_prediction = walk_vs_stop[0]

for i in range(40):
  error_for_all_lights = 0
  for row_index in range(len(walk_vs_stop)):
    input = streetlights[row_index]
    goal_prediction = walk_vs_stop[row_index]
    prediction = np.dot(input, weights)
    error = (goal_prediction - prediction) ** 2
    error_for_all_lights += error
    delta = prediction - goal_prediction
    weights = weights - (alpha * (input * delta))
    print("Prediction: {}".format(prediction))
  print("Error: {} \n".format(error_for_all_lights))

Prediction: -0.19999999999999996
Prediction: -0.19999999999999996
Prediction: -0.5599999999999999
Prediction: 0.6160000000000001
Prediction: 0.17279999999999995
Prediction: 0.17552
Error: 2.6561231104 

Prediction: 0.14041599999999999
Prediction: 0.3066464
Prediction: -0.34513824
Prediction: 1.006637344
Prediction: 0.4785034751999999
Prediction: 0.26700416768
Error: 0.9628701776715985 

Prediction: 0.213603334144
Prediction: 0.5347420299776
Prediction: -0.26067345110016
Prediction: 1.1319428845096962
Prediction: 0.6274723921901568
Prediction: 0.25433999330650114
Error: 0.5509165866836797 

Prediction: 0.20347199464520088
Prediction: 0.6561967149569552
Prediction: -0.221948503950995
Prediction: 1.166258650532124
Prediction: 0.7139004922542389
Prediction: 0.21471099528371604
Error: 0.36445836852222424 

Prediction: 0.17176879622697283
Prediction: 0.7324724146523222
Prediction: -0.19966478845083285
Prediction: 1.1697769945341199
Prediction: 0.7719890116601171
Prediction: 0.172979974288593

#### Full, batch, and stochastic gradient descent

Stochastic gradient descent: updates weights one example
at a time.

(Full) gradient descent: updates weights one dataset at a time.

Batch gradient descent: updates weights after n examples.

#### Learning indirect correlation

If your data doesn’t have correlation, create intermediate data
that does. (means create hidden layers)

#### Stacking neural networks:

The output of the first lower network (layer_0 to layer_1) is the input to the second upper neural network (layer_1 to layer_2). 

#### sometimes correlation (non linear)
Turn off the node when the value would be below 0.

By turning off any middle node whenever it would be negative, you allow the network to sometimes subscribe to correlation from various inputs. This is impossible for two-layer neural networks, thus adding power to three-layer nets.

In [None]:
# Backpropagation in code
import numpy as np

np.random.seed(1)

def relu(x):
  return (x > 0) * x

def relu2deriv(output):
  return output>0

alpha = 0.2
hidden_size = 4

streetlights = np.array([[1,0,1],
                        [0,1,1],
                        [0,0,1],
                        [1,1,1]]) # (4 * 3) matrix
walk_vs_stop = np.array([[1,1,0,0]]).T # (1 * 4) matrix

weights_0_1 = 2*np.random.random((3,hidden_size)) - 1 # (3 * 4) matrix
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1 # (4 * 1) matrix

for i in range(60):
  layer_2_error = 0
  for j in range(len(streetlights)):
    layer_0 = streetlights[j:j+1] # (1 * 3) matrix
    layer_1 = relu(np.dot(layer_0, weights_0_1)) # (1 * 4) matrix
    layer_2 = np.dot(layer_1, weights_1_2) # (1 * 1) matrix

    layer_2_error += np.sum((layer_2 - walk_vs_stop[j:j+1]) ** 2) # (4 * 1) matrix

    layer_2_delta = (walk_vs_stop[j:j+1] - layer_2) # (4 * 1) matrix
    layer_1_delta = layer_2_delta.dot(weights_1_2.T)*relu2deriv(layer_1)

    weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
    weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    if(i % 10 == 9):
      print("Error: {}".format(layer_2_error))

Error: 0.02317222265722177
Error: 0.37697998088916795
Error: 0.5585446349508787
Error: 0.6342311598444467
Error: 6.583974603752558e-05
Error: 0.18340014797885185
Error: 0.31497070417381795
Error: 0.35838407676317513
Error: 1.2854807174765666e-14
Error: 0.032716692228431515
Error: 0.07944050755997109
Error: 0.0830183113303298
Error: 2.438025114683471e-24
Error: 0.0016192127634249867
Error: 0.006448232304230192
Error: 0.006467054957103705
Error: 0.0
Error: 6.299054636709449e-05
Error: 0.0003292669000750734
Error: 0.0003292669000750734
Error: 0.0
Error: 2.8586146897158e-06
Error: 1.5055622665134859e-05
Error: 1.5055622665134859e-05


### Chapter 8:
#### Learning signal and ignoring noise: introduction to regularization and batching

In [None]:
import sys, numpy as np
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images, labels = (x_train[0:1000].reshape(1000, 28*28)/255, y_train[0:1000])
one_hot_labels = np.zeros((len(labels),10))

for i,l in enumerate(labels):
  one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28)/255
test_labels = np.zeros((len(y_test),10))

for i,l in enumerate(y_test):
  test_labels[i][l] = 1

np.random.seed(1)
relu = lambda x:(x>=0) * x
relu2deriv = lambda x: x>=0
alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 350, 40, 784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image, hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations):
  error, correct_cnt = (0.0, 0)

  for i in range(len(images)):
    layer_0 = images[i:i+1]
    layer_1 = relu(np.dot(layer_0,weights_0_1))
    layer_2 = np.dot(layer_1, weights_1_2)
    error += np.sum((labels[i:i+1] - layer_2) ** 2)
    correct_cnt += int(np.argmax(layer_2) == np.argmax(labels[i:i+1]))
    
    layer_2_delta = (labels[i:i+1] - layer_2)
    layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)

    weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
    weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

  sys.stdout.write("\r"+ " I:"+str(j)+ " Error:" + str(error/float(len(images)))[0:5] + " Correct:" + str(correct_cnt/float(len(images))))


  if(j%10 == 0 or j == iterations-1):
    error_test, correct_cnt_t = (0.0, 0)

    for i in range(len(test_images)):
      layer_0 = test_images[i:i+1]
      layer_1 = relu(np.dot(layer_0, weights_0_1))
      layer_2 = np.dot(layer_1, weights_1_2)

      error_test += np.sum((test_labels[i:i+1] - layer_2) ** 2)
      correct_cnt_t += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
    sys.stdout.write(" Test-Err:" + str(error_test/float(len(test_images)))[0:5] + " Test-Acc:" + str(correct_cnt_t/float(len(test_images))))

 I:349 Error:0.108 Correct:1.0 Test-Err:0.653 Test-Acc:0.7073

In [None]:
import numpy as np
np.random.seed(1)
streetlights = np.array([[1,0,1],
                         [0,1,1],
                         [0,0,1],
                         [1,1,1]])
walk_vs_stop = np.array([1,1,0,0]).T

relu = lambda x: (x>0) * x
relu2deriv = lambda x: x>0

alpha = 0.2
hidden_size = 4

weights_0_1 = 2*np.random.random((3, hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size, 1)) - 1

for iteration in range(60):
  layer_2_error=0
  for i in range(len(streetlights)):
    layer_0 = streetlights[i:i+1]
    layer_1 = relu(np.dot(layer_0, weights_0_1))
    layer_2 = np.dot(layer_1, weights_1_2)

    layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)
    layer_2_delta = layer_2 - walk_vs_stop[i:i+1]

    layer_1_delta = np.dot(layer_2_delta, weights_1_2.T) * relu2deriv(layer_1)

    weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta)
    weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta)

  if(iteration % 10 == 9):
    print("Error:" + str(layer_2_error))

Error:0.6342311598444467
Error:0.35838407676317513
Error:0.0830183113303298
Error:0.006467054957103705
Error:0.0003292669000750734
Error:1.5055622665134859e-05


#### Chapter 8 : code (without dropout)

In [None]:
# chapter 8 (without dropout)
import sys, numpy as np
from keras.datasets import mnist

In [None]:
# load_data() return Tuple of Numpy arrays: (x_train, y_train), (x_test, y_test)
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [None]:
x_train.shape

(60000, 28, 28)

In [None]:
images, labels = (x_train[0:1000].reshape(1000,28*28)/255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))

one_hot_labels.shape

(1000, 10)

In [None]:
# this part is confusing, labels has numbers from 0-9 for 1000 rows, this part is creating a 1000*10 matrix where for each row it's putting a 1 in place of a 0 in 1000*10 matrix
for i,l in enumerate(labels):
  one_hot_labels[i][l] = 1

print(labels.shape)

(1000,)


In [None]:
print(labels)

[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
 3 0 2 1 1 7 9 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1 5 7 1 7 1 1 6 3 0 2 9
 3 1 1 0 4 9 2 0 0 2 0 2 7 1 8 6 4 1 6 3 4 5 9 1 3 3 8 5 4 7 7 4 2 8 5 8 6
 7 3 4 6 1 9 9 6 0 3 7 2 8 2 9 4 4 6 4 9 7 0 9 2 9 5 1 5 9 1 2 3 2 3 5 9 1
 7 6 2 8 2 2 5 0 7 4 9 7 8 3 2 1 1 8 3 6 1 0 3 1 0 0 1 7 2 7 3 0 4 6 5 2 6
 4 7 1 8 9 9 3 0 7 1 0 2 0 3 5 4 6 5 8 6 3 7 5 8 0 9 1 0 3 1 2 2 3 3 6 4 7
 5 0 6 2 7 9 8 5 9 2 1 1 4 4 5 6 4 1 2 5 3 9 3 9 0 5 9 6 5 7 4 1 3 4 0 4 8
 0 4 3 6 8 7 6 0 9 7 5 7 2 1 1 6 8 9 4 1 5 2 2 9 0 3 9 6 7 2 0 3 5 4 3 6 5
 8 9 5 4 7 4 2 7 3 4 8 9 1 9 2 8 7 9 1 8 7 4 1 3 1 1 0 2 3 9 4 9 2 1 6 8 4
 7 7 4 4 9 2 5 7 2 4 4 2 1 9 7 2 8 7 6 9 2 2 3 8 1 6 5 1 1 0 2 6 4 5 8 3 1
 5 1 9 2 7 4 4 4 8 1 5 8 9 5 6 7 9 9 3 7 0 9 0 6 6 2 3 9 0 7 5 4 8 0 9 4 1
 2 8 7 1 2 6 1 0 3 0 1 1 8 2 0 3 9 4 0 5 0 6 1 7 7 8 1 9 2 0 5 1 2 2 7 3 5
 4 9 7 1 8 3 9 6 0 3 1 1 

In [None]:
labels = one_hot_labels
print(labels)

[[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [None]:
test_images, test_labels = (x_test.reshape(len(x_test), 28*28), np.zeros((len(y_test),10)))

In [None]:
for i,l in enumerate(y_test):
  test_labels[i][l] = 1

np.random.seed(1)

relu = lambda x : (x>=0) * x
relu2deriv = lambda x: x>=0

alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 350, 40, 784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image, hidden_size)) - 0.1 # 784*40 size matrix
weights_1_2 = 0.2*np.random.random((hidden_size, num_labels)) - 0.1 # 40*10 size matrix

In [None]:
for j in range(iterations):
  error, correct_cnt = (0.0, 0)

  for i in range(len(images)):
    layer_0 = images[i:i+1]
    layer_1 = relu(layer_0.dot(weights_0_1))
    layer_2 = layer_1.dot(weights_1_2)

    error += np.sum((labels[i:i+1] - layer_2) ** 2)
    correct_cnt += int(np.argmax(labels[i:i+1]) == np.argmax(layer_2))

    layer_2_delta = (labels[i:i+1] - layer_2) # 1*10
    layer_1_delta = (layer_2_delta.dot(weights_1_2.T)) * relu2deriv(layer_1) # (1*10).dot(40*10.T)*(1*40) = 1*40

    weights_1_2 += alpha * layer_1.T.dot(layer_2_delta) # scalar*(1*40.T).dot(1*10) = 40*10
    weights_0_1 += alpha * layer_0.T.dot(layer_1_delta) # scalar*(1*784.T).dot(1*40) = 784*40

  sys.stdout.write("\r"+ " I:"+str(j)+ " Error:" + str(error/float(len(images)))[0:5] + " Correct:" + str(correct_cnt/float(len(images))))
  
  if(j%10 == 0 or j == iterations-1):
    error_test, correct_cnt_t = 0.0, 0

    for i in range(len(test_images)):
      layer_0 = test_images[i:i+1]
      layer_1 = layer_0.dot(weights_0_1)
      layer_2 = layer_1.dot(weights_1_2)

      error_test += np.sum((layer_2 - test_labels[i:i+1])**2)
      correct_cnt_t += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
    sys.stdout.write(" Test-Err:" + str(error_test/float(len(test_images)))[0:5] + " Test-Acc:" + str(correct_cnt_t/float(len(test_images))))


 I:349 Error:0.108 Correct:1.0 Test-Err:15948 Test-Acc:0.3834

#### Dropout in code

In [None]:
import sys, numpy as np
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images, labels = (x_train[0:1000].reshape(1000, 28*28)/255, y_train[0:1000])
one_hot_labels = np.zeros((len(labels),10))

for i,l in enumerate(labels):
  one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images, test_labels = (x_test.reshape(len(x_test), 28*28)/255, np.zeros((len(y_test), 10)))

for i,l in enumerate(y_test):
  test_labels[i][l] = 1

np.random.seed(1)

relu = lambda x: (x>=0) * x
relu2deriv = lambda x: x>=0

alpha, iterations, hidden_size = (0.005, 300, 100)
pixels_per_image, num_labels = (784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image, hidden_size)) - 0.1 # 784*100
weights_1_2 = 0.2*np.random.random((hidden_size, num_labels)) - 0.1 # (100*10)

for j in range(iterations):
  error, correct_cnt = (0.0, 0)
  for i in range(len(images)):
    layer_0 = images[i:i+1] # (1*784)
    layer_1 = relu(layer_0.dot(weights_0_1)) # (1*100)
    dropout_mask = np.random.randint(2, size=layer_1.shape)
    layer_1 *= dropout_mask * 2
    layer_2 = layer_1.dot(weights_1_2)

    error += np.sum((layer_2 - labels[i:i+1]) ** 2)
    correct_cnt += int(np.argmax(layer_2) == np.argmax(labels[i:i+1]))

    layer_2_delta = (labels[i:i+1] - layer_2) # (1*10) - (1*10)
    layer_1_delta = layer_2_delta.dot(weights_1_2.T)*relu2deriv(layer_1) # (1*10).dot(100*10.T)*relu2deriv(1*100) = (1*100)
    layer_1_delta *= dropout_mask

    weights_1_2 += alpha*(layer_1.T.dot(layer_2_delta)) # scalar * (100*1).dot(1*10) = (100*10)
    weights_0_1 += alpha*(layer_0.T.dot(layer_1_delta)) # scalar*(784*1).dot(1*100) = (784*100)

  if(j%10 == 0):
    test_error = 0.0
    test_correct_cnt = 0

    for i in range(len(test_images)):
      layer_0 = test_images[i:i+1]
      layer_1 = relu(layer_0.dot(weights_0_1)) # why activation function was used for test data?
      layer_2 = layer_1.dot(weights_1_2)

      test_error += np.sum((test_labels[i:i+1] - layer_2) ** 2)
      test_correct_cnt += int(np.argmax(test_labels[i:i+1]) == np.argmax(layer_2))

    sys.stdout.write("\n" +
                     "I:" + str(j) +
                     " Test-Err:"+str(test_error/float(len(test_images)))[0:5] +
                     " Test-Acc:"+str(test_correct_cnt/float(len(test_images))) +
                     " Train-Err:"+str(error/float(len(images)))[0:5] +
                     " Train-Acc:"+str(correct_cnt/float(len(images))))


I:0 Test-Err:0.641 Test-Acc:0.6333 Train-Err:0.891 Train-Acc:0.413
I:10 Test-Err:0.458 Test-Acc:0.787 Train-Err:0.472 Train-Acc:0.764
I:20 Test-Err:0.415 Test-Acc:0.8133 Train-Err:0.430 Train-Acc:0.809
I:30 Test-Err:0.421 Test-Acc:0.8114 Train-Err:0.415 Train-Acc:0.811
I:40 Test-Err:0.419 Test-Acc:0.8112 Train-Err:0.413 Train-Acc:0.827
I:50 Test-Err:0.409 Test-Acc:0.8133 Train-Err:0.392 Train-Acc:0.836
I:60 Test-Err:0.412 Test-Acc:0.8236 Train-Err:0.402 Train-Acc:0.836
I:70 Test-Err:0.412 Test-Acc:0.8033 Train-Err:0.383 Train-Acc:0.857
I:80 Test-Err:0.410 Test-Acc:0.8054 Train-Err:0.386 Train-Acc:0.854
I:90 Test-Err:0.411 Test-Acc:0.8144 Train-Err:0.376 Train-Acc:0.868
I:100 Test-Err:0.411 Test-Acc:0.7903 Train-Err:0.369 Train-Acc:0.864
I:110 Test-Err:0.411 Test-Acc:0.8003 Train-Err:0.371 Train-Acc:0.868
I:120 Test-Err:0.402 Test-Acc:0.8046 Train-Err:0.353 Train-Acc:0.857
I:130 Test-Err:0.408 Test-Acc:0.8091 Train-Err:0.352 Train-Acc:0.867
I:140 Test-Err:0.405 Test-Acc:0.8083 Train-Er

#### Dropout in code (with batch gradient descent)

In [3]:
import sys, numpy as np
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
images, labels = (x_train[0:1000].reshape(1000, 28*28)/255,y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
  one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images, test_labels = (x_test.reshape(len(x_test), 28*28)/255, np.zeros((len(y_test),10)))

for i,l in enumerate(y_test):
  test_labels[i][l] = 1

np.random.seed(1)
relu = lambda x: (x>=0) * x
relu2deriv = lambda x: x>=0

batch_size = 100
alpha, iterations = (0.001, 300)
pixel_per_image, num_labels, hidden_size = (784, 10, 100)

weights_0_1 = 0.2*np.random.random((pixels_per_image, hidden_size)) - 0.1 # (784*100)
weights_1_2 = 0.2*np.random.random((hidden_size, num_labels)) - 0.1 # (100*10)

for j in range(iterations):
  error, correct_cnt = (0.0, 0)
  for i in range(int(len(images)/batch_size)):
    batch_start, batch_end = ((i*batch_size), ((i+1)*batch_size))
    layer_0 = images[batch_start:batch_end] # (100*784)
    layer_1 = relu(layer_0.dot(weights_0_1)) # (100*784).dot(784*100) = (100*100)
    dropout_mask = np.random.randint(2, size=layer_1.shape)
    layer_1 *= dropout_mask * 2
    layer_2 = layer_1.dot(weights_1_2) # (100*100).dot(100*10) = (100*10)

    error += np.sum((labels[batch_start:batch_end] - layer_2) ** 2)

    for k in range(batch_size):
      correct_cnt += int(np.argmax(labels[batch_start:batch_end] == np.argmax(layer_2[k:k+1])))
      layer_2_delta = (labels[batch_start:batch_end] - layer_2) / batch_size # ((100*10) - (100*10))/scalar = (100*10)
      layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1) # (100*10).dot(100*10.T) * relu2deriv(100*100) = (100*100)
      layer_1_delta *= dropout_mask
      weights_1_2 += alpha*layer_1.T.dot(layer_2_delta) # scalar*(100*100.T).dot(100*10) = (100*10)
      weights_0_1 += alpha*layer_0.T.dot(layer_1_delta) # scalar*(100*784.T).dot(100*100) = (784*100)

  if(j%10==0):
    test_error, test_correct_cnt = (0.0, 0)
    for i in range(len(test_images)):
      layer_0 = test_images[i:i+1]
      layer_1 = relu(layer_0.dot(weights_0_1))
      layer_2 = layer_1.dot(weights_1_2)
      test_error += np.sum((test_labels[i:i+1] - layer_2) ** 2)
      test_correct_cnt += int(np.argmax(test_labels[i:i+1]) == np.argmax(layer_2))

    sys.stdout.write("\n" +
                     "I:" + str(j) +
                     " Test-Err:"+str(test_error/float(len(test_images)))[0:5] +
                     " Test-Acc:"+str(test_correct_cnt/float(len(test_images))) +
                     " Train-Err:"+str(error/float(len(images)))[0:5] +
                     " Train-Acc:"+str(correct_cnt/float(len(images))))
      


I:0 Test-Err:0.815 Test-Acc:0.3832 Train-Err:1.284 Train-Acc:0.362
I:10 Test-Err:0.568 Test-Acc:0.7173 Train-Err:0.591 Train-Acc:0.53
I:20 Test-Err:0.510 Test-Acc:0.7571 Train-Err:0.532 Train-Acc:0.438
I:30 Test-Err:0.485 Test-Acc:0.7793 Train-Err:0.498 Train-Acc:0.331
I:40 Test-Err:0.468 Test-Acc:0.7877 Train-Err:0.489 Train-Acc:0.309
I:50 Test-Err:0.458 Test-Acc:0.793 Train-Err:0.468 Train-Acc:0.28
I:60 Test-Err:0.452 Test-Acc:0.7995 Train-Err:0.452 Train-Acc:0.305
I:70 Test-Err:0.446 Test-Acc:0.803 Train-Err:0.453 Train-Acc:0.269
I:80 Test-Err:0.451 Test-Acc:0.7968 Train-Err:0.457 Train-Acc:0.314
I:90 Test-Err:0.447 Test-Acc:0.795 Train-Err:0.454 Train-Acc:0.252
I:100 Test-Err:0.448 Test-Acc:0.793 Train-Err:0.447 Train-Acc:0.284
I:110 Test-Err:0.441 Test-Acc:0.7943 Train-Err:0.426 Train-Acc:0.232
I:120 Test-Err:0.442 Test-Acc:0.7966 Train-Err:0.431 Train-Acc:0.246
I:130 Test-Err:0.441 Test-Acc:0.7906 Train-Err:0.434 Train-Acc:0.254
I:140 Test-Err:0.447 Test-Acc:0.7874 Train-Err:0.4

### Chapter 9: Activation functions

#### MNIST with tanh and softmax:

In [1]:
import sys, numpy as np
from keras.datasets import mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()
np.random.seed(1)

images, labels = (train_x[0:1000].reshape(1000, 28*28)/255, train_y[0:1000])
one_hot_labels = np.zeros((len(labels), 10))
for i,l in enumerate(labels):
  one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images, test_labels = (test_x.reshape(len(test_x), 28*28), np.zeros((len(test_y), 10)))
for i,l in enumerate(test_y):
  test_labels[i][l] = 1

tanh = lambda x: np.tanh(x)
tanh2deriv = lambda x: 1 - (x ** 2)

def softmax(x):
  temp = np.exp(x)
  return temp/np.sum(temp, axis=1, keepdims=True)

alpha, iterations, hidden_size = (2, 300, 100)
pixels_per_image, num_labels = (784, 10)
batch_size = 100

weights_0_1 = 0.02*np.random.random((pixels_per_image, hidden_size)) - 0.01 # (784*100)
weights_1_2 = 0.2*np.random.random((hidden_size, num_labels)) - 0.1 # (100*10)

for j in range(iterations):
  correct_cnt = 0
  for i in range(int(len(images)/batch_size)):
    batch_start, batch_end = (i*batch_size,(i+1)*batch_size)
    layer_0 = images[batch_start:batch_end] # (100*784)
    layer_1 = tanh(layer_0.dot(weights_0_1)) # activationFn(100*784.dot(784*100)) = (100*100)
    dropout_mask = np.random.randint(2, size=layer_1.shape)
    layer_1 *= dropout_mask * 2
    layer_2 = softmax(layer_1.dot(weights_1_2)) # activationFn(100*100.dot(100*10)) = (100*10)

    for k in range(batch_size):
      correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1])) # need to understand this line
      layer_2_delta = (labels[batch_start:batch_end] - layer_2[k:k+1])/(batch_size*layer_2.shape[0]) # (100*10) need to understand this line
      layer_1_delta = (layer_2_delta.dot(weights_1_2.T)) * tanh2deriv(layer_1) # (100*10.dot(100*10.T)*derivative(100*100)) = (100*100)
      layer_1_delta *= dropout_mask

      weights_1_2 += alpha*layer_1.T.dot(layer_2_delta) # scalar*(100*100.T).dot(100*10) = (100*10)
      weights_0_1 += alpha*layer_0.T.dot(layer_1_delta) # scalar*(100*784.T).dot(100*100) = (784*100)

  test_correct_cnt = 0
  for i in range(len(test_images)):
    layer_0 = test_images[i:i+1]
    layer_1 = tanh(layer_0.dot(weights_0_1))
    layer_2 = layer_1.dot(weights_1_2)
    test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))

  if(j%10 == 0):
    sys.stdout.write("\n" + "I:"+str(j)+
                     " Test_Acc:" +str(test_correct_cnt/float(len(test_images)))+
                     " Train_Acc:"+str(correct_cnt/float(len(images)))
                     )





I:0 Test_Acc:0.0974 Train_Acc:0.196
I:10 Test_Acc:0.2115 Train_Acc:0.136
I:20 Test_Acc:0.1254 Train_Acc:0.084
I:30 Test_Acc:0.1046 Train_Acc:0.098
I:40 Test_Acc:0.1319 Train_Acc:0.115
I:50 Test_Acc:0.1881 Train_Acc:0.138
I:60 Test_Acc:0.1201 Train_Acc:0.128
I:70 Test_Acc:0.2217 Train_Acc:0.159
I:80 Test_Acc:0.1891 Train_Acc:0.135
I:90 Test_Acc:0.0989 Train_Acc:0.098
I:100 Test_Acc:0.1585 Train_Acc:0.127
I:110 Test_Acc:0.1434 Train_Acc:0.136
I:120 Test_Acc:0.0833 Train_Acc:0.091
I:130 Test_Acc:0.128 Train_Acc:0.136
I:140 Test_Acc:0.056 Train_Acc:0.095
I:150 Test_Acc:0.123 Train_Acc:0.127
I:160 Test_Acc:0.1282 Train_Acc:0.132
I:170 Test_Acc:0.1511 Train_Acc:0.136
I:180 Test_Acc:0.1858 Train_Acc:0.155
I:190 Test_Acc:0.1135 Train_Acc:0.103
I:200 Test_Acc:0.1704 Train_Acc:0.13
I:210 Test_Acc:0.2617 Train_Acc:0.167
I:220 Test_Acc:0.0904 Train_Acc:0.109
I:230 Test_Acc:0.1513 Train_Acc:0.134
I:240 Test_Acc:0.1219 Train_Acc:0.135
I:250 Test_Acc:0.1153 Train_Acc:0.112
I:260 Test_Acc:0.1413 Trai




I:270 Test_Acc:0.098 Train_Acc:0.097
I:280 Test_Acc:0.098 Train_Acc:0.097
I:290 Test_Acc:0.098 Train_Acc:0.097