<a href="https://colab.research.google.com/github/mohameddhameem/LearnPyTorch/blob/master/Introduction_to_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to Neural Network with PyTorch

##We will start with very basic - Perceptron

---
"system that depends on probabilistic rather than deterministic principles for its operation, gains its reliability from the properties of statistical measurements obtain from a large population of elements" - Frank Rosenblatt (1957)



In [69]:
from IPython.display import Image
from IPython.core.display import HTML
from tqdm import tqdm

from IPython.display import clear_output
Image(url="https://ibin.co/4TyMU8AdpV4J.png", width=500)

In [70]:
Image(url="https://svgshare.com/i/AbJ.svg", width=500)

Optimization Process
====

To learn the weights, $w$, we use an **optimizer** to find the best-fit (optimal) values for $w$ such that the inputs correct maps to the outputs.

Typically, process performs the following 4 steps iteratively.

### **Initialization**

 - **Step 1**: Initialize weights vector
 
### **Forward Propagation**

 
 - **Step 2a**: Multiply the weights vector with the inputs, sum the products, i.e. `s`
 - **Step 2b**: Put the sum through the sigmoid, i.e. `f()`
 
### **Back Propagation**
 
 
 - **Step 3a**: Compute the errors, i.e. difference between expected output and predictions
 - **Step 3b**: Multiply the error with the **derivatives** to get the delta
 - **Step 3c**: Multiply the delta vector with the inputs, sum the product
 
### **Optimizer takes a step**

 - **Step 4**: Multiply the learning rate with the output of Step 3c.

In [0]:
import math
import numpy as np
np.random.seed(0)

In [0]:
def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

https://math.stackexchange.com/a/1225116

In [0]:
def sigmoid_derivative(sx): 
    return sx*(1-sx)

In [74]:
sigmoid(np.array([2.5, 0.32, -1.42]))

array([0.92414182, 0.57932425, 0.19466158])

In [75]:
sigmoid_derivative(np.array([2.5, 0.32, -1.42]))

array([-3.75  ,  0.2176, -3.4364])

In [0]:
#Lets define our cost function
def cost(predicted, truth):
    return np.abs(truth - predicted)

In [77]:
#test our cost function
gold = np.array([0.5, 1.2, 9.8])
pred = np.array([0.6, 1.0, 10.0])
cost(pred, gold)

array([0.1, 0.2, 0.2])

In [78]:
gold = np.array([0.5, 1.2, 9.8])
pred = np.array([9.3, 4.0, 99.0])
cost(pred, gold)

array([ 8.8,  2.8, 89.2])

Representing OR Boolean
---

Lets consider the problem of the OR boolean and apply the perceptron with simple gradient descent. 

| x2 | x3 | y | 
|:--:|:--:|:--:|
| 0 | 0 | 0 |
| 0 | 1 | 1 | 
| 1 | 0 | 1 | 
| 1 | 1 | 1 | 

In [0]:
X = or_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = or_output = np.array([[0,1,1,1]]).T ## To transpose and Flat as single dimention array

In [80]:
or_input

array([[0, 0],
       [0, 1],
       [1, 0],
       [1, 1]])

In [81]:
or_output

array([[0],
       [1],
       [1],
       [1]])

In [0]:
# Define the shape of the weight vector.
num_data, input_dim = 4, 2
# Define the shape of the output vector. 
output_dim = len(or_output.T)

In [83]:
print('Inputs\n======')
print('no. of rows =', num_data) 
print('no. of cols =', input_dim)
print('\n')
print('Outputs\n=======')
print('no. of cols =', output_dim)

Inputs
no. of rows = 4
no. of cols = 2


Outputs
no. of cols = 1


In [84]:
# Initialize weights between the input layers and the perceptron
W = np.random.random((input_dim, output_dim)) # Some random numbers are fine
W

array([[0.5488135 ],
       [0.71518937]])

Step 2a: Multiply the weights vector with the inputs, sum the products
====

To get the output of step 2a, 

 - Itrate through each row of the data, `X`
 - For each column in each row, find the product of the value and the respective weights
 - For each row, compute the sum of the products

In [85]:
# If we write it imperatively:
summation = []
for row in X:
    sum_wx = 0
    for feature, weight in zip(row, W):
        sum_wx += feature * weight
    summation.append(sum_wx)
print(np.array(summation))

[[0.        ]
 [0.71518937]
 [0.5488135 ]
 [1.26400287]]


In [86]:
#If we need to do very simple then below with numpy.
np.dot(X, W)

array([[0.        ],
       [0.71518937],
       [0.5488135 ],
       [1.26400287]])

Lets Train the Single-Layer Model
====

In [87]:
num_epochs = 10000 # No. of times to iterate.
learning_rate = 0.03 # How large a step to take per iteration.

# Lets standardize and call our inputs X and outputs Y
X = or_input
Y = or_output

for _ in tqdm(range(num_epochs)):
    layer0 = X

    # Step 2a: Multiply the weights vector with the inputs, sum the products, i.e. s
    # Step 2b: Put the sum through the sigmoid, i.e. f()
    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(X, W))

    # Back propagation.
    # Step 3a: Compute the errors, i.e. difference between expected output and predictions
    # How much did we miss?
    layer1_error = cost(layer1, Y)

    # Step 3b: Multiply the error with the derivatives to get the delta
    # multiply how much we missed by the slope of the sigmoid at the values in layer1
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # Step 3c: Multiply the delta vector with the inputs, sum the product (use np.dot)
    # Step 4: Multiply the learning rate with the output of Step 3c.
    W +=  learning_rate * np.dot(layer0.T, layer1_delta)

100%|██████████| 10000/10000 [00:00<00:00, 42585.23it/s]


In [88]:
layer1

array([[0.5       ],
       [0.95643415],
       [0.95623017],
       [0.99791935]])

In [89]:
# Expected output.
Y

array([[0],
       [1],
       [1],
       [1]])

In [90]:
# On the training data 
#For us we are setting a threshold of .5
[[int(prediction > 0.5)] for prediction in layer1]

[[0], [1], [1], [1]]

In [91]:
#Lets try to calcuate the cost
predicted = [[int(prediction > 0.5)] for prediction in layer1]
cost(predicted, Y) # In our case there is no cost at all

array([[0],
       [0],
       [0],
       [0]])

Lets try the XOR Boolean
---

Lets consider the problem of the OR boolean and apply the perceptron with simple gradient descent. 

| x2 | x3 | y | 
|:--:|:--:|:--:|
| 0 | 0 | 0 |
| 0 | 1 | 1 | 
| 1 | 0 | 1 | 
| 1 | 1 | 0 | 

In [0]:
X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

In [93]:
xor_input

array([[0, 0],
       [0, 1],
       [1, 0],
       [1, 1]])

In [94]:
xor_output

array([[0],
       [1],
       [1],
       [0]])

In [95]:
num_epochs = 10000000 # No. of times to iterate.
learning_rate = 0.003 # How large a step to take per iteration.

# Lets drop the last row of data and use that as unseen test.
X = xor_input
Y = xor_output

# Define the shape of the weight vector.
num_data, input_dim = 4, 2
# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the input layers and the perceptron - Random
W = np.random.random((input_dim, output_dim)) # Some random numbers are fine

for _ in tqdm(range(num_epochs)):
    layer0 = X
    # Forward propagation.
    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(X, W))

    # How much did we miss?
    layer1_error = cost(layer1, Y)

    # Back propagation.
    # multiply how much we missed by the slope of the sigmoid at the values in layer1
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W +=  learning_rate * np.dot(layer0.T, layer1_delta)

100%|██████████| 10000000/10000000 [03:22<00:00, 49496.38it/s]


In [96]:
# Expected output.
Y

array([[0],
       [1],
       [1],
       [0]])

In [97]:
# On the training data 
#For us we are setting a threshold of .5
[[int(prediction > 0.5)] for prediction in layer1]

[[0], [1], [1], [1]]

In [98]:
predicted = [[int(prediction > 0.5)] for prediction in layer1]
cost(predicted, Y) # Still One is wrong .. Go back to training and try to tune the parameters..

array([[0],
       [0],
       [0],
       [1]])

You can't represent XOR with simple perceptron !!!
====

No matter how you change the hyperparameters or data, the XOR function can't be represented by a single perceptron layer.
 
There's no way you can get all four data points to get the correct outputs for the XOR boolean operation.


Solving XOR (Add more layers)
====

In [99]:
from itertools import chain
import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def cost(predicted, truth):
    return truth - predicted

xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim)) ## To make it simple, last layer output is feed as input 

# Initialize weigh
num_epochs = 10000
learning_rate = 0.03

for epoch_n in tqdm(range(num_epochs)):
    layer0 = X
    # Forward propagation.
    
    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)
    
    # How much did we miss in the predictions?
    layer2_error = cost(layer2, Y)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_delta = layer2_error * sigmoid_derivative(layer2)

    
    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)
    
    # update weights
    W2 +=  learning_rate * np.dot(layer1.T, layer2_delta)
    W1 +=  learning_rate * np.dot(layer0.T, layer1_delta)
    ##print(epoch_n, list((layer2)))

100%|██████████| 10000/10000 [00:00<00:00, 25447.34it/s]


In [100]:
# Training input.
X

array([[0, 0],
       [0, 1],
       [1, 0],
       [1, 1]])

In [101]:
# Expected output.
Y

array([[0],
       [1],
       [1],
       [0]])

In [102]:
layer2 # Our output layer

array([[0.31284349],
       [0.6213127 ],
       [0.62323891],
       [0.46427804]])

In [103]:
# On the training data
[int(prediction > 0.5) for prediction in layer2]

[0, 1, 1, 0]

Now try adding another layer
====

Use the same process:
    
  1. Initialize
  2. Forward Propagate
  3. Back Propagate 
  4. Update (aka step)

In [0]:
from itertools import chain
import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def cost(predicted, truth):
    return truth - predicted

xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T

In [105]:
#Input
X

array([[0, 0],
       [0, 1],
       [1, 0],
       [1, 1]])

In [106]:
#Output
Y

array([[0],
       [1],
       [1],
       [0]])

In [112]:
# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
layer0to1_hidden_dim = 5
layer1to2_hidden_dim = 5

# Initialize weights between the input layers 0 ->  layer 1
W1 = np.random.random((input_dim, layer0to1_hidden_dim))

# Initialize weights between the layer 1 -> layer 2
W2 = np.random.random((layer0to1_hidden_dim, layer1to2_hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the layer 2 -> layer 3
W3 = np.random.random((layer1to2_hidden_dim, output_dim))

# Initialize weigh
num_epochs = 100000
learning_rate = 0.001

for epoch_n in tqdm(range(num_epochs)):
    layer0 = X
    # Forward propagation.
    
    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))
    layer3 = sigmoid(np.dot(layer2, W3))

    # Back propagation (Y -> layer2)
    # How much did we miss in the predictions?
    layer3_error = cost(layer3, Y)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer3_delta = layer3_error * sigmoid_derivative(layer3)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer3 error (according to the weights)?
    layer2_error = np.dot(layer3_delta, W3.T)
    layer2_delta = layer3_error * sigmoid_derivative(layer2)
    
    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)
    
    # update weights
    W3 +=  learning_rate * np.dot(layer2.T, layer3_delta)
    W2 +=  learning_rate * np.dot(layer1.T, layer2_delta)
    W1 +=  learning_rate * np.dot(layer0.T, layer1_delta)

100%|██████████| 100000/100000 [00:05<00:00, 17234.57it/s]


In [113]:
Y

array([[0],
       [1],
       [1],
       [0]])

In [114]:
layer3

array([[0.52868989],
       [0.54217557],
       [0.54040728],
       [0.53694543]])

In [115]:
# On the training data
prediction_output = [int(prediction > 0.5) for prediction in layer3]
prediction_output

[1, 1, 1, 1]