## Essentials to implement a Neural Net using PyTorch

From theory, we know that we need:
- The Neural Net definition i.e.the input, the hidden layers that propagate the input and the output (a little bit more work for a RNN or more precise convolutional layers and pooling and flattening for a CNN)
- An intialization of the weights or filters
- A loss function and desired outputs

To run our computation, we need to keep doing:
- Push input after input forward along the net
- Compute a loss
- Propagate backwards and get gradients for each weight or filter
- Alter the weights in direction opposite to gradient to decrease loss
... until we have optimized weights enough that loss is within acceotable range

### Defining the Neural Net Architecture

In [235]:
# Imports
# Torch package
import torch
# Variable Class
from torch.autograd import Variable
# Neural Net Sub-package from Torch
import torch.nn as nn
# Functions from the Neural Net Sub-Package such as RELU
import torch.nn.functional as F

# input_size will dictate size of first hidden layer
# Here, have 3 samples with 8 coordinates each
inputs = Variable(torch.randn(3,8), requires_grad=True)
input_size = inputs.size()

# Implement the Neural Net class from torch.nn.Module Class
# Has all the useful stuff needed for a Neural Net
class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        # Basic net architecture i.e. the layers needed in all
        # How they are chained and non-linearity addition defined in forward()
        # A layer is defined by number of inputs and outputs
        # Input layer already specified by size of input fed when instantiating the net
        self.hidden = nn.Linear(input_size[1], 5)
        self.output = nn.Linear(5, 2)

    def forward(self, inputs):
        # How lyers link to each other is defined here
        # Non-linearity added in between this chaining definition
        
        # Input 3 samples of size 8 into the hidden layer
        x = self.hidden(inputs)
        x = F.tanh(x)
        x = self.output(x)
        x = F.log_softmax(x)
        return x

    # Can also define any other helpers to be used by this NN here
    
# Instantiate Net architecture first
net = Net()
print 'This net looks like: ', net
# Calling the net on a set of inputs does calls forward() i.e. forward props on all of them,
# Useful if say you have a trained net and just need to classify test data
result = net(inputs)
print 'Classifying the inputs as if it is a trained model gives: ', result

This net looks like:  Net (
  (hidden): Linear (8 -> 5)
  (output): Linear (5 -> 2)
)
Classifying the inputs as if it is a trained model gives:  Variable containing:
-0.5904 -0.8077
-0.8998 -0.5220
-0.6376 -0.7519
[torch.FloatTensor of size 3x2]



## Training the Neural Net
So far we have just deined the Neural Net architecture, weights have been randomized as none have been specified and we saw how to forward propagate

To train we still need:
- Initializing weights
- Specifying needed desired outputs for each input and... 
- A loss function to gauge how we far we are from those

We need to proceed as follows:
- Do full epochs through test data
- Calculate the Loss and back propagate to minimize it
- Repeat until loss satisfactory

In [236]:
# In one epoch:
#  The net forward propagates on the input
#  Calculates loss using the loss function on outputs obtained and desired_output
#  Optimizes the loss function using the optimizer
#    That is, finds gradient of loss wrt. each weight (Recall the Variable Class that wraps around a Tensor)
#    (SGD is an example seen in theory)
#  Updates the weights based on those gradients

def feed_forward_one_time(net, inputs, desired_outputs):
    # Forward prop
    output = net(inputs)
    
    # Calculate loss
    loss_function = nn.MSELoss()
    loss = loss_function(output, desired_outputs)
    print('Loss is now valued at: ', loss)
    return loss

net = Net()
loss = feed_forward_one_time(net, inputs, Variable(torch.rand(3, 2)))

# Note: Both output and desired output ned to be Variables with same type of tensor inside
# Cast a tensor by doing that_tensor_name.float() or .double() or .long() etc etc

('Loss is now valued at: ', Variable containing:
 1.3019
[torch.FloatTensor of size 1]
)


### Backprop and update weights (Optimization) in fine-grained manner

In [237]:
# Clear all gradient buffers for params to get fresh gradients
net.zero_grad() 
# Backprop and get gradient for each and every param
loss.backward()

# Loss is a Variable that has its grad_fn spanning all the way back to inputs 
    # This allows Backpropagation wrt. every parameter
    # Each parameter is in the net.parameters() generator:
layer_count = 0
for x in net.parameters():
    print 'layer #', layer_count, 'Parameters'
    print x.data
    print 'layer #', layer_count, 'Gradients'
    print x.grad.data
    layer_count += 1

layer # 0 Parameters

 0.0828 -0.0073  0.1013 -0.2768 -0.2450 -0.0431  0.1213 -0.2040
-0.1974 -0.0602  0.0627 -0.1636 -0.3224  0.2514  0.1594 -0.1397
 0.0430 -0.0236 -0.0016 -0.2356 -0.3284  0.2442 -0.1784  0.0559
-0.3530  0.0957  0.1588 -0.3023 -0.2544  0.0558  0.0886  0.0960
 0.1613  0.3233  0.1959  0.0641 -0.1029 -0.2467  0.2768  0.2288
[torch.FloatTensor of size 5x8]

layer # 0 Gradients

-0.1148 -0.0310  0.0987  0.0060 -0.0584 -0.0197  0.0707 -0.0265
-0.0634 -0.0547  0.2070  0.0677 -0.2211 -0.1692  0.2744  0.0019
 0.0720  0.0282 -0.0797 -0.0147  0.0596  0.0341 -0.0716  0.0174
 0.0032  0.0049 -0.0193 -0.0072  0.0222  0.0178 -0.0276 -0.0011
-0.0013 -0.0002  0.0008 -0.0001 -0.0003  0.0002  0.0003 -0.0003
[torch.FloatTensor of size 5x8]

layer # 1 Parameters

-0.3396
-0.1061
 0.2396
-0.1187
-0.1789
[torch.FloatTensor of size 5]

layer # 1 Gradients

1.00000e-02 *
  5.8034
  1.2014
 -3.0349
  0.0543
  0.0761
[torch.FloatTensor of size 5]

layer # 2 Parameters

 0.1815  0.3632 -0.3937 -

In [238]:
# Now can update all those seen parameters using the gradients obtained
def update_parameters(net, learning_rate = 0.01):
    for f in net.parameters():
        f.data.sub_(f.grad.data * learning_rate)

print 'weights from input layer to hidden BEFORE update:'
print net.parameters().next().data
update_parameters(net)
print 'weights from input layer to hidden AFTER update:'
print net.parameters().next().data

weights from input layer to hidden BEFORE update:

 0.0828 -0.0073  0.1013 -0.2768 -0.2450 -0.0431  0.1213 -0.2040
-0.1974 -0.0602  0.0627 -0.1636 -0.3224  0.2514  0.1594 -0.1397
 0.0430 -0.0236 -0.0016 -0.2356 -0.3284  0.2442 -0.1784  0.0559
-0.3530  0.0957  0.1588 -0.3023 -0.2544  0.0558  0.0886  0.0960
 0.1613  0.3233  0.1959  0.0641 -0.1029 -0.2467  0.2768  0.2288
[torch.FloatTensor of size 5x8]

weights from input layer to hidden AFTER update:

 0.0839 -0.0069  0.1003 -0.2769 -0.2444 -0.0429  0.1206 -0.2037
-0.1967 -0.0597  0.0607 -0.1643 -0.3202  0.2531  0.1566 -0.1397
 0.0422 -0.0239 -0.0008 -0.2355 -0.3290  0.2439 -0.1777  0.0557
-0.3531  0.0956  0.1590 -0.3022 -0.2546  0.0556  0.0889  0.0961
 0.1613  0.3233  0.1958  0.0641 -0.1029 -0.2467  0.2768  0.2288
[torch.FloatTensor of size 5x8]



### Easy built-in Optimization (only choose optimizer, loss function and step)
One may want to optimize in the SGD way or any other way, without explicitly implementing the optimizer like we did (Just feedforward, calculate loss, backpropagate then let the optimizer do the rest with the gradients)

Torch provides built-in optimizers e.g. SGD or Adam

In [240]:
import torch.optim as optim
net = Net()
# Choose loss function, optimizer
loss_function = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.01)

# Loop through:
loss_value = 1000000
update_count = 0
while loss_value > 1:
    # Regular feed-forward and loss calculation
    output = net(inputs)
    loss = loss_function(output, Variable(torch.rand(3, 2)))
    loss_value = loss.data[0]
    # Zero_grad on optimizer directly which now wraps around the net params
    optimizer.zero_grad()
    # Backpropagate
    loss.backward()
    # Adjust weights based on how this optimizer does it in theory
    optimizer.step()
    update_count += 1
    print 'weight from input layer neurone 1 to hidden layer neurone 1 after update #', update_count, ': '
    print net.parameters().next().data[0,0]
    
print 'Loss reached in the end: ', loss_value

weight from input layer neurone 1 to hidden layer neurone 1 after update # 1 : 
-0.118415981531
weight from input layer neurone 1 to hidden layer neurone 1 after update # 2 : 
-0.126368165016
weight from input layer neurone 1 to hidden layer neurone 1 after update # 3 : 
-0.12321575731
weight from input layer neurone 1 to hidden layer neurone 1 after update # 4 : 
-0.118741162121
weight from input layer neurone 1 to hidden layer neurone 1 after update # 5 : 
-0.112103506923
weight from input layer neurone 1 to hidden layer neurone 1 after update # 6 : 
-0.105015315115
weight from input layer neurone 1 to hidden layer neurone 1 after update # 7 : 
-0.0995587557554
Loss reached in the end:  0.896777868271
