# Neural Networks with torch7

Torch has a package for neural networks called 'nn'. It can be imported into torch using:

In [1]:
nn = require 'nn';

There are many types of neural networks designed over the years. Most neural networks can be represented as directed acyclic graphs. However (just as a note), we'll see a class of neural networks called recurrent neural networks that can't be represented in that manner. The simplest type of neural networks are sequential feed-forward neural networks. Multilayer perceptrons and Convolutional Neural Networks do fall in that category. 

To create a sequential network, we use a super-module called 'nn.Sequential'. Other modules can be added to this super-module to create a sequential neural network. 

A variant to the sequential network is the parallel module which has multiple neural network pipelines running in parallel. Also, if you want even more power and to design recurrent neural networks you would use another package called the 'nngraph' package. But, we are getting ahead of ourselves. 

## The sequential super-module
Here's how you create a sequential model:

In [2]:
model = nn.Sequential()

That's it! However, our model is empty and can't do anything yet. 
## The Linear module
Let's add a linear module to our model.

In [3]:
model:add(nn.Linear(3,4))

nn.Sequential {
  [input -> (1) -> output]
  (1): nn.Linear(3 -> 4)
}
{
  gradInput : DoubleTensor - empty
  modules : 
    {
      1 : 
        nn.Linear(3 -> 4)
        {
          gradBias : DoubleTensor - size: 4
          weight : DoubleTensor - size: 4x3
          gradWeight : DoubleTensor - size: 4x3
          gradInput : DoubleTensor - empty
          bias : DoubleTensor - size: 4
          output : DoubleTensor - empty
        }
    }
  output : DoubleTensor - empty
}


Ouch, that printed a lot of things. Let's ignore most of it for now and concentrate on the 3 and 4. So, it takes a vector of size 3 and embeds it into a vector space of 4 dimensions.

The linear module is basically just:
$$ Y = W^{\top}X + b$$ where X, Y and b are vectors and W is a matrix. 

If you notice in the output of the module, there are internal variables for W called weight and b called bias. 

Let's verify if they actually work like that. Consider an initialization for X:

In [4]:
X = torch.randn(3)

In [5]:
X

 0.6086
 0.0635
-0.0813
[torch.DoubleTensor of size 3]



## The forward pass
You can get the output from any torch module by doing a <module_name>:forward(). Let's do that here to get Y. 

In [6]:
Y = model:forward(X)

In [7]:
Y

-0.5239
-0.3472
-0.4393
 0.2691
[torch.DoubleTensor of size 4]



Now, let's replicate that as a math calculation.

First let's access the weight of that model.

In [8]:
W = model:get(1).weight

In [9]:
W

-0.3449 -0.5608 -0.0103
-0.1104  0.0342 -0.2782
 0.0876  0.3920  0.2362
 0.0042  0.3910  0.5008
[torch.DoubleTensor of size 4x3]



The model:get(1) command accesses the first module of model which is the Linear module we just defined. Next we access the weight variable of that module by chaining a '.weight'.

Similarly, we can get the bias of that linear module like this:

In [10]:
b = model:get(1).bias

In [11]:
b

-0.2792
-0.3048
-0.4983
 0.2825
[torch.DoubleTensor of size 4]



Now we can actually do some math. The Linear module can be expressed in torch maths as follows:

In [12]:
-- The * operator is overloaded with matrix multiplication
W*X+b

-0.5239
-0.3472
-0.4393
 0.2691
[torch.DoubleTensor of size 4]



Compare this with the output of the above model. 

In [13]:
model:get(1).output

-0.5239
-0.3472
-0.4393
 0.2691
[torch.DoubleTensor of size 4]



Yes, the modules store their previous outputs in the output variable. This is useful in calculating the gradient later on.

## The backward pass

Just like the forward pass gave us the output, a backward pass through a module will give us its backpropagated gradient.

Before doing anything, we should clear our gradient parameters. What gradient parameters?

Let's print out the internals of model again:

In [14]:
model

nn.Sequential {
  [input -> (1) -> output]
  (1): nn.Linear(3 -> 4)
}
{
  gradInput : DoubleTensor - empty
  modules : 
    {
      1 : 
        nn.Linear(3 -> 4)
        {
          gradBias : DoubleTensor - size: 4
          weight : DoubleTensor - size: 4x3
          gradWeight : DoubleTensor - size: 4x3
          gradInput : DoubleTensor - empty
          bias : DoubleTensor - size: 4
          output : DoubleTensor - size: 4
        }
    }
  output : DoubleTensor - size: 4
}


Do you notice the grad variables? Here's a mapping between these variables and mathematical expressions:  
* gradBias = $$\frac{\partial E}{\partial b}$$
* gradWeight = $$\frac{\partial E}{\partial W}$$
* gradInput = $$\frac{\partial E}{\partial X}$$
if E is the final Energy/Error of your model

Torch's :backward() function allows you to calculate the above values easily. However, we need to perform one operation before doing that, which is:

In [15]:
model:zeroGradParameters()




This is essential because sometimes garbage values in the grad parameters spoil your gradients. Now, let's check out the values of these parameters.

In [16]:
model:get(1).gradBias

 0
 0
 0
 0
[torch.DoubleTensor of size 4]



In [17]:
model:get(1).gradWeight

 0  0  0
 0  0  0
 0  0  0
 0  0  0
[torch.DoubleTensor of size 4x3]



In [18]:
model:get(1).gradInput

[torch.DoubleTensor with no dimension]



Now, we perform the backward pass:

In [19]:
gradOutput = torch.Tensor(4):fill(2)

In [20]:
-- format --> model:backward(input_passed, error_that_came_from_previous_modules)

model:backward(X, gradOutput)

-0.7272
 0.5126
 0.8971
[torch.DoubleTensor of size 3]



The second parameter of backward contains the error passed down from the layers above this layer. However, since we don't have any modules above the Linear module, we pass a vector of 2s (just for a demonstration). The mapping between this and mathematical expression is this:
$$\frac{\partial E}{\partial Y}$$
if E is the final Energy/Error of your model. This value is also referred to as gradOutput in torch terminology.

Now, let's confirm that the values are being calculated as per their respective mathematical expressions.

According to analytical expression for $\frac{\partial E}{\partial b}$, 
$$ \frac{\partial E}{\partial b} = 
\begin{bmatrix} \frac{\partial E}{\partial y_1}\times \frac{\partial y_1}{\partial b_1} & \frac{\partial E}{\partial y_2}\times \frac{\partial y_2}{\partial b_2} & \frac{\partial E}{\partial y_3}\times \frac{\partial y_3}{\partial b_3} & \frac{\partial E}{\partial y_4}\times \frac{\partial y_4}{\partial b_4}
\end{bmatrix}$$
Now, $ \frac{\partial y_1}{\partial b_1}, \frac{\partial y_2}{\partial b_2}, \frac{\partial y_3}{\partial b_3}, \frac{\partial y_4}{\partial b_4}$ are all 1s and $ \frac{\partial E}{\partial y_1}, \frac{\partial E}{\partial y_2}, \frac{\partial E}{\partial y_3}, \frac{\partial E}{\partial y_4}$ are what we input as gradOutput which is a vector of 4 2s. 

Feeding in all these values, you should be able confirm that indeed the value of gradBias is correct.

In [21]:
model:get(1).gradBias

 2
 2
 2
 2
[torch.DoubleTensor of size 4]



According to analytical derivation,
$$\frac{\partial E}{\partial W} = 
\begin{bmatrix}
    \frac{\partial E}{\partial y_1} \times x_{1}       & \frac{\partial E}{\partial y_2} \times x_{2} & \frac{\partial E}{\partial y_3} \times x_{3} \\
    \frac{\partial E}{\partial y_1} \times x_{1}       & \frac{\partial E}{\partial y_2} \times x_{2} & \frac{\partial E}{\partial y_3} \times x_{3} \\
    \frac{\partial E}{\partial y_1} \times x_{1}       & \frac{\partial E}{\partial y_2} \times x_{2} & \frac{\partial E}{\partial y_3} \times x_{3} \\
    \frac{\partial E}{\partial y_1} \times x_{1}       & \frac{\partial E}{\partial y_2} \times x_{2} & \frac{\partial E}{\partial y_3} \times x_{3}
\end{bmatrix}
$$
We perform the top three operations below.

In [22]:
gradOutput[1]*X[1]

1.2171901777293	


In [23]:
gradOutput[2]*X[2]

0.12706564800652	


In [24]:
gradOutput[3]*X[3]

-0.1626833781925	


Match this with:

In [25]:
model:get(1).gradWeight

 1.2172  0.1271 -0.1627
 1.2172  0.1271 -0.1627
 1.2172  0.1271 -0.1627
 1.2172  0.1271 -0.1627
[torch.DoubleTensor of size 4x3]



The analytical expression for $\frac{\partial E}{\partial X}$ is:
$$\frac{\partial E}{\partial X} = \begin{bmatrix}
    \frac{\partial E}{\partial y_1}  & \frac{\partial E}{\partial y_2}  & \frac{\partial E}{\partial y_3} & \frac{\partial E}{\partial y_4} 
\end{bmatrix} \times 
\begin{bmatrix}
    w_{11} & w_{12} & w_{13} \\
    w_{21} & w_{22} & w_{23} \\
    w_{31} & w_{32} & w_{33} \\
    w_{41} & w_{42} & w_{43} 
\end{bmatrix}
$$

To confirm the above expression, here's the weight matrix:

In [29]:
W

-0.3449 -0.5608 -0.0103
-0.1104  0.0342 -0.2782
 0.0876  0.3920  0.2362
 0.0042  0.3910  0.5008
[torch.DoubleTensor of size 4x3]



If we perform the calculation mentioned above, we get:

In [36]:
gradOutput:reshape(1,4)*W

-0.7272  0.5126  0.8971
[torch.DoubleTensor of size 1x3]



Compare this with:

In [35]:
model:get(1).gradInput

-0.7272
 0.5126
 0.8971
[torch.DoubleTensor of size 3]

