# Goal

Until lecture 10.

Source material: http://cs231n.stanford.edu/syllabus.html

# Activation Functions

## Sigmoid

$\sigma(x) = 1/(1+e^{-x})$, output $\in [0,1]$

Problems:
- saturation and killing of gradients: at the tails, gradients are almost 0
- outputs not zero centered: at later layers, can lead to zigzagging

## tanh

$\tanh(x) = 2\sigma(2x) - 1$, output $\in [-1,1]$

At least this one is zero-centered, though first problem remains

## Rectified Linear Unit (ReLU)

$f(x) = \max(0,x)$

- + Greatly accelerates SGD learning rate (supposedly because of linear, non-saturating form)
- + Simple matrix thresholding
- - Can "die" if learning rate is too high

## Leaky ReLU

$f(x) = \mathbb{1}(x < 0) (\alpha x) + \mathbb{1}(x >= 0) (x)$ for some small $\alpha$

- Attempts to fix the death problem, opinions mixed on whether it does

## Maxout

$\max(w_1^T x + b_1, w_2^T x + b_2)$

- Generalized ReLU and leaky ReLU, but number of params is doubled

## TLDR

- Try ReLU, if that doesn't work well, try leaky ReLu, try Maxout, try tanh


# PyTorch

## example from cs231n

```
x  y
 \/
 *
 |    z
 a   /
  \ /
   +
   |
   b
   |
   SUM
   |
   c

```

In [2]:
import torch
from torch.autograd import Variable

N,D = 3,4
x = Variable(torch.randn(N,D), # you can .cuda() this if compiled with GPU support
             requires_grad=True)
y = Variable(torch.randn(N,D),
             requires_grad=True)
z = Variable(torch.randn(N,D),
             requires_grad=True)

a = x*y
b = a+z
c = torch.sum(b)
c.backward()

print(x.grad.data)
print(y.grad.data)
print(z.grad.data)


-0.9019 -1.3741  0.5085 -0.0164
 0.6451  0.8173  0.4622  1.7398
-0.9370  0.3392  0.5834  0.1341
[torch.FloatTensor of size 3x4]


 0.0014  2.3757 -0.7652  0.2973
-0.4328  0.5511 -1.1725 -0.3712
-0.0399  0.1815 -0.9876  1.3819
[torch.FloatTensor of size 3x4]


 1  1  1  1
 1  1  1  1
 1  1  1  1
[torch.FloatTensor of size 3x4]



## Abstraction Levels

1. tensor - imperative ndarray (GPU)
2. variable - computational graph node, stores data, gradient
3. module - neural network layer, stores state or learnable weights

In [7]:
import torch
from torch.autograd import Variable

N, D_in, H, D_out = 64, 1000, 100, 10
x = Variable(torch.randn(N, D_in), requires_grad=False)
y = Variable(torch.randn(N, D_out), requires_grad=False)
w1 = Variable(torch.randn(D_in, H), requires_grad=True)
w2 = Variable(torch.randn(H, D_out), requires_grad=True)

learning_rate = 1e-6
niters = 500
for t in range(niters):
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    loss = (y_pred - y).pow(2).sum()

    loss.backward()

    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data

    w1.grad.data.zero_()
    w2.grad.data.zero_()

print(w1)
print(w2)

Variable containing:
-2.8240e+00 -1.7631e+00  6.9515e-01  ...   6.3589e-01  1.0783e+00 -4.9382e-01
 6.7872e-01 -3.3975e-02  2.3419e-01  ...   4.0380e-01  4.5888e-01  2.8980e-01
-8.1475e-01  1.1521e+00 -1.2944e+00  ...  -3.9297e-01 -1.5169e-01 -5.4367e-03
                ...                   ⋱                   ...                
 5.5354e-01 -6.9006e-01  5.3435e-01  ...  -1.3042e+00 -3.2528e-01 -3.3605e-01
 4.3530e-01  6.1585e-01 -1.0339e+00  ...  -2.9458e-01  4.5648e-01  5.1100e-01
-4.3797e-01  1.0416e+00 -3.5381e-01  ...   1.0808e+00 -2.1105e+00  1.5029e+00
[torch.FloatTensor of size 1000x100]

Variable containing:
 0.3657 -0.0616  0.4886  ...   0.0241  0.0367 -1.2944
 0.0012 -2.3050 -0.5206  ...   0.1069  2.0296  0.3539
 0.1414  1.2347  1.8882  ...   1.8816  0.5254  0.7998
          ...             ⋱             ...          
-0.6765 -1.1699  1.2648  ...  -0.3919 -0.6870  0.5645
 0.2284 -0.8989  0.5521  ...  -0.5136  1.2102 -0.5416
 0.1527 -0.0769 -0.2782  ...   0.4135 -0.0459  0.0

In [17]:
import torch
from torch.autograd import Variable
from torch.optim import Adam
from torch.nn import *

N, D_in, H, D_out = 64, 1000, 100, 10
x = Variable(torch.randn(N, D_in), requires_grad=False)
y = Variable(torch.randn(N, D_out), requires_grad=False)

model = Sequential(Linear(D_in, H), ReLU(), Linear(H, D_out))
loss_fn = MSELoss(size_average=False)

learning_rate = 1e-4
optimizer = Adam(model.parameters(), lr=learning_rate)
niters = 500
for t in range(niters):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)

    # model.zero_grad()
    # loss.backward()
    # 
    # for p in model.parameters():
    #     p.data -= learning_rate * p.grad.data
    # 
    # can be replaced with
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(list(model.parameters()))

[Parameter containing:
-7.1715e-03 -3.1891e-02  1.6240e-02  ...  -3.1562e-02  2.5376e-02  2.8484e-02
-1.4594e-02 -2.0420e-02 -6.8582e-03  ...  -3.0192e-02  3.0163e-02 -8.2229e-03
-2.2499e-02 -1.5067e-02  6.6661e-03  ...   5.0411e-03 -2.2978e-02 -9.7512e-03
                ...                   ⋱                   ...                
-3.0491e-03  2.1939e-02  1.8735e-02  ...   1.4093e-02  2.9189e-02  1.3363e-02
 9.0013e-03  1.8239e-02 -8.3933e-03  ...  -1.3175e-03  2.8808e-02 -1.6543e-02
 2.2596e-02  9.0188e-03 -3.6682e-02  ...   1.5825e-02 -8.7305e-03 -3.0213e-02
[torch.FloatTensor of size 100x1000]
, Parameter containing:
1.00000e-02 *
  2.6924
 -1.4046
 -2.0361
 -0.9474
  1.4091
  1.4870
 -0.7614
  1.3763
  1.1454
  0.4435
 -0.4593
  0.7179
 -1.3324
  1.9364
  2.8984
 -1.3482
  2.4812
 -1.9819
  0.5796
  0.1031
  2.1817
  2.8111
  4.3834
  2.5566
 -0.5711
 -0.2461
  1.7617
  1.1214
  3.6693
  3.1884
 -1.0156
  0.2164
  2.9482
  4.3420
  2.2797
  2.6839
  2.8039
  1.4191
  0.7350
  0.8

# Recurrent Neural Networks

$h_t = f_W (h_{t-1}, x_t)$

- one-to-one (vanilla)
- one-to-many (e.g. captioning)
- many-to-one (e.g. sentiment)
- many-to-many (e.g. translation)
- many-to-many / stacked-one-to-one (e.g. video frame labeling)

## Backpropagation through time

Forward across entire sequence, backwards through entire sequence

### Truncated version

Run through chunks of sequence instead of whole sequence, so hidden states are carried around but not always propagated

Problem:
- if singular values > 1, gradients explode, so we clip
- if singular values < 1, gradients vanish, we can use LSTM

# Long Short Term Memory (LSTM)

A vector of 4 gates, 

- f, forget gate: P(erase cell), e.g. sigmoid
- i, input gate: P(write cell), e.g. sigmoid
- g, thresholdy kind of gate: how much to write to cell, e.g. sigmoid
- o, output gate: how much to reveal cell, e.g. tanh