In [1]:
! pip install torch

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/49/0e/e382bcf1a6ae8225f50b99cc26effa2d4cc6d66975ccf3fa9590efcbedce/torch-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (519.5MB)
[K    100% |████████████████████████████████| 519.5MB 31kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x59076000 @  0x7fc9923e91c4 0x46d6a4 0x5fcbcc 0x4c494d 0x54f3c4 0x553aaf 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54e4c8
[?25hInstalling collected packages: torch
Successfully installed torch-0.4.1


## Deep Learning Building Blocks - Affine Maps , Non-Linearities  and Objectives

Deep Learning - Composing Linear and Non-Linearities 

### Affine Maps

Core workhorses of deep learning . It is a function $f(x)$ $$ f(x) = A(x) +b $$ 

where for a matrix A and vectors x,b. The parameters to be learned are 'A' and 'b' , Often b is refered as bias term

Map the rows of input instead of the columns => i'th row of output <-> maps to <-> i'th row of input under A + bias term

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [4]:
# To ensure the random numbers generated are same over time , seeding manually
torch.manual_seed(1)

<torch._C.Generator at 0x7f153c0a6090>

In [5]:
# linear layer which maps from R^5 -> R^3 
# contains parameters A and b 
lin = nn.Linear(5,3)
# input data
data = torch.randn(2,5)

print(lin(data))

tensor([[ 0.1755, -0.3268, -0.5069],
        [-0.6602,  0.2260,  0.1089]], grad_fn=<ThAddmmBackward>)


### Non-Linearities

Suppose there are two affine maps $f(x)=Ax+b$ and $g(x) = Cx+d$  find $f(g(x))$ 

$$ f(g(x))=A(Cx+d)+b = ACx+(Ad+b)$$

AC is a matrix and Ad+b is a vector => **composing affine maps gives affine maps **

Thus if we want a neural network to be long chains of affine compositions, There is **no new power added to the network** ie the final output can also be achieved by doing a single affine map. 

'without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function'

**By Introducing non-linearities between the affine layers , we can build much more powerful models **

Some common non-linearites include $tanh(x) , \sigma(x), ReLU(x)$ as their gradients are easy to compute  eg. 

$$\frac {d\sigma}{dx} = \sigma(x)(1-\sigma(x))$$

*Note*-  **$\sigma(x)$ gradient vanishes very quickly** use **tanh or ReLU instead**


In [6]:
# In pytorch most nonLinearitieas are in  torch.functional 
data = torch.randn(2,2)
print(data)
print(F.relu(data))


tensor([[-0.5404, -2.2102],
        [ 2.1130, -0.0040]])
tensor([[0.0000, 0.0000],
        [2.1130, 0.0000]])


### Softmax and Probabilities

** $Softmax(x)$ ** also a non-linearity usually used as the last operation done in the network . **Takes a vector of real numbers and returns a probability distribution **  Let x be a vector of real numbers (-/+) , then i'th component of $Softmax(x)$ is 

$$ \frac {\exp(x_i)} {\sum_j\exp(x_j) }$$

The output is probability distribution - all elements >=0, summing to 1


In [9]:
# Softmax also in nn.functional

data = torch.randn(5)
print(data)
print(F.softmax(data,dim=0))
# sum =1
print(F.softmax(data,dim=0).sum())
# log softmax - log(softmax) ie log(exp_i/exp(x).sum())
print(F.log_softmax(data,dim=0))

tensor([ 2.2820, -1.2080,  1.1120,  2.2174, -0.4269])
tensor([0.4264, 0.0130, 0.1324, 0.3998, 0.0284])
tensor(1.)
tensor([-0.8523, -4.3423, -2.0222, -0.9168, -3.5611])


### Objective Function 

The function that the network is being trained to minimize ( Loss function/ Cost function )

Training instance --> Neural Network --> Loss of output --> update model by taking derivative of loss 

Eg **Negetive Log Likelihood Loss ** - multi class classification . Train  the network to minimize the negetive log probability of the correct output ( or maximize the log probability of correct output)

## Optimization and Training

With **requires_grad = True** , a Tensor remembers the operation used to create it , eg - z = x+y , then z.grad_fn contains the info that it was a sum of x&y.
Using this info the tensor can compute gradients w.r.t the things that were used to compute it.

Final Loss - after loss function is also a tensor , we can compute gradients w.r.t all parameters used to compute it .And then can perform standard gradient update
Let $\theta$ be the parameters ,$L(\theta)$ be the loss function and $\eta$ learning rate then :

$$ \theta^{(t+1)} = \theta^{(t)}  - \eta\nabla_\theta L(\theta)$$

Where $\nabla$ (Del or nabla ) is a vector diffenrtial operator

Torch has torch.optim package that can handle all the calculations . Optimization - different learning ratesm different update algo's like replacing SGD with Adam or RMSProp can lead to better performance.


## Creating Network Components in Pytorch

Creating a network that takes in a sparse bag-of-words represenations and outputs a probability distribution over two labels "English" and "Spanish" (Logistic Regression)

** Logistic Bag-Of-words Classifier **

map a sparse BOW representation to log probabilities over labels . Assign each word in vocab an index . Eg only two  words - 'Hello' and 'World' -> assign index 0,1

Thus Hello -> [1,0]  , Hello Hello  World -> [2,1] 

In general  $$ [Count(Hello),Count(World)]$$

Denote this BOW vector as x , then output of the network is :
$$ log Softmax(Ax+b)$$


In [0]:
# data =[("me gusta comer en la cafeteria".split(),"Spanish"),
#       ("Give me the location".split(),"English"),
#       ("Donde esta la biblioteca".split(),"Spanish"),
#       ("Where is the train station".split(),"English")]

# test_data = [("Yo creo que si".split(),"Spanish"),
#             ("I cannot find the bottle".split(),"English")]

data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]




In [27]:
# Word to index mapping , map each word in vocab to a unique integer
# which will be its index into the bag-of-words vector

word_to_ix={}

for sentence, _ in data + test_data:
  for word in sentence:
    if word not in word_to_ix:
      # word:index (current length)
      word_to_ix[word] = len(word_to_ix)
      
print(word_to_ix)

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}


In [0]:
VOCAB_SIZE = len(word_to_ix)
NUM_LABELS =2

In [0]:
# The neural network to classify the text 

class BOWClassifier(nn.Module):
  
  def __init__(self,num_labels,vocab_size):
    
    super(BOWClassifier,self).__init__()
    # Defining the linear layer that takes an input of size vocab size 
    # and outputs maps to the number of labels
    # Need the parameters A and b of affine mapping
    # Using nn.Linear that provides the affine map
    self.linear = nn.Linear(vocab_size,num_labels)
    
  def forward(self,bow_vec):
    # Pass the input through the linear layer 
    # apply log softmax at the end
    return F.log_softmax(self.linear(bow_vec),dim=1)
  
  
#Helper functions to map input and output
  
def make_bow_vector(sentence,word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
      vec[word_to_ix[word]] += 1
    return vec.view(1,-1)
  
def make_target(label,label_to_ix):
    return torch.LongTensor([label_to_ix[label]])
    
    

In [0]:
model = BOWClassifier(NUM_LABELS,VOCAB_SIZE)



In [31]:
# First param is A , second is b
for param in model.parameters():
  print(param)

Parameter containing:
tensor([[ 0.1130,  0.1821, -0.1218,  0.0426,  0.1692,  0.1300,  0.1222,  0.1394,
          0.1240,  0.0507, -0.1341, -0.1647, -0.0899, -0.0228, -0.1202,  0.0717,
          0.0607, -0.0444,  0.0754,  0.0634,  0.1197,  0.1321, -0.0664,  0.1916,
         -0.0227, -0.0067],
        [-0.1851, -0.1262, -0.1146, -0.0839,  0.1394, -0.0641, -0.1466,  0.0755,
          0.0628,  0.1270, -0.1015,  0.0425, -0.0714, -0.0441, -0.1563, -0.0894,
         -0.0601,  0.0839,  0.0358,  0.0484,  0.1957,  0.1911,  0.1338,  0.0062,
         -0.1357,  0.1533]], requires_grad=True)
Parameter containing:
tensor([-0.0490, -0.0159], requires_grad=True)


In [32]:
# Running a sample run thus no grad
with torch.no_grad():
  sample = data[0]
  print(sample)
  bow_vector = make_bow_vector(sample[0],word_to_ix)
  print(bow_vector)
  log_probs = model(bow_vector)
  print(log_probs)

(['me', 'gusta', 'comer', 'en', 'la', 'cafeteria'], 'SPANISH')
tensor([[1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.]])
tensor([[-0.3365, -1.2527]])


#### Training the model

pass instances through model to get log_probabilites -> compute loss -> compute gradient of loss -> update params with gradient step

Loss function - nn.NLLLoss() , optimizer - optim.SGD

input to NLLLoss is a vector of log_probabilities and a target label , it doesn't compute the log probabilities . 
In case of 'nn.CrossEntropyLoss()' ( similar to NLLLoss) the log_softmax step is done 

In [35]:
# label indexes
label_to_ix ={"SPANISH":0,"ENGLISH":0}

# Running on test data to compare before and after 
with  torch.no_grad():
  print('before training')
  for statement, label in test_data:
    bow_vec = make_bow_vector(statement,word_to_ix)
    log_probs = model(bow_vec)
    print(log_probs)

# printing the matrix column corresponding to "creo"
print('Accuracy ->',next(model.parameters())[:,word_to_ix["creo"]])

before training
tensor([[-0.6806, -0.7059]])
tensor([[-0.5845, -0.8150]])
Accuracy -> tensor([-0.1341, -0.1015], grad_fn=<SelectBackward>)


In [36]:
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(),lr=0.1)

# large epoch as small dataset
for epoch in range(100):
  for statement,label in data:
    # Need to clear the gradients as they get accumulated
    model.zero_grad()
    
    # Making the input and output as tensors
    bow_vec = make_bow_vector(statement,word_to_ix)
    target = make_target(label,label_to_ix)
    
    # Forward Pass
    log_probs = model(bow_vector)
    
    # Calculate loss , gradient and backpropagate error , update params
    loss = loss_function(log_probs,target)
    loss.backward()
    optimizer.step()
    
    
# after training running the test set
with torch.no_grad():
  print('After training')
  for statement,label in test_data:
    bow_vec = make_bow_vector(statement,word_to_ix)
    log_probs = model(bow_vec)
    print(log_probs)
    
print('Accuracy ->',next(model.parameters())[:,word_to_ix["creo"]])

After training
tensor([[-0.3715, -1.1702]])
tensor([[-0.1563, -1.9333]])
Accuracy -> tensor([-0.1341, -0.1015], grad_fn=<SelectBackward>)
