### ----  

requirements (if run locally) : 
- `conda create -n td4 python=3.6`
- `source activate td4`
- `pip install jupyter`
- `pip install torch torchvision`
- `conda install -c conda-forge spacy `
- `python -m spacy download en_core_web_sm`
- `cd ./td4`
- `jupyter notebook`

### ----  



# Machine Learning for NLP : TD 4 
## _Description_

### Course takeaways

### TD outline 

1. Introduction to pytorch
2. Sequence Labelling with pytorch


### Resources : 

https://pytorch.org/tutorials/  
https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html    
https://stats.stackexchange.com/questions/268202/backpropagation-algorithm-nn-with-rectified-linear-unit-relu-activation   
https://ruder.io/optimizing-gradient-descent/

## Pytorch
PyTorch is a Python based library for scientific computing that provides three main features:
- An n-dimensional Tensor, which is similar to numpy but can run on GPUs
- Easily build big computational graphs for deep learning
- Automatic differentiation for computing gradients 

Usages : 
- It’s a Python-based scientific computing package targeted at two sets of audiences:
    - A replacement for NumPy to use the power of GPUs
    - a deep learning research platform that provides maximum flexibility and speed


## Pytorch basics

**NB** : Tensor are the basics block of pytorch. Tensor allows to store data (input data or target data) as well as the parameters (also called weights, neurons,...) of your neural network.


- tensor creation 
- tensor types 
- basic operations between tensors
- from and to numpy 
- about GPU 

In [0]:
%load_ext autoreload
%autoreload 2

import torch
import torch.nn as nn
import torch.nn.functional as F

import numpy as np

### Tensors


**What is a pytorch tensor ?** : A torch.Tensor is a multi-dimensional matrix containing elements of a single data type.

Tensors are similar to NumPy’s ndarrays, with the addition being that Tensors can also be used on a GPU to accelerate computing.

**How to define a pytoch tensor ?**
- using existing constructors : _torch.ones_ , _torch.zeros_ _torch.rand_
- based on existing object
    - from another tensor (or only using the shape of the other tensor)
    - from a python list 
    - from a numpy array

In [0]:
# define 
ones = torch.ones(3,2)
# a tensor can be printed
print(ones)

tensor([[1., 1.],
        [1., 1.],
        [1., 1.]])


In [0]:
# other basic definition 
print(torch.zeros(5,3), "\n", 
      torch.rand(2,3), "\n", 
      torch.empty(2,2))

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]) 
 tensor([[0.4385, 0.6439, 0.6972],
        [0.5278, 0.9691, 0.7974]]) 
 tensor([[2.3786e-35, 0.0000e+00],
        [0.0000e+00, 0.0000e+00]])


In [0]:
# from a python list 
ls = [[[1,3,5,6],[-1,4,4,4]],[[-1,-3,-5,-6],[10,-4,-4,-4]]]
tensor = torch.Tensor(ls)
print(tensor)
# from a numpy array : 
array = np.array([0,1])
#array
tensor = torch.from_numpy(array)
print(tensor)
# symetrically  tensor.numpy()

tensor([[[ 1.,  3.,  5.,  6.],
         [-1.,  4.,  4.,  4.]],

        [[-1., -3., -5., -6.],
         [10., -4., -4., -4.]]])
tensor([0, 1])


In [0]:
# list must be in a proper matrix shape
ls = [[[1,3,5,6],[-1,4,4,4]],[[-1,-3,-5,-6],[10,-4,-4]]]
torch.Tensor(ls)

ValueError: ignored

**Basic manipulations**
- access type / change data types 
- access elements 
- reshape 
- maths opertions : add, multiply , ..
- differentiate / derive
- set to a specific _device_ : GPU , GPU:0, GPU:1 , CPU ...

In [0]:
# get type 
print(tensor,tensor.dtype)
# change type 
tensor = tensor.float()

tensor([0, 1]) torch.int64


**NB** : types are important in Deep Learning  because : 
- some types are more memory consumming than others : e.g : float16 vs float32
- some operations require specific type (cf. Embedding layer ...)

In [0]:
tensor = torch.rand(5,2,2)
print(tensor)
# access one element
print(tensor[0,1,1])
# access several element
print(tensor[:3,0,:2])

tensor([[[0.7511, 0.0401],
         [0.3000, 0.1814]],

        [[0.3512, 0.0374],
         [0.7119, 0.9846]],

        [[0.9080, 0.3035],
         [0.0668, 0.7956]],

        [[0.1612, 0.3080],
         [0.0748, 0.5270]],

        [[0.4311, 0.0938],
         [0.5968, 0.4802]]])
tensor(0.1814)
tensor([[0.7511, 0.0401],
        [0.3512, 0.0374],
        [0.9080, 0.3035]])


**NB** : pytorch tensor indexing exactly match numpy indexing

In [0]:
# get the shape of a tensor
tensor.size()
# reshape it 
print(tensor, "\n",
      tensor.view(2,2,5))

tensor([[[0.7511, 0.0401],
         [0.3000, 0.1814]],

        [[0.3512, 0.0374],
         [0.7119, 0.9846]],

        [[0.9080, 0.3035],
         [0.0668, 0.7956]],

        [[0.1612, 0.3080],
         [0.0748, 0.5270]],

        [[0.4311, 0.0938],
         [0.5968, 0.4802]]]) 
 tensor([[[0.7511, 0.0401, 0.3000, 0.1814, 0.3512],
         [0.0374, 0.7119, 0.9846, 0.9080, 0.3035]],

        [[0.0668, 0.7956, 0.1612, 0.3080, 0.0748],
         [0.5270, 0.4311, 0.0938, 0.5968, 0.4802]]])


In [0]:
intTensor = torch.ones(3,2, dtype=torch.float32)
print(intTensor, intTensor.dtype)
intTensor.int()

tensor([[1., 1.],
        [1., 1.],
        [1., 1.]]) torch.float32


tensor([[1, 1],
        [1, 1],
        [1, 1]], dtype=torch.int32)

### All operations on tensors 
- all reshape 
- squeeze 
- sum , prod 
- max, norm ...

## Automatic Differentiation 

The core component of any modern deep learning library is _Automatic Differentiation_. 


**Recall**
- Training any deep learning model requires backpropagatation 
- Backpropagation is an algorithm that efficiently computes the gradient of a neural network's output based on its input and with regard to all its parameters (or also named weights)

_Automatic Differentiation_ provides a way of automatically computing gradients of any function. In other words, _automatic differentiation_ gives you the possibility to build complex neural network without caring about computing the gradients by yourself. 


**NB** 

Having access to an open source library that performs Automatic Differentation (tensorflow/pytorch and before Dynet or Theano..) is one of the reasons for the popularity and sucess of Deep Learning today.

### Automatic Differentiation in a nutshell


**Definition**
Automatic differentiation refers to a general way of taking a program which computes a value, and automatically constructing a procedure for computing derivatives of that value.

Automatic Differentation requires 3 steps 

1. Building a computation Graph 
2. propagating inputs throughout the graph (forward pass)
3. Computing gradient of each of the node in the graph (backward pass)

In [0]:
x = torch.ones(2, 2, requires_grad=True)
# double checking if gradient 
print("Checking gradient is set to {}. Its gradient is still {} ".format(x.requires_grad, x.grad))

Checking gradient is set to True. Its gradient is still None 


In [0]:
# let us define a basic operation
y = x+1
print(y)

tensor([[2., 2.],
        [2., 2.]], grad_fn=<AddBackward0>)


In [0]:
# y has now a gradient attribute , grad is none
y.grad_fn, y.grad

(<AddBackward0 at 0x7f37e11a7ba8>, None)

In [0]:
z = y * y * 3
out = z.mean()
print(z, out, z.grad)

tensor([[12., 12.],
        [12., 12.]], grad_fn=<MulBackward0>) tensor(12., grad_fn=<MeanBackward0>) None


In [0]:
out.backward()
# Let's inspect the gradient at each previous variable' gradients now
print("Gradients with regard to intermediate nodes:", out.grad, z.grad, y.grad)
print("Gradients with regard to the input node that we considered to be the parameter:", x.grad)

Gradients with regard to intermediate nodes: None None None
Gradients with regard to the input node that we considered to be the parameter: tensor([[3., 3.],
        [3., 3.]])


### Questions:
- Find the function that is being differentiated with regard to x_ij.
- Try to manually retrieve the same gradient with the function you found for x=[[1, 1], [1, 1]].

In [0]:
# to manipulate a tensor without its gradient 
out.detach()

tensor(12.)

## Pytorch Model


Our goal is to define a deep learning model, train it, make prediction with it and evaluate it. 

With pytorch this means doing the three following "scripts" : 
1. Defining the model 
2. Implementing the prediction 
3. Implementing the training loop 
    - Defining a loss
    - Defining an optimizer
    - Loop :
        - forward pass 
        - backward pass
        - applying optimization update rule
4. Evaluating the model / playing with it 
    - You can use the training criteria (loss) as your evaluation score
    - You can use another score : accuracy, F1 , ...

### 1. Defining the model 
Pytorch models always follow the same template : 

- a class
- defining all layers (or parameters) in _init_()
- defining the forward pass in foward()

Let's see what it looks like with a simple 2 layers model.

All trivial Neural Network layers can generally be found in [torch.nn](https://pytorch.org/docs/stable/nn.html).

**Warning**: All your parametrized modules (Layers or any trainable vectors) must be defined as *direct* attributes to your ```nn.Module``` class so that the call to ```.backward()``` can properly propagate the gradients through everything. To define layers in list attribute, (resp. dictionary attributes) use ```ModuleList``` (resp. ```ModuleDict```).


<img src="./imgs/nn.png">



In [0]:
# defining the model 
class MyModel(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(MyModel, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H, bias=True)
        self.linear2 = torch.nn.Linear(H, D_out, bias=True)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = torch.relu(self.linear1(x))
        y_pred = self.linear2(h_relu)
        return y_pred

### 2. Forward pass 
1. instanciating the model
2. getting input data 
3. computing the foward pass

In [0]:
# instanciating the model 
N, D_in, H, D_out = 2, 10, 10, 2

# Construct our model by instantiating the class defined above 
# Note: all the parameters are initialized here 
model = MyModel(D_in, H, D_out)
# You can look up into the model 
model

MyModel(
  (linear1): Linear(in_features=10, out_features=10, bias=True)
  (linear2): Linear(in_features=10, out_features=2, bias=True)
)

In [0]:
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

In [0]:
#model = MyModel(D_in, H, D_out)
# forward pass / predict x 
y_pred = model(x) # almost equivalent to model.forward(x)
# y_pred
y_pred

tensor([[-0.1292, -0.0250],
        [-0.1046, -0.0991]], grad_fn=<AddmmBackward>)

### Questions 
- Why do the prediction change if the model is re-instanciated ? 
- Can this be a problem ? 
- How to avoid it ? 

### 3. Training loop 

- Criterion : 

a model is trained with regard to a _training criterion_ or a _loss_.   
Pytorch provides many different pre-coded losses : 
    - Mean-Square Error 
    - Categorical Cross-Entropy , ...

Most of them can be found in [torch.nn](https://pytorch.org/docs/stable/nn.html) 
- Optimizer 

In pytorch as in any deep learning framwork, models are trained with backpropagation. Backpropagation consists in applying Stochastic Gradient Descent (SGD) to a neural network. There is a broad range of variants around the simple form of SGD. 

Pytorch provides pre-defined objects for many different forms of Gradient Descent algorithm in [torch.optim](https://pytorch.org/docs/stable/optim.html):
- SGD 
- Adadelta 
- Adam 

Your optimizer will be instanciated with it's configuration(*e.g.* the *step_size* or *learning_rate* for SGD), and the network's parameters.

Overview of all the Gradient Descent based algorithms : https://ruder.io/optimizing-gradient-descent/ 


- Training Loop :
    - forward pass to get prediction and the loss value 
    - zero_grad : Resetting the gradient value to zero for all parameters before adding their newly backpropagated values) 
    - compute the gradients' value with loss.backward()
    - update all the parameters of the model with optimizer.step()



In [0]:
# instanciate the model 
# Note: all the model parameters are intialized at this step
model = MyModel(D_in, H, D_out)

criterion = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

# t is normally also an index over the samples (or batches) in your dataset,
# but we will just consider it to be a time-step here
for t in range(10000):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 0:
        print("Step:{} Loss:{} ".format(t, loss.item()))

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


Step:0 Loss:0.21324831247329712 
Step:100 Loss:0.20180049538612366 
Step:200 Loss:0.19123151898384094 
Step:300 Loss:0.18145005404949188 
Step:400 Loss:0.1723765879869461 
Step:500 Loss:0.1639418601989746 
Step:600 Loss:0.15608489513397217 
Step:700 Loss:0.14875219762325287 
Step:800 Loss:0.14189618825912476 
Step:900 Loss:0.13547523319721222 
Step:1000 Loss:0.12945178151130676 
Step:1100 Loss:0.12379277497529984 
Step:1200 Loss:0.11846837401390076 
Step:1300 Loss:0.1134520173072815 
Step:1400 Loss:0.10871973633766174 
Step:1500 Loss:0.10425002872943878 
Step:1600 Loss:0.10002334415912628 
Step:1700 Loss:0.0960220992565155 
Step:1800 Loss:0.09223027527332306 
Step:1900 Loss:0.08863332122564316 
Step:2000 Loss:0.08521801233291626 
Step:2100 Loss:0.08197227865457535 
Step:2200 Loss:0.078885018825531 
Step:2300 Loss:0.07594609260559082 
Step:2400 Loss:0.07314622402191162 
Step:2500 Loss:0.07047688215970993 
Step:2600 Loss:0.06793016195297241 
Step:2700 Loss:0.06549879163503647 
Step:2800 

If the Loss decrease it means the gradient descent is working !! 

**Note:** Don't forget the zero_grad()
- we are doing gradient backpropagation at each step 
- gradients are computed with the loss.backward 
- after each update we must set to zero all the gradients values otherwise they get accumulated (hence zero_grad())

### Questions : 

- plot the loss values that you would record while going through the above loop

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
from torch.optim import SGD, Adam

def plot_loss(lr, N, optim_alg, target=None, max_steps=None):
    """
    This function takes as arguments:
    lr: the learning rate
    N: the number of samples
    optim_alg: the optimisation algorithm
    target: a target value for the loss ( a message should printed when this 
            target is reached)
    max_steps: the number of gradient steps for the optimisation process
    """
    pass


plt.title('lr: {}, N:{}, optim_alg:{}'.format(1e-4, 2, 'SGD'))
plot_loss(lr=1e-4, N=2, optim_alg=SGD)
plt.show()


- how many steps do you need for the loss to reach 1e-1 ? 


- now same question if
        - lr=1e-5,1-3 
        - N = 10, 100, 1000 (NB : you can )
        - optim.Adam 


- What can you conclude on the capacity of a neural network ? 


## Sequence Labelling with pytorch

Now that we have seen how to build a simple neural network, let's build a model for a task more useful in NLP : _Sequence Labelling_ 

Recall : Sequence labelling is the task of predicting a label to a sequence among a fixed range of possibilities 

e.g : Sentiment Analysis 


<img src="./imgs/sentiment_analysis.png">



### 1. Define the model
Define a neural network that uses an LSTM ([nn.LSTM](https://pytorch.org/docs/stable/nn.html)) to classify the elements of a sequence.

 The input sequence will be given to the neural network as a series of indexes. Each index (corresponding to a token in the source vocabulary) will be transformed to a corresponding trainable embedding ([nn.Embedding](https://pytorch.org/docs/stable/nn.html)) before entering the LSTM.   

In [0]:
class SequenceLabeller(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, num_classes, sequence_model="LSTM"):
        super(SequenceLabeller, self).__init__()

        self.hidden_dim = hidden_dim
        # self.word_embeddings = ??
        
        if sequence_model == "LSTM":
            pass
            # The LSTM takes word embeddings as inputs, and outputs hidden states
            # with dimensionality hidden_dim.
            # self.seq = ??
        else:
            raise(Exception("Sequence model {} not supported".format(sequence_model)))

        # The linear layer that maps from hidden state space to class space
        # self.hidden2tag = ??

    def forward(self, sentence):
          pass

### 2.1 Prepare the data
First, put the three files given [here](https://drive.google.com/drive/folders/19fFgwB0Vk9mfGcA2TNhIeViBHYR4zATX?usp=sharing) in your working folder. Then, choose a data file to perform your sequence labelling and use the given loop to read it.
*optional* 
- inspect the data files
- try to guess how the data is being parsed by the loop
- inspect the ``re`` package [documentation](https://docs.python.org/3/library/re.html) and see whether you were right.

In [0]:
import re
import spacy
tokenizer = spacy.load("en_core_web_sm")

In [0]:

def get_data(path):
  data = []
  no_match = 0
  with open(path, "r") as f:
      for line in f:
          match  = re.search("(.*)\s\s([0-1]+).*", line)
          if match is not None:
              tokenized = tokenizer(match.group(1).strip())
              sent = [token.text for token in  tokenized]
              score = match.group(2)
              data.append((sent,int(score)))
          else:
            match  = re.search("(.*),([0-1]+).*", line)
            if match is not None:
              tokenized = tokenizer(match.group(1).strip())
              sent = [token.text for token in  tokenized]
              score = match.group(2)
              data.append((sent,int(score)))
            else:
              no_match += 1
  return data, no_match

data, no_match = get_data("./imdb_labelled.csv") # fill in the path to a 
                                                 # file of your choosing
training_data = data[:int(len(data)*4/5)]
test_data = data[int(len(data)*4/5):]
print("Got {} training examples, {} test examples, and failed to capture "
"{} examples.".format(len(training_data), len(test_data), no_match))



Prepare:
- a structure that maps each token in your source vocabulary to a unique index.
- a structure that matches each index to it's corresponding token in the source vocabulary
- a structure that maps each label to an index
- a function that turns a token sequence to the corresponding index sequence

The labels in this case are indexes themselves(`0`, `1`) but they can be otherwise (_e.g._ `positive`, `negative`, `amusing`, `anxious` ...)




Inspect your data 

### 2.2 Forward pass
Instanciate the model and perform a forward pass on the first sentence in your data.  
See what the scores are before training.  
Note that element i,j of the output is the score for tag j for word i.  
Here we don't need to train, so the code is wrapped in `torch.no_grad()`

In [0]:

with torch.no_grad():
  pass

## Training 
### 3.1  Optimizer and loss
Instanciate an optimizer and a loss function for your network from pytorch. 

In [0]:
# loss_function = ??
# optimizer = ??

### 3.2 Training loop 
Write a loop that goes through the data `n_epochs=40` times, and trains on it.  
The network should train all the word tags in a sentence to produce the entire sentence's label. This fuzzy kind of supervision is called weak-supervision (weakly-supervised learning). 
tip: You should transform your target tags with [one_hot](https://pytorch.org/docs/stable/nn.functional.html#one-hot) before giving them to NLLLoss.

In [0]:
from torch.nn.functional import one_hot
n_epochs = 40
for epoch in range(n_epochs):  # again, normally you would NOT do 300 epochs, it is toy data
    loss_mean_ep = 0
    n_sample = 0
    for sentence, tags in training_data:
        if len(sentence) < 2: continue
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        
        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        #sentence_in = ??
        
        #targets = ??
        #one_hot_targets = ??
        
        # Step 3. Run our forward pass.
        #tag_scores = ??
        
        # Step 4. Compute the loss
        # loss = ??
        loss_mean_ep += loss
        n_sample += 1

        # Backpropagate the loss and perform an optimisation step
        # ??
        # ??

    print("Epoch {} loss {:0.4f} ".format(epoch, loss/n_sample))

### 4. Evaluating model 


#### 1 - Evaluate your data qualitatively by inspecting 3 predictions for positive examples and 3 for negative ones on the test data

In [0]:
with torch.no_grad():
  n_samples = 3
  for sentence, tag in test_data:
    pass


#### 2 - Evaluate your data quantitatively by measuring the [roc_auc](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html) on the test set, and generating a [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [0]:
from sklearn.metrics import roc_auc_score, classification_report

ground_truth, scores = [], []
with torch.no_grad():
  for sentence, tag in test_data:
    pass

print("The AUC is {}")
print("Classification report:\n ")


#### 3 - Evaluate how well your model behaves out of it's training domain (test it on one of the other given files)

In [0]:

out_of_domain1, no_match = get_data("./amazon_cells_labelled.csv")
print("Got {} examples, and failed to capture "
"{} examples.".format(len(out_of_domain1), no_match))
out_of_domain2, no_match = get_data("./yelp_labelled.csv")
print("Got {} examples, and failed to capture "
"{} examples.".format(len(out_of_domain1), no_match))

Got 1000 examples, and failed to capture 0 examples.
Got 1000 examples, and failed to capture 8 examples.


In [0]:

ground_truth, scores = [], []
with torch.no_grad():
  for sentence, tag in out_of_domain1:
    pass

print("The AUC for the first out of domain corpus is {}")
print("Classification report:\n ")

ground_truth, scores = [], []
with torch.no_grad():
  for sentence, tag in out_of_domain2:
    pass

print("The AUC for the second out of domain corpus is {}")
print("Classification report:\n ")

The AUC for the first out of domain corpus is {}
The AUC for the second out of domain corpus is {}


### 4. Analysing token-wise tags
Go on and try to input some sentences of your own making, and to see how the score varies throughout the sentence.

In [0]:
import seaborn as sns

def sentiment_heatmap(sentence):
  tokens = [str(w.text) for w in tokenizer(sentence)]
  sentence_in = prepare_sequence(tokens, word_to_ix)
  tag_scores = model(sentence_in)
  token_sentiments = torch.exp(tag_scores)[:, :, 1].detach().numpy()
  sns.heatmap(token_sentiments, xticklabels=tokens)
  plt.title("sentence score: "
  "{}".format(torch.mean(torch.exp(tag_scores)[:, :, 1]).item()))
  plt.show()
  


In [0]:
sample_sentences = [
                    "Awful acting throughout the movie",
                    "I'd like to say that the actor was awful",
                    "Delightful scenery at the opening act of the movie",
                    "Bad movie, but pretty actress"
]
for sentence in sample_sentences:
  sentiment_heatmap(sentence)

**Question**
- What would you say about the interpretability of the results ? to what can we blame for this ?
- Do you think averaging the token scores was a good decision ? How could we do better ?