Warmup

---



In [0]:
# http://pytorch.org/
from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl torchvision

In [1]:
#Loading packages

import torch
from torch.autograd import Variable
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from tqdm import tqdm 
import codecs
import random


#we fix the seeds to get consistent results

SEED = 234
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)


# Text classification: Sentiment analysis


In this notebook we are going to build state-of-the art models for text classification using the example of sentiment analysis. To be more precise, we will build a feed-forward neural network (FFNN) and a convolutional neural network (CNN). We will look into the details of data preparation, functioning of each model and how the performance of those NNs could be measured efficiently. We will start our work using a toy corpus. Further you can extend your knowledge and use a larger dataset.

Again we are using [pytorch](https://pytorch.org/), an open source deep learning platform, as our backbone library in the course. 

Here is our toy training and validation sets. It is good practise to use the validation set (a set representative of the test data). This set is used to tune hyperparameters and choose a configuration for your model to ensure the best performance. 

Our toy sets are already tokenized and lowercased.




In [2]:
 #our toy sentiment analysis corpus

train = ['i like his paper !',
                'what a well-writen essay !',
                'i do not agree with the criticism on this paper',
                'well done ! it was an enjoyable reading',
                'it was very good . send me a copy please .',
                'the argumenation in the paper is very weak',
                'poor effort !',
                'the methodology could have been more detailed',
                'i am not impressed',
                'could have done better .']

train_labels = [1,1,1,1,1,0,0,0,0,0]


valid = ['i like your paper', 
             'i agree with your results', 
             'what a success ! a well-writen paper', 
             'not enough details . very poor', 
             'i support the criticism',
             'could be better']

valid_labels = [1,1,1,0,0,0]

# Pre-processing

Using the material from the previous lab session define here a method to get a tokenized corpus.

In [0]:
def get_tokenized_corpus(corpus):

  ...
 
  return tokenized_corpus 

# Word2index dictionary

Similar to the way it was done in the previous lab, we define here a method that returns a word to index dictionary. Note that we reserve the 0 index for the *pad* token

In [3]:
def get_word2idx(tokenized_corpus):
  vocabulary = []
  for sentence in tokenized_corpus:
    for token in sentence:
        if token not in vocabulary:
            vocabulary.append(token)
  
  
  word2idx = {w: idx+1 for (idx, w) in enumerate(vocabulary)}
  # we reserve the 0 index for the placeholder token
  word2idx['<pad>'] = 0
 
  return word2idx

# Preparation of inputs

The first layer of our FNN will be an embedding (look-up) layer takes as input indexes of tokens (we do not need to one-hot encode our vectors).
 
 Q. Why do we need to fix the length of our input vectors (we take the maximum sentence length here) ? This process is referred to as padding. Print the padded input corpus.


In [0]:
def get_model_inputs(tokenized_corpus, word2idx, labels, max_len):

  # we index our sentences
  vectorized_sents = [[word2idx[tok] for tok in sent if tok in word2idx] for sent in tokenized_corpus]
  print(vectorized_sents)
  
  # we create a tensor of a fixed size filled with zeroes for padding

  sent_tensor = Variable(torch.zeros((len(vectorized_sents), max_len))).long()
  
  sent_lengths = [len(sent) for sent in vectorized_sents]
  
  # we fill it with our vectorized sentences 
  
  for idx, (sent, sentlen) in enumerate(zip(vectorized_sents, sent_lengths)):

    sent_tensor[idx, :sentlen] = torch.LongTensor(sent)

  label_tensor = torch.FloatTensor(labels)
  
  return sent_tensor, label_tensor


tokenized_corpus = get_tokenized_corpus(train)

sent_lengths = [len(sent) for sent in tokenized_corpus]
max_len = np.max(np.array(sent_lengths))

word2idx = get_word2idx(tokenized_corpus)

train_sent_tensor, train_label_tensor = get_model_inputs(tokenized_corpus, word2idx, train_labels, max_len)


# Building the Feed-Forward Neural Network

We will start by building a very simple feed-forward neural network (FFNN).  

Note that our FFNN class is a sub-class of `nn.Module`.  Within the `__init__` we define the layers of the module. Our first layer is an embedding layer (look-up layer). The layer could be initialized with pre-trained embeddings (as we will see at the end of this lab) or could be trained together with other layers. Then we average the embeddings for each sentence (e.g., [Iyyer et al, 2015](http://www.aclweb.org/anthology/P15-1162)). The next layer is a fully connected layer with a *ReLU* activation. The output layer uses no activation.

The `forward` method is called when we feed data into our model. Please note that output dimension of each layer is the input dimension of the next one. 


Q. Recall from the previous lab the functioning of a lookup layer. How does the mapping to the dense representation happen?

Q. Implement the averaging of embeddings. Note that in this case of averaging padding is not necessary. Why? Think about other ways to get a representation of sentence where padding would be necessary.


In [0]:

class FFNN(nn.Module):
    
    def __init__(self, embedding_dim, hidden_dim, vocab_size, max_len, num_classes):
        
        super(FFNN, self).__init__()
        
        #embedding (lookup layer) layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        #hidden layer
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        
        #activation
        self.relu1 = nn.ReLU()
        
        #output layer
        self.fc2 = nn.Linear(hidden_dim, num_classes)  
    
    def forward(self, x):
        
        embedded = self.embedding(x)
        
        # we average the embeddings of words in a sentence
        
        # Q. How to average the embeddings here?
        
        # averaged = ?
        
        # (batch size, max sent length, embedding dim) to (batch size, embedding dim)

        out = self.fc1(averaged)
        out = self.relu1(out)
        out = self.fc2(out)
        return out



# Training the model

In this section we will define the hyperparameters of our model, the loss function, the optimizer and do a number epochs of training over our mini training data. 

We will use the **Stochastic gradient descent (SGD)** optimizer. Please note such a hyperparameter as the **learning rate**  controls how the weights are adjusted with respect to the loss gradient. The lower the value,  the more fine-grained are weight updates.

**Note**: It is a common practise to perform training using mini-batches (sets of training instances seen by the model during weight update step). In this case the epoch loss is defined as the loss averaged across the mini-batches. Here we use a very small dataset and do not define mini-batches.

Q. Why is the number of output classes is equal to 1 for binary classification?

Q. Try to modify the learning rate of the optimizer in the range from 0.0001 up to 0.5. How the loss will react to these changes?


**Note** Learning rate is initially set to 0.5

In [0]:
# we will train for N epochs (N times the model will see all the data)
epochs=20

# the input dimension is the vocabulary size
INPUT_DIM = len(word2idx)

# we define our embedding dimension (dimensionality of the output of the first layer)
EMBEDDING_DIM = 100

# dimensionality of the output of the second hidden layer
HIDDEN_DIM = 50

#the outut dimension is the number of classes, 1 for binary classification
OUTPUT_DIM = 1


# recall input parameters to our model
#embedding_dim, hidden_dim, vocab_size, max_len, num_classes
# max_len is the maximum length of the input sentences as we defined during padding

model = FFNN(EMBEDDING_DIM, HIDDEN_DIM, len(word2idx), max_len, OUTPUT_DIM)

# we use the stochastic gradient descent (SGD) optimizer
optimizer = optim.SGD(model.parameters(), lr=0.5)

# we use the Binary cross-entropy loss with sigmoid (applied to logits) 
#Recall we did not apply any activation to our output layer, we need to make our outputs look like probality.
loss_fn = nn.BCEWithLogitsLoss()

feature = train_sent_tensor
target = train_label_tensor

for epoch in range(1, epochs+1):
  
  #to ensure the dropout (exlained later) is "turned on" while training
  #good practice to include even if do not use here
  model.train()
  
  #we zero the gradients as they are not removed automatically
  optimizer.zero_grad()
  
  # queeze is needed as the predictions are initially size (batch size, 1) and we need to remove the dimension of size 1 
  predictions = model(feature).squeeze(1)
  loss = loss_fn(predictions, target)
  #calculate the gradient of each parameter
  loss.backward()
  #update the parameters using the gradients and optimizer algorithm 
  optimizer.step()
  
  epoch_loss = loss.item()
  
  print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.3f}')



# Accuracy

In addition to measuring the loss, we can also evaluate the performance of out model. Implement a method to compute accuracy. And display accuracy after each epoch of training in the previous example.

**Note** In the case of training with mini-batches the epoch accuracy is defined as the accuracy averaged across the mini-batches. 

 

In [0]:
 def accuracy(output, target):
    
    ...
 
    return acc

# Tuning on the validation set

Apply the pre-processing and data preparation procedures to this validation set.

Q: Should we re-use the word to index dictionary we created before ? Why?

We will now estimate the loss and the accuracy over the validation set at the end of each epoch.
 
Q. Try to modify the learning rate and the number of epochs now. How will the validation loss and accuracy react to those changes? 

Typically, training could be stopped as soon as we no longer observe any improvement in the evaluation results (accuracy for our case) over the validation set. When the train accuracy is close to 100% and the validation accuarcy starts to go down we observe the overfitting. 


**Note** Learning rate is initially set to 0.5



In [0]:
# we will train for N epochs (N times the model will see all the data)
epochs=20

# the input dimension is the vocabulary size
INPUT_DIM = len(word2idx)

# we define our embedding dimension (dimensionality of the output of the first layer)
EMBEDDING_DIM = 100

# dimensionality of the output of the second hidden layer
HIDDEN_DIM = 50

#the outut dimension is the numeber of classes, 1 for binary classification
OUTPUT_DIM = 1

# recall input parameters to our model
#embedding_dim, hidden_dim, vocab_size, max_len, num_classes
# max_len is the maximum length of the input sentences as we defined during padding
 
model = FFNN(EMBEDDING_DIM, HIDDEN_DIM, len(word2idx), max_len, OUTPUT_DIM)

# we use the stochastic gradient descent (SGD) optimizer
optimizer = optim.SGD(model.parameters(), lr=0.5)

# we use the Binary cross-entropy loss with sigmoid (applied to logits) 
#Recall we did not apply any activation to our output layer, we need to make our outputs look like probality.
loss_fn = nn.BCEWithLogitsLoss()

feature_train = train_sent_tensor
target_train = train_label_tensor

feature_valid = valid_sent_tensor
target_valid = valid_label_tensor

for epoch in range(1, epochs+1):
  
  #to ensure the dropout (exlained later) is "turned on" while training
  #good practice to include even if do not use here
  model.train()
 
  #we zero the gradients as they are not removed automatically
  optimizer.zero_grad()
  
  # queeze is needed as the predictions are initially size (batch size, 1) and we need to remove the dimension of size 1 
  predictions = model(feature_train).squeeze(1)
  loss = loss_fn(predictions, target_train)
  acc = accuracy(predictions, target_train)
  #calculate the gradient of each parameter
  loss.backward()
  #update the parameters using the gradients and optimizer algorithm 
  optimizer.step()
  
  epoch_loss = loss.item()
  epoch_acc = acc
  
  
  # this puts the model in "evaluation mode" (turns off dropout and batch normalization)
  # good practise to include even if we do not use 
  model.eval()
  # we do not compute gradients within this block, we do no training here
  with torch.no_grad():
 
    predictions_valid = model(feature_valid).squeeze(1)
    loss = loss_fn(predictions_valid, target_valid)
    acc = accuracy(predictions_valid, target_valid)
    valid_loss = loss.item()
    valid_acc = acc
  
  print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.3f} | Train Acc: {epoch_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% |')

# Testing the model


Now let us test our model. We define a small test set. Apply the data preparation procedures to this test set as you did for the validation set.



In [0]:
test = ['i really like your paper', 
             'well-done', 
             'good results for a paper !', 
             'your effort is poor !', 
             'not impressed' , 
             'weak argumentation']

test_labels =[1,1,1,0,0,0]



We can now test our model. Write a method for the computation of F-measure. Compute both F-measure and accuracy for the test set.

Q. Are the resulting evaluations different ? How do you interpret those differences? Output predictions.

In [0]:
def f_measure(output, gold):
  
  ...
  
  print("Test: Recall: %.2f, Precision: %.2f, F-measure: %.2f\n" % (recall, precision, fscore))
  


# this puts the model in "evaluation mode" (turns off dropout and batch normalization)
# good practise to include even if we do not use 
model.eval()

feature = test_sent_tensor
target = test_label_tensor

# we do not compute gradients within this block
with torch.no_grad():
     
    ...
    print(f'| Test Loss: {loss:.3f} | Test Acc: {acc*100:.2f}%')
    f_measure(predictions, test_labels)
    
    

# Building the Convolutional Neural Network (CNN)

We will implement a model inspired by the state-of-art CNN model as described in 
 [Convolutional Neural Networks for Sentence Classification (Kim, 2014)](https://arxiv.org/abs/1408.5882). 
 
 We start as for the FFNN model with a look up **embedding layer**. We implement the **convolutional layer** with the help of `nn.Conv2d` and use the *ReLU* activation after it. In the lecture we have seen an example of a 1-dimentional convolution applied to text. The abovementioned paper, being inspired by the convolution for images, applies a 2-dimensional convolution: a (window size, embedding dimension) filter. It cover *n* sequential words, taking embedding dimensions as the width. We then pass the tensors through a max pooling layer. 
 
 The **max pooling layer ** is typically followed by a **dropout layer**. The latter sets a random set of activations in the max-pooling layer to zero. This prevents the network from learning to rely on specific weights and helps to prevent overfitting. Note that the dropout layer is only used during training, and not during test time.
 
 Q. Study the shapes of outputs coming from convolution and max pooling layers. What is the shape of the max pooling layer output?

In [0]:
class CNN(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, out_channels, window_size, output_dim, dropout):
        
        super(CNN, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        #in_channels -- 1 text channel
        #out_channels -- the number of output channels
        #kernel_size is (window size x embedding dim)
        
        self.conv = nn.Conv2d(in_channels=1, out_channels=out_channels, kernel_size=(window_size,embedding_dim))
        
        #the dropout layer
        self.dropout = nn.Dropout(dropout)
    
        #the output layer
        self.fc = nn.Linear(out_channels, output_dim)
        
        
        
    def forward(self, x):
                
        #(batch size, max sent length)
        
        embedded = self.embedding(x)
                
        #(batch size, max sent length, embedding dim)
        
        #images have 3 RGB channels 
        #for the text we add 1 channel
        embedded = embedded.unsqueeze(1)
        
        #(batch size, 1, max sent length, embedding dim)
        
        feature_maps = self.conv(embedded)
        
        #Q. what is the shape of the convolution output ?
        
        
        
        feature_maps = feature_maps.squeeze(3)
        
        #Q. why do we reduce 1 dimention here ?
                
        feature_maps = F.relu(feature_maps)
        
  
        #the max pooling layer
        pooled = F.max_pool1d(feature_maps, feature_maps.shape[2])
        
        pooled = pooled.squeeze(2)
  
        #Q. what is the shape of the pooling output ?
         
        
        dropped = self.dropout(pooled)
 
        preds = self.fc(dropped)
        
        return preds


# Training and testing the CNN

Here we will define the CNN-specific hyperparameters and perform the network training and testing.
 
Q. Is the performance of CNN different from the performance of FFNN? Output predictions.

Q. Is padding necessary for CNN inputs? What is the role of the window size?


**Note** Learning rate is initially set to 0.01

In [0]:
epochs=20

INPUT_DIM = len(word2idx)
EMBEDDING_DIM = 100
OUTPUT_DIM = 1

#the hyperparamerts specific to CNN

# we define the number of filters
N_OUT_CHANNELS = 100

# we define the window size
WINDOW_SIZE = 1

# we apply the dropout with the probability 0.5
DROPOUT = 0.5

model = CNN(INPUT_DIM, EMBEDDING_DIM, N_OUT_CHANNELS, WINDOW_SIZE, OUTPUT_DIM, DROPOUT)

optimizer = optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.BCEWithLogitsLoss()

feature_train = train_sent_tensor
target_train = train_label_tensor

feature_valid = valid_sent_tensor
target_valid = valid_label_tensor


for epoch in range(1, epochs+1):
   
  model.train()
  
  optimizer.zero_grad()
  
  predictions = model(feature_train).squeeze(1)
  loss = loss_fn(predictions, target_train)
  acc = accuracy(predictions, target_train)
  loss.backward()
  optimizer.step()
  
  epoch_loss = loss.item()
  epoch_acc = acc
  
  model.eval()
  
  with torch.no_grad():
 
    predictions_valid = model(feature_valid).squeeze(1)
    loss = loss_fn(predictions_valid, target_valid)
    acc = accuracy(predictions_valid, target_valid)
    valid_loss = loss.item()
    valid_acc = acc
  
  print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.3f} | Train Acc: {epoch_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% |')
  
model.eval()

feature = test_sent_tensor
target = test_label_tensor

with torch.no_grad():
 
    predictions = model(feature).squeeze(1)
    loss = loss_fn(predictions, target)
    acc = accuracy(predictions, target)
    print(f'| Test Loss: {loss:.3f} | Test Acc: {acc*100:.2f}%')
    f_measure(predictions, test_labels)


# Initializing CNN with pre-trained representations

The work [Convolutional Neural Networks for Sentence Classification (Kim, 2014)](https://arxiv.org/abs/1408.5882) also investigates the exploitation of pre-trained embeddings and demonstrates the efficiency of using them.

Try and initialize the CNN embedding layer with the pre-trained GloVe embeddings you used in the previous lab. We encourage you to modify the respective method from the previous lab. Pay particular attention to keeping the correct indexes from the word2ind for the lookup table.

Q. What will the embedding for the *pad* token be?
 

In [0]:

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
 


In [0]:
wvecs = np.zeros((len(word2idx), 100))

with codecs.open('glove.6B.100d.txt', 'r','utf-8') as f: 
 ...


You can initialize the CNN model embedding layer as follows:

In [0]:
model.embedding.weight.data.copy_(torch.from_numpy(wvecs))

Q. What is the impact of using those pre-trained embeddings on the model performance?

# Advanced: Experimenting with larger corpora

For advanced experiments with a larger dataset, we suggest to use the [IMBD dataset](http://ai.stanford.edu/~amaas/data/sentiment/) of movie reviews available from [`torchtext.datasets`](https://torchtext.readthedocs.io/en/latest/data.html). This module also provides a range of useful functionalities for data preparation: defining a preprocessing pipeline, splitting, batching, padding, iterating through data, loading pre-trained embeddings, building vocabulary, etc. Below we provide an example using the tokenizer as provided by the [spaCy](https://spacy.io/) toolkit.
 
With the batch size provided, `BucketIterator` defines mini-batches by grouping sequences with similar original lengths, so that there is minimal need for padding.  For this bigger dataset, use `.cuda()` on any input batches/tensors, network modules and loss functions to place computations on the GPU.

You can start by applying the provided CNN model to this dataset. You may find necessary to permute inputs to make the batch size be the first dimension. 

In [0]:
# Additional packages
 
from torchtext import data
from torch.utils.data import DataLoader
import spacy

#Fix GPU seeds

torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [0]:
#define our batch size

BATCH_SIZE = 64

#define types of data and their preprocessing

text_field = data.Field(tokenize='spacy',lower=True)
label_field = data.LabelField(dtype=torch.float)

#get pre-defined split
train, test_init = datasets.IMDB.splits(text_field, label_field)

#define our own validation and test set (initial test set is too large)
train, valid_test = train.split(split_ratio=0.9, random_state=random.seed(SEED))
valid, test = valid_test.split(split_ratio=0.5, random_state=random.seed(SEED))

print(f'Train size: {len(train)}')
print(f'Validation size: {len(valid)}')
print(f'Test size: {len(test)}')

#build vocabulary with maximum size (less frequent words are not considered)
# load the pre-trained word embeddings.
text_field.build_vocab(train, max_size=25000, vectors="glove.6B.100d")
label_field.build_vocab(train)

# get iterators over the data
# only train is split in mini-batches
# validation and test sets are not split (we provide their lengths as the batch size)
# place iterators on the GPU

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits((train, valid, test),
                batch_sizes=(BATCH_SIZE, len(valid), len(test)), device=accelerator)

Q. The paper [Convolutional Neural Networks for Sentence Classification (Kim, 2014)](https://arxiv.org/abs/1408.5882) applies 3 convolutional layers in parallel with window sizes [3, 4, 5]. Try to extend our CNN model with 2 more convolution layers and apply these window sizes. Hint: you can use the `nn.ModuleList ` function. Outputs of the pooling layers are concatenated. What will be the effect on the model performance? 
 
 Below we provide an example of a train method employing mini-batches.

In [0]:
def train_model(train_iter, dev_iter, model):
        
     
    for epoch in range(1, epochs+1):
        
        epoch_loss = 0
        epoch_acc = 0
        model.train()
       
        #iterate over batches
        for batch in train_iter:
          
          
            #place on the GPU          
            feature, target = batch.text.cuda(), batch.label.cuda()
 
            optimizer.zero_grad()
            predictions = model(feature).squeeze(1)
             
            loss = loss_fn(predictions, target)
            acc = binary_accuracy(predictions, target)
            
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
            epoch_acc += acc.item()

        valid_loss, valid_acc = eval(dev_iter, model)
        #average train loss and accuracy over the mini-batches
        epoch_loss, epoch_acc = epoch_loss / len(train_iter), epoch_acc / len(train_iter)
        print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.3f} | Train Acc: {epoch_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% |')

Q. Pre-processing: experiment with filtering out stop words from input data. What will be the effect on the performance? 

You may choose to use spaCy to get a list of stop words.

In [0]:
spacy_nlp = spacy.load('en_core_web_sm')
spacy_stop_words = spacy.lang.en.stop_words.STOP_WORDS
print(spacy_stop_words)
text_field = data.Field(tokenize='spacy',lower=True,stop_words=spacy_stop_words)


Q. Apply a Naive Bayes classifier to the problem. How would it perform for this task? You can use the `sklearn.naive_bayes.MultinomialNB` implementation from the popular [scikit-learn](https://scikit-learn.org) toolkit. Extraction of the data for this purpose could be performed as follows:

In [0]:
for example in train:
    tweet = example.text
    if example.label == 'pos':
        label =1
    else:
        label =0
