<a href="https://colab.research.google.com/github/jcauzi/jcauzi/blob/main/Sentiment_classifier_with_word_embeddings_(exercise).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exercice completed by Jules Cauzinille 

* I reached an accuracy of 0.87 on the dev set with the smallest embeddings (size 50)
* I also added an early stopping criterion and plotted the accuracy evolution.

Goal
==

We are about to design and train a neural system to perform sentiment analysis on film reviews. More precisely, the network will have to output the probability that the input review expresses a positive opinion (overall).

The system will be a bag-of-words model using GloVe embeddings. It will have to first average the embeddings of the words of the input review, and then send the result through a simple network that should output a probability.

The first 5 parts were already implemented. 

Loading PyTorch
==

In [None]:
# Imports PyTorch.
import torch

Downloading the dataset
==
The dataset we are going to use is the Large Movie Review Dataset (https://ai.stanford.edu/~amaas/data/sentiment/).

Downloading the dataset and pre-processing it might take several minutes, so ask Colab to execute all cells while you are reading the code.

In [None]:
# Downloads the dataset.
import urllib

tmp = urllib.request.urlretrieve("https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz")
filename = tmp[0]

In [None]:
filename

In [None]:
# Extracts the dataset.
import tarfile
tar = tarfile.open(filename)
tar.extractall()
tar.close()

In [None]:
import os # Useful library to read files and inspect directories.

In [None]:
# Shows which files and directories are present at the root of the file system.
for filename in os.listdir("."):
  print(filename)

In [None]:
dataset_root = "aclImdb"
# Shows which files and directories are present at the root of the dataset directory.
for filename in os.listdir(dataset_root):
  print(filename)

In [None]:
# Shows several reviews.
dirname = os.path.join(dataset_root, "train", "neg") # "aclImdb/{train|test}/{neg|pos}"
for idx, filename in enumerate(os.listdir(dirname)):
  if(idx >= 5): break # Stops after the 5th file.
  
  print(filename)
  with open(os.path.join(dirname, filename)) as f:
    review = f.read()
    print(review)
  print()

Preprocessing the dataset
==

In [None]:
import nltk # Imports NLTK, an NLP library.
nltk.download('punkt') # Loads a module required for tokenization.
import collections # This library defines useful data structures. 

In [None]:
newline = "<br />" # The reviews sometimes contain this HTLM tag to indicate a line break.
def preprocess(text):
  text = text.replace(newline, " ") # Replaces the newline HTML tag with a space.
  tokens = nltk.word_tokenize(text); # Converts the text to a list of tokens (strings).
  tokens = [token.lower() for token in tokens] # Lowercases all tokens.
  
  return tokens

# Reads and pre-processes the reviews.
dataset = {"train": [], "test": []}
binary_classes = {"neg": 0, "pos": 1}
for part_name, l in dataset.items():
  for class_name, value in binary_classes.items():
    path = os.path.join(dataset_root, part_name, class_name)
    print("Processing %s..." % path, end='');
    for filename in os.listdir(path):
        with open(os.path.join(path, filename)) as f:
          review_text = f.read()
          review_tokens = preprocess(review_text)
          
          l.append((review_tokens, value))
    print(" done")

In [None]:
# Splits the train set into a proper train set and a development/validation set.
# 'dataset["train"]' happens to be a list composed of a certain number of negative examples followed by the same number of positive examples.
# We are going to use 3/4 of the original train set as our actual train set, and 1/4 as our development set.
# We want to keep balanced train and development sets, i.e. for both, half of the reviews should be positive and half should be negative.
if("dev" in dataset): print("This should only be run once.")
else:
  dev_set_half_size = int((len(dataset["train"]) / 4) / 2) # Half of a quarter of the training set size.
  dataset["dev"] = dataset["train"][:dev_set_half_size] + dataset["train"][-dev_set_half_size:] # Takes some negative examples at the beginning and some positive ones at the end.
  dataset["train"] = dataset["train"][dev_set_half_size:-dev_set_half_size] # Removes the examples used for the development set.

  for (part, data) in dataset.items():
    class_counts = collections.defaultdict(int)
    for (_, p) in data: class_counts[p] += 1
    print(f"{part}: {class_counts}")
  print("Train set split into train/dev.")

Loading the word embeddings
==
We are going to use GloVe embeddings.

All word forms with a frequency below a given threshold are going to be considered unknown forms.

In [None]:
# Computes the frequency of all word forms in the train set.
word_counts = collections.defaultdict(int)
for tokens, _ in dataset["train"]:
  for token in tokens: word_counts[token] += 1

print(word_counts)

In [None]:
# Builds a vocabulary containing only those words present in the train set with a frequency above a given threshold.
count_threshold = 4;
vocabulary = set()
for word, count in word_counts.items():
    if(count > count_threshold): vocabulary.add(word)

print(vocabulary)
print(len(vocabulary))

In [None]:
import zipfile
import numpy as np

In [None]:
# Returns a dictionary {word[String]: id[Integer]} and a list of Numpy arrays.
# `data_path` is the path of the directory containing the GloVe files (if None, 'glove.6B' is used)
# `max_size` is the number of word embeddings read (starting from the most frequent; in the GloVe files, the words are sorted)
# If `vocabulary` is specified, the output vocabulary contains the intersection 
#   of `vocabulary` and the words with a defined embedding. Otherwise, all words with a defined embedding are used.
def get_glove(dim=50, vocabulary=None, max_size=-1, data_path=None):
  dimensions = set([50, 100, 200, 300]) # Available dimensions for GloVe 6B
  fallback_url = 'http://nlp.stanford.edu/data/glove.6B.zip' # (Remember that in GloVe 6B, words are lowercased.)

  assert (dim in dimensions), (f'Unavailable GloVe 6B dimension: {dim}.')

  if(data_path is None): data_path = 'glove.6B'

  # Checks that the data is here, otherwise downloads it.
  if(not os.path.isdir(data_path)):
    #print('Directory "%s" does not exist. Creation.' % data_path)
    os.makedirs(data_path)
  
  glove_weights_file_path = os.path.join(data_path, f'glove.6B.{dim}d.txt')
  
  if(not os.path.isfile(glove_weights_file_path)):
    local_zip_file_path = os.path.join(data_path, os.path.basename(fallback_url))
  
    if(not os.path.isfile(local_zip_file_path)):
      print(f'Retreiving GloVe embeddings from {fallback_url}.')
      urllib.request.urlretrieve(fallback_url, local_zip_file_path)
    
    with zipfile.ZipFile(local_zip_file_path, 'r') as z:
      print(f'Extracting GloVe embeddings from {local_zip_file_path}.')
      z.extractall(path=data_path)
  
  assert os.path.isfile(glove_weights_file_path), (f"GloVe file {glove_weights_file_path} not found.")

  # Reads GloVe data.
  print('Reading GloVe embeddings.')
  new_vocabulary = {} # A dictionary {word[String]: id[Integer]}
  embeddings = [] # The list of embeddings (Numpy arrays)
  with open(glove_weights_file_path, 'r') as f:
    for line in f: # Each line consist of the word followed by a space and all of the coefficients of the vector separated by a space.
      values = line.split()

      # Here, I'm trying to detect where on the line the word ends and where the vector begins.
      #   As in some version(s) of GloVe words can contain spaces, this is not entirely trivial.
      vector_part = ' '.join(values[-dim:])
      x = line.find(vector_part)
      word = line[:(x - 1)]

      if((vocabulary is not None) and (not word in vocabulary)): # If a vocabulary was specified and if the word is not it…
        continue # …this word is skipped.

      new_vocabulary[word] = len(new_vocabulary)
      embedding = np.asarray(values[-dim:], dtype=np.float32)
      embeddings.append(embedding)

      if(len(new_vocabulary) == max_size): break
  print('(GloVe embeddings loaded.)')
  print()

  return (new_vocabulary, embeddings)

In [None]:
(new_vocabulary, embeddings) = get_glove(dim=50, vocabulary=vocabulary)

In [None]:
print(len(new_vocabulary)) # Shows the size of the vocabulary.
print(new_vocabulary) # Shows each word and its id.
#print l'embedding du mot "and" :
# print(type(embeddings))
# print(new_vocabulary["and"])
# print(embeddings[new_vocabulary["and"]])

# unk = np.average(embeddings, axis=0)
# print(unk)
# pad = np.zeros_like(unk)
# print(pad)

embeddings_tensor = torch.FloatTensor(embeddings)
print(embeddings_tensor.size())

Batch generator
==

In [None]:
# Defines a class of objects that produce batches from the dataset.
class BatchGenerator:
  def __init__(self, dataset, vocabulary):
    self.dataset = dataset
    for part in self.dataset.values(): # Shuffles the dataset so that positive and negative examples are mixed.
      np.random.shuffle(part)

    self.vocabulary = vocabulary # Dictonary {word[String]: id[Integer]}
    self.unknown_word_id = len(vocabulary) # Id for unknown forms
    self.padding_idx = len(vocabulary) + 1 # Not all reviews of a given batch will have the same length.
    #   We will "pad" shorter reviews with a special token id so that the batch can be represented by a matrix.
  
  def length(self, data_type='train'):
    return len(self.dataset[data_type])

  # Returns a random batch.
  # Batches are output as a triples (word_ids, polarity, texts). 
  # If `subset` is an integer, only a subset of the corpus is used. This can be useful to debug the system.
  def get_batch(self, batch_size, data_type, subset=None):
    data = self.dataset[data_type] # selects the relevant portion of the dataset.
    
    max_i = len(data) if(subset is None) else min(subset, len(data))
    instance_ids = np.random.randint(max_i, size=batch_size) # Randomly picks some instance ids.

    return self._ids_to_batch(data, instance_ids)

  def _ids_to_batch(self, data, instance_ids):
    word_ids = [] # Will be a list of lists of word ids (Integer)
    polarity = [] # Will be a list of review polarities (Boolean)
    texts = [] # Will be a list of lists of words (String)
    for instance_id in instance_ids:
      text, p = data[instance_id]

      word_ids.append([self.vocabulary.get(w, self.unknown_word_id) for w in text])
      polarity.append(p)
      texts.append(text)
    
    # Padding
    self.pad(word_ids)

    word_ids = torch.tensor(word_ids, dtype=torch.long) # Conversion to a tensor
    polarity = torch.tensor(polarity, dtype=torch.bool) # Conversion to a tensor

    return (word_ids, polarity, texts) # We don't really need `texts` but it might be useful to debug the system.
  
  # Pads a list of lists (i.e. adds fake word ids so that all sequences in the batch have the same length, so that we can use a matrix to represent them).
  # In place
  def pad(self, word_ids):
    max_length = max([len(s) for s in word_ids])
    for s in word_ids: s.extend([self.padding_idx] * (max_length - len(s)))
  
  # Returns a generator of batches for a full epoch.
  # If `subset` is an integer, only a subset of the corpus is used. This can be useful to debug the system.
  def all_batches(self, batch_size, data_type="train", subset=None):
    data = self.dataset[data_type]
    
    max_i = len(data) if(subset is None) else min(subset, len(data))

    # Loop that generates all full batches (batches of size 'batch_size')
    i = 0
    while((i + batch_size) <= max_i):
      instance_ids = np.arange(i, (i + batch_size))
      yield self._ids_to_batch(data, instance_ids)
      i += batch_size
    
    # Possibly generates the last (not full) batch.
    if(i < max_i):
      instance_ids = np.arange(i, max_i)
      yield self._ids_to_batch(data, instance_ids)
  
  # Turns a list of arbitrary pre-processed texts into a batch.
  # This function will be used to infer the polarity of a unannotated review.
  def turn_into_batch(self, texts):
    word_ids = [[self.vocabulary.get(w, self.unknown_word_id) for w in text] for text in texts]
    self.pad(word_ids)
    return torch.tensor(word_ids, dtype=torch.long)

batch_generator = BatchGenerator(dataset=dataset, vocabulary=new_vocabulary)
print(batch_generator.length('train')) # Prints the number of instance in the train set.

In [None]:
tmp = batch_generator.get_batch(3, data_type="train")
print(tmp[0]) # Prints the matrix of token ids.
print(tmp[1]) # Prints the vector of polarities.
print(tmp[2]) # Prints the list of reviews.

print(type(batch_generator))

In [None]:
len(list(batch_generator.all_batches(batch_size=3, data_type="train"))) # Number of batches in the training set for batches of size 3

The model
==

In [None]:
class SentimentClassifier(torch.nn.Module):
  # embeddings: list of Numpy arrays
  # hidden_sizes: list of the size (Integer) of each hidden layer; there may be 0 or more hidden layers
  def __init__(self, embeddings, hidden_sizes, freeze_embeddings=True, device='cpu'):
    embeddings = list(embeddings) # Creates a copy of the list of embeddings, so we can add or remove entries without affecting the original list.
    super().__init__() # Calls the constructor of the parent class. Usually, this is necessary when creating a custom module.

    self.padding_idx = len(embeddings) + 1 # len(embeddings) will be the id of the embedding of the unknown word

    # Here you have to (i) define a vector for unknown forms (the average of actual word embeddings)
    # and a vector for the padding token (full of 0·s) and (ii) define an embedding layer 'self.embeddings' using torch.nn.Embedding.from_pretrained
    # and without forgeting to use the 'freeze' and 'padding_idx' arguments.

    #i

    self.unk = np.average(embeddings, axis=0)#vecteur comme moyenne de tous les embeddings.
    self.padding = np.zeros_like(self.unk)#vecteur de 0 de la même taille que les embeddings
 
    #ii
    embeddings.append(self.unk)
    embeddings.append(self.padding)
    embeddings_tensor = torch.FloatTensor(embeddings)
    self.embeddings = torch.nn.Embedding.from_pretrained(embeddings_tensor, freeze=freeze_embeddings, padding_idx=self.padding_idx)

    self.embeddings = self.embeddings.to(device) # Sends the word embeddings to 'device', which is potentially a GPU.

    # Here you have to define self.main_part, the network that computes a probability for any review given as input
    # (represented as the average of the embeddings of the tokens).
    # The number of hidden layers is determined by 'hidden_sizes', which is a list of integers describing the (output) size of each of them.
    # Use torch.nn.Linear to build linear layers.
    # torch.nn.Sequential takes one argument per module and not a list of modules as argument, but if 'modules' is a list of modules,
    # 'torch.nn.Sequential(*modules)' (with the star notation) works.
    
    modules = []

    #création / ajout de la première couche cachée
    modules.append(torch.nn.Linear(len(embeddings[0]), hidden_sizes[0]))

    #création / ajout des fonctions d'activation + autres couches potentielles
    for i in range(1, len(hidden_sizes)) :
        activation = torch.nn.Tanh()
        modules.append(activation)
        layer = torch.nn.Linear(hidden_sizes[i-1], hidden_sizes[i])
        modules.append(layer)

    #création / ajout de la couche + fonction de sortie
    modules.append(torch.nn.Linear(hidden_sizes[-1], 1))
    modules.append(torch.nn.Sigmoid())

    #print(modules)

    #construction du réseau à partir de la liste de couches cachées "modules"
    self.main_part = torch.nn.Sequential(*modules)
    
    self.main_part = self.main_part.to(device) # Sends the network to 'device', which is potentially a GPU.

    self.device = device

  # 'batch' is a matrix of word ids (Integer).

  def forward(self, batch):
    # Here you have to (i) turn 'batch' into a matrix of embeddings (i.e. a tensor of rank 3), (ii) average all embeddings for a given review
    # while being careful not to take into account padding vectors, (iii) send these bag-of-words representations to the network.
    # Return a tensor of shape (batch size) instead of (batch size, 1).
    # Once you think the function works, check that the presence of padding ids do NOT impact the result in any way
    # (i.e. the same probability should be computed for a given review independently of the number of padding ids).
    #################

    #i turn 'batch' into a matrix of embeddings (tensor of rank 3)

    embedded_batch = self.embeddings(batch)
    
    #ii average embeddings without padding vectors

    lenghts = batch 
    #lenghts est une matrice qui, pour chaque review, contient sa longueur en mots
    lenghts = (lenghts != self.padding_idx).sum(dim=1)

    #mean est la matrice contenant les embeddings "moyennés" de tous les mots d'une review (sans les pads)
    #Sachant que les embeddings de pad sont vides, la somme des embeddings d'un exemple ne les prendra pas en compte 
    #En divisant cette somme par le nb de mots n'étant pas des pads, on obtient la moyenne sans padding.
    mean = torch.sum(embedded_batch,1) / lenghts.unsqueeze(1)

    # #iii send the bow to the network
    
    return self.main_part(mean).squeeze(1)

    #################

In [None]:
model = SentimentClassifier(embeddings, hidden_sizes=[100], freeze_embeddings=True)
batch = batch_generator.get_batch(3, data_type="train")
print("input :", model(batch[0]))


In [None]:
# Function that computes the accuracy of the model on a given part of the dataset.
evaluation_batch_size = 256
def evaluation(data_type, subset=None):
  nb_correct = 0
  total = 0
  for batch in batch_generator.all_batches(evaluation_batch_size, data_type=data_type, subset=subset):
    prob = model(batch[0].to(model.device)) # Forward pass
    answer = (prob > 0.5) # Shape: (batch_size, 1)
    nb_correct += (answer == batch[1].to(model.device)).sum().item()
    total += batch[0].shape[0]
      
  accuracy = (nb_correct / total)
  return accuracy

Accuracy evolution plot
==
Function to plot the evolution of train and dev accuracy after training of the model

In [None]:
import matplotlib.pyplot as plt

def plot(train_accuracies, dev_accuracies) :
  plt.plot(train_accuracies, label='Train Accuracy')
  plt.plot(dev_accuracies, label='Dev Accuracy')
  plt.title('Training and Validation accuracy evolution')
  plt.xlabel('Epochs')
  plt.ylabel('Accuracy')
  plt.legend()
  plt.show()

Training
==
Once everything works, try to find better hyperparameters.
The goal is to maximise the accuracy on the development set.
Feel also free to improve the model or the training process.

In [None]:
#parts surrounded by ########## correpond to the early stopping 

model = SentimentClassifier(embeddings, hidden_sizes=[100, 200, 20], freeze_embeddings=False, device='cuda')

# Tests the model on a couple of instance before training.
model.eval() # Tells PyTorch we are in evaluation/inference mode (can be useful if dropout is used, for instance).
#print(model(batch_generator.turn_into_batch([preprocess(text) for text in ["This movie was terrible!!", "Pure gold!"]]).to(model.device)))

# Training procedure
learning_rate = 0.004
l2_reg = 0.0001
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.99, weight_decay=l2_reg) # Once the backward propagation has been done,
# just call the 'step' method (with no argument) of this object to update the parameters.
batch_size = 64
subset = None # Use an integer to train on a smaller portion of the training set, otherwise use None.
epoch_size = batch_generator.length("train") if(subset is None) else subset # In number of instances

nb_epoch = 40
epoch_id = 0 # Id of the current epoch
instances_processed = 0 # Number of instances trained on in the current epoch
epoch_loss = [] # Will contain the loss for each batch of the current epoch

##########
earlyStopping = True #Condition to early stop the training according to the following parameter :
patience = 2
#variables used for early stopping :
last_accuracy = 0
trigger_times = 0
##########

train_accuracies = []
dev_accuracies = []

while(epoch_id < nb_epoch):
  model.train() # Tells PyTorch that we are in training mode (can be useful if dropout is used, for instance).
  model.zero_grad() # Makes sure the gradient is reinitialised to zero.
  batch = batch_generator.get_batch(batch_size, data_type="train", subset=subset)

  # (i) compute the prediction of the model (you might want to use ".to(model.device)" on the input of the model),
  outputs = model(batch[0].to(model.device))

  # (ii) compute the loss (use an average over the batch),
  loss_function = torch.nn.BCELoss(reduction="mean")
  target = batch[1].float()
  loss = loss_function(outputs, target.to(model.device))

  # (iii) call "backward" on the loss and
  loss.backward()

  # (iv) store the loss in "epoch_loss".
  epoch_loss.append(loss)
  optimizer.step() # Updates the parameters.

  instances_processed += batch_size
  if(instances_processed > epoch_size):
    print(f"-- END OF EPOCH {epoch_id}.")
    if (len(epoch_loss) != 0) :
      print(f"Average loss: {sum(epoch_loss) / len(epoch_loss)}.")

    # Evaluation
    model.eval() # Tells PyTorch we are in evaluation/inference mode (can be useful if dropout is used, for instance).
    with torch.no_grad(): # Deactivates Autograd (it is computationaly expensive and we don't need it here).
      accuracy = evaluation("train")
      print(f"Accuracy on the train set: {accuracy}.")
      train_accuracies.append(accuracy)

      accuracy = evaluation("dev")
      print(f"Accuracy on the dev set: {accuracy}.")
      dev_accuracies.append(accuracy)

    epoch_id += 1
    instances_processed -= epoch_size
    epoch_loss = []

    ##########
    #early stopping on the validation accuracy

    if earlyStopping and epoch_id>10 :  #didn't find a better way to avoid an excessively early stopping (especially for low patience)
                                        #(usually 10 epochs is the bare minimum to train the model correctly)
      if accuracy < last_accuracy : 
        #print(accuracy, last_accuracy, "Trigger !")
        trigger_times += 1
      else :
        #print("untrigger")
        trigger_times = 0
        #saving optimal parameters so far
        if accuracy > max(dev_accuracies) :
          best_param = model.state_dict()

      if trigger_times >= patience :
        print(f"accuracy went down {trigger_times} times ! \nTraining early stopped.")
        break
    
      last_accuracy = accuracy
      ##########

#resetting the model on the best parameters :
model.load_state_dict(best_param)
print(f"Best epoch is #{dev_accuracies.index(max(dev_accuracies))+1} for a dev accuracy of {max(dev_accuracies)}")
#(using this weird index method because saving the best epoch_id doesn't seem to be working...)
plot(train_accuracies, dev_accuracies)


In [None]:
model.eval() # Tells PyTorch we are in evaluation/inference mode (can be useful if dropout is used, for instance).
test_reviews = ["This movie was terrible!!", "Pure gold!", "Bad.", "Not bad!", "I loved it", "it's alright", "Useless movie, I wish I could unsee it"]
test = model(batch_generator.turn_into_batch([preprocess(text) for text in test_reviews]).to(model.device))
bool_test = test>0.5

#pretty print of inputs / class
for i in range(len(test_reviews)) :
  out = "+" if bool_test[i].item() else "-"
  print(test_reviews[i], ":", out)
