# Sentiment with Deep Neural Networks and Pretrained Embeddings


In this assignment, you will explore **sentiment** analysis using deep neural networks and **pretrained word embeddings**. \\
We will use a dataset of movie reviews from the IMDb (Internet Movie Database), which contains the text of the reviews and a binary label for their sentiment.

These are main the steps:

- [**1)  Loading the data**](#1.1)
  (load the csv files, extract inputs and outputs, clean and tokenize the text)
- [**2) Building the Vocabulary**](#2)
  (enumerate all tokens in the train and create a Vocabulary)
- [**3) Numericalizing and Padding the texts**](#3)
  (from texts of different lengths to equal-sized lists of integers)
- [**4)  Preparing the Embeddings**](#4)
  (download pretrained embeddings, arrange them to match the order of the words in the Vocabulary)
- [**5)  Setup the Model**](#5)
  (define the Model, setup the data loaders)
- [**6)  Training and Testing**](#6)
  (define the training / testing functions, execute training / testing)


The following is a rough schema of the Model we are going to use. It is made of:
- an Embedding Layer (pretrained GloVe embeddings) \\
- an averaging operation (to compute the mean embedding of the sentence) \\
- a Linear Layer (which projects the "sentence embedding" to the binary output space) \\

\\
$$
\small{(\textit{batch_size} \times \textit{max_seq_len})}\\
\boxed{\ \small{\textbf{Embedding Layer}}\ } \\
\small{(\textit{batch_size} \times \textit{embedding_size} \times \textit{max_seq_len})}\\
\boxed{\ \small{\textbf{Mean}}\ } \\
\small{(\textit{batch_size} \times \textit{embedding_size})}\\
\boxed{\ \small{\textbf{Linear Layer}}\ } \\
\small{(\textit{batch_size} \times \textit{2})}\\
$$


<a name="1"></a>
# 1)  Importing the data

The next cell will download the GloVe embeddings. It will take about 20 minutes, but it will run in the background, so you can keep using the other cells in the meantime. \\
Once the download is finished it will create a file named "DONE" (you can open the "File" section on the left to check whether the download is DONE). When you'll need the embeddings another cell will check if the download has finished.

In [1]:
import os
import threading
class Downloader(object):
    def __init__(self):
        pass
    def start(self):
        if not os.path.exists("glove.6B.300d.txt"):
            print("Dowloading embeddings")
            ! wget -O glove.6B.300d.txt http://ailab.uniud.it/wp-content/uploads/2019/05/glove.6B.300d.txt 2> progress.txt
            with open("DONE", "w"):
                pass
        else:
            print("Embeddings already downloaded!")
downloader = Downloader()
t = threading.Thread(target=downloader.start)
t.start()

Embeddings already downloaded!


In this first part you will
load two csv datasets using pandas (train and test data), 
extract the inputs and the outputs from the datasets and
tokenize the sentences. \\
Let's start by downloading the two datasets.

In [2]:
import os

if not os.path.exists("Train_Movie_Data.csv"):
    print("Downloading Train set")
    ! wget -O Train_Movie_Data.csv http://ailab.uniud.it/wp-content/uploads/2019/05/Train_Movie_Data.csv
else:
    print("Train set already downloaded!")

if not os.path.exists("Test_Movie_Data.csv"):
    print("Downloading Test set")
    ! wget -O Test_Movie_Data.csv http://ailab.uniud.it/wp-content/uploads/2019/05/Test_Movie_Data.csv
else:
    print("Test set already downloaded!")

Train set already downloaded!
Test set already downloaded!


We will use a dataset of movie reviews from the IMDb (Internet Movie Database). This dataset contains the **text of the reviews**, together with a **label** that indicates whether a review is **"positive" or "negative"**. \\
Let's load the Train dataset and visualize it using `pandas`!

In [3]:
import pandas as pd

train = pd.read_csv('Train_Movie_Data.csv')
print("First Elements of train set:\n {}".format(train.head(20)))
print()
print("Number of Train samples:\n", train.shape[0])

First Elements of train set:
                                                review  sentiment
0   In 1974, the teenager Martha Moxley (Maggie Gr...          1
1   OK... so... I really like Kris Kristofferson a...          0
2   ***SPOILER*** Do not read this, if you think a...          0
3   hi for all the people who have seen this wonde...          1
4   I recently bought the DVD, forgetting just how...          0
5   Leave it to Braik to put on a good show. Final...          1
6   Nathan Detroit (Frank Sinatra) is the manager ...          1
7   To understand "Crash Course" in the right cont...          1
8   I've been impressed with Chavez's stance again...          1
9   This movie is directed by Renny Harlin the fin...          1
10  I once lived in the u.p and let me tell you wh...          0
11  Hidden Frontier is notable for being the longe...          1
12  It's a while ago, that I have seen Sleuth (197...          0
13  What is it about the French? First, they (appa...       

Let's save our inputs (reviews) and outputs (sentiments) in two arrays: `X_train` and `y_train`.

In [4]:
X_train = train['review'].values
y_train = train['sentiment'].values

print("X_train:\n", X_train)
print("\ny_train:\n", y_train)

X_train:
 ['In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and

Okay, let's repeat the same process for the Test data: load them and save only the needed data in `X_test` and `y_test`.
Check how many samples there are in the Test dataset.

In [5]:
test = pd.read_csv('Test_Movie_Data.csv')  # BLANK
print("First Elements of test set:\n {}".format(test.head(3)))
print()
print("Number of Test samples:\n", test.shape[0])  # BLANK
X_test = test['review'].values  # BLANK
y_test = test['sentiment'].values  # BLANK

print()
print("X_test:\n", X_test)
print("\ny_test:\n", y_test)

First Elements of test set:
                                               review  sentiment
0  I have seen several comments here about Brando...          1
1  I liked this film very much. The story jumps b...          1
2  There's a part of me that would like to give t...          0

Number of Test samples:
 25002

X_test:
 ['I have seen several comments here about Brando using a Southern accent, some of which felt it was a mistake. When this movie was made, racism and discrimination were very strong in the South. The Jim Crow laws were still in effect. Civil Rights was in it\'s infancy. Could this have possibly been a subtle social commentary, a Southern man in love with a woman of another race? The same way MASH was a subtle criticism of the Viet Nam war? Any thoughts?<br /><br />Another comment was made about Myoshi Umeki appearing "cold". Anyone who has been in Japan would understand. The Japanese people, at least in my experience, did not tend to show emotion in front of stranger

Now create a function that cleans and tokenizes the texts and try it out.
- `tokenize_text` removes unwanted characters, transforms the text to lowercase...
- It also returns a list of words (it tokenizes the original string).

In [6]:
import string
import nltk
nltk.download('punkt')
from nltk import word_tokenize

def tokenize_text(text):

    text = text.replace("`", "'")
    # separate punctuation from words
    for k in string.punctuation:
        if k != "'":
            text = text.replace(k, " "+k+" ")
    
    text_tokens = word_tokenize(text.lower())
    text_clean = text_tokens
    return text_clean

# Try out function that processes texts
print("Original text")
text = "This is a sample text, with some UPPER and lower case words... and punctuation! It's cool."
display(text)
print()
print("Text after tokenization:")
display(tokenize_text(text))

Original text


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mattiadurso/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


"This is a sample text, with some UPPER and lower case words... and punctuation! It's cool."


Text after tokenization:


['this',
 'is',
 'a',
 'sample',
 'text',
 ',',
 'with',
 'some',
 'upper',
 'and',
 'lower',
 'case',
 'words',
 '.',
 '.',
 '.',
 'and',
 'punctuation',
 '!',
 'it',
 "'s",
 'cool',
 '.']

Now that we saw that the function works, let's clean and tokenize all train/test samples.

In [7]:
from tqdm.auto import tqdm

tok_train = []

for text in tqdm(X_train):
    tokenized = tokenize_text(text)  # BLANK
    tok_train.append(tokenized)  # BLANK

tok_test = []

for text in tqdm(X_test): 
    tokenized = tokenize_text(text)  # BLANK
    tok_test.append(tokenized)  # BLANK

print()
print(tok_train[0])

HBox(children=(IntProgress(value=0, max=24998), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25002), HTML(value='')))



['in', '1974', ',', 'the', 'teenager', 'martha', 'moxley', '(', 'maggie', 'grace', ')', 'moves', 'to', 'the', 'high', '-', 'class', 'area', 'of', 'belle', 'haven', ',', 'greenwich', ',', 'connecticut', '.', 'on', 'the', 'mischief', 'night', ',', 'eve', 'of', 'halloween', ',', 'she', 'was', 'murdered', 'in', 'the', 'backyard', 'of', 'her', 'house', 'and', 'her', 'murder', 'remained', 'unsolved', '.', 'twenty', '-', 'two', 'years', 'later', ',', 'the', 'writer', 'mark', 'fuhrman', '(', 'christopher', 'meloni', ')', ',', 'who', 'is', 'a', 'former', 'la', 'detective', 'that', 'has', 'fallen', 'in', 'disgrace', 'for', 'perjury', 'in', 'o', '.', 'j', '.', 'simpson', 'trial', 'and', 'moved', 'to', 'idaho', ',', 'decides', 'to', 'investigate', 'the', 'case', 'with', 'his', 'partner', 'stephen', 'weeks', '(', 'andrew', 'mitchell', ')', 'with', 'the', 'purpose', 'of', 'writing', 'a', 'book', '.', 'the', 'locals', 'squirm', 'and', 'do', 'not', 'welcome', 'them', ',', 'but', 'with', 'the', 'supp

<a name="2"></a>
# 2)  Building the Vocabulary

Now let's build the Vocabulary.
- Map each word in each text to an integer (an "index"). 
- Note that you will build the vocabulary based on the **training data**. 
- To do so, you will assign an index to every word by iterating over your training set.

The vocabulary will also include some special tokens
- `__PAD__`: padding, which we will use to make all sentences of the same length
- `__UNK__`: a token representing any word that is not in the vocabulary.

In [8]:
# Include special tokens 
# started with pad and unk tokens
Vocab = {'__PAD__': 0, '__UNK__': 1} 

# Note that we build vocab using training data
for tok_text in tok_train:
    for word in tok_text:  # BLANK
        if word not in Vocab:  # BLANK 
            Vocab[word] = len(Vocab)
    
print("Total words in vocab are:", len(Vocab))
# print first 6 elements of Vocab
print("{")
for k,v in list(Vocab.items())[:6]:
    print(f"  '{k}': {v},")
print("  ...\n}")

Total words in vocab are: 80618
{
  '__PAD__': 0,
  '__UNK__': 1,
  'in': 2,
  '1974': 3,
  ',': 4,
  'the': 5,
  ...
}


The dictionary `Vocab` will look like this:
```CPP
{'__PAD__': 0, 
 '__UNK__': 1,
 'in': 2,
 '1974': 3,
 ',': 4,
 'the': 5,
 ...
```

- Each unique word has a unique integer associated with it.
- The total number of words in Vocab: 80727

<a name="3"></a>
# 3)  Numericalization and Padding

Write a function that will convert each text to a list of text ids (a list of unique integer IDs representing the processed text).
For words in the text that are not in the vocabulary, set them to the unique ID for the token `__UNK__`.

**Example**

Input a text:
```
'Let the hypotenuse of the triangle be x'
```

The tweet_to_tensor will first conver the tweet into a list of tokens (including only relevant words)
```
['let', 'the', 'hypotenuse', 'of', 'the', 'triangle', 'be', 'x']
```

Then it will convert each word into its unique integer

```
[807, 5, 0, 19, 5, 17099, 288, 5769]
```
- Notice that the word "hypotenuse" is not in the vocabulary, so it is assigned the unique integer associated with the `__UNK__` token, because it is considered "unknown."



In [9]:
text = "Let the hypotenuse of the triangle be x"
toks = tokenize_text(text)
for tok in toks:
    print(Vocab.get(tok), "\t", tok)

807 	 let
5 	 the
None 	 hypotenuse
19 	 of
5 	 the
17094 	 triangle
288 	 be
5768 	 x


Let's write a function `text_to_ids` that takes in a text and converts it to an array of numbers. Use the `Vocab` dictionary you have just created to numericalize the texts. \\
Use the `vocab_dict` parameter and not a global variable.


In [10]:
def text_to_ids(tokenized_text, vocab_dict, unk_token='__UNK__'):
        
    # Initialize the list that will contain the unique integer IDs of each word
    text_id_list = []
    
    # Get the unique integer ID of the __UNK__ token using the vocab_dict
    unk_ID = Vocab['__UNK__'] # BLANK
        
    # for each word in the list:
    for word in tokenized_text:
        
        # Get the unique integer ID.
        # If the word doesn't exist in the vocab dictionary,
        # use the unique ID for __UNK__ instead.
        if word in Vocab:
            word_ID = Vocab[word]
        else:
            word_ID = unk_ID
        
        # Append the unique integer ID to the text_id_list.
        text_id_list.append(word_ID)
    
    return text_id_list

print("Actual text is\n", X_test[1])
print("Processed text is\n", tok_test[1])
print("\nText_ids of text:\n", text_to_ids(tok_test[1], vocab_dict=Vocab))

Actual text is
 I liked this film very much. The story jumps back and forth quite a bit and is not easy to follow. There is no resolution to the story whatsoever, and you are left to wonder what really happened. Since I like that sort of film I enjoyed this. I especially like the "dating" scenes between the boys and I was drawn into their lives. And of course any film with a naked Staphane Rideau will get a couple of extra points. ;-)
Processed text is
 ['i', 'liked', 'this', 'film', 'very', 'much', '.', 'the', 'story', 'jumps', 'back', 'and', 'forth', 'quite', 'a', 'bit', 'and', 'is', 'not', 'easy', 'to', 'follow', '.', 'there', 'is', 'no', 'resolution', 'to', 'the', 'story', 'whatsoever', ',', 'and', 'you', 'are', 'left', 'to', 'wonder', 'what', 'really', 'happened', '.', 'since', 'i', 'like', 'that', 'sort', 'of', 'film', 'i', 'enjoyed', 'this', '.', 'i', 'especially', 'like', 'the', '``', 'dating', '``', 'scenes', 'between', 'the', 'boys', 'and', 'i', 'was', 'drawn', 'into', 'their

**Expected output**

```
Actual text is
 I liked this film very much. The story jumps back and forth quite a bit and is not easy to follow. There is no resolution to the story whatsoever, and you are left to wonder what really happened. Since I like that sort of film I enjoyed this. I especially like the "dating" scenes between the boys and I was drawn into their lives. And of course any film with a naked Staphane Rideau will get a couple of extra points. ;-)
Processed text is
 ['i', 'liked', 'this', 'film', 'very', 'much', '.', 'the', 'story', 'jumps', 'back', 'and', 'forth', 'quite', 'a', 'bit', 'and', 'is', 'not', 'easy', 'to', 'follow', '.', 'there', 'is', 'no', 'resolution', 'to', 'the', 'story', 'whatsoever', ',', 'and', 'you', 'are', 'left', 'to', 'wonder', 'what', 'really', 'happened', '.', 'since', 'i', 'like', 'that', 'sort', 'of', 'film', 'i', 'enjoyed', 'this', '.', 'i', 'especially', 'like', 'the', '``', 'dating', '``', 'scenes', 'between', 'the', 'boys', 'and', 'i', 'was', 'drawn', 'into', 'their', 'lives', '.', 'and', 'of', 'course', 'any', 'film', 'with', 'a', 'naked', 'staphane', 'rideau', 'will', 'get', 'a', 'couple', 'of', 'extra', 'points', '.', ';', '-', ')']

Text_ids of text:
 [158, 418, 186, 472, 377, 403, 24, 5, 113, 6964, 254, 36, 5213, 369, 51, 474, 36, 50, 83, 164, 14, 1624, 24, 145, 50, 506, 6490, 14, 5, 113, 966, 4, 36, 277, 295, 2106, 14, 1106, 219, 159, 387, 24, 871, 158, 160, 55, 1988, 19, 472, 158, 470, 186, 24, 158, 357, 160, 5, 108, 6593, 108, 1117, 576, 5, 990, 36, 158, 31, 2940, 739, 127, 483, 24, 36, 19, 656, 292, 472, 70, 51, 4073, 1, 19879, 178, 764, 51, 1273, 19, 6882, 2934, 24, 840, 16, 12]
 ```

All samples which will be fed to our simple neural network should be truncated or padded to the same length. 
Let's set a `MAX_SEQ_LEN` of 320 tokens and pad (lengthen) or truncate (cut) all sequences to this lenght.

Padding is done by using the `__PAD__` token in `Vocab`.

In [11]:
MAX_SEQ_LEN = 320

def pad_sequence(ids, max_len, pad_id):
    
    if max_len <= MAX_SEQ_LEN:
        
        n_pad = max_len - len(ids)  # BLANK

        # If the sequence is too short: pad it
        if n_pad >= 0: 
            # Generate a list of pad_id, with length n_pad
            padding = [pad_id for x in range(n_pad)]  # BLANK
            # concatenate the tensor and the list of padded zeros
            out = ids + padding  # BLANK

        # If the sequence is too long: cut it
        elif n_pad < 0: 
            out = ids[:max_len]  # BLANK
    
        return out
    else:
        pad_sequence(ids, MAX_SEQ_LEN, pad_id)

example_sentence_ids = [1,2,3,4,5,6,7,8,9]
print("This is an examples of sentence ids (length 9):")
print(example_sentence_ids, "\n")

print("This is the same sequence padded to 13:")
print("|<----- it should be this long ------>|")
print(pad_sequence(example_sentence_ids, 13, 0), "\n")

print("This is the same sequence padded to 5:")
print("|<----------->|")
print(pad_sequence(example_sentence_ids, 5, 0))

This is an examples of sentence ids (length 9):
[1, 2, 3, 4, 5, 6, 7, 8, 9] 

This is the same sequence padded to 13:
|<----- it should be this long ------>|
[1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0] 

This is the same sequence padded to 5:
|<----------->|
[1, 2, 3, 4, 5]


If the test above turned out right, we can numericalize and pad all the samples of the two datasets.
This will turn our data (strings of various lengths) into inputs which are understandable for the neural network (**equally sized sequences of integers**).

In [12]:
ids_train = [text_to_ids(text, Vocab) for text in tok_train]
pad_ids_train = [pad_sequence(ids, MAX_SEQ_LEN, Vocab["__PAD__"]) for ids in ids_train]

print(len(pad_ids_train))
# check if all sequences have the same length MAX_SEQ_LEN
print(all([len(seq) == MAX_SEQ_LEN for seq in pad_ids_train]))

24998
True


Let's do the same for the test samples now

In [13]:
ids_test = [text_to_ids(text, Vocab) for text in tok_test]  # BLANK
pad_ids_test = [pad_sequence(ids, MAX_SEQ_LEN, Vocab["__PAD__"]) for ids in ids_test]  # BLANK

print(len(pad_ids_test))
# check if all sequences have the same length MAX_SEQ_LEN
print(all([len(seq) == MAX_SEQ_LEN for seq in pad_ids_test]))

25002
True


<a name="4"></a>
# 4) Preparing the Embeddings

Now we have a sequence of integers for each text (`pad_ids_train` and `pad_ids_train`). Integes are not very expressive, but you saw in previous lectures that we can use **word embeddings** to create a richer numerical representation of words. \\
We will use pretrained word embeddings, in this case GloVe (https://nlp.stanford.edu/projects/glove/).

We will use them to prepare a `weights_matrix`, which has a row for each word in the `Vocab`. \\
The two structures need to be aligned to work together:
**the n-th row of `weights_matrix` will contain the embedding of the n-th word in `Vocab`**.

First of all let's check if the download that we started at the beginning is DONE. \\
If for any reason the download is not proceeding, paste the following command in an empty cell to restart the download. \\
**(restart the download only if necessary: downloading from zero will take approx 20 minutes)**

```! wget -O glove.6B.300d.txt http://ailab.uniud.it/wp-content/uploads/2019/05/glove.6B.300d.txt```

In [14]:
# Wait for the embeddings to finish downloading
import time
from IPython.display import clear_output

# this will show you the output of the previous wget command, while the download in in progress
while not os.path.exists("DONE"):
    #print("Wait for the embeddings to be downloaded. Here is the progress")
    ! tail -n 2 progress.txt | head -n 1
    clear_output(wait=True)

print("Embeddings downloaded successfully!")

Embeddings downloaded successfully!


The embeddings you just downloaded in `glove.6B.300d.txt` are given in the following format:

```
word_1 <space> f1 <space> f2 <space> ... <space> f299 <space> f300 <\n>
word_2 <space> f1 <space> f2 <space> ... <space> f299 <space> f300 <\n>
...
word_N <space> f1 <space> f2 <space> ... <space> f299 <space> f300 <\n>
```

For each word we need to recover the vector `[f1, f2, ..., f300]`, which is a 300-dimensional GloVe embedding.

We are going to read the file and save:
`glove.6B.300d_DICT.pkl`: a word-vector dictionary with an entry for each word in `glove.6B.300d.txt` 

In [15]:
import pickle
import numpy as np

# This will take a approx 2 minutes

if not os.path.exists("glove.6B.300d_DICT.pkl"):
    print("Extracting embeddings from glove.6B.300d")
    glove = {}
    with open(f'glove.6B.300d.txt', 'rb') as f:
        lines = f.readlines()
        for l in tqdm(lines):
            line = l.decode().split()
            word = line[0]
            vect = np.array(line[1:]).astype(np.float)
            glove[word] = vect
    
    pickle.dump(glove, open(f'glove.6B.300d_DICT.pkl', 'wb'))
    print()

print("Loading embeddings")
glove = pickle.load(open(f'glove.6B.300d_DICT.pkl', 'rb'))

print("Example: embedding of the word 'the'")
print("the", glove["the"])

Loading embeddings
Example: embedding of the word 'the'
the [ 4.6560e-02  2.1318e-01 -7.4364e-03 -4.5854e-01 -3.5639e-02  2.3643e-01
 -2.8836e-01  2.1521e-01 -1.3486e-01 -1.6413e+00 -2.6091e-01  3.2434e-02
  5.6621e-02 -4.3296e-02 -2.1672e-02  2.2476e-01 -7.5129e-02 -6.7018e-02
 -1.4247e-01  3.8825e-02 -1.8951e-01  2.9977e-01  3.9305e-01  1.7887e-01
 -1.7343e-01 -2.1178e-01  2.3617e-01 -6.3681e-02 -4.2318e-01 -1.1661e-01
  9.3754e-02  1.7296e-01 -3.3073e-01  4.9112e-01 -6.8995e-01 -9.2462e-02
  2.4742e-01 -1.7991e-01  9.7908e-02  8.3118e-02  1.5299e-01 -2.7276e-01
 -3.8934e-02  5.4453e-01  5.3737e-01  2.9105e-01 -7.3514e-03  4.7880e-02
 -4.0760e-01 -2.6759e-02  1.7919e-01  1.0977e-02 -1.0963e-01 -2.6395e-01
  7.3990e-02  2.6236e-01 -1.5080e-01  3.4623e-01  2.5758e-01  1.1971e-01
 -3.7135e-02 -7.1593e-02  4.3898e-01 -4.0764e-02  1.6425e-02 -4.4640e-01
  1.7197e-01  4.6246e-02  5.8639e-02  4.1499e-02  5.3948e-01  5.2495e-01
  1.1361e-01 -4.8315e-02 -3.6385e-01  1.8704e-01  9.2761e-02 -1.

We can now use the the dictionary `glove` to get the pretrained GloVe embeddings of many words. \\
Let's create a `weights_matrix` which has a row for each word in the `Vocab` of our dataset and fill it with the pretrained embeddings when possible.

**If a word does not appear in the `glove` dictionary** we will initialize its embedding using a **random vector** drawn from a normal distribution `np.random.normal(scale=0.6, size=(300, ))`. Same goes for the two special tokens `__PAD__` and `__UNK__`.

In [16]:
import torch
import random

torch.manual_seed(123)
torch.cuda.manual_seed(123)
np.random.seed(123)
random.seed(123)

matrix_len = len(Vocab)
weights_matrix = np.zeros((matrix_len, 300))

# Initialize embeddings for the special tokens __PAD__ and __UNK__
weights_matrix[0] = np.random.normal(scale=0.6, size=(300, ))
weights_matrix[1] = np.random.normal(scale=0.6, size=(300, ))  

cnt_words_found = 0
cnt_oov = 0

words_not_found = []

for word, idx in tqdm(Vocab.items()):
    # skip special tokens: they were already initialized
    if word in ["__PAD__","__UNK__"]: 
        continue
    
    try: 
        weights_matrix[idx] = glove[word]  # BLANK
        cnt_words_found += 1
    except KeyError:
        cnt_oov+=1
        words_not_found.append(word)
        weights_matrix[idx] = np.random.normal(scale=0.6, size=(300, ))  # BLANK

print("out of vocab words: ", cnt_oov)
print("words found:        ", cnt_words_found)
print("total words:        ", cnt_oov + cnt_words_found)

HBox(children=(IntProgress(value=0, max=80618), HTML(value='')))


out of vocab words:  18378
words found:         62238
total words:         80616


**Expected output**
```
out of vocab words:  18486
words found:         62239
total words:         80725
```

Let's take a look at the contents of the `weights_matrix`:

In [17]:
display(weights_matrix)
print(weights_matrix.shape)
print(len(Vocab))

array([[-0.65137836,  0.59840727,  0.1697871 , ...,  0.24941672,
         0.09632665,  0.49185637],
       [ 0.45903291, -0.4973933 , -0.39549079, ..., -0.83662507,
         0.54761276, -0.76414214],
       [-0.44399   ,  0.12817   , -0.25247   , ..., -0.20043   ,
        -0.082191  , -0.06255   ],
       ...,
       [ 0.45443   , -0.19099   ,  0.012089  , ...,  0.77892   ,
         0.35793   , -0.2189    ],
       [ 0.505486  ,  0.24055355,  0.11938975, ...,  0.45069723,
         1.05371616,  0.42042734],
       [-0.15609478, -0.57761698, -0.3779698 , ...,  0.33017153,
        -0.18328693,  0.32758707]])

(80618, 300)
80618


**Expected output** (the first two rows might be different because they are initialized randomly).

```
array([[-0.65137836,  0.59840727,  0.1697871 , ...,  0.24941672,
         0.09632665,  0.49185637],
       [ 0.45903291, -0.4973933 , -0.39549079, ..., -0.83662507,
         0.54761276, -0.76414214],
       [-0.44399   ,  0.12817   , -0.25247   , ..., -0.20043   ,
        -0.082191  , -0.06255   ],
       ...,
       [ 0.45443   , -0.19099   ,  0.012089  , ...,  0.77892   ,
         0.35793   , -0.2189    ],
       [ 0.25495337, -0.14032335,  0.13888993, ...,  0.56438958,
         0.89429585,  1.09692969],
       [-0.81127196,  0.72867943, -1.13618151, ...,  0.95413854,
         1.79100622,  0.85883073]])
(80727, 300)
80727
```

Now the `weights_matrix` is ready to be used later in our final model: it will translate any `text_id` to a 300-dimensional embedding.

<a name="5"></a>
# 5) Setup the Model

Now you will implement a classifier using neural networks.

For the model implementation, you will use some pytorch layers from `torch.nn` (`nn.Embedding`, `nn.Linear`, `nn.LogSoftmax`). To know more about their inputs/outputs type a command like `help(torch.nn.Embeddings)` in an emptycell or look for the official documentation online. \\
This is a more detailed schema of the model we will create:

\\
$$
\small{\textrm{input ids}}: 
\small{(\textit{batch_size} \times \textit{max_seq_len})}\\
\boxed{\ \small{\textbf{Embedding Layer}\\ \texttt{nn.Embedding}}\ } \\
%
\small{\textrm{word embeddings}}:
\small{(\textit{batch_size} \times \textit{embedding_size} \times \textit{max_seq_len})}\\
\boxed{\ \small{\textbf{Mean}\\ \texttt{torch.mean}}\ } \\
%
\small{\textrm{sentence embeddings}}:
\small{(\textit{batch_size} \times \textit{embedding_size})}\\
\boxed{\ \small{\textbf{Linear Layer}\\ \texttt{nn.Linear}}\ } \\
%
\small{\textrm{raw outputs}}:
\small{(\textit{batch_size} \times \textit{2})}\\
\boxed{\ \small{\textbf{Soft Max}\\ \texttt{nn.SoftMax}}\ } \\
%
\small{\textrm{output probabilities / logits}}:
\small{(\textit{batch_size} \times \textit{2})}\\
$$

It will:
- take a sequence of word input ids
- get the embedding of each word using an `nn.Embedding` layer initialized with GloVe embeddings
- calculate the mean embedding of the input using `torch.mean`
- pass the mean embedding to a `nn.Linear` layer which will return a tensor of two values (one for each possible output label)
- use a `nn.LogSoftmax` layer to normalize the raw values output by the linear layer into probabilities

About `torch.mean`: you need to specify the **axis** on which to perform the operation. In this case, please choose axis = 1 to get an average embedding vector (an embedding vector that is an average of all words in the sentence). Here is a demo.

In [18]:
import numpy as np

# Pretend that the embeddings use 3 values to embed the meaning of a word
# the sentence has a lenght of 2
# and we have a batch size of 4
# So the input has a shape (4,2,3)

my_inputs = np.array(
    [
     [ [1,2,3], [4,5,6] ],
     [ [1,2,3], [4,5,6] ],
     [ [1,2,3], [4,5,6] ],
     [ [1,2,3], [4,5,6] ],
    ])

print("The mean along axis 1 creates vectors whose length equals the number of elements in a word embedding")
display(np.mean(my_inputs,axis=1))

print("The mean along axis 2 creates vectors whose length equals the length of the sentence")
display(np.mean(my_inputs,axis=2))

The mean along axis 1 creates vectors whose length equals the number of elements in a word embedding


array([[2.5, 3.5, 4.5],
       [2.5, 3.5, 4.5],
       [2.5, 3.5, 4.5],
       [2.5, 3.5, 4.5]])

The mean along axis 2 creates vectors whose length equals the length of the sentence


array([[2., 5.],
       [2., 5.],
       [2., 5.],
       [2., 5.]])

## Model Definition

Let's implement the neural netwok we described above in the class `MyModel`!

In [19]:
import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self, embedding_matrix, output_size=2):
        super().__init__()

        num_embeddings, embedding_dim = embedding_matrix.shape        
        self.embedding = nn.Embedding(len(weights_matrix), 300)  # BLANK
        self.embedding.load_state_dict({'weight': torch.tensor(embedding_matrix)})
        # We choose not to train the embeddings further, as they are pretrained
        # so we set requires_grad to False
        self.embedding.weight.requires_grad = False
        
        self.linear = nn.Linear(300, 2, bias = True)  # BLANK
        self.softmax = nn.LogSoftmax(dim=-1)
        
    def forward(self, input_ids):

        # input_ids -> embeddings -> mean_embedding -> raw_output -> logits
        x = self.embedding(input_ids)
        y = torch.mean(x, axis = 1)
        z = self.linear(y)
        logits = self.softmax(z)
                          
        return logits

tmp_model = MyModel(weights_matrix)
display(tmp_model)
next(tmp_model.parameters()).device

MyModel(
  (embedding): Embedding(80618, 300)
  (linear): Linear(in_features=300, out_features=2, bias=True)
  (softmax): LogSoftmax()
)

device(type='cpu')

**Expected output**

```
MyModel(
  (embedding): Embedding(80727, 300)
  (linear): Linear(in_features=300, out_features=2, bias=True)
  (softmax): LogSoftmax(dim=-1)
)
device(type='cpu')
```

## From numerical inputs to TensorDataset

Remember that our train/test data right now are saved in the following variables:
- `pad_ids_train`, `y_train`
- `pad_ids_test`, `y_test`

In [20]:
print("First elements of pad_ids_train:")
print(pad_ids_train[0:2])
print("pad_ids_train of type", type(pad_ids_train))
print()
print("First elements of y_train:")
print(y_train[0:2])
print("y_train of type", type(y_train))

First elements of pad_ids_train:
[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 5, 15, 16, 17, 18, 19, 20, 21, 4, 22, 4, 23, 24, 25, 5, 26, 27, 4, 28, 19, 29, 4, 30, 31, 32, 2, 5, 33, 19, 34, 35, 36, 34, 37, 38, 39, 24, 40, 16, 41, 42, 43, 4, 5, 44, 45, 46, 9, 47, 48, 12, 4, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2, 58, 59, 60, 2, 61, 24, 62, 24, 63, 64, 36, 65, 14, 66, 4, 67, 14, 68, 5, 69, 70, 71, 72, 73, 74, 9, 75, 76, 12, 70, 5, 77, 19, 78, 51, 79, 24, 5, 80, 81, 36, 82, 83, 84, 85, 4, 86, 70, 5, 87, 19, 5, 88, 54, 89, 90, 9, 91, 92, 12, 55, 31, 2, 93, 19, 5, 94, 2, 5, 95, 96, 4, 97, 98, 5, 99, 36, 51, 100, 19, 101, 36, 102, 14, 103, 5, 37, 24, 104, 105, 106, 107, 104, 105, 106, 107, 108, 37, 2, 22, 108, 50, 51, 109, 110, 111, 4, 70, 5, 112, 113, 19, 51, 37, 19, 51, 114, 42, 115, 116, 55, 31, 117, 118, 51, 119, 6, 120, 121, 31, 51, 122, 24, 5, 123, 36, 124, 125, 126, 127, 128, 14, 103, 5, 37, 59, 129, 130, 40, 42, 24, 131, 4, 51, 132, 54, 36, 133, 134, 2, 58, 31, 135, 14, 136, 137, 5, 

**Expected output**

```
First elements of pad_ids_train:
[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 5, 15, ...
pad_ids_train of type <class 'list'>

First elements of y_train:
[1 0]
y_train of type <class 'numpy.ndarray'>
```

As you can see, neither one of them is a pytorch Tensor, so we need to **convert** them in a suitable format.

In addition to just converting them to `LongTensors` (tensors of integers), we will use the `TensorDataset` class provided by `pytorch`, which helps iterating collections of tensors.

We will:
- convert `pad_ids_<set>` (inputs) and `y_<set>` (outputs) to `LongTensors`
- create `<set>_dataset`, a `TensorDataset` with two fields: the inputs and the outputs


In [21]:
import torch
from torch.utils.data import TensorDataset

tensor_ids_train = torch.LongTensor(pad_ids_train) # BLANK
labels_train = torch.LongTensor(y_train)     # BLANK
train_dataset = TensorDataset(tensor_ids_train, labels_train)  # BLANK

tensor_ids_test = torch.LongTensor(pad_ids_test)  # BLANK
labels_test = torch.LongTensor(y_test)  # BLANK
test_dataset = TensorDataset(tensor_ids_test, labels_test)  # BLANK

display(tensor_ids_train)
display(labels_train)

tensor([[    2,     3,     4,  ...,     0,     0,     0],
        [  156,    24,    24,  ...,     0,     0,     0],
        [  281,   281,   281,  ...,    36,   108,   390],
        ...,
        [  186,  1445,  1498,  ...,   312,   109,   312],
        [   51,  2484,    55,  ...,  1561,   525, 11413],
        [  158,   253,    14,  ...,   105,   106,   107]])

tensor([1, 0, 0,  ..., 1, 1, 0])

**Expected output**
```
tensor([[    2,     3,     4,  ...,     0,     0,     0],
        [  156,    24,    24,  ...,     0,     0,     0],
        [  281,   281,   281,  ...,    36,   108,   390],
        ...,
        [  186,  1445,  1498,  ...,   312,   109,   312],
        [   51,  2484,    55,  ...,  1561,   525, 11415],
        [  158,   253,    14,  ...,   105,   106,   107]])
tensor([1, 0, 0,  ..., 1, 1, 0])
```


## Creating a batch generator (DataLoader)

Most of the time in Natural Language Processing, and AI in general we use **batches** when training our data sets. 
- If instead of training with batches of examples, you were to train a model with one example at a time, it would take a very long time to train the model. 
- We will now use the `class torch.utils.data.DataLoader` to get batches of samples and target labels from the `TensorDataset`s we created before.

Once you create the generator, you could include it in a for loop

```CPP
for batch_inputs, batch_targets in dataloader:
    ...
```

You can also get a single batch like this:

```CPP
batch_inputs, batch_targets = next(dataloader)
```
The generator returns the next batch each time it's called. \\
It returns a tuple containing **the same fields we put in the `TensorDataset`**

**Training batches will be sampled randomly** using a `RandomSampler`, in order to break posible patterns in the data.

**Test batches can be sampled sequentially** with a `SequentialSampler` (not randomized or shuffled), to make it easier to compare the results with the real labels later.

In [22]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training
batch_size = 16

# Create the DataLoaders for our train and test sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size
        )

# Create the test_dataloader on the test_dataset
# Use a SequentialSampler instead of a RandomSampler to pull out batches sequentially
test_dataloader = DataLoader(
            test_dataset,               # BLANK
            sampler = RandomSampler(test_dataset),     # BLANK
            batch_size = batch_size,  # BLANK
        )
list(train_dataloader)[0]

[tensor([[  284,   158,    31,  ...,     0,     0,     0],
         [  158,    31,   377,  ...,     0,     0,     0],
         [ 1665, 11643,  5500,  ...,     0,     0,     0],
         ...,
         [  158,   982,   705,  ...,     0,     0,     0],
         [  462,   277,  3438,  ...,     0,     0,     0],
         [ 3828,     4,   145,  ...,     0,     0,     0]]),
 tensor([1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0])]

Now that you have your train/test DataLoader, you can just call them and they will return tensors which correspond to your texts in the first column and their corresponding labels in the second column.

Now evrything is setup and we are ready for training.

<a name="6"></a>
# 6)  Training and Testing

We will now define the two functions which will:
- train the neural network for one epoch on a given train dataset
- test the trained model on a given test dataset


Let's start with `train_one_epoch`: it takes as inpu the model, its optimizer (which will take care of updating the weights of the neural network) and the data loader for the train dataset.

Its objective is to train the model on all the batches contained in the dataloader: for each batch of input data:
- the model produces series of predictions
- the loss function `nn.NLLLoss` calculates how much the predictions differ from the real labels
- the loss is backpropagated through the network, and the weights of the layers are updated

At the same time, we also calculate the accuracy of the predictions, and we report loss and accuracy at the end of the epoch.

In [1]:
def train_one_epoch(model, optimizer, data_loader):
    running_loss = 0.0
    running_correct = 0
    num_samples = 0
    
    for batch_idx , batch in enumerate(data_loader):

        text, target = batch
        num_samples += target.shape[0]
        
        device = next(model.parameters()).device
        text, target = text.to(device), target.to(device)
        
        # zero all the gradients
        optimizer.zero_grad()

        # set the model to "train" mode
        # and get its outputs
        model.train()  # BLANK
        output = model(text)  # BLANK
        
        loss_funct = nn.NLLLoss()
        loss = loss_funct(output, target)
        
        running_loss += loss.item()
        preds = torch.argmax(output, axis=1)
        running_correct += preds.eq(target.data.view_as(preds)).cpu().sum()
        
        # perform backward pass on the loss
        # and update the paramenters using the optimizer
        loss.backward()  # BLANK
        optimizer.step()  # BLANK
        
    loss = running_loss/num_samples
    accuracy = running_correct.cpu().numpy()/num_samples
    
    print('   loss {:<6} | acc {:<6}'.format(round(loss,4), round(accuracy,4)))    
    return loss, accuracy
    

The `test` function takes as input arguments the model and the dataloader for the test set and uses the model for inference. This means that the weights don't need any updates: this is why we don't need to pass the optimizer to the function.

We also need to be careful to set the model in evalutation mode (`model.eval()`) and we can stop tracking the operations on the tensors.

Just like the training function, the test function return the loss and the accuracy on the test set.

In [24]:
def test(model, data_loader):
    running_loss = 0.0
    running_correct = 0
    num_samples = 0

    for batch_idx , batch in enumerate(data_loader):

        text, target = batch
        num_samples += target.shape[0]
        
        device = next(model.parameters()).device
        text, target = text.to(device), target.to(device)
        
        # set model in "eval" mode
        model.eval()  # BLANK
        with torch.no_grad():
            output = model(text)
        
        loss_funct = nn.NLLLoss()
        loss = loss_funct(output, target)  # BLANK
        
        running_loss += loss.item()
        preds = torch.argmax(output, axis=1)
        running_correct += preds.eq(target.data.view_as(preds)).cpu().sum()
        
    loss = running_loss/num_samples
    accuracy = running_correct.cpu().numpy()/num_samples
    
    print('   loss {:<6} | acc {:<6}'.format(round(loss,4), round(accuracy,4)))
    
    return loss, accuracy

Let's finally create a model and its optimizer and run the training.

We will train the model for 20 epochs and perform a test every 5 epochs.

In [25]:
model = MyModel(weights_matrix)
#model.cuda()
# The optimizer needs to know which parameters to update
# We generate the list of model.parameters s.t. requires_grad = True
# This will exclude the pretrained Embeddings we uploaded before
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-2)

print("training")
for i in range(35):
    print(f"epoch {i+1:<3}", end="")
    train_one_epoch(model, optimizer, train_dataloader)  # BLANK
    if (i+1)%5 == 0:
        print("testing")
        test(model, test_dataloader)

training
epoch 1     loss 0.0384 | acc 0.6659
epoch 2     loss 0.0341 | acc 0.7299
epoch 3     loss 0.0316 | acc 0.7543
epoch 4     loss 0.0305 | acc 0.7658
epoch 5     loss 0.0299 | acc 0.7708
testing
   loss 0.0283 | acc 0.8036
epoch 6     loss 0.0296 | acc 0.7743
epoch 7     loss 0.0293 | acc 0.7781
epoch 8     loss 0.0289 | acc 0.7846
epoch 9     loss 0.0285 | acc 0.7867
epoch 10    loss 0.0284 | acc 0.7866
testing
   loss 0.0278 | acc 0.7953
epoch 11    loss 0.0282 | acc 0.7904
epoch 12    loss 0.0281 | acc 0.7916
epoch 13    loss 0.0282 | acc 0.7913
epoch 14    loss 0.0281 | acc 0.792 
epoch 15    loss 0.0281 | acc 0.7913
testing
   loss 0.0265 | acc 0.8167
epoch 16    loss 0.0277 | acc 0.7953
epoch 17    loss 0.0277 | acc 0.7965
epoch 18    loss 0.0275 | acc 0.7956
epoch 19    loss 0.0274 | acc 0.7984
epoch 20    loss 0.0274 | acc 0.7963
testing
   loss 0.0268 | acc 0.8067
epoch 21    loss 0.0278 | acc 0.7936
epoch 22    loss 0.0272 | acc 0.8003
epoch 23    loss 0.0274 | acc 0.7

In [26]:
print("testing")
test_loss, test_acc = test(model, test_dataloader)

testing
   loss 0.0277 | acc 0.7963


**Expected output** (may differ slightly because of randomness)

```
testing
   loss 0.0261 | acc 0.818
```

