### Xinhao Lan 1082620

# Lab 4: Recurrent models

This lab gives you practice with embeddings (word vectors, in this case) and recurrent neural models in NLP. The first part focuses on embeddings as input to a recurrent model, and the second part focuses on embeddings derived from recurrent models, applying them to the task of word sense disambuiguation following the approach from the original paper by Peters et al.

Everybody's machine is different and the neural computations required for this lab are more demanding than in the other assignments in this course. For this reason, it is advisable to use Google CoLab which guarantees a minimum level of performance. It is also recommended to use GPU acceleration; on CoLab, it can be turned on via <code>Runtime>Change Runtime type>GPU</code>.

## Part 1 (45 points)

In the first part of lab 4, we will play with training a recurrent model for part of speech tagging. As an easy exercise, you will observe what happens when you plug in pretrained word embeddings into a neural NLP model and experiment with different sizes of training data.

If you use Google Colab (we recommend so), it may be easiest to place this notebook and <code>lstm_tutorial.py</code> in <code>/Colab Notebooks</code> directory of your Google Drive. Run the code in the cell just below to enable Colab to access the files on Google drive. This will open a pop-up window where you can allow Colab to access your google drive.

In [1]:
#RUN THIS CELL IF USING COLAB TO USE GOOGLE DRIVE FOR STORING lstm_tutorial.py AND/OR DATA FILES
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Set the global variable for your environment to the directory where lstm_tutorial.py is located
# This is important for portability of the notebook and grading.   
WORKING_DIR = '/content/drive/My Drive/Colab Notebooks/NLP' #Feel free to change this
%cd $WORKING_DIR

/content/drive/My Drive/Colab Notebooks/NLP


The neural network solutions in this lab rely on AllenNLP library version 0.9.0 (the other code of this lab assignment may work incorrectly with more recent versions); <code>overrides</code> is required for compatibility. Linguistic resources are from NLTK version 3.6.2 and might work incorrectly in other versions. Install these before proceeding; installation process may vary depending on your system. On CoLab, this can be done via the following command:

In [3]:
# IF USING COLAB, INSTALL allennlp AND nltk AS FOLLOWS
!pip install -U overrides==3.1.0 nltk==3.6.2 allennlp==0.9.0 
# This might require restart of the runtime (Runtime>restart runtime)
# After restart no need to run this cell again

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting overrides==3.1.0
  Downloading overrides-3.1.0.tar.gz (11 kB)
Collecting nltk==3.6.2
  Downloading nltk-3.6.2-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 7.6 MB/s 
[?25hCollecting allennlp==0.9.0
  Downloading allennlp-0.9.0-py3-none-any.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 53.7 MB/s 
Collecting pytorch-pretrained-bert>=0.6.0
  Downloading pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123 kB)
[K     |████████████████████████████████| 123 kB 75.1 MB/s 
[?25hCollecting parsimonious>=0.8.0
  Downloading parsimonious-0.9.0.tar.gz (48 kB)
[K     |████████████████████████████████| 48 kB 6.7 MB/s 
Collecting flask-cors>=3.0.7
  Downloading Flask_Cors-3.0.10-py2.py3-none-any.whl (14 kB)
Collecting gevent>=1.3.6
  Downloading gevent-21.12.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (5.8 MB)
[K     |████████

**Before you start**,  import required modules:

In [4]:
import random
import nltk
import allennlp

In [5]:
print(f"NLTK version: {nltk.__version__}")
print(f"AllenNLP version: {allennlp.__version__}")

NLTK version: 3.6.2
AllenNLP version: 0.9.0


If you run this for the first time, you may need to download various data using NLTK:

In [6]:
nltk.download('brown')
nltk.download('semcor')
nltk.download('wordnet')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package semcor to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

## Exercise 1: prepare the data (5 points)

Linguistic data come in a variety of formats. You already had a chance to play with POS-annotated corpus data in Lab 1.

In the first exercise, you will access POS-annotated data in one format (NLTK) and save it on the disk in a text format. Start with the tagged sentences from the Brown corpus, which can be retrieved as below:

In [7]:
nltk.corpus.brown.tagged_sents()

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

Now randomize the order of all sentences in the corpus using <code>random.shuffle()</code> function with a seed `42` (for some determinism in the code behaviour) and select the first 50K sentences for training and the next 5K for validation.

In [8]:
# #YOUR CODE HERE
# It is important to keep intended values in the following vars
# DON'T CHANGE VARIABLE NAMES
sentences = [sentence for sentence in nltk.corpus.brown.tagged_sents()]
random.seed(42)
random.shuffle(sentences)

training_brown = sentences[:50000]
validation_brown = sentences[50000:55000]
testing_brown = sentences[55000:]

Define a function for saving your datasets to a text file in the following format:
* one sentence per line
* tokens separated by spaces
* POS tag separated from the token by "###", for example <code>said###VBD</code>.

In [9]:
def write_posdata(sentences, outfile):
    #YOUR CODE HERE
    storage = ""
    for sent in sentences:
        phrase = ""
        for i in sent:
            phrase = phrase + (i[0]+"###"+i[1]+" ")
        phrase = phrase +"\n"
        storage = storage + phrase
    file = open(outfile,"w") 
    file.write(storage)
    file.close() 

Now save your data partitions in different sizes. We will start with small data samples since training on a large dataset may be very slow depending on your machine. We won't use the full 50K sentence training set in this lab since this might take too long.

In [10]:
write_posdata(training_brown,"train_brown.txt")
write_posdata(validation_brown,"validation_brown.txt")
write_posdata(training_brown[:50],"train_brown_50.txt")
write_posdata(validation_brown[:50],"validation_brown_50.txt")
write_posdata(training_brown[:500],"train_brown_500.txt")
write_posdata(validation_brown[:500],"validation_brown_500.txt")
write_posdata(training_brown[:5000],"train_brown_5000.txt")
write_posdata(validation_brown[:5000],"validation_brown_5000.txt")

Congratulations, you have now saved the POS tagged data for model training purposes!

## Exercise 2: train neural POS tagger models (15 points)

We will now play with a neural model. You have installed <code>allennlp</code> which contains all necessary components for this and the training code for an LSTM model, which follows an old AllenNLP tutorial, is contained in <code>lstm_tutorial.py</code>. PLace the latter in the same directory as this notebook. Let us start by loading the model code and data, starting with a tiny sample for demonstration purposes. 

In [11]:
from lstm_tutorial import *

train_dataset_tiny = reader.read("train_brown_50.txt")
validation_dataset_tiny = reader.read("validation_brown_50.txt")

50it [00:00, 10037.58it/s]
50it [00:00, 9897.83it/s]


Fist of all we need to initialize the vocabulary and define an embedding (vector) for each token. We set the embedding size at 300, common in realistic applications. By default, the embeddings are initialized randomly and updated during trining (this can be changed but we start with a standard configuration). We also need to specify the <code>HIDDEN_DIM</code> parameter: the dimensionality of the hidden vector representations in the LSTM cell.

In [12]:
vocab_tiny = Vocabulary.from_instances(train_dataset_tiny + validation_dataset_tiny)

EMBEDDING_DIM = 300
HIDDEN_DIM = 20

token_embedding_tiny = Embedding(num_embeddings=vocab_tiny.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

100%|██████████| 100/100 [00:00<00:00, 24433.79it/s]


In [13]:
token_embedding_tiny

Embedding()

Download the smallest pretrained word vector model from https://nlp.stanford.edu/projects/glove/, unzip it, and extract the relevant file <code>'glove.6B.300d.txt'</code> in your working directory. The size of the file is 1GB; if using Google Drive with Colab, make sure you have sufficient space. Downloading and uploading the file might take a few minutes. You can <b>either</b> upload the relevant file from your personal machine `or` use the code below directly from CoLab:

In [14]:
#THIS CELL IS OPTIONAL, TO BE USED ON COLAB. YOU CAN USE wget AS BELOW OR ALTERNATIVELY UPLOAD GloVe EMBEDDINGS TO GOOGLE DRIVE FROM YOUR MACHINE
# download the file
!wget http://nlp.stanford.edu/data/glove.6B.zip
# unzip the file
!unzip -d . 'glove.6B.zip'
# remove useless contents
!rm 'glove.6B.200d.txt' 'glove.6B.100d.txt' 'glove.6B.50d.txt'

--2022-06-24 16:12:00--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-06-24 16:12:00--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-06-24 16:12:00--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... ^C
Archive:  glove.6B.zip
  inflating: ./glove.6B.50d.txt      
  inflating: ./glove.6B.100d.txt     
  inflating: ./glove.6B.200d.t

Initialize token embeddings with values from pretrained GloVe model:


In [15]:
glove_token_embedding_tiny = Embedding.from_params(vocab=vocab_tiny,
                            params=Params({'pretrained_file':'glove.6B.300d.txt',
                                           'embedding_dim' : EMBEDDING_DIM}))

400000it [00:02, 161446.31it/s]


Now from embedding a single word with <code>token_embedding_tiny</code> we can proceed to mapping a word sequence into a sequence of vectors:

In [16]:
word_embeddings_tiny = BasicTextFieldEmbedder({"tokens": token_embedding_tiny})

The following initializes parameters of an LSTM model using <code>word_embeddings_tiny</code> input encoding

In [17]:
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

model_tiny = LstmTagger(word_embeddings_tiny, lstm, vocab_tiny)

Now define an LSTM model called <code>glove_model_tiny</code> that uses <code>glove_token_embedding_tiny</code>:

In [18]:
#YOUR CODE HERE
glove_word_embeddings_tiny = BasicTextFieldEmbedder({"tokens": glove_token_embedding_tiny})
glove_model_tiny = LstmTagger(glove_word_embeddings_tiny, lstm, vocab_tiny)

Train the **basic model** for the tiny dataset. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [None]:
basic_trainer_tiny=initialize_trainer(model_tiny,vocab_tiny,train_dataset_tiny,validation_dataset_tiny,batch_size=50)
basic_trainer_tiny.train()

accuracy: 0.0042, loss: 4.4558 ||: 100%|██████████| 1/1 [00:00<00:00,  3.87it/s]
accuracy: 0.0029, loss: 4.4593 ||: 100%|██████████| 1/1 [00:00<00:00, 69.47it/s]
accuracy: 0.0042, loss: 4.4498 ||: 100%|██████████| 1/1 [00:00<00:00, 55.60it/s]
accuracy: 0.0029, loss: 4.4541 ||: 100%|██████████| 1/1 [00:00<00:00, 106.54it/s]
accuracy: 0.0042, loss: 4.4438 ||: 100%|██████████| 1/1 [00:00<00:00, 51.49it/s]
accuracy: 0.0049, loss: 4.4489 ||: 100%|██████████| 1/1 [00:00<00:00, 118.31it/s]
accuracy: 0.0052, loss: 4.4379 ||: 100%|██████████| 1/1 [00:00<00:00, 52.86it/s]
accuracy: 0.0236, loss: 4.4438 ||: 100%|██████████| 1/1 [00:00<00:00, 77.67it/s]
accuracy: 0.0220, loss: 4.4319 ||: 100%|██████████| 1/1 [00:00<00:00, 44.86it/s]
accuracy: 0.0619, loss: 4.4386 ||: 100%|██████████| 1/1 [00:00<00:00, 78.69it/s]
accuracy: 0.0639, loss: 4.4259 ||: 100%|██████████| 1/1 [00:00<00:00, 54.80it/s]
accuracy: 0.1013, loss: 4.4335 ||: 100%|██████████| 1/1 [00:00<00:00, 82.94it/s]
accuracy: 0.1122, loss: 4.

{'best_epoch': 999,
 'best_validation_accuracy': 0.4631268436578171,
 'best_validation_loss': 2.7383158206939697,
 'epoch': 999,
 'peak_cpu_memory_MB': 4182.676,
 'peak_gpu_0_memory_MB': 1312,
 'training_accuracy': 0.5041928721174004,
 'training_cpu_memory_MB': 4182.676,
 'training_duration': '0:02:44.652894',
 'training_epochs': 999,
 'training_gpu_0_memory_MB': 1312,
 'training_loss': 2.262674570083618,
 'training_start_epoch': 0,
 'validation_accuracy': 0.4631268436578171,
 'validation_loss': 2.7383158206939697}

You have trained an LSTM POS tagger for the basic model. Now train the <code>glove_model_tiny</code>. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [None]:
#Write your code here
glove_trainer_tiny=initialize_trainer(glove_model_tiny,vocab_tiny,train_dataset_tiny,validation_dataset_tiny,batch_size=50)
glove_trainer_tiny.train()

accuracy: 0.0010, loss: 4.4993 ||: 100%|██████████| 1/1 [00:00<00:00, 54.18it/s]
accuracy: 0.0029, loss: 4.4762 ||: 100%|██████████| 1/1 [00:00<00:00, 77.14it/s]
accuracy: 0.0021, loss: 4.4725 ||: 100%|██████████| 1/1 [00:00<00:00, 55.35it/s]
accuracy: 0.0039, loss: 4.4526 ||: 100%|██████████| 1/1 [00:00<00:00, 80.83it/s]
accuracy: 0.0021, loss: 4.4463 ||: 100%|██████████| 1/1 [00:00<00:00, 51.58it/s]
accuracy: 0.0069, loss: 4.4296 ||: 100%|██████████| 1/1 [00:00<00:00, 90.43it/s]
accuracy: 0.0052, loss: 4.4207 ||: 100%|██████████| 1/1 [00:00<00:00, 49.06it/s]
accuracy: 0.0098, loss: 4.4070 ||: 100%|██████████| 1/1 [00:00<00:00, 74.61it/s]
accuracy: 0.0126, loss: 4.3954 ||: 100%|██████████| 1/1 [00:00<00:00, 53.39it/s]
accuracy: 0.0216, loss: 4.3847 ||: 100%|██████████| 1/1 [00:00<00:00, 138.60it/s]
accuracy: 0.0210, loss: 4.3704 ||: 100%|██████████| 1/1 [00:00<00:00, 48.24it/s]
accuracy: 0.0472, loss: 4.3627 ||: 100%|██████████| 1/1 [00:00<00:00, 126.38it/s]
accuracy: 0.0409, loss: 4.

{'best_epoch': 999,
 'best_validation_accuracy': 0.5447394296951819,
 'best_validation_loss': 2.32021164894104,
 'epoch': 999,
 'peak_cpu_memory_MB': 4182.864,
 'peak_gpu_0_memory_MB': 1312,
 'training_accuracy': 0.7348008385744235,
 'training_cpu_memory_MB': 4182.864,
 'training_duration': '0:02:43.881415',
 'training_epochs': 999,
 'training_gpu_0_memory_MB': 1312,
 'training_loss': 1.3145437240600586,
 'training_start_epoch': 0,
 'validation_accuracy': 0.5447394296951819,
 'validation_loss': 2.32021164894104}

## Exercise 3: Explore training parameters (25 points)

Create separate models on the basis of bigger datasets: the 500 sentence training and 500 sentence validation and 5000 sentence training and 5000 sentence validation. Using the full training set (50K sentences) is optional (your machine might be too slow). Initialize and train the **basic model** on 500 sentence training and 500 sentence validation data. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [None]:
#train the basic model on 500 sentences
#YOUR CODE HERE
train_dataset_regular= reader.read("train_brown_500.txt")
validation_dataset_regular = reader.read("validation_brown_500.txt")
vocab_regular = Vocabulary.from_instances(train_dataset_regular + validation_dataset_regular)
token_embedding_regular = Embedding(num_embeddings=vocab_regular.get_vocab_size('tokens'),embedding_dim=EMBEDDING_DIM)
word_embeddings_regular = BasicTextFieldEmbedder({"tokens": token_embedding_regular})
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model_regular = LstmTagger(word_embeddings_regular, lstm, vocab_regular)
basic_trainer_regular=initialize_trainer(model_regular,vocab_regular,train_dataset_regular,validation_dataset_regular,batch_size=50)
basic_trainer_regular.train()

500it [00:00, 21146.40it/s]
500it [00:00, 18516.26it/s]
100%|██████████| 1000/1000 [00:00<00:00, 52739.97it/s]
accuracy: 0.0459, loss: 5.0935 ||: 100%|██████████| 10/10 [00:00<00:00, 77.69it/s]
accuracy: 0.0510, loss: 5.0653 ||: 100%|██████████| 10/10 [00:00<00:00, 128.75it/s]
accuracy: 0.0601, loss: 5.0395 ||: 100%|██████████| 10/10 [00:00<00:00, 103.96it/s]
accuracy: 0.0831, loss: 5.0105 ||: 100%|██████████| 10/10 [00:00<00:00, 178.65it/s]
accuracy: 0.0983, loss: 4.9843 ||: 100%|██████████| 10/10 [00:00<00:00, 110.04it/s]
accuracy: 0.1060, loss: 4.9540 ||: 100%|██████████| 10/10 [00:00<00:00, 164.75it/s]
accuracy: 0.1094, loss: 4.9270 ||: 100%|██████████| 10/10 [00:00<00:00, 100.35it/s]
accuracy: 0.1325, loss: 4.8949 ||: 100%|██████████| 10/10 [00:00<00:00, 151.28it/s]
accuracy: 0.1376, loss: 4.8667 ||: 100%|██████████| 10/10 [00:00<00:00, 106.19it/s]
accuracy: 0.1384, loss: 4.8322 ||: 100%|██████████| 10/10 [00:00<00:00, 172.07it/s]
accuracy: 0.1291, loss: 4.8024 ||: 100%|██████████

{'best_epoch': 999,
 'best_validation_accuracy': 0.7268518518518519,
 'best_validation_loss': 1.4180195212364197,
 'epoch': 999,
 'peak_cpu_memory_MB': 4197.32,
 'peak_gpu_0_memory_MB': 1352,
 'training_accuracy': 0.9205324327009039,
 'training_cpu_memory_MB': 4197.32,
 'training_duration': '0:04:54.993714',
 'training_epochs': 999,
 'training_gpu_0_memory_MB': 1352,
 'training_loss': 0.48127667009830477,
 'training_start_epoch': 0,
 'validation_accuracy': 0.7268518518518519,
 'validation_loss': 1.4180195212364197}

Now do the same training (500 sentence training and 500 sentence validation sets) with GloVE embeddings. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [None]:
#YOUR CODE HERE
glove_token_embedding_regular = Embedding.from_params(vocab=vocab_regular,params=Params({'pretrained_file':'glove.6B.300d.txt','embedding_dim' : EMBEDDING_DIM}))
glove_word_embeddings_regular = BasicTextFieldEmbedder({"tokens": glove_token_embedding_regular})
glove_model_regular = LstmTagger(glove_word_embeddings_regular, lstm, vocab_regular)
trainer_glove_model_regular=initialize_trainer(glove_model_regular,vocab_regular,train_dataset_regular,validation_dataset_regular,batch_size=50)
trainer_glove_model_regular.train()

400000it [00:02, 153750.02it/s]
accuracy: 0.0176, loss: 4.9951 ||: 100%|██████████| 10/10 [00:00<00:00, 92.68it/s]
accuracy: 0.0891, loss: 4.8015 ||: 100%|██████████| 10/10 [00:00<00:00, 153.18it/s]
accuracy: 0.1107, loss: 4.6577 ||: 100%|██████████| 10/10 [00:00<00:00, 102.93it/s]
accuracy: 0.1391, loss: 4.4913 ||: 100%|██████████| 10/10 [00:00<00:00, 158.87it/s]
accuracy: 0.1334, loss: 4.3756 ||: 100%|██████████| 10/10 [00:00<00:00, 106.06it/s]
accuracy: 0.1470, loss: 4.2459 ||: 100%|██████████| 10/10 [00:00<00:00, 165.64it/s]
accuracy: 0.1399, loss: 4.1608 ||: 100%|██████████| 10/10 [00:00<00:00, 108.66it/s]
accuracy: 0.1824, loss: 4.0699 ||: 100%|██████████| 10/10 [00:00<00:00, 154.56it/s]
accuracy: 0.1734, loss: 4.0076 ||: 100%|██████████| 10/10 [00:00<00:00, 104.98it/s]
accuracy: 0.1865, loss: 3.9464 ||: 100%|██████████| 10/10 [00:00<00:00, 175.43it/s]
accuracy: 0.1793, loss: 3.8971 ||: 100%|██████████| 10/10 [00:00<00:00, 107.09it/s]
accuracy: 0.2116, loss: 3.8542 ||: 100%|█████

{'best_epoch': 998,
 'best_validation_accuracy': 0.763416477702192,
 'best_validation_loss': 1.1940969467163085,
 'epoch': 999,
 'peak_cpu_memory_MB': 4210.396,
 'peak_gpu_0_memory_MB': 1352,
 'training_accuracy': 0.8917254395549816,
 'training_cpu_memory_MB': 4210.396,
 'training_duration': '0:04:56.095750',
 'training_epochs': 999,
 'training_gpu_0_memory_MB': 1352,
 'training_loss': 0.4417404025793076,
 'training_start_epoch': 0,
 'validation_accuracy': 0.7633219954648526,
 'validation_loss': 1.1941138863563538}

Use a bigger training set now with 5K sentence training and 5K sentence validation sets and random initial embeddings. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [19]:
#YOUR CODE HERE
train_dataset_big= reader.read("train_brown_5000.txt")
validation_dataset_big = reader.read("validation_brown_5000.txt")
vocab_big = Vocabulary.from_instances(train_dataset_big + validation_dataset_big)
token_embedding_big = Embedding(num_embeddings=vocab_big.get_vocab_size('tokens'),embedding_dim=EMBEDDING_DIM)
word_embeddings_big = BasicTextFieldEmbedder({"tokens": token_embedding_big})
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model_big = LstmTagger(word_embeddings_big, lstm, vocab_big)
basic_trainer_big=initialize_trainer(model_big,vocab_big,train_dataset_big,validation_dataset_big,batch_size=50)
basic_trainer_big.train()

5000it [00:00, 12255.96it/s]
5000it [00:00, 7831.76it/s]
100%|██████████| 10000/10000 [00:00<00:00, 51752.91it/s]
accuracy: 0.1139, loss: 5.3624 ||: 100%|██████████| 100/100 [00:01<00:00, 75.98it/s]
accuracy: 0.1329, loss: 4.8646 ||: 100%|██████████| 100/100 [00:00<00:00, 132.59it/s]
accuracy: 0.1307, loss: 4.3814 ||: 100%|██████████| 100/100 [00:00<00:00, 109.83it/s]
accuracy: 0.1329, loss: 4.0625 ||: 100%|██████████| 100/100 [00:00<00:00, 141.98it/s]
accuracy: 0.1325, loss: 3.9715 ||: 100%|██████████| 100/100 [00:01<00:00, 89.65it/s]
accuracy: 0.1329, loss: 3.8635 ||: 100%|██████████| 100/100 [00:00<00:00, 164.60it/s]
accuracy: 0.1326, loss: 3.8380 ||: 100%|██████████| 100/100 [00:00<00:00, 109.56it/s]
accuracy: 0.1329, loss: 3.7656 ||: 100%|██████████| 100/100 [00:00<00:00, 166.93it/s]
accuracy: 0.1367, loss: 3.7570 ||: 100%|██████████| 100/100 [00:00<00:00, 110.25it/s]
accuracy: 0.1329, loss: 3.6966 ||: 100%|██████████| 100/100 [00:00<00:00, 167.71it/s]
accuracy: 0.1629, loss: 3.68

{'best_epoch': 307,
 'best_validation_accuracy': 0.856992193660306,
 'best_validation_loss': 0.7320199972391128,
 'epoch': 316,
 'peak_cpu_memory_MB': 4292.144,
 'peak_gpu_0_memory_MB': 1398,
 'training_accuracy': 0.9412591202416676,
 'training_cpu_memory_MB': 4292.144,
 'training_duration': '0:08:50.563956',
 'training_epochs': 316,
 'training_gpu_0_memory_MB': 1398,
 'training_loss': 0.3034310123324394,
 'training_start_epoch': 0,
 'validation_accuracy': 0.8585889449613625,
 'validation_loss': 0.738064848780632}

Now do the same training (5K sentence training and 5K sentence validation sets) with GloVE embeddings. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [20]:
#YOUR CODE HERE
glove_token_embedding_big = Embedding.from_params(vocab=vocab_big,params=Params({'pretrained_file':'glove.6B.300d.txt','embedding_dim' : EMBEDDING_DIM}))
glove_word_embeddings_big = BasicTextFieldEmbedder({"tokens": glove_token_embedding_big})
glove_model_big = LstmTagger(glove_word_embeddings_big, lstm, vocab_big)
trainer_glove_model_big=initialize_trainer(glove_model_big,vocab_big,train_dataset_big,validation_dataset_big,batch_size=50)
trainer_glove_model_big.train()

400000it [00:03, 125892.03it/s]
accuracy: 0.1739, loss: 4.5896 ||: 100%|██████████| 100/100 [00:00<00:00, 112.34it/s]
accuracy: 0.2868, loss: 3.8493 ||: 100%|██████████| 100/100 [00:00<00:00, 174.26it/s]
accuracy: 0.2917, loss: 3.6071 ||: 100%|██████████| 100/100 [00:00<00:00, 111.37it/s]
accuracy: 0.3272, loss: 3.3632 ||: 100%|██████████| 100/100 [00:00<00:00, 173.76it/s]
accuracy: 0.3413, loss: 3.2532 ||: 100%|██████████| 100/100 [00:00<00:00, 111.94it/s]
accuracy: 0.3660, loss: 3.0895 ||: 100%|██████████| 100/100 [00:00<00:00, 171.55it/s]
accuracy: 0.3790, loss: 3.0270 ||: 100%|██████████| 100/100 [00:00<00:00, 109.97it/s]
accuracy: 0.3966, loss: 2.9014 ||: 100%|██████████| 100/100 [00:00<00:00, 170.48it/s]
accuracy: 0.4032, loss: 2.8656 ||: 100%|██████████| 100/100 [00:00<00:00, 108.09it/s]
accuracy: 0.4133, loss: 2.7576 ||: 100%|██████████| 100/100 [00:00<00:00, 172.39it/s]
accuracy: 0.4191, loss: 2.7332 ||: 100%|██████████| 100/100 [00:00<00:00, 110.03it/s]
accuracy: 0.4323, loss

{'best_epoch': 494,
 'best_validation_accuracy': 0.8708799873836934,
 'best_validation_loss': 0.656370515525341,
 'epoch': 503,
 'peak_cpu_memory_MB': 4340.976,
 'peak_gpu_0_memory_MB': 1454,
 'training_accuracy': 0.9199425939121446,
 'training_cpu_memory_MB': 4340.976,
 'training_duration': '0:14:00.494951',
 'training_epochs': 503,
 'training_gpu_0_memory_MB': 1454,
 'training_loss': 0.31916643247008325,
 'training_start_epoch': 0,
 'validation_accuracy': 0.8716487935656837,
 'validation_loss': 0.6572038769721985}

For each trained model, record validation accuracy and training duration (they are returned along with other training stats after training a model) and accuracy on the training set. Fill in the numbers in the table below:

| model | validation accuracy | training accuracy | training duration|
|-------|---------------------|---------------|-------------------------------------------
| basic model on 50 sentences| 0.4631268437 | 0.5041928721 | 0:02:44:652894 |
| glove model on 50 sentences| 0.5447394297 | 0.7348008386 | 0:02:43:881415 |
| basic model on 500 sentences| 0.7268518519 | 0.9205324327 | 0:04:54:993714 |
| glove model on 500 sentences| 0.7633219955 | 0.8917254396 | 0:04:56:095750 |
| basic model on 5000 sentences| 0.8585889450 | 0.9412591202 | 0:08:50:563956 |
| glove model on 5000 sentences| 0.8716487936 | 0.9199425939 | 0:14:00:494951 |

**Question.** What do you conclude from these comparisons? when can it be especially beneficial to initialize a model with pretrained embeddings?

**Answer.** In any condition(50, 500, 5000 sentences), validation accuracy with glove models is always higher than the basic model. However, the larger the dataset is, the differences between two conditions are not more clear at all. When the size of training set is limited, pre-trained embeddings will especially benefit the training model. When the training set is vary large, difference of the accuracy will not be that obvious while still exist.

## Comment 
In this lab we used pretrained GloVe embeddings in a model for part of speech tagging. GloVe in its turn is also a neural word embedding model, but it had been trained on a completely different objective. GloVe vectors had been optimised on word cooccurrence matrix decomposition, i.e. on the task of predicting which words tend to occur with which other words. Part of speech certainly plays a role in determining statistical cooccurrence of words, but this role is indirect, and explicit part of speech information has not been used in training GloVe.

This makes our application an example of **transfer learning**, whereby a learned model trained on one objective (e.g. word cooccurrence) can benefit a different application (e.g. POS tagging), because some information is shared between them. 

## Part 2 - ELMo vectors (55 points)

> Indented block



In the second part of this lab we will reproduce the word sense disambiguation strategy that the authors of the ELMo vectors explored. The strategy consists in the following:

- create ELMo embeddings for all tokens in a sense-annotated corpus
- calculate mean sense vectors for each word sense in the training partition of the corpus
- for each sense-annotated token in the test partition of the corpus, assign it to the sense of the word to which its ELMo vector is the closest according to the cosine distance metric
- as a backup strategy, use the 1st sense of the word by default.

As a sense annotated corpus, we can use SemCor, conveniently available within NLTK. <code>semcor.sents()</code> iterates over all sentences represented as lists of tokens, while <code>semcor.tagged_sents()</code> iterates over the same sentences with additional annotation including WordNet lemma identifiers (lemmas in WordNet stand for a word taken in a specific sense).

In [None]:
from nltk.corpus import wordnet as wn  
from nltk.corpus import semcor
semcor.sents()

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlanta', "'s", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term', 'end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [None]:
semcor.tagged_sents(tag="sem")

[[['The'], Tree(Lemma('group.n.01.group'), [Tree('NE', ['Fulton', 'County', 'Grand', 'Jury'])]), Tree(Lemma('state.v.01.say'), ['said']), Tree(Lemma('friday.n.01.Friday'), ['Friday']), ['an'], Tree(Lemma('probe.n.01.investigation'), ['investigation']), ['of'], Tree(Lemma('atlanta.n.01.Atlanta'), ['Atlanta']), ["'s"], Tree(Lemma('late.s.03.recent'), ['recent']), Tree(Lemma('primary.n.01.primary_election'), ['primary', 'election']), Tree(Lemma('produce.v.04.produce'), ['produced']), ['``'], ['no'], Tree(Lemma('evidence.n.01.evidence'), ['evidence']), ["''"], ['that'], ['any'], Tree(Lemma('abnormality.n.04.irregularity'), ['irregularities']), Tree(Lemma('happen.v.01.take_place'), ['took', 'place']), ['.']], [['The'], Tree(Lemma('jury.n.01.jury'), ['jury']), Tree(Lemma('far.r.02.far'), ['further']), Tree(Lemma('state.v.01.say'), ['said']), ['in'], Tree(Lemma('term.n.02.term'), ['term']), Tree(Lemma('end.n.02.end'), ['end']), Tree(Lemma('presentment.n.01.presentment'), ['presentments']), ['

## Exercise 1. Extract relevant data from SemCor (5 points)

Let's prepare SemCor data for the disambiguation task. Since this is just an educational exercise and we don't aim at replicating the full results, we can use only a subset of SemCor. Take the first 10K sentences of SemCor and split them **randomly** (with a seed=`42`) into 90% training and 10% testing partitions:

In [None]:
#YOUR CODE HERE
# Don't change variable names!
semcor_sents_10k=list(semcor.sents()[:10000])
semcor_tag_10k=list(semcor.tagged_sents(tag="sem")[:10000])

random.seed(42)
random.shuffle(semcor_sents_10k)
random.shuffle(semcor_tag_10k)

semcor_train=semcor_sents_10k[:9000]
semcor_test=semcor_sents_10k[9000:]
semcor_tag_train=semcor_tag_10k[:9000]
semcor_tag_test=semcor_tag_10k[9000:]

Create a function that takes as input a sentence from SemCor and extracts a list which contains, for each token of the sentence, either the corresponding WordNet Lemma (e.g. <code>Lemma('friday.n.01.Friday')</code>) or <code>None</code>. <code>None</code> corresponds to tokens that are either 1) not annotated for word senses (e.g. articles); 2) are marked up as (part of) a named entity (e.g. "City of Atlanta" or placename "Fulton" annotated as  <code>Tree(Lemma('location.n.01.location'), [Tree('NE', ['Fulton'])])</code>).

In [None]:
def get_lemmas(semcor_sentence):
    #YOUR CODE HERE
    extracted_list=[]
    for token in semcor_sentence:
        if type(token)==list:
            for t in token:
                extracted_list.append(None)
        elif type(token[0])==nltk.tree.Tree:
            for i in range(len(token[0])):
                extracted_list.append(None)
        else:
            for i in range(len(token)):
                extracted_list.append(token.label())

    return extracted_list

In [None]:
# TEST
get_lemmas(semcor.tagged_sents(tag='sem')[0])

[None,
 None,
 None,
 None,
 None,
 Lemma('state.v.01.say'),
 Lemma('friday.n.01.Friday'),
 None,
 Lemma('probe.n.01.investigation'),
 None,
 Lemma('atlanta.n.01.Atlanta'),
 None,
 Lemma('late.s.03.recent'),
 Lemma('primary.n.01.primary_election'),
 Lemma('primary.n.01.primary_election'),
 Lemma('produce.v.04.produce'),
 None,
 None,
 Lemma('evidence.n.01.evidence'),
 None,
 None,
 None,
 Lemma('abnormality.n.04.irregularity'),
 Lemma('happen.v.01.take_place'),
 Lemma('happen.v.01.take_place'),
 None]

You are now able to extract word senses (instantiated by WordNet lemmas) from the corpus. The next step is to associate senses with ELMo vectors. Create a dictionary of contextualized token embeddings from the training corpus grouped by the WordNet sense:

In [None]:
from collections import defaultdict

# DON'T CHANGE THE VARIABLE NAME
Train_embeddings = defaultdict(list)

Now let's create contextualized ELMo word embeddings for the tokens in this corpus. We can load the pretrained ELMo model and define a function <code>sentences_to_elmo()</code> that receives a list of tokenized sentences as input and produces their ELMo vectors.

In [None]:
from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
elmo = Elmo(options_file, weight_file, 1, dropout=0)

def sentences_to_elmo(sentences):
    character_ids = batch_to_ids(sentences)
    embeddings = elmo(character_ids)
    return embeddings

100%|██████████| 336/336 [00:00<00:00, 184365.01B/s]
100%|██████████| 374434792/374434792 [00:12<00:00, 29258704.36B/s]


Now you can process the corpus sentences and produce their ELMo vectors. It is recommended to pass the input to ELMo encoder in batches. A suggested batch size is 50 sentences. For example, the code below processes the first 50 sentences from the corpus:

In [None]:
sentences=semcor.sents()[:50]
embeddings=sentences_to_elmo(sentences)

The <code>embeddings</code> that we obtained is a dictionary that contains a list of ELMo embeddings and a list of masks. The mask tells us which embeddings correspond to tokens in the original input sentences and which correspond to the padding (introduced to give all sentences in the batch the same length).
In principle all embeddings are stored in PyTorch tensors so that they can be used in bigger neural models, but we are not going to do it now. Note that PyTorch tensors can be converted to numpy arrays with `pyTorch_tensor.detach().numpy()`. 

In [None]:
embeddings['elmo_representations'][0]

tensor([[[-6.4618e-03,  6.0215e-03, -3.5598e-01,  ..., -1.1715e-02,
           7.0427e-02, -4.1873e-01],
         [-3.7781e-01,  2.8141e-01, -2.5836e-01,  ..., -4.8547e-01,
           2.5508e-01,  3.6381e-02],
         [ 9.1191e-01,  1.1779e+00, -8.4833e-01,  ...,  9.8472e-01,
           3.3675e-01,  1.6172e-01],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]],

        [[-6.4618e-03,  6.0215e-03, -3.5598e-01,  ..., -4.4876e-02,
           1.1313e-01, -9.9628e-02],
         [ 1.3721e-01, -2.0003e-01, -1.3074e-01,  ...,  5.9482e-01,
           9.3387e-01, -2.6757e-01],
         [ 1.7280e-01,  1.0801e+00, -5.4539e-01,  ...,  3.1966e-01,
          -5.6408e-01,  3.2461e-01],
         ...,
         [ 0.0000e+00,  0

We can check the size of the embeddings we got. It has three dimensions: 1) the number of sentences 2) the number of tokens (corresponds to the tokens in the longest original sentence of the batch; shorter ones were padded) and 3) the dimensionality of the Elmo vector (1024).

In [None]:
embeddings['elmo_representations'][0].detach().size()

torch.Size([50, 59, 1024])

Another thing contained in the <code>embeddings</code> is the mask, a tensor encoding which token vectors correspond to original tokens and which are paddings. It has two dimensions, one corresponding to the sentences in the batch (50) and one corresponding to the token positions:

In [None]:
print(embeddings['mask'].size())
embeddings['mask']

torch.Size([50, 59])


tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])

## Exercise 2. Extract ELMo encoding of sentences using a mask (5 points)  

Now define a function <code>get_masked_vectors(embeddings)</code> that takes embeddings as input and returns a list of ELMo sentence encodings to which the mask has been applied.  The output should be a list of Torch tensors, where the padding vectors have been removed so each sentence is represented by an $n \times 1024$ tensor where $n$ is sentence length.

In [None]:
def get_masked_vectors(embeddings):
    #YOUR CODE HERE
    elmo = embeddings['elmo_representations'][0]
    masks = embeddings['mask'].unsqueeze(2)
    outVec = elmo * masks
    
    outVec = list(torch.chunk(outVec,outVec.size()[0],dim=0))
    for i, out in enumerate(outVec):
        out = torch.squeeze(out,dim=0)
        out = list(torch.chunk(out,out.size()[0],dim=0))
        out = [token for token in out if torch.sum(token)!=0]
        out = torch.cat(out,dim=0)
        outVec[i] = out

    return outVec

## Exercise 3. Collect ELMo vectors from the training corpus (20 points)

Process the corpus updating your train word sense vectors in the dictionary. Iterate over the all the train sentences in the corpus, and retrieve for each lemma-annotated token (where lemma is not <code>None</code>) the corresponding ELMo vector. Store the ELMo sense embeddings that correspond to each lemma in the dictionary <code>Train_embeddings</code>. This step of processing the training corpus with ELMo is the most time consuming part of this assignment. However, it should not take forever. If this computation takes more than an hour, you may want to optimize your code or make sure you are using GPU acceleration. For the purposes of developing and debugging your solution, you may start by use a sample of 100 sentences, but then switch to the full 9K sentence training set. 

In [None]:
# might take ~25min on Colab's GPU
import torch
#YOUR CODE HERE
#Don't forget to populate Train_embeddings 
def batch_get_lemmas(embeddings_list, sentences_tag, batch_size):
    for i in range(batch_size):
        lemmas_list=get_lemmas(sentences_tag[i])
        for k in range(len(lemmas_list)):
            if not lemmas_list[k]:
                continue
            Train_embeddings[lemmas_list[k]].append(embeddings_list[i][k])

batch_size=50
#train half of the data due to the limitation of RAM on google colab
for bat in range(int(len(semcor_train)/batch_size/2)):
    embeddings=sentences_to_elmo(semcor_train[bat*batch_size:(bat+1)*batch_size])
    print("###batch No. ",bat+1,"###")
    for i in range(batch_size):
        lemmas_list = get_lemmas(semcor_tag_train[bat*batch_size:(bat+1)*batch_size][i])
        for j in range(len(lemmas_list)):
            if not lemmas_list[j]:
                continue
            Train_embeddings[lemmas_list[j]].append(embeddings['elmo_representations'][0][i][j])

###batch No.  1 ###
###batch No.  2 ###
###batch No.  3 ###
###batch No.  4 ###
###batch No.  5 ###
###batch No.  6 ###
###batch No.  7 ###
###batch No.  8 ###
###batch No.  9 ###
###batch No.  10 ###
###batch No.  11 ###
###batch No.  12 ###
###batch No.  13 ###
###batch No.  14 ###
###batch No.  15 ###
###batch No.  16 ###
###batch No.  17 ###
###batch No.  18 ###
###batch No.  19 ###
###batch No.  20 ###
###batch No.  21 ###
###batch No.  22 ###
###batch No.  23 ###
###batch No.  24 ###
###batch No.  25 ###
###batch No.  26 ###
###batch No.  27 ###
###batch No.  28 ###
###batch No.  29 ###
###batch No.  30 ###
###batch No.  31 ###
###batch No.  32 ###
###batch No.  33 ###
###batch No.  34 ###
###batch No.  35 ###
###batch No.  36 ###
###batch No.  37 ###
###batch No.  38 ###
###batch No.  39 ###
###batch No.  40 ###
###batch No.  41 ###
###batch No.  42 ###
###batch No.  43 ###
###batch No.  44 ###
###batch No.  45 ###
###batch No.  46 ###
###batch No.  47 ###
###batch No.  48 ###
#

How many senses does your Train_embeddings contain? **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [None]:
print(len(Train_embeddings))

10051


## Exercise 4. Vector averaging (5 points)

Your <code>Train_embeddings</code> now is a list of all vectors for a given word sense in the training corpus. For our purposes, we do not need the full list but the mean vector for each sense. For each sense in <code>Train_embeddings</code>, substitute the list by the average ELMo vector on the list. One efficient way to do this is to convert the list to a tensor via <code>stack</code> function and use Torch's <code>mean</code> function. Below is an example of how an average of two (random) vectors stored in a tensor can be computed in PyTorch: 

In [None]:
 randtensor = torch.randn(2, 4)
 print("Tensor storing two 4-dimensional vectors:\n",randtensor)
 print("Average vector: \n",randtensor.mean(dim=0))

Tensor storing two 4-dimensional vectors:
 tensor([[ 0.7262,  0.5241, -2.0477,  1.1439],
        [ 0.3781,  0.5274,  0.7336,  0.6302]])
Average vector: 
 tensor([ 0.5522,  0.5257, -0.6570,  0.8870])


Now you are ready to update your <code>Train_embeddings</code> so that it maps lemmas not to lists but to averaged vectors.

In [None]:
#YOUR CODE HERE
for key, value in Train_embeddings.items():
    vec = torch.stack([value,value],dim=0)
    vec = torch.mean(vec,dim=0)
    assert (vec.size()==torch.Size([1024]))
    Train_embeddings[key] = vec

## Exercise 5. Testing the sense vectors (20 points)

Test your sense embeddings on your test data, which is a subset of the SemCor corpus. Use the strategy outlined above, with 1st WordNet sense as a fallback: 

- rely on mean sense vectors for each word sense in the training partition of the corpus, as stored in <code>Train_embeddings</code>
- for each sense-annotated token <i>t</i> (e.g. the verb "run") in the test partition of the corpus, assign to it the sense of the word "Lemma('X.v.n.run')" to which the ELMo vector <i>t</i> is the closest according to the cosine distance metric
- as a backup strategy, use the 1st sense of the word (e.g. <code>Lemma('run.n.01.run')</code>) from WordNet. You can look it up using a built-in function from NLTK (e.g. <code>wn.lemmas('run')</code>). More on usage of WordNet with NLTK [here](https://www.nltk.org/howto/wordnet.html).

Calculate WSD accuracy in percentage points on your test data. Report three numbers
- overall accuracy (proportion of times the ELMo method+WordNet backup results in the correct sense annotation)
- WordNet baseline accuracy: what if you always select the first WordNet sense, ignoring the ELMo embedding?
- accuracy of the ELMo method just for the instances in which ELMo strategy is applicable
- accuracy of the WordNet baseline just for the instances in which ELMo strategy is applicable

For the purpose of testing the model, it is important to implement comparison of predicted and ground truth synsets correctly. To do this, use a string conversion, because ```==``` applied to WordNet lemmas only compares the words that express the two lemmas, ignoring the synsets. See the code below:

In [None]:
word="toy"
toy1 = wn.lemmas(word)[0]
toy2 = wn.lemmas(word)[6]
print("toy1 (noun):", toy1)
print("toy2 (verb):", toy2)
print("direct equality comparison of toy 1 and toy2: toy1==toy2",toy1==toy2)
print("string based comparison of toy 1 and toy2: str(wordnet.lemmas(word)[0])==str(wordnet.lemmas(word)[1])",
      str(wn.lemmas(word)[0])==str(wn.lemmas(word)[1]))

toy1 (noun): Lemma('plaything.n.01.toy')
toy2 (verb): Lemma('toy.v.02.toy')
direct equality comparison of toy 1 and toy2: toy1==toy2 True
string based comparison of toy 1 and toy2: str(wordnet.lemmas(word)[0])==str(wordnet.lemmas(word)[1]) False


In [None]:
from torch.nn.functional import cosine_similarity

all_outcomes = []
#YOUR CODE HERE
batch_size=50
all_outcomes = all_outcomes + [0, 0, 0, 0]
correct = [0,0,0,0]
num_all_lemmas = 0
vecs = []
for val in Train_embeddings.values():
    vecs.append(val) 
vecs = torch.stack(vecs, dim = 0)

for bat in range(int(len(semcor_test)/batch_size)):
    embeddings=sentences_to_elmo(semcor_test[bat*batch_size:(bat+1)*batch_size])
    print("###batch No. ",bat+1,"###")
    for i in range(batch_size):
        lemmas_list=get_lemmas(semcor_tag_test[bat*batch_size:(bat+1)*batch_size][i])
        for k in range(len(lemmas_list)):
            if type(lemmas_list[k])!=nltk.corpus.reader.wordnet.Lemma:
                continue
            num_all_lemmas += 1
            if k>=len(semcor_test[bat*batch_size:(bat+1)*batch_size][i]):
                break
            if len(wn.lemmas(semcor_test[bat*batch_size:(bat+1)*batch_size][i][k]))>0:
                wn_lemma=wn.lemmas(semcor_test[bat*batch_size:(bat+1)*batch_size][i][k])[0]
            else:
                wn_lemma=None
            similarities = torch.nn.functional.cosine_similarity(embeddings['elmo_representations'][0][i][k].expand(vecs.size()), vecs)
            elmo_lemma=list(Train_embeddings.keys())[similarities.argmax()]
            if type(elmo_lemma)!=nltk.corpus.reader.wordnet.Lemma or (not wn_lemma):
                continue
            if elmo_lemma==lemmas_list[k] or wn_lemma==lemmas_list[k]:
                all_outcomes[0]+=1
            if wn_lemma==lemmas_list[k]:
                all_outcomes[1]+=1
            if elmo_lemma==lemmas_list[k]:
                all_outcomes[2]+=1
            if wn_lemma==lemmas_list[k] and elmo_lemma==lemmas_list[k]:
                all_outcomes[3]+=1

accuracy = all_outcomes[0]/num_all_lemmas
baseline_accuracy = all_outcomes[1]/num_all_lemmas
elmo_accuracy = all_outcomes[2]/num_all_lemmas
baseline_on_elmo_data_accuracy = all_outcomes[3]/num_all_lemmas

print("correct:", all_outcomes)
print("num_all_lemmas:", num_all_lemmas)
print("Overall accuracy:", accuracy) 
print("WordNet baseline:", baseline_accuracy)
print("Accuracy in cases where ELMo method is used", elmo_accuracy)
print("Accuracy of the baseline in cases where ELMo method is applicable", baseline_on_elmo_data_accuracy)

###batch No.  1 ###
###batch No.  2 ###
###batch No.  3 ###
###batch No.  4 ###
###batch No.  5 ###
###batch No.  6 ###
###batch No.  7 ###
###batch No.  8 ###
###batch No.  9 ###
###batch No.  10 ###
###batch No.  11 ###
###batch No.  12 ###
###batch No.  13 ###
###batch No.  14 ###
###batch No.  15 ###
###batch No.  16 ###
###batch No.  17 ###
###batch No.  18 ###
###batch No.  19 ###
###batch No.  20 ###
correct: [6681, 5704, 5701, 4724]
num_all_lemmas: 9709
Overall accuracy: 0.6881244206406427
WordNet baseline: 0.5874961376042847
Accuracy in cases where ELMo method is used 0.5871871459470595
Accuracy of the baseline in cases where ELMo method is applicable 0.4865588629107014


Make sure you have the following variables defined so that this cell runs smoothly.  
**<font color="red">Do not delete the output of this cell in the submitted version.</font>**

In [None]:
# Don't round the numbers
print("Overall accuracy:", accuracy) 
print("WordNet baseline:", baseline_accuracy)
print("Accuracy in cases where ELMo method is used", elmo_accuracy)
print("Accuracy of the baseline in cases where ELMo method is applicable", baseline_on_elmo_data_accuracy)

Overall accuracy: 0.6881244206406427
WordNet baseline: 0.5874961376042847
Accuracy in cases where ELMo method is used 0.5871871459470595
Accuracy of the baseline in cases where ELMo method is applicable 0.4865588629107014


If you reached this point, you were able to evaluate ELMo as a model of contextual semantic similarity of word usages. The idea behind the vector averaging is that a word when used in the same sense should have similar vector representations, while usages in distinct senses should have different vector representations.

Analyze the numbers above. What do they tell you?

We only use half of the data to train in the colab due to the limitation of RAM. However, in this situation, the accuracy is still higher than the accuracy of baseline. So, in conclusion, using ELMo can generate more representative embeddings with higher accuracy.


## The end
Congratulations! this is the end of Lab 4.

**Acknowledgements**: Tejaswini Deoskar has given valuable comments that helped improve this lab assignment. Timothee Mickus helped to test this assignment and gave extensive feedback on the instructions. Many thanks to both.