# Word Sense Disambiguation using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

A problem with static distributional vectors is the difficulty of distinguishing between different *word senses*. We will continue our exploration of word vectors by considering *trainable vectors* or *word embeddings* for Word Sense Disambiguation (WSD).

The goal of word sense disambiguation is to train a model to find the sense of a word (homonyms of a word-form). For example, the word "bank" can mean "sloping land" or "financial institution". 

(a) "I deposited my money in the **bank**" (financial institution)

(b) "I swam from the river **bank**" (sloping land)

In case a) and b) we can determine that the meaning of "bank" based on the *context*. To utilize context in a semantic model we use *contextualized word representations*. Previously we worked with *static word representations*, i.e. the representation does not depend on the context. To illustrate we can consider sentences (a) and (b), the word **bank** would have the same static representation in both sentences, which means that it becomes difficult for us to predict its sense. What we want is to create representations that depend on the context, i.e. *contextualized embeddings*. 

We will create contextualized embeddings with Recurrent Neural Networks. You can read more about recurrent neural netoworks [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). Your overall task in this lab is to create a neural network model that can disambiguate the word sense of 30 different words. 

Name: **MAX BOHOLM**

*Second attempt.*

In [1]:
# first we import some packages that we need
import torch
import torch.nn as nn
import torchtext
import torch.nn.functional as F

# and define our device
device = torch.device('cuda:0')
#device = torch.device('cpu')

print(f"PyTorch Version: {torch.__version__}")

PyTorch Version: 1.7.0+cu101


In [2]:
# our hyperparameters for Part A (add more when/if you need them)
a_batch_size = 2
a_learning_rate = 0.001
a_epochs = 3
a_hidden = 2

# 1. Working with data

A central part of any machine learning system is the data we're working with. In this section we will split the data (the dataset is located here: ``wsd-data/wsd_data.txt``) into a training set and a test set. We will also create a baseline to compare our model against. Finally, we will use TorchText to transform our data (raw text) into a convenient format that our neural network can work with.

## Data

The dataset we will use contain different word sense for 30 different words. The data is organized as follows (values separated by tabs): 
- Column 1: word-sense
- Column 2: word-form
- Column 3: index of word
- Column 4: white-space tokenized context

### Splitting the data

Your first task is to seperate the data into a *training set* and a *test set*. The training set should contain 80% of the examples and the test set the remaining 20%. The examples for the test/training set should be selected **randomly**. Save each dataset into a .csv file for loading later. **[2 marks]**

AE: Looks good! **2 marks**

In [3]:
import random
def data_split(path_to_dataset, directory_for_output="wsd_data", train_frac=0.8):

    with open(path_to_dataset, mode="r") as f:
        data=[example for example in f.read().split("\n") if len(example.split("\t")) == 4]
        
    #data=data[:20000] #OBS!

    random.shuffle(data)
    n_train = int(len(data)*train_frac)
    train=data[:n_train]
    test=data[n_train:]
    
    with open(f"{directory_for_output}/train.csv", mode="w") as f:
        f.write("\n".join(train))
    with open(f"{directory_for_output}/test.csv", mode="w") as f:
        f.write("\n".join(test))

data_split("wsd_data/wsd_data.txt")

### Creating a baseline

Your second task is to create a *baseline* for the task. A baseline is a "reality check" for a model, given a very simple heuristic/algorithmic/model solution to the problem, can our neural network perform better than this?
The baseline you are to create is the "most common sense" (MCS) baseline. For each word form, find the most commonly assigned sense to the word, and label a words with that sense. **[2 marks]**

E.g. In a fictional dataset, "bank" have two senses, "financial institution" which occur 5 times and "side of river" 3 times. Thus, all 8 occurences of bank is labeled "financial institution" and this yields an MCS accuracy of 5/8 = 62.5%. If a model obtain a higher score than this, we can conclude that the model *at least* is better than selecting the most frequent word sense.

AE: Looks good! **2 marks**

In [4]:
def mcs_baseline(path_to_data="wsd_data/wsd_data.txt"): #baseline on the test data alone?
    with open(path_to_data, mode="r") as f:
        data=[tuple(line.split("\t")[:2]) for line in f.read().split("\n") if line != ""]
   
    counts={lemma:{} for lemma in [lemma for sense, lemma in data]}
    for sense, lemma in data:
        if sense in counts[lemma]:
            counts[lemma][sense]+=1
        else:
            counts[lemma][sense]=1  
    
    baseline={lemma:{} for lemma in counts.keys()}
    for lemma in counts.keys():
        my_top_sense = list(counts[lemma].keys())[0]
        for sense in counts[lemma].keys():
            if counts[lemma][sense] > counts[lemma][my_top_sense]:
                my_top_sense = sense
        total=sum(counts[lemma].values())
        baseline[lemma]["sense"]=my_top_sense
        baseline[lemma]["accuracy"]=counts[lemma][my_top_sense] / total
        baseline[lemma]["no_of_senses"]=len(counts[lemma].keys())
    
    return baseline

my_baseline = mcs_baseline()
#print(my_baseline)

### Creating data iterators

To train a neural network, we first need to prepare the data. This involves converting words (and labels) to a number, and organizing the data into batches. We also want the ability to shuffle the examples such that they appear in a random order.  

To do all of this we will use the torchtext library (https://torchtext.readthedocs.io/en/latest/index.html). In addition to converting our data into numerical form and creating batches, it will generate a word and label vocabulary, and data iterators than can sort and shuffle the examples. 

Your task is to create a dataloader for the training and test set you created previously. So, how do we go about doing this?

1) First we create a ``Field`` for each of our columns. A field is a function which tokenize the input, keep a dictionary of word-to-numbers, and fix paddings. So, we need four fields, one for the word-sense, one for the position, one for the lemma and one for the context. 

2) After we have our fields, we need to process the data. For this we use the ``TabularDataset`` class. We pass the name and path of the training and test files we created previously, then we assign which field to use in each column. The result is that each column will be processed by the field indicated. So, the context column will be tokenized and processed by the context field and so on. 

3) After we have processed the dataset we need to build the vocabulary, for this we call the function ``build_vocab()`` on the different ``Fields`` with the output from ``TabularDataset`` as input. This looks at our dataset and creates the necessary vocabularies (word-to-number mappings). 

4) Finally, the last step. In the last step we load the data objects given by the ``TabularDataset`` and pass it to the ``BucketIterator`` class. This class will organize our examples into batches and shuffle them around (such that for each epoch the model observe the examples in a different order). When we are done with this we can let our function return the data iterators and vocabularies, then we are ready to train and test our model!

Implement the dataloader. [**2 marks**]

*hint: for TabularDataset and BucketIterator use the class function splits()* 

AE: Looks good! **2 marks**

In [5]:
#from torchtext.legacy.data import Field, BucketIterator, Iterator, TabularDataset # Needed for running this on my laptop
from torchtext.data import Field, BucketIterator, Iterator, TabularDataset

def dataloader(directory="wsd_data",
               train_file="train.csv",
               test_file="test.csv",
               batch=a_batch_size):
    
    whitespacer = lambda x: x.split(' ') #from: https://canvas.gu.se/files/4597768/download?download_frd=1
    to_int      = lambda x: [int(x[0])]
    
    SENSE = Field(batch_first = True)

    LEMMA = Field(batch_first = True) 
    
    INDEX = Field(batch_first   = True,
                  use_vocab     = False,
                  preprocessing = to_int
                 ) 
    
    CONTEXT = Field(tokenize    = whitespacer,
                    lower       = True,
                    batch_first = True,
                    init_token  = "<start>", 
                    eos_token   = "<end>"
                   ) 
    
    my_fields = [("sense", SENSE),
                 ("lemma", LEMMA),
                 ("index", INDEX),
                 ("context", CONTEXT)]
    
    train, test = TabularDataset.splits(path   = directory,
                                        train  = 'train.csv',
                                        test   = 'test.csv',
                                        format = 'csv',
                                        fields = my_fields,
                                        csv_reader_params = {'delimiter':'\t',
                                                             'quotechar':'¤'}) 
                                        #"¤" not in data
    SENSE.build_vocab(train) #labels
    LEMMA.build_vocab(train) #lemmas 
    CONTEXT.build_vocab(train) #Vocabulary

    train_iter, test_iter = BucketIterator.splits((train, test),
                                                  batch_size        = batch,
                                                  sort_within_batch = True,
                                                  sort_key          = lambda x: len(x.context),
                                                  shuffle           = True,
                                                  device            = device)

    return train_iter, test_iter, CONTEXT.vocab, SENSE.vocab, LEMMA.vocab  
    

# 2.1 Creating and running a Neural Network for WSD

In this section we will create and run a neural network to predict word senses based on *contextualized representations*.

### Model

We will use a bidirectional Long-Short-Term Memory (LSTM) network to create a representation for the sentences and a Linear classifier to predict the sense of each word.

When we initialize the model, we need a few things:

    1) An embedding layer: a dictionary from which we can obtain word embeddings
    2) A LSTM-module to obtain contextual representations
    3) A classifier that compute scores for each word-sense given *some* input


The general procedure is the following:

    1) For each word in the sentence, obtain word embeddings
    2) Run the embedded sentences through the RNN
    3) Select the appropriate hidden state
    4) Predict the word-sense 

**Suggestion for efficiency:**  *Use a low dimensionality (32) for word embeddings and the LSTM when developing and testing the code, then scale up when running the full training/tests*
    
Your tasks will be to create two different models (both follow the two outlines described above), described below:

#### MODEL 1: Ambigious Word Approach

In the first approach to WSD, you are to select the index of our target word (column 3 in the dataset) and predict the word sense. **[5 marks]**


AE: This needs some work, a problem I see here is that you predict the sense for every word, then select the prediction for the ambiguous word. This is problematic because when backpropagating, the classification will update it's prediction on ALL words in the sentence (such that it becomes better for all words) not only the ambiguous word which is what we want. 

So, you should change `classifications = self.classifier(contextualized_embedding)` to something like `classifications = self.classifier(contextualized_embedding[SELECT_AMBIGUOUS_WORD])`. You sorta have the key to this already when you're selecting the *predictions* of the ambiguous word.

**2 marks**

**OLD CODE (MODEL 1)**

```
class WSDModel_approach1(nn.Module):
    def __init__(self, voc_size, hidden, n_labels):  
        super(WSDModel_approach1, self).__init__()
        self.embeddings = nn.Embedding(voc_size, hidden)
        self.rnn = nn.LSTM(hidden, hidden, bidirectional=True, batch_first=True)
        self.classifier = nn.Linear(hidden*2, n_labels) 
  
    def forward(self, batch, index):
        embeddings = self.embeddings(batch)
        contextualized_embedding, *_ = self.rnn(embeddings)
        classifications = self.classifier(contextualized_embedding)
        
        ### NOTES ON TENSOR TRANSFORMATIONS ###
        # 1. Add 1 to the index input since we have a start-token of he sequence
        # 2. We make a vector (1D tensor) of the batch * index tensor
        # 3. We take the first batch example to "build upon" `predictions = classifications[0, index_mod[0], :].unsqueeze(0)`
        # 4. We iterate over the remaining batch examples to build the full output
        #######################################
        
        index_mod = torch.add(index.squeeze(), 1)
        predictions = classifications[0, index_mod[0], :].unsqueeze(0)
        print(predictions.size())
        for counter, index_at_count in enumerate(index_mod[1:], start=1):
            to_add = classifications[counter, index_at_count, :].unsqueeze(0)
            print(to_add.size())
            predictions = torch.cat((predictions, to_add)) #dim=0 by default
            
        print(predictions.size())
       
        return predictions
```

In [6]:
# RE-WORK SUMMER 2021

class WSDModel_approach1(nn.Module):
    def __init__(self, voc_size, hidden, n_labels):  
        super(WSDModel_approach1, self).__init__()
        self.embeddings = nn.Embedding(voc_size, hidden)
        self.rnn = nn.LSTM(hidden, hidden, bidirectional=True, batch_first=True)
        self.classifier = nn.Linear(hidden*2, n_labels) 
  
    def forward(self, batch, index):
        embeddings = self.embeddings(batch)
        contextualized_embedding, *_ = self.rnn(embeddings)
        
        ### HERE IS MY NEW IDEA ###
        # First, we rebuild a batch of vectors representing the ambigious word
        # --- Add 1 to the index input since we have a start-token of the sequence
        index_mod = torch.add(index.squeeze(), 1)
        # --- We take the first batch example to "build upon"
        projection = contextualized_embedding[0, index_mod[0], :].unsqueeze(0)
        # --- We iterate over the remaining batch examples to build the full output
        for counter, index_at_count in enumerate(index_mod[1:], start=1):
            to_add = contextualized_embedding[counter, index_at_count, :].unsqueeze(0)
            #print(to_add.size())
            projection = torch.cat((projection, to_add)) #dim=0 by default

        # Second, we predict labels 
        print(projection.size())
        predictions = self.classifier(projection)
        
        #print(predictions.size())
       
        return predictions        
            


#### MODEL 2: Sentence Approach

In the second approach to WSD, you are to predict the word sense based on the final hidden state given by the RNN. **[5 marks]**

AE: Same comment as for the previous approach, also note that the LSTM gives you three outputs, `TOKEN_REPRESENTATIONS, (final_hidden, final_cell)`. You can use the `final_hidden` here instead of the `<end>` token. **2 marks**

**OLD CODE (MODEL 2)**
```
class WSDModel_approach2(nn.Module):
    def __init__(self, voc_size, hidden, n_labels):  
        super(WSDModel_approach2, self).__init__()
        self.embeddings = nn.Embedding(voc_size, hidden)
        self.rnn = nn.LSTM(hidden, hidden, bidirectional=True, batch_first=True) #bidirectional?
        self.classifier = nn.Linear(hidden*2, n_labels) 
  
    def forward(self, batch, index): #index is dummy in model2 
   
        embeddings = self.embeddings(batch)
      
        contextualized_embedding, *_ = self.rnn(embeddings)
     
        classifications = self.classifier(contextualized_embedding)
        
        ### NOTES ON TENSOR TRANSFORMATIONS ###
        # 1. Identify index of <end> i.e. key 3
        # 2. Use this list of indecies in the same way as "index" of the dataset (above)
        #######################################
        
        end=torch.tensor(3, device=device)
        end_index=(end == batch).nonzero(as_tuple=True)[1]
        
        predictions = classifications[0, end_index[0], :].unsqueeze(0)
        
        for counter, index_at_count in enumerate(end_index[1:], start=1):
            to_add = classifications[counter, index_at_count, :].unsqueeze(0)
            predictions = torch.cat((predictions, to_add)) #dim=0 by default
       
        return predictions
```

In [7]:
# RE-WORK SUMMER 2021

class WSDModel_approach2(nn.Module):
    def __init__(self, voc_size, hidden, n_labels):  
        super(WSDModel_approach2, self).__init__()
        self.embeddings = nn.Embedding(voc_size, hidden)
        self.rnn = nn.LSTM(hidden, hidden, bidirectional=True, batch_first=True) #bidirectional?
        self.classifier = nn.Linear(hidden*2, n_labels) 
  
    def forward(self, batch, index): #index is dummy in model2 
   
        embeddings = self.embeddings(batch)
    
        # HERE IS THE NEW IDEA
      
        contextualized_embedding, (hidden_final, cell_final) = self.rnn(embeddings)
        
        # Structure of hidden_final: (D∗num_layers,N,Hout), where D=2, if bidirectional=True, as it is in our case (see https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)
        # We want to add D1 to D2, resuling in a tensor shaped [N, Hout*2]
        
        print(contextualized_embedding.shape)
        print(contextualized_embedding)
        mu=contextualized_embedding[:,-1,:].squeeze()
        print(mu.shape)
        print(hidden_final.shape)
        print(hidden_final)
        
        #projection = hidden ... 
        
        
        predictions = self.classifier(projection)
        
        return predictions
        

In [30]:
myt=torch.tensor([[[1,2,3], [4,5,6]], [[7,8,9], [0.1,0.2,0.3]]])

def bi_cat(my_tensor):
    D, batch, hidden = my_tensor.shape
    
    #print(D, batch, hidden)
    
    base = torch.cat((my_tensor[0, 0, :], my_tensor[1, 0, :]), 0)
    

    for x in range(batch):
        print(x)
        print(torch.cat((my_tensor[0, x, :], my_tensor[1, x, :]), 0))
        torch.cat((base, torch.cat((my_tensor[0, x, :], my_tensor[1, x, :]), 0)), 1)
        
    print(base)
    
bi_cat(myt)




0
tensor([1., 2., 3., 7., 8., 9.])


IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

### Training and testing the model

Now we are ready to train and test our model. What we need now is a loss function, an optimizer, and our data. 

- First, create the loss function and the optimizer.
- Next, we iterate over the number of epochs (i.e. how many times we let the model see our data). 
- For each epoch, iterate over the dataset (``train_iter``) to obtain batches. Use the batch as input to the model, and let the model output scores for the different word senses.
- For each model output, calculate the loss (and print the loss) on the output and update the model parameters.
- Reset the gradients and repeat.
- After all epochs are done, test your trained model on the test set (``test_iter``) and calculate the total and per-word-form accuracy of your model.

Implement the training and testing of the model **[4 marks]**

**Suggestion for efficiency:** *when developing your model, try training and testing the model on one or two batches (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

AE: Looks good! **4 marks**

In [8]:
#Note: I have splitted training and testing into separate cells

import torch.optim as optim

train_iter, test_iter, vocab, labels, lemmas = dataloader()

model = WSDModel_approach2(voc_size = len(vocab),
                           hidden   = a_hidden, 
                           n_labels = len(labels))

model.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=a_learning_rate)


for e in range(a_epochs):
    total_loss = 0
    for i, batch in enumerate(train_iter):
        
        sentence = batch.context
        index = batch.index
        label = batch.sense
      
        output_from_model = model(sentence, index)
        #assert False
        
        loss = loss_function(output_from_model, label.squeeze()) # "output" from model is "input" to CEL
        
        #Note: code below adopted from previous assignment
        total_loss += loss.item()
        print(i, total_loss/(i+1), end='\r') 
        loss.backward() # compute gradients
        optimizer.step() # update parameters
        optimizer.zero_grad # reset gradients
        
        break
    print()

torch.Size([2, 93, 4])
tensor([[[ 3.2667e-02, -4.1260e-02,  1.7637e-02,  1.2551e-01],
         [-1.7908e-02, -6.2767e-02,  2.7954e-02,  1.6238e-01],
         [ 7.6297e-02, -9.4297e-02, -6.4729e-02,  1.8105e-01],
         [ 1.1905e-01,  1.6619e-02, -8.8846e-02,  2.3299e-01],
         [ 1.2165e-01,  6.7937e-02, -7.5555e-02,  2.2017e-01],
         [ 1.3479e-01,  2.1069e-01, -7.4591e-02,  1.9807e-01],
         [ 1.0569e-01,  4.5848e-02, -2.0943e-02,  2.9699e-02],
         [ 1.5791e-01,  2.4456e-01, -9.0909e-02,  9.4615e-02],
         [ 1.3892e-01,  9.2879e-02, -4.2629e-02,  7.2185e-02],
         [ 1.2922e-01,  2.1007e-02, -4.2279e-02,  6.0528e-02],
         [ 1.3890e-01,  7.9341e-02, -3.2740e-02,  2.4970e-01],
         [ 7.2489e-02,  1.6789e-02, -6.1292e-03,  1.6479e-01],
         [ 1.2246e-01,  5.9497e-02, -7.8013e-02,  8.1131e-02],
         [ 1.3257e-01,  8.4945e-02, -5.1027e-02,  7.1260e-02],
         [ 1.2553e-01,  1.8340e-02, -5.6160e-03, -2.2648e-02],
         [ 6.1475e-02, -3.5050e-

NameError: name 'projection' is not defined

In [None]:
# evaluate model after all epochs are completed
import numpy as np

def select(vector):
    """Selects the index of the top value in a vector."""
    top_value=0
    no_one=0 #index of top value
    for index, value in enumerate(vector):
        if value > top_value:
            top_value=value
            no_one=index
    return no_one

correct_set = []
correct_per_word = {lemma:[] for lemma in [lemmas.itos[x] for x in range(len(lemmas))]}
model.eval() #evaluation mode

for i, batch in enumerate(test_iter):
    print(f"{round((i/len(test_iter))*100, 3)} %", end="\r")
    sentence = batch.context
    index = batch.index
    label = batch.sense
    lemma = batch.lemma
    
    output = model(sentence, index)
    
    my_probs = F.softmax(output, dim=1)
    index_of_top_prob = [select(x) for x in my_probs]
    predicted_label = [labels.itos[x] for x in index_of_top_prob]

    for i in range(label.shape[0]):
        true_label = labels.itos[label[i][0]]
        this_lemma = lemmas.itos[lemma[i][0]]
        if true_label == predicted_label[i]:
            correct_set.append(1)
            correct_per_word[this_lemma].append(1)
        else:
            correct_set.append(0)
            correct_per_word[this_lemma].append(0)

accuracy = sum(correct_set) / len(correct_set)

accuracy_per_word = {lemma:0 for lemma in correct_per_word.keys()}
for lemma in correct_per_word.keys():
    if len(correct_per_word[lemma]) == 0:
        accuracy_per_word[lemma] = "NA"
    else:
        mean = sum(correct_per_word[lemma]) / len(correct_per_word[lemma])
        accuracy_per_word[lemma] = mean
    
print("="*40)
print("EVALUATION")
print(f"Overall accuracy: {round(accuracy, 3)}.")
print("Lemma{}\tAcc.\tBaseL.\tGood?\tNo. senses".format(" "*9))

##########################################
# For interpretation of model performance, 
# I here collect variables for correlation
def pearson(v1, v2):
    calculation = np.corrcoef(v1, v2)
    r = round(calculation[0][1], 3)
    return r
v_acc=[]
v_bl=[]
v_nsen=[]
##########################################

for lemma in accuracy_per_word.keys():
    if lemma not in ["<unk>", "<pad>"]:
        acc = round(accuracy_per_word[lemma], 2)
        bl = round(my_baseline[lemma]["accuracy"], 2)
        n_sense = my_baseline[lemma]["no_of_senses"]
        is_it_good = "Yes"
        if bl > acc:
            is_it_good = "No"
        
        print("{}\t{}\t{}\t{}\t{}".format(lemma+" "*(14-len(lemma)), acc, bl, is_it_good, n_sense))
        
        v_acc.append(acc)
        v_bl.append(bl)
        v_nsen.append(n_sense)
print("="*40)
print()
print("Correlation of Accuracy and Baseline: {}".format(pearson(v_acc, v_bl)))
print("Correlation of Accuracy and No. of senses: {}".format(pearson(v_acc, v_nsen)))

# 2.2 Running a transformer for WSD

In this section of the lab you'll try out the transformer, specifically the BERT model. For this we'll use the huggingface library (https://huggingface.co/).

You can find the documentation for the BERT model here (https://huggingface.co/transformers/model_doc/bert.html) and a general usage guide here (https://huggingface.co/transformers/quickstart.html).

What we're going to do is *fine-tune* the BERT model, i.e. update the weights of a pre-trained model. That is, we have a model that is trained on language modeling, but now we apply it to word sense disambiguation with the word representations it learnt from language modeling.

We'll use the same data splits for training and testing as before, but this time you'll not use a torchtext dataloader. Rather now you create an iterator that collects N sentences (where N is the batch size) then use the BertTokenizer to transform the sentence into integers. For your dataloader, remember to:
* Shuffle the data in each batch
* Make sure you get a new iterator for each *epoch*
* Create a vocabulary of *sense-labels* so you can calculate accuracy 

We then pass this batch into the BERT model and train as before. The BERT model will encode the sentence, then we send this encoded sentence into a prediction layer (you can either the the sentence-representation from bert, or the ambiguous word) like before and collect sense predictions.

About the hyperparameters and training:
* For BERT, usually a lower learning rate works best, between 0.0001-0.000001.
* BERT takes alot of resources, running it on CPU will take ages, utilize the GPUs :)
* Since BERT takes alot of resources, use a small batch size (4-8)
* Computing the BERT representation, make sure you pass the mask

**[10 marks]**

AE: **0 marks**

For the dataloading, I'd siggest implementing something simpler in lines of: 

```
for BATCH, BATCH_LABELS in dataset:
    input = tokenizer.batch_encode_plus(...)
    labels = BATCH_LABELS
    yield input, labels
```

should take less memory as you don't need to save everything into memory.

To upgrade the transformers library on MLTGPU, you can do this (copy/pasted from Discord):

In case other people run into the problems with transformers, we did a short guide to use the most up-to-date version 4.6.0 of the transformers, instead of the 2.2.0 version on mltgpu:

1. Create an empty directory and use it to create a virtual environment:
python -m venv </path/to/new/virtual/environment>

2. Activate created environment:
source <pathofenv>/bin/activate

3. To run jupyter, install jupyter in venv and run this command (also in venv) (you only need to change <nameofenv> to your environment name)
pip install jupyter
python -m ipykernel install --user --name=<nameofenv>

4. Install all necessary dependencies, we used:
 pip install torch
 pip install transformers
 pip install -Iv torchtext==0.4.0

5. Run the notebook in the venv and open your notebook, change the kernel from the browser: Kernel>Change kernel><nameofenv>

NOTE: we had to log out of mltgpu and change the port for the notebook to open(edited)



In [None]:
# BERT stuff ...
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
print(tokenizer)
BERT = BertModel.from_pretrained('bert-base-cased')

In [None]:
import transformers
print("Version: ", transformers.__version__)

**TO BE** *RE-WORKED SUMMER 2021*

### Note by M.B. 
I have experienced problems when implementing the BertTokenizer. The `transformers` library on the MLTGPU server is version 2.2, which makes implementation a struggle. This has consequences for any further implementation, training and evaluation of the model that requires this fundamental step. 

1. The litterature on how to work with the `transformers` library is based on later versions than 2.2. I do not get the procedures described and exemplified to work in v. 2.2 (consider: https://huggingface.co/transformers/training.html). For example, calling the tokenizer yields the following error: `TypeError: 'BertTokenizer' object is not callable`, which is a known problem for verison prior v3 (https://github.com/huggingface/transformers/issues/5580). 
2. There is no `docs`for the 2.2 version of BERT on huggingface (https://huggingface.co/transformers/v2.2.0/model_doc/bert.html). 

On my laptop I come to a stage that I manage to preprocess the data quite well (code below), but I cannot start experiment with the training part, since this code does not run on MLTGPU. 

I have considered building a BERT-ish tokenizer with the functionality of the `transformers` v 2.2 library, but there is not enough time.

In [None]:
#BERT hyperp
b_batch_size = 3
b_learning_rate = 0.0001
b_epochs = 3

In [None]:
# RE-WORK ...

In [None]:
# From file to Python
import random

def read_and_split(path_to_dataset = "wsd-data/wsd_data.txt", train_frac = 0.8):
    with open(path_to_dataset, mode="r") as f:
        data=[tuple(example.split("\t")) for example in f.read().split("\n") if len(example.split("\t")) == 4]
    
    random.shuffle(data)
    n_train = int(len(data)*train_frac)
    train=data[:n_train]
    test=data[n_train:]
    
    labels = []
    lemmas = []
    for label, lemma, x, y in data:
        if label not in labels:
            labels.append(label)
        if lemma not in lemmas:
            lemmas.append(lemma)
    
    return train, test, labels, lemmas
    
my_train, my_test, my_labels, my_lemmas = read_and_split()
#print(my_train[:10])

In [None]:
# NEW TOKENIZER

for BATCH, BATCH_LABELS in dataset:
    input = tokenizer.batch_encode_plus(...)
    labels = BATCH_LABELS
    yield input, labels
    
    
    

**OLD >>>**
```
import random

class Xemplar(): 
    def __init__(self, label, lemma, position, sentence, set_labels, set_lemmas):
        self.label    = set_labels.index(label) 
        self.lemma    = set_lemmas.index(lemma) 
        self.position = int(position)
        self.sentence = sentence.split(" ")

class Batch():
    def __init__(self, chunk): #gets a list of len=batch_size of Xemplars 
        self.label    = torch.tensor([X.label for X in chunk], dtype=torch.long, device=device)
        self.lemma    = torch.tensor([X.lemma for X in chunk], dtype=torch.long, device=device)
        self.position = torch.tensor([X.position for X in chunk], dtype=torch.long, device=device)
        self.sentence = tokenizer([X.sentence for X in chunk], 
                                     is_split_into_words=True, 
                                     padding=True, 
                                     truncation=True
                                          )

def data_to_data(data, labels_set, lemmas_set):
    container = []
    for label, lemma, index, sentence in data:
        X = Xemplar(label, lemma, index, sentence, labels_set, lemmas_set)
        container.append(X)    
    return container

def batcher(my_list, size):
    output=[Batch(my_list[i : i+size]) for i in range(0, len(my_list), size)] #solution found here: https://www.delftstack.com/howto/python/python-split-list-into-chunks/
    print(output)
    return output
        
class MyDataLoader():
    def __init__(self, data, labels_set, lemmas_set, batch_size=1, shuffle=True):
        self.data       = data_to_data(data, labels_set, lemmas_set)
        self.batch_size = batch_size
        self.shuffle    = shuffle
    
    def __iter__(self):
        if self.shuffle==True:
            random.shuffle(self.data)
        if self.batch_size > 1:
            iterator_to_be = batcher(self.data, size = self.batch_size)
        else:
            iterator_to_be = self.data
        return iter(iterator_to_be) 

# So why does you not use Pytorch DataLoader, you might wonder ... This class seems to be retriced to two 
# aspects of the dataset (input and label), but I want four (label=sense, lemma, index of ambigious word,
# and the sentence [or context]). Perhaps there are ways to use DataLoader in a less retricted way, but I
# give up finding that functionality for now. 
```

In [None]:
# Defining a dataloader
import random

class Xemplar(): 
    def __init__(self, label, lemma, position, sentence, set_labels, set_lemmas):
        self.label    = set_labels.index(label) 
        self.lemma    = set_lemmas.index(lemma) 
        self.position = int(position)
        self.sentence = sentence.split(" ")

class Batch():
    def __init__(self, chunk): #gets a list of len=batch_size of Xemplars 
        self.label    = torch.tensor([X.label for X in chunk], dtype=torch.long, device=device)
        self.lemma    = torch.tensor([X.lemma for X in chunk], dtype=torch.long, device=device)
        self.position = torch.tensor([X.position for X in chunk], dtype=torch.long, device=device)
        self.sentence = tokenizer([X.sentence for X in chunk], 
                                     is_split_into_words=True, 
                                     padding=True, 
                                     truncation=True
                                          )

def data_to_data(data, labels_set, lemmas_set):
    container = []
    for label, lemma, index, sentence in data:
        X = Xemplar(label, lemma, index, sentence, labels_set, lemmas_set)
        container.append(X)    
    return container

def batcher(my_list, size):
    output=[Batch(my_list[i : i+size]) for i in range(0, len(my_list), size)] #solution found here: https://www.delftstack.com/howto/python/python-split-list-into-chunks/
    print(output)
    return output
        
class MyDataLoader():
    def __init__(self, data, labels_set, lemmas_set, batch_size=1, shuffle=True):
        self.data       = data_to_data(data, labels_set, lemmas_set)
        self.batch_size = batch_size
        self.shuffle    = shuffle
    
    def __iter__(self):
        if self.shuffle==True:
            random.shuffle(self.data)
        if self.batch_size > 1:
            iterator_to_be = batcher(self.data, size = self.batch_size)
        else:
            iterator_to_be = self.data
        return iter(iterator_to_be) 

# So why does you not use Pytorch DataLoader, you might wonder ... This class seems to be retriced to two 
# aspects of the dataset (input and label), but I want four (label=sense, lemma, index of ambigious word,
# and the sentence [or context]). Perhaps there are ways to use DataLoader in a less retricted way, but I
# give up finding that functionality for now. 

In [None]:
# From data to dataloaders
#my_train = my_train[:100]
#my_test  = my_test[:100]

b_train_iter = MyDataLoader(my_train, my_labels, my_lemmas, batch_size=b_batch_size)
b_test_iter  = MyDataLoader(my_test, my_labels, my_lemmas)

In [None]:
# Experimental phase, so far ... 

from_bert = 100 #dummy

class BERT_WSD(nn.Module):
    def __init__(self, num_labels):
        super(BERT_WSD, self).__init__()
        self.bert = BERT
        self.classifier = nn.Linear(from_bert, num_labels) #sentence repr or ambigious word repr ---> labels
    
    def forward(self, batch): #shall we use index?
        #print(batch)      
        
        output = self.bert(**batch) #what do we get out from BERT?
        #print(output)
        predictions = self.classifier(output) 
        
        return predictions

In [None]:
# Training the model
    
import torch.optim as optim

model = BERT_WSD(len(my_labels))

loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=b_learning_rate)

total_loss = 0
for e in range(b_epochs):
    for i, batch in enumerate(b_train_iter):
        
        sentence = batch.sentence
        index = batch.position
        label = batch.label
        
        #print(batch.position)
      
        output_from_model = model(sentence) #index?
        
    loss = loss_function(output_from_model, label.squeeze()) # Do we need squeeze for this version?
        # "output" from model is "input" to CEL
        
        #Note: code below adopted from previous assignment
        total_loss += loss.item()
        print(total_loss/(i+1), end='\r') 
        loss.backward() # compute gradients
        optimizer.step() # update parameters
        optimizer.zero_grad # reset gradients
        
        #break
    print()    

In [None]:
# test model after all epochs are completed
def select(vector):
    """Selects the index of the top value in a vector."""
    top_value=0
    no_one=0 #index of top value
    for index, value in enumerate(vector):
        if value > top_value:
            top_value=value
            no_one=index
    return no_one   

correct_set = []
correct_per_word = {lemma:[] for lemma in my_lemmas}
model.eval() #evaluation mode

for i, batch in enumerate(test_iter):
    print(f"{round((i/len(test_iter))*100, 3)} %", end="\r")
    #Attributes have different names for bert_wsd part, than in first part ...
    sentence = batch.sentence
    index    = batch.position
    label    = batch.label
    lemma    = batch.lemma
    
    output = model(sentence, index) # index ...
    
    my_probs = F.softmax(output, dim=1)
    index_of_top_prob = [select(x) for x in my_probs]
    predicted_label = [my_labels[x] for x in index_of_top_prob]

    for i in range(label.shape[0]):
        true_label = my_labels[label[i][0]]
        this_lemma = my_lemmas[lemma[i][0]]
        if true_label == predicted_label[i]:
            correct_set.append(1)
            correct_per_word[this_lemma].append(1)
        else:
            correct_set.append(0)
            correct_per_word[this_lemma].append(0)

accuracy = sum(correct_set) / len(correct_set)

accuracy_per_word = {lemma:0 for lemma in correct_per_word.keys()}
for lemma in correct_per_word.keys():
    if len(correct_per_word[lemma]) == 0:
        accuracy_per_word[lemma] = "NA"
    else:
        mean = sum(correct_per_word[lemma]) / len(correct_per_word[lemma])
        accuracy_per_word[lemma] = mean
    
print("="*40)
print("EVALUATION")
print(f"Overall accuracy: {round(accuracy, 3)}.")
print("Lemma{}\tAcc.\tBaseL.\tGood?".format(" "*9))
for lemma in accuracy_per_word.keys():
    if lemma not in ["<unk>", "<pad>"]:
        acc = round(accuracy_per_word[lemma], 2)
        bl = round(my_baseline[lemma]["accuracy"], 2)
        is_it_good = "Yes"
        if bl > acc:
            is_it_good = "No"
        
        print("{}\t{}\t{}\t{}".format(lemma+" "*(14-len(lemma)), acc, bl, is_it_good))


# 3. Evaluation

Explain the difference between the first and second approach. What kind of representations are the different approaches using to predict word-senses? **[4 marks]**

AE: Yeah! **4 marks**

**Answer:** The first approach attempts to classify meaning from the representation of *the ambiguous word, as it appears in a sequence*. The second approach attempts to classify the meaning of the ambiguous word based on the representation of *the sentence* (in which the ambiguous word appears). 

Evaluate your model with per-word-form *accuracy* and comment on the results you get, how does the model perform in comparison to the baseline, and how do the models compare to each other? 

Expand on the evaluation by sorting the word-forms by the number of senses they have. Are word-forms with fewer senses easier to predict? Give a short explanation of the results you get based on the number of senses per word.

**[6 marks]**

AE: Good analysis, and I agree with all your points! **6 marks**

### Model

    model = WSDModel_approach1()
    a_batch_size = 16
    a_learning_rate = 0.001
    a_epochs = 8
    a_hidden = 256

### Results
**Overall accuracy: 0.484.**

**Table: Accuracy, Baseline, Improvement from Baseline, and No. of senses.**

|Lemma         |Acc.|BaseL.|Good?|No. senses|
|--------------|----|----|-------|----------|
|see.v         |0.61|0.63|No|11|
|line.n        |0.92|0.85|Yes|11|
|keep.v        |0.54|0.39|Yes|11|
|follow.v      |0.46|0.15|Yes|11|
|hold.v        |0.33|0.15|Yes|11|
|serve.v       |0.38|0.16|Yes|9|
|force.n       |0.61|0.16|Yes|8|
|lead.v        |0.33|0.18|Yes|8|
|build.v       |0.28|0.21|Yes|10|
|bring.v       |0.29|0.21|Yes|8|
|extend.v      |0.35|0.18|Yes|7|
|find.v        |0.43|0.23|Yes|10|
|case.n        |0.34|0.2|Yes|8|
|position.n    |0.27|0.2|Yes|6|
|national.a    |0.41|0.2|Yes|6|
|security.n    |0.57|0.2|Yes|7|
|life.n        |0.51|0.22|Yes|9|
|time.n        |0.5|0.28|Yes|5|
|professional.a|0.57|0.22|Yes|5|
|order.n       |0.52|0.22|Yes|5|
|regular.a     |0.39|0.22|Yes|8|
|point.n       |0.44|0.36|Yes|8|
|place.n       |0.48|0.24|Yes|7|
|physical.a    |0.32|0.24|Yes|6|
|common.a      |0.39|0.25|Yes|4|
|bad.a         |0.68|0.61|Yes|4|
|critical.a    |0.45|0.27|Yes|5|
|major.a       |0.42|0.3|Yes|4|
|active.a      |0.44|0.32|Yes|5|
|positive.a    |0.49|0.35|Yes|5|

**Correlation of Accuracy and Baseline: 0.753**

**Correlation of Accuracy and No. of senses: 0.091**

### Conclusion & Discussion
From this data we can draw the following conclusions:

*   Overall accuracy is 48.4 which is not strong, but far from worthless, given the complexity of the task at hand. Kågebäck & Salomonsson reported substantially higher scores (66.9 for SE2 and 73.4 for SE3), but still leaving room for improvement. 
*   The neural model is better than the baseline for every lemma except one (*see*). We might specualte that *see* causes special for the model given its subtile (metaphoric) variation of meaning. Senses of *see* might be hard to clearly to distingusih for humans as well.   
*   Predictions does not get better, with fewer senses of lemmas, or worse with many senses of lemmas. Pearson's correlation coefficient is close to 0 between Accuracy and No. of senses. (Of course, there would be a lower limit for this dissociation as No. of senses = 1, would yield 100% accuracy). 
*   There is however a strong correlation between Accuracy and Basline. This suggests that the model is better at predicting *dominant* senses. As the basline is determined by the most common sense of a word, words with high baselines are words where a large proportions of its tokens encode the "baseline sense". The model is especially good at predciting such common senses, seemingly independent of the number of other senses there are for the word.  
*   There are some interesting exceptions to the previous point. For *force*, the model makes quite good predicitons, although this is a lemma without a clear dominant sense.Similarily, the model disambiguates *security* quite well despite its lack of a dominant sense. We might hypothize that these words have fairly distinct sense, appearing in quite different contexts, making them easier for the model to distingusih and recognize. 

How does the LSTMs perform in comparison to BERT? What's the difference between representations obtained by the LSTMs and BERT? **[2 marks]**

AE: **0 marks**

*RE-WORK SUMMER 2021*

**Answer:** Unfortunately, I have not been able to build and train (fine-tune) the `BERT_WSD` model (see comment above and `readme.md` file). Therefore, there is no comparison to make. However, something can be said about what the models are (supposed) to represent. The LSTM model represents tokens relative a sequence, so that representations of previously processed elements are "remembered" at the current state. BERT represents something different. It represents words and sentences at several levels (based on attention). When finetuned, the general "knowledge" of BERT is calibrated on a particular task; here: WSD. 

What could we do to improve our LSTM word sense disambiguation models and our BERT model? **[4 marks]**

AE: Good suggestions, it would be helpful if BERT worked indeed :D **2 marks**

**Answer**

For the LSTM model, we can as Kågebäck & Salomonsson try to 
*   parameterize sense by word type (not trainging every sense for every word, as above)
*   use the dropword technique
*   use dropout on layer
*   use a pretrained word embedding (e.g. Glove)

For BERT, make it work :(

# Readings:

[1] Kågebäck, M., & Salomonsson, H. (2016). Word Sense Disambiguation using a Bidirectional LSTM. arXiv preprint arXiv:1606.03568.

[2] https://cl.lingfil.uu.se/~nivre/master/NLP-LexSem.pdf

Total marks: 26, which unfourtunately is not enought to pass. Your analysis and general code is good, but there are some problems in the model implementations. There will be a deadline in September for resubmission (date will be announced soon-ish). If you have any questions regarding the code or so, feel free to hit me up on e-mail or Discord!