# Lab 3: Word Embeddings and Language Modelling

Adam Ek

In this lab we'll explore constructing *static* word embeddings (i.e. word2vec) and building language models. We'll also evaluate these systems on intermediate tasks, namely word similarity and identifying "good" and "bad" sentences.

* For this we'll use pytorch. Some basic operations that will be useful can be found here: https://jhui.github.io/2018/02/09/PyTorch-Basic-operations
* In general: we are not interested in getting state-of-the-art performance :) focus on the implementation and not results of your model. For this reason, you can use a subset of the dataset: the first 5000-10 000 sentences or so, on linux/mac: ```head -n 10000 inputfile > outputfile```. 
* If possible, use the MLTGpu, it will make everything faster :)

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# for gpu, replace "cpu" with "cuda:n" where n is the index of the GPU
#device = torch.device('cpu')
device = torch.device('cuda')

### Note
There has been space issues on MLTGPU while working with this assignment. For example, 

    RuntimeError: CUDA out of memory

Therefore it has been hard to train models with large datasets and batch sizes. The cell below makes it possible to a subset of sentences and a batch size at one place.

Presently (May 14, before noon) the sitaution seems to have appoved. I have been able to train the model with the complete 50k dataset.

In [2]:
my_restriction = False # Set to False for no restriction of data; set to True for using n_subsamle as restriction
n_subsample = 10000 #Sub-sample
n_batch = 16 #Batch Size
name_of_model = f"model{str(n_subsample)[:-3]}k{n_batch}b"
print(name_of_model)

model10k16b


# Word2Vec embeddings

In this first part we'll construct a word2vec model which will give us *static* word embeddings (that is, they are fixed after training).

After we've trained our model we will evaluate the embeddings obtained on a word similarity task.

## Formatting data


First we need to load some data, you can download the file on canvas under files/03-lab-data/wiki-corpus.txt. The file contains 50 000 sentences randomly selected from the complete wikipedia. Each line in the file contains one sentence. The sentences are whitespace tokenized.

Your first task is to create a dataset suitable for word2vec. That is, we define some ```window_size``` then iterate over all sentences in the dataset, putting the center word in one field and the context words in another (separate the fields with ```tab```).

For example, the sentece "this is a lab" with ```window size = 4``` will be formatted as:
```
center, context
---------------------
this    is a lab
is      this a lab
a       this is lab
lab     this is a
```

this will be our training examples when training the word2vec model.

[3 marks]

In [3]:
data_path = 'data/wiki-corpus.txt'
WINDOW_SIZE = 4 #on each side!
def corpus_reader(data_path, output="data/my_data.csv", k=WINDOW_SIZE, restriction=my_restriction):
    
    with open(data_path, mode="r") as f:
        my_data=[line.split(" ") for line in f.read().split("\n")]
    
    if restriction == True:
        my_data=my_data[:n_subsample] #OBS!
    
    my_string=""
    for sentence in my_data:
        if len(sentence)>2: 
            # Only at least two-word contexts (minimal context). 
            # The simple reason for this is to solve downstream problems otherwise
            # encounter with training related to the CrossEntropyLoss function 
            my_context=["<x>"]*k + sentence + ["<x>"]*k
            i=k
            for w in sentence:
                my_string+=w+"\t"
                j=i+1
                left=my_context[i-k:i]
                right=my_context[j:j+k]
                context_words=[c for c in left+right if c not in["<x>", "(", ")", ".", "!", ",", '"', "'", "?"]]
                my_string+=" ".join(context_words)+"\n"
                i+=1
    
    with open(output, mode="w") as f:
        f.write(my_string[:-1]) #note to self: we do not want empty rows

corpus_reader(data_path)

We sampled 50 000 senteces completely random from the *whole* wikipedia for our training data. Give some reasons why this is good, and why it might be bad. (*note*: We'll have a few questions like these, one or two reasons for and against is sufficient)

[2 marks]

**ANSWER**

Here are some ideas for why it might not always be a good idea:
1.    Wikipedia illutrates a restricted register of language use and its representativity can be discussed (as always in corpus linguistics).
2.    The diversified collection might perhaps result in a high Type Token Ratio (TTR), i.e. number of types / number of tokens. For our present purposes, I suppose that a high TTR can mean that we have many word types, but a limited context to learn (generalize) their embeddings from.

The main good reasons derive from availibility:

*    Wikipedia has a lot of text
*    Wikipedia pages and their URLs are stricly standardized, which makes Web crawling and text extraction easy

Also, Wikipedia is commonly used as data in NLP tasks, which makes improves possibilities for comparisons 

### Loading the data

We now need to load the data in an appropriate format for torchtext (https://torchtext.readthedocs.io/en/latest/). We'll use PyText for this and it'll follow the same structure as I showed you in the lecture (remember to lower-case all tokens). Create a function which returns a (bucket)iterator of the training data, and the vocabulary object (```Field```). 

(*hint1*: you can format the data such that the center word always is first, then you only need to use one field)

(*hint2*: the code I showed you during the leture is available in /files/pytorch_tutorial/ on canvas)

[4 marks]

In [33]:
#from torchtext.legacy.data import Field, BucketIterator, Iterator, TabularDataset # Needed for running this on my laptop
from torchtext.data import Field, BucketIterator, Iterator, TabularDataset

def get_data(
    my_path = "data/my_data.csv",
    batch_size = 3
    ):
    
    whitespacer = lambda x: x.split(' ') #from: files/partofspeech-tagging_main.py on Canvas
   
    MY_FIELD = Field(
        tokenize = whitespacer,
        lower=True,
        batch_first = True
        )
        
    my_fields = [("target_word", MY_FIELD), ("context_words", MY_FIELD)]   
    
    train = TabularDataset(
        path   = my_path,
        format = 'csv',
        fields = my_fields,
        #skip_header = True,
        csv_reader_params = {'delimiter':'\t', 'quotechar':'{'} #Note "{" is used as quotechar
        )
    
    #Note on quote character selection: 
    #By trial and error I found that "{" is not a character in the corpus.
    #Therefore it can be used as quote character. (Swedish "å", "ä" and "ö" are in 
    #the corpus. So is "½".)

    MY_FIELD.build_vocab(train, min_freq=3)
    
    my_bucket = BucketIterator(
        train,      
        batch_size        = batch_size,
        sort_within_batch = True,
        sort_key          = lambda x: len(x.context_words),
        shuffle           = True,
        device            = device)    
    
    return my_bucket, MY_FIELD.vocab


### FOR DEVELOPMENT; PLEASE IGNORE!
    b, v = get_data()
    print(b)
    print(v)
    print(len(v))

### FOR DEVELOPMENT; PLEASE IGNORE!
    t=0
    c=0

    for i, x in enumerate(b):
        if x.target_word.shape[0] != 3:
            t+=1
            print(x.target_word)
        if x.context_words.shape[0] != 3:
            c+=1
            print(x.context_words)

    print("t ", t)
    print("c ", c)

We lower-cased all tokens above; give some reasons why this is a good idea, and why it may be harmful to our embeddings.

[2 marks]

**ANSWER:** 

Reasons *against* lower-casing: when lower-casing, at least two types of information that are *lost*:
*    Named entities (e.g. *Shell* vs *shell*)
*    Sentence bounary information (e.g. *The* is a better predicition after *.* than *the*).

In as far as such information is important for the task we are trying to solve with the embeddings, lower-casing can be "harmfull". Consider, for example, named entitiy recognition and language modelling aiming to predict sequentiality of language.

On the other hand, upper case does not in general carry much relevant information. Mostly, the upper vs. lower case is *semantically* redundant, but *expressively* expensive. Potentially our vocabulary could be twice as long. As such, there is good reason for lower-casing. 

## Word Embeddings Model

We will implement the CBOW model for constructing word embedding models.

In [5]:
import torch.optim as optim

In the CBOW model we try to predict the center word based on the context. That is, we take as input ```n``` context words, encode them as vectors, then combine them by summation. This will give us one embedding. We then use this embedding to predict *which* word in our vocabuary is the most likely center word. 

Implement this model 

[7 marks]

In [6]:
class CBOWModel(nn.Module):
    def __init__(self, voc_size, hidden_d):
        super(CBOWModel, self).__init__()
        self.embeddings = nn.Embedding(voc_size, hidden_d) #vocabulary size * hidden size
        self.prediction = nn.Linear(hidden_d, voc_size)
        #self.vsize=voc_size # Please ignore. I used this for my first projection function.
        
        # NOTE: Softmax is part of the loss function implemented below
    
    def forward(self, context):
        embedded_context = self.embeddings(context)
        projection = self.projection_function2(embedded_context)
        predictions = self.prediction(projection)
       
        return predictions
    
    def projection_function2(self, xs):
        """
        This function will take as input a tensor of size (B, S, D)
        where B is the batch_size, S the window size, and D the dimensionality of embeddings
        this function should compute the sum over the embedding dimensions of the input, 
        that is, we transform (B, S, D) to (B, 1, D) or (B, D) 
        """
        #Note: Implemented as suggested in class.
        
        xs_sum = torch.sum(xs, dim=1) # helpful reference: https://towardsdatascience.com/understanding-dimensions-in-pytorch-6edf9972d3be
        
        return xs_sum   
    

### PLEASE IGNORE (my first attempt at a projection function)
    def projection_function(self, xs):
        """

        """
        
        print("input projection ", xs.shape)

        #Collapsed one-hot vector representation... 
        b=xs.shape[0] #The number of "lines" (the batch size)

        xs_sum=torch.zeros(b, self.vsize, dtype=torch.long, device=device) 
        for i in range(b):
            ctxt=xs[i] #context of ith batch
            for j in ctxt:
                xs_sum[i][j]+=1
        
        print("output projection ", xs_sum.shape)
        
        return xs_sum

Now we need to train the models. First we define which hyperparameters to use. (You can change these, for example when *developing* your model you can use a batch size of 2 and a very low dimensionality (say 10), just to speed things up). When actually training your model *fo real*, you can use a batch size of [8,16,32,64], and embedding dimensionality of [128,256].

In [7]:
# you can change these numbers to suit your needs :)
word_embeddings_hyperparameters = {'epochs':3,
                                   'batch_size':n_batch, 
                                   'embedding_size':128,
                                   'learning_rate':0.001,
                                   'embedding_dim':128}

Train your model. Iterate over the dataset, get outputs from your model, calculate loss and backpropagate.

We mentioned in the lecture that we use Negative Log Likelihood (https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) loss to train Word2Vec model. In this lab we'll take a shortcut when *training* and use Cross Entropy Loss (https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), basically it combines ```log_softmax``` and ```NLLLoss```. So what your model should output is a *score* for each word in our vocabulary. The ```CrossEntropyLoss``` will then assign probabilities and calculate the negative log likelihood loss.

[3 marks]

In [34]:
# load data

dataset, vocab = get_data(batch_size = word_embeddings_hyperparameters['batch_size'])

print("Length of vocabulary: ", len(vocab))

# build model and construct loss/optimizer
cbow_model = CBOWModel(len(vocab), word_embeddings_hyperparameters['embedding_dim'])
cbow_model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=word_embeddings_hyperparameters['learning_rate'])

# start training loop
cbow_model.train()
total_loss = 0
for epoch in range(word_embeddings_hyperparameters['epochs']):
    for i, batch in enumerate(dataset):
        
        context = batch.context_words
        target_word = batch.target_word
        
        # send your batch of sentences to the model
        output = cbow_model(context)
        
        # compute the loss, you'll need to reshape the input
        # you can read more about this is the documentation for
        # CrossEntropyLoss
        
        print("output", output.shape)
        print("target", target_word.shape)
        print("target, squeezed", target_word.squeeze())

        loss = loss_fn(output, target_word.squeeze())
        total_loss += loss.item()
        
        # print average loss for the epoch
        print(total_loss/(i+1), end='\r') 
        
        # compute gradients
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        # reset gradients
        optimizer.zero_grad
        
    print()
        

Length of vocabulary:  80707
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([20349,  1570,     5,  1938,  1197,  8303,    44,  1951,  3216, 16017,
            3,   250,    57,  1672,     3, 49206], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([18662,   301,   203,    73,   788,    12,    21,   101,   949, 69354,
        17484,    59,   138,    21, 75051,   346], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([ 3824,     9,    12,   408,   978,  1071,  1212,     8,     3,   171,
            2,   767,  9495,  1181,  1517, 42148], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([ 2780,    28,    37, 10739,   127,     3, 24387,     2,    41,   133,
            2,  3641, 10270,     9,     5,   123], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([  48

output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([   96,  2199, 11089,  1023,     2,    28,   851,   223,    24,    20,
           82,     9,    13,    13,  1859,   226], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([    5,  1391,   114,   138,     6,   114, 25221, 61217,     2,     2,
         3487,    27,    25,    13,  2131,     7], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([   13,     3,     6,     3,   424,   294, 64224,    54,  2960, 44346,
            2, 50008,     6,     4,     4,   490], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([ 4222,    19,   231,  1474,    25, 61558, 36442,  4689,    36,    13,
          307,    36,     5,     8,  2400,     3], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([ 1981,     4,     3,   178,  3931

output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([1552,  142,    3,    7, 1738, 1304,    7,   22,    3,  640,    4,   13,
         180, 6967,    6, 1664], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([  65,  705, 3287,    7, 1574,   21,   21,   21,  349,   77, 1013,   21,
         366, 2515, 1970,  717], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([   27,    95,  2772,     7, 12679,   501,     3,     3,    56,  6150,
          340,  1887,   172,   721, 11369,  2248], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([    2,    13,    23,   304,   110,     2,     3,  2317, 76102,    94,
          506,    26,   197,  1503,   255,    19], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([ 1170,   145, 15985,     2,     7,     2,     2,     2,  1471, 19

output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([  759,  9145, 26210,  1215,   666,    13,   763,   238,     2,    25,
           77,    83,     2,    24,  5934, 35915], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([    6,     5, 30534,     3,    19,  2824,    13,    13,     8,   650,
         1860,   516,  4613,   709,  3278,    13], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([  239,    18,    27, 11400,     2,   474,     8,     4,    28,   473,
            8,   259,  2628, 42338,     2,   153], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([    4,   264,   928,    45,    13,  1509, 17726, 25098,    13,     2,
            5,    13,    14,     2,   232,    16], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([  34,   21,   21,   99,   34,  22

output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([   11,     5,    21,   125,    21,   150,  2961,    21,   141, 33958,
          130, 76833,     5,   728,    21,  2462], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([1059, 3725, 2845,    2,    4,  130,   32,   86,    2,   13, 1883,   47,
          59,   31,    3,    6], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([ 5426,    34,    21,   449,    13,   800,    21,   124,  1706,  1272,
         7327,    21,    60,    11,   111, 11777], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([   21,    21, 76439,    21,    21,   765, 22968, 11975,  7792,    21,
         7069,  1189,  2409,    21,    21,    21], device='cuda:0')
output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([   22,  6640,  8384,  1753,    21,     2,    21, 

14.816834659111208output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([10434,  6606,    20,    77,    17,     6, 53546,    17,    10,   341,
         1054,  2422,    77,   134,   599,  5611], device='cuda:0')
14.850646632885644output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([   13,  1066,   267,     2,     2, 31190,     2,    13,  9607,  7137,
            4,     7,     3,   215,    93,  2318], device='cuda:0')
14.889818222292009output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([41330,  2210,    13,     5,   286,    26,    69,    15,    13,    10,
           16,   278,   181,    13,    12,    37], device='cuda:0')
14.921323247702725output torch.Size([16, 80707])
target torch.Size([16, 1])
target, squeezed tensor([   13,    16,    31,  1116,   441,     3,  2737, 29037,   246,  2126,
           19,     7,  6490,    87,  1019,    11], device='cuda:0')
14.949041213989258output torch.Size([16

KeyboardInterrupt: 

In [9]:
#SAVING THE MODEL
PATH = f"models/{name_of_model}.pt"
torch.save(cbow_model, PATH)

### ACTIVATE AS CODE AND REMOVE INDENT, FOR LOADING THE MODEL
    model = torch.load(PATH)
    model.eval()

## Evaluating the model

We will evaluate the model on a dataset of word similarities, WordSim353 (http://alfonseca.org/eng/research/wordsim353.html , also avalable in vanvas under files/03-l). The first thing we need to do is read the dataset and translate it to integers. What we'll do is to reuse the ```Field``` that records word indexes (the second output of ```get_data()```) and use it to parse the file.

The wordsim data is structured as follows:

```
word1 word2 score
...
```


The ```Field``` we got from ```read_data()``` has two built-in functions, ```stoi``` which maps a string to an integer and ```itos``` which maps an integer to a string. 

What our datareader needs to do is: 

```
for line in file:
    word1, word2, score = file.split()
    # encode word1 and word2 as integers
    word1_idx = vocab.vocab.stoi[word1]
    word2_idx = vocab.vocab.stoi[word2]
```

when we have the integers for ```word_1``` and ```word2``` we'll compute the similarity between their word embeddings with *cosine simlarity*. We can obtain the embeddings by querying the embedding layer of the model.

We calculate the cosine similarity for each word pair in the dataset, then compute the pearson correlation between the similarities we obtained with the scores given in the dataset. 

[4 marks]

In [20]:
# your code goes here

def read_wordsim(path, 
                 vocabulary=vocab.stoi, 
                 checker=vocab.itos,
                 embeddings=cbow_model.embeddings.weight):
    #https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
    
    dataset_sims = []
    model_sims = []
    summary="Pair\tPsychology\tModel\tDf.\n"
    
    with open(path) as f:
        for line in f.read().split("\n"):
            if len(line) != 0: #We do not want empty lines.
                word1, word2, score = line.split() #splits by space by default

            # get the index for the word
            word1_idx = vocabulary[word1]
            word2_idx = vocabulary[word2]            
            
            if (checker[word1_idx]=='<unk>', checker[word2_idx]=='<unk>') == (False, False):
            #If two words are unknown to the model they will both be assigned "<unk>"
            #and associated with the same embedding (u). The cosine similarity of u and u
            #will be 1. I suggest that such scores should not be part of our evaluation.

                score = float(score)
                dataset_sims.append(score)

                # get the embedding of the word
                # the hidden layer will be a matrix of weights;
                # use the index to identify the right row/column (?)
                word1_emb = embeddings[word1_idx]
                word2_emb = embeddings[word2_idx]

                # compute cosine similarity, we'll use the version included in pytorch functional
                # https://pytorch.org/docs/master/generated/torch.nn.functional.cosine_similarity.html
                cosine_similarity = F.cosine_similarity(word1_emb, word2_emb, dim=0)
                
                #In order to identify "best" and "worst performing word pairs"; see below.
                psy=round(score/10, 3)
                mod=round(cosine_similarity.item(), 3)
                dif=round(abs(psy-mod), 3)
                summary+=f"{word1} -- {word2}\t{psy}\t{mod}\t{dif}\n"

                model_sims.append(cosine_similarity.item())
    
    return dataset_sims, model_sims, summary

path = 'eval_data/wordsim_similarity_goldstandard.txt'
data, model, summary = read_wordsim(path)
pearson_correlation = np.corrcoef(data, model)
r = round(pearson_correlation[0][1], 3)
            
# the non-diagonals give the pearson correlation

print("\nRESULTS:")
print("Pearson correlation: ", r)
print("\nSUMMARY:")
print(summary)

with open(f"evaluations/{name_of_model}.txt", mode="w") as f:
    f.write(f"Pearson's correlation coefficient: {r}.\n")
    f.write(f"\nSummary of performance for pairs:\n")
    f.write(summary)


RESULTS:
Pearson correlation:  0.034

SUMMARY:
Pair	Psychology	Model	Df.
tiger -- cat	0.735	-0.052	0.787
tiger -- tiger	1.0	1.0	0.0
plane -- car	0.577	0.004	0.573
train -- car	0.631	0.075	0.556
television -- radio	0.677	0.145	0.532
media -- radio	0.742	0.126	0.616
bread -- butter	0.619	0.064	0.555
cucumber -- potato	0.592	0.14	0.452
doctor -- nurse	0.7	0.078	0.622
professor -- doctor	0.662	0.205	0.457
student -- professor	0.681	0.167	0.514
smart -- stupid	0.581	0.193	0.388
wood -- forest	0.773	0.042	0.731
money -- cash	0.915	0.007	0.908
king -- queen	0.858	0.217	0.641
bishop -- rabbi	0.669	0.11	0.559
fuck -- sex	0.944	0.088	0.856
football -- soccer	0.903	0.108	0.795
football -- basketball	0.681	0.171	0.51
football -- tennis	0.663	0.08	0.583
physics -- chemistry	0.735	0.184	0.551
vodka -- gin	0.846	0.37	0.476
vodka -- brandy	0.813	0.225	0.588
drink -- eat	0.687	0.135	0.552
car -- automobile	0.894	0.003	0.891
gem -- jewel	0.896	0.089	0.807
journey -- voyage	0.929	0.353	0.576
boy -- lad	

Do you think the model performs good or bad? Why?

[3 marks]

**ANSWER**

My model is terrible! **Pearson correlation = 0.034** There is no correlation between the psychologial assessments of word similarity and the model's estimate of similarity. This means that my model does do not assign high scores to "truly" similar pairs, and low scores to "truly" dissimilar ones. 

But it could be worse: there could have been a negative correlation (which was the case for some of occasions of training the model with other parameters).  

Select the 10 best and 10 worst performing word pairs, can you see any patterns that explain why *these* are the best and worst word pairs?

[3 marks]

**ANSWER**

In general, my model gives lower scores to pairs then these are given in psychological estimates. The table below show the Maximum, Minimum and Range for the model and the pyschological data (ignoring similariy estimates of the same word).

Stat |Psych|Model|
-----|-----|----- 
MIN  |0.023|-0.11|
MAX  |0.944|0.281|
RANGE|0.921|0.391|

A consequence of this is that the model performs better for pairs given low scores in the psychologial data. It will perform worse, when the psychologicla score is high. This pattern becomes clear from the following table which is sorted by Difference (decending order), but as we see this sorting also closely maps the psychological scores. To conclude, the model does not represent similar words as similar. 


Pair|PsyScore|ModScore|Difference
----|--------|--------|----------
seafood -- food|0.834|-0.073|0.907
type -- kind|0.897|0.002|0.895
life -- death|0.788|-0.071|0.859
man -- woman|0.83|-0.029|0.859
planet -- star|0.845|0.027|0.818
wood -- forest|0.773|-0.033|0.806
planet -- sun|0.802|0.016|0.786
century -- year|0.759|-0.024|0.783
bird -- cock|0.71|-0.061|0.771
physics -- chemistry|0.735|-0.016|0.751
psychology -- science|0.671|-0.063|0.734
opera -- performance|0.688|-0.006|0.694
student -- professor|0.681|0.037|0.644
journal -- association|0.497|-0.13|0.627
planet -- moon|0.808|0.183|0.625
food -- fruit|0.752|0.132|0.62
man -- governor|0.525|-0.095|0.62
plane -- car|0.577|-0.033|0.61
doctor -- personnel|0.5|-0.059|0.559
computer -- news|0.447|-0.082|0.529
consumer -- confidence|0.413|-0.111|0.524
image -- surface|0.456|-0.058|0.514
psychology -- discipline|0.558|0.047|0.511
coast -- hill|0.438|-0.068|0.506
professor -- doctor|0.662|0.159|0.503
five -- month|0.338|-0.163|0.501
cup -- food|0.5|0.001|0.499
cup -- object|0.369|-0.119|0.488
report -- gain|0.363|-0.125|0.488
space -- chemistry|0.488|0.006|0.482
consumer -- energy|0.475|0.003|0.472
train -- car|0.631|0.16|0.471
attempt -- peace|0.425|-0.029|0.454
situation -- conclusion|0.481|0.035|0.446
architecture -- century|0.378|-0.055|0.433
coast -- forest|0.315|-0.114|0.429
life -- term|0.45|0.036|0.414
investigation -- effort|0.459|0.048|0.411
car -- flight|0.494|0.084|0.41
peace -- plan|0.475|0.065|0.41
opera -- industry|0.263|-0.138|0.401
travel -- activity|0.5|0.103|0.397
experience -- music|0.347|-0.039|0.386
media -- trading|0.388|0.012|0.376
situation -- isolation|0.388|0.017|0.371
development -- issue|0.397|0.029|0.368
peace -- atmosphere|0.369|0.006|0.363
media -- gain|0.288|-0.071|0.359
stock -- live|0.373|0.026|0.347
atmosphere -- landscape|0.369|0.026|0.343
population -- development|0.375|0.047|0.328
hospital -- infrastructure|0.463|0.14|0.323
money -- operation|0.331|0.038|0.293
seven -- series|0.356|0.072|0.284
century -- nation|0.316|0.036|0.28
problem -- airport|0.238|-0.036|0.274
stock -- life|0.092|-0.18|0.272
school -- center|0.344|0.098|0.246
cup -- substance|0.192|-0.015|0.207
minority -- peace|0.369|0.175|0.194
month -- hotel|0.181|-0.001|0.182
music -- project|0.363|0.242|0.121
cup -- entity|0.215|0.102|0.113
production -- hike|0.175|0.069|0.106
possibility -- girl|0.194|0.161|0.033


Suggest some ways of improving the model we apply to WordSim353.

[3 marks]

**ANSWER**
The most obvious suggestion for improvement would be to test the model for varying (larger) data set, epochs and batch sizes. (Perhaps something should be done to avoid overfitting?)

If we consider a scenario where we use these embeddings in a downstream task, for example sentiment analysis (roughly: determining whether a sentence is positive or negative). 

Give some examples why the sentiment analysis model would benefit from our embeddnings and one examples why our embeddings could hur the performance of the sentiment model.

[3 marks]

**ANSWER**

My model is a poor representation of language (meaning). There would be no value of such model in any downstream application. 

However, the general rationale for using word embeddings in sentiment analysis is that it improves performance. Since embeddings carry "meaning", but e.g. one-hot vectors do not, using these meaning representations (embeddings) can be useful in a task where we want to extract the attitudinal meaning expressed in text. Using embeddings, we will have a more informative data (input) for our sentiment classifcation task. 

A possible problem of using embeddings would perhaps be that they might carry a bias with respect to the sentiment classifcation task at hand. Consider an extrme example where we trained our embeedings on a standup comedy roast (https://en.wikipedia.org/wiki/Roast_(comedy)) full of ironi and slurs and then implement these embeddings in a sentiment analysis of *Guide Michelin*. For this task, perhaps a "meaningless" input would be better than the trained one.

# Language modeling

In this second part we'll build a simple LSTM language model. Your task is to construct a model which takes a sentence as input and predict the next word for each word in the sentence. For this you'll use the ```LSTM``` class provided by PyTorch (https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). You can read more about the LSTM here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

NOTE!!!: Use the same dataset (wiki-cropus.txt) as before.

Our setup is similar to before, we first encode the words as distributed representations then pass these to the LSTM and for each output we predict the next word.

For this we'll build a new dataloader with torchtext, the file we pass to the dataloader should contain one sentence per line, with words separated by whitespace.

```
word_1, ..., word_n
word_1, ..., word_k
...
```

in this dataloader you want to make sure that each sentence begins with a ```<start>``` token and ends with a ```<end>``` token, there is a keyword argument in ```Field``` for this :). But other than that, as before you read the dataset and output a iterator over the dataset and a vocabulary. 

Implement the dataloader, language model and the training loop (the training loop will basically be the same as for word2vec).

[12 marks]

In [36]:
#SETTING SOME PARAMETERS FOR TRAINING AND SAVING MODELS
restrict_data = True
n_lines = 1000
n_lstm_batch = 3

def shorty(txt_in, txt_out, max_lines=n_lines):
    with open(txt_in, mode="r") as f:
        my_data = f.read().split("\n")
    with open(txt_out, mode="w") as f:
        output="\n".join(my_data[:max_lines])
        f.write(output)

if restrict_data==True:
    shorty('data/wiki-corpus.txt', 'data/wiki-corpus-short.txt')
    data_path='data/wiki-corpus-short.txt'
    name_LSTM_model = f"LSTMmodel{str(n_lines)[:-3]}k{n_lstm_batch}b"
else:
    data_path = 'data/wiki-corpus.txt'
    name_LSTM_model = f"LSTMmodel_full{n_lstm_batch}b"

print(name_LSTM_model)

LSTMmodel1k3b


In [37]:
# you can change these numbers to suit your needs as before :)
lm_hyperparameters = {'epochs':3,
                      'batch_size':n_lstm_batch,
                      'learning_rate':0.001,
                      'embedding_dim':128,
                      'output_dim':128}

In [38]:
#data_path = 'data/wiki-corpus.txt'
#data_path = 'data/wiki-corpus-short.txt' ##DEFINED ABOVE

#from torchtext.legacy.data import Field, BucketIterator, Iterator, TabularDataset
from torchtext.data import Field, BucketIterator, Iterator, TabularDataset

def get_data2(
    my_path = data_path,
    batch_size = 3
    ):
    
    whitespacer = lambda x: x.split(' ') #from: files/partofspeech-tagging_main.py on Canvas
    
    MY_FIELD = Field(
        tokenize = whitespacer,
        lower=True,
        batch_first = True,
        init_token="<start>", 
        eos_token="<end>"
    )

    train = TabularDataset(
        path   = my_path,
        #train  = file,
        format = 'csv',
        fields = [("sentence", MY_FIELD)],
        #skip_header = True,
        #csv_reader_params = {'delimiter':'\t', 'quotechar':'}'} 
    )

    MY_FIELD.build_vocab(train)
    
    my_bucket = BucketIterator(
        train,      
        batch_size        = batch_size,
        sort_within_batch = True,
        sort_key          = lambda x: len(x.sentence),
        shuffle           = True,
        device            = device)    
    
    return my_bucket, MY_FIELD.vocab


In [39]:
class LM_withLSTM(nn.Module):
    def __init__(self, n_words, emb_dim, outp_dim):
        super(LM_withLSTM, self).__init__()
        self.embeddings = nn.Embedding(n_words, emb_dim)
        self.LSTM = nn.LSTM(emb_dim, outp_dim, batch_first=True)
        self.predict_word = nn.Linear(outp_dim, n_words)
    
    def forward(self, seq):
        embedded_seq = self.embeddings(seq)
        timestep_reprentation, *_ = self.LSTM(embedded_seq)
        predicted_words = self.predict_word(timestep_reprentation)
        
        return predicted_words

In [55]:
import torch.optim as optim
# load data
dataset, vocab = get_data2(batch_size = lm_hyperparameters["batch_size"])
#dataset, vocab = get_data()

# build model and construct loss/optimizer
lm_model = LM_withLSTM(len(vocab), 
                       lm_hyperparameters['embedding_dim'],
                       lm_hyperparameters['output_dim'])
lm_model.to(device)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(lm_model.parameters(), lr=lm_hyperparameters['learning_rate'])
#here cbow model was provided . i changed to lm model ----

# start training loop
total_loss = 0
for epoch in range(lm_hyperparameters['epochs']):
    for i, batch in enumerate(dataset):
        
        # the strucure for each BATCH is:
        # <start>, w0, ..., wn, <end>
        sentence = batch.sentence
     
        # when training the model, at each input we predict the *NEXT* token
        # consequently there is nothing to predict when we give the model 
        # <end> as input. 
        # thus, we do not want to give <end> as input to the model, select 
        # from each batch all tokens except the last. 
        # tip: use pytorch indexing/slicing (same as numpy) 
        # (https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html#operations-on-tensors)
        # (https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/)
        input_sentence = sentence[:, :-1]
        
        print("input", input_sentence.shape)
        
        # send your batch of sentences to the model
        output = lm_model(input_sentence)
        
        print("output", output.shape)
        
        # for each output, the model predict the NEXT token, so we have to reshape 
        # our dataset again. On timestep t, we evaluate on token t+1. That is,
        # we never predict the <start> token ;) so this time, we select all but the first 
        # token from sentences (that is, all the tokens that we predict)
        
        gold_data = sentence[:, 1:]
        
        print("gold", gold_data)
        
        ###### BUILDING ONE-HOT VECTORS #####################################################
        #hot_gold=torch.zeros(gold_data.shape[0], len(vocab), dtype=torch.long, device=device)
        #for i in range(gold_data.shape[0]):
        #    for j in gold_data[i]:
        #        hot_gold[i][j]+=1
        ######################################################################################
        
        # the shape of the output and sentence variable need to be changed,
        # for the loss function. Details are in the documentation.
        # You can use .view(...,...) to reshape the tensors  

        #loss = loss_fn(output, hot_gold)
        
        input_cel = output.view(output.shape[0]*output.shape[1], output.shape[2])
        target_cel = gold_data.view(1, gold_data.shape[0] * gold_data.shape[1]).squeeze()
        
               
        print("input to CEL", input_cel.shape)
        print("input to CEL", target_cel.shape)
        
        loss = loss_fn(input_cel, target_cel)
        
        total_loss += loss.item()
        
        print(total_loss/(i+1), end='\r') 
        
        # print average loss for the epoch
        #print(total_loss/(i+1), end='\r') 
      
        # compute gradients
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        # reset gradients
        optimizer.zero_grad
        
        
        break
        
    print()

input torch.Size([3, 5])
output torch.Size([3, 5, 4007])
gold tensor([[  36,   26,   76,    5,    3],
        [1420, 2800,    5,    3,    1],
        [  55, 1595,    5,    3,    1]], device='cuda:0')


RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

In [None]:
#SAVING THE MODEL
PATH = f"models/{name_LSTM_model}.pt"
torch.save(lm_model, PATH)

### ACTIVATE AS CODE AND REMOVE INDENT, FOR LOADING THE MODEL
    my_lm_model = torch.load(PATH)
    #my_lm_model.eval()

### Evaluating the language model

We'll evaluate our model using the BLiMP dataset (https://github.com/alexwarstadt/blimp). The BLiMP dataset contains sets of linguistic minimal pairs for various syntactic and semantic phenomena, We'll evaluate our model on *existential quantifiers* (link: https://github.com/alexwarstadt/blimp/blob/master/data/existential_there_quantifiers_1.jsonl). This data, as the name suggests, investigate whether language models assign higher probability to *correct* usage of there-quantifiers. 

An example entry in the dataset is: 

```
{"sentence_good": "There was a documentary about music irritating Allison.", "sentence_bad": "There was each documentary about music irritating Allison.", "field": "semantics", "linguistics_term": "quantifiers", "UID": "existential_there_quantifiers_1", "simple_LM_method": true, "one_prefix_method": false, "two_prefix_method": false, "lexically_identical": false, "pairID": "0"}
```

Download the dataset and build a datareader (similar to what you did for word2vec). The dataset structure you should aim for is (you don't need to worry about the other keys for this assignment):

```
good_sentence_1, bad_sentence_1
...
```

your task now is to compare the probability assigned to the good sentence with to the probability assigned to the bad sentence. To compute a probability for a sentence we consider the product of the probabilities assigned to the *gold* tokens, remember, at timestep ```t``` we're predicting which token comes *next* e.g. ```t+1``` (basically, you do the same thing as you did when training).

In rough pseudo code what your code should do is:

```
accuracy = []
for good_sentence, bad_sentence in dataset:
    gs_lm_output = LanguageModel(good_sentence)
    gs_token_probabilities = softmax(gs_lm_output)
    gs_sentence_probability = product(gs_token_probabilities[GOLD_TOKENS])

    bs_lm_output = LanguageModel(bad_sentence)
    bs_token_probabilities = softmax(bs_lm_output)
    bs_sentence_probability = product(bs_token_probabilities[GOLD_TOKENS])

    # int(True) = 1 and int(False) = 0
    is_correct = int(gs_sentence_probability > bs_sentence_probability)
    accuracy.append(is_correct)

print(numpy.mean(accuracy))
    
```

[6 marks]

In [53]:
# your code goes here

import json

def evaluate_model(path, vocab, model):
    
    accuracy = []
    with open(path) as f:
        # iterate over one pair of sentences at a time
        for line in f:
            # load the data
            data = json.loads(line)
            good_s = data['sentence_good']
            bad_s = data['sentence_bad']
            
            # the data is tokenized as whitespace
            tok_good_s = [token.lower().replace(".", "") for token in good_s.split()]
            tok_bad_s = [token.lower().replace(".", "") for token in bad_s.split()]
            #THERE IS A FULL STOP (.) ON THE LAST TOKEN WHICH WE NEED TO REMOVE
        
            
            # encode your words as integers using the vocab from the dataloader, size is (S)
            # we use unsqueeze to create the batch dimension 
            # in this case our input is only ONE batch, so the size of the tensor becomes: 
            # (S) -> (1, S) as the model expects batches
            enc_good_s = torch.tensor([vocab.stoi[x] for x in tok_good_s], device=device).unsqueeze(0)
            enc_bad_s = torch.tensor([vocab.stoi[x] for x in tok_bad_s], device=device).unsqueeze(0)
            
            # pass your encoded sentences to the model and predict the next tokens
            good_s_pred = model(enc_good_s)
            bad_s_pred = model(enc_bad_s)
            
            # get probabilities with softmax
            gs_probs = F.softmax(good_s_pred[0], dim=0)
            bs_probs = F.softmax(bad_s_pred[0], dim=0)
            
            # select the probability of the gold tokens
            gs_sent_prob = find_token_probs(gs_probs, enc_good_s)
            bs_sent_prob = find_token_probs(bs_probs, enc_bad_s)
            
            accuracy.append(int(gs_sent_prob>bs_sent_prob))
            
    return accuracy
            
def find_token_probs(model_probs, encoded_sentence, escape_zero=True):
    prob=1
    for counter, idx in enumerate(encoded_sentence[0]):
        token_prob=model_probs[counter][idx] #There is a softmax calculation for every word; pick the "word-in-n-th-order", then pick the probability for the nth word
        if escape_zero == True: #Obs! Very shady condition, but there seems to be some zeroing out here. Without this condition, accuracy = 0.0
            if token_prob != 0: 
                prob *= token_prob
        else:
            prob *= token_prob
    return prob     
    
path = 'eval_data/existential_there_quantifiers_1.jsonl'
accuracy = evaluate_model(path, vocab, model=lm_model) #Note: provide your model

print('Final accuracy:')
print(np.round(np.mean(accuracy), 3))


Final accuracy:
0.085


### Analysis

Our model get some score, say, 55% correct predictions. Is this good? Suggest some *baseline* (i.e. a stupid "model" we hope ours is better than) we can compare the model against.

[3 marks]

**ANSWER**

By chance alone, the probability of sentece S = {w1, w2, ..., wn}, given vocabulary size V, would be: 

*ReallyStupidModel1:*   P(S) = count(w1)/V * ... * count(wn)/V

The model we build should at least perform better than this ReallyStupidModel. However, as the sentence pairs of the training data are "minimal pairs" differentiated by the quantifiers in them, the differntiation of sentence probabilities based on ReallyStupidModel whould come down to the frequency of the quantifiers in the corpus. 

A more sophisticated model for comparison should at least consider conditional proabilities of words given previous words, i.e. N-gram models:

*NgramModel:*   P(S) = p(w1|w1-1, ... w1-N) * ... * p(wn|wn-1,... wn-N)

Deciding on some value of *N*, which is problematic for Ngram models due to the "curse of dimensionality" (Bengio et al. 2003), our model should perform better than a *NgramModel*.

Suggest some improvements you could make to your language model.

[3 marks]

**ANSWER**

Again, as noted above, I have had problems with training the model with larger dataset and batch sizes. Supposedly, larger data sets and batch sizes would improve the model. 

Suggest some other metrics we can use to evaluate our system

[2 marks]

**ANSWER**

A metric for internal evaluation to consider is perplexity.

# Literature


Neural architectures:
* Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. (Links to an external site.) Journal of Machine Learning Research, 3(6):1137–1155, 2003. (Sections 3 and 4 are less relevant today and hence you can glance through them quickly. Instead, look at the Mikolov papers where they describe training word embeddings with the current neural network architectures.)
* T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
* T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    


Total marks: 63