# Lab session 3: Word embeddings

This lab covers distributed word represenations (`word embedding') as seen in the theory lectures.

General instructions:
- Complete the code where needed
- Provide answers to questions only in the cell where indicated
- **Do not alter the evaluation cells** (`## evaluation`) in any way as they are needed for the partly automated evaluation process

## **Embeddings; the Steroids for NLP!**

Pre-trained embeddings have brought NLP a long way. Most of the recent methods include word embeddings into their pipeline to obtain state-of-the-art performance (these days in particular for cases where lightweight models are preferred over heavier transformer-based methods that have the additional advantage of context-dependent representations). `Word2vec` is among the most famous methods to efficiently create word embeddings and has been around since 2013. Word2Vec has two different model architectures, namely `Skip-gram` and `CBOW`. `Skip-gram` was explained in more detail in the theory lecture, and today we will play with `CBOW`. We will train our own little embeddings, and use them to visualize text corpora. In the last part, we will download and utilize other pretrained embeddings to build a Part-of-Speech tagging (PoS) model.

<img src="http://3g1o5q2sqh3w32ohtj4dwggw.wpengine.netdna-cdn.com/wp-content/uploads/2012/08/steroids-before-and-after-480x321.jpg" alt="img" width="512px"/>



In [3]:
# import necessary packages
import random
import math
import numpy as np

from random import shuffle
from collections import Counter

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [4]:
# for reproducibility

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## 1. Data preparation

As always, let's first prepare the data. We shall use the `text8` dataset, which offers cleaned English Wikipedia text. The data is clean UTF-8 and all characters are lower-cased with valid encodings.

In [5]:
!wget "http://mattmahoney.net/dc/text8.zip" -O text8.zip
!unzip -o text8.zip
!rm text8.zip
!head -c 100 text8 # print first bytes of text8 data

--2022-12-12 21:04:23--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.24
Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’


2022-12-12 21:04:36 (2.32 MB/s) - ‘text8.zip’ saved [31344016/31344016]

Archive:  text8.zip
  inflating: text8                   
 anarchism originated as a term of abuse first used against early working class radicals including t

In [6]:
# read text8
with open('text8', 'r') as input_file:
    text = input_file.read()

### Tokenization
We first chop our text into pieces using NLTK's `WordPuncTokenizer`:

In [7]:
from nltk.tokenize import WordPunctTokenizer

tknzr = WordPunctTokenizer()
tokenized_text = tknzr.tokenize(text)

print(tokenized_text[0:20])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english']


### Build dictionary
In this step, we convert each word to a unique id. We can define our vocabulary trimming rules, which specify whether certain words should remain in the vocabulary, be trimmed away, or handled differently. In following, we limit our vocabulary size to `vocab_size` words and replace the remaining tokens with `UNK`:

In [8]:
def get_data(text, vocab_size = None):
    
    word_counts = Counter(text)
    
    sorted_tokens = sorted(word_counts, key=word_counts.get, reverse=True) # sort by frequency
    
    if vocab_size: # keep most frequent words
        sorted_tokens = sorted_tokens[:vocab_size-1] 
    
    sorted_tokens.insert(0, 'UNK') # reserve 0 for UNK
    
    id_to_token = {k: w for k, w in enumerate(sorted_tokens)}
    token_to_id = {w: k for k, w in id_to_token.items()}
    
    # tokenize words in vocab and replace rest with UNK
    tokenized_ids = [token_to_id[w] if w in token_to_id else 0 for w in text]
    
    return tokenized_ids, id_to_token, token_to_id

In [9]:
tokenized_ids, id_to_token, token_to_id = get_data(tokenized_text)
print('-' * 50)
print('Number of uniqe tokens: {}'.format(len(id_to_token)))
print('-' * 50)
print("tokenized text: {}".format(tokenized_text[0:5]))
print('-' * 50)
print("tokenized ids: {}".format(tokenized_ids[0:5]))

--------------------------------------------------
Number of uniqe tokens: 253855
--------------------------------------------------
tokenized text: ['anarchism', 'originated', 'as', 'a', 'term']
--------------------------------------------------
tokenized ids: [5234, 3081, 12, 6, 195]


### Generate samples
 
The `CBOW` model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words). The training data thus comprises pairs of `(context_window, target_word)`, for which the model should predict the `target_word` based on the `context_window` words.

Considering a simple sentence, __the quick brown fox jumps over the lazy dog__, with a `context_window` of size 1, we have examples like __([quick, fox], brown)__, __([the, brown], quick)__, __([the, dog], lazy)__ and so on. 

<img src="
https://cdn-images-1.medium.com/max/800/1*UVe8b6CWYykcxbBOR6uCfg.png" alt="img" width="400px"/>



Now let us convert our tokenized text from `tokenized_ids` into `(context_window, target_word)` pairs.

You should loop over the `tokenized_ids` and build a __generator__ which yields a target word of length 1 and surrounding context of length (2 $\times$ `window_size`) where we take `window_size` words before and after the target word in our corpus. Remember to pad context words with zeroes to a fixed length if needed.

In [10]:
def generate_sample(tknzd_ids, window_size = 5):
    # for index, target in enumerate(tknzd_ids):
    # ############### for student ################
    #     print(index,target)
    #     tknzd_ids_pad = window_size*[0] + tknzd_ids + window_size*[0]
    #     # print(tknzd_ids_pad)
    #     if index <= window_size:
    #         context_window = (window_size-index)*[0] + tknzd_ids[0:index] + tknzd_ids[index+1:index+window_size+1]
    #     elif len(tknzd_ids)-(index+1) <= window_size:
    #         context_window = tknzd_ids[index-window_size:index] + tknzd_ids[index+1:] + (window_size-len(tknzd_ids)+index+1)*[0]
    #     else:
    #          context_window =  tknzd_ids[index-window_size:index] + tknzd_ids[index+1:index+window_size+1]                                                               
    #     print(context_window)
    tknzd_ids_pad = window_size*[0] + tknzd_ids + window_size*[0]
    for index, target in enumerate(tknzd_ids_pad):
        if index>=window_size and index < len(tknzd_ids_pad)-window_size:
            context_window = tknzd_ids_pad[index-window_size:index] + tknzd_ids_pad[index+1:index+window_size+1]
            

        
    ############################################
            yield context_window, target



In [11]:
dummy_gen = generate_sample([11, 12, 13, 14, 15], 5)
print(dummy_gen)
dummy_ex_list = list(dummy_gen)
print(dummy_ex_list)

<generator object generate_sample at 0x7f1bb1b96d60>
[([0, 0, 0, 0, 0, 12, 13, 14, 15, 0], 11), ([0, 0, 0, 0, 11, 13, 14, 15, 0, 0], 12), ([0, 0, 0, 11, 12, 14, 15, 0, 0, 0], 13), ([0, 0, 11, 12, 13, 15, 0, 0, 0, 0], 14), ([0, 11, 12, 13, 14, 0, 0, 0, 0, 0], 15)]


In [12]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_gen = generate_sample([11, 12, 13, 14, 15], 2)
dummy_ex_list = list(dummy_gen)
dummy_ex1 = dummy_ex_list[0]
assert isinstance(dummy_ex1, tuple), "Is it a pair?"

dummy_ex1_input = dummy_ex1[0]
dummy_ex1_target = dummy_ex1[1]
assert len(dummy_ex1_input) == 4, "Context length should be 2 * window_size"
assert len(dummy_ex_list[0]) == len(dummy_ex_list[-1]), "Length of all instances should be the same due to the padding"

assert dummy_ex1_target == 11, "Did you return the correct target word?"
assert dummy_ex1_input[0] == dummy_ex1_input[1] == 0, "Did you add 0 pads where needed?"
assert dummy_ex1_input == [0, 0, 12, 13], "Did you consider contexts before and after the target word?" 

print('Well done!')

Well done!


To train our model faster, it is good idea to batchify our data. For your convenience, we implemented it for you: 

In [13]:
def batch_gen(tknzd_ids, batch_size = 4,  window_size = 5):
    #if the data is divided into sentences, you should shuffle them at every epoch
    #here we won't bother because a single epoch takes long enough already.
    
    single_gen = generate_sample(tknzd_ids, window_size) # get sample generator
    
    while True:
        try: 
            # The end of iterations is indicated by an exception 
            context_batch = np.zeros([batch_size, window_size * 2], dtype=np.int32)
            target_batch = np.zeros([batch_size], dtype=np.int32)
            for index in range(batch_size):
                context_batch[index], target_batch[index] = next(single_gen)
            yield context_batch, target_batch
        except StopIteration:
            break

In [14]:
dummy_batches = batch_gen([11, 12, 13, 14, 15, 16, 17, 18], batch_size=4, window_size=2)
for i, (c, t) in enumerate(dummy_batches):
  print(i, c)
  print(i, t)
# print("First batch:\n", next(dummy_batches))
# print('-' * 50)
# print("Second batch:\n", next(dummy_batches))


0 [[ 0  0 12 13]
 [ 0 11 13 14]
 [11 12 14 15]
 [12 13 15 16]]
0 [11 12 13 14]
1 [[13 14 16 17]
 [14 15 17 18]
 [15 16 18  0]
 [16 17  0  0]]
1 [15 16 17 18]


## 2. CBOW Model

We now leverage pytorch to build our CBOW model. For this, our inputs will be our context words which are first converted into one-hot vectors, and next projected into a word vector. Word-vectors will be obtained from an embedding matrix ($W$) which represents the distributed feature vectors associated with each word in the vocabulary. This embedding-matrix is initialized with a normal distribution.

Next, the projected words are averaged out (hence we don’t really consider the order or sequence in the context words when averaged) and then we multiply this averaged vector with another embedding matrix ($W'$), which defines so-called context embeddings to project the CBOW representation back to the one-hot space to match with the target word. (Note: in the theory lectures, this is introduced as the linear output layer, with dimensions equal to the transposed embedding matrix.)  We apply a log-softmax on the resulting context vectors, to predict the most probable target word given the input context.

We match the predicted word with the actual target word, compute the loss by leveraging the cross entropy loss and perform back-propagation in each iteration to update the embedding-matrix in the process.

<img src="https://cdn-images-1.medium.com/freeze/max/1000/1*uATTt40gbJ1HJQgIqE-VPA.png?q=20" alt="img" width="512px"/>



### Question-1

- How could we modify the `CBOW` architecture to consider the order and position of the context words?  

**<font color=blue><<< For the interior of windows, n-gram can be used instead of single tokens. >>></font>**

Now, complete the CBOW class below, following the instructions in the comments.

In [15]:
from torch._C import dtype
class CBOW(nn.Module):

    def __init__(self, embedding_dim=100, vocab_size=10000):
        super(CBOW, self).__init__()
        
        self.vocab_size = vocab_size
        
        # use nn.Parameter to define the two matrices W and W' from above, 
        # thus one for word (W) and one for context (W') embeddings:
        # self.embed_in = ...  # word embedding   W
        # self.embed_out = ... # context embedding  W'
        ############### for student ################
        self.embedding_dim = embedding_dim   ####embedding_dim可以理解成hidden_size嘛？
        self.embed_in = nn.Parameter(torch.zeros(self.embedding_dim, self.vocab_size))
        self.embed_out = nn.Parameter(torch.zeros(self.vocab_size,self.embedding_dim))



        ############################################
        
        self.reset_parameters()
            
    
    def reset_parameters(self):
        # Initialize parameters
        nn.init.kaiming_uniform_(self.embed_in, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.embed_out, a=math.sqrt(5))
    
    def get_word_embedding(self):
        return self.embed_in
    
    def get_context_embedding(self):
        return self.embed_out
    
    
    def forward(self, inps):
        """
        Convert given indices to log-probablities. 
        Follow these steps:
        1) convert the inputs' word indices to one-hot vectors (via F.one_hot)
        2) project the one-hot vectors to their embedding (use F.linear, do *NOT* use nn.Embedding)
        3) calculate the mean of the embedded vectors
        4) project back with the context embedding matrix 
        5) calculate the log-probability (with F.log_softmax)
                
        :argument:
            inps (list): List of indecies
        
        :return:
            log-probablity of words
        """
        ############### for student ################
        hidden_sub = 0
        list_hidden = []
        one_hot = F.one_hot(inps,num_classes=self.vocab_size).float()
        # print(one_hot)
        for i in one_hot:
            for s in i:
                # print(s.dtype)
                hidden_s = F.linear(s,self.embed_in)
                # print(hidden_s)
                hidden_sub += hidden_s
                hidden_mean = hidden_sub/self.embedding_dim
            
            list_hidden.append(hidden_sub.detach().numpy().tolist())
        # print(list_hidden)
        out1 = torch.Tensor(list_hidden)
        # print(out1)
        out2 = F.linear(out1,self.embed_out)
        log_probs = F.log_softmax(out2)
        ############################################
        return log_probs

In [16]:
dummy_model = CBOW(20, 10)
# print(dummy_model)
dummy_inps2 = torch.tensor([[6, 7, 9, 0], [1, 2, 3, 4]], dtype=torch.long)
# print(dummy_inps1.dtype)
dummy_pred2 = dummy_model(dummy_inps2)
print(dummy_pred2.shape)

torch.Size([2, 10])


  log_probs = F.log_softmax(out2)


In [17]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_model = CBOW(20, 10)
dummy_inps1 = torch.tensor([[6, 7, 9, 0]], dtype=torch.long)
dummy_inps2 = torch.tensor([[6, 7, 9, 0], [1, 2, 3, 4]], dtype=torch.long)
dummy_pred1 = dummy_model(dummy_inps1)

dummy_pred2 = dummy_model(dummy_inps2)


assert isinstance(dummy_model.embed_in, nn.Parameter), "Use nn.Parameter for embed_in"
assert isinstance(dummy_model.embed_out, nn.Parameter), "Use nn.Parameter for embed_out"
assert dummy_model.embed_in.shape == torch.Size([20, 10]), "param_in shape is not correct"
assert dummy_model.embed_out.shape == torch.Size([10, 20]), "param_out shape is not correct"
assert dummy_pred1.shape == torch.Size([1,10]), "Prediction shape is not correct"
assert dummy_pred2.shape == torch.Size([2,10]), "Prediction shape is not correct"
assert dummy_pred1.grad_fn.__class__.__name__ in ['LogSoftmaxBackward', 'LogSoftmaxBackward0'], "softmax layer?"

print('Well done!')

Well done!


  log_probs = F.log_softmax(out2)


### Train Model

Before jumping into the training part, we need to define some hyper-parameters:

In [18]:
# embedding hyper-parameters

EMBED_DIM = 64
WINDOW_SIZE = 5
BATCH_SIZE = 128
VOCAB_SIZE = 5000

EPOCHS = 1 # to make things faster in this basic setup
interval = 500

In [19]:
# get data

tokenized_ids, id_to_token, _ = get_data(tokenized_text, VOCAB_SIZE)
print(len(tokenized_ids))

17005207


Now we define our main training loop. Please implement the typical steps for training:
- Reset all gradients
- Compute output and loss value
- Perform back-propagation
- Update the network’s parameters

In [None]:
model = CBOW(EMBED_DIM, VOCAB_SIZE)
model = model.to(device)

criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters())

loss_history = []

for e in range(EPOCHS):
    
    batches = batch_gen(tokenized_ids, batch_size=BATCH_SIZE, window_size=WINDOW_SIZE)
    print(batches)
    total_loss = 0.0
    
    for iteration, (context, target) in enumerate(batches):
        
        # Step 1. Prepare the inputs to be passed to the model (wrap integer indices in tensors)
        # Step 2. Zero out the gradients
        # Step 3. Run the forward pass, getting predicted target word log probabilities
        # Step 4. Compute your loss function. 
        # Step 5. Do the backward pass and update the gradient
        print(iteration)
        # print("context shape: ",context.shape)
        # print("target shape: ",target.shape)

        # ############### for student ################
        context_tensor = torch.tensor(context, dtype = torch.long)
        # print("y^: ", context_tensor.shape)
        # for c in context_tensor:
          # print("c: ", c.shape)

        target_tensor=torch.from_numpy(target).long()
        # for t in target_tensor:
          # print("t: ", t.shape)
        # print("y: ", target_tensor.shape)
        optimizer.zero_grad()
        log_prob = model(context_tensor)
        loss = criterion(log_prob,target_tensor)
        print(loss)
        loss.backward()
        optimizer.step()
        # ############################################
        total_loss += loss.item()
        
        if iteration % interval == 499:
            print('Epoch:{}/{},\tIteration:{},\tLoss:{}'.format(e, EPOCHS, iteration, total_loss / interval))
            loss_history.append(total_loss / interval)
            total_loss = 0.0

In [None]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

assert loss_history[-1] < 6.5  #after an entire epoch, the loss would rather be around 5

print('Well done!')

### Nearest words

So far, we trained the __CBOW__ successfully, now it is time to explore it more. In this part, we want to find the $k$ nearest words to a given word, i.e., nearby in the vector space.

<img src="https://i0.wp.com/i.imgur.com/IeZt839.png" alt="img" width="480px"/>



Define a helper function to retrieve the corresponding vector for a given word:

In [None]:
def get_vector(embedding, word):
    """
    # be sure jupyter session is not terminated!
    # use token_to_id to retrieve the index, e.g., token_to_id[word]

    :argument:
        embedding (matrix): embedding matrix 
        word (str): The given input
    :return:
        word-vector for a given word
    """
    ############### for student ################
    word_id = torch.tensor([token_to_id[word]])
    one_hot_word = F.one_hot(word_id,num_classes=embedding.shape[1]).float()
    word_vector = (F.linear(one_hot_word, embedding)).t()   
    return word_vector
    ############################################

In [None]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

embedding = model.embed_in.data
assert get_vector(embedding, 'school').shape == torch.Size([64, 1]), "vector size should be (embed_dim, 1)"
assert np.allclose(embedding[:,(0,)].data.cpu().numpy(), get_vector(embedding, 'UNK').data.cpu().numpy()), "Do you retrieve correct vector?"
print('Well done!')

Well done!


Define a function to return the list of $k$ most similar words, e.g., based on `cosine-similarity`, to a given word:

In [None]:
def most_similar_words(embedding, word, k=1):
    """
    return k similar words based on cosine similarity (F.cosine_similarity)
    :argument:
        embedding (matrix): embedding matrix 
        word (str): The given input
        k (int): The number of similar items    
    :return:
        list of k similar items
    """
    most_similar = []
    dict_similar = {}
    x = get_vector(embedding, word) # 300, 1
    # ...
    # most_similar = ...
    ############### for student ################
    # print(list(token_to_id.values())[:5000])
    for word in list(token_to_id.keys())[:5000]:
        y = get_vector(embedding, word)
        # print("x:", x)
        # print("w:", type(w))
        similarity = F.cosine_similarity(x, y, dim = 0)
        # print(similarity.item())
        dict_similar[word] = similarity.item()
    dict_similar_list = sorted(dict_similar.items(), key=lambda x:x[1],reverse=True)
    # print(dict_similar)
    most_similar += [t[0] for t in dict_similar_list[1:k+1]]
    ############################################
    return most_similar

In [None]:
embedding = model.embed_in.data
print(most_similar_words(embedding, "manual", 3))

['academic', 'persian', 'decimal']


In [None]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

embedding = model.embed_in.data

dummy_list = most_similar_words(embedding, "manual", 3)
s1 = F.cosine_similarity(get_vector(embedding, dummy_list[0]).T, get_vector(embedding, "manual").T)
s2 = F.cosine_similarity(get_vector(embedding, dummy_list[1]).T, get_vector(embedding, "manual").T)
s3 = F.cosine_similarity(get_vector(embedding, dummy_list[2]).T, get_vector(embedding, "manual").T)

assert len(dummy_list) == 3, "return k nearest words"
assert s1.data.cpu().numpy()[0] >= s2.data.cpu().numpy()[0], "first item should have higher probablity to the given word"
assert s2.data.cpu().numpy()[0] >= s3.data.cpu().numpy()[0], "second item should have higher probability"
assert s1.data.cpu().numpy()[0] != 1 , "Similarity score of one means you return the word itself"

print('Well done!')

Well done!


### Linear projection


Now let's visualize the trained word embeddings in low-dimensional space. 
The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.


<img src="https://hackernoon.com/hn-images/1*ZFqnPuxa1PtUece-OHBoTA.png" alt="img" width="512px"/>

Under the hood, it attempts to decompose an object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing the *mean squared error*:

$$\min_{W, \hat{W}} \ \ \|(X W) \hat{W} - X\|^2_2 $$

with
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;


In [None]:
from sklearn.decomposition import PCA

# Map word vectors onto a 2D plane with PCA. Use the good old sklearn API (fit, transform).
# Finally, normalize the mapped vectors, to make sure they have zero mean and unit variance 

# word_vectors = ...
# ...
# word_vectors_pca = ...  # normalized vectors

############### for student ################
embedding = model.embed_in.data
word_vectors = np.array([get_vector(embedding, word).squeeze(1).numpy() for word in list(token_to_id.keys())[:5000]])
print(word_vectors.shape)

    # print(y.tolist())
#     list_array.append(.tolist())
# np_array = np.array(list_array)
# print(np_array.shape)


pca = PCA(n_components=2)
word_vectors_pca = pca.fit_transform(word_vectors)
print(word_vectors_pca)
from sklearn.preprocessing import StandardScaler
word_vectors_pca = StandardScaler().fit_transform(word_vectors_pca)
print(word_vectors_pca)



############################################

(5000, 64)
[[-0.00929424  0.00385346]
 [-0.01046508 -0.00508939]
 [ 0.01162651  0.00947326]
 ...
 [-0.00745938 -0.00775107]
 [-0.00098803 -0.00318186]
 [ 0.00109768 -0.01138258]]
[[-1.0232749   0.43438944]
 [-1.1521811  -0.57371116]
 [ 1.2800524   1.0678918 ]
 ...
 [-0.8212604  -0.8737551 ]
 [-0.10877965 -0.35868105]
 [ 0.12085206 -1.2831243 ]]


In [None]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2D vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

print('Well done')

Well done


In [None]:
!pip install bokeh



In [None]:
# !pip install bokeh

import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxiliary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [None]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=list(id_to_token.values()))

### Visualizing neighbors with t-SNE
PCA is nice but it is strictly linear and thus only able to capture the coarse high level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use t-SNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__.

In [None]:
from sklearn.manifold import TSNE

# Map word vectors onto a 2D plane with TSNE. (Hint: use verbose=100 to see what it's doing.)
# Normalize them just like with PCE into word_tsne

# ...
# word_tsne = ...

############### for student ################
word_tsne = TSNE().fit_transform(word_vectors)


############################################



In [None]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=list(id_to_token.values()))

## 3. POS tagging task

The embeddings by themselves are nice to have, but the main objective of course is to solve a particular (NLP) task. Further, so far we have trained our own embedding from a given corpus, but often it is beneficial to use existing word embeddings.

Now, let's use embeddings to train a simple Part of Speech (PoS) tagging model, using pretrained word embeddings. We shall use [50d glove word vectors](https://nlp.stanford.edu/projects/glove/) for the rest of this section.

Before jumping into our neural POS tagger, it is better to set up a baseline to give us an intuition how the neural model performs compared to other models. The baseline model is the [Conditional-Random-Field (CRF)](https://en.wikipedia.org/wiki/Conditional_random_field), also discussed in the theory lectures, which is a discriminative sequence labelling model. The evaluation is done on a 10\% sample of the Penn Treebank (which is offered through NLTK).

Download data from `nltk` repository and split it into test (20%) and training (80%) sets:

In [None]:
import nltk

# download necessary packages from nltk
nltk.download('treebank')
nltk.download('universal_tagset')

tagged_sentence = nltk.corpus.treebank.tagged_sents(tagset='universal')
print("Number of Tagged Sentences ", len(tagged_sentence))
print(tagged_sentence[0:5])

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
Number of Tagged Sentences  3914
[[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')], [('Mr.', 'NOUN'), ('Vinken', 'NOUN'), ('is', 'VERB'), ('chairman', 'NOUN'), ('of', 'ADP'), ('Elsevier', 'NOUN'), ('N.V.', 'NOUN'), (',', '.'), ('the', 'DET'), ('Dutch', 'NOUN'), ('publishing', 'VERB'), ('group', 'NOUN'), ('.', '.')], [('Rudolph', 'NOUN'), ('Agnew', 'NOUN'), (',', '.'), ('55', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), ('and', 'CONJ'), ('former', 'ADJ'), ('chairman', 'NOUN'), ('of', 'ADP'), ('Consolidated

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(tagged_sentence, test_size=0.20, random_state=42)

print("Train size: {}".format(len(train)))
print("Test size: {}".format(len(test)))

Train size: 3131
Test size: 783


### Setup a baseline

In [None]:
def features(sentence, index):
    """
    Return hand designed features for a given word
    :argument:
        sentence: tokenized sentence [w1, w2, ...] 
        index: index of the word    
    :return:
        a feature set for given word
    """

    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_capitalized': sentence[index][0].upper() == sentence[index][0],
        'prev_word': '' if index == 0 else sentence[index - 1],
        'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
        ############### for student ################
        'prefix-1': sentence[index][0],
        'prefix-2': sentence[index][:2],
        'prefix-3': sentence[index][:3],
        'suffix-1': sentence[index][-1],
        'suffix-2': sentence[index][-2:],
        'suffix-3': sentence[index][-3:],
        'has_hyphen': '-' in sentence[index],
        'is_numeric': sentence[index].isdigit(),
        'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
        ############################################
    }

##### Question-2

Suggest at least 6 more features that you could improve the above feature-set and add them to the code above (e.g., to take into account typical word prefixes or endings). After running the model with these features: which features worked best, and how much did your new features help in improving the model?  To get an idea, a test score of 0.95 should be feasible with a few well-chosen additional features.

**<font color=blue><<< 'prefix-1': sentence[index][0],
        'prefix-2': sentence[index][:2],
        'prefix-3': sentence[index][:3],
        'suffix-1': sentence[index][-1],
        'suffix-2': sentence[index][-2:],
        'suffix-3': sentence[index][-3:],
        'has_hyphen': '-' in sentence[index],
        'is_numeric': sentence[index].isdigit(),
        'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]; suffix of the sentence can produce the best performances, and make the score over 0.95. The combination of more features can result in higher scores.  >>></font>**

1. original score: 0.9376125358898243
2. 'prefix-1': sentence[index][0]  0.946177429558616
3. 'prefix-2': sentence[index][:2] 0.9465667429071974
4. 'prefix-3': sentence[index][:3] 0.9478320112900871
5. 'suffix-1': sentence[index][-1] 0.9508978539101659
6. 'suffix-2': sentence[index][-2:] 0.9537690398559541
7. 'suffix-3': sentence[index][-3:] 0.9561535841160154
8. 'has_hyphen': '-' in sentence[index] 0.9417003260499295
9. 'is_numeric': sentence[index].isdigit() 0.9409216993527666
10.'capitals_inside': sentence[index][1:].lower() != sentence[index][1:] 0.9379045209012604

11. combine 5,6,7: 0.9600953817704024
12. combine 2,3,4: 0.9518711372816195
13. combine 2,3,4,5,6,7: 0.9672490145505864
14. combine 2,3,4,5,6,7,8,9: 0.9693415737992116
15. combine 2,3,4,5,6,7,8,9,10: 0.9695362304735023

In [None]:
def transform2feature_label(tagged_sentence):
    X, y = [], []
 
    for tagged in tagged_sentence:
        X.append([features([w for w, t in tagged], i) for i in range(len(tagged))])
        y.append([tagged[i][1] for i in range(len(tagged))])
    
    return X,y

In [None]:
X_train, y_train = transform2feature_label(train)
X_test, y_test = transform2feature_label(test)

In [None]:
X_train[0][0]

{'capitals_inside': False,
 'has_hyphen': False,
 'is_capitalized': True,
 'is_first': True,
 'is_last': False,
 'is_numeric': False,
 'next_word': 'Vinken',
 'prefix-1': 'P',
 'prefix-2': 'Pi',
 'prefix-3': 'Pie',
 'prev_word': '',
 'suffix-1': 'e',
 'suffix-2': 're',
 'suffix-3': 'rre',
 'word': 'Pierre'}

In [None]:
# install crf-classifier

!pip install sklearn-crfsuite



In [None]:
import sklearn_crfsuite
from sklearn_crfsuite import CRF

# fit crfsuite classifier on train data
# use lbfgs for optimization and set max number of iterations to 100
############### for student ################
crf = CRF()
crf.fit(X_train, y_train)

############################################

print ("Accuracy:", crf.score(X_test, y_test))

Accuracy: 0.9695362304735023


### Build a neural model 

Now it is time to build our Neural PoS-tagger. The model we want to play with is a bi-directional LSTM on top of pretrained word embeddings. First, we prepare the embedding part and then go into the model itself:

In [None]:
# download glove 50d

!wget "https://www.dropbox.com/s/lc3yjhmovq7nyp5/glove6b50dtxt.zip?dl=1" -O glove6b50dtxt.zip
!unzip -o glove6b50dtxt.zip
!rm glove6b50dtxt.zip

--2022-05-06 10:00:30--  https://www.dropbox.com/s/lc3yjhmovq7nyp5/glove6b50dtxt.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601b:18::a27d:812
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/lc3yjhmovq7nyp5/glove6b50dtxt.zip [following]
--2022-05-06 10:00:30--  https://www.dropbox.com/s/dl/lc3yjhmovq7nyp5/glove6b50dtxt.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc38df2578b7d0da51438b337c87.dl.dropboxusercontent.com/cd/0/get/Bks56UR5i-5fff_vFnZodaV82kzKIerJlg-K1yrEeWAOMUzcvoNq-kNLupxlm3YURDhcSUQERHA2mnGnPQBcANc201z1f25FZBUKDhgPLPNinj3QO5imoZFEHsEBgpEU-3Qfo68jS72TouxaJymipH7wj1Wc8qCj8TNJxfcAYtH-1Q/file?dl=1# [following]
--2022-05-06 10:00:31--  https://uc38df2578b7d0da51438b337c87.dl.dropboxusercontent.com/cd/0/get/Bks56UR5i-5fff_vFnZodaV82kzKIerJlg-K1yrEeWAOMUzcvoN

In [None]:
GLOVE_PATH = 'glove.6B.50d.txt'

We build two dictionaries for mapping words and tags to uniqe IDs, which we need later on:

In [None]:
word_to_id = {}
tag_to_id = {}

for sentence in tagged_sentence:
    for word, pos_tag in sentence:
        if word not in word_to_id.keys():
            word_to_id[word] = len(word_to_id)
        if pos_tag not in tag_to_id.keys():
            tag_to_id[pos_tag] = len(tag_to_id)
            
word_vocab_size = len(word_to_id)
tag_vocab_size = len(tag_to_id)

print("Unique words: {}".format(word_vocab_size))
print("Unique tags: {}".format(tag_vocab_size))

Unique words: 12408
Unique tags: 12


We created a wrapper for the embedding module to encapsulate it from the other parts. This module aims to load word vectors from file and assign the weights into the corresponding embedding.

Create an embedding layer (this time use `nn.Embedding`), and copy the pretrained embeddings to its `weight` field. In this exercise, you can continue to finetune the embeddings while training the end task; no need to freeze them: this means the pre-trained embeddings serve as a smart initialization of the embedding layer.

In [None]:
class PretrainedEmbeddings(nn.Module):
    def __init__(self, filename, word_to_id, dim_embedding):
        super(PretrainedEmbeddings, self).__init__()
        
        wordvectors = self.load_word_vectors(filename, word_to_id, dim_embedding)
        # self.embed = ...
        # self.embe.weight ... 
        ############### for student ################
        self.embed = nn.Embedding(len(word_to_id),dim_embedding)
        self.embed.weight = nn.Parameter(torch.zeros(len(word_to_id), dim_embedding))
        ############################################

    def forward(self, inputs):
        return self.embed(inputs)
    
    def load_word_vectors(self, filename, word_to_id, dim_embedding):
        wordvectors = torch.zeros(len(word_to_id), dim_embedding)
        with open(filename, 'r') as file:
            for line in file.readlines():
                data = line.split(' ')
                word = data[0]
                vector = data[1:]
                if word in word_to_id.keys():
                    wordvectors[word_to_id[word],:] = torch.Tensor([float(x) for x in vector])
        
        return wordvectors

In [None]:
dummy_model = PretrainedEmbeddings(GLOVE_PATH, word_to_id, 50)
dummy_inps = torch.tensor([0, 4, 3, 5, 9], dtype=torch.long)
dummy_model.embed.weight.shape 

torch.Size([12408, 50])

In [None]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_model = PretrainedEmbeddings(GLOVE_PATH, word_to_id, 50)
dummy_inps = torch.tensor([0, 4, 3, 5, 9], dtype=torch.long)

assert dummy_model.embed.weight.shape == torch.Size([word_vocab_size, 50]), "embedding shape is not correct"
assert dummy_model(dummy_inps).shape == torch.Size([5, 50]), "word embedding shape is not correct"
assert np.allclose(dummy_model.embed.weight.detach().numpy()[0], [0] * 50), "Load weights from glove?"
assert np.allclose(dummy_model.embed.weight.detach().numpy()[714], [0] * 50), "Are you sure you load from glove correctly?"

print('Well done')

Well done


Let’s now define the model. Here’s what we need:

- We’ll need an embedding layer that computes a word vector for each word in a given sentence
- We’ll need a bidirectional-LSTM layer to incorporate context from both directions  (reshape the embedding since `nn.LSTM` needs 3-dimensional inputs)
- After the LSTM Layer we need a Linear layer that picks the appropriate POS tag (note that this layer is applied to each element of the sequence).
- Apply the LogSoftmax to calculate the log probabilities from the resulting scores.

Complete the forward path of the POSTagger model: 

In [None]:
class POSTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, word_to_id, tag_to_id, embedding_file_path):
        super(POSTagger, self).__init__()
        
        self.embed = PretrainedEmbeddings(embedding_file_path, word_to_id, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True)
        self.hidden2tag = nn.Linear(hidden_dim * 2, len(tag_to_id))
        
    def forward(self, sentence):
        ############### for student ################
        embedding_sen = self.embed(sentence).unsqueeze(0)
        # print(embedding_sen.shape)
        output, x = self.lstm(embedding_sen)
        # print(output)
        prediction = self.hidden2tag(output).squeeze(0)
        m = nn.LogSoftmax()
        tag_scores = m(prediction)
        ############################################
        return tag_scores

In [None]:
dummy_model = POSTagger(50, 50, word_to_id, tag_to_id, GLOVE_PATH)
dummy_inps = torch.tensor([0, 4, 3, 5, 9], dtype=torch.long)
print(dummy_model)
print(dummy_inps)
dummy_model(dummy_inps).shape

POSTagger(
  (embed): PretrainedEmbeddings(
    (embed): Embedding(12408, 50)
  )
  (lstm): LSTM(50, 50, bidirectional=True)
  (hidden2tag): Linear(in_features=100, out_features=12, bias=True)
)
tensor([0, 4, 3, 5, 9])




torch.Size([5, 12])

In [None]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_model = POSTagger(50, 50, word_to_id, tag_to_id, GLOVE_PATH)
dummy_inps = torch.tensor([0, 4, 3, 5, 9], dtype=torch.long)

assert dummy_model(dummy_inps).grad_fn.__class__.__name__ in ['LogSoftmaxBackward', 'LogSoftmaxBackward0'], "softmax layer?"
assert dummy_model(dummy_inps).shape == torch.Size([5, len(tag_to_id)]), "The output has wrong shape! Probably you need some reshaping!"

print("Well done!")

Well done!




Perfect! Now train your model:

In [None]:
# Training start

model = POSTagger(50, 64, word_to_id, tag_to_id, GLOVE_PATH)
model = model.to(device)
criterion = nn.NLLLoss()
optimizer = optim.AdamW(model.parameters())

accuracy_list = []
loss_list = []

interval = round(len(train) / 100.)
EPOCHS = 6
e_interval = round(EPOCHS / 10.)
print(len(train))
for e in range(EPOCHS):
    acc = 0 
    loss = 0
    
    model.train()
    
    for i, sentence_tag in enumerate(train):
        
        sentence = [word_to_id[s[0]] for s in sentence_tag]
        sentence = torch.tensor(sentence, dtype=torch.long)
        sentence = sentence.to(device)
        targets = [tag_to_id[s[1]] for s in sentence_tag]
        targets = torch.tensor(targets, dtype=torch.long)
        targets = targets.to(device)
        
        model.zero_grad()
        
        tag_scores = model(sentence)
        
        loss = criterion(tag_scores, targets)
        
        loss.backward()
        
        optimizer.step()
        
        loss += loss.item()
        
        _, indices = torch.max(tag_scores, 1)

        acc += torch.mean((targets == indices).float())
        
        if i % interval == 0:
            print("Epoch {} Running;\t{}% Complete".format(e + 1, i / interval), end = "\r", flush = True)
    
    loss = loss / len(train)
    acc = acc / len(train)
    loss_list.append(float(loss))
    accuracy_list.append(float(acc))
    
    if (e + 1) % e_interval == 0:
        print("Epoch {} Completed,\tLoss {}\tAccuracy: {}".format(e + 1, np.mean(loss_list[-e_interval:]), np.mean(accuracy_list[-e_interval:])))

3131
Epoch 1 Running;	0.0% Complete



Epoch 1 Completed,	Loss 1.5070617337187286e-05	Accuracy: 0.8535513281822205
Epoch 2 Completed,	Loss 1.4119958905212115e-05	Accuracy: 0.9619249105453491
Epoch 3 Completed,	Loss 1.644892108743079e-05	Accuracy: 0.9679481983184814
Epoch 4 Completed,	Loss 1.907019213831518e-05	Accuracy: 0.9689435958862305
Epoch 5 Completed,	Loss 2.061753002635669e-05	Accuracy: 0.9698601961135864
Epoch 6 Completed,	Loss 1.9404629711061716e-05	Accuracy: 0.9707337617874146


So far, so good! It's time to test our classifier. Complete the evaluation part, i.e., compute the accuracy on the test data:

In [None]:
def evaluate(model, data):

    model.eval()
    
    acc = 0.0
    
    # calculate accuracy based on predictions
    ############### for student ################
    with torch.no_grad():
        for i, sentence_tag in enumerate(data):
            
            sentence = [word_to_id[s[0]] for s in sentence_tag]
            sentence = torch.tensor(sentence, dtype=torch.long)
            targets = [tag_to_id[s[1]] for s in sentence_tag]
            targets = torch.tensor(targets, dtype=torch.long)
            
            prediction = model(sentence)
            # print(tag_scores)
            _, indices = torch.max(prediction, 1)
            # print(tensor_score)
            # print(torch.mean((targets == indices).float()))
            # print(torch.max(tag_scores, 1))

            acc += torch.mean((targets == indices).float())
        # print(len(data))
        score = acc / len(data)
        # print(score)
      ############################################
    
    return score
    
        

In [None]:
score = evaluate(model, train)



In [None]:
score = evaluate(model, test)
print("Accuracy:", score)

assert score > 0.96, "accuracy should be above 96%"
assert score < 1.00, "accuracy should be less than 100!%"

print('Well done!')



Accuracy: tensor(0.9292)


AssertionError: ignored

### Question-3

- Whether or not to fine-tune the pre-trained embeddings, the number of epochs you need (whether or not to use 'early stopping'), to apply regularization... are hyperparameters that should be properly tuned on a validation set. We did not do this here. It is therefore hard to make strong claims about the model at this point. However, as a quick test, please train the POS model with the same settings, but with a standard randomly initialized embedding layer instead of the pretrained embeddings. What do you observe compared to the CRF baseline / compared to the GloVe initialization? (Note: for your final code in `POSTagger`, please make sure it again loads the pretrained embeddings).

**<font color=blue><<< INSERT ANSWER HERE >>></font>**

### Acknowledgment

If you received help or feedback from fellow students, please acknowledge that here. We count on your academic honesty:

**<font color=blue><<< LIST POTENTIAL COLLABORATORS HERE >>></font>**