In [None]:
!pip install nltk

#Semantic word representations


Recall that *semantic Word Representations* are representations that are learned to capture the 'meaning' of a word. These are low-dimensional vectors that contain some semantic properties. In this notebook we are going to build state-of-the art approaches to obtain semantic word representations using the **word2vec** modelling approach. We will also use these vectors in some  tasks to understand the utility of these representations. 

We begin by loading some of the libraries that are necessary for building our model. We are using [pytorch](https://pytorch.org/), an open source deep learning platform, as our backbone library in the course. 



In [None]:
#@title Loading packages

import torch
from torch.autograd import Variable
import numpy as np
import torch.nn.functional as F
from scipy.spatial.distance import euclidean, cosine
from tqdm import tqdm 
import codecs
from sklearn.metrics.pairwise import cosine_distances

In [None]:
#@title Sample corpora

corpus = [
    'he is a king',
    'she is a queen',
    'he is a Man',
    'she Is a woman',
    'london is, the capital of England',
    'Berlin is ... the capital of germany',
    'paris is the capital of france.',
    'He will eat cake, pie, and/or brownies',
    "she didn't like the brownies"
]

# Tokenization

Q: What is a token and why do we need to tokenize?  
A: A token is a string of contiguous characters between two spaces, or between a space and punctuation marks. For segmentation.

Q: Print the tokenized corpus above. What mistakes do you find in the code below?  
A: `print([s.split() for s in corpus])`, capitalization inducing sparsity in the dataset. 

Q: What could be a nice way of fixing these mistakes?  
A: Process of normalization (removing capitalization, etc.)

##### 10 mins 

In [None]:
tokenized_corpus = [] # Let us put the tokenized corpus in a list
for sentence in corpus:
  tokenized_sentence = []
  for token in sentence.split(' '): # simplest split is 
    # Q3
    token = token.lower()
    tokenized_sentence.append(token)
  tokenized_corpus.append(tokenized_sentence)

print(tokenized_corpus)

# Pre-processing

Tokenization is a crucial pre-processing step in the NLP domain`*`. However, other pre-processing techniques also exist, many of which were extensively employed in rule-based and statistical NLP. While we don't utilise these pre-processing techniques in neural-based NLP anymore, they are still worth a recap. Typically, **stop words** and **punctuation** removal are employed, along with *either* **stemming** or **lemmatization**. However, in the following code, we will demonstrate each of the techniques separately (mainly due to our corpus being so small)

### Stop Word removal
Stop words are generally the most common words in the language which who's meaning in a sequenece is ambiguous. Some examples of stop words are: The, a, an, that.

### Punctuation removal
Old school NLP techniques (and some modern day ones) struggle to understand the semantics of punctuation. Thus, they were also removed.

## Stemming and Lemmatization
Stemming and Lemmatization are two distinct word normalization techniques. Essentially this means that, given our corpora, we wish to have variants of a word in a 'normal' form. For example, [playing, plays, played] may be normalised to "Play". The sentence "the boy's cars are different colours" may be normalised to "the boy car be differ colour"

### Stemming
In the case of stemming, we want to normalise all words to their stem (or root). The stem is the part of the word to which affixes (suffixes or prefixes) are assigned. Stemming a word may result in the word not actually being a word. For example, some stemming algorithms may stem [trouble, troubling, troubled] as "troubl".

### Lemmatization
Lemmatization attempts to properly reduce unnormalized tokens to a word that belongs in the language. The root word is called a **lemma**, and is the canonical form of a set of words. For example, [runs, running, ran] are all forms of the word "run.



Q. Think of two or three other stop words, and add them to the list of stop words below.  
A. ["is", "and", "or"].  
**N.B** when we run the punctuation removal below, we see that some of the words with apostrophes are split (e.g. "didn't" => "didn", "t"). Some stop word lists also add the "t" to their list. 

Q. Write some code which both removes stop words and punctuation from our corpus.  
A. See code below

Q. The examples of stemming and lemmatization below are on words/sequences not in our corpus. Extend the code so it works on our corpus.  
A. See code below

##### 10 mins 

N.B. We are not going to use these techniques in this file after this section, so we will demonstrate how to perform these techniques distinctly on our toy corpus.

`*`Recently there has been newer approaches to "tokenization" which goes further than one token being one word. One example is [SentencePiece](https://github.com/google/sentencepiece). These approaches are out of scope for this lab session, but may appear in future sessions

In [None]:
import nltk
nltk.download('punkt') # Download the tokenizer model
nltk.download('wordnet') # Download the wordnet corpora

In [None]:
# STOP WORD REMOVAL
stop_words_list = ["the", "a", "an", "that"]
# Q1
stop_words_list.extend(["is", "and", "or"])

# SWR = stop words removed
tokenized_corpus_SWR = []
for sentence in corpus:
    tokenized_sentence_SWR = []
    
    for token in sentence.split(" "):
        if token not in stop_words_list:
            tokenized_sentence_SWR.append(token)

    if tokenized_sentence_SWR: # Only append to corpus if tokenized_sentence_SWR isn't empty
        tokenized_corpus_SWR.append(tokenized_sentence_SWR)
        
print(tokenized_corpus_SWR)

In [None]:
# PUNCTUATION REMOVAL
import re # regex

re_punctuation_string = '[\s,/.\']'

# PR = punctuation removed
tokenized_corpus_PR = []
for sentence in corpus:
    tokenized_sentence_PR = re.split(re_punctuation_string, sentence) # in python's regex, [...] is an alternative to writing .|.|.
    tokenized_sentence_PR = list(filter(None, tokenized_sentence_PR)) # remove empty strings from list 
    tokenized_corpus_PR.append(tokenized_sentence_PR)
        
print(tokenized_corpus_PR)

In [None]:
# ANSWER Q2 HERE
stop_words_list = stop_words_list # just for clarity ;)
tokenized_corpus_SwPR = [] # SwPR = Stop Words and Puncutation Removal
for sentence in corpus:
    tokenized_sentence_PR = re.split(re_punctuation_string, sentence)
    tokenized_sentence_PR = list(filter(None, tokenized_sentence_PR))
    
    # Now that punctuation has been removed, let's remove the stop workds
    tokenized_sentence_SwPR = []
    for token in tokenized_sentence_PR:
        if token not in stop_words_list:
            tokenized_sentence_SwPR.append(token)
    
    if tokenized_sentence_SwPR:
        tokenized_corpus_SwPR.append(tokenized_sentence_SwPR)
        
print(tokenized_corpus_SwPR)

In [None]:
# STEMMING
from nltk.stem import PorterStemmer

porter = PorterStemmer()

stemming_word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]
print("{0:20}{1:20}".format("Word","Stemmed variant"))
print()

for word in stemming_word_list:
      print("{0:20}{1:20}".format(word,porter.stem(word)))

In [None]:
# Q3 - STEMMING
# We'll use the tokenized corpus from the "Tokenization" section
tokenized_corpus_stemmed = []
for t_sentence in tokenized_corpus:
    sentence_stemmed = []
    for token in t_sentence:
        sentence_stemmed.append(porter.stem(token))
    tokenized_corpus_stemmed.append(sentence_stemmed)
    
print(tokenized_corpus_stemmed)

In [None]:
# LEMMATIZATION
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

to_lemmatize_sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

# lemmatization requires punctuation removal
to_lemmatize_sentence = re.split(re_punctuation_string, to_lemmatize_sentence)
to_lemmatize_sentence = list(filter(None, to_lemmatize_sentence))

print("{0:20}{1:20}".format("Word","Lemma"))
print()

for word in to_lemmatize_sentence:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))

In [None]:
# Why didn't the above do anything?
# It's because the lemmatizer requires parts of speech (POS) context about the word it is currently parsing.
# We would need to use a POS model to identify what the POS for a token in its context is.
# In the above example (and for Q3), we'll just pass in the VERB context for every token

print("{0:20}{1:20}".format("Word","Lemma"))
print()

for word in to_lemmatize_sentence:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word, pos="v")))

In [None]:
# Q3 - Lemmatization
re_punctuation_string = '[\s,/.\']'

tokenized_corpus_to_lemmatize = []
for sentence in corpus:
    to_lemmatize_sentence = re.split(re_punctuation_string, sentence)
    to_lemmatize_sentence = list(filter(None, to_lemmatize_sentence))
    tokenized_corpus_to_lemmatize.append(to_lemmatize_sentence)

tokenized_corpus_lemmatized = []
for t_sentence in tokenized_corpus_to_lemmatize:
    lemmatized_sentence = []
    for token in t_sentence:
        lemmatized_token = wordnet_lemmatizer.lemmatize(token, pos="v")
        lemmatized_sentence.append(lemmatized_token)
    tokenized_corpus_lemmatized.append(lemmatized_sentence)
    
print(tokenized_corpus_lemmatized)

# Vocabulary

The code below obtains the vocabulary of the corpus. 

Q. Print the size of the vocabulary.  
A. `print(len(vocabulary))`

Q. A programatically cleaner (and shorter) way of writing the code below by using a set instead of a list. Can you implement the code below using a set?  
A. See code below

##### 3 mins

In [None]:
vocabulary = [] # Let us put all the tokens (mostly words) 
                # appearing in the vocabulary in a list
  
for sentence in tokenized_corpus:
    for token in sentence:
        if token not in vocabulary:
            vocabulary.append(token)


# Q. what is the size of the vocabulary?
# A. uncomment and fill below
vocabulary_size = len(vocabulary)
print("Vocabulary size:", vocabulary_size)


## USING A SET
vocab_set = set()
for sentence in tokenized_corpus:
    vocab_set.update(sentence)

# Sanity check to ensure that the set size is the same as the list size
print("Vocabulary (set) size:", len(vocab_set))
assert len(vocab_set) == len(vocabulary)
##

# Helper functions 

* These are some of the common helper functions that are used for NLP models:

    * `word2idx`:  Maintains a dictionary of word and the corresponding index
    
    * `idx2word`: Maintains a mapping from index to word 
    
    
* Print the word2idx and idx2word, we will be using these in future exercises. 

##### 3 mins

In [None]:

word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}

print(word2idx)
print(idx2word)



# Look-up table 

* This is a table that maps from an index to a one hot vector. 

Q. Print one-hot vectors corresponding to the words 'the', 'he' and ''england'  
A. See code below

##### 3 mins

In [None]:

def look_up_table(word_idx):
    x = torch.zeros(vocabulary_size).float()
    x[word_idx] = 1.0
    return x
  
# This is a one hot representation

# Q. try printing it for word_idx = 1


word_idx = word2idx['he']
print(look_up_table(word_idx))



In [None]:
word_vectors_for = ["the", "he", "england"]

for word in word_vectors_for:
    print("Word vector for word _{}_: \n {}".format(word, look_up_table(word2idx[word])))

# Extracting contexts and the focus word


Recall that we are building the skip-gram model. 

**We first begin by obtaining the set of contexts and focus words.**
* Let's say we have a sentence (represented as vocabulary indicies): `[0, 2, 3, 6, 7]`.
* For every word in the sentence, we want to get the words which are `window_size` around it.
* So if `window_size==2`, for the word '0', we obtain: `[[0, 2], [0, 3]]`
* For the word '2', we obtain: `[[2, 0], [2, 3], [2, 6]]`
* For the word '3', we obtain: `[[3, 0], [3, 2], [3, 6], [3, 7]]`

Q. Print some of the index pairs and trace them back to their words.  
A. See code below

##### 10 mins

In [None]:
window_size = 2

idx_pairs = []

# variables of interest: 
#   center_word_pos: center word position
#   context_word_pos: context_word_position
#   add sentence length as a constraint

for sentence in tokenized_corpus:
    indices = [word2idx[word] for word in sentence]
    
    for center_word_pos in range(len(indices)):
        
        for w in range(-window_size, window_size + 1):
            context_word_pos = center_word_pos + w
            
            if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
                continue
                
            context_word_idx = indices[context_word_pos]
            idx_pairs.append((indices[center_word_pos], context_word_idx))

idx_pairs = np.array(idx_pairs) # it will be useful to have this as numpy array

print(idx_pairs)

In [None]:
# We'll sample 5 elements at random and trace these back to their word pairs
from random import sample

# Hey, I'm gonna be honest here.. I didn't build this part of the lab and while casting idx_pairs
  # to np.array is the correct thing to do, finding the original words from these elements requires a bit
  # of a hack job if done after the cast.
  # This means that to randomly sample, we need to cast idx_pairs back to a list.
    # Now we have list of arrays which is a bit easier to work with
random_pairs = sample(list(idx_pairs), 5)
print(random_pairs)

tokens_from_idx_pairs = []
for random_pair in random_pairs:
    focus_word_idx, context_word_idx = random_pair[0], random_pair[1]
    focus_word = idx2word[focus_word_idx]
    context_word = idx2word[context_word_idx]
    tokens_from_idx_pairs.append([focus_word, context_word])

print()
print(tokens_from_idx_pairs)

# Parameters and hyperparameters 

* For our toy task, let us set the embedding dimensions to 5
* Let us run the algorithm for 10 epochs (number of times the training algorithm looks at the corpus/training data)
* Let us choose the learning rate as 0.001

We have two parameter matrices $W_1$ and $W_2$ - the embedding matrix and the weight matrix. 

Q. What are the dimensionalities of $W_1$ and $W_2$?
A.  
```
shape(W1) = [embedding_dims x vocabulary_size]
shape(W2) = [vocabulary_size x embedding_dims]
```


##### 3 mins

In [None]:
# Hyperparameters:
embedding_dims = 5
num_epochs = 100
learning_rate = 0.001

# The two weight matrices:
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
W2 = torch.randn(vocabulary_size, embedding_dims, requires_grad=True)


# Training the model

(Refer to Lecture 2 slides 30-31)

In the code below, we are going to compute the log probability of the correct context (target) given the word. 

Before running the code, answer the question commented in the code -> fill `y_true`.

Print the loss and see if the loss goes down.

###### 10 mins



In [None]:

for epoch in tqdm(range(num_epochs)):
  
    loss_val = 0
    
    for data, target in idx_pairs:
      
        x = torch.Tensor(look_up_table(data)) #, requires_grad=True) # x is a One-hot tensor

        # Q. what would y_true be? 
        y_true = torch.Tensor([target]).long()

        # A. [index] of the target word
        

        # 
        z1 = torch.matmul(W1, x) 
        # Q. what is z1? 
        
        z2 = torch.matmul(W2, z1)
        # Q. what is the above operation? 
    
        # Let us obtain prediction over the vocabulary
        log_softmax = F.log_softmax(z2, dim=0)
        
        
        # Our loss is a negative log-likelihood loss 
        # (what does this mean?)
        
        loss = F.nll_loss(log_softmax.view(1,-1), y_true)
        
        loss_val += loss.item()
        
        # propagate the error
        loss.backward()
        
        # gradient descent
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        # zero out gradient accumulation
        W1.grad.data.zero_()
        W2.grad.data.zero_()

print(f'\nFinal epoch loss: {loss_val/len(idx_pairs)}')        

Q. Given that we are interested in distributed representations, what is the major bottleneck in our setup? Is it the dimensionality of the representations? Is it the learning rate? Is it the corpus?  
A. It is both the dimensionality and the corpus. We have a only few words & contexts it would be difficult to capture distributional contexts. As we increase the words or expand the corpus we would have to expand our dimensions. 

Q. What hyperparameters would you tune to improve the representations?  
A. Decreasing the learning rate (this will be discussed more in further lectures)

Q. Train the algorithm with a bigger corpus. 

(You can either copy and paste the corpus and bring it to the same format as the corpus above or use the hint below)

###### 10 mins

In [None]:
# Example code for getting corpora from the internet
import urllib
txt = [line.strip() for line in urllib.request.urlopen('https://raw.githubusercontent.com/luonglearnstocode/Seinfeld-text-corpus/master/corpus.txt').readlines()]


# Using word embeddings

One of the simplest ways of exploiting word representations is to find similar words. There are many ways of measuring the semantic similarity between two words. As we are using word representations which are vectors in the euclidean space, distance metrics defined in the euclidean space are the most popular choice. This is because words that share common contexts in the corpus are located in close proximity to one another in the euclidean space.  One such metric is the eucldeian distance.

Q. What is the euclidean distance between 'the' and 'a' (in the sample corpus and the new corpus)?  
A. ~1.414 (see solution in code)

Q. What other distance metrics can we use for two vectors?  
A. Cosine distance is the most frequently used distance metric for high-dimensional space. Many other distance metrics exist and high-level overview can be found in this [blog post](https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d) 


###### 10 mins


In [None]:
# Let us get two vectors from the trained model

x = torch.Tensor(look_up_table(0))
x_emb = torch.matmul(W1, x).detach().numpy()
y = torch.Tensor(look_up_table(1))
y_emb = torch.matmul(W1, y).detach().numpy()

# let us print the euclidean distance
print(euclidean(x_emb, y_emb))

In [None]:
vector_the = look_up_table(word2idx["the"])
vector_a = look_up_table(word2idx["a"])
print(euclidean(vector_the, vector_a))

# ADVANCED: Training with negative sampling 
 
 
 

Refer to skipgram models in the slides. 

Q. What happens when we have a very large vocabulary?  
A. Computationally challenging because of the normalization factor.

Q. What is a negative sample? 


Below is the code for training the model with negative sampling. 


##### 10 mins 


In [None]:
# The two weight matrices:
W1 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)
W2 = torch.randn(embedding_dims, vocabulary_size, requires_grad=True)

for epoch in range(num_epochs):
    epoch_loss = 0
    for data, target in idx_pairs:
        x_var = Variable(look_up_table(data)).float() 
        
        y_pos = Variable(torch.from_numpy(np.array([target])).long())
        y_pos_var = Variable(look_up_table(target)).float()
        
        neg_sample = np.random.choice(list(range(vocabulary_size)),size=(1))[0]
        y_neg = Variable(torch.from_numpy(np.array([neg_sample])))
        y_neg_var = Variable(look_up_table(neg_sample)).float()
         
        x_emb = torch.matmul(W1, x_var) 
        y_pos_emb = torch.matmul(W2, y_pos_var)
        y_neg_emb = torch.matmul(W2, y_neg_var)
        
        # get positive sample score
        pos_loss = F.logsigmoid(torch.matmul(x_emb, y_pos_emb))
        
        # get negsample score
        neg_loss = F.logsigmoid(-1 * torch.matmul(x_emb, y_neg_emb))
        
        loss = - (pos_loss + neg_loss)
        epoch_loss += loss.item()
        
        # propagate the error
        loss.backward()
        
        # gradient descent
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        # zero out gradient accumulation
        W1.grad.data.zero_()
        W2.grad.data.zero_()
        
    if epoch % 10 == 0:    
        print(f'Loss at epo {epoch}: {epoch_loss/len(idx_pairs)}')

* In the current setup, we are only exploiting a very small sample of negative examples. This is suboptimal. 

* Given a sufficiently large vocabulary, we would ideally sample the negative samples from a noise distribution whose probabilities match the frequency of vocabulary.


Q.  Using this code as the basis, build an object oriented negative sampling based model and train it on the fairly large corpus. 



In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# Pre-trained representations

We have seen from the above that word embeddings are learned in an unsupervised manner, i.e., we don't have any labelled data. These representations can be used to `bootstrap' models in NLP. There are many word representation inducing algorithms : [word2vec](https://arxiv.org/abs/1301.3781), [GloVe](https://nlp.stanford.edu/pubs/glove.pdf), [Fasttext](https://arxiv.org/abs/1607.04606) are some of the popular choices. There are differences in the algorithms but they are all based on the distributional hypothesis. 

We will now use one of these pre-trained representations: GloVe. 

Q. What is the dimensionality of the representations below? 

##### 2 mins

In [None]:
w2i = [] # word2index
i2w = [] # index2word
wvecs = [] # word vectors

# this is a large file, it will take a while to load in the memory!
with codecs.open('glove.6B.50d.txt', 'r','utf-8') as f: 
  index = 0
  for line in tqdm(f.readlines()):
    # Ignore the first line - first line typically contains vocab, dimensionality
    if len(line.strip().split()) > 3:
      
      (word, vec) = (line.strip().split()[0], 
                     list(map(float,line.strip().split()[1:]))) 
      
      wvecs.append(vec)
      w2i.append((word, index))
      i2w.append((index, word))
      index += 1

w2i = dict(w2i)
i2w = dict(i2w)
wvecs = np.array(wvecs)

For the following experiments, we recommend  using `wvecs` - the pretrained representations. 

# Evaluating word representation models

## Inrtinsic Evaluation

* Intrinsic evaluation of word representations involves evaluating  set of word vectors generated by an embedding technique on specific  subtasks that in someways are directly related to the distributional hypothesis. These are typically simple and fast to compute and thereby allow us to help understand representation learning algorithms.

* An intrinsic evaluation should typically return to us a scalar quantity that measures the performance of those word vectors on the evaluation subtask.



## Word Similarity

The first task we consider is evaluating if the representations are good at computing if two words are similar. In this task, you will use both euclidean distance or cosine distance as similarity measures. 

* Print similarity scores for word pairs in https://github.com/iraleviant/eval-multilingual-simlex/blob/master/evaluation/ws-353/wordsim353-english-sim.txt

     (Format of the file: two words and the corresponding human score for the two words)

* Obtain pearson's correlation with predicted scores and the human generated scores. 


##### 15 mins



##  Exploring Analogies

The second task we consider **completing analogies**. We are given an incomplete analogy of the form: 


* $a : b : : c :~?$


We would then identify the word vector which maximizes the cosine similarity. 
This metric has an intuitive interpretation. Ideally, we want $\phi(b) - \phi(a) = \phi(d) - \phi(c)$ where $\phi(.)$ is the word vector. 
For instance, 

* *london $-$ england = paris $-$ france* .

Thus we identify the vector $\phi(d)$ which maximizes the normalized dot-product between the two word
vectors (i.e. cosine similarity).



* You can either use your own method to compute the correct word or use the code below. 

* Use original analogies dataset https://github.com/svn2github/word2vec/blob/master/questions-words.txt 

Q. When does it fail? 

Q. What are the possible reasons for failure?

##### 15mins



In [None]:
def cosine_distance(u, v):
    distance = 0.0
    dot = np.dot(u,v)
    norm_u = np.sqrt(np.sum(u**2))
    norm_v = np.sqrt(np.sum(v**2))
    distance = dot/(norm_u)/norm_v
    return distance
  
 
def find_analogy(word_a, word_b, word_c, word_vectors, word2index):
    word_a = word_a.lower()
    word_b = word_b.lower()
    word_c = word_c.lower()
    
    (e_a, e_b, e_c) = (word_vectors[word2index[word_a]], 
                       word_vectors[word2index[word_b]], 
                       word_vectors[word2index[word_c]])
    
    
    max_cosine_sim = -999
    best_word = None
    
    for (w, i) in word2index.items():
        if w in [word_a, word_b, word_c]:
            continue
        cosine_sim = cosine_distance(e_b - e_a, word_vectors[i] - e_c)
        
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
            
    return best_word
  
# find_analogy('france', 'paris', 'england', wvecs, w2i)

# Advanced: Compositionality 

* Given access to only word representations, how can we build representations for phrases and sentences? 

  (Hint: algebraic operation is one way) 


* Compute the similarity score between two sentences on the STS.input.MSRpar.txt dataset from https://github.com/alvations/stasis/tree/master/STS-data/STS2012-train 

  (Please use 00-readme.txt in the corpus for details on the format)
  
* Measure the pearson correlation with the human scores in STS.gs.MSRpar.txt

Q. What problems did you encounter when computing the scores? 

Q. What are alternative ways of computing the scores? 

Q. Using your composition method, compute representations for the following expressions and also list the top-5 most similar words: 

* New York 
* kick the bucket
* post office

  Does it work? What are the possible reasons? 






# References


* [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf): word2vec reference

* [Eluciating the properties of semantic word representations](http://www.offconvex.org/2016/02/14/word-embeddings-1/): A global perspective

* [Understanding the algebraic notions of semantic word representations](http://www.offconvex.org/2015/12/12/word-embeddings-1/): Why does the word-analogies task work with simple algebraic manipulations?

* [Stemming And Lemmatization](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python)