# HW 3 Word Embeddings, Feedforward Networks, and Semantic Similarity 

In this assignment, we'll be getting familiar with the basics of PyTorch, a popular library for implementing neural networks, and exploring word embeddings through the neural trigram model from Bengio et al. (2003) and using the embeddings we get from that model to compute semantic similarities between potential completions of the stimuli from Roland et al (2011). 

Note that we haven't read Roland et al. 2011 for the class (but you have completed the norming study!), so it might be fun to give the article a read. Unfortunately, we won't be able to replicate their main result (we don't have the reading time data), be we will be able to play around with a rough version of their model and see it's predictions.

Also, you'll note that we're using Google's CoLab for this project - that's because we'll be having you train a small neural network, and the extra compute power (read: a free GPU!) CoLab offers will make sure that hardware doesn't become a bottleneck. 



# 1.  Data prep

We're going to be working with the Brown Corpus for this assignment, which is a popular multi-genre corpus of a smidge over 1 million words, which is sizeable, but not enormous. Luckily, nltk can handle downloading and loading the dataset for us.

In [1]:
import nltk
nltk.download("brown")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [2]:
brown = nltk.corpus.brown.words()

# This allows us to walk through the corpus word by word
print(brown)

# This behaves just like any other list
print(len(brown))
print(brown[0])
print(brown[:10])

brown = [word.lower() for word in brown]
print(brown[:20])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
1161192
The
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', "atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that']


Now that we have our data, we need to split our data into 3 pieces: train data, which we'll use to fit the parameters of our model; development/validation data, which we'll use to tune hyperparameters (parameters that affect the structure and training of the model on a higher level), and test data, which we'll use **only** as a final evaluation of how well our model performs at the end of the training process.

Following Bengio et al. (2003), we'll keep the first 800,000 words as training data, the next 200,000 as development data, and the rest as test data.

1. ** Fill in the cell below to assign the given variables their appropriate parts of the brown dataset. **

In [3]:
# Divide the data into training, evaluation and testing
train_words = brown[:800001]
dev_words = brown[800001:1000001]
test_words = brown[1000001:]

Now, there's a special bit of pre-processing that we need to do when we work with neural nets: We need to integerize (integer-ify?) every word we work with. Why? Well, mostly because it's what the embeddings layer of our neural net will expect as input - these integers will be indices in to an embeddings matrix that the layer maintains, which will then get us our embedding for that particular word. You can also think of these as *one-hot* vector representations of our words, as you saw in the Goldberg reading. 

Regardless, it's a thing we'll have to do! The easiest way of doing this is to first determine our vocabulary (e.g., all of the words we want to have a unique representation), add in an UNK/OOV entry, and then iterate through them in any arbitrary order, assigning each an integer in sequence. We'll want to build 2 data structures out of this - one mapping words to integers, and the other a list of the vocab words + UNK/OOV in order (which is equivalent to a dictionary mapping integers to words!)

We could, in theory, have you just keep every single word in the training data in the vocabulary. However, we should keep in mind that the more words in our vocabulary, the more word embeddings we'll have to train (and the more words between which we have to decide when predicting - we're still doing language modeling here!). To alleviate these concerns somewhat, one method people use is to only consider words that appear more than k times in the training set to be in the vocabulary, for some k. Everything we don't consider as being in the vocab will be treated as an UNK. 

2. ** Fill in the function build_vocab below to do this! **

*Hint: You'll have to (1) get counts of all of the words in the tokens sequence, (2) construct a list of unique words with count > k plus an "UNK", and (3) build a dictionary mapping each of those words to it's index in that list.*

** GOLDEN PIECE OF ADVICE FOR EVERYTHING BUT THIS ASSIGNMENT ESPECIALLY : Build small test cases before running your code on the (rather large) real dataset. It'll save you time and sanity in the long run, and solidify your understanding of everything that's going on. ** 

In [4]:
UNK = "<UNK>"
def build_indices(tokens, k):
  vocab = []
  word2idx = {}
  count = {}
  for i in tokens:
    if not (i in vocab):
      vocab.append(i)
      count[i] = 1
    else:
      count[i]+=1
    
  for j in list(count):
    if (count[j] <= k):
      del(count[j])
     
  vocab = list(count)
  vocab.append(UNK)
  for n in vocab:
    word2idx[n] = vocab.index(n)
  
  # Fill in your implementation here
  
  return word2idx, vocab

word2idx, vocab = build_indices(train_words, 3)

Now a quick one - lets convert all of our words in our dataset to these indexes! Don't forget about handling UNKs!

In [5]:
#Indexing all the dataset
train_indices = [word2idx[word] if word in vocab else word2idx[UNK] for word in train_words]
dev_indices = [word2idx[word] if word in vocab else word2idx[UNK] for word in dev_words]
test_indices = [word2idx[word] if word in vocab else word2idx[UNK] for word in test_words]

In [6]:
print(int(len(vocab)*1.1))

14507


So far, so good. Now time to revisit an old friend...

## 2. Ngrams 2: Electric Boogaloo

Yep. They're back, and better than ever! You've seen these bad boys before, so doing it all again will be a piece of cake, right?

*Warning: don't forget about the unks when thinking through your implementations - you might avoid a bit of extra work.*

3. ** Get unigram, bigram, and trigram counts from train_indices (yes, the one with the numbers, not the words) **

In [7]:
# Have all of these be dictionaries from tuples (yes, even unigrams) to counts.
# I better not see any nltk.prob.FreqDists running around in here. 

unigrams = {}
bigrams = {}
trigrams = {}

for word in train_indices:
  t = tuple((word,))
  if not(word in unigrams):
    unigrams[word] = train_indices.count(word)
    
trigram_set = [train_indices[i:i+3] for i in range(len(train_indices)-2)]
for word in trigram_set:
  t = tuple(word)
  trigrams[t] =  trigrams.get(t, 0) + 1
      
bigram_set = [train_indices[i:i+2] for i in range(len(train_indices)-1)]
for word in bigram_set:
  t = tuple(word)
  bigrams[t] =  bigrams.get(t, 0) + 1


4. ** Fill in the interpolated_trigram function below to return a log-probability corresponding to the trigram passed in **

*Note: this function has a bit of a weird structure - there's a function inside a function, with the outer function returning the inner function. This allows us to explicitly announce what the parameters (and hyperparameters) of the model are (the arguments to the outer function), while still having our final function only take in the trigram to be predicted. You can use arguments from both the outer and inner function within the inner function - don't worry too much about why or how this works - closures are cool, but that's CS, and believe it or not this is a CogSci class. *

In [8]:
import numpy as np

def get_interpolated_trigram(unigrams, bigrams, trigrams, weights):
  # {uni, bi, tri}grams are the counts of the relevant ngram
  # weights is a 3-tuple containing the weight of the unigram model, 
  #   bigram model, and trigram model respectively, all in raw probabilities
  #   guaranteed to sum to 1
  
  # Your implementation can be here too
  
  def interpolated_trigram(trigram):
    # trigram is a 3-tuple containing the 3 words of the trigram in order
    if (trigrams.get(trigram,0)==0 and bigrams.get((trigram[0],trigram[1]),0) ==0):
      t3 =np.log(0)
    else:
      t3 = np.log(weights[2]) + np.log(trigrams.get(trigram,0)) - np.log(bigrams.get((trigram[0],trigram[1]),0))
    t2 = np.log(weights[1]) + np.log(bigrams.get((trigram[1],trigram[2]),0)) - np.log(unigrams.get(trigram[1],0))
    t1= np.log(weights[0]) + np.log(unigrams.get(trigram[2],0)) - np.log(len(unigrams))
    prob = np.logaddexp(t1, t2)
    prob2 = np.logaddexp(prob,t3)
    log_prob = prob2
      
    # Your implementation here
    
    return log_prob
  
  return interpolated_trigram

interp_trigram = get_interpolated_trigram(unigrams, bigrams, 
                                          trigrams, (0.1, 0.3, 0.6))

interp_trigram((1, 2, 3))

-2.2781149824005364

5. ** Fill in the perplexity function to compute the perplexity of your model over some given sequence of indices **

In [9]:
def perplexity(model, indices):
  perp = None
  
  perplexity = 0 #np.log(1)
  for i in range(len(indices)-2):
    prob = model((indices[i],indices[i+1],indices[i+2]))
    perplexity = perplexity + (1-prob)
  N = len(indices)-2
  perp = 1/N * perplexity
  
  return perp
  
  
perplexity(interp_trigram, dev_indices)

  
  app.launch_new_instance()


4.9140115663064945

This provides us a baseline comparison for any neural ngram replacement we build - this is how good any new model we build must be to be seen as an improvement!

Well, since this is on dev data, this is the number that we'd compare to if we want to change any hyperparameters within this model - here, the weights on the smaller MLE models we're interpolating over! I've given you 0.1, 0.3, and 0.6 for uni-, bi-, and trigrams respectively, but those are just arbitrary numbers! We can try out the model with various weights on the development data, and use that to pick the best options (we can even try and write algorithms that will fit these parameters for us - this is called hyperparameter search!). Once we've picked our favorite model, with both the parameters and hyperparameters locked-in, we can run that model on the test data and get a final number to compare to.

We won't make you do that here, but you're welcome to play around with these ideas if you'd like.

# 3. Training a "Neural Probabilistic Language Model" (a neural ngram model!)

Now we get to the cool stuff

Before we start, go to Edit -> Notebook Settings -> Hardware Accelerator and choose GPU. This will allow you to leverage the power of GPU matrix multiplication powers to let your code run in a reasonable amount of time. 

That aside, let's get on to implementing the model that we've seen in class - a model that has 3 parts - a word embedding layer, a single hidden layer, and an output layer with softmax. We take the 2 context words, get vector embeddings for each of them, concatenate them, feed them through the single, fully connected layer with a tanh nonlinearity, then feed them into a layer that changes the dimension of it's output to match the size of the vocabulary. We then softmax the layer so we have a probability distribution over the vocab, and we're done!

Now, having internalized that description, look at the implementation of the model in pytorch below, and compare the description (and what you've learned in class ,etc) to the pytorch code - not too much different, right?

6. **Fill in the 2 blanks in the code below**

No tricks, this should be fairly simple.

In [10]:
import torch
from torch import nn

# set random seed so that this code is deterministic
torch.manual_seed(360)

if torch.cuda.is_available():
  torch.cuda.manual_seed_all(360)

In [11]:

class NeuralTrigram(nn.Module):
  # Note that in python classes, the first argument of every member function
  # is self - you can pay it no real mind when calling functions, since you 
  # don't need to include it as an argument.
  
  def __init__(self, vocab_size, embed_size, hidden_size):
    # This function is called when we create a new NeuralTrigram
    
    super(NeuralTrigram, self).__init__()
    
    # These represent all of the pieces of the model:
    
    # The embedding layer takes in an integer from 0 to vocab_size - 1 and 
    # outputs a vector of length embed_size
    self.embed = nn.Embedding(vocab_size, embed_size) 
    
    # This linear layer takes a vector of length 2 * embed_size and applies an
    # Affine (linear transform plus the bias) transformation to it, returning 
    # a vector of length hidden_size
    self.linear = nn.Linear(2 * embed_size, hidden_size)
    
    # This layer also applies an affine transformation, but now outputs a vector
    # of length...
    # TODO: what length?
    self.out = nn.Linear(hidden_size, vocab_size)
    
    # And this piece just applies a softmax over the 1st dimension of the input
    # Note that this is 0-indexed: there is a 0th dimension
    self.softmax = nn.LogSoftmax(dim=1)
    
    
  # This is our implementation of a forward pass through the neural net - pytorch
  # will handily make the backward pass code for us!
  def forward(self, input):
    # the input is a torch.Tensor, a pytorch data-type you can think of as a 
    #   fancy nested list. 
    
    # the input is of dimension (batch_size, context_size), where
    #   batch size tells us how many examples to train on in parallel and 
    #   context size tells us how many words are in the context (in this case, 2)
    #   
    batch_size, n_words = input.shape
    
    # pass the input through the embedding layer
    # embeds is now (batch_size, context_pos, embed_size)
    embeds = self.embed(input)
    
    # view reshapes the tensor such that we concatenate the context vectors
    # context is now of shape (batch_size, 2 * embed_size)
    context = embeds.view((batch_size, -1))
    
    # Here we put context through our fully connected layer with a tanh 
    # nonlinearity/activation function
    hidden = torch.tanh(self.linear(context))
    
    # Now we pass it through the out layer, gettings a "score" from the neural 
    # network for each possible next word
    
    # TODO: what should the rhs of this look like? (Hint: compare to the previous
    #       lines)
    scores = self.out(hidden)
    
    # Then toss that into a softmax to make those scores into a probability 
    # distribution
    probs = self.softmax(scores)
    
    # then return that distribution
    return probs

Now let's write some evaluation code - to run our model forward, see how good it did, rinse and repeat for each batch, and then report back to us.

Now, you may see *cuda* littered all over the code. That's because CUDA is the platform that allows us to leverage the GPU to do all of our neural net's computations. if the argument cuda is set to true, we just move our model and data over to the gpu whenever we're doing any computations. Nothing else going on there.

In [12]:
def evaluate(model, indices, batch_size=32, cuda=False):
  model.eval()
  
  # This means that pytorch won't automatically keep track of gradients when we 
  # do computations within this block/indentation level
  with torch.no_grad():
    
    # Move to the model to GPU if we want to use CUDA
    if cuda:
      model = model.cuda()
      
    # This will keep track of the total (summed) loss over all of our batches
    total_loss = 0.0
    
    # This is a really convoluted way to iterate over our indices, skipping over
    # batch_size of them each time
    for i in range(2, len(indices)-batch_size+1, batch_size):
      # thus in this loop, our batch is indices[i:i+batch_size]
      
      
      # This convoluted piece of code builds a tensor of dimension
      # (batch_size, 2). i.e., 1 row for each bigram preceding a word in our batch
      input = torch.tensor([indices[j-2:j] for j in range(i, i + batch_size)], 
                           dtype=torch.int64)
      
      if cuda:
        input = input.cuda()
      
      # Now this is just a tensor of the words in our batch - each row of indices
      # in input corresponds to the preceding bigram for that row in target
      target = torch.tensor(indices[i:i+batch_size], dtype=torch.int64)
      
      # Clear the gradients from the last iteration
      model.zero_grad()
      
      # run the model forward, getting a probability distribution as output
      output = model.forward(input)
      
      if cuda:
        output = output.cpu()
        
      # Compute the loss, this time the Negative Log Likelihood Loss
      loss = nn.NLLLoss()(output, target)
      
      # And add it to the running total
      total_loss += loss.item()
      
    # Now we just average the loss over all of the batches
    num_batches = (len(indices)-2)//batch_size
    avg_loss = total_loss/num_batches
    
    # and return it
    return avg_loss

But that's the model code, and the evaluation code - now we need training code! This is going to look a little messier, but the concept is simple. 

We have 3 pieces - the model, an optimizer (the algorithm we use to train, like Stochastic Gradient Descent), and a loss function (here, the negative log-likelihood). for each batch of training data, we do a forward pass, compute the loss with respect to the target, and then tell the optimizer to update all of our parameters in the right direction (e..g, in the opposite direction of the gradient!). We then just rinse and repeat.

This is going to look a LOT like the eval code, with a couple extra pieces (like the optimizer, and running over the data multiple times (epochs!)). 

7. ** Use that knowledge to fill in the first blank **

But there's another thing to think about here - we have yet to write code to get our model's perplexity on a dataset - eval just gives the negative log-likelihood, averaged over all of the words in the data... but wait a second! There's a simple connection between the by-word-average of the negative log likelihood and the perplexity... the likelihood is

$ P(w_1, ..., w_n) $

The averaged negative log likelihood is

$ -\frac{\log P(w_1,..., w_n)}{n} $

And perplexity is... uhh.., if we do the math from HW1, ...

$ e ^{-\frac{\log P(w_1,..., w_n)}{n}}$

So...


8. ** Fill in the second blank and get the perplexity reporting up and running**

In [13]:
import time
def train(model, indices, dev_indices, batch_size=32, n_epochs=1, cuda=False):
  
  # Set up the optimizer, here Stochastic Gradient Descent
  optimizer = torch.optim.SGD(model.parameters(), lr = 0.1, weight_decay=0.00001)
  if cuda:
    model = model.cuda()
  
  losses = []
  
  for epoch in range(n_epochs):
    start_time = time.time()
    model.train()
    
    total_loss = 0.0
    
    for i in range(2, len(indices)-batch_size+1, batch_size):
      
      input = torch.tensor([indices[j-2:j] for j in range(i, i + batch_size)], 
                           dtype=torch.int64)
      
      target = torch.tensor(indices[i:i+batch_size], dtype=torch.int64)

      if cuda:
        input = input.cuda()
        
      optimizer.zero_grad()

      # TODO: Do some stuff here to eventually compute the loss
      # It should not be very much stuff!
      log_prob = model(input)
      if cuda:
        log_prob = log_prob.cpu()
      loss = nn.NLLLoss()(log_prob, target)
      
      # Do a backward pass / Compute the gradients!
      loss.backward()
      total_loss += loss.item()
      
      # Update the parameters!
      optimizer.step()
      
    num_batches = (len(indices)-2)//batch_size
    avg_loss = total_loss/num_batches
    losses.append(avg_loss)
    
    # check what our loss on the dev set is
    dev_loss = evaluate(model, dev_indices, cuda=cuda)
    
    # TODO: Get the perplexity
    perp = pow(np.e,avg_loss)
      
    # Print out some info
    print("epoch {} | time {:.2} | train loss {:.5} | dev loss {:.5} | perp {:4.2}".format(
            epoch + 1, time.time() - start_time, avg_loss, dev_loss, perp))
    
  # return a list of the avg loss at each epoch, just in case we want it
  return losses

Now let's train our model! Be aware if you're running this on the full training data, it will take a WHILE to get through a reasonable number of epochs (Bengio et al. claimed it took ~50 for them to reach convergence. I'll make you run 10, because that takes long enough for a homework of this scale). That being said, this should teach you about how long it might take to train a much larger, more sophisticated model!

** Train on small amounts data at first please. Save yourself some time and sanity. Just make a subset of the brown corpus training **

In [25]:
print("----Beginning Training----")
ntrigram = NeuralTrigram(vocab_size=len(vocab), embed_size=30, hidden_size=50)
train(ntrigram, train_indices,dev_indices, n_epochs=50, cuda=True)
#evaluate(ntrigram,train_indices)

----Beginning Training----
epoch 1 | time 7.3e+01 | train loss 6.1819 | dev loss 5.8861 | perp 4.8e+02
epoch 2 | time 7.3e+01 | train loss 5.9043 | dev loss 5.8003 | perp 3.7e+02
epoch 3 | time 7.2e+01 | train loss 5.8164 | dev loss 5.7524 | perp 3.4e+02
epoch 4 | time 7.2e+01 | train loss 5.7569 | dev loss 5.719 | perp 3.2e+02
epoch 5 | time 7.3e+01 | train loss 5.7108 | dev loss 5.6955 | perp 3e+02
epoch 6 | time 7.3e+01 | train loss 5.6728 | dev loss 5.6779 | perp 2.9e+02
epoch 7 | time 7.3e+01 | train loss 5.64 | dev loss 5.6638 | perp 2.8e+02
epoch 8 | time 7.3e+01 | train loss 5.6111 | dev loss 5.6515 | perp 2.7e+02
epoch 9 | time 7.3e+01 | train loss 5.5849 | dev loss 5.6408 | perp 2.7e+02
epoch 10 | time 7.2e+01 | train loss 5.5611 | dev loss 5.631 | perp 2.6e+02
epoch 11 | time 7.2e+01 | train loss 5.5392 | dev loss 5.6229 | perp 2.5e+02
epoch 12 | time 7.3e+01 | train loss 5.5189 | dev loss 5.6157 | perp 2.5e+02
epoch 13 | time 7.3e+01 | train loss 5.4999 | dev loss 5.6095 | 

[6.181886506236083,
 5.904261759125837,
 5.816360317491008,
 5.756903988718295,
 5.710794441308673,
 5.672768687967291,
 5.640049056533032,
 5.611075291257271,
 5.5849338127539685,
 5.56111681362625,
 5.539207898891899,
 5.518899999030509,
 5.49988712497299,
 5.482025837821004,
 5.46526673101531,
 5.449391979003669,
 5.434121776454005,
 5.419529052652203,
 5.405555422293949,
 5.392204494570545,
 5.379444877378149,
 5.367182765847621,
 5.3555122909414665,
 5.344171772604394,
 5.333222377004135,
 5.322698616057359,
 5.312556414552877,
 5.302605006069063,
 5.292649245308878,
 5.282866351141625,
 5.273270430405229,
 5.263783462486571,
 5.254319163197359,
 5.245258676533203,
 5.236577014321494,
 5.228102878015648,
 5.219767691054398,
 5.211478114190104,
 5.203326189182745,
 5.195158592829385,
 5.187110235807519,
 5.179159477456216,
 5.171321171980638,
 5.163643294608356,
 5.156154247564174,
 5.148817668185624,
 5.141676125733765,
 5.134643303021854,
 5.127685457044823,
 5.120748475349552]

9. **Now run both models (ngram and neural ngram) on the test data. Which does better (on perplexity)?**

In [15]:
evaluate(ntrigram,test_indices,cuda=True)


5.666408392321337

In [16]:
perplexity(interp_trigram,test_indices)

  app.launch_new_instance()
  


4.87501882462981

The perplexity of trigram is better than neuralgram, since it is smaller. This may be due to the small number of epoches during training. I haven't tested (as I don't want to wait forever for the training to finish) but perhaps with 20 or 30 epoches instead of 10 the perplexity of the neuralgram may be smaller, and the neural model performs better.

# 4. Embeddings for Semantics

Finally, something psycholinguistic-y! When we have vector representations of words, it's natural to try to see how properties of those words (like "semantic" similarity) map on to properties of the vector space they reside in (like cosine similarity).

So let's explore Cosine Similarity - that is, the cosine of the angle between two vectors. This value will range between -1 (the least similar/the negation of the vector) and 1 (the most similar/the same), which presents us a handy scale. Another benefit of this measure is how easy it is to compute - note that

$ cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}||\vec{b}|} $

So we just need to take a dot product with a bit of normalization and bam, cosine similarity. 

10. **Let's fill in quick implementation of this function**

*Hint: check out np.dot and np.linalg.norm because there's no need to implement those by hand*

In [17]:
def cosine_sim(a, b):
  similarity = 0
  similarity = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))  
  return similarity

Now let's compare similarities with the embeddings in our model!

In [18]:
# This long sequence of calls gets us a matrix representing the embeddings we trained
embed_matrix = ntrigram.cpu().embed.weight.data.numpy()

a = embed_matrix[word2idx["car"]]
b = embed_matrix[word2idx["dog"]]
c = embed_matrix[word2idx["cat"]]

cosine_sim(a, b), cosine_sim(b, c)

(-0.041289706, 0.1052716)

So far so good - we should find that "dog" and "cat" are more similar than "dog" and "car." That's some semantic learning, right? 

How about we test out an example from Roland?
 11. **See if *spear* is more similar to *rock* or *sword***

In [19]:
spear, sword, rock = "spear", "sword", "rock"
cosine_sim(embed_matrix[word2idx['spear']],embed_matrix[word2idx['rock']])
cosine_sim(embed_matrix[word2idx['spear']],embed_matrix[word2idx['sword']])


0.24888812

-- apparently more similar to rock than to sword

Unfortunately with a lack of reading time data for the setences in Roland, we can't quite replicate their results. We can, however, continue to see whether our model captures some semantic relations we might care about. 

12. **Use the cell below to run comparisons between a couple of words in our vocab, and use the text cell below to describe what these comparisons tell you about the representations learned by our model. Maybe increasing the number of hidden dimensions in our model, or the number of epoches during training might make these representations better? Discuss!**

In [20]:
#cosine_sim(embed_matrix[word2idx['city']],embed_matrix[word2idx['place']])
cosine_sim(embed_matrix[word2idx['city']],embed_matrix[word2idx['atlanta']])
cosine_sim(embed_matrix[word2idx['place']],embed_matrix[word2idx['atlanta']])



0.55838114

In [21]:
cosine_sim(embed_matrix[word2idx['inadequate']],embed_matrix[word2idx['ambiguous']])
cosine_sim(embed_matrix[word2idx['inadequate']],embed_matrix[word2idx['experienced']])



0.20188572

In [22]:
cosine_sim(embed_matrix[word2idx['husband']],embed_matrix[word2idx['son']])
cosine_sim(embed_matrix[word2idx['wife']],embed_matrix[word2idx['daughter']])


-0.117492765

Coming back to the discussion about the number of epoches, I do think that increasing the number will make the model perform better. The result here is not bad, but it is not also very good, for example:

1. It's interesting how "Place" and "Atlanta" is more similar than "City" and "Atlanta" - negative result for City-Atlanta and positive for Place-Atlanta. Atlanta is, of course, a place, but it is also a city,  and since "city" is a sub-category of "place", I personally would associate Atlanta to 'city' more than to 'place'.

2. "Inadequate" is more similar to "experience" than to "ambiguous", which is a bit strange if we think of the first and the third adjectives as having a negative connotation while the second as a positive adj. 
 
3. The last one is interesting. "Husband" to "son" is a positive value (0.03) while "wife" to "daughter" is a  negative value (-0.117). Although the difference between them is not too large, this example shows that this model may not do well in an analogies test.
 
 