# Word Embeddings

Word embeddings, or word vectors, provide a way of mapping words from a vocabulary into a low-dimensional space, where words with similar meanings are close together. Let's play around with a set of pre-trained word vectors, to get used to their properties. There exist many sets of pretrained word embeddings; here, we use ConceptNet Numberbatch, which provides a relatively small download in an easy-to-work-with format (h5).

In [6]:
from urllib.request import urlretrieve
import os
if not os.path.isfile('datasets/mini.h5'):
    print("Downloading conceptnet numberbatch and word embeddings.....")
    conceptnet_url = 'http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.06/mini.h5'
    urlretrieve(conceptnet_url, 'datasets/mini.h5')

Below, we use the package to open the mini.h5 file we just downloaded. We extract from the file a list of utf-8-encoded words, as well as their 300-d vectors.

In [7]:
import h5py
with h5py.File('datasets/mini.h5', 'r') as F:
    all_words = [word.decode('utf-8') for word in F['mat']['axis1'][:]]
    all_embeddings = F['mat']['block0_values'][:]
print("all_words dimension : {}".format(len(all_words)))
print("all_embeddings dimensions : {}".format(all_embeddings.shape))
print("Random example words : {}".format(all_words[184576]))

all_words dimension : 362891
all_embeddings dimensions : (362891, 300)
Random example words : /c/fr/abstrait


We are interested only in the English words. We use Python list comprehensions to pull out the indices of the English words, then extract just the English words (stripping the six-character /c/en/ prefix) and their embeddings.

In [8]:
english_words= [word[6:] for word in all_words if word.startswith('/c/en/')]
english_words_indices = [i for i, word in enumerate(all_words) if word.startswith('/c/en/')]
english_embeddings = all_embeddings[english_words_indices]
print("Number of english words in all_words : {0}".format(len(english_words)))
print("Embeddings of english words in all words : {0}".format(english_embeddings.shape))
print("Random example english words : {}".format(english_words[25884]))

Number of english words in all_words : 150875
Embeddings of english words in all words : (150875, 300)
Random example english words : coaching


The magnitude of a word vector is less important than its direction; the magnitude can be thought of as representing frequency of use, independent of the semantics of the word. Here, we will be interested in semantics, so we normalize our vectors, dividing each by its length. The result is that all of our word vectors are length 1, and as such, lie on a unit circle. The dot product of two vectors is proportional to the cosine of the angle between them, and provides a measure of similarity (the bigger the cosine, the smaller the angle).

In [9]:
import numpy as np

norms = np.linalg.norm(english_embeddings, axis=1)
normalized_embeddings = english_embeddings.astype('float32') / norms.astype('float32').reshape([-1, 1])

We want to look up words easily, so we create a dictionary that maps us from a word to its index in the word embeddings matrix.

In [10]:
index= {word : i for i, word in enumerate(english_words)}

Now we are ready to measure the similarity between pairs of words. We use numpy to take dot products.

In [11]:
def similarity_score(w1, w2):
    score = np.dot(normalized_embeddings[index[w1], :], normalized_embeddings[index[w2], :])
    return score

# A word is as similar with itself as possible:
print('cat\trat\t', similarity_score('cat', 'rat'))

# Closely related words still get high scores:
print('cat\tmouse\t', similarity_score('cat', 'mouse'))
print('cat\tman\t', similarity_score('cat', 'man'))

# Unrelated words, not so much
print('cat\tmoo\t', similarity_score('cat', 'moo'))
print('cat\tfreeze\t', similarity_score('cat', 'freeze'))

# Antonyms are still considered related, sometimes more so than synonyms
print('antonym\topposite\t', similarity_score('antonym', 'opposite'))
print('antonym\tantonym\t', similarity_score('antonym', 'antonym'))

cat	rat	 0.3766275
cat	mouse	 0.31482112
cat	man	 -0.004878208
cat	moo	 0.0039538294
cat	freeze	 -0.030225191
antonym	opposite	 0.39410648
antonym	antonym	 0.9999999


In [12]:
def closest_to_vector(v, n):
    all_scores = np.dot(normalized_embeddings, v)
    best_words= list(map(lambda i: english_words[i], reversed(np.argsort(all_scores))))
    return best_words[:n]
def most_similar(w, n):
    return closest_to_vector(normalized_embeddings[index[w], :], n)

In [13]:
print(most_similar('python', 10))
print(most_similar('friend', 10))
print(most_similar('spiderman', 10))

['python', 'reticulated_python', 'pythons', 'pythonesque', 'scripting_language', 'ecmascript', 'php', 'objective_c', 'boa_constrictor', 'egyptian_cobra']
['friend', 'schoolfriend', 'buddy', 'pal', 'colleague', 'friends', 'friended', 'best_friend', 'girlfriend', 'boyfriend']
['spiderman', 'superhero', 'superman', 'ghost_rider', 'supermans', 'spider_man', 'captain_america', 'batman', 'superheroes', 'superheroine']


We can also use closest_to_vector to find words "nearby" vectors that we create ourselves. This allows us to solve analogies. For example, in order to solve the analogy "man : brother :: woman : ?", we can compute a new vector brother - man + woman: the meaning of brother, minus the meaning of man, plus the meaning of woman. We can then ask which words are closest, in the embedding space, to that new vector.

In [14]:
def solve_analogy(a1, b1, a2):
    b2 = normalized_embeddings[index[b1], :] - normalized_embeddings[index[a1], :] + normalized_embeddings[index[a2], :]
    return closest_to_vector(b2, 1)

print(solve_analogy("man", "brother", "woman"))
print(solve_analogy("man", "husband", "woman"))
print(solve_analogy("juventus", "barcelona", "madrid"))

['sister']
['wife']
['barcelona']


# Using word embeddings in deep models

Word embeddings are fun to play around with, but their primary use is that they allow us to think of words as existing in a continuous, Euclidean space; we can then use an existing arsenal of techniques for machine learning with continuous numerical data (like logistic regression or neural networks) to process text. Let's take a look at an especially simple version of this. We'll perform sentiment analysis on a set of movie reviews: in particular, we will attempt to classify a movie review as positive or negative based on its text.

In [15]:
import string
remove_punct=str.maketrans('','',string.punctuation)

# This function converts a line of our data file into
# a tuple (x, y), where x is 300-dimensional representation
# of the words in a review, and y is its label.
def convert_line_to_example(line):
    # Pull out the first character: that's our label (0 or 1)
    y = int(line[0])
    
    # Split the line into words using Python's split() function
    words = line[2:].translate(remove_punct).lower().split()
    
    # Look up the embeddings of each word, ignoring words not
    # in our pretrained vocabulary.
    embeddings = [normalized_embeddings[index[w]] for w in words
                  if w in index]
    
    # Take the mean of the embeddings
    x = np.mean(np.vstack(embeddings), axis=0)
    return x, y

# Apply the function to each line in the file.
xs = []
ys = []
with open("C:/Users/Itika/Desktop/movie-simple.txt", "r", encoding='utf-8', errors='ignore') as f:
    for l in f.readlines():
        x, y = convert_line_to_example(l)
        xs.append(x)
        ys.append(y)

# Concatenate all examples into a numpy array
xs = np.vstack(xs)
ys = np.vstack(ys)

In [16]:
print("Shape of inputs: {}".format(xs.shape))
print("Shape of labels: {}".format(ys.shape))

num_examples = xs.shape[0]

Shape of inputs: (1411, 300)
Shape of labels: (1411, 1)


In [17]:
print("First 20 labels before shuffling: {0}".format(ys[:20, 0]))

shuffle_idx = np.random.permutation(num_examples)
xs = xs[shuffle_idx, :]
ys = ys[shuffle_idx, :]

print("First 20 labels after shuffling: {0}".format(ys[:20, 0]))

First 20 labels before shuffling: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
First 20 labels after shuffling: [0 0 1 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 1 0]


In [18]:
import torch
num_train = 4*num_examples //5
x_train = torch.tensor(xs[:num_train])
y_train = torch.tensor(ys[:num_train], dtype=torch.float32)

x_test = torch.tensor(xs[num_train:])
y_test = torch.tensor(ys[num_train:], dtype=torch.float32)

We could format each batch individually as we feed it into the model, but to make it easier on ourselves, let's create a TensorDataset and DataLoader as we've used in the past for MNIST.

In [19]:
reviews_train = torch.utils.data.TensorDataset(x_train, y_train)
reviews_test = torch.utils.data.TensorDataset(x_test, y_test)

train_loader = torch.utils.data.DataLoader(reviews_train, batch_size=100, shuffle=True)
test_loader = torch.utils.data.DataLoader(reviews_test, batch_size=100, shuffle=False)

In [20]:
import torch.nn as nn
import torch.nn.functional as F

First we build the model, organized as a nn.Module. We could make the number of outputs for our MLP the number of classes for this dataset (i.e. 2). However, since we only have two output classes here ("positive" vs "negative"), we can instead produce a single output value, calling everything greater than  0  "postive" and everything less than  0  "negative". If we pass this output through a sigmoid operation, then values are mapped to  [0,1] , with  0.5  being the classification threshold.

In [21]:
class SWEM(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(300, 64)
        self.fc2 = nn.Linear(64, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        return x

To train the model, we instantiate the model. Notice that since we are only doing binary classification, we use the binary cross-entropy (BCE) loss instead of the cross-entropy loss we've seen before. We use the "with logits" version for numerical stability.

In [22]:
## Training
# Instantiate model
model = SWEM()

# Binary cross-entropy (BCE) Loss and Adam Optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Iterate through train set minibatchs 
for epoch in range(250):
    correct = 0
    num_examples = 0
    for inputs, labels in train_loader:
        # Zero out the gradients
        optimizer.zero_grad()
        
        # Forward pass
        y = model(inputs)
        loss = criterion(y, labels)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        predictions = torch.round(torch.sigmoid(y))
        correct += torch.sum((predictions == labels).float())
        num_examples += len(inputs)
    
    # Print training progress
    if epoch % 25 == 0:
        acc = correct/num_examples
        print("Epoch: {0} \t Train Loss: {1} \t Train Acc: {2}".format(epoch, loss, acc))

## Testing
correct = 0
num_test = 0

with torch.no_grad():
    # Iterate through test set minibatchs 
    for inputs, labels in test_loader:
        # Forward pass
        y = model(inputs)
        
        predictions = torch.round(torch.sigmoid(y))
        correct += torch.sum((predictions == labels).float())
        num_test += len(inputs)
    
print('Test accuracy: {}'.format(correct/num_test))

Epoch: 0 	 Train Loss: 0.6818224787712097 	 Train Acc: 0.5620567202568054
Epoch: 25 	 Train Loss: 0.18264468014240265 	 Train Acc: 0.9539006948471069
Epoch: 50 	 Train Loss: 0.07849441468715668 	 Train Acc: 0.972517728805542
Epoch: 75 	 Train Loss: 0.08244629204273224 	 Train Acc: 0.9760638475418091
Epoch: 100 	 Train Loss: 0.0745677500963211 	 Train Acc: 0.9822695255279541
Epoch: 125 	 Train Loss: 0.02149335853755474 	 Train Acc: 0.9831560254096985
Epoch: 150 	 Train Loss: 0.08886618167161942 	 Train Acc: 0.9858155846595764
Epoch: 175 	 Train Loss: 0.05673982948064804 	 Train Acc: 0.9893617033958435
Epoch: 200 	 Train Loss: 0.018002599477767944 	 Train Acc: 0.9955673813819885
Epoch: 225 	 Train Loss: 0.015156297944486141 	 Train Acc: 0.9982269406318665
Test accuracy: 0.9575971961021423


We can now examine what our model has learned, seeing how it responds to word vectors for different words:



In [23]:
words_to_test= ["boring", "encouraging", "entertaining"]
for word in words_to_test:
    x = torch.tensor(normalized_embeddings[index[word]].reshape(1,300))
    print("Sentiment of word {0} : {1}".format(word, torch.sigmoid(model(x))))

Sentiment of word boring : tensor([[2.2327e-18]], grad_fn=<SigmoidBackward>)
Sentiment of word encouraging : tensor([[4.3608e-08]], grad_fn=<SigmoidBackward>)
Sentiment of word entertaining : tensor([[1.0000]], grad_fn=<SigmoidBackward>)


# Learning word embedding

How do we learn word embeddings? To do so, we need to make them a part of our model, rather than as part of loading the data. In PyTorch, the preferred way to do so is with the nn.Embedding. Like the other nn layers we've seen (e.g. nn.Linear), nn.Embedding must be instantiated first. There are two required arguments for instantiation are the number of embeddings (i.e. the vocabulary size  𝑉 ) and the dimension of word embeddings (300, in our previous example).

In [24]:
vocab_size = 500
embed_dimension= 10
embedding = nn.Embedding(vocab_size , embed_dimension)
print(embedding)

Embedding(500, 10)


In [25]:
embedding.weight.size()

torch.Size([500, 10])

Notice that this matrix is basically a 10 dimensional word embedding for each of the 500 words, stacked on top of each other. Looking up a word embedding in this embedding matrix is simply selecting a specific row of this matrix, corresponding to the word.

When word embeddings are learned, nn.Embedding look-up is often one of the first operations in a model module. For example, if we were to learn the word embeddings for our previous SWEM model, the model might instead look like this:

In [26]:
class SWEMwithEmbeddings(nn.Module):
    def __init__(self, vocab_size, embed_dimension, hidden_dimension, num_output):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size , embed_dimension)
        self.fc1 = nn.Linear(embed_dimension, hidden_dimension)
        self.fc2 = nn.Linear(hidden_dimension, num_output)
    def forward(self, x):
        x = self.embedding(x)
        x = torch.mean(x, dimension=0)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        return x

Here we've abstracted the size of the various layers of the model as constructor arguments, so we need to specify those hyperparameters at initialization.

In [27]:
model = SWEMwithEmbeddings(
        vocab_size = 4000,
        embed_dimension = 500,
        hidden_dimension = 48,
        num_output = 4
)
print(model)

SWEMwithEmbeddings(
  (embedding): Embedding(4000, 500)
  (fc1): Linear(in_features=500, out_features=48, bias=True)
  (fc2): Linear(in_features=48, out_features=4, bias=True)
)


# Recurrent Neural Networks(RNNs)

The key difference between sequential models and the previous models we've seen is the presence of a "time" dimension: words in a sentence (or paragraph, document) have an ordering to them that convey meaning:

basic_RNN
In the example sequence above, the word "Recurrent" is the  𝑡=1  word, which we denote  𝑤1 ; similarly, "neural" is  𝑤2 , and so on. As the preceding sections have hopefully impressed upon you, it is often more advantageous to model words as embedding vectors  𝑥1,...,𝑥𝑇 , rather than one-hot vectors (which tokens  𝑤1,...𝑤𝑇  correspond to), so our first step is often to do an embedding table look-up for each input word. Let's assume 300-dimensional word embeddings and, for simplicity, a minibatch of size 1.

In [29]:
mb = 1
x_dim = 300 
sentence = ["recurrent", "neural", "network", "are", "great"]

xs = []
for word in sentence:
    xs.append(torch.tensor(normalized_embeddings[index[word]]).view(1, x_dim))
    
xs = torch.stack(xs, dim=0)
print("xs shape: {}".format(xs.shape))

xs shape: torch.Size([5, 1, 300])


RNNs in Pytorch

How would we implement an RNN in PyTorch? There are quite a few ways, but let's build the Elman RNN from scratch first, using the input sequence "recurrent neural networks are great".

In [30]:
import numpy as np
import torch

In an RNN, we project both the input  𝑥𝑡  and the previous hidden state  ℎ𝑡−1  to some hidden dimension, which we're going to choose to be 128. To perform these operations, we're going to define some variables we're going to learn.

In [39]:
h_dim= 300

# For projecting the input
Wx= torch.randn(x_dim, h_dim)/np.sqrt(x_dim)
Wx.requires_grad_()
bx = torch.randn(h_dim, requires_grad=True)

# For projecting the previous state
Wh = torch.randn(h_dim, x_dim)/np.sqrt(h_dim)
Wh.requires_grad_()
bh= torch.randn(h_dim, requires_grad=True)

print(Wx.shape, bx.shape, Wh.shape, bh.shape)

torch.Size([300, 300]) torch.Size([300]) torch.Size([300, 300]) torch.Size([300])


For convenience, we define a function for one time step of the RNN. This function take the current input  𝑥𝑡  and previous hidden state  ℎ𝑡−1 , performs the linear transformations  𝑥𝑊𝑥+𝑏𝑥  and  ℎ𝑊ℎ+𝑏ℎ , and then a hyperbolic tangent nonlinearity.

In [40]:
def RNN_step(x, h):
    h_next = torch.tanh((torch.matmul(x, Wx) + bx) + (torch.matmul(h, Wh) + bh))

    return h_next

In [41]:
#Word embedding for the first word
x1= xs[0, :, :]
#initialize hidden state to 0
h0 = torch.zeros([mb, h_dim])

To take one time step of the RNN, we call the function we wrote, passing in  𝑥1  and  ℎ0 . In this case,

In [42]:
# Forward pass of one RNN step for time step t=1
h1 = RNN_step(x1, h0)

print("Hidden state h1 dimensions: {0}".format(h1.shape))

Hidden state h1 dimensions: torch.Size([1, 300])


In [47]:
#Word embedding for 2nd word
x2 = xs[1, :, :]
h2 = RNN_step(x2, h1)
print("Hidden state h2 dimension : {0}".format(h2.shape))

Hidden state h2 dimension : torch.Size([1, 300])


Using torch.nn

In [48]:
import torch.nn

rnn = nn.RNN(x_dim, h_dim)
print("RNN parameter shapes: {}".format([p.shape for p in rnn.parameters()]))

RNN parameter shapes: [torch.Size([300, 300]), torch.Size([300, 300]), torch.Size([300]), torch.Size([300])]


to perform a forward pass with an RNN, we pass the entire input sequence to the forward() function, which returns the hidden states at every time step (hs) and the final hidden state (h_T).

In [49]:
hs, h_T = rnn(xs)
print("Hidden states shape : {}".format(hs.shape))
print("Final hidden states shape : {}".format(h_T.shape))

Hidden states shape : torch.Size([5, 1, 300])
Final hidden states shape : torch.Size([1, 1, 300])


Gated RNN

While the RNNs we've just explored can successfully model simple sequential data, they tend to struggle with longer sequences, with vanishing gradients an especially big problem. A number of RNN variants have been proposed over the years to mitigate this issue and have been shown empirically to be more effective. In particular, Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU) have seen wide use recently in deep learning. We're not going to go into detail here about what structural differences they have from vanilla RNNs; a fantastic summary can be found here. Note that "RNN" as a name is somewhat overloaded: it can refer to both the basic recurrent model we went over previously, or recurrent models in general (including LSTMs and GRUs).

In [50]:
lstm = nn.LSTM(x_dim, h_dim)
print("LSTM parameters: {}".format([p.shape for p in lstm.parameters()]))

gru = nn.GRU(x_dim, h_dim)
print("GRU parameters: {}".format([p.shape for p in gru.parameters()]))

LSTM parameters: [torch.Size([1200, 300]), torch.Size([1200, 300]), torch.Size([1200]), torch.Size([1200])]
GRU parameters: [torch.Size([900, 300]), torch.Size([900, 300]), torch.Size([900]), torch.Size([900])]


# Torchtext

Much like PyTorch has Torchvision for computer vision, PyTorch also has Torchtext for natural language processing. As with Torchvision, Torchtext has a number of popular NLP benchmark datasets, across a wide range of tasks (e.g. sentiment analysis, language modeling, machine translation). It also has a few pre-trained word embeddings available as well, including the popular Global Vectors for Word Representation (GloVe). If you need to load your own dataset, Torchtext has a number of useful containers that can make the data pipeline easier.