## Word Embeddings - Encoding Lexical Semantics


**Word Embeddings ** are dense vectors of real numbers , one per word in the vocabulary. 

Representing text , words is a challenge in NLP as need to store it's meaning as well.  

Also given input is $ |V|$ dimensional ( where V is our vocabulary ) we want an output with only a few dimensions .( Need to move from higer dimension to lower one)

** One Hot Encoding ** we can represent a word by 
$$ \overbrace {\left [0,0,\dots,1,\dots,0,0\right]} ^\text {|V| elements} $$

where 1 is in a location unique to word w . Every word will have a 1 in a unique location and zeros everywhere else

**Problem** this representation treats all words as **independent entities ** with no relation to each other . We need a notion of **Similarity** between words.
i.e **Semantic Similarity ** and not just orthographic representations. Using this technique we can combat the **sparsity ** of linguistic data by connecting the dots of what we have seen and what we haven't. Eg

* Mathematican ran to the store. (Train)
* Physicist ran to the store.(Train)
* Mathematician solved the open problem.(Train)

* Physicist solved the open problem. (Test)

Using Symantic similarity the n/w can genralize this sentence . The eg relies on a fundamental linguistic assumprion; words appering in similar contexts are related to each other. - **Distributional Hypothesis**

### Getting Dense Word Embeddings

Encoding Semantic Similarity in words -  Make up some semantic attributes and give scores to them , common words can have similar score .

$$ q_\text{mathematician} = \left [\overbrace{2.3}^\text{can run},\overbrace{9.1}^\text{likes coffee},\overbrace{-5.5}^\text{majored in Physics},\dots\right]$$


$$ q_\text{physicist} = \left [\overbrace{2.3}^\text{can run},\overbrace{9.1}^\text{likes coffee},\overbrace{9.1}^\text{majored in Physics},\dots\right]$$

Then the measure of similarity can be  = $ q_\text{mathematician} \cdot q_\text{physicist} $

After normalization Similarity ( physicist,mathematician  ) = $$ \frac {q_\text{mathematician} \cdot q_\text{physicist}} {||q_\text{mathematician} || || q_\text{physicist}||} = cos(\phi) $$


Where $\phi$ is the angle between the two vectors . Extrmely similar words ( where embeddings point in the same direction) will have similarity =1 , Dissimilar words will have similarity =-1

One Hot Encoding - special case where each word basically has similarity =0 , and each word is given a unique semantic attribute. Whereas the one hot encoded vectors are sparse , these vectors are dense with entries typically non-zero.


Using the neural network to learn the representations i.e ** keeping the word embedddings as parameters in the model and learning and updating them during training ** Note - The learnt word embeddings will not be interpretable .

** Word Embeddings are a representation of the SEMANTICS of a word -> efficiently encoding semantic information that might be relevant to the task at hand **


### Word Embeddings in Pytorch 

Need to define an indexx for each word when using embeddings 

Embeddings are stored as $ |V| \times D $  matrix , where D is the dimensionality of the embeddings, such that the word assigned index i has it's embeddings stored in the i'th row of the matrix 

** torch.nn.Embedding - takes two  arguments : vocab size , dimensionallity of the embeddings **

In [1]:
!pip install torch

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/49/0e/e382bcf1a6ae8225f50b99cc26effa2d4cc6d66975ccf3fa9590efcbedce/torch-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (519.5MB)
[K    100% |████████████████████████████████| 519.5MB 32kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x592c6000 @  0x7f705c0071c4 0x46d6a4 0x5fcbcc 0x4c494d 0x54f3c4 0x553aaf 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54e4c8
[?25hInstalling collected packages: torch
Successfully installed torch-0.4.1


In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [3]:
torch.manual_seed(1)

<torch._C.Generator at 0x7fbfe90bd190>

In [4]:
# indexing the words 
words_to_ix = {"hello":0,"world":1}
# 2 words in the vocab - hello , world
# dimensionality of embeddings = 5
embeddings = nn.Embedding(2,5)
# getting the index of word hello and converting to a long tensor
lookup_tensor = torch.tensor([words_to_ix["hello"]], dtype=torch.long)
# getting the embeddding at index 0 -> hello 
hello_embedding = embeddings(lookup_tensor)
print(lookup_tensor)
print(hello_embedding)

tensor([0])
tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]], grad_fn=<EmbeddingBackward>)


### N Gram Modelling

Given a sequence of words w , compute

$$ P(w_i | w_{i-1}, w_{i-2},\dots,w_{i-n+1}) $$

where $ w_i $ is the i'th word of the sequence 

Computing the loss function and updating params with backprop 

In [5]:
# Context size is the size of window to look for context
# 2 words to left - asymmetric window
CONTEXT_SIZE =2 
EMBEDDING_DIM = 10

test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold."""

test_sentence = test_sentence.split()

# Tokenizing the input 
# Building a list of tuples , Each tuple is ([word_i-2,word_i-1],target word)

trigrams = [(
            [test_sentence[i],test_sentence[i+1]],
            test_sentence[i+2]) 
            for i in range(len(test_sentence)-2) ]

print(trigrams[:3])


[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


In [0]:
# set removing duplicates
vocab = set(test_sentence)

word_to_ix = {word:i for i,word in enumerate(vocab)}



In [0]:
class NGramLangClassifier(nn.Module):
  def __init__(self,vocab_size,dimensions,context_size):
    super(NGramLangClassifier,self).__init__()
    # The Embedding Layer
    self.embeddings =  nn.Embedding(vocab_size,dimensions)
    # Linear layer
    self.linear1 = nn.Linear(context_size*dimensions,128)
    # second layer/output layer will give probability for each word
    self.linear2 = nn.Linear(128,vocab_size)
    
  def forward(self,inputs):
    embedding = self.embeddings(inputs).view((1,-1))
    out1 = F.relu(self.linear1(embedding))
    out2 = self.linear2(out1)
    # returning the log_probabilities of each word
    return F.log_softmax(out2,dim=1)

In [0]:
# loss function , optimizer and model
losses =[]

loss_function = nn.NLLLoss()

model = NGramLangClassifier(len(vocab),EMBEDDING_DIM,CONTEXT_SIZE)

optimizer = optim.SGD(model.parameters(),lr=0.001)

In [12]:
# Training Loop
for epoch in range(10):
  total_loss =0
  for context,target in trigrams:
    # 1. Get the word indexes for the trigrams 
    context_idx = torch.tensor([word_to_ix[w] for w in context],dtype=torch.long)
    target_idx = torch.tensor([word_to_ix[target]],dtype=torch.long)
    
    #2. Zero the gradients being accumulated
    model.zero_grad()
    
    # 3. Forward Pass
    log_probs = model(context_idx)
    
    # 4. Compute loss
    loss = loss_function(log_probs,target_idx)
    
    # 5. Backprop and updating gradients
    loss.backward()
    optimizer.step()
    
    # appending the losses
    total_loss +=loss.item()
  losses.append(total_loss)

print(losses)

[524.9456593990326, 522.3348758220673, 519.7448451519012, 517.1742217540741, 514.6207783222198, 512.0832452774048, 509.56009459495544, 507.05187797546387, 504.55892968177795, 502.07942605018616]


### Computing Word Embeddings : Continuous Bag-Of-Words

Continuous Bag-Of-Words (CBOW) is frequently used in NLP

Model tries to predict words given the context of a few words before and and few words after the target word. 

Distinct from language modelling -> ** CBOW is not sequential and does not have to be probabilistic **

CBOW used to quickly train word embeddings --> these embeddings used to initialize the embeddings of some more complicated model ** Pretrained Embedding **

** CBOW Model ** given a target word $w_i$ and an $ N $ context window on each size - $ w_{i-1},\dots,w_{i-N} and w_{i+1},\dots,w_{i+N}  $ referring to all context words as $ C $ , CBOW tries to minimize 

$$ -log p(w_i | C ) = -log Softmax (A (\sum_{w \in C} q_w) +b ) $$


where $ q_w $ is the embedding for the word w 

In [18]:
# Context size - symmetrical 2 to left , 2 to right
CONTEXT_SIZE = 2 

text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

text = text.split()

# removing duplicates
vocab = set(text)
vocab_size = len(vocab)

word_to_ix =  {word:i for i,word in enumerate(vocab)}

# creating the sets of context , target
data =[]

for i in range(2, len(text) - 2):
  context =[text[i - 2],text[i - 1],text[i+1],text[i+2]]
  target = text[i]
  data.append((context,target))
  

print(data[:4])

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the')]


In [19]:
# function to return indexes for a given context 
def make_context_vector(context,word_to_ix):
  idxs = [word_to_ix[w] for w in context]
  return torch.tensor(idxs,dtype=torch.long)

make_context_vector(data[0][0],word_to_ix)

tensor([17, 21, 16, 43])

In [1]:
# CBOW Model
# class CBOW(nn.Module):
#   def __init__(self,vocab_size,embedding_dim,context_size):
#     super(CBOW,self).__init__()
#     # Embedding Layer
#     self.embedding = nn.Embedding(vocab_size,embedding_dim)
#     # layer 1
#     self.l1 = nn.Linear(embedding_dim * context_size,128)
#     # layer 2
#     self.l2 = nn.Linear(128,vocab_size)
    
#   def forward(self,inputs):
#     embed = self.embedding(inputs).view(1,-1)
#     o1 = F.relu(self.l1(embed))
#     o2 = self.l2(o1)
#     return F.log_softmax(o2,dim=1)



class CBOW2(nn.Module):
  def __init__(self,vocab_size,embedding_dim,context_size):
    super(CBOW2,self).__init__()
    self.embedding = nn.Embedding(vocab_size,embedding_dim)
    self.l1 = nn.Linear(embedding_dim,vocab_size)

    
  def forward(self,inputs):
    embed = self.embedding(inputs).sum(dim=0).view((1,-1))
    out = self.l1(embed)
    return F.log_softmax(out,dim=1)
    
  
  

NameError: ignored

In [0]:
losses=[]

loss_function = nn.NLLLoss()

model = CBOW(vocab_size,EMBEDDING_DIM,2*CONTEXT_SIZE)

optimizer = optim.SGD(model.parameters(),lr=0.001)

In [29]:
# Training Loop
for epoch in range(10):
  total_loss=0
  for context, target in data:
    #1. Calculate the indexs
    context_idx = make_context_vector(context,word_to_ix)
    target_idx = make_context_vector([target],word_to_ix)
    
    #2. Zero the gradients
    model.zero_grad()
    
    #3. forward pass
    log_prob = model(context_idx)
    
    #4. loss calculation
    loss = loss_function(log_prob,target_idx)
    
    #5. Backward pass
    loss.backward()
    optimizer.step()
    
    total_loss += loss.item()
  
  losses.append(total_loss)
  
print(losses)

[227.04092955589294, 225.45482993125916, 223.88139843940735, 222.3219611644745, 220.77350521087646, 219.2351851463318, 217.70639371871948, 216.18591237068176, 214.67090392112732, 213.1623682975769]
