# Introduction 

Word embedding is just a fancy way of saying numerical representation of words. 

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in "vector space" (arrays) with several dimensions. Word embeddings can be generated using fancy methods like neural networks, co-occurrence matrix, probabilistic models, etc.

The program "Word2Vec" consists of models for generating word embedding. These models are shallow two layer neural networks having one input layer, one hidden layer and one output layer. Word2Vec utilizes two types of training sets :

(1) __CBOW (Continuous Bag of Words)__

> CBOW model predicts the current word given context words within specific window. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent current word present at the output layer.

<p style="text-align: center;"><img src="images/intro-cbow.png"></p>

(2) __Skip Gram__

> Skip gram predicts the surrounding context words within specific window given current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer.

<p style="text-align: center;"><img src="images/intro-skipgram.png"></p>

Before we talk about these 2 Word2Vec techniques, let's go back to the basics with the samplest "words embedding" using what has been called "One-hot encoding" of words.

# One-hot Encoding

The simplest word embedding you can have is using one-hot vectors. If you have 10,000 words in your vocabulary, then you can represent each word as a 1x10,000 vector.

For a simple example, if we have 4 words — mango, strawberry, city, Delhi — in our vocabulary then we can represent them as following:
* nango [1, 0, 0, 0]
* strawberry [0, 1, 0, 0]
* city [0, 0, 1, 0]
* delhi [0, 0, 0, 1]

There are a few problems with the above approach, firstly, our size of vectors depends on the size of our vocabulary(which can be huge). This is a wastage of space and increases algorithm complexity exponentially resulting in the curse of dimensionality.

Secondly, these embedding will be closely coupled to their applications, making transfer-learning to a model using a different vocabulary of the same size, adding/removing words from vocabulary would almost impossible as it would require to re-train the whole model again.

Lastly, the entire purpose of creating embedding is to capture the contextual meaning of the words, which this representation fails to do. There is no co-relation between words that have similar meaning or usage.


In [1]:
import numpy as np
import pandas as pd

In [2]:
def Similarity(a1, a2):
    return (a1 - a2).sum()

In [4]:
mango = np.array((1, 0, 0, 0))
strawberry = np.array((0, 1, 0, 0))
city = np.array((0, 0, 1, 0))
delhi = np.array((0, 0, 0, 1))

In [5]:
Similarity(mango, strawberry)

0

In [6]:
Similarity(mango, city)

0

# CBOW vs Skip-gram

Continuous Bag of Words Model (CBOW) and Skip-grams are techniques to represent words for training machine learning algorithms.

In the CBOW model, the distributed representations of context (or surrounding words) are combined to predict the word in the middle. While in the Skip-gram model, the distributed representation of the input word is used to predict the context. This digram illustrates this difference:

<img src="images/cbow_skipgram.png">

Remember that both of these are simply ways to create labeled training data.  How do you a train a neural network to predict word embedding when you don’t have any labeled data i.e words and their corresponding word embedding?

# Skip-gram

We’ll do so by creating a “fake” task for the machine learning algorithm to train on.

The fake task for Skip-gram model would be, given a word, we’ll try to predict its neighboring words. We’ll define a neighboring word by the window length — a fancy term called "hyper-parameter"!

Here is an example with window length of 2:

<img src="images/skipgram.png">

Given the sentence:

<p style="text-align: center; font-style: italic;">“I will have orange juice and eggs for breakfast.”</p>

and a window length of 2, if the target word is juice, its neighboring words will be ( have, orange, and, eggs). Our input and target word pair would be (juice, have), (juice, orange), (juice, and), (juice, eggs).

Also note that within the sample window, proximity of the words to the source word plays no role. So have, orange, and, and eggs will be treated the same while training.

# Word2Vec

In [25]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import nltk    
from nltk.tokenize import sent_tokenize, word_tokenize 
# nltk.download('punkt') 
import warnings 

import gensim 
from gensim.models import Word2Vec 

In [19]:
file_handle=open("data/alice.txt", "r")
text=file_handle.read()

In [20]:
full_text=text.replace("\n", "");

In [21]:
data = []

In [24]:
# iterate through each sentence in the file, and lower-casing all tokens
for i in sent_tokenize(full_text): 
    temp = [] 
      
    # tokenize the sentence into words 
    for j in word_tokenize(i): 
        temp.append(j.lower()) 
  
    data.append(temp) 

In [None]:
# only interested in alphanumeric characters
# import re
# text = re.sub(r'\W+', ' ', text)

# CBOW Model

In [27]:
# Create CBOW model 
CBOW_model = gensim.models.Word2Vec(data, min_count = 1,  
                              size = 100, window = 5) 

In [34]:
print("Cosine similarity between 'alice' and 'wonderland' - CBOW : ", 
    CBOW_model.wv.similarity('alice', 'wonderland')) 

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.9256836


In [35]:
print("Cosine similarity between 'alice' and 'machines' - CBOW : ", 
    CBOW_model.wv.similarity('alice', 'machines'))

Cosine similarity between 'alice' and 'machines' - CBOW :  0.9809821


# Skipgram Model

In [30]:
# Create Skip Gram model 
Skipgram_model = gensim.models.Word2Vec(data, min_count = 1, size = 100, 
                                             window = 5, sg = 1) 

In [32]:
print("Cosine similarity between 'alice' and 'wonderland' - Skip Gram : ", 
    Skipgram_model.wv.similarity('alice', 'wonderland')) 

Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.61861473


In [33]:
print("Cosine similarity between 'alice' and 'machines' - Skip Gram : ", 
      Skipgram_model.wv.similarity('alice', 'machines'))

Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.84913594


Output indicates the cosine similarities between word vectors ‘alice’, ‘wonderland’ and ‘machines’ for different models. One interesting task might be to change the parameter values of ‘size’ and ‘window’ to observe the variations in the cosine similarities.