## COBW vs BoW

In my previous post I show how to predict the next word using a bi-gram model by calculating bi-gram probability

https://nbviewer.org/github/kyramichel/NLP/blob/master/N-gram%20Probabilities.ipynb

 
COBW model tries to predict using context ie, it takes into account the similarity between words when predicting the next word.  

In [2]:
import numpy as np
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

from sklearn.preprocessing import OneHotEncoder


import gensim
from gensim.models import Word2Vec


import warnings
warnings.filterwarnings(action = 'ignore')
 

In [3]:
sents= ["great beach", " awesome beach", "great beach and weather", "this is a nice beach ",]

In [4]:
tokens=[]
for i in range(len(sents)):
    for w in sents[i].split():
        tokens.append(w)   
print(tokens)

['great', 'beach', 'awesome', 'beach', 'great', 'beach', 'and', 'weather', 'this', 'is', 'a', 'nice', 'beach']


Integer (numeric) representation

In [5]:
#Integer encoding with LabelEncoder

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
integer_encoded = le.fit_transform(tokens)
print(integer_encoded)

[4 3 2 3 4 3 1 8 7 5 0 6 3]


Binary representation

In [6]:
### Binary encoding using OneHotRncoder

integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)

from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

print(onehot_encoded)

[[0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0.]]


Limitation of BoW or OneHotEncoding: when trying to predict next word like"beach", because using BoW or OneHotEncoding, "great" and "awesome" are independent - which is not true.

COBW model predicts next word using context, in this case, it takes into account the similarity between "great" and "good" when trying to predict "beach"  

Word embedding:  capturing context of a word in a document, semantic and syntactic similarity, etc.




Word2Vec algorithm which is implemented in Python, allows to easily construct word embeddings via 2 different methods (both use NN with back-propagation): 

1. CBOW (Common Bag Of Words): takes the context of each word and predicts a word corresponding to the contex

The input (context word) is a OneHotEncoded vector of size V. The hidden layer containing n neurons

that takes the weighted average over all the context word inputs. The output is a |V| size vector with the elements being the softmax values.



2.Skip Gram

Input word (ie context position), the model outputs the probability distributions for each word.

Similarity: Words are said to have similar context if they occupy close spatial positions. Mathematically, the cosine of the angle between the vector representations close to 1 when the angle is close to 0. More @ 

https://arxiv.org/pdf/1310.4546.pdf

https://arxiv.org/pdf/1411.2738.pdf



In [7]:
data = []
 
# iterate through each sentence in the file
for s in sents:
    temp = []
    for w in word_tokenize(s):
        temp.append(w.lower())
             
    data.append(temp)
print(data)

[['great', 'beach'], ['awesome', 'beach'], ['great', 'beach', 'and', 'weather'], ['this', 'is', 'a', 'nice', 'beach']]


In [8]:
# Create CBOW model using Word2Vec algorithm

model1= gensim.models.Word2Vec(data, min_count = 1,
                              vector_size = 100, window = 5)
print("Using CBOW, similarity between 'great' " +
               "and 'awesome' : ",model1.wv.similarity('great', 'awesome'))

Using CBOW, similarity between 'great' and 'awesome' :  0.033640567


In [9]:
# Create Skip Gram model using Word2Vec algorithm

model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100,
                                             window = 5, sg = 1)

print("Using Skip-Gram method, similarity between 'great' " +
               "and 'awesome'is : ", model2.wv.similarity('great', 'awesome'))

Using Skip-Gram method, similarity between 'great' and 'awesome'is :  0.033640567
