#### Introduction to Word Embedding and Word2Vec

##### How does Word2Vec work?

Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): 
- Skip Gram 
- Common Bag Of Words (CBOW)

##### CBOW Model:
This method takes the context of each word as the input and tries to  redict the word corresponding to the context.

##### Skip-Gram model:
This looks like the multiple-context CBOW model just got flipped. To some extent that is true. We input the target word into the network. 

In both cases, the network uses back-propagation to learn.

##### Who wins?
Both have their own advantages and disadvantages. According to Mikolov,
- Skip Gram works well with small amounts of data and is found to represent rare words well.
- On the other hand, CBOW is faster and has better representations for more frequent words.

#### Word2Vec Detailed Explanation and Train your custom Word2Vec Model using genism in Python - #NLProc tutorial.

##### Motivation
- Audio: Audio spectrogram is Dense
- Image: Image pixels are Dense
- Text: Word, content, or document vectors are Sparse

Word2Vec is a way of efficient estimation of word representations in Vector space.

##### Word2Vec Hyperparameters

1. Number of negative samples
 - The original paper prescribes 5-20 as good number of negative samples.
 - It also states that 2-5 seems to be enough when you have a large enough dataset.
2. Window size
 - Smaller window sizes(2-15) lead to embeddings where high similarity scores between two embeddings indicates that the words are interchangeable.
 - Larger window sizes(15-50, or even more) lead to embeddings wher similarity is more indicative of relatedness of the words.

In [2]:
pip install --upgrade gensim

Note: you may need to restart the kernel to use updated packages.


In [7]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [6]:
for i, word in enumerate(wv.vocab):
    if i == 10:
        break
    print(word)

NameError: name 'wv' is not defined

In [None]:
vec_king = wv['king']
print(vec_king)

In [None]:
pairs = [
    ('car', 'minivan'), # a minivan is a kind of car
    ('car', 'bicycle'), # still a wheeled vehicle
    ('car', 'airplane'), # no wheels but still a vehicle
    ('car', 'cereal'),   # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

In [None]:
print(wv.most_similar(positive=['car', 'minivan'], topn = 5))

In [None]:
print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

In [8]:
import json
import pandas as pd
import string
import time

The dataset is from Amazon Review Data (2018). Here, we look at cell phones and Accessories review dataset from "Small" subsets for experimentation

In [None]:
data = []

for line in open('C:\\AmazonReviewsCellPhones\\Cell_Phones_and_Accessories')
    data.append(json.loads(line))

In [None]:
print(data[0])
df = pd.DataFrame(data)
print(len(data))

In [None]:
df.head(10)
df = df.drop(columns = ['reviewerName', 'vote', 'image', 'style'])
df1 = df.rename(columns = {'overall': 'rating', 'asin': 'productID'}, inplace = True

In [None]:
df1.dropna(axis = 0, how = 'any', inplace = True)
df1.drop_duplicates(subset = ['rating', 'reviewText'], keep = 'first', inplace = True 

In [None]:
def clean_text(text ):
    delete_dict = {sp_character: ' ' for sp_character in string.punctuation}
    delete_dict[' '] = ' '
    table = str.maketrans(delete_dict)
    text1 = text.translate(table)
    #print('cleaned:'+text1)
    textArr = text1.split()
    text2 = ' '.join([w for w in textArr if ( not w.isdigit() and (not w.isdigit))
    
    return text2.lower().split(' ')

In [None]:
df2 = df1.sample(n=200000)
df2['reviewText']= df2['reviewText'].apply(clean_text)

In [None]:
sentences = df2['reviewText'].tolist()

In [None]:
print(len(sentences))
print(sentences[1])
print(sentences[200])

In [None]:
import gensim

In [None]:
from gensim.models.callbacks import CallbackAny2Vec
from gensim.models import Word2Vec

# init callback class
class callback(CallbackAny2Vec):
    """
    Callback to print loss after each epoch
    """
    def _init_(self):
        self.epoch = 0
        
    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        
        if self.epoch == 0:
            print('Loss after epoch {}: {}'.format(self.epoch, loss))
        elif self.epoch % 100 == 0:
            print('Loss after epoch {}: {}'.format(self.epoch, loss-self.loss))
            
        self.epoch += 1
        self.loss_previous_step = loss

In [None]:
# init word2vec class
w2v_model = Word2Vec(size = 300,
                     window = 15,
                     min_count = 2,
                     workers = 20c,
                     sg = 1,
                     negative = 5,
                     sample = 1e-5)
# build vocab

w2v_model.build_vocab(sentences)

# train the w2v model
start = time.time()
w2v_model.train(sentences,
                total_examples=w2v_model.corpus_count,
                epochs=1001,
                report_delay=1,
                compute_loss = True, # set compute_loss = True
                callbacks=[callback()]) # add the callback class
end = time.time()

print("elapsedtime in seconds :"+ str(end - start))
# save the word2vec model
w2v_model.save('C:\\AmazonReviewsCellPhones\\word2vec.model')

Let us reload our word2vec model and perform operations using it.

In [None]:
reloaded_w2v_model = Word2Vec.load('C:\\AmazonReviewsCellPhones\\word2vec.model')
words = list(reloaded_w2v_model.wv.vocab)
print('Vocab size: '+str(len(words)))
w1 = 'cancellation'
print("Top 3 words similar to cancellation:",\
      reloaded_w2v_model.wv.most_similar(positive = w1, topn =3))
w1 = 'poor'
print("Top 3 words similar to poor:",\
      reloaded_w2v_model.wv.most_similar(positive = w1, topn =3)
print("Similarity between earphones and headphones:"+\
      str(reloaded_w2v_model.wv.similarity(w1="earphones",w2="headphones")))
print("Similarity between charger and charge:"+\
      str(reloaded_w2v_model.wv.similarity(w1="charger",w2="charge")))

In [None]:
from sklearn.manifold import TSNE              # final reduction
import numpy as np                             # array handling

def reduce_dimensions(model):
    num_dimensions = 2  # final num dimesions (2D, 3D, etc)
    
    vectors = [] # positions in vector space
    labels = []  # keep track of words to label our data again later
    for word in model.wv.vocab:
        vectors.append(model.wv[word])
        labels.append(word)
        
        # covert both lists into numpy vectors for reduction
        vectors = np.asarray(vectors)
        #labels = np.asarray(labels)
        
        # reduce using t-SNE
        vectors = np.asarray(vectors)
        tsne = TSNE(n_components=num_dimensions, random_state=0)
        vectors = tsne.fit_transform(vectors)
        
        x_vals = [v[0] for v in vectors]
        y_vals = [v[1] for v in vectors]
        return x_vals, y_vals, labels
    
x_vals, y_vals, labels = reduce_dimensions(reloaded_w2v_model)    

Les us visualize our word2vec model code

In [None]:
def plot_with_matpltlib(x_vals, y_vals, labels):
    import matplotlib.pyplot as plt
    import random
    
    random.seed(0)
    
    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals, y_vals)
    
    # Label randomly subsampled 25 data points
    
    indices = list(range(len(labels)))
    #selected_indices = random.sample(indices, 25)
    selected_indices=[]
    index = labels.index("cell")
    selected_indices.append(index)
    index = labels.index("phone")
    selected_indices.append(index)
    index = labels.index("noise")
    selected_indices.append(index)
    index = labels.index("cancellation")
    selected_indices.append(index)
    index = labels.index("charger")
    selected_indices.append(index)
    index = labels.index("charge")
    selected_indices.append(index)
    index = labels.index("poor")
    selected_indices.append(index)
    index = labels.index("bad")
    selected_indices.append(index)
    
    for i in selected_indices:
        plt.annotate(labels[i], (x_vals[i], y_vals[i]))
        
plot_function = plot_with_matplotlib

plot_function(x_vals, y_vals, labels)