<a href="https://colab.research.google.com/github/premswaroopmusti/word2vec-model-in-python-gensim-library-/blob/main/Implement_word2vec_in_gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# gensim library is a NLP library for python
!pip install gensim
!pip install python-Levenshtein

In [3]:
import gensim
import pandas as pd

## **Reading and Exploring the Dataset**

The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [4]:
df = pd.read_json('/content/drive/MyDrive/Machine learning projects/datasets/Cell_Phones_and_Accessories_5.json', lines = True)
df

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"
...,...,...,...,...,...,...,...,...,...
194434,A1YMNTFLNDYQ1F,B00LORXVUE,eyeused2loveher,"[0, 0]",Works great just like my original one. I reall...,5,This works just perfect!,1405900800,"07 21, 2014"
194435,A15TX8B2L8B20S,B00LORXVUE,Jon Davidson,"[0, 0]",Great product. Great packaging. High quality a...,5,Great replacement cable. Apple certified,1405900800,"07 21, 2014"
194436,A3JI7QRZO1QG8X,B00LORXVUE,Joyce M. Davidson,"[0, 0]","This is a great cable, just as good as the mor...",5,Real quality,1405900800,"07 21, 2014"
194437,A1NHB2VC68YQNM,B00LORXVUE,Nurse Farrugia,"[0, 0]",I really like it becasue it works well with my...,5,I really like it becasue it works well with my...,1405814400,"07 20, 2014"


## We are going to train Word2Vec model using only reviewText column

so we are interested only in this column

In [5]:
df.shape

(194439, 9)

In [6]:
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

**The first step in training Word2Vec model is to do pre-processing**

bcz these text have stopwords like 'a', 'it'

So we don't want those

And the another thing is we want to convert these text into lowercase, so that everything is in lowercase and comparable, removing the trailing spaces, removing the punctuation marks and so on. 

So all of this can be done using a function in gensim

In [9]:
gensim.utils.simple_preprocess('They look good and stick good! I just dont like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just wont buy a product like this again')
# it is tokenizing the sentence

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'dont',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'wont',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [13]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [14]:
model = gensim.models.Word2Vec(
         window = 10,     # 10 words before the target word and 10 words after the target word
         min_count = 2,   # if u have a sentence which has only 1 word, then don't use that sentence, atleast 2 words need to be there in order to be considered for the training
         workers =  4     # workers is how many cpu threads u want to use to train this model
)

**We need to build a vocabulary**

**Building voacbulary means building a unique list of words**

In [15]:
model.build_vocab(review_text, progress_per = 1000)
# progress_per means it indicates how many words to process before showing/updating the progress

In [16]:
model.epochs
# so by default the epochs are set to 5

5

In [17]:
model.corpus_count

194439

In [18]:
model.train(review_text, total_examples = model.corpus_count, epochs = model.epochs)

(61505839, 83868975)

In [19]:
model.save('./word2vec-amazon-cell-accessories-reviews-short.model')

In [20]:
# let's see what are the words which are similar to bad
model.wv.most_similar('bad')
# similarity score of 'shabby' is the highest

[('shabby', 0.6732350587844849),
 ('terrible', 0.6712656617164612),
 ('horrible', 0.6044690608978271),
 ('good', 0.5891343355178833),
 ('crappy', 0.5438897609710693),
 ('okay', 0.5306547284126282),
 ('poor', 0.5262861251831055),
 ('cheap', 0.5195913314819336),
 ('lame', 0.5148704648017883),
 ('legit', 0.512425422668457)]

In [21]:
# let's check similarity score between 2 words
model.wv.similarity(w1 = 'cheap', w2 = 'inexpensive')

0.54649466

In [22]:
model.wv.similarity(w1 = 'great', w2 = 'good')

0.78419983

In [23]:
model.wv.similarity(w1 = 'work', w2 = 'product')
# we can see words 'work' and 'product' are not very similar

0.0015640492