# Word2Vec Problem using Natural Language Processing-based Gensim Library

Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

**Gensim** is a Python library for NLP

In [1]:
pip install gensim



In [2]:
pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.21.1-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.21.1 (from python-Levenshtein)
  Downloading Levenshtein-0.21.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (172 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.5/172.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<4.0.0,>=2.3.0 (from Levenshtein==0.21.1->python-Levenshtein)
  Downloading rapidfuzz-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.21.1 python-Levenshtein-0.21.1 rapidfuzz-3.2.0


In [3]:
import gensim
import pandas as pd

In [4]:
path = '/content/Cell_Phones_and_Accessories_5.json'


# Reading the Dataset

In [5]:
df = pd.read_json(path, lines=True)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [6]:
df.shape

(194439, 9)

## Target column is `review_text`

In [7]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [9]:
review_text.loc[1]

['these',
 'stickers',
 'work',
 'like',
 'the',
 'review',
 'says',
 'they',
 'do',
 'they',
 'stick',
 'on',
 'great',
 'and',
 'they',
 'stay',
 'on',
 'the',
 'phone',
 'they',
 'are',
 'super',
 'stylish',
 'and',
 'can',
 'share',
 'them',
 'with',
 'my',
 'sister']

In [10]:
df.reviewText.loc[1]

'These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :)'

Converting all texts into vectors

In [11]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

In [12]:
model.build_vocab(review_text, progress_per=1000)

In [15]:
model.epochs

5

## Training Model

In [16]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)



(61505105, 83868975)

## Saving Model

In [17]:
model.save('/content/Cell_Phones_and_Accessories_5.h5')

## Testing the Similiarity

In [19]:
model.wv.most_similar("nice")

[('cool', 0.7595975399017334),
 ('neat', 0.7030728459358215),
 ('good', 0.7016851902008057),
 ('great', 0.6816050410270691),
 ('snazzy', 0.6732475757598877),
 ('lovely', 0.6504204273223877),
 ('attractive', 0.6469370722770691),
 ('fantastic', 0.6327206492424011),
 ('classy', 0.6324760317802429),
 ('distinctive', 0.6302198171615601)]

## Look at below! Both words are in reality represents almost the same meaning.

### Vectors generated here also tells the same story!

So, our code's working! Hurray

In [20]:
model.wv.similarity(w1="cheap", w2="inexpensive")

0.5382275

In [21]:
model.wv.similarity(w1="great", w2="awesome")

0.7366418