<a href="https://colab.research.google.com/github/mdrahitazim/Machine-Learning/blob/main/Word_to_Vector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word to Vector

In [2]:
#!pip install gensim

In [3]:
#!pip install python-Levenshtein

**Importing Libraries**

In [4]:
import gensim
import pandas as pd

**Loading Data**

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
%cd "/content/drive/My Drive/"

/content/drive/My Drive


In [7]:
import pandas as pd
import gzip

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('data/reviews_Cell_Phones_and_Accessories_5.json.gz')

In [8]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


# Preprocessing and Tokenization

The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [9]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

In [10]:
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [11]:
review_text[0]

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [12]:
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

# Training the Word2Vec Model

Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

**Initialze the model**

In [13]:
model = gensim.models.Word2Vec(
    window= 10,
    min_count =2,
    workers =4
)

**Build Volcabulary**

Building a unique list of words

In [14]:
model.build_vocab(review_text, progress_per = 1000)

In [15]:
model.corpus_count

194439

**Training Word2Vec Model**

In [16]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61504052, 83868975)

# Save the Model

Save the model to use it later

In [19]:
model.save('./embedding.model')

# Finding Similar Words and Similarity between words

In [21]:
model.wv.most_similar("good")

[('decent', 0.8165975213050842),
 ('great', 0.7840939164161682),
 ('nice', 0.7050759792327881),
 ('fantastic', 0.7048753499984741),
 ('outstanding', 0.6251596212387085),
 ('excellent', 0.6247411966323853),
 ('superb', 0.6155173778533936),
 ('exceptional', 0.60003662109375),
 ('amazing', 0.5846076607704163),
 ('bad', 0.5812669396400452)]

In [22]:
model.wv.most_similar("siri")

[('commands', 0.7871379256248474),
 ('voice', 0.7368420958518982),
 ('command', 0.7343847751617432),
 ('dialing', 0.7113860845565796),
 ('dialer', 0.6991751790046692),
 ('voicemail', 0.6787272691726685),
 ('dial', 0.6742653250694275),
 ('speech', 0.6703729629516602),
 ('recognition', 0.6671193838119507),
 ('motospeak', 0.6594144701957703)]

In [None]:
model.wv.most_similar("good")

In [23]:
model.wv.similarity("bad","siri")

-0.07269509

Low similarly bad and siri

In [24]:
model.wv.similarity("good","decent")

0.8165975

High similarly between good and decent