In [1]:
# !pip install gensim
# !pip install python-Levenshtein

In [2]:
import gensim
import pandas as pd

### Reading and Exploring the Dataset
The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [3]:
df = pd.read_json("../data-large/Cell_Phones_and_Accessories_5.json", lines=True)
df.head(2)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"


In [4]:
df.shape

(194439, 9)

In [5]:
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

In [6]:
gensim.utils.simple_preprocess(df.reviewText[0])

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

### Simple Preprocessing & Tokenization
The first thing to do for any data science task is to clean the data.
For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. 
This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [6]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

In [31]:
review_text
review_text.shape

(194439,)

In [8]:
review_text.loc[0]

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [9]:
df.reviewText.loc[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

### Training the Word2Vec Model

Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

#### Initialize the model

In [10]:
model = gensim.models.Word2Vec(
    window = 10, # The maximum distance between the current and predicted word within a sentence. 10 words to the left and 10 words to the right
    min_count=2, # Ignores all words with total frequency lower than this. 
    workers = 4, # Use these many worker threads to train the model
) 

#### Build Vocabulary

In [11]:
model.build_vocab(
      review_text,
      progress_per=1000 # Report progress every 1000 documents
    )

In [33]:
model.epochs
# find the vocabulary count of the model
model.corpus_count

194439

#### Train the Word2Vec Model

In [13]:
model.train(
            review_text, 
            total_examples = model.corpus_count, # Count of sentences
            epochs = model.epochs # Number of iterations (epochs) over the corpus. Default is 5.
        )

(61505066, 83868975)

### Save the Model

Save the model so that it can be reused in other applications

In [14]:
model.save("./word2vec-amazon-cell-accessories-reviews-short.model")

### Finding Similar Words and Similarity between words
https://radimrehurek.com/gensim/models/word2vec.html

In [18]:
# findfing the most similar words
model.wv.most_similar("bad")
# model.wv.most_similar("sex")

[('amazonif', 0.7045546174049377),
 ('cia', 0.7022926211357117),
 ('chilling', 0.7008479237556458),
 ('hapless', 0.6990648508071899),
 ('iono', 0.697027862071991),
 ('mosquito', 0.6968753337860107),
 ('jabber', 0.6964439153671265),
 ('ocassions', 0.6925169825553894),
 ('endocrine', 0.6909171938896179),
 ('hoister', 0.6897976398468018)]

In [19]:
model.wv.similarity(w1="cheap", w2="inexpensive")

0.51563835

In [20]:
model.wv.similarity(w1="cheap", w2="cheap")

1.0

In [21]:
model.wv.similarity(w1="great", w2="good")

0.7778091

In [38]:
model.wv.vector_size
model.wv.vectors[0].shape

print(review_text.loc[0][0])
# this is the vector for the word "they" word as index 0 as show avove
model.wv.vectors[0]

# model.wv.vectors.shape
# model.wv.vectors means the word vectors of the model

they


array([ 0.50364834, -1.4285587 ,  1.4168093 ,  1.9422042 ,  0.8653465 ,
        0.75955147,  0.2747358 , -1.4190488 , -2.9617536 ,  1.2181287 ,
        1.10275   ,  0.3177622 ,  1.4322487 , -0.5661708 , -2.697159  ,
        0.4741032 ,  0.63043934,  0.9372659 ,  1.4779495 , -2.4965205 ,
       -1.4740303 ,  0.69636965, -0.21480872, -0.46300757, -0.12567146,
       -1.7558631 ,  0.6577201 ,  1.3229464 ,  1.8606657 , -2.117915  ,
        1.6325997 , -0.68104964,  0.58867455,  1.2656505 , -2.2630498 ,
       -0.5371549 , -0.07377188,  1.127902  , -0.6312002 , -1.7259263 ,
       -4.3799133 ,  0.10753926, -1.3706571 ,  0.41079774, -0.6065559 ,
       -1.0834429 , -0.6081674 ,  0.5145841 , -0.6611412 , -0.26599085,
        1.0037822 ,  1.8968215 , -1.0369328 , -0.46508506,  0.40745157,
       -0.83666784,  0.4777559 , -1.4246547 , -1.1692537 , -0.7285985 ,
        0.02681392,  0.29209062,  0.10349356,  0.5412399 ,  0.4328139 ,
       -0.2797536 ,  1.0206255 ,  0.4661867 , -0.05839719,  0.64

### Further Reading

You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html

Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/

## Exercise

Train a word2vec model on the [Sports & Outdoors Reviews Dataset](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Sports_and_Outdoors_5.json.gz)
Once you train a model on this, find the words most similar to 'awful' and find similarities between the following word tuples: ('good', 'great'), ('slow','steady')

Click here for [solution](https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/42_word2vec_gensim/42_word2vec_gensim_exercise_solution.ipynb).