## Word2Vec
Word2Vec is a technique in computer science that allows you to do mathematics with the words. For example if you give the equation "King - man + woman" to the computer, computer will tell you the answer is "Queen". So now the question is how computer can do this? So computer don't understand text, it understand numbers. So if there is a word to represent king in a number such that it can accurately represent the meaning of the word 'King', now the number can't be one number, so we need to have a set of numbers and in mathematics set of numbers are called vectors. So let's think now that how to represent word 'King' into a vector (which is just a bunch of numbers) such that it can represent the meaning of word 'King' accurately. So now if we think there are different ways of representing the word 'King'. For example 'King' has authority, 'King' is rich, 'King' has a gender of male, and so does 'King' has a population? no generally countries or cities have populations, so this is false about representing 'King'. So the vector generated from word 'King' will be [authority = 1, rich = 1, population = 0, gender = 1] =>
So the vector will be: [1 1 0 1]. For other words we'll have the same generated vectors. So once we have such vectors for all the words in vocabulary, then we're able to do math easilly. See the bellow example image:


<img src = "img.png" width = "800px" height = "400px"></img>

* So here in the above image, 'authority', 'event', 'has tail', 'rich' and 'gender' are called features. The vectors generatd for King, Man, Woman, Queen, Battle, and Horse are called feature vectors.
* Generatig features vectors manually is very very difficult, so for generating these feature vectors we use neural networks, and neural network will authomatically learn these features.
* Generating these feature vectors are also called word embedding.
* The wonderful thing about word embedding is, that we don't know in vector 'King' the first 1 is 'authority', it works magically.
* **See the fake problem of missing words in a sentence as whown in the following images:**

<img src = "img1.png" width = "800px" height = "400px"></img>

<img src = "img2.png" width = "800px" height = "400px"></img>

* So now as shown in the following images, if I take a window of three words, and I say if there is a words 'lived' and 'a', so then I can preict that there is a word 'there'. So here we're taking the 2nd and 3rd words and we're predicting the 3rd word.
* So these are our training samples. (We move the window of three words over the paragraph & generate all train samples. As result the generated training samples becomes the training set for the neural network.):

<img src = "img3.png" width = "800px" height = "400px"></img>

<img src = "img4.png" width = "800px" height = "400px"></img>

<img src = "img5.png" width = "800px" height = "400px"></img>

* In the training samples, the words in left side are X and on the right side is Y. You feed the words X and predict the word Y. 

* Now to train our neural network using each of these samples, first let's we have words 'ordered' and 'his', based on these two words we want to predict the word 'king'. So the input layer will have One Hot Encoded vectors. If we have 5000 words in the vocabulary, then in input layer there will be a vector which has 5000 size and only one word will be one and the remaining will be zero. For exaple if the word is 'ordered' then in the vector the value of word 'ordered' will be 1 and the remaining numbers will be zero. Same for word 'his'. 
* Vocabulary means unique word in your text corpus.
* In the hidden layer we have 4 neurons, and these 4 neuron are the size of the embedding vectors. The size may be different, there is no any rule, just trail and error.
* In the output layer we'll have a 5000 size vector. So when we feed these training samples into neural network, what happen is, first the edges will have random values (weights), based on these random weights the output will be generated which might be wrong most likely. So we compare the actual output Y with the predicted output Y'. We take a loss which is a difference between the actual output and the predicted output and we back propagate. When we back propagate, essentially what we do is audjusting the weights, then we take a second sample, third sample ... we take all 5000 or 100000 samples & we train the model in such a way that if our input is 'ordered' and 'his' then the output should be 'King'.
* **The process is shown in images bellow:**

<img src = "img6.png" width = "800px" height = "400px"></img>

<img src = "img7.png" width = "800px" height = "400px"></img>

* **The second sample is (which is window of size 3):**

<img src = "img8.png" width = "800px" height = "400px"></img>

* **The 3rd sample:**

<img src = "img9.png" width = "800px" height = "400px"></img>

* So when we feed let's say 1 milion samples and also we run 100 epochs, so our neural network will be trained. At that point the word vector for 'King' would be the weights which is shown in the following image: 

<img src = "img10.png" width = "800px" height = "400px"></img>

* Those weights are nothing but a trained word vector. And these vectors will be very similar with the vectors of word 'emperor' because the input is same. See the bellow image:

<img src = "img11.png" width = "800px" height = "400px"></img>

* So the upper approach is called **Continues Bag Of Words (CBOW).**
* In CBOW based on context we predict the target word.
* There is 2nd methodology which is called **Skip Gram**. in skip gram we do reverse, we have a target word and based on that we predict the context words.

<img src = "img12.png" width = "800px" height = "400px"></img>

* To **summarize,** Word2Vec is not  single method, but it can use one of the two techniques which is **CBOW** and **Skip Gram** to learn word embedding.
* Simply **Word2Vec** means converting word into vectors.
* The **Skip Gram** working mechanism is shown in bellow images:

<img src = "img13.png" width = "800px" height = "400px"></img>

<img src = "img14.png" width = "800px" height = "400px"></img>

<img src = "img15.png" width = "800px" height = "400px"></img>

* So when you're using **Skip Gram**, the word embedding (weights) is a layer between the input layer and a hidden layer, but when you're using **CBOW**, the word embedding (weights) is between hidden layer and the output layer.

<img src = "img16.png" width = "800px" height = "400px"></img>

<img src = "img17.png" width = "800px" height = "400px"></img>

### Word2Vec Implementation

In [8]:
# First you need to install 'gensim' library, but if you install Anaconda, then it's installed in your computer.
# Next you need to install another model called 'python-Levenshtein' which is not comming with Anaconda.
# !pip install gensim
# !pip install python-Levenshtein

In [7]:
# Let's first import gensim and pandas.
import gensim
import pandas as pd

#### Reading and Exploring the Dataset
The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.
* Here we use gensim library which is an NLP library for python, and it's very easy to use the syntax is very easy compared to tensorflow, so thats why we want to use it to build a **Word2Vec Model.**

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [12]:
# so let's read the dataset. The dataset has json reviews and pandas has the power to read json file.
# 'Lines=True' means each json review will be read in single line.
df = pd.read_json("Cell_Phones_and_Accessories_5.json", lines=True)
df.sample(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
28238,A1H9ZV1NWGJNG3,B0042UQLM0,Mike,"[0, 0]",I had to get this use my headphones on my xbox...,5,Great,1390089600,"01 19, 2014"
143844,A18P98QS2M3K6Z,B00AR4M04S,Jim Besso,"[0, 0]",These worked great -- at first. After a coupl...,3,Not up to par,1403654400,"06 25, 2014"
28173,A1RP5VRESSVIHJ,B0042TY68C,reazon i am needing help,"[0, 0]",i ordered two of these. they do NOT provide mo...,1,i wasted my money,1373068800,"07 6, 2013"
158817,A2LE5IS9W9OWXC,B00CB6X6Y8,Imad Ali Syed,"[0, 0]",This product works perfectly for me in my Niss...,5,Great Service,1397433600,"04 14, 2014"
6699,A3919EID796R4N,B001DDG2FA,Sherry N. Efferson,"[0, 0]",Much better price than cell service carrier - ...,5,Cell phone batteries,1225670400,"11 3, 2008"


In [13]:
# So we want to traine a word2Vec model using only a 'reviewText' column.
# To get an idea how many records are there in the dataset we do shape:
df.shape

(194439, 9)

In [14]:
# To check a single review of 'reviewText' column:
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

### Simple Preprocessing & Tokenization
* The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

* Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [15]:
# The first step in training word2Vec is to do pre-processing, because the text have stop words and we want to convert the 
# words into lower case we remove the spaces, and we remove the punctuation marks and so on...
# So these thing can be done by a function in gensim called 'gensim.util.simple_preprocess'. So this function will simply 
# preprocess the text.
# Let's pass a single review to the function:
gensim.utils.simple_preprocess("They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again")

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [16]:
# So as we see the function tokenize the review, covert the capital letters to small letters, remove punctuation marks, and
# also removed the stop words such as I.
# So now we supply the entire column to the function by using apply function. As result it will return you numpy series.
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
review_text.sample(5)

130150    [this, is, nice, case, for, the, price, you, g...
74315     [ordered, this, as, part, of, our, back, up, s...
57341     [this, battery, works, well, for, me, charges,...
164356    [it, great, to, keep, your, phone, covered, bu...
171456    [was, really, blurry, so, was, really, disappo...
Name: reviewText, dtype: object

#### Training the Word2Vec Model
* Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.
* Workers define how many CPU threads to be used.

#### Initializing the model

In [17]:
# So it creates a new pandas series, and each object in the series is a list and each list has tokenized words.
# So next we initialize gensim model. Gensim is an NLP library and Word2Vec class come with this library which takes couple
# of arguments. The first parameter is window size (if we assign 10, it means 10 words before the target word and 10 words 
# after the target word.). The 2nd argument is 'min_count', so mean_count is basically, if you have a sentence which has 
# only one word then we don't use that sentence, so mean_count say that at least n words should be in a sentence in order to
# be considered in the training process. The next argument is 'workers' which means how many threads of CPU do you want to 
# use to train the model?
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

In [18]:
# Next we need to build a vocabulary, building vocabulary means building a unique list of words.
# So for that we called a function 'build_vocab()' and we supply the tokenize text. 
model.build_vocab(review_text, progress_per=1000)

In [19]:
# By default the number of epochs are set to 5:
model.epochs

5

In [20]:
# Next we want to perform the actual training by using 'train()' function and first we supply the review text and 2nd we 
# supply total_examples which is the total number of reviews in the corpus and the 3rd argument is the number of epochs.
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61503356, 83868975)

In [21]:
model.corpus_count

194439

#### Save the Model
* Save the model so that it can be reused in other applications

In [22]:
# So the model is trained. We want to first save the model in file. Because normally we train the model then we save it in
# the file and then we use the pre-trained model.
# Once the model is saved then we can take it and store it in the cloud, then accroding to our NLP needs we can use it.
model.save("./word2vec-amazon-cell-accessories-reviews-short.model")

#### Finding Similar Words and Similarity between words

In [23]:
# The next step is to experiment the model. The way we do that is to call 'Word2Vec = wv' and it has the function 'most_si-
# milar()'.
# So if we pass a word to the function, it will return all the words  which are similar to the supplied word with similarity
# scores.
model.wv.most_similar("bad")

[('terrible', 0.6764117479324341),
 ('shabby', 0.6534987092018127),
 ('horrible', 0.6188998222351074),
 ('good', 0.576382040977478),
 ('awful', 0.5688327550888062),
 ('okay', 0.5432535409927368),
 ('mad', 0.5354875922203064),
 ('cheap', 0.5241401791572571),
 ('disappointing', 0.5223191380500793),
 ('keen', 0.520203709602356)]

In [24]:
# Next we can print the cosine similarity scores of the model between two words:
model.wv.similarity(w1="cheap", w2="inexpensive")

0.50951564

In [25]:
# Let's try two other words:
model.wv.similarity(w1="great", w2="good")

0.7796989

In [26]:
model.wv.similarity(w1="good", w2="good")

1.0

#### Further Reading
* You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html
* Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/

### Exercise
Train a word2vec model on the **Sports & Outdoors Reviews Dataset** Once you train a model on this, find the words most similar to 'awful' and find similarities between the following word tuples: ('good', 'great'), ('slow','steady')

In [27]:
# Let's first load the dataset:
dfe = pd.read_json("Sports_and_Outdoors_5.json", lines=True)
dfe.sample(5)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
34796,A1KHNJ93BP4ELV,B000ENMNCQ,Nikola Edison,"[0, 0]",I wear a swim cap over ear plugs as I am intol...,4,Better than average.,1371254400,"06 15, 2013"
84954,A21J6AKQZ0V6C5,B001298HEE,Kinsykins,"[0, 0]",Just spray it on and the gunk literally runs o...,5,It Works,1371600000,"06 19, 2013"
167904,A1BUDU8SNJ0I9W,B002TUSJWA,"J. Wu ""JackandBlood""","[1, 1]",For the price this bi-pod does very well. Mat...,5,Cant go wrong for the price,1291939200,"12 10, 2010"
26145,A66DME0HFC10K,B000AXAO3U,"An American Soldier ""Six""","[3, 3]",Absolutely all it's said to be. Keeps weapons...,5,The Real Deal in Weapon Protection,1259884800,"12 4, 2009"
28926,AFECGAER92D8C,B000BRQVH8,Rational,"[1, 3]","These seem well made, although, having had the...",3,Nicely made,1337731200,"05 23, 2012"


In [28]:
# Let's see the shape of dataset:
dfe.shape

(296337, 9)

In [29]:
# The next step is to pre-process the text:
review_text_ex = dfe.reviewText.apply(gensim.utils.simple_preprocess)
review_text_ex.sample(5)

243119    [it, works, great, no, directions, so, it, was...
111172    [these, small, projectiles, are, perfect, trai...
73785     [use, this, in, conjunction, with, my, battle,...
180745    [these, things, are, nice, and, tight, but, no...
272910    [compact, easy, to, use, and, relatively, infi...
Name: reviewText, dtype: object

In [30]:
# Next we initialize the model:
model_ex = gensim.models.Word2Vec(
    window=8,
    min_count=2,
    workers=4,
)

In [31]:
# The next step is to build a vocabulary:
model_ex.build_vocab(review_text_ex, progress_per=1000)

In [32]:
# The next step is perform the actual training:
model_ex.train(review_text_ex, total_examples=model_ex.corpus_count, epochs=model_ex.epochs)

(91341518, 121496535)

In [34]:
# Now the model is trained,so we store it in a file:
model_ex.save("./word2vec-Sports_and_Outdoors_5.model")

In [36]:
# Now we can expriment the model:
model_ex.wv.most_similar("excellent")

[('outstanding', 0.9000451564788818),
 ('exceptional', 0.8672378063201904),
 ('incredible', 0.7825648188591003),
 ('excellant', 0.7681642770767212),
 ('awesome', 0.7564525604248047),
 ('excelent', 0.7422342896461487),
 ('superb', 0.7302507758140564),
 ('fantastic', 0.7212028503417969),
 ('amazing', 0.7166146039962769),
 ('unbeatable', 0.7148966193199158)]

In [37]:
# Next we can print the cosine similarity scores of the model between two words:
model_ex.wv.similarity(w1="cheap", w2="inexpensive")

0.530342

In [40]:
model.wv.similarity(w1="superb", w2="excellent")

0.80471194

In [42]:
model.wv.similarity(w1="nice", w2="nice")

1.0

* **Thats were all for this notebook.**