# Word2vec Model

### Let's understand what is word2vec ?
word2vec is `word embedding` wherein similarity comes from `neighbourhood` words.
### But what is embedding ?
<img src='img\embedding.jpg'>

![SegmentLocal](img\w2v.gif "segment")

## What are different attributes of a `word2vec model` ?
- `Input layer` : Accepts one-hot encoded words
- `Hidden layer` : Less number of neurons as compared to input layer
- `Output layer` -
    - Exactly same number of neurons as input layer
    - Here we use `softmax function` as activation function
    - `Cross-entropy loss` is used to measure error at a softmax layer
    - Once we get error, we use gradient descent optimiser to head to direction of `local/global minima`
    - After several rounds of optimisation we extract the matrix which is vectors coressponding to each word
<img src= "img\model_structure_w2v.png" style="width: 620px;"/>

## Now let's understand what happens inside a **word2vec** model?
**Before getting into how does a word2vec model work let's understand what is n-gram?**
- `n-gram :` In very simple language n gram means length number of combinations for example -
In the sentence - *Kite flies high in the sky*
<br>

`2-grams` can be represented as below :-

In [1]:
from nltk.util import skipgrams, ngrams
sent = "Kite flies high in the sky".split()
list(ngrams(sent, 2))

[('Kite', 'flies'),
 ('flies', 'high'),
 ('high', 'in'),
 ('in', 'the'),
 ('the', 'sky')]

`3-grams` can be represented as below :-

In [2]:
sent = "Kite flies high in the sky".split()
list(ngrams(sent, 3))

[('Kite', 'flies', 'high'),
 ('flies', 'high', 'in'),
 ('high', 'in', 'the'),
 ('in', 'the', 'sky')]

### Similarly we can make n-grams for different values of n

In [3]:
sent = "I am making it work with hardwork".split()
list(skipgrams(sent, 3, 4))[:20]

[('I', 'am', 'making'),
 ('I', 'am', 'it'),
 ('I', 'am', 'work'),
 ('I', 'am', 'with'),
 ('I', 'am', 'hardwork'),
 ('I', 'making', 'it'),
 ('I', 'making', 'work'),
 ('I', 'making', 'with'),
 ('I', 'making', 'hardwork'),
 ('I', 'it', 'work'),
 ('I', 'it', 'with'),
 ('I', 'it', 'hardwork'),
 ('I', 'work', 'with'),
 ('I', 'work', 'hardwork'),
 ('I', 'with', 'hardwork'),
 ('am', 'making', 'it'),
 ('am', 'making', 'work'),
 ('am', 'making', 'with'),
 ('am', 'making', 'hardwork'),
 ('am', 'it', 'work')]

### Now let's understand what does k-skip n-gram mean?
Now using `3 grams` we will make **4-skip 3-grams:-**
Here we can see 
- 0 skip gram : ('I', 'am', 'making')
- 1 skip gram : ('I', 'am', 'it')
- 2 skip gram : ('I', 'am', 'work')
- 3 skip gram : ('I', 'am', 'with')
- 4 skip gram : ('I', 'am', 'hardwork')
<br>
<br>

Here we can see for 4th skip gram **am** is 0 skip since **am** occurs just after **I** after that **hardwork** occurs at 4th index after **am** if we start counting from **0**

#### Similarly it will skip 2and gram now and it will look like this :-

<br>

- 0 skip gram : ('I', 'making', 'it')
- 1 skip gram : ('I', 'making', 'work')
- 2 skip gram : ('I', 'making', 'with')
- 3 skip gram : ('I', 'making', 'hardwork')

### How a word2vec model works? 

Once we have built k-skip n-gram we have to use those grams in the word2vec model.
<br>
But how are we going to do that. That is an important question?
<br>
**But there comes another question ?**
<br>
Why we built k-skip n-grams? 
<br>
Since we wanted to find out neighbourhood and contextual words coressponding to every word. So, k-skip n-gram created that for us here and with the help of n-gram here we want do define context of how many words do we want to give to our model. So if you are building 3 grams then you will be giving context of 3 neighbourhood words to your model.
#### Ok now we know which words are neighbours. Now what?
Once you have built k-skip n-gram model you would want to increase the probability of occuring of the given word with it's neighbourhood words. So in order to do that you would like to increase the probabilities of neighbourhood words(predict neighbourhood words) out of all the words in the output layer **but how will you do that?**
In order to do that you'd try to adjust the weights in hidden layer in such a way that it increases the probability of the neighbourhood word in the output layer. For this purpose you'd use different optimisation algorithms like in this case **gradient descent** will be used which will help in adjusting the weights of hidden layer.
<br>
So now you have prepared the model which will have high probabilities for the neighbourhood words but that is not the only purpose of a word2vec model. 
<br>

Since hidden layer will have vectors of every word but if we need vector for a word we will multiply hidden layer matrix with one-hot encoding of that word.

<br>

**Conclusion :** So now using Word2vec we can reduce textual data in very high dimension to 100, 200 or 300 dimensions(as per the model requirement) 

# Let's implement word2vec with H2O 

## Libraries read operation
**Importing required libraries, like `h2o`, `pandas`**

In [4]:
import h2o
h2o.init()
from IPython.core.display import Image, display
from h2o.estimators.word2vec import H2OWord2vecEstimator
from sklearn import metrics
import matplotlib.pyplot as plt
import re
import pandas as pd
import numpy as np

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
  Starting server from C:\Users\paras.mal\AppData\Local\Continuum\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\paras.mal\AppData\Local\Temp\tmpfw6mvw75
  JVM stdout: C:\Users\paras.mal\AppData\Local\Temp\tmpfw6mvw75\h2o_paras_mal_started_from_python.out
  JVM stderr: C:\Users\paras.mal\AppData\Local\Temp\tmpfw6mvw75\h2o_paras_mal_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,06 secs
H2O cluster timezone:,Asia/Kolkata
H2O data parsing timezone:,UTC
H2O cluster version:,3.18.0.8
H2O cluster version age:,10 months and 26 days !!!
H2O cluster name:,H2O_from_python_paras_mal_jtmem5
H2O cluster total nodes:,1
H2O cluster free memory:,3.531 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


## Data `preprocessing` step :
- Reading csv file 
- Removing b' from each tweet

In [5]:
df = pd.read_csv(r"elonmusk_tweets.csv") 
df.head()

Unnamed: 0,id,created_at,text
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa..."
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ..."


In [6]:
df['text'] = df['text'].apply(lambda x : str(x)[2:])
df.head()

Unnamed: 0,id,created_at,text
0,849636868052275200,2017-04-05 14:56:29,And so the robots spared humanity ... https://...
1,848988730585096192,2017-04-03 20:01:01,@ForIn2020 @waltmossberg @mims @defcon_5 Exact...
2,848943072423497728,2017-04-03 16:59:35,"@waltmossberg @mims @defcon_5 Et tu, Walt?'"
3,848935705057280001,2017-04-03 16:30:19,Stormy weather in Shortville ...'
4,848416049573658624,2017-04-02 06:05:23,@DaveLeeBBC @verge Coal is dying due to nat ga...


## Converting dataframe to `h2o` frame 
- Column type of text column is strings
- As H2o `Word2Vec` accepts string columns 

In [7]:
temp = h2o.H2OFrame(df[['text']], column_types=['string'])

  data = _handle_python_lists(python_obj.as_matrix().tolist(), -1)[1]


Parse progress: |█████████████████████████████████████████████████████████| 100%


In [8]:
temp.shape

(2819, 1)

## Text Cleaning process 
- Tokenize the sentence into words
- Convert all words into lower case
- Filter the words which has only single charchter
- Remove the `Numbers` from tweets
- Remove `stop-words` from tweets


In [9]:
STOP_WORDS = ["ax","i","you","edu","s","t","m","subject","can","lines","re","what","hi",
               "there","all","we","one","the","a","an","of","or","in","for","by","on",
               "but","is","in","a","not","with","as","was","if","they","are","this","and","it","have",
               "from","at","my","be","by","not","that","to","from","com","org","like","likes","so"]

In [10]:
def tokenize(sentences, stop_word = STOP_WORDS):
    tokenized = sentences.tokenize("\\W+")
    tokenized_lower = tokenized.tolower()
    tokenized_filtered = tokenized_lower[(tokenized_lower.nchar() >= 2) | (tokenized_lower.isna()),:]
    tokenized_words = tokenized_filtered[tokenized_filtered.grep("[0-9]",invert=True,output_logical=True),:]
    tokenized_words = tokenized_words[(tokenized_words.isna()) | (~ tokenized_words.isin(STOP_WORDS)),:]
    return tokenized_words

In [11]:
print("Break Notes into sequence of words")
words = tokenize(temp['text'])

Break Notes into sequence of words


In [12]:
words

C1
robots
spared
humanity
https
co
waltmossberg
mims
exactly
tesla




In [13]:
words.shape

(33368, 1)

## Training h2o `word2vec`
- 3 hyperparameters have been used here.
- `Sent_sample_rate` is used to downsample high frequency words
- `Epoch parameter` specifies how many training iterations should be ran
- `Vec_size` parameter tells that in how many dimension vectors coresponding to every word will be distributed
- Finally the model is saved in desired location

In [14]:
print("Build Word2vec Model")
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 15,vec_size = 100)
w2v_model.train(training_frame=words['C1'])
h2o.save_model(model=w2v_model, path="Model")

Build Word2vec Model
word2vec Model Build progress: |██████████████████████████████████████████| 100%


'D:\\Personal Project\\word2vec\\Model\\Word2Vec_model_python_1552847583187_1'

> Using find_synonyms we can find words appearing close to the word in `higher dimensional` space
- the resultant dictionary will be an ordered dictionary

In [15]:
w2v_model.find_synonyms("humanity", count = 5)

OrderedDict([('life', 0.7273121476173401),
             ('survival', 0.6959032416343689),
             ('exciting', 0.677527129650116),
             ('them', 0.6674196720123291),
             ('future', 0.662167489528656)])

**Now we will use word2vec model to create vectors coresponding to each word in 100-D space**
- aggregate_method: Specifies how to aggregate sequences of words. If the method is NONE, then no aggregation is performed, and each input word is mapped to a single word-vector.

In [16]:
vec = w2v_model.transform(words, aggregate_method = "AVERAGE")

### Below is 100-D vector for each tweet

In [17]:
vec.head(2)

C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26,C27,C28,C29,C30,C31,C32,C33,C34,C35,C36,C37,C38,C39,C40,C41,C42,C43,C44,C45,C46,C47,C48,C49,C50,C51,C52,C53,C54,C55,C56,C57,C58,C59,C60,C61,C62,C63,C64,C65,C66,C67,C68,C69,C70,C71,C72,C73,C74,C75,C76,C77,C78,C79,C80,C81,C82,C83,C84,C85,C86,C87,C88,C89,C90,C91,C92,C93,C94,C95,C96,C97,C98,C99,C100
-0.140124,0.115578,0.137378,-0.192654,-0.142361,-0.222518,-0.212257,0.0944436,-0.0500358,0.0408906,0.125791,0.0715994,-0.24141,-0.0363606,0.0747646,-0.087048,0.0961847,0.318623,0.127406,-0.00814109,0.183734,-0.190772,-0.197582,0.0558142,0.268359,0.000782574,-0.154784,0.0138605,0.199373,0.0324799,-0.00671948,-0.0164448,-0.0166572,-0.0428746,-0.0825788,0.104213,-0.0518572,0.109531,0.198774,-0.118415,0.00451641,0.131714,0.0301582,-0.17077,-0.232667,0.0119375,0.107875,0.0990394,0.172051,0.00331667,0.259763,-0.166099,0.0267812,-0.234313,0.220877,-0.203211,0.158323,0.137339,0.116616,0.193931,-0.109274,0.10889,0.0208958,0.197201,0.0186227,0.0834158,-0.18026,0.152136,0.270947,-0.0115342,0.10143,0.187187,-0.0813169,-0.137847,-0.0931884,-0.0441091,0.185116,-0.105353,0.0485673,-0.177656,0.109172,0.179267,0.19912,-0.130298,0.0919077,-0.245947,0.165589,0.1208,0.090146,0.0141273,-0.0353326,-0.292456,0.107669,-0.0418254,0.226459,0.198535,0.161463,-0.0135548,0.0811656,-0.304476
-0.16967,0.206854,0.0421439,-0.00396038,-0.0131857,-0.115076,-0.190208,0.0758914,-0.11229,0.14289,0.0355645,0.0902681,-0.178689,0.0229995,0.0934682,0.00935052,0.129002,0.0558293,0.0428345,0.0288177,0.0267488,-0.0828091,-0.0788152,0.0619851,0.146143,0.0478584,-0.179795,-0.0174663,0.0743986,0.0233767,-0.0126196,-0.0670774,-0.0221247,-0.0282587,0.00912785,0.0235637,0.0228713,-0.0478296,0.144907,0.0681545,0.0351881,0.0359154,0.0262917,-0.064895,-0.178923,-0.148187,0.0323386,0.084935,0.0677829,-0.0490558,0.0490368,-0.145193,0.0257258,-0.140599,0.111788,-0.17087,0.0740502,0.0879119,0.00678276,0.140466,-0.129329,0.0556249,0.0719198,0.0881176,0.0513157,0.0352245,-0.0678333,0.131806,0.11959,0.0148897,0.00791822,0.182454,-0.0761149,-0.0855969,-0.0708252,-0.0503814,0.241522,-0.100864,-0.0274989,-0.171782,0.0719592,0.135548,0.0789291,-0.0341286,0.0393357,-0.0596421,0.152296,-0.00441801,0.016978,-0.0667887,-0.0230533,-0.138561,0.137034,-0.0305867,0.147919,0.195176,0.28544,0.0863056,0.159609,-0.14301




In [18]:
vec.shape

(2819, 100)

## Saving word2vec vectors and mapping tweets to be used for different models

In [19]:
h2o.export_file(vec,'w2v_vectors_lat.csv')
h2o.export_file(temp,'w2v_words_lat.csv')

Export File progress: |███████████████████████████████████████████████████| 100%
Export File progress: |███████████████████████████████████████████████████| 100%


# To be continued ....