<p style="color:#153462; 
          font-weight: bold; 
          font-size: 30px; 
          font-family: Gill Sans, sans-serif; 
          text-align: center;">
          Word2Vect</p>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Word2Vec is a shallow,two-layer neural network that accepts a text corpus as an input and returns a set
       of vectors(also know as embedding). Each vector is a neumeric representation of a given word. It is capable
       of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.
       Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving 
       Neural Networks): Skip Gram and Common Bag Of Words (CBOW).
   </font>
</p>

### Importing Required Modules

In [42]:
import gensim.downloader as api
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
# Loading pretrained word vectors using gensim
# Other pretrained model can found at below link
# https://github.com/RaRe-Technologies/gensim-data
wiki_embeddings = api.load("glove-wiki-gigaword-100")

In [3]:
# Exploring the word vector king
wiki_embeddings["king"]

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [4]:
# Finding the words most similar to king based on trained word vector
wiki_embeddings.most_similar("king")

[('prince', 0.7682328820228577),
 ('queen', 0.7507690787315369),
 ('son', 0.7020888328552246),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919989585876465),
 ('kingdom', 0.6811409592628479),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712858080863953),
 ('ii', 0.6676074266433716)]

### Training our own vectorizing model

#### Understanding with simple examples

In [19]:
X_dummy_data = pd.Series([["something", "is", "better", "than", "nothing"], 
                          ["good", "is", "better", "than", "nothing"],
                          ["Love", "is", "good", "but", "break","up", "is","better"]])

In [20]:
X_dummy_data

0          [something, is, better, than, nothing]
1               [good, is, better, than, nothing]
2    [Love, is, good, but, break, up, is, better]
dtype: object

In [21]:
w2v_model = gensim.models.Word2Vec(X_dummy_data,
                                   vector_size=100, # size of the vector,
                                                    # If feel you text messages have more tokens then increase the lenth
                                   window=3, #`window` is the maximum distance between the current and predicted word 
                                             # within a sentence.
                                   min_count=2 # ignore all words with total frequency lower than this
                                  )

In [64]:
w2v_model.wv.vectors[0]

array([-2.28168786e-01,  2.94776231e-01,  5.42313941e-02,  3.35366116e-04,
        1.37255648e-02, -6.58287704e-01,  1.82953089e-01,  8.47477376e-01,
       -2.45199174e-01, -2.15501830e-01, -2.35695377e-01, -5.95315695e-01,
       -1.27114728e-01,  2.03529745e-01,  1.32459402e-01, -3.54928523e-01,
        5.92925549e-02, -5.11806726e-01, -1.64134875e-02, -7.97453165e-01,
        2.51370192e-01,  2.02712864e-01, -6.05017506e-03, -1.01823099e-01,
       -5.40905446e-02,  1.32755460e-02, -3.62832636e-01, -3.20576549e-01,
       -3.55743676e-01,  4.59524542e-02,  4.24092948e-01,  7.39516020e-02,
        5.42852655e-02, -1.32181421e-01, -9.67487916e-02,  4.10268098e-01,
        3.39242704e-02, -3.58483493e-01, -2.31161505e-01, -7.17128217e-01,
       -3.46506536e-02, -2.92194545e-01, -1.77535862e-01,  1.11085422e-01,
        3.74869108e-01, -1.08029746e-01, -2.63971657e-01,  1.52991165e-03,
        1.60804272e-01,  1.98200151e-01,  1.29610568e-01, -2.82718152e-01,
       -1.12879865e-01, -

In [22]:
w2v_model.wv.key_to_index

{'is': 0, 'better': 1, 'good': 2, 'nothing': 3, 'than': 4}

In [23]:
w2v_model.wv.vectors.shape

(5, 100)

In [24]:
w2v_model.wv.index_to_key

['is', 'better', 'good', 'nothing', 'than']

#### Applying vectoring model on our spam data

In [25]:
messages_df = pd.read_csv("data/spam.csv", encoding="latin-1")
messages_df = messages_df.drop(labels=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
messages_df.columns = ["label", "text"]
messages_df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [26]:
# Convert a document into a list of tokens.
# This lowercases, tokenizes, de-accents (optional). – the output are final
# tokens = unicode strings, that won’t be processed any further.
messages_df["text"] = messages_df["text"].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(x.lower()))
messages_df["text_clean"] = messages_df["text"].apply(lambda x: gensim.utils.simple_preprocess(x))
messages_df.head()

Unnamed: 0,label,text,text_clean
0,ham,"jurong point, crazy.. available bugis n great ...","[jurong, point, crazy, available, bugis, great..."
1,ham,ok lar... joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,free entry 2 wkly comp win fa cup final tkts 2...,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,ham,u dun early hor... u c say...,"[dun, early, hor, say]"
4,ham,"nah don't think goes usf, lives","[nah, don, think, goes, usf, lives]"


In [75]:
X_train, X_test, y_train, y_test = train_test_split(messages_df["text_clean"], 
                                                    messages_df["label"],
                                                    test_size=0.20)

In [105]:
# Documenation: https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.html
w2v_model = gensim.models.Word2Vec(X_train,
                                   vector_size=100, # 'size' of the vector,
                                   window=3,        #'window' is the maximum distance between the current and predicted word 
                                                    # within a sentence.
                                   min_count=2      #'min_count' ignore all words with total frequency lower than this
                                  )

In [104]:
# it has found around 2236 unique words from all sentence 
w2v_model.wv.vectors.shape

(3327, 100)

In [77]:
w2v_model.wv["king"]shape

array([-0.01540353,  0.02704115,  0.00047133,  0.00184532,  0.00344139,
       -0.04279326,  0.00590101,  0.05406338, -0.0152276 , -0.01037676,
       -0.00672873, -0.03237762, -0.00977162,  0.01117991,  0.00303108,
       -0.01064968,  0.00362202, -0.03314536, -0.00059913, -0.05471944,
        0.0086899 ,  0.02047253, -0.00240871, -0.00359745, -0.01171398,
        0.01361787, -0.03178739, -0.02237112, -0.02020619,  0.00915507,
        0.01671531,  0.00733537,  0.00442236, -0.00340522,  0.00493995,
        0.0342346 ,  0.00730813, -0.02075255, -0.00889054, -0.05365228,
        0.00112327, -0.02593256, -0.00352555,  0.00921736,  0.01512694,
       -0.02015343, -0.01613283,  0.00396995,  0.01825234,  0.01231018,
        0.01598307, -0.02147548, -0.00420659, -0.01027708, -0.01967317,
        0.02323181,  0.00934186, -0.0069949 , -0.02478185, -0.00246385,
       -0.00148238,  0.01543016,  0.00652061,  0.01073922, -0.02576259,
        0.02429629,  0.01407668,  0.02275086, -0.03823524,  0.03

In [78]:
# ofcourse it learned very badly as compared with pretrained glove-wiki-gigaword-100 model
w2v_model.wv.most_similar("king", 
                          topn=5, # Number of example
                         )

[('lol', 0.9684240818023682),
 ('dis', 0.9677614569664001),
 ('maybe', 0.9675440788269043),
 ('told', 0.9673854112625122),
 ('minutes', 0.9673014283180237)]

In [79]:
# Which it it able to find 2232 which are repreated 3 times and created 100 elements vector for each word.
w2v_model.wv.vectors.shape

(2236, 100)

In [73]:
# Generate aggregated sentence vectors based on the word vectors for each word in the sentence
word_vec = [np.array([w2v_model.wv[i] for i in ls if i in w2v_model.wv.index_to_key])
          for ls in X_test]

In [83]:
# Checking words in the sentence and number of vectors created for it. Both has to match. if doesn't
# Then we have to fix it.
count = 0
for i, v in enumerate(word_vec):
    if count == 10:
        break
    print(len(X_test.iloc[i]), len(v))
    count += 1

10 6
12 3
34 11
4 18
5 3
15 7
4 1
6 2
6 3
6 14


In [108]:
w2v_vect_avg = []
for vect in word_vec:
    if len(vect) != 0:
        w2v_vect_avg.append(vect.mean(axis=0))
    else:
        w2v_vect_avg.append(np.zeros(100))
        

In [109]:
count = 0
for i, v in enumerate(w2v_vect_avg):
    if count == 10:
        break
    print(len(X_test.iloc[i]), len(v))
    count += 1

10 100
12 100
34 100
4 100
5 100
15 100
4 100
6 100
6 100
6 100


### Resources
1. https://www.tensorflow.org/tutorials/text/word2vec