<p style="color:#153462; 
          font-weight: bold; 
          font-size: 30px; 
          font-family: Gill Sans, sans-serif; 
          text-align: center;">
          Word2Vect</p>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Word2Vec is a shallow,two-layer neural network that accepts a text corpus as an input and returns a set
       of vectors(also know as embedding). Each vector is a neumeric representation of a given word. It is capable
       of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.
       Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving 
       Neural Networks): Skip Gram and Common Bag Of Words (CBOW).
   </font>
</p>

### Importing Required Modules

In [1]:
import gensim.downloader as api
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split

In [4]:
# Loading pretrained word vectors using gensim
# Other pretrained model can found at below link
# https://github.com/RaRe-Technologies/gensim-data
wiki_embeddings = api.load("glove-wiki-gigaword-100")

In [5]:
# Exploring the word vector king
wiki_embeddings["king"]

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [6]:
# Finding the words most similar to king based on trained word vector
wiki_embeddings.most_similar("king")

[('prince', 0.7682328820228577),
 ('queen', 0.7507690787315369),
 ('son', 0.7020888328552246),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919989585876465),
 ('kingdom', 0.6811409592628479),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712858080863953),
 ('ii', 0.6676074266433716)]

### Training our own vectorizing model

#### Understanding with simple examples

In [96]:
X_dummy_data = pd.Series([["something", "is", "better", "than", "nothing"], 
                          ["good", "is", "better", "than", "nothing"],
                          ["Love", "is", "good", "but", "break","up", "is","better"]])

In [97]:
X_dummy_data

0          [something, is, better, than, nothing]
1               [good, is, better, than, nothing]
2    [Love, is, good, but, break, up, is, better]
dtype: object

In [98]:
w2v_model = gensim.models.Word2Vec(X_dummy_data,
                                   vector_size=100, # size of the vector,
                                                    # If feel you text messages have more tokens then increase the lenth
                                   window=3, #`window` is the maximum distance between the current and predicted word 
                                             # within a sentence.
                                   min_count=2 # ignore all words with total frequency lower than this
                                  )

In [99]:
w2v_model.wv.key_to_index

{'is': 0, 'better': 1, 'good': 2, 'nothing': 3, 'than': 4}

In [101]:
w2v_model.wv.vectors.shape

(5, 100)

#### Applying vectoring model on our spam data

In [106]:
messages_df = pd.read_csv("data/spam.csv", encoding="latin-1")
messages_df = messages_df.drop(labels=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
messages_df.columns = ["label", "text"]
messages_df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [108]:
# Convert a document into a list of tokens.
# This lowercases, tokenizes, de-accents (optional). – the output are final
# tokens = unicode strings, that won’t be processed any further.
messages_df["text"] = messages_df["text"].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(x.lower()))
messages_df["text_clean"] = messages_df["text"].apply(lambda x: gensim.utils.simple_preprocess(x))
messages_df.head()

Unnamed: 0,label,text,text_clean
0,ham,"jurong point, crazy.. available bugis n great ...","[jurong, point, crazy, available, bugis, great..."
1,ham,ok lar... joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,free entry 2 wkly comp win fa cup final tkts 2...,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,ham,u dun early hor... u c say...,"[dun, early, hor, say]"
4,ham,"nah don't think goes usf, lives","[nah, don, think, goes, usf, lives]"


In [109]:
X_train, X_test, y_train, y_test = train_test_split(messages_df["text_clean"], 
                                                    messages_df["label"],
                                                    test_size=0.20)

In [119]:
# Documenation: https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.html
w2v_model = gensim.models.Word2Vec(X_train,
                                   vector_size=100, # 'size' of the vector,
                                                    # If feel you text messages have more tokens then increase the lenth
                                   window=3,        #'window' is the maximum distance between the current and predicted word 
                                                    # within a sentence.
                                   min_count=3      #'min_count' ignore all words with total frequency lower than this
                                  )

In [120]:
w2v_model.wv["king"]

array([-1.36129437e-02,  1.76094379e-02,  1.56155077e-03, -3.53098987e-03,
        1.32325087e-02, -2.83463877e-02,  1.37087665e-02,  3.64526212e-02,
       -1.69212036e-02, -1.99795049e-02, -2.38756370e-03, -3.16845737e-02,
        1.64627004e-03,  9.85298213e-03,  4.04103484e-05, -1.41845420e-02,
        3.15313693e-03, -2.30012406e-02, -8.58663209e-03, -5.16244248e-02,
        1.34838400e-02, -1.09210867e-03, -3.39793344e-03, -3.59541550e-03,
       -9.28980298e-03, -6.90292520e-03, -1.27994083e-02, -2.62981653e-02,
       -2.69380827e-02,  8.98756739e-03,  3.11183818e-02, -1.76429551e-03,
        3.53080616e-03, -1.62198786e-02, -6.05623564e-03,  2.85396427e-02,
        5.89518622e-03, -1.42099345e-02, -1.56633798e-02, -4.96917777e-02,
        1.86526054e-03, -2.25843303e-02,  4.43299767e-03,  6.35256385e-03,
        1.95551198e-02, -9.88571439e-03, -1.19930981e-02, -4.99160355e-03,
        5.92284417e-03,  1.60462763e-02, -5.70295798e-03, -1.29251266e-02,
       -1.97508442e-03, -

In [114]:
# ofcourse it learned very badly as compared with pretrained glove-wiki-gigaword-100 model
w2v_model.wv.most_similar("king", 
                          topn=5, # Number of example
                         )

[('goin', 0.9480611681938171),
 ('am', 0.9469554424285889),
 ('call', 0.9462488293647766),
 ('takes', 0.9462072849273682),
 ('muz', 0.9460297226905823)]

In [117]:
# Which it it able to find 2232 which are repreated 3 times and created 100 elements vector for each word.
w2v_model.wv.vectors.shape

(2232, 100)

### Resources
1. https://www.tensorflow.org/tutorials/text/word2vec