<p style="color:#153462; 
          font-weight: bold; 
          font-size: 30px; 
          font-family: Gill Sans, sans-serif; 
          text-align: center;">
          Word2Vect</p>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Word2Vec is a shallow,two-layer neural network that accepts a text corpus as an input and returns a set
       of vectors(also know as embedding). Each vector is a neumeric representation of a given word. It is capable
       of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.
       Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving 
       Neural Networks): Skip Gram and Common Bag Of Words (CBOW).
   </font>
</p>

### Importing Required Modules

In [9]:
import gensim.downloader as api
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
# Loading pretrained word vectors using gensim
# Other pretrained model can found at below link
# https://github.com/RaRe-Technologies/gensim-data
wiki_embeddings = api.load("glove-wiki-gigaword-100")



In [4]:
# Exploring the word vector king
wiki_embeddings["king"]

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [5]:
# Finding the words most similar to king based on trained word vector
wiki_embeddings.most_similar("king")

[('prince', 0.7682328820228577),
 ('queen', 0.7507690787315369),
 ('son', 0.7020888328552246),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919989585876465),
 ('kingdom', 0.6811409592628479),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712858080863953),
 ('ii', 0.6676074266433716)]

### Training our own model

In [6]:
messages_df = pd.read_csv("data/spam.csv", encoding="latin-1")
messages_df = messages_df.drop(labels=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
messages_df.columns = ["label", "text"]
messages_df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [37]:
# Convert a document into a list of tokens.
# This lowercases, tokenizes, de-accents (optional). – the output are final
# tokens = unicode strings, that won’t be processed any further.
messages_df["text"] = messages_df["text"].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(x.lower()))
messages_df["text_clean"] = messages_df["text"].apply(lambda x: gensim.utils.simple_preprocess(x))
messages_df.head()

Unnamed: 0,label,text,text_clean
0,ham,"jurong point, crazy.. available bugis n great ...","[jurong, point, crazy, available, bugis, great..."
1,ham,ok lar... joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,free entry 2 wkly comp win fa cup final tkts 2...,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,ham,u dun early hor... u c say...,"[dun, early, hor, say]"
4,ham,"nah don't think goes usf, lives","[nah, don, think, goes, usf, lives]"


In [38]:
X_train, X_test, y_train, y_test = train_test_split(messages_df["text_clean"], 
                                                    messages_df["label"],
                                                    test_size=0.20)
# Documenation: https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.html
w2v_model = gensim.models.Word2Vec(X_train,
                                   vector_size=100, # size of the vector
                                   window=3, #`window` is the maximum distance between the current and predicted word within a sentence.
                                   min_count=3 # ignore all words with total frequency lower than this
                                  )

In [39]:
w2v_model.wv["king"]

array([-0.01038788,  0.01430211, -0.00174529,  0.00702417,  0.01507522,
       -0.04153308,  0.01270943,  0.0454486 , -0.01883297, -0.0072866 ,
        0.00134877, -0.03388841, -0.01434648,  0.00882359,  0.00890319,
       -0.02159827,  0.00762409, -0.01556034, -0.00138938, -0.03971127,
        0.01525787,  0.00120921,  0.01363898, -0.01067892,  0.00040899,
        0.00362042, -0.01863683, -0.01215883, -0.02106537, -0.00254103,
        0.02084519,  0.00406907,  0.0015666 , -0.00930545,  0.00360219,
        0.02288773,  0.00624533, -0.01466221, -0.00341158, -0.02754484,
       -0.00296511, -0.02049479, -0.00642643,  0.00969344,  0.01968034,
       -0.01694279, -0.01895383, -0.01204172,  0.00993854,  0.02615556,
       -0.00333178, -0.00763972, -0.01172165,  0.00725493, -0.00563524,
        0.01652205,  0.00522912, -0.00728336, -0.01672616,  0.00798486,
        0.01129118,  0.00052489, -0.00071299, -0.00167633, -0.01569651,
        0.01471573,  0.00684515,  0.00674684, -0.02761444,  0.01

In [44]:
# ofcourse it learned very badly
w2v_model.wv.most_similar("come", 
                          topn=10, # Number of example
                         )

[('ur', 0.9992110729217529),
 ('txt', 0.9991639852523804),
 ('like', 0.9991474747657776),
 ('me', 0.99908846616745),
 ('now', 0.9990391135215759),
 ('said', 0.9990137219429016),
 ('reply', 0.998992383480072),
 ('new', 0.9989880323410034),
 ('want', 0.998987078666687),
 ('know', 0.9989732503890991)]

### Resources
1. https://www.tensorflow.org/tutorials/text/word2vec