### word2vec

- Two Layer **Neural Network** that accepts a Text Corpus as an Input

- Returns Set of **Vectors** ( Embeddings )

- Embeddings : The Words that are closer in the vector space are expected to be Similar in Meaning

- One **Numeric** Vector for each Word    

e.g. My Name is Kirankumar (It will be Passed to a Two Layer Neural Network)

The Neural Network will Return a Numeric Vector for each Word ( My, Name, is, Kirankumar )

Numeric Vectors helps to Understand what each given Words Represents.

Helps to **Understand** words by looking its Surrounding Words.

### Explore Pre Trained Embeddings

- **glove-twitter**

- **glove-wiki-gigaword-100** : Each Vector should be of Length 100

- **word2vec-google-news**

In [1]:
import gensim.downloader as api

wikipedia_embeddings = api.load('glove-wiki-gigaword-100')

Explore the Word Vector for **King**

In [2]:
wikipedia_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

Find the Words most Similar to King based on the **Trained** Word Vectors

In [3]:
wikipedia_embeddings.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690191268921),
 ('son', 0.7020887136459351),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919990181922913),
 ('kingdom', 0.6811410188674927),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712857484817505),
 ('ii', 0.6676074266433716)]

**Train** Our Own Model

In [4]:
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth',150)

msg = pd.read_csv('../Data/Spam.csv', encoding='latin-1')
msg.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'], axis=1, inplace=True)
msg.rename(columns={'v1':'Label', 'v2':'Text'}, inplace=True)
msg.head()

Unnamed: 0,Label,Text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 0845281007...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


Clean Data using Built in **Cleaner** in Gensim

In [5]:
msg['Clean Text'] = msg['Text'].apply(lambda x : gensim.utils.simple_preprocess(x))
msg.head()

Unnamed: 0,Label,Text,Clean Text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, there, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 0845281007...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive, entry, question, std, txt, rate, apply, over]"
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


Split the Dataset into **Train** Set and **Test** Set

In [6]:
X_train, X_test, y_train, y_test = train_test_split(msg['Clean Text'], 
                                                    msg['Label'],
                                                    test_size=0.2, 
                                                    random_state=42)

Train the **word2vec** Model

In [7]:
w2v = gensim.models.Word2Vec(X_train, 
                             size=100, # Each Vector should be of length 100
                             window=5, # Number of Words to Look Before and After 
                             min_count=2) # Number of Time the Word must appear in the Corpus

Explore Word Vector for King based on Trained Model

In [8]:
w2v.wv['king']

array([ 0.02112135, -0.04699536, -0.00183464,  0.06589616, -0.00281708,
       -0.01595385, -0.03129307, -0.0271321 , -0.02606255, -0.04593177,
       -0.00301043,  0.01025276, -0.01764861, -0.01973836,  0.01935757,
       -0.03919063,  0.02259323, -0.05420363, -0.01302793,  0.04470475,
       -0.00982729, -0.0534529 , -0.03276915, -0.05482113, -0.01965195,
       -0.02171403,  0.01809428,  0.0040285 , -0.03911978, -0.03532439,
       -0.00972071,  0.02560821,  0.07207319, -0.05749268,  0.04741387,
        0.0217126 ,  0.04189188, -0.084478  ,  0.01729113, -0.01384825,
       -0.05672732,  0.09623412,  0.03715248,  0.02482585,  0.06136093,
       -0.04866209, -0.03569631,  0.00607663, -0.02420629,  0.03659953,
       -0.04459782,  0.04636875, -0.0125243 ,  0.06573644,  0.0630407 ,
        0.00358475, -0.04340429,  0.09723521, -0.03011098,  0.03611866,
        0.07209954,  0.04425444, -0.04550089, -0.0103822 , -0.03265659,
        0.0063421 , -0.00324013,  0.05630175,  0.01905873,  0.00

Most Similar Words to King based on Word Vectors from Trained Model

In [9]:
w2v.wv.most_similar('king')

[('other', 0.9973921179771423),
 ('were', 0.9973630905151367),
 ('change', 0.9973626136779785),
 ('great', 0.9973187446594238),
 ('care', 0.9972983002662659),
 ('wkly', 0.9972978234291077),
 ('missing', 0.9972972869873047),
 ('two', 0.9972882866859436),
 ('man', 0.9972831010818481),
 ('cheap', 0.9972789287567139)]

**Wikipedia** Embeddings are More Meaningful as compared to Model Created on Own.

**Prep** Word Vector

Generate a **List** of Words the word2vec Model Learned Word Vector for

In [10]:
w2v.wv.index2word

['to',
 'you',
 'the',
 'and',
 'in',
 'is',
 'me',
 'my',
 'it',
 'for',
 'your',
 'of',
 'call',
 'have',
 'that',
 'on',
 'now',
 'are',
 'so',
 'can',
 'not',
 'but',
 'or',
 'at',
 'we',
 'do',
 'ur',
 'get',
 'if',
 'will',
 'just',
 'be',
 'with',
 'no',
 'lt',
 'gt',
 'this',
 'up',
 'how',
 'what',
 'go',
 'when',
 'ok',
 'free',
 'from',
 'all',
 'll',
 'out',
 'know',
 'got',
 'am',
 'good',
 'come',
 'then',
 'like',
 'was',
 'day',
 'its',
 'there',
 'time',
 'he',
 'only',
 'love',
 'send',
 'txt',
 'text',
 'want',
 'going',
 'one',
 'home',
 'stop',
 'by',
 'need',
 'as',
 'lor',
 'she',
 'still',
 'don',
 'sorry',
 'see',
 'about',
 'today',
 'back',
 'da',
 'reply',
 'hi',
 'mobile',
 'dont',
 'pls',
 'our',
 'please',
 'tell',
 'new',
 'they',
 'some',
 'did',
 'been',
 'think',
 'ì_',
 'any',
 'her',
 'phone',
 'later',
 'week',
 'dear',
 'msg',
 'take',
 'where',
 'here',
 'claim',
 'well',
 'him',
 'an',
 'hey',
 'night',
 'more',
 'way',
 're',
 'too',
 'hope',
 

Generate **Aggregated** Sentence Vectors

In [11]:
w2v_vector = np.array([np.array([w2v.wv[i] for i in ls if i in w2v.wv.index2word]) for ls in X_test])

  w2v_vector = np.array([np.array([w2v.wv[i] for i in ls if i in w2v.wv.index2word]) for ls in X_test])


In [12]:
for index,vector in enumerate(w2v_vector):
    print(len(X_test.iloc[index]), len(vector))

20 10
38 31
17 16
29 29
23 23
9 8
7 7
6 6
21 20
5 4
15 14
20 19
6 5
7 7
9 8
9 9
4 3
14 14
5 5
8 8
12 11
23 8
8 5
11 10
25 25
23 22
15 14
9 8
6 6
4 4
18 18
10 9
24 20
11 9
19 18
24 23
4 4
26 25
5 5
17 13
50 44
9 8
25 25
8 7
16 16
8 6
14 12
6 5
6 5
20 19
26 24
5 1
20 20
14 13
7 7
11 11
14 13
4 3
7 2
12 12
7 7
24 20
8 8
9 9
27 25
8 8
21 19
6 6
20 18
18 18
4 4
3 2
11 9
26 26
18 15
22 20
46 40
10 9
7 7
101 101
5 5
4 2
1 1
13 3
7 5
6 6
28 18
9 9
14 14
24 23
16 13
8 6
12 11
2 2
4 4
17 13
12 12
5 5
5 5
8 5
4 4
6 6
2 2
11 11
10 10
7 6
28 26
14 14
7 7
6 5
4 4
5 4
12 11
8 8
4 4
3 3
7 6
30 25
14 13
15 15
3 3
12 10
20 18
22 18
58 53
5 4
28 26
31 29
29 29
19 19
13 13
5 5
25 23
26 26
10 10
14 14
31 30
11 11
18 17
6 5
5 5
6 6
10 10
25 25
6 6
10 7
37 32
19 17
26 24
25 22
10 10
4 4
16 16
4 4
10 10
10 10
9 5
25 22
8 8
7 6
10 8
27 6
14 11
22 20
6 6
20 19
21 21
20 19
14 13
22 19
9 8
18 15
10 9
40 36
5 5
6 6
20 17
11 10
7 7
3 3
17 16
8 6
21 21
1 1
11 11
6 3
21 21
4 4
12 10
7 6
13 13
6 5
8 8
5 5
1 1
16 14
30

Computing Sentence Vectors by Averaging the Word Vectors

In [13]:
w2v_Avg_Vector = []
for vector in w2v_vector:
    if len(vector) !=0:
        w2v_Avg_Vector.append(vector.mean(axis=0))
    else:
        w2v_Avg_Vector.append(np.zeros(100))

In [14]:
for index,vector in enumerate(w2v_Avg_Vector):
    print(len(X_test.iloc[index]), len(vector))

20 100
38 100
17 100
29 100
23 100
9 100
7 100
6 100
21 100
5 100
15 100
20 100
6 100
7 100
9 100
9 100
4 100
14 100
5 100
8 100
12 100
23 100
8 100
11 100
25 100
23 100
15 100
9 100
6 100
4 100
18 100
10 100
24 100
11 100
19 100
24 100
4 100
26 100
5 100
17 100
50 100
9 100
25 100
8 100
16 100
8 100
14 100
6 100
6 100
20 100
26 100
5 100
20 100
14 100
7 100
11 100
14 100
4 100
7 100
12 100
7 100
24 100
8 100
9 100
27 100
8 100
21 100
6 100
20 100
18 100
4 100
3 100
11 100
26 100
18 100
22 100
46 100
10 100
7 100
101 100
5 100
4 100
1 100
13 100
7 100
6 100
28 100
9 100
14 100
24 100
16 100
8 100
12 100
2 100
4 100
17 100
12 100
5 100
5 100
8 100
4 100
6 100
2 100
11 100
10 100
7 100
28 100
14 100
7 100
6 100
4 100
5 100
12 100
8 100
4 100
3 100
7 100
30 100
14 100
15 100
3 100
12 100
20 100
22 100
58 100
5 100
28 100
31 100
29 100
19 100
13 100
5 100
25 100
26 100
10 100
14 100
31 100
11 100
18 100
6 100
5 100
6 100
10 100
25 100
6 100
10 100
37 100
19 100
26 100
25 100
10 100
4 100
1