### doc2vec

- Two Layer **Neural Network** that accepts a Text Corpus as an Input

- Returns Set of **Vectors** ( Embeddings )

- One **Numeric** Vector for each **Sentence**    

- doc2vec Vectors are converted to **List** unlike **Array** in word2vec

e.g. My Name is Kirankumar (It will be Passed to a Two Layer Neural Network)

The Neural Network will Return a Numeric Vector for entire Sentence ( My Name is Kirankumar )


**Train** Our Own Model

In [1]:
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth',150)

msg = pd.read_csv('../Data/Spam.csv', encoding='latin-1')
msg.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'], axis=1, inplace=True)
msg.rename(columns={'v1':'Label', 'v2':'Text'}, inplace=True)
msg.head()

Unnamed: 0,Label,Text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 0845281007...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


Clean Data using Built in **Cleaner** in Gensim

In [2]:
msg['Clean Text'] = msg['Text'].apply(lambda x : gensim.utils.simple_preprocess(x))
msg.head()

Unnamed: 0,Label,Text,Clean Text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, there, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 0845281007...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive, entry, question, std, txt, rate, apply, over]"
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


Split the Dataset into **Train** Set and **Test** Set

In [3]:
X_train, X_test, y_train, y_test = train_test_split(msg['Clean Text'], 
                                                    msg['Label'],
                                                    test_size=0.2, 
                                                    random_state=42)

Creating Tagged Document Object to Train the **doc2vec** Model

In [4]:
tagged_doc = [gensim.models.doc2vec.TaggedDocument(vector, [i]) for i, vector in enumerate(X_train)]

tagged_doc[0]

TaggedDocument(words=['no', 'in', 'the', 'same', 'boat', 'still', 'here', 'at', 'my', 'moms', 'check', 'me', 'out', 'on', 'yo', 'half', 'naked'], tags=[0])

Train Basic Model

In [5]:
d2v = gensim.models.Doc2Vec(tagged_doc, 
                            vector_size=100, 
                            windows=5, 
                            min_count=2)

Pass a List of Words to a Model

In [6]:
d2v.infer_vector(['i','am','learning','nlp'])

array([ 3.82968388e-03,  4.13799053e-03, -6.63741725e-04,  2.40805373e-03,
        7.69969629e-05,  7.21669756e-03,  6.78393478e-03,  1.06860707e-02,
        1.09892734e-03, -7.62522686e-03, -2.73938896e-03,  1.57334656e-02,
        6.02932228e-03, -3.71735357e-03, -1.19462544e-02, -3.49768251e-03,
        1.73992815e-03, -2.83530238e-03, -1.82010245e-03,  1.06529947e-02,
       -3.43838055e-03,  6.64628018e-03, -1.19453052e-03,  2.39211321e-03,
       -5.88744367e-03, -1.15917539e-02,  2.04924308e-03, -1.85354752e-03,
        1.45837956e-03, -8.04794487e-03,  1.46116065e-02, -8.18478409e-04,
       -1.07960682e-02, -8.38901382e-03,  1.96016319e-02, -4.47415514e-03,
       -2.09215493e-03, -9.65477549e-04,  6.97840098e-03, -2.13088561e-02,
        8.44003749e-04,  1.72075015e-02,  5.17110620e-03,  7.43819447e-03,
        5.57388738e-03,  1.15229702e-02,  5.20422123e-03, -1.65762659e-02,
        1.66290104e-02,  3.58728948e-03,  1.84620603e-03, -4.49036481e-03,
        2.92143726e-04, -

There are not so many **Pretrained** Document Vectors and **API's** like Word Vectors

Prepare the Vectors to be used in a **Machine Learning Model**

In [7]:
vectors = [[d2v.infer_vector(word)] for word in X_test]
vectors[0]

[array([ 0.00739162,  0.04333074,  0.02113032, -0.00489163,  0.00191754,
         0.02435322,  0.00541513,  0.02438186, -0.00044579, -0.01962186,
        -0.01884066,  0.06329914,  0.05002325, -0.03069448, -0.03426618,
        -0.01110566, -0.0113846 ,  0.00360292, -0.01247877,  0.04154537,
        -0.02077973,  0.02628371, -0.02138242,  0.01969699, -0.00769653,
        -0.04228035,  0.00259427, -0.02075599, -0.00118174, -0.02084192,
         0.05354832, -0.00400604, -0.0310108 , -0.02013468,  0.0898321 ,
        -0.03106854,  0.00561934, -0.0274569 ,  0.04312494, -0.0816195 ,
         0.00027242,  0.07861911,  0.01129802,  0.03486763,  0.042072  ,
         0.03331375,  0.03659694, -0.05415655,  0.07201695,  0.00617635,
        -0.0055012 , -0.01186855, -0.00340939, -0.02485179, -0.04848693,
         0.0209582 ,  0.03250092,  0.00464377, -0.04998276,  0.02563726,
        -0.04112181, -0.00269202,  0.00824046, -0.06723288, -0.00888634,
         0.01006795, -0.00038465,  0.04945185,  0.0