<p style="color:#153462; 
          font-weight: bold; 
          font-size: 30px; 
          font-family: Gill Sans, sans-serif; 
          text-align: center;">
          Doc2Vec</p>

<p style="text-align: justify; text-justify: inter-word;">
    <font size=3>
        Doc2Vec is a shallow, two-layer neural network that accepts a text corpus as an input,
        and it returns a set of vectors(also known as embeddings); each vector is a numeric representation
        of a given sentence, paragraph or document.
    </font>
</p>

### Importing Required Models

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import gensim

### Cleaning and Reading the Data

In [2]:
# This process is similar to wor2vec
messages_df = pd.read_csv("data/spam.csv", encoding="latin-1")
messages_df = messages_df.drop(labels=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
messages_df.columns = ["label", "text"]
messages_df["text"] = messages_df["text"].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(x.lower()))
messages_df["text_clean"] = messages_df["text"].apply(lambda x: gensim.utils.simple_preprocess(x))
messages_df.head()

Unnamed: 0,label,text,text_clean
0,ham,"jurong point, crazy.. available bugis n great ...","[jurong, point, crazy, available, bugis, great..."
1,ham,ok lar... joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,free entry 2 wkly comp win fa cup final tkts 2...,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,ham,u dun early hor... u c say...,"[dun, early, hor, say]"
4,ham,"nah don't think goes usf, lives","[nah, don, think, goes, usf, lives]"


In [3]:
X_train, X_test, y_train, y_test = train_test_split(messages_df["text_clean"], 
                                                    messages_df["label"],
                                                    test_size=0.20)

<p style="text-align: justify; text-justify: inter-word;">
    <font size=3>
        One of the difference between wor2vec and doc2vec is, doc2vec requires you to create a tagged for each document.
    </font>
</p>

In [15]:
tagged_docs = [gensim.models.doc2vec.TaggedDocument(message, [tag])
               for tag, message in enumerate(X_train)]

In [24]:
# Checking how tagged document looks like
tagged_docs[0]

TaggedDocument(words=['ì_', 'send', 'copy', 'da', 'report'], tags=[0])

In [26]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                  vector_size=100,
                                  window=5,
                                  min_count=2)

In [28]:
# if pass a single string to see a vector it will through an error
d2v_model.infer_vector("text")

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [30]:
# Below line might be in the text messages but still it able to generate a vector.
d2v_model.infer_vector(["i", "am", "learning", "nlp"])

array([-0.01670334,  0.01506151,  0.01069222, -0.00099723,  0.01174583,
       -0.02641355,  0.01006052,  0.03633425, -0.00801944, -0.01884378,
       -0.01059441, -0.03392114,  0.0077312 ,  0.00724932,  0.00435393,
       -0.01755195, -0.00129795, -0.03394359,  0.00842868, -0.0376656 ,
        0.02096071,  0.01253987,  0.01408338, -0.0073376 , -0.00260323,
        0.00035145, -0.013527  , -0.01244347, -0.01531816,  0.00457848,
        0.02517725,  0.00314163,  0.00578883, -0.00276743, -0.0051368 ,
        0.02993125, -0.0006773 , -0.00904137, -0.00885883, -0.03443177,
        0.00085865, -0.02342424, -0.00166096,  0.00544285,  0.01690786,
       -0.00842708, -0.00368595, -0.00519634,  0.00580165,  0.01186061,
        0.01635859, -0.00939585, -0.00151957, -0.00035397, -0.02264286,
        0.00785382,  0.01268574, -0.00524943, -0.01497518,  0.01326055,
        0.01511205,  0.00219207, -0.0104853 ,  0.00257377, -0.0219974 ,
        0.02341004,  0.00325897, -0.00109748, -0.02742057,  0.02

In [31]:
# Preparing data for machine learning model
vectors = [[d2v_model.infer_vector(words)] for words in X_test]

In [33]:
vectors[0]

[array([-0.01600209,  0.01305018,  0.00265853, -0.00142869,  0.00875799,
        -0.02439217,  0.00783628,  0.02846464, -0.01510691, -0.00980889,
        -0.00184414, -0.02052439,  0.00045317,  0.01417863,  0.00763485,
        -0.00900592, -0.00113399, -0.02850657,  0.00791143, -0.02818114,
         0.00818712,  0.00916575,  0.01372023, -0.00880946, -0.00138504,
         0.00513204, -0.02041306, -0.0066932 , -0.01267459,  0.00080392,
         0.01497518, -0.00087495,  0.00384755, -0.00099499, -0.00689779,
         0.02301539, -0.0040477 , -0.00924474, -0.00441097, -0.03417779,
        -0.00123487, -0.0154581 , -0.00444547, -0.00095956,  0.01552329,
        -0.010543  , -0.00792183, -0.00501963,  0.01103252,  0.0084653 ,
         0.00677812, -0.00182994, -0.00559629, -0.00368985, -0.01173695,
        -0.00034725,  0.00565822, -0.00479573, -0.0108868 ,  0.00696854,
         0.00535064,  0.00439968, -0.00579093, -0.00245765, -0.02306881,
         0.01168393,  0.00489581,  0.00114061, -0.0