# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
!ipython locate profile

/root/.ipython/profile_default


In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [2]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [6]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['this', 'will', 'increase', 'the', 'chance', 'of', 'winning'], tags=[0])

In [8]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                 vector_size=100,
                                 window=5,
                                 min_count=2)

In [9]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [10]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i','am','learning','nlp'])

array([-0.00136442, -0.00363975,  0.00437213, -0.00221957, -0.00249227,
       -0.00306828, -0.00635214, -0.00074465, -0.0019828 ,  0.00747749,
       -0.0025201 , -0.00263584, -0.00051234, -0.00890664,  0.00687952,
        0.00980445,  0.00832667,  0.00318225, -0.00638723,  0.0020399 ,
        0.00953929, -0.00388563,  0.00797023,  0.00279042, -0.00374689,
        0.00523655, -0.01341463, -0.0077482 ,  0.00393441,  0.01903278,
       -0.0026451 , -0.00403799,  0.00176416,  0.00341318,  0.00357593,
        0.00618204, -0.00787507, -0.0040524 , -0.01807948,  0.00222847,
       -0.00283368,  0.01068965,  0.00816391, -0.01214261,  0.00355402,
        0.00186048,  0.01092862, -0.0067454 , -0.00016078,  0.00839845,
        0.01071907,  0.00283719,  0.00269074,  0.00264713,  0.00435193,
        0.00306998,  0.01919701,  0.01123271,  0.00177054,  0.02013707,
       -0.00247387, -0.00571772,  0.00916544, -0.00236018,  0.0011739 ,
        0.00839449,  0.00474824,  0.00523856,  0.01090014, -0.00

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!