# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [2]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('./data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [5]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [6]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['gr', 'so', 'how', 'do', 'you', 'handle', 'the', 'victoria', 'island', 'traffic', 'plus', 'when', 'the', 'album', 'due'], tags=[0])

In [8]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                  vector_size=100,
                                  window=5,
                                  min_count=2)

In [9]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [10]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([ 7.26529630e-03, -2.31050048e-03, -9.41816159e-03,  9.80881322e-03,
       -8.79096985e-03,  2.73494376e-03, -8.65102978e-04, -6.75851200e-03,
        8.65330733e-03,  1.02958670e-02, -1.21573377e-02,  3.10066808e-03,
       -2.85203359e-03,  7.62374839e-03,  4.81973635e-04, -1.34307453e-02,
        3.40634678e-03, -1.34199457e-02,  1.69804562e-02,  9.52691864e-03,
        6.89091103e-05,  4.07285383e-03,  9.58214700e-03,  2.77136592e-03,
        2.30789301e-03, -2.37510982e-03, -2.63911095e-02, -9.30130109e-03,
       -1.50075147e-03, -1.76140983e-02,  1.27865374e-02, -4.23173141e-03,
        2.84788851e-03, -2.16237605e-02, -9.71438456e-03,  1.13085238e-02,
        3.06288758e-03,  5.37598599e-03,  1.10779600e-02,  4.01287246e-03,
       -7.84148648e-03, -8.04968271e-03, -2.69622519e-03, -3.26160900e-03,
       -7.08591333e-03,  4.50088782e-03,  1.22238593e-02, -2.52124225e-03,
        9.86202527e-03, -1.49864350e-02,  1.55974319e-02,  4.96885553e-03,
        7.37729669e-03, -

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!