# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [3]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['nothing', 'lor', 'bit', 'bored', 'too', 'then', 'dun', 'go', 'home', 'early', 'sleep', 'today'], tags=[0])

In [4]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                  vector_size=100,
                                  window=5,
                                  min_count=2)

In [6]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector(['text'])

array([-1.80431865e-02,  8.37543234e-03,  6.17241487e-03, -3.44395498e-03,
        3.45290522e-03, -3.62966582e-02, -1.96596906e-02,  6.09759130e-02,
       -1.84559748e-02, -1.50382435e-02, -2.08117813e-02, -2.26940978e-02,
        4.50779218e-03,  1.01694716e-02, -2.42809877e-02, -2.70336308e-02,
        1.17012970e-02, -2.67291870e-02,  9.98055842e-03, -3.83110009e-02,
        1.36352202e-03,  1.08455289e-02,  1.69306882e-02, -2.66350806e-03,
       -6.54482422e-03, -1.18508879e-02, -6.84094196e-03, -1.51186222e-02,
       -2.46240404e-02,  1.56936236e-02,  1.41401635e-02,  4.86529293e-03,
        1.04783503e-02, -1.29865911e-02, -6.47423230e-03,  3.19251306e-02,
        1.51302796e-02, -3.22629325e-02, -1.50373839e-02, -3.10964063e-02,
       -1.05075510e-02, -1.04291718e-02, -1.17229857e-02, -2.22441088e-03,
        1.10154385e-02, -8.57207738e-03, -1.49089685e-02, -1.18329341e-03,
       -7.42972689e-03,  8.14653561e-03,  4.24240530e-03, -1.44670606e-02,
       -5.12884278e-03,  

In [7]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([-0.0099374 ,  0.0154734 ,  0.00442767,  0.00242123,  0.00422297,
       -0.0320076 , -0.00012823,  0.0404783 , -0.02869176, -0.01635095,
       -0.01653304, -0.02202714, -0.00741267,  0.01072358, -0.0090397 ,
       -0.02561929,  0.00866255, -0.01565938, -0.00121497, -0.03019524,
        0.01808571,  0.01705728,  0.02437944, -0.00096959, -0.00965553,
       -0.00222158, -0.012289  , -0.01598782, -0.01431445,  0.00868111,
        0.01703324,  0.00358201,  0.01063224, -0.00840601, -0.00804244,
        0.02636494,  0.00330802, -0.01470087, -0.01007254, -0.03698661,
       -0.00062674, -0.01730937, -0.00135942,  0.0029403 ,  0.00298335,
       -0.01465081, -0.00550584, -0.00788276,  0.00659547,  0.00924875,
       -0.00017958, -0.02128269,  0.01105198, -0.0039272 , -0.01228422,
        0.00702484,  0.01346784, -0.00047194, -0.0266333 ,  0.0162513 ,
        0.01688452, -0.00418266, -0.00035278,  0.00452123, -0.02205559,
        0.02651102,  0.01764529,  0.00787011, -0.02420409,  0.03

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!