# Doc2Vec Train and Test
The idea is to ultimately create a module that takes the data frame and return, instead of the body text, retuns a vector for each paragraph input (data input).

In [1]:
import gensim
import pandas as pd
import numpy as np

## Instantiate and Train Model

## Get train and test data

* Data (train and test): [Cleaned reddit dataset](../../data/ad_hominem/ad_hominems_cleaned.csv), the data will be separated into test and train in a 70-30 ratio.

In [2]:
# Set file names for train and test data
data = pd.read_csv("../../data/ad_hominem/ad_hominems_cleaned.csv")
train_data, test_data = np.split(data.sample(frac=1), [int(.7*len(data))])

## Define a Function to Read and Preprocess Text

* Define a function to open the train/test file (with latin encoding)
* Read the file line-by-line
* Pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc)
* Return a list of words.
Note that, for the data frame (corpus), each row constitutes a single document and the length of row entry (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the index for the data frame (row number).

In [3]:
def read_corpus(df, tokens_only=False):
    #with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
    for i, line in df.iterrows():
        if tokens_only:
            yield gensim.utils.simple_preprocess(str(line["reddit_ad_hominem.body"]))
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(str(line["reddit_ad_hominem.body"])), [i])

In [4]:
train_corpus = list(read_corpus(train_data))
test_corpus = list(read_corpus(test_data, tokens_only=True))

Let's take a look at the training corpus (both in the data frame and the generated corpus to see the differences):

In [5]:
pd.set_option('display.max_colwidth', 0)
pd.DataFrame(train_data.loc[:, "reddit_ad_hominem.body"])[:2]

Unnamed: 0,reddit_ad_hominem.body
50255,
99976,


In [6]:
train_corpus[:2]

[TaggedDocument(words=['nan'], tags=[50255]),
 TaggedDocument(words=['nan'], tags=[99976])]

And the testing corpus looks like this:

In [7]:
pd.DataFrame(test_data.loc[:, "reddit_ad_hominem.body"])[:2]

Unnamed: 0,reddit_ad_hominem.body
79142,
69811,


In [8]:
print(test_corpus[:2])

[['nan'], ['nan']]


Notice that the testing corpus is just a list of lists and does not contain any tags.

## Training the Model
### Instantiate a Doc2Vec Object 
Doc2Vec model with:
* Vector size with 500 words
* Iterating over the training corpus 10 times (More iterations take more time and eventually reach a point of diminishing returns)
* Minimum word count set to 20 (discard words with very few occurrences)

Note: retaining infrequent words can often make a model worse.

In [9]:
  model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=10, epochs=10)

### Build a Vocabulary

In [10]:
model.build_vocab(train_corpus)

Essentially, the vocabulary is a dictionary (accessible via `model.wv.vocab`) of all of the unique words extracted from the training corpus along with the count (e.g., `model.wv.vocab['penalty'].count` for counts for the word `penalty`).

### Time to Train

In [11]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 3min 29s, sys: 4.86 s, total: 3min 34s
Wall time: 1min 39s


### Inferring a Vector

This vector can then be compared with other vectors via cosine similarity.

In [12]:
model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])

array([ 0.03302052, -0.00475105,  0.01161314,  0.0236905 , -0.0537018 ,
       -0.00255536,  0.01054048,  0.02972727,  0.0154462 ,  0.02121845,
        0.00277529, -0.0707717 , -0.01914096, -0.04099298,  0.01934998,
        0.02442866,  0.03670777, -0.00923106, -0.01774863, -0.02243855,
       -0.04781026, -0.09083778,  0.04270164, -0.01088333, -0.05486543,
       -0.02477331,  0.00861344,  0.00278031, -0.00727144, -0.02567455,
       -0.0065259 , -0.04000838,  0.00248543, -0.06729561, -0.04923471,
       -0.03039499, -0.01674587, -0.02626782, -0.06380375,  0.02066823,
       -0.03653348, -0.00031122,  0.0093143 , -0.06327739, -0.03727213,
        0.01684709,  0.01302233, -0.00377187, -0.02742417,  0.00583922,
        0.03568843,  0.00695386, -0.00476701,  0.00950964,  0.01364512,
        0.01799491, -0.02467484, -0.03712162, -0.0182891 , -0.01591285,
       -0.02610657,  0.0193175 , -0.00923604,  0.00321477,  0.01475343,
        0.00436636,  0.02516589,  0.01650405,  0.00501648,  0.00

* `infer_vector()` takes a list of *string tokens*
* Input should be tokenized prior to inference
    * Here the test set is already tokenized (in `test_corpus = list(read_corpus(test_data, tokens_only=True))`)
    
Note: algorithms use internal randomization, so repeated inferences of the same text will return slightly different vectors.

## Saving the model

Deleting training data from memory.

In [13]:
model.save("reddit-doc2vec.model")
#model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

## To load the model:

`model = Doc2Vec.load(fname)`

## To use model for inference:

`vector = model.infer_vector(["tokenized", "input", "string"])`

## Test the model
### To load the model:
* `model = Doc2Vec.load(fname)` (not required here)
### To use model for inference:
* `vector = model.infer_vector(["tokenized", "input", "string"])`
    #### To tokenize:
    * `list(read_corpus(df, tokens_only=False))` (used earlier)

In [14]:
vector_sample = model.infer_vector(test_corpus[1])
print("Tokenized test sample:")
print(test_corpus[1])
print("\nInferred vector:")
print(vector_sample)

Tokenized test sample:
['nan']

Inferred vector:
[ 1.60084721e-02  1.02460361e-03 -1.89598370e-02  7.23619899e-03
 -1.24632930e-02 -5.63005311e-03  1.60672376e-03  1.42009286e-02
  1.64097082e-02  1.69727746e-02  5.79811260e-03 -1.75342038e-02
  3.11491662e-03 -1.05276387e-02  2.83717224e-03 -7.69761810e-03
  4.51505510e-03  3.55933816e-03  1.06733304e-03 -2.52480921e-03
 -1.45907858e-02 -1.14028715e-02 -3.49434535e-03  8.23463965e-03
 -8.31985194e-03 -5.32701332e-03  1.74424239e-02  1.08821699e-02
  9.20679234e-03  1.74813031e-03  3.04745976e-03 -1.13345059e-02
 -1.58613976e-02  2.52946839e-03 -1.96615309e-02 -8.69426318e-03
  1.80128578e-03 -1.79972837e-03 -3.56873032e-03 -6.59934292e-03
 -2.12082453e-02 -1.64342839e-02 -2.37535196e-03  1.06621720e-03
 -1.69275738e-02  2.13032681e-02 -5.35039324e-03  7.81056518e-03
 -2.80448049e-03 -9.15284827e-03  5.72796864e-03  5.66163892e-03
  5.99653134e-03 -9.80296265e-03  8.81322194e-03 -7.95341213e-04
 -4.51059034e-03 -9.75381769e-03 -6.96850

## For more:
* [Yaron Vazana](http://yaronvazana.com/2018/01/20/training-doc2vec-model-with-gensim/)
* [Rare Technologies](https://rare-technologies.com/doc2vec-tutorial/)
* [Gensim Documentation](https://radimrehurek.com/gensim/models/doc2vec.html)
* [Gensim Doc2Vec Tutorial on the IMDB Sentiment Dataset](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb)
* [Doc2Vec Tutorial on the Lee Dataset](https://markroxor.github.io/gensim/static/notebooks/doc2vec-lee.html)