# Doc2Vec Train and Test
The idea is to ultimately create a module that takes the data frame and return, instead of the body text, retuns a vector for each paragraph input (data input).

In [1]:
import gensim
import pandas as pd
import numpy as np

## Instantiate and Train Model

## Get train and test data

* Data (train and test): [Cleaned reddit dataset](../../data/ad_hominem/ad_hominems_cleaned.csv), the data will be separated into test and train in a 70-30 ratio.

In [2]:
# Set file names for train and test data
data = pd.read_csv("../../data/ad_hominem/ad_hominems_cleaned.csv")
train_data, test_data = np.split(data.sample(frac=1), [int(.7*len(data))])

## Define a Function to Read and Preprocess Text

* Define a function to open the train/test file (with latin encoding)
* Read the file line-by-line
* Pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc)
* Return a list of words.
Note that, for the data frame (corpus), each row constitutes a single document and the length of row entry (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the index for the data frame (row number).

In [3]:
def read_corpus(df, tokens_only=False):
    #with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
    for i, line in df.iterrows():
        if tokens_only:
            yield gensim.utils.simple_preprocess(str(line["reddit_ad_hominem.body"]))
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(str(line["reddit_ad_hominem.body"])), [i])

In [4]:
train_corpus = list(read_corpus(train_data))
test_corpus = list(read_corpus(test_data, tokens_only=True))

Let's take a look at the training corpus (both in the data frame and the generated corpus to see the differences):

In [5]:
pd.set_option('display.max_colwidth', 0)
pd.DataFrame(train_data.loc[:, "reddit_ad_hominem.body"])[:2]

Unnamed: 0,reddit_ad_hominem.body
31528,
7202,"Nobody can force you to not contradict yourself, I guess. If you want to hold onto the social contract while not accepting the conclusions that follow from it, you are of course free to be irrational and undermine your own position on morality."


In [6]:
train_corpus[:2]

[TaggedDocument(words=['nan'], tags=[31528]),
 TaggedDocument(words=['nobody', 'can', 'force', 'you', 'to', 'not', 'contradict', 'yourself', 'guess', 'if', 'you', 'want', 'to', 'hold', 'onto', 'the', 'social', 'contract', 'while', 'not', 'accepting', 'the', 'conclusions', 'that', 'follow', 'from', 'it', 'you', 'are', 'of', 'course', 'free', 'to', 'be', 'irrational', 'and', 'undermine', 'your', 'own', 'position', 'on', 'morality'], tags=[7202])]

And the testing corpus looks like this:

In [7]:
pd.DataFrame(test_data.loc[:, "reddit_ad_hominem.body"])[:2]

Unnamed: 0,reddit_ad_hominem.body
41508,no nothing. What you are seeing is a lot of individuals
36957,I also understand that for other cultures


In [8]:
print(test_corpus[:2])

[['no', 'nothing', 'what', 'you', 'are', 'seeing', 'is', 'lot', 'of', 'individuals'], ['also', 'understand', 'that', 'for', 'other', 'cultures']]


Notice that the testing corpus is just a list of lists and does not contain any tags.

## Training the Model
### Instantiate a Doc2Vec Object 
Doc2Vec model with:
* Vector size with 500 words
* Iterating over the training corpus 10 times (More iterations take more time and eventually reach a point of diminishing returns)
* Minimum word count set to 20 (discard words with very few occurrences)

Note: retaining infrequent words can often make a model worse.

In [9]:
  model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=10, epochs=10)

### Build a Vocabulary

In [10]:
model.build_vocab(train_corpus)

Essentially, the vocabulary is a dictionary (accessible via `model.wv.vocab`) of all of the unique words extracted from the training corpus along with the count (e.g., `model.wv.vocab['penalty'].count` for counts for the word `penalty`).

### Time to Train

In [11]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 3min 29s, sys: 4.51 s, total: 3min 33s
Wall time: 1min 31s


### Inferring a Vector

This vector can then be compared with other vectors via cosine similarity.

In [12]:
model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])

array([-0.01656617, -0.01430459, -0.05245181, -0.00066138,  0.00586664,
        0.01258109, -0.00204146, -0.02078112,  0.01407013, -0.01110554,
       -0.01981455,  0.02212978,  0.02192996,  0.02683485, -0.00126907,
        0.0400483 , -0.02251391,  0.01554183, -0.01737957,  0.05833966,
       -0.0238959 ,  0.06652302, -0.01765297,  0.01790353,  0.01940756,
       -0.02121411,  0.00521518, -0.01025048,  0.02276885,  0.04519996,
        0.01079008,  0.01254493,  0.01227546,  0.02644348,  0.03949171,
        0.0028408 , -0.02339171, -0.00848808,  0.01235549, -0.00383482,
        0.02303251,  0.02827961, -0.03533961, -0.04026923, -0.00180538,
        0.03984473,  0.05694033,  0.02689684,  0.01986732,  0.01301446,
       -0.01341978,  0.05513496,  0.00897647, -0.03526007,  0.03911593,
       -0.01551447, -0.02599545, -0.03524369,  0.03231908,  0.0451032 ,
        0.04861251,  0.03795578, -0.00419231,  0.00696138, -0.00077949,
       -0.03132649, -0.04002546, -0.04368919,  0.02382658, -0.01

* `infer_vector()` takes a list of *string tokens*
* Input should be tokenized prior to inference
    * Here the test set is already tokenized (in `test_corpus = list(read_corpus(test_data, tokens_only=True))`)
    
Note: algorithms use internal randomization, so repeated inferences of the same text will return slightly different vectors.

## Saving the model

Deleting training data from memory.

In [13]:
model.save("reddit-doc2vec.model")

## To load the model:

`model = Doc2Vec.load(fname)`

## To use model for inference:

`vector = model.infer_vector(["tokenized", "input", "string"])`

## Test the model
### To load the model:
* `model = Doc2Vec.load(fname)` (not required here)
### To use model for inference:
* `vector = model.infer_vector(["tokenized", "input", "string"])`
    #### To tokenize:
    * `list(read_corpus(df, tokens_only=False))` (used earlier)

In [14]:
vector_sample = model.infer_vector(test_corpus[1])
print("Tokenized test sample:")
print(test_corpus[1])
print("\nInferred vector:")
print(vector_sample)

Tokenized test sample:
['also', 'understand', 'that', 'for', 'other', 'cultures']

Inferred vector:
[-3.50757921e-03  1.90081634e-03 -3.46079506e-02  3.77507391e-03
  1.34055242e-02 -7.74921477e-03 -2.19890289e-02 -3.07108499e-02
  3.56810503e-02  5.54005615e-02 -4.47838940e-02  3.77739333e-02
 -2.31937189e-02  1.90163720e-02 -2.43136268e-02  9.67435092e-02
  3.04232612e-02  1.01761976e-02  1.53096411e-02  8.95579010e-02
 -2.06847843e-02  7.23380819e-02 -3.11913360e-02  4.82664816e-02
  3.21813077e-02 -4.79818974e-03  3.61051410e-02 -4.30038273e-02
  1.44684920e-02  5.99387325e-02  6.93687983e-03  5.27522415e-02
 -3.82570201e-03 -1.77666359e-02  5.55941910e-02  8.42079613e-03
 -3.17627452e-02  2.78663840e-02 -2.17901040e-02 -2.85427496e-02
  5.90239326e-03  3.40274535e-02 -1.83005985e-02 -5.19166663e-02
  1.77615825e-02  1.83763709e-02  4.02364284e-02  3.56108323e-02
  8.76685418e-03  3.27675901e-02 -2.77509168e-02  5.70554771e-02
  4.32702154e-03 -4.53536026e-02  6.18866310e-02 -1.557

## For more:
* [Yaron Vazana](http://yaronvazana.com/2018/01/20/training-doc2vec-model-with-gensim/)
* [Rare Technologies](https://rare-technologies.com/doc2vec-tutorial/)
* [Gensim Documentation](https://radimrehurek.com/gensim/models/doc2vec.html)
* [Gensim Doc2Vec Tutorial on the IMDB Sentiment Dataset](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb)
* [Doc2Vec Tutorial on the Lee Dataset](https://markroxor.github.io/gensim/static/notebooks/doc2vec-lee.html)