# Doc2Vec Train and Test
The idea is to ultimately create a module that takes the data frame and return, instead of the body text, retuns a vector for each paragraph input (data input).

In [1]:
import gensim
import os
import collections
import smart_open
import random
import pandas as pd
import numpy as np

## Instantiate and Train Model

## Get train and test data

* Data (train and test): [Cleaned reddit dataset](../../data/ad_hominem/ad_hominems_cleaned_Murilo.csv), the data will be separated into test and train in a 70-30 ratio.

In [2]:
# Set file names for train and test data
data = pd.read_csv("../../data/ad_hominem/ad_hominems_cleaned_Murilo.csv")
train_data, test_data = np.split(data.sample(frac=1), [int(.7*len(data))])

## Define a Function to Read and Preprocess Text

* Define a function to open the train/test file (with latin encoding)
* Read the file line-by-line
* Pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc)
* Return a list of words.
Note that, for the data frame (corpus), each row constitutes a single document and the length of row entry (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the index for the data frame (row number).

In [3]:
def read_corpus(df, tokens_only=False):
    #with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
    for i, line in df.iterrows():
        if tokens_only:
            yield gensim.utils.simple_preprocess(str(line["reddit_ad_hominem.body"]))
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(str(line["reddit_ad_hominem.body"])), [i])

In [4]:
train_corpus = list(read_corpus(train_data))
test_corpus = list(read_corpus(test_data, tokens_only=True))

Let's take a look at the training corpus (both in the data frame and the generated corpus to see the differences):

In [5]:
pd.set_option('display.max_colwidth', 0)
pd.DataFrame(train_data.loc[:, "reddit_ad_hominem.body"])[:2] #4323660

Unnamed: 0,reddit_ad_hominem.body
26164,and 25.5 percent black.
18612,"First of all, I'm not talking about veganism. There is a big difference between not eating red meat and being a vegan. Second of all, that thing about meat being a necessity for athletes is simply false, 100 %. About enjoying meat being a rational argument, you are right. I addressed that in one of my posts. That was a poor choice of words by me."


In [6]:
train_corpus[:2]

[TaggedDocument(words=['and', 'percent', 'black'], tags=[26164]),
 TaggedDocument(words=['first', 'of', 'all', 'not', 'talking', 'about', 'veganism', 'there', 'is', 'big', 'difference', 'between', 'not', 'eating', 'red', 'meat', 'and', 'being', 'vegan', 'second', 'of', 'all', 'that', 'thing', 'about', 'meat', 'being', 'necessity', 'for', 'athletes', 'is', 'simply', 'false', 'about', 'enjoying', 'meat', 'being', 'rational', 'argument', 'you', 'are', 'right', 'addressed', 'that', 'in', 'one', 'of', 'my', 'posts', 'that', 'was', 'poor', 'choice', 'of', 'words', 'by', 'me'], tags=[18612])]

And the testing corpus looks like this:

In [7]:
pd.DataFrame(test_data.loc[:, "reddit_ad_hominem.body"])[:2]

Unnamed: 0,reddit_ad_hominem.body
20961,which is what we're talking about. That's why your air analogy is garbage
24385,”


In [8]:
print(test_corpus[:2])

[['which', 'is', 'what', 'we', 're', 'talking', 'about', 'that', 'why', 'your', 'air', 'analogy', 'is', 'garbage'], []]


Notice that the testing corpus is just a list of lists and does not contain any tags.

## Training the Model
### Instantiate a Doc2Vec Object 
Doc2Vec model with:
* Vector size with 500 words
* Iterating over the training corpus 10 times (More iterations take more time and eventually reach a point of diminishing returns)
* Minimum word count set to 20 (discard words with very few occurrences)

Note: retaining infrequent words can often make a model worse.

In [9]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=10, epochs=10)

### Build a Vocabulary

In [10]:
model.build_vocab(train_corpus)

Essentially, the vocabulary is a dictionary (accessible via `model.wv.vocab`) of all of the unique words extracted from the training corpus along with the count (e.g., `model.wv.vocab['penalty'].count` for counts for the word `penalty`).

### Time to Train

In [11]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 2min 42s, sys: 1.45 s, total: 2min 43s
Wall time: 1min 7s


### Inferring a Vector

This vector can then be compared with other vectors via cosine similarity.

In [12]:
model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])

array([ 0.03143369,  0.0407154 ,  0.03831863,  0.05787762,  0.003495  ,
       -0.01959528, -0.1392654 ,  0.02381854,  0.05999694,  0.00511141,
        0.0614543 , -0.07074674, -0.03040167,  0.05974266, -0.05069724,
       -0.0095914 ,  0.00403029,  0.01088104, -0.05114654, -0.01545902,
       -0.00697675, -0.0191806 ,  0.01606508, -0.04181037,  0.04848306,
       -0.00219237, -0.00888258, -0.01794555, -0.00313175, -0.00407065,
        0.00936912, -0.02379045,  0.01550802,  0.05234915, -0.01517074,
        0.02088564,  0.02866564,  0.00668709,  0.0375262 ,  0.00101556,
       -0.02678772, -0.01674695, -0.03478102, -0.06777813,  0.03225682,
       -0.03425853, -0.02280523, -0.01904884,  0.03232688, -0.01424852,
       -0.01720729, -0.02564723,  0.00844694, -0.0238448 , -0.02616653,
       -0.02288621,  0.00434677, -0.00772536,  0.02465179,  0.01039768,
        0.00927351,  0.01873142,  0.0036584 ,  0.05478862,  0.0434427 ,
        0.00309651,  0.01977954, -0.06521635, -0.03103744,  0.02

* `infer_vector()` takes a list of *string tokens*
* Input should be tokenized prior to inference
    * Here the test set is already tokenized (in `test_corpus = list(read_corpus(test_data, tokens_only=True))`)
    
Note: algorithms use internal randomization, so repeated inferences of the same text will return slightly different vectors.

## Saving the model

Deleting training data from memory.

In [13]:
model.save("reddit-doc2vec.model")
model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

## To load the model:

`model = Doc2Vec.load(fname)`

## To use model for inference:

`vector = model.infer_vector(["tokenized", "input", "string"])`

## Test the model
### To load the model:
* `model = Doc2Vec.load(fname)` (not required here)
### To use model for inference:
* `vector = model.infer_vector(["tokenized", "input", "string"])`
    #### To tokenize:
    * `list(read_corpus(df, tokens_only=False))` (used earlier)

In [14]:
vector_sample = model.infer_vector(test_corpus[1])
print("Tokenized test sample:")
print(test_corpus[1])
print("\nInferred vector:")
print(vector_sample)

Tokenized test sample:
[]

Inferred vector:
[ 9.76270094e-05  4.30378743e-04  2.05526754e-04  8.97663631e-05
 -1.52690394e-04  2.91788223e-04 -1.24825572e-04  7.83546013e-04
  9.27325513e-04 -2.33116967e-04  5.83450077e-04  5.77898390e-05
  1.36089118e-04  8.51193268e-04 -8.57927895e-04 -8.25741387e-04
 -9.59563185e-04  6.65239699e-04  5.56313491e-04  7.40024319e-04
  9.57236683e-04  5.98317129e-04 -7.70412735e-05  5.61058347e-04
 -7.63451157e-04  2.79842032e-04 -7.13293441e-04  8.89337854e-04
  4.36966438e-05 -1.70676125e-04 -4.70888772e-04  5.48467389e-04
 -8.76993363e-05  1.36867893e-04 -9.62420425e-04  2.35271000e-04
  2.24191448e-04  2.33867991e-04  8.87496164e-04  3.63640604e-04
 -2.80984212e-04 -1.25936087e-04  3.95262381e-04 -8.79549072e-04
  3.33533419e-04  3.41275736e-04 -5.79234853e-04 -7.42147386e-04
 -3.69143294e-04 -2.72578473e-04  1.40393546e-04 -1.22796977e-04
  9.76747717e-04 -7.95910368e-04 -5.82246459e-04 -6.77380944e-04
  3.06216651e-04 -4.93416796e-04 -6.73784525e-

## For more information:
* [Yaron Vazana](http://yaronvazana.com/2018/01/20/training-doc2vec-model-with-gensim/)
* [Rare Technologies](https://rare-technologies.com/doc2vec-tutorial/)
* [Gensim Documentation](https://radimrehurek.com/gensim/models/doc2vec.html)
* [Gensim Doc2Vec Tutorial on the IMDB Sentiment Dataset](./doc2vec-IMDB.ipynb)
* [Doc2Vec Tutorial on the Lee Dataset](./Doc2Vec_Tutorial_Lee_mod.ipynb)
* [Doc2Vec Testing Trained Model](./Doc2Vec_Test_Lee.ipynb)