# Doc2Vec Train and Test
The idea is to ultimately create a module that takes the data frame and return, instead of the body text, retuns a vector for each paragraph input (data input).

In [1]:
import gensim
import pandas as pd
import numpy as np

## Instantiate and Train Model

## Get train and test data

* Data (train and test): [Cleaned reddit dataset](../../data/ad_hominem/ad_hominems_cleaned.csv), the data will be separated into test and train in a 70-30 ratio.

In [2]:
# Set file names for train and test data
data = pd.read_csv("../../data/ad_hominem/ad_hominems_cleaned.csv")
train_data, test_data = np.split(data.sample(frac=1), [int(.7*len(data))])

## Define a Function to Read and Preprocess Text

* Define a function to open the train/test file (with latin encoding)
* Read the file line-by-line
* Pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc)
* Return a list of words.
Note that, for the data frame (corpus), each row constitutes a single document and the length of row entry (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the index for the data frame (row number).

In [15]:
def read_corpus(df, tokens_only=False):
    #with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
    for i, line in df.iterrows():
        if tokens_only:
            yield gensim.utils.simple_preprocess(str(line["reddit_ad_hominem.body"]))
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(str(line["reddit_ad_hominem.body"])), [i])

In [4]:
train_corpus = list(read_corpus(train_data))
test_corpus = list(read_corpus(test_data, tokens_only=True))

Let's take a look at the training corpus (both in the data frame and the generated corpus to see the differences):

In [5]:
pd.set_option('display.max_colwidth', 0)
pd.DataFrame(train_data.loc[:, "reddit_ad_hominem.body"])[:2]

Unnamed: 0,reddit_ad_hominem.body
23552,be cute. Indulge in yourself. In the society we exist in
17342,There is no such equivalent.


In [6]:
train_corpus[:2]

[TaggedDocument(words=['be', 'cute', 'indulge', 'in', 'yourself', 'in', 'the', 'society', 'we', 'exist', 'in'], tags=[23552]),
 TaggedDocument(words=['there', 'is', 'no', 'such', 'equivalent'], tags=[17342])]

And the testing corpus looks like this:

In [33]:
pd.DataFrame(test_data.loc[:, "reddit_ad_hominem.body"])[:2]

Unnamed: 0,reddit_ad_hominem.body
8806,"Everyone talks about how much of a joke Donald Trump is. Just today there was an askreddit thread about how people are sure he's just running as a joke. I don't get it though. In my opinion, he's my favorite Republican in years.Here are some views he has that are more progressive than the rest of his party.1. He is not strong against gay marriage/abortion. In fact, up until recently, he was publicly in favor of these things. Pretty obvious these points are not a big part of his campaign.2. He is anti Super-PAC. Just like Bernie, he is not taking outside donations from Wall Street or big name corporations. He is willing to address the issue that corporations have too much power over politicians. This is HUGE in my opinion.3. His foreign policy is very sound. Let the Germans deal with Crimea. Let the Russians deal with ISIS. We don't need to be the police of the world anymore. A lot of Republicans seem to act like we're still in the Cold War and at war with Russia. I honestly believe that Trump and Putin could get along. China is the biggest economic enemy at this point. Trump acknowledges this. 4. US-China trade reform. China does seem to be breaking an incredible amount of WTO rules. I'm not sure that he will be able to bring them to the table to negotiate. However, at least he has a cohesive plan.Conclusion: I will not be voting for Donald Trump. Issues like global warming are too important to me for me to ever vote for a Republican candidate. However, the general idea is that Trump is a joke. This is absurd. If anything he's a progressive figure for the Republican party. He's infinitely better than Carson in my opinion. People say Trump scares them... Carson scares me much more."
10448,"What ones am I missing?????? I acknowledged the main ones you keep bringing up, medical, insurance, and tax. I literally mentioned these points in my FLIPPIN POST. I'm asking you to bring up a better benefit than what I already flippin mentioned. Those 1500+ laws you keep talking about mostly flippin apply to the main points you keep bringing up. Property, children, medical, insurance, tax, alimony, etc. Jesus."


In [8]:
print(test_corpus[:2])

[['everyone', 'talks', 'about', 'how', 'much', 'of', 'joke', 'donald', 'trump', 'is', 'just', 'today', 'there', 'was', 'an', 'askreddit', 'thread', 'about', 'how', 'people', 'are', 'sure', 'he', 'just', 'running', 'as', 'joke', 'don', 'get', 'it', 'though', 'in', 'my', 'opinion', 'he', 'my', 'favorite', 'republican', 'in', 'years', 'here', 'are', 'some', 'views', 'he', 'has', 'that', 'are', 'more', 'progressive', 'than', 'the', 'rest', 'of', 'his', 'party', 'he', 'is', 'not', 'strong', 'against', 'gay', 'marriage', 'abortion', 'in', 'fact', 'up', 'until', 'recently', 'he', 'was', 'publicly', 'in', 'favor', 'of', 'these', 'things', 'pretty', 'obvious', 'these', 'points', 'are', 'not', 'big', 'part', 'of', 'his', 'campaign', 'he', 'is', 'anti', 'super', 'pac', 'just', 'like', 'bernie', 'he', 'is', 'not', 'taking', 'outside', 'donations', 'from', 'wall', 'street', 'or', 'big', 'name', 'corporations', 'he', 'is', 'willing', 'to', 'address', 'the', 'issue', 'that', 'corporations', 'have', '

Notice that the testing corpus is just a list of lists and does not contain any tags.

## Training the Model
### Instantiate a Doc2Vec Object 
Doc2Vec model with:
* Vector size with 500 words
* Iterating over the training corpus 10 times (More iterations take more time and eventually reach a point of diminishing returns)
* Minimum word count set to 20 (discard words with very few occurrences)

Note: retaining infrequent words can often make a model worse.

In [9]:
  model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=10, epochs=10)

### Build a Vocabulary

In [10]:
model.build_vocab(train_corpus)

Essentially, the vocabulary is a dictionary (accessible via `model.wv.vocab`) of all of the unique words extracted from the training corpus along with the count (e.g., `model.wv.vocab['penalty'].count` for counts for the word `penalty`).

### Time to Train

In [11]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 2min 44s, sys: 1.79 s, total: 2min 46s
Wall time: 1min 11s


### Inferring a Vector

This vector can then be compared with other vectors via cosine similarity.

In [12]:
model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])

array([-1.00985859e-02, -1.66647714e-02, -1.78537089e-02, -1.02618933e-02,
       -7.61404494e-03,  5.06948819e-03,  5.99262584e-03, -3.03212907e-02,
       -1.11470679e-02,  3.84169258e-02, -3.12466584e-02, -2.09496589e-03,
        4.69029834e-03,  9.80205461e-03, -6.78870920e-03,  1.12740705e-02,
       -1.09866410e-02,  2.90585845e-03, -1.07265841e-02,  1.42388362e-02,
       -3.97526985e-03,  8.75213835e-03,  2.64565833e-03, -1.09908134e-02,
        1.18182003e-02, -1.23832384e-02, -2.68010106e-02,  7.90684298e-03,
       -4.14244318e-03, -1.87092591e-02, -2.19194614e-03,  1.81097165e-02,
        3.81297283e-02,  1.96947437e-02, -8.69639218e-03, -3.35553102e-02,
        2.84875222e-02, -4.21259105e-02,  1.28740892e-02, -1.71385948e-02,
       -1.82706583e-02,  3.29706334e-02,  3.66269425e-02,  1.12076895e-02,
        5.22379158e-03, -2.38496549e-02, -5.06996643e-03,  1.18713723e-02,
        3.02123148e-02, -3.39984633e-02,  1.21895531e-02, -1.39129348e-02,
       -1.12414807e-02, -

* `infer_vector()` takes a list of *string tokens*
* Input should be tokenized prior to inference
    * Here the test set is already tokenized (in `test_corpus = list(read_corpus(test_data, tokens_only=True))`)
    
Note: algorithms use internal randomization, so repeated inferences of the same text will return slightly different vectors.

## Saving the model

Deleting training data from memory.

In [13]:
model.save("reddit-doc2vec.model")
model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

## To load the model:

`model = Doc2Vec.load(fname)`

## To use model for inference:

`vector = model.infer_vector(["tokenized", "input", "string"])`

## Test the model
### To load the model:
* `model = Doc2Vec.load(fname)` (not required here)
### To use model for inference:
* `vector = model.infer_vector(["tokenized", "input", "string"])`
    #### To tokenize:
    * `list(read_corpus(df, tokens_only=False))` (used earlier)

In [14]:
vector_sample = model.infer_vector(test_corpus[1])
print("Tokenized test sample:")
print(test_corpus[1])
print("\nInferred vector:")
print(vector_sample)

Tokenized test sample:
['what', 'ones', 'am', 'missing', 'acknowledged', 'the', 'main', 'ones', 'you', 'keep', 'bringing', 'up', 'medical', 'insurance', 'and', 'tax', 'literally', 'mentioned', 'these', 'points', 'in', 'my', 'flippin', 'post', 'asking', 'you', 'to', 'bring', 'up', 'better', 'benefit', 'than', 'what', 'already', 'flippin', 'mentioned', 'those', 'laws', 'you', 'keep', 'talking', 'about', 'mostly', 'flippin', 'apply', 'to', 'the', 'main', 'points', 'you', 'keep', 'bringing', 'up', 'property', 'children', 'medical', 'insurance', 'tax', 'alimony', 'etc', 'jesus']

Inferred vector:
[-1.46771912e-02 -3.45688686e-02  4.79957797e-02  8.53247121e-02
 -2.08942089e-04 -2.52568126e-02  3.83563042e-02  3.92340496e-02
 -2.71544382e-02 -4.67369370e-02 -6.32333755e-02  1.71506196e-01
  1.35483861e-01  3.58873457e-02  2.82059424e-03  1.20748810e-01
 -9.56936702e-02 -8.10178593e-02  3.72729376e-02 -3.99550535e-02
 -2.03975216e-02  1.45050185e-02  1.30063161e-01  2.66129300e-02
  3.6746665

## For more:
* [Yaron Vazana](http://yaronvazana.com/2018/01/20/training-doc2vec-model-with-gensim/)
* [Rare Technologies](https://rare-technologies.com/doc2vec-tutorial/)
* [Gensim Documentation](https://radimrehurek.com/gensim/models/doc2vec.html)
* [Gensim Doc2Vec Tutorial on the IMDB Sentiment Dataset](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb)
* [Doc2Vec Tutorial on the Lee Dataset](https://markroxor.github.io/gensim/static/notebooks/doc2vec-lee.html)