# Doc2Vec Train and Test
The idea is to ultimately create a module that takes the data frame and return, instead of the body text, retuns a vector for each paragraph input (data input).

In [1]:
import gensim
import collections
import smart_open
import random
import pandas as pd
import numpy as np

## Instantiate and Train Model

## Get train and test data

* Data (train and test): [Cleaned reddit dataset](../../data/ad_hominem/ad_hominems_cleaned.csv), the data will be separated into test and train in a 70-30 ratio.

In [2]:
# Set file names for train and test data
data = pd.read_csv("../../data/ad_hominem/ad_hominems_cleaned.csv")
train_data, test_data = np.split(data.sample(frac=1), [int(.7*len(data))])

## Define a Function to Read and Preprocess Text

* Define a function to open the train/test file (with latin encoding)
* Read the file line-by-line
* Pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc)
* Return a list of words.
Note that, for the data frame (corpus), each row constitutes a single document and the length of row entry (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the index for the data frame (row number).

In [3]:
def read_corpus(df, tokens_only=False):
    #with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
    for i, line in df.iterrows():
        if tokens_only:
            yield gensim.utils.simple_preprocess(str(line["reddit_ad_hominem.body"]))
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(str(line["reddit_ad_hominem.body"])), [i])

In [4]:
train_corpus = list(read_corpus(train_data))
test_corpus = list(read_corpus(test_data, tokens_only=True))

Let's take a look at the training corpus (both in the data frame and the generated corpus to see the differences):

In [5]:
pd.set_option('display.max_colwidth', 0)
pd.DataFrame(train_data.loc[:, "reddit_ad_hominem.body"])[:2] #4323660

Unnamed: 0,reddit_ad_hominem.body
8182,"I've never used 4Chan and from the little things I've seen from there over the years, I never will.From my understanding, 4Chan is the place for the undesirable anomalies of society to group together anonymously, mock the socially successful (or even unsuccessful), make jokes nobody finds funny for the sake of having inside jokes and say offensive things for the sake of them being offensive. All while being on a website created after 2Chan, a less so but still ridiculous Japanese version.The website looks like something from 2002 and is like a more confusing and pointless version of Reddit. I hate when people refer to the place and part of me thinks it's full of Neo-Nazi types, pedophiles and weebs. Like all these people desperately want to fit in somewhere but the rest of society hates them."
24828,it is a very real threat. There are parties actively pushing for more restrictions on gun ownership


In [6]:
train_corpus[:2]

[TaggedDocument(words=['ve', 'never', 'used', 'chan', 'and', 'from', 'the', 'little', 'things', 've', 'seen', 'from', 'there', 'over', 'the', 'years', 'never', 'will', 'from', 'my', 'understanding', 'chan', 'is', 'the', 'place', 'for', 'the', 'undesirable', 'anomalies', 'of', 'society', 'to', 'group', 'together', 'anonymously', 'mock', 'the', 'socially', 'successful', 'or', 'even', 'unsuccessful', 'make', 'jokes', 'nobody', 'finds', 'funny', 'for', 'the', 'sake', 'of', 'having', 'inside', 'jokes', 'and', 'say', 'offensive', 'things', 'for', 'the', 'sake', 'of', 'them', 'being', 'offensive', 'all', 'while', 'being', 'on', 'website', 'created', 'after', 'chan', 'less', 'so', 'but', 'still', 'ridiculous', 'japanese', 'version', 'the', 'website', 'looks', 'like', 'something', 'from', 'and', 'is', 'like', 'more', 'confusing', 'and', 'pointless', 'version', 'of', 'reddit', 'hate', 'when', 'people', 'refer', 'to', 'the', 'place', 'and', 'part', 'of', 'me', 'thinks', 'it', 'full', 'of', 'neo',

And the testing corpus looks like this:

In [7]:
pd.DataFrame(test_data.loc[:, "reddit_ad_hominem.body"])[:2]

Unnamed: 0,reddit_ad_hominem.body
22538,and more having to with the fact that transgender people are a very small minority in a society not designed for them in general.
5969,"""why don't you give that project to Veronica. She makes more money than I do, let her handle it."""


In [8]:
print(test_corpus[:2])

[['and', 'more', 'having', 'to', 'with', 'the', 'fact', 'that', 'transgender', 'people', 'are', 'very', 'small', 'minority', 'in', 'society', 'not', 'designed', 'for', 'them', 'in', 'general'], ['why', 'don', 'you', 'give', 'that', 'project', 'to', 'veronica', 'she', 'makes', 'more', 'money', 'than', 'do', 'let', 'her', 'handle', 'it']]


Notice that the testing corpus is just a list of lists and does not contain any tags.

## Training the Model
### Instantiate a Doc2Vec Object 
Doc2Vec model with:
* Vector size with 500 words
* Iterating over the training corpus 10 times (More iterations take more time and eventually reach a point of diminishing returns)
* Minimum word count set to 20 (discard words with very few occurrences)

Note: retaining infrequent words can often make a model worse.

In [9]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=10, epochs=10)

### Build a Vocabulary

In [10]:
model.build_vocab(train_corpus)

Essentially, the vocabulary is a dictionary (accessible via `model.wv.vocab`) of all of the unique words extracted from the training corpus along with the count (e.g., `model.wv.vocab['penalty'].count` for counts for the word `penalty`).

### Time to Train

In [11]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 2min 44s, sys: 1.81 s, total: 2min 45s
Wall time: 1min 14s


### Inferring a Vector

This vector can then be compared with other vectors via cosine similarity.

In [12]:
model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])

array([ 6.10099314e-03,  1.96408131e-03,  4.54725325e-03,  9.16318968e-03,
        1.16502456e-02,  2.16513555e-02,  4.69636125e-03,  3.01834126e-03,
       -1.73208881e-02,  1.19826598e-02, -9.12664831e-03, -1.61856525e-02,
        1.79378670e-02, -5.63165639e-03, -8.99874067e-05, -2.20483784e-02,
        1.42881665e-02,  3.41686308e-02,  2.00273059e-02, -1.02976719e-02,
        2.52445228e-02,  2.67276932e-02, -2.76128724e-02,  1.86041202e-02,
       -4.02392671e-02,  1.81042142e-02,  5.15680127e-02, -1.04470998e-02,
       -4.18031663e-02,  2.97561195e-03,  2.16589738e-02,  8.55786074e-03,
        1.16063291e-02,  1.87184606e-02, -2.73608211e-02, -1.20737310e-02,
       -3.78828496e-02, -1.63631998e-02, -5.23256622e-02,  4.22422169e-03,
        2.63618119e-02, -1.08182644e-02,  7.80457119e-03,  2.88221240e-02,
       -4.92784707e-03, -2.00535599e-02, -3.76831442e-02,  4.88839857e-02,
       -2.74876226e-02, -2.54667606e-02,  1.58868032e-03, -8.84599425e-03,
       -3.15850275e-03, -

* `infer_vector()` takes a list of *string tokens*
* Input should be tokenized prior to inference
    * Here the test set is already tokenized (in `test_corpus = list(read_corpus(test_data, tokens_only=True))`)
    
Note: algorithms use internal randomization, so repeated inferences of the same text will return slightly different vectors.

## Saving the model

Deleting training data from memory.

In [23]:
model.save("reddit-doc2vec.model")
model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

## To load the model:

`model = Doc2Vec.load(fname)`

## To use model for inference:

`vector = model.infer_vector(["tokenized", "input", "string"])`

## Test the model
### To load the model:
* `model = Doc2Vec.load(fname)` (not required here)
### To use model for inference:
* `vector = model.infer_vector(["tokenized", "input", "string"])`
    #### To tokenize:
    * `list(read_corpus(df, tokens_only=False))` (used earlier)

In [14]:
vector_sample = model.infer_vector(test_corpus[1])
print("Tokenized test sample:")
print(test_corpus[1])
print("\nInferred vector:")
print(vector_sample)

Tokenized test sample:
['why', 'don', 'you', 'give', 'that', 'project', 'to', 'veronica', 'she', 'makes', 'more', 'money', 'than', 'do', 'let', 'her', 'handle', 'it']

Inferred vector:
[-0.01985193  0.01519804  0.0105221  -0.02903783  0.04584349  0.00821724
  0.01950787 -0.04702343 -0.04687892 -0.03594166 -0.01066448 -0.0942203
 -0.01465426 -0.01166661 -0.0058945   0.014545   -0.04101643  0.0406091
 -0.00967981 -0.00633664 -0.00180249 -0.05288441 -0.00942213  0.01729073
  0.01860647  0.02358845  0.01816625 -0.02612522 -0.03782067 -0.07178894
  0.07045363 -0.05475719 -0.04246257  0.00980603 -0.11037128  0.04370284
  0.00813757  0.00931698  0.00326506  0.00490022  0.02962516 -0.01799162
  0.01522217  0.00149615  0.00011974 -0.03393247  0.01070947 -0.00145847
 -0.03797978  0.00212131  0.01929183  0.03622057  0.0517694   0.04490124
 -0.02248931  0.0564543   0.03109168  0.03142186 -0.06635758 -0.01958945
 -0.01405963  0.05588077 -0.02224294  0.01246074  0.08489747  0.02165684
 -0.02622388 -

## For more:
* [Yaron Vazana](http://yaronvazana.com/2018/01/20/training-doc2vec-model-with-gensim/)
* [Rare Technologies](https://rare-technologies.com/doc2vec-tutorial/)
* [Gensim Documentation](https://radimrehurek.com/gensim/models/doc2vec.html)
* [Gensim Doc2Vec Tutorial on the IMDB Sentiment Dataset](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb)
* [Doc2Vec Tutorial on the Lee Dataset](https://markroxor.github.io/gensim/static/notebooks/doc2vec-lee.html)