# Doc2Vec Testing Trained Model

In [1]:
from gensim.models.doc2vec import Doc2Vec
from gensim.utils import tokenize
import pandas as pd

## Load the data:
### Doc2Vec model
* This is a test model, trained from [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included in gensim.
* Vector size was 50 words and iterating over the training corpus 40 times (40 epochs).
* Minimum word count set to 2 in (to discard words with very few occurrences). 
* Doc2Vec models should use way more documents in the training set and less epochs. In literature ([Doc2Vec Paper](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)) 10k-1M of docs are used, with 10-20 epochs.
* Model is most likely overfit.
* Notebook used for testing: [Doc2Vec Tutorial on the Lee Dataset](doc2vec-train.ipynb)

However, this is a very very small dataset (300 documents) with shortish documents (a few hundred words). Adding training passes can sometimes help with such small datasets.
### Dataframe
* Fallacies from reddit database (cleaned).

In [2]:
test_model = Doc2Vec.load("./lee-doc2vec.model")
fallacies = pd.read_csv("./ad_hominems_cleaned_Murilo.csv")

## Prepare example to test:
* Load dataset
* Grab example
* Tokenize example

See example below

In [3]:
sample_test = fallacies.loc[1,"reddit_ad_hominem.body"]
tokenized_sample = tokenize(sample_test)
print(sample_test)

I'm sorry if your smugness gets in the way. Like I said elsewhere in this thread. Somolia is not close to anything I advocate for so why on earth would I move there? Any time the Somolia "argument" is brought up, I instantly know I'm dealing with someone who refuses to learn the difference between a voluntary society and a third world country ravaged by warlords and foreign policies of other countries. If you want a thoughtful response to an argument, make sure you're not comparing Antarctica to the Bahamas. Otherwise, take your circlejerk, "arguments" elsewhere. You have contributed absolutely nothing to this thread but ad hominem Attacks and the typical liberal/conservative talking points and almost everyone in here knows it. 


## Infer the vector:
From data input, infer the vector, given the loaded model. See output below

In [6]:
vector = test_model.infer_vector(tokenized_sample)
vector

array([ 0.00097627,  0.00430379,  0.00205527,  0.00089766, -0.0015269 ,
        0.00291788, -0.00124826,  0.00783546,  0.00927326, -0.00233117,
        0.0058345 ,  0.0005779 ,  0.00136089,  0.00851193, -0.00857928,
       -0.00825741, -0.00959563,  0.0066524 ,  0.00556313,  0.00740024,
        0.00957237,  0.00598317, -0.00077041,  0.00561058, -0.00763451,
        0.00279842, -0.00713293,  0.00889338,  0.00043697, -0.00170676,
       -0.00470889,  0.00548467, -0.00087699,  0.00136868, -0.0096242 ,
        0.00235271,  0.00224191,  0.00233868,  0.00887496,  0.00363641,
       -0.00280984, -0.00125936,  0.00395262, -0.00879549,  0.00333533,
        0.00341276, -0.00579235, -0.00742147, -0.00369143, -0.00272578],
      dtype=float32)