# Training

## Imports

In [1]:
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.parsing.preprocessing import strip_punctuation, strip_numeric
import random

## Data
Data comes from [this academic source](http://fakenews.research.sfu.ca/).

In [2]:
df = pd.read_csv("data/snopes_phase2_clean_2018_7_3.csv")

In [3]:
raw_texts = list(df['original_article_text_phase2'])
labels = list(df['fact_rating_phase1'])

In [4]:
print('We have '+str(len(raw_texts))+' total texts in our dataset.')

We have 15804 total texts in our dataset.


In [5]:
def clean(doc):
    return strip_punctuation(doc).lower().split()

In [6]:
texts = [clean(doc) for doc in raw_texts]

The following creates TaggedDocument objects for each of the texts in the dataset, where each text is tagged by the fact rating (label),e.g. "true" or "false."

In [7]:
documents = [TaggedDocument(doc, [label]) for doc,label in zip(texts,labels)]
random.shuffle(documents)
n = len(documents)
split = n*7//10
train_corpus = documents[:split]
test_corpus = documents[split:]

## Model
The model is trained on the documents, with vector size of 100 (for each word), with a window of 10 (each word is predicted by the 10 words surrounding it). min_count = 2 means that every word will be used if it appears more than once.

In [8]:
model = Doc2Vec(vector_size=100, window=10, min_count=2, epochs=100)
model.build_vocab(train_corpus)

Train

In [9]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

Save the model

In [10]:
model.save("models/my_doc2vec_model")
model = Doc2Vec.load("models/my_doc2vec_model")

# Evaluating

## Example

In [11]:
new_doc = 'hillary clinton won the presidential election'.split()
vector = model.infer_vector(new_doc)
print(vector)

[ 0.77026314  0.3245291   0.46236303  0.6530718   0.252805    1.1175214
 -0.82519925 -0.42763317  0.2310017  -0.6000125   0.05518538 -0.5051248
 -0.11029471 -0.2755155   0.02550256  0.55742896  0.47391373  0.47023472
  0.15100312  0.532407    0.24265862  0.28292423 -0.6275333  -0.1948347
  0.21949245  0.11276026 -0.1754544  -1.1636639  -0.05264677 -0.11232027
 -1.316966   -0.35494038  0.06453314  0.3031587  -0.6171715   0.8219609
 -0.39230192 -0.63734394  0.07676978 -0.78597903 -0.27953026 -0.7963541
 -0.1812085   0.64095634 -0.58655846 -0.63447726  0.4350234  -0.23559555
  0.8868153   0.696059    0.01137678 -0.73044     0.720216   -0.45125455
  0.46951807 -0.18292809 -0.8259651   0.35229456  0.04092822 -0.26731676
 -0.10490963 -0.37612042 -0.2486074   0.03884813  0.6809351  -0.17610945
  0.36686182  0.65326095 -0.16514468  0.57372344 -0.0118232  -0.61706966
  0.16035254  0.15476711 -0.0227529  -0.8314403   0.6950987  -0.02919235
 -0.43318322  0.57178324 -0.50574094  0.35658738 -0.0871

## Assessment
We do the following to make sure the model is behaving in a useful way. For each document in the train corpus, we infer a new vector from the model, calculate the most similar document vectors in the model, and determine if the inferred vectors are closest to themselves in the model. ***rank*** will store the index of the correct document in the similarity list. We should see most of the documents ranked as the number one most similar document to themselves.

In [12]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(train_corpus[doc_id].tags[0])
    ranks.append(rank)
    second_ranks.append(sims[1])

In [13]:
import collections

counter = collections.Counter(ranks)
print(counter)

Counter({0: 8662, 1: 1245, 2: 554, 3: 259, 4: 154, 5: 77, 6: 42, 7: 25, 8: 22, 9: 18, 10: 4})
