## Project Overview



In [None]:
# !pip install pandas
# !pip install gensim
# !pip install nltk
# !pip install sklearn
# !pip install numpy
# !pip install openpyxl

In [1]:
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
from gensim.models.word2vec import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.cluster import KMeans

RANDOM_SEED = 42

## Part 1a: Political Bias Modeling

First we want to build a model of political bias using features that will be available in our primary dataset. We'll import the Spinde political bias dataset and select the article text and bias rating columns. Then, we'll vectorize the article text and train the model.

In [2]:
#First, we'll import the political bias dataset. We'll only keep the article body text ('article') and bias type ('type').
pb_spinde = pd.read_excel('assets\pb_spinde.xlsx')
pb_reduced = pb_spinde[['article','type']]
pb_reduced = pb_reduced.dropna().reset_index(drop=True)

  warn(msg)


In [4]:
#We'll replace the text labels with numbers.
pb_reduced['type'] = pb_reduced.type.replace({'center':0,'left':-1,'right':1})

In [5]:
#Here we'll do a simple nltk word tokenization on the article text.
pb_reduced['tokens'] = pb_reduced.article.apply(lambda y: [str.lower(x) for x in word_tokenize(y)])

In [6]:
#Now we'll train the Word2Vec model on our text tokens.
wv_mod = Word2Vec(pb_reduced['tokens'], seed = RANDOM_SEED)

In [7]:
#We'll extract the vectors from the model...
vectors = wv_mod.wv
#...and since each word is a vector of 100 numbers, we'll take the mean of all word vectors in a given article 
#to represent the article as a whole
vec_frame = pd.DataFrame([vectors.get_mean_vector(x) for x in pb_reduced.tokens])

In [8]:
#Finally, we'll train a Random Forest classifier on the vectorized text to predict article bias.
X_train, X_test, y_train, y_test = train_test_split(vec_frame, pb_reduced.type, test_size=0.2, random_state=RANDOM_SEED)

In [9]:
clf = RandomForestClassifier(random_state=RANDOM_SEED)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.85

## Part 1b: Applying the Model

Now, we want to predict the political bias of the target fake news dataset. We'll save these predictions as probabilities, which we'll use as additional features for clustering and trustworthiness prediction.

In [10]:
#Import the dataset and drop any rows with NA values
fn_kag = pd.read_csv(r'assets\fn_kagg_train.csv')
fn_kag = fn_kag.dropna().reset_index(drop=True)

In [11]:
#Tokenize article body text
fn_kag_tok = fn_kag.copy()
fn_kag_tok['text_tokens'] = fn_kag_tok.text.apply(lambda y: [str.lower(x) for x in word_tokenize(y)])

In [12]:
#Some articles have very few words, so we'll drop any rows with fewer than 30 tokens.
fn_kag_tok['tmp'] = fn_kag_tok['text_tokens'].apply(lambda x: len(x))
fn_kag_tok = fn_kag_tok[fn_kag_tok['tmp']>30]
fn_kag_tok = fn_kag_tok.drop(columns='tmp')

In [13]:
#Now we'll apply the Word2Vec model we generated above to our tokens to vectorize the text.
vec_frame = pd.DataFrame([vectors.get_mean_vector(x) for x in fn_kag_tok.text_tokens])

In [14]:
#Now we apply the Random Forest classifier to our vectorized text and save out the predicted probabilities.
preds = pd.DataFrame(clf.predict_proba(vec_frame), columns=['dem_bias','neutral','rep_bias'])

In [15]:
#Finally, we'll rejoin the predictions to the original dataset.
fn_kag_reduced = fn_kag_tok.copy().reset_index(drop=True)
fn_kag_reduced = fn_kag_reduced.join(preds)
fn_kag_reduced

Unnamed: 0,id,title,author,text,label,text_tokens,dem_bias,neutral,rep_bias
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,"[house, dem, aide, :, we, didn, ’, t, even, se...",0.42,0.07,0.51
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"[ever, get, the, feeling, your, life, circles,...",0.33,0.02,0.65
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,"[why, the, truth, might, get, you, fired, octo...",0.52,0.03,0.45
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,"[videos, 15, civilians, killed, in, single, us...",0.24,0.31,0.45
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,"[print, an, iranian, woman, has, been, sentenc...",0.46,0.25,0.29
...,...,...,...,...,...,...,...,...,...
17939,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,0,"[rapper, t., i., unloaded, on, black, celebrit...",0.23,0.03,0.74
17940,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,0,"[when, the, green, bay, packers, lost, to, the...",0.27,0.35,0.38
17941,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,0,"[the, macy, ’, s, of, today, grew, from, the, ...",0.51,0.17,0.32
17942,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...",1,"[nato, ,, russia, to, hold, parallel, exercise...",0.30,0.31,0.39


Now we have bias predictions for each article in our fake news dataset. We could follow a similar procedure for additional features (e.g. sentiment analysis)

## Part 2: Clustering

Once we have all the features we want, we'll do unsupervised clustering. Ideally we'd want to do some evaluations to find an ideal number of clusters, but for now we'll just go with 4.

We'll need to re-vectorize the text, as the political bias vectors won't work here. Also, we'd probably want to vectorize *both* headline and article body, but for now I'll just vectorize the article body.

In [16]:
#Since we already have the tokenized text from above, we can just go ahead and train the new Word2Vec model on those tokens.
wv_mod = Word2Vec(fn_kag_reduced['text_tokens'], seed = RANDOM_SEED)

In [17]:
#Again we'll extract and average the word vectors.
vectors = wv_mod.wv
vec_frame = pd.DataFrame([vectors.get_mean_vector(x) for x in fn_kag_reduced.text_tokens])

In [18]:
#We'll join the new word vectors with the bias estimates we generate above.
all_feat_df = vec_frame.join(fn_kag_reduced).drop(columns=['id','title','author','text','label','text_tokens'])

In [19]:
#Finally we'll build our clustering model...
cls = KMeans(4, random_state=RANDOM_SEED).fit(all_feat_df)



In [20]:
#...and add the predicted clusters back into the vector dataframe.
all_feat_df['cluster'] = cls.predict(all_feat_df)
all_feat_df



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,94,95,96,97,98,99,dem_bias,neutral,rep_bias,cluster
0,0.047740,-0.062067,0.010142,0.023733,0.027136,0.011726,-0.017329,-0.025544,0.027321,-0.041113,...,0.049923,0.000369,-0.057304,-0.037434,0.021349,0.011955,0.42,0.07,0.51,0
1,0.038732,-0.062580,-0.000968,0.032428,0.034728,0.006141,-0.037084,-0.027762,0.024189,-0.042112,...,0.050013,0.020645,-0.053560,-0.046181,0.028000,0.009854,0.33,0.02,0.65,0
2,0.043878,-0.084607,0.001290,0.029689,0.024617,0.012073,-0.043939,-0.027483,0.038452,-0.054688,...,0.050848,-0.004226,-0.061635,-0.054407,0.019642,0.001657,0.52,0.03,0.45,2
3,0.018492,-0.052355,0.016414,-0.019925,0.036960,-0.009948,-0.020988,-0.009709,0.013210,-0.092974,...,0.062181,0.004864,-0.071717,-0.040037,0.029040,0.008478,0.24,0.31,0.45,1
4,0.045002,-0.040723,-0.001672,-0.007292,0.032992,0.001198,-0.038173,-0.011006,0.034885,-0.060621,...,0.027709,0.008914,-0.074335,-0.041465,0.038758,-0.006983,0.46,0.25,0.29,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17939,0.042028,-0.041074,-0.007991,0.023312,0.040681,0.004353,-0.045735,-0.018476,0.024066,-0.023719,...,0.059906,0.010804,-0.047868,-0.021401,0.015917,0.019536,0.23,0.03,0.74,0
17940,0.021898,-0.026946,0.032306,-0.005613,0.034018,-0.001052,-0.012961,-0.041187,-0.004921,-0.088816,...,0.064928,0.010397,-0.034544,-0.019263,0.019563,0.041904,0.27,0.35,0.38,1
17941,0.011924,-0.052399,0.033022,-0.000930,0.010965,0.003568,-0.028792,-0.027073,0.023143,-0.056647,...,0.039514,0.020995,-0.045691,-0.024739,0.013343,0.008033,0.51,0.17,0.32,2
17942,0.039743,-0.046810,0.016195,-0.026166,0.025034,-0.005241,-0.046477,-0.026727,0.009661,-0.106481,...,0.067306,-0.031781,-0.070297,-0.045169,0.007362,-0.013425,0.30,0.31,0.39,1


## Part 3: Supervised Learning

Now that we have all of our features and clusters, and article body text is already vectorized, we can train a classifier to predict whether a given article is misinformation or not.

In [21]:
#We've already done most of the work above, we'll just split up the dataset and build the model.
X_train, X_test, y_train, y_test = train_test_split(all_feat_df, fn_kag_reduced.label, test_size=0.2, random_state=RANDOM_SEED)

In [22]:
clf = RandomForestClassifier(random_state=RANDOM_SEED)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)



0.9127890777375314

This is... a surprisingly good score, which makes me feel like I did something wrong. I probably should have done the train-test split *before* applying the political bias model and doing the clustering... this probably caused some data leakage. But this is probably good enough for me to get started on the dashboard.