Below is the model for Fake News detection over the Buzzfeed-Webis Fake News Corpus 2016. 

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import xml.etree.ElementTree as ET
import os
from gensim.models import Word2Vec, KeyedVectors

The first step is to download the Buzzfeed-Webis corpus, which is provided in the form of XML files. read_files will read each file and parse the XML tree to retrieve a tuple of the body of the text ('mainText') and the veracity label ('veracity')

In [3]:
def read_files():
    """
    For each xml file return the main text and the veracity label
    """
    path = 'data/train/'
    for filename in os.listdir(path):
        if not filename.endswith('.xml'): continue
        xmlfile = os.path.join(path, filename)
        tree = ET.parse(xmlfile)
        yield (tree.find('mainText').text, tree.find('veracity').text)

tokenize = lambda doc: doc.lower().split(" ")

We call this function to get a list of the main text of each article ('documents') as well as a matching list of the labels ('predictions')

In [4]:
documents = [f[0] for f in read_files() if f[0] is not None]
possibilities = ['mixture of true and false', 'mostly false', 'no factual content', 'mostly true']
predictions = [possibilities.index(f[1]) for f in read_files() if f[0] is not None]

Now we load the Google News pre-trained word embeddings for use in our model. These embeddings are trained using a combination of CBOW and skip-grams over a corpus of over 100 billion words from Google News.  

In [5]:
file = 'data/GoogleNews-vectors-negative300.bin'
embeddings = KeyedVectors.load_word2vec_format(file, binary=True)

To represent entire articles using the Google News word embeddings, we replace each string with it's matching embedding and then taken the elementwise mean of the entire document. This takes a document of N words from being N separate vectors to being a single 1D vector. 

In [None]:
def document_to_vector(docText,embeddings):
    """
    This function converts the text of a document (input as a list of strings) to word embeddings, then
    takes the elementwise average of the embeddings to return a single vector.
    """
    
    

In [None]:
#Calculate TF-IDF over the main text of each article, creating vector representations of them
sklearn_tfidf = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
sklearn_representation = sklearn_tfidf.fit_transform(documents)

In [None]:
# Splits data into training and test
X_train, X_test, y_train, y_test = train_test_split(sklearn_representation, predictions, test_size = .3, random_state=25)
LogReg = LogisticRegression()
LogReg.fit(X_train, y_train)
y_pred = LogReg.predict(X_test)
print(y_pred)
