# <span style="color:darkblue">News Classification - by Paul Jacques Mignault & Jonathan Serrano Barbosa</span> 

In this notebook [Jonathan Serrano Barbosa](https://www.linkedin.com/in/jonathan-serrano-barbosa-0723b762/) and [Paul Jacques Mignault](https://www.linkedin.com/in/paul-jacques-mignault/) developed a binary text classifier able to dissociate between fake news and real news. The team tried different approaches in the data pre-processing, modeling and feature extraction.

***
#### Summary of the process:
***

1. Importing Libraries

2. Reading CSV

3. Creating a Baseline

4. Data Pre-Processing and Models Testing

5. Feature Extraction

6. Predicting on holdout/test set
    

***    
#### Main Findings include:
***

- **Data to be used**: The best models used both text and title.


- **Pre-Processing**: Extensive pre-processing led to decreased accuracy. Hence, we did minimalistic cleaning to the text.


- **Vectorizer**: TF-IDF vectorizer performed best.


- **Models**: Passive Aggressive Classifier was the best performing algorithm.


- **POS**: Isolating different POS led to decent accuracy but did not beat our best model using full-text.



***
#### Key Take Aways:
***

- **Baseline**. Starting with a baseline model helps in creating a benchmark to understand model performance without pre-processing nor complex modeling. This serves in evaluating the impact of feature engineering (in this case feature extraction) and performance of models later developed.


- **Pre-Processing**. It is best to work on the pre-processing iteratively, that is studying the impact of each function on the model, rather than going for deep cleaning from the beginning. In fact, during the exercise, we made the mistake of starting with extensive cleaning early on and that led us to reach a point at which the model performance kept decreasing. Hence, it took us days before we understood that we had to work backwords and identify which were the steps that led to decreased model performance until finding the optimal pre-processing.


- **Modeling**. Rather than going for complex model it is best to 1) try different models selected rationally, that is - in this case - models frequently used for text classification and 2) run smaller grid searches on the best performing models rather than using brute force. Simpler models might work best.


- **Feature Extraction**. Using feature extraction is useful in identifying how the different POS might help in modelling. Looking back at the exercise, starting with EDA on the features rather than focusing on using feature extraction for modeling, would have helped us in understand the structure of fake news vs real news and thereby in informing our data pre-processing steps.


# <span style="color:darkblue">I. Importing Libraries</span> 

The main libraries used for the exercice are [nltk](https://www.nltk.org/) for the text cleaning, and [sklearn](https://scikit-learn.org/stable/) for the machine learning aspects of the task.

In [1]:
# Importing Libraries
import nltk
import pandas as pd
import numpy as np
import string
import re
import xgboost
import requests
import io
import itertools 

from IPython.display import Image
from IPython.core.display import HTML 

from string import punctuation

from nltk import word_tokenize, pos_tag, pos_tag_sents, DefaultTagger, UnigramTagger
from nltk.util import ngrams
from nltk.stem import PorterStemmer, WordNetLemmatizer, SnowballStemmer
from nltk.corpus import stopwords
from nltk.classify import MaxentClassifier, maxent
from nltk.tokenize import RegexpTokenizer
from nltk.probability import FreqDist
from nltk.tokenize.treebank import TreebankWordDetokenizer

from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, VotingClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier, SGDClassifier, RidgeClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import confusion_matrix

from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

# <span style="color:darkblue">II. Reading the CSV</span> 

### Reading the file from github

In order for anyone to be able to run this notebook smoothly without having to change any directory path, the csv file was directly uploaded on to github. Hence, the file is accessible through a URL.

In [2]:
# Reading the Training dataset
url = "https://raw.githubusercontent.com/paul-jm/Fake_News_Detection_NLP/master/fake_or_real_news_training.csv"
s = requests.get(url).content
raw_data = pd.read_csv(io.StringIO(s.decode('utf-8')), sep=',')
raw_data.head()

Unnamed: 0,ID,title,text,label,X1,X2
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,,
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,,
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,,


### Fixing the dataframe

When reading the CSV, the team noticed that text from certain rows is moved to the columns X1 and X2. Similarly, labels are, in some instances moved to other cells. Therefore, the team created a function to correct the imperfections when reading the file. 

In [3]:
#These 3 functions allow to retrive labels and re-integrate text from columns X1 and X2
def isNaN(num):
    return num != num

def retrieve_label(df):
    'This function retrieves the label should it have been moved to another cell upon reading the CSV'
    for row in range(df.shape[0]):
        if df.loc[row, 'X2'] == 'REAL' or df.loc[row, 'X2'] == 'FAKE':
            df.loc[row,'news_label'] = df.loc[row,'X2']
        elif df.loc[row, 'X1'] == 'REAL' or df.loc[row, 'X1'] == 'FAKE':
            df.loc[row,'news_label'] = df.loc[row,'X1']
        else:
            df.loc[row,'news_label'] = df.loc[row,'label']
    return df

def fix_df(df):
    'This function retrieves the text that has been shifted to cells X1 and X2 upon reading the CSV'
    for row in range(df.shape[0]):
        if isNaN(df.loc[row, 'label']) == False and df.loc[row, 'label'] != 'REAL' and df.loc[row, 'label'] != 'FAKE':
            df.loc[row, 'text'] = df.loc[row, 'text'] + df.loc[row, 'label']
        elif isNaN(df.loc[row, 'X1']) == False and df.loc[row, 'X1'] != 'REAL' and df.loc[row, 'X1'] != 'FAKE':
            df.loc[row, 'text'] = df.loc[row, 'text'] + df.loc[row, 'X1']
        elif isNaN(df.loc[row, 'X2']) == False and df.loc[row, 'X2'] != 'REAL' and df.loc[row, 'X2'] != 'FAKE':
            df.loc[row, 'text'] = df.loc[row, 'text'] + df.loc[row, 'X2']
    df = df.drop(columns = ['label', 'X1', 'X2'])
    return df 

In [4]:
#Applying the functions created above

df = retrieve_label(raw_data) 
df = fix_df(df) 
df.head() 

Unnamed: 0,ID,title,text,news_label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


# <span style="color:darkblue">III. Creating a Baseline</span> 

The team began the exercice by creating a baseline. The baseline consists of a basic model, without any pre-processing. Three simple models, logistic regression, Multinomial Naive-Bayes and Linear Support Vector Machine, were tested. The models were tried with both a count vectorizer and a TF-IDF Vectorizer. However, best results were obtained with the TF-IDF Vectorizer.

In [5]:
# Setting the ID as the index
df = df.set_index('ID')

# Creating the target column
y = df.news_label 

In [6]:
df.head()

Unnamed: 0_level_0,title,text,news_label
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


We chose a split of 70/30 in order to leave the algorithm sufficient data for training. Furthermore, we added a seed to ensure similar split of data when re-running the notebook.

In [7]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.3, random_state=666)

Then we simply create and train our Count Vectorizer and TF-IDF Vectorizer on the training data created above.

In [8]:
# Simple count vectorizer
count_vectorizer = CountVectorizer()
# Fit and transform the training data 
count_train = count_vectorizer.fit_transform(X_train)
# Transform the test set 
count_test = count_vectorizer.transform(X_test)

# Initialize the `tfidf_vectorizer` 
tfidf_vectorizer = TfidfVectorizer() 
# Fit and transform the training data 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
# Transform the test set 
tfidf_test = tfidf_vectorizer.transform(X_test)

### Count vectorizer

The team created a function to compare different models and display the results in a table ranked by accuracy. Hence, making it easy to compare different models evaluated through cross-validation. We used cross-validation in order to ensure (or limit) overfitting of our models.

In [9]:
models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Multinomial NB', MultinomialNB()))
models.append(('Linear SVM', LinearSVC()))

results = []
names = []
scoring = 'accuracy'

#with joblib.parallel_backend('dask'):
for name, model in models:
    cv_scores = cross_val_score(model, count_train, y_train, cv=10, n_jobs=-1)
    mean_score = round(np.mean(cv_scores), 3)
    results.append(mean_score)
    names.append(name)

Models_comparison = pd.DataFrame(np.column_stack([names,results]), 
                                       columns=['Model','Accuracy'])

Models_comparison = Models_comparison.sort_values('Accuracy', ascending = False)

print(Models_comparison)

                 Model Accuracy
0  Logistic Regression    0.902
1       Multinomial NB    0.881
2           Linear SVM    0.863


### TF-IDF Vectorizer

In [10]:
models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Multinomial NB', MultinomialNB()))
models.append(('Linear SVM', LinearSVC()))

results = []
names = []
scoring = 'accuracy'

for name, model in models:
    cv_scores = cross_val_score(model, tfidf_train,y_train,cv=10,n_jobs=-1)
    mean_score = round(np.mean(cv_scores), 3)
    results.append(mean_score)
    names.append(name)

Models_comparison = pd.DataFrame(np.column_stack([names,results]), 
                               columns=['Model','Accuracy'])

Models_comparison = Models_comparison.sort_values('Accuracy', ascending = False)

print(Models_comparison)

                 Model Accuracy
2           Linear SVM    0.924
0  Logistic Regression    0.897
1       Multinomial NB     0.75


# <span style="color:darkblue">IV. Building Model: Data Pre-Processing and Models Testing</span> 

# <span style="color:lightslategray">Creating Functions for Pre-Processing </span> 

Different pre-processing steps were tried. Not all functions that were developed by the team were implemented. In fact, the team noticed that extensive pre-processing led to decreasing model performance. Hence, we had to work backwords and iteratively discard pre-processing functions that did not improve the model (see # functions or the functions discareded function). Also we created the normalization function with optional steps in order to easily try different combinations of cleaning later on.

### Functions Used

In [11]:
stops = stopwords.words('english')

def basic_clean(dataframe):
    'Function that takes the dataframe as an input and cleans it by removing punction, digits and strips'
    #Removing punctuation
    dataframe.title = dataframe.title.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
    dataframe.text = dataframe.text.apply(lambda x: x.translate(str.maketrans('','', string.punctuation)))
    #Removing regular expressions
    dataframe['title'] = dataframe['title'].str.replace('[^\w\s]',' ')
    dataframe['text'] = dataframe['text'].str.replace('[^\w\s]',' ')
    #Removing digits
    #dataframe.title = dataframe.title.apply(lambda x: x.translate(str.maketrans('', '', string.digits)))
    #dataframe.text = dataframe.text.apply(lambda x: x.translate(str.maketrans('','', string.digits)))
    #Removing double spaces
    #dataframe['title'] = dataframe['title'].str.replace('  ',' ')
    #dataframe['text'] = dataframe['text'].str.replace('  ',' ')
    #Removing strips
    #dataframe['title'] = dataframe['title'].replace(r'\s+|\\n', ' ', regex = True, inplace = False)
    #dataframe['text'] = dataframe['text'].replace(r'\s+|\\n', ' ', regex = True, inplace = False)
    return dataframe

def normalization(text, lowercase = False, remove_stops = False, prt_stemming = False, snb_stemming = False, lemmatization = False):
    'Flexible function to try effect of removing stopwords and the different stemming techniques'
    txt = str(text)
    if lowercase:
        txt = " ".join([w.lower() for w in txt.split()])
    if remove_stops:
        txt = " ".join([w for w in txt.split() if w not in stops])
    if prt_stemming:
        st = PorterStemmer()
        txt = " ".join([st.stem(w) for w in txt.split()])
    if snb_stemming:
        snb = SnowballStemmer('english')
        txt = " ".join([snb.stem(w) for w in txt.split()])
    if lemmatization:
        wordnet_lemmatizer = WordNetLemmatizer()
        txt = " ".join([wordnet_lemmatizer.lemmatize(w, pos = 's') for w in txt.split()])
    return txt

### Functions discarded

In [12]:
def advanced_clean(dataframe, column):
    'Function that takes the dataframe as input and fixes most common contractions and regular expressions'
    dataframe[column] = dataframe[column].str.replace("isn't", "is not")
    dataframe[column] = dataframe[column].str.replace("aren't", "are not")
    dataframe[column] = dataframe[column].str.replace("ain't", "am not")
    dataframe[column] = dataframe[column].str.replace("won't", "will not")
    dataframe[column] = dataframe[column].str.replace("didn't", "did not")
    dataframe[column] = dataframe[column].str.replace("shan't", "shall not")
    dataframe[column] = dataframe[column].str.replace("haven't", "have not")
    dataframe[column] = dataframe[column].str.replace("hadn't", "had not")
    dataframe[column] = dataframe[column].str.replace("hasn't", "has not")
    dataframe[column] = dataframe[column].str.replace("don't", "do not")
    dataframe[column] = dataframe[column].str.replace("wasn't", "was not")
    dataframe[column] = dataframe[column].str.replace("weren't", "were not")
    dataframe[column] = dataframe[column].str.replace("doesn't", "does not")
    dataframe[column] = dataframe[column].str.replace("'s", " is")
    dataframe[column] = dataframe[column].str.replace("'re", " are")
    dataframe[column] = dataframe[column].str.replace("'m", " am")
    dataframe[column] = dataframe[column].str.replace("'d", " would")
    dataframe[column] = dataframe[column].str.replace("'ll", " will")
    dataframe[column] = dataframe[column].str.replace("can't","cannot")
    dataframe[column] = dataframe[column].str.replace("'cause'","because")
    dataframe[column] = dataframe[column].str.replace("could've","could have")
    dataframe[column] = dataframe[column].str.replace("couldn't", "could not")
    dataframe[column] = dataframe[column].str.replace("he's", "he is")
    dataframe[column] = dataframe[column].str.replace("how'd", "how did")
    dataframe[column] = dataframe[column].str.replace("I'd've", "I would have")
    dataframe[column] = dataframe[column].str.replace("I've", "I have")
    return dataframe

def remove_url(text):
    'Function that takes the dataframe as input and removes URLs from text'
    txt = str(text)
    txt = re.sub(r'^https?:\/\/.*[\r\n]*', ' ', text, flags=re.MULTILINE)
    return txt

# <span style="color:lightslategray">Step 1: Selecting data to use, vectorizer and narrowing down potential models </span> 

The objective of the first step was to select the data to be used moving forward (text, title or text and title), to decide on the best vectorizer to use and, lastly, to test models that are generally used for similar tasks, that is text classification. In general, we noticed that the text explained more variance than the title but less than text and title combined, which was a first interesting finding at this stage. Similarly, this step 1 proved superior performance of the TF-IDF vectorizer compared to the count vectorizer.

Main Findings:


- Using title and text is best
- TF-IDF beats Count Vectorizer
- Passive Aggressive Classifier*, Ridge Classifier, and Linear SVM are the top 3 performing algorithms


*In a passive aggressive classifier, the model updates only when misclassifying an instance, i.e. Aggressive, if not it keeps the model, i.e. Passive. This flexible model therefore adjusts to its errors. It appears that our particular dataset provides the algorithm with enough information for updating the model*.

In [13]:
df0 = df.copy()
df0['text'] = df0['text'].map(lambda x: normalization(x, lowercase=True, remove_stops=False, prt_stemming=True, snb_stemming = False, lemmatization = False))
df0['title'] = df0['title'].map(lambda x: normalization(x, lowercase=True, remove_stops=False, prt_stemming=True, snb_stemming = False, lemmatization = False))

### Using Text Only

In [14]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df0['text'], y, test_size=0.3, random_state=666)

# Creating the count vectorizer
vect = CountVectorizer()
count_train = vect.fit_transform(X_train)
count_test = vect.transform(X_test)

# Trying the different models
models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Multinomial NB', MultinomialNB()))
models.append(('Linear SVM', LinearSVC()))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('XGBoost', XGBClassifier()))
models.append(('Passive Agressive',PassiveAggressiveClassifier()))
models.append(('Tree Classifier',AdaBoostClassifier()))
models.append(('Ridge Classifier',RidgeClassifier(solver='lsqr')))

results = []
names = []
scoring = 'accuracy'

for name, model in models:
    cv_scores = cross_val_score(model, count_train,y_train,cv=5,n_jobs=-1)
    mean_score = round(np.mean(cv_scores), 3)
    results.append(mean_score)
    names.append(name)

Models_comparison = pd.DataFrame(np.column_stack([names,results]), columns=['Model','Accuracy'])

Models_comparison = Models_comparison.sort_values('Accuracy', ascending = False)

print(Models_comparison)

                 Model Accuracy
0  Logistic Regression    0.895
7     Ridge Classifier    0.894
4              XGBoost    0.886
1       Multinomial NB    0.877
6      Tree Classifier    0.865
5    Passive Agressive    0.856
2           Linear SVM    0.853
3        Random Forest    0.816


In [15]:
# Creating the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Trying the different models
models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Multinomial NB', MultinomialNB()))
models.append(('Linear SVM', LinearSVC()))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('XGBoost', XGBClassifier()))
models.append(('Passive Agressive',PassiveAggressiveClassifier()))
models.append(('Tree Classifier',AdaBoostClassifier()))
models.append(('Ridge Classifier',RidgeClassifier(solver='lsqr')))
              
results = []
names = []
scoring = 'accuracy'

for name, model in models:
    cv_scores = cross_val_score(model, tfidf_train,y_train,cv=10,n_jobs=-1)
    mean_score = round(np.mean(cv_scores), 3)
    results.append(mean_score)
    names.append(name)

Models_comparison = pd.DataFrame(np.column_stack([names,results]), columns=['Model','Accuracy'])

Models_comparison = Models_comparison.sort_values('Accuracy', ascending = False)

print(Models_comparison)

                 Model Accuracy
5    Passive Agressive    0.922
2           Linear SVM    0.919
7     Ridge Classifier    0.916
4              XGBoost    0.902
0  Logistic Regression    0.896
6      Tree Classifier    0.876
3        Random Forest     0.81
1       Multinomial NB    0.749


### Using Title Only

In [16]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df0['title'], y, test_size=0.3, random_state=666)

# Creating the count vectorizer
vect = CountVectorizer()
count_train = vect.fit_transform(X_train)
count_test = vect.transform(X_test)

# Trying the different models
models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Multinomial NB', MultinomialNB()))
models.append(('Linear SVM', LinearSVC()))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('XGBoost', XGBClassifier()))
models.append(('Passive Agressive',PassiveAggressiveClassifier()))
models.append(('Tree Classifier',AdaBoostClassifier()))
models.append(('Ridge Classifier',RidgeClassifier(solver='lsqr')))

results = []
names = []
scoring = 'accuracy'

for name, model in models:
    cv_scores = cross_val_score(model, count_train,y_train,cv=5,n_jobs=-1)
    mean_score = round(np.mean(cv_scores), 3)
    results.append(mean_score)
    names.append(name)

Models_comparison = pd.DataFrame(np.column_stack([names,results]), columns=['Model','Accuracy'])

Models_comparison = Models_comparison.sort_values('Accuracy', ascending = False)

print(Models_comparison)

                 Model Accuracy
1       Multinomial NB     0.79
0  Logistic Regression    0.789
5    Passive Agressive    0.764
2           Linear SVM    0.763
7     Ridge Classifier    0.761
3        Random Forest    0.736
4              XGBoost    0.734
6      Tree Classifier    0.723


In [17]:
# Creating the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Trying the different models
models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Multinomial NB', MultinomialNB()))
models.append(('Linear SVM', LinearSVC()))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('XGBoost', XGBClassifier()))
models.append(('Passive Agressive',PassiveAggressiveClassifier()))
models.append(('Tree Classifier',AdaBoostClassifier()))
models.append(('Ridge Classifier',RidgeClassifier(solver='lsqr')))
              
results = []
names = []
scoring = 'accuracy'

for name, model in models:
    cv_scores = cross_val_score(model, tfidf_train,y_train,cv=10,n_jobs=-1)
    mean_score = round(np.mean(cv_scores), 3)
    results.append(mean_score)
    names.append(name)

Models_comparison = pd.DataFrame(np.column_stack([names,results]), columns=['Model','Accuracy'])

Models_comparison = Models_comparison.sort_values('Accuracy', ascending = False)

print(Models_comparison)

                 Model Accuracy
7     Ridge Classifier    0.803
0  Logistic Regression    0.801
2           Linear SVM    0.798
1       Multinomial NB    0.791
5    Passive Agressive    0.779
3        Random Forest    0.737
4              XGBoost    0.731
6      Tree Classifier     0.73


### Using Title and Text

In [18]:
# Merging Title and Text
df0['titles_text'] = df0[['title', 'text']].apply(lambda x: ' '.join(x), axis=1)

# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df0['titles_text'], y, test_size=0.3, random_state=666)

# Creating the count vectorizer
vect = CountVectorizer()
count_train = vect.fit_transform(X_train)
count_test = vect.transform(X_test)

# Trying the different models
models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Multinomial NB', MultinomialNB()))
models.append(('Linear SVM', LinearSVC()))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('XGBoost', XGBClassifier()))
models.append(('Passive Agressive',PassiveAggressiveClassifier()))
models.append(('Tree Classifier',AdaBoostClassifier()))
models.append(('Ridge Classifier',RidgeClassifier(solver='lsqr')))

results = []
names = []
scoring = 'accuracy'

for name, model in models:
    cv_scores = cross_val_score(model, count_train,y_train,cv=5,n_jobs=-1)
    mean_score = round(np.mean(cv_scores), 3)
    results.append(mean_score)
    names.append(name)

Models_comparison = pd.DataFrame(np.column_stack([names,results]), columns=['Model','Accuracy'])

Models_comparison = Models_comparison.sort_values('Accuracy', ascending = False)

print(Models_comparison)

                 Model Accuracy
0  Logistic Regression      0.9
7     Ridge Classifier      0.9
4              XGBoost     0.89
1       Multinomial NB    0.884
2           Linear SVM    0.874
6      Tree Classifier    0.873
5    Passive Agressive    0.872
3        Random Forest    0.799


In [19]:
# Creating the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Trying the different models
models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Multinomial NB', MultinomialNB()))
models.append(('Linear SVM', LinearSVC()))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('XGBoost', XGBClassifier()))
models.append(('Passive Agressive',PassiveAggressiveClassifier()))
models.append(('Tree Classifier',AdaBoostClassifier()))
models.append(('Ridge Classifier',RidgeClassifier(solver='lsqr')))
              
results = []
names = []
scoring = 'accuracy'

for name, model in models:
    cv_scores = cross_val_score(model, tfidf_train,y_train,cv=10,n_jobs=-1)
    mean_score = round(np.mean(cv_scores), 3)
    results.append(mean_score)
    names.append(name)

Models_comparison = pd.DataFrame(np.column_stack([names,results]), columns=['Model','Accuracy'])

Models_comparison = Models_comparison.sort_values('Accuracy', ascending = False)

print(Models_comparison)

                 Model Accuracy
2           Linear SVM    0.924
5    Passive Agressive    0.922
7     Ridge Classifier    0.919
4              XGBoost    0.903
0  Logistic Regression    0.897
6      Tree Classifier    0.885
3        Random Forest    0.808
1       Multinomial NB    0.755


# <span style="color:lightslategray">Step 2: Selecting Best Normalization Method with Top Performing Algorithms</span> 

From the findings in step 1, the team decided to move forward testing the three top performing models on text and title combined with the TF-IDF Vectorizer. Here the objective was to analyze the impact of the different pre-processing steps on the model performance. For this purpose, the team used a seed and tried all the different combinations of pre-processing using the *normalization* function and by using the # to block certain parts of the basic_clean function. Furthermore, when using lemmatizing the team studied the impact of using different pos argument and found that pos = 's', which stands for satellite adjectives (depedent adjective, i.e. adjective in a given context or epithet).

Main Findings:

- Model performs best when not removing stop words
- Model performs best with lemmatizing on satellite adjectives (vs stemming)

In [20]:
df1 = df.copy()
basic_clean(df1)

# Further cleaning
df1['text'] = df1['text'].map(lambda x: re.sub(r'\W+', ' ', x))
df1['title'] = df1['title'].map(lambda x: re.sub(r'\W+', ' ', x))

# Lowercase and lemmatizing on satellite adjectives
df1['text'] = df1['text'].map(lambda x: normalization(x, lowercase=False, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = True))
df1['title'] = df1['title'].map(lambda x: normalization(x, lowercase=False, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = True))

In [21]:
df1['titles_text'] = df1[['title', 'text']].apply(lambda x: ' '.join(x), axis=1)

In [22]:
df1.head()

Unnamed: 0_level_0,title,text,news_label,titles_text
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8476,You Can Smell Hillary s Fear,Daniel Greenfield a Shillman Journalism Fellow...,FAKE,You Can Smell Hillary s Fear Daniel Greenfield...
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,Watch The Exact Moment Paul Ryan Committed Pol...
3608,Kerry to go to Paris in gesture of sympathy,US Secretary of State John F Kerry said Monday...,REAL,Kerry to go to Paris in gesture of sympathy US...
10142,Bernie supporters on Twitter erupt in anger ag...,Kaydee King KaydeeKing November 9 2016 The les...,FAKE,Bernie supporters on Twitter erupt in anger ag...
875,The Battle of New York Why This Primary Matters,Its primary day in New York and frontrunners H...,REAL,The Battle of New York Why This Primary Matter...


In [23]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df1['titles_text'], y, test_size=0.2, random_state=666)

In [24]:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

models = []
models.append(('Linear SVM', LinearSVC(random_state=69)))
models.append(('Passive Agressive',PassiveAggressiveClassifier(random_state=69)))
models.append(('Ridge Classifier',RidgeClassifier(solver='lsqr',random_state=69)))

results = []
names = []
scoring = 'accuracy'

for name, model in models:
    cv_scores = cross_val_score(model, tfidf_train,y_train,cv=10,n_jobs=-1)
    mean_score = round(np.mean(cv_scores), 3)
    results.append(mean_score)
    names.append(name)

Models_comparison = pd.DataFrame(np.column_stack([names,results]), columns=['Model','Accuracy'])

Models_comparison = Models_comparison.sort_values('Accuracy', ascending = False)

print(Models_comparison)

               Model Accuracy
1  Passive Agressive    0.938
0         Linear SVM    0.934
2   Ridge Classifier    0.931


# <span style="color:lightslategray">Step 3: Tuning Hyperparameters from Best Performing Model</span> 

In this step, the team ran a Grid Search to identify the best set of hyperparameters both for the TF-IDF Vectorizer and for the Passive Aggressive Classifier. The challenge in this step was to select the right parameters to train our grid search on as well as the right ranges.

Main Findings:

- Best Parameters for Passive Aggressive Classifier: loss = 'hinge'
- Best Parameters for the TF-IDF Vectorizer: max_df = 1.0, ngram_range = (1,2), norm = 'l2'

In [131]:
pipe = Pipeline([('tfidf', TfidfVectorizer()),('PAC', PassiveAggressiveClassifier(random_state=69))])

pipe_params = {'tfidf__ngram_range': [(1,2), (2,2), (1,3)],'tfidf__max_df':[0.5, 0.75, 1.0],'tfidf__norm':['l1','l2'],'PAC__loss':['hinge', 'squared_hinge']}

gs_pac = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs_pac.fit(X_train, y_train);
print("Best score:", gs_pac.best_score_)
print("Train score", gs_pac.score(X_train, y_train))
print("Test score", round(gs_pac.score(X_test, y_test),3))
pac_best = gs_pac.best_estimator_
gs_pac.best_params_

Best score: 0.9265395436073773
Train score 1.0
Test score 0.924


{'PAC__loss': 'squared_hinge',
 'tfidf__max_df': 0.75,
 'tfidf__ngram_range': (1, 2),
 'tfidf__norm': 'l2'}

In [26]:
tfidf_vectorizer = TfidfVectorizer(max_df = 1.0, ngram_range = (1,2), norm = 'l2') 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

cv_scores = cross_val_score(PassiveAggressiveClassifier(random_state=69, loss = 'hinge'), tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.936


# <span style="color:lightslategray">Step 4: Comparing Individual Performance vs Stacked Model </span> 

In this stem, the team stacked the top 3 performing models, i.e. Linear SVC, Ridge Regression and Passive Aggressive Classifier, and compared the performance of the ensemble with the performance of the Passive Aggressive Classifier.

Main Findings:

- Passive Aggressive Classifier beats Ensemble (even with hyperparameters tunning)

In [27]:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

SVM = LinearSVC(random_state = 69)
Ridge = RidgeClassifier(solver='lsqr', random_state = 69)
PAC = PassiveAggressiveClassifier(random_state = 69)

In [28]:
ensemble = VotingClassifier(estimators=[('Linear SVC', SVM), ('Ridge Regression', Ridge), ('Passive Aggressive Classifier', PAC)], voting='hard')

In [29]:
cv_scores = cross_val_score(ensemble, tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.934


In [136]:
pipe = Pipeline([('tfidf', TfidfVectorizer()),('PAC', VotingClassifier(estimators=[('Linear SVC', SVM), ('Ridge Regression', Ridge), ('Passive Aggressive Classifier', PAC)], voting='hard'))])
pipe_params = {'tfidf__ngram_range': [(1,1), (1,2), (2,2), (1,3)],'tfidf__max_df':[0.5, 0.75, 1.0],'tfidf__norm':['l1','l2']}

gs_pac = GridSearchCV(pipe, param_grid=pipe_params, cv=5)
gs_pac.fit(X_train, y_train);
print("Best score:", gs_pac.best_score_)
print("Train score", gs_pac.score(X_train, y_train))
print("Test score", round(gs_pac.score(X_test, y_test),3))
pac_best = gs_pac.best_estimator_
gs_pac.best_params_

Best score: 0.9306033135354799
Train score 0.9996874023132228
Test score 0.924


{'tfidf__max_df': 0.75, 'tfidf__ngram_range': (1, 1), 'tfidf__norm': 'l2'}

# <span style="color:lightslategray">Discarded Option: Spelling Correction</span> 

The team noticed that the text sometimes contained spelling mistakes. Hence, we took on the challenging task of  creating a function that uses the texblob library to correct eventual spelling mistakes in the text. However, because the function is computationally expensive (>7h00 running in a 32GB laptop), we ran it once and exported the results into a csv file that we have re-uploaded into github. Hence, there is no need to run the function when reading this notebook. If you would still like to do so remove the # from the cell below. 

The main reason the function was discarded is because of its sometimes inaccurate spelling corrections leading to little change to the model performance.

In [31]:
#!pip install -U textblob
#from textblob import TextBlob

#def correct_spelling(dataframe):
    #'Function that takes the dataframe as input and corrects spelling mistakes in text and title'
    #dataframe['title'] = dataframe['title'].apply(lambda x: str(TextBlob(x).correct()))
    #dataframe['text'] = dataframe['text'].apply(lambda x: str(TextBlob(x).correct()))
    #return dataframe

#correct_spelling(df)

#df5 = df

In [32]:
# Reading the new Training dataset
#url = "https://raw.githubusercontent.com/jonathanserrano1993/Fake_news_detection_NLP/master/new_training.csv"
#s = requests.get(url).content
#df5 = pd.read_csv(io.StringIO(s.decode('utf-8')), sep=',')
#df5.head()

In [33]:
#df5['text'] = df5['text'].map(lambda x: cleanData(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = True))
#df5['title'] = df5['title'].map(lambda x: cleanData(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = True))

In [34]:
#df5['titles_text'] = df5[['title', 'text']].apply(lambda x: ' '.join(x), axis=1)

In [35]:
# Further tuning our best models

#pipe = Pipeline([('tfidf', TfidfVectorizer()),('PAC', PassiveAggressiveClassifier(random_state = 11))])

#pipe_params = {'tfidf__ngram_range': [(1,1), (2,2), (1,3)],'tfidf__max_df':[0.1, 0.5, 1.0],'tfidf__norm':['l1','l2']}

#gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3)
#gs.fit(X_train, y_train);
#print("Best score:", gs.best_score_)
#print("Train score", gs.score(X_train, y_train))
#print("Test score", gs.score(X_test, y_test))
#gs.best_params_

# <span style="color:darkblue">V. Text Feature Extraction</span> 

The section below extracts features from the articles in the train set, in order to quantify some of their attributes. For each article, the following elements were extracted:

1/ The number of dates in each article. Fake news tend to appeal to emotions rather than state facts; the number of dates was extracted for every article through a regular expression. No text normalization function was applied. 

Following the removal of stopwords, the following features were extracted. The objective is to uncover whether fake news typically have more or fewer words, nouns, etc. Such a pattern would ideally ease their identification. All cases were left as they were when reading the .csv file, in order to best identify the proper nouns. 

2/ The proportion of tokens that had the POS tag starting with 'VB', i.e.  the number of verbs out of all the tokens in the text. 

3/ The proportion of tokens that had the POS tag starting with 'NN', i.e.  the number of nouns out of all the tokens in the text. 

4/ The proportion of tokens that had the POS tag starting with 'JJ', i.e.  the number of adjectives out of all the tokens in the text. 

5/ The proportion of tokens that had the POS tag starting with 'RB', i.e.  the number of adverbs out of all the tokens in the text. 

6/ The proportion of tokens that had the POS tag starting with 'JJS', i.e.  the number of superlative adjectives out of all the tokens in the text. 

7/ The proportion of tokens that had the POS tag starting with 'JJR', i.e.  the number of comparative adjectives out of all the tokens in the text. 

8/ The proportion of tokens that had the POS tag starting with 'NNP', i.e.  the number of proper nouns out of all the tokens in the text. 

After removing all capital letters and lemmatizing tokens, the following features were extracted from the combined title and text columns:

9/ The number of tokens in the text, to determine if fake news tend to be shorter or longer than real news stories.  

10/ The number of types out of the total number of tokens; this feature keeps track of the lexical diversity the journalist used to write the article. Real news story typically have richer vocabulary than fake stories, though this particular feature is biased towards longer articles, which do employ more types than shorter ones. 

As all 10 new features were numerical, they were scaled and their skewness was fixed where needed. Little correlation was observed between the numerical features and the target variable; the number of dates presented the highest correlation coefficient in absolute value of 0.39. 

The vectorized 'titles_text' column containing tokens from both titles and text colums was converted to a dataframe and concatenated  with the numerical features. In order to contain memory usage, the option 'shuffle' was set to false on the 'train_test_split' function, implying the model is highly biased towards initial order of the tuples. Comparing it directly with the other models in the present notebook would not be fair. However, the model performance while including any set of the 10 numerical features decreased accuracy by 0.5 to 1 percentage points. 

# <span style="color:lightslategray">Text Feature Extraction on Raw Data</span> 

In [36]:
df_features_raw = df.copy()
df_features_raw['titles_text'] = df_features_raw[['title', 'text']].apply(lambda x: ' '.join(x), axis=1)

In [37]:
def count_dates(df):
    count_dates = []
    for row in range(df.shape[0]):
        string = df['titles_text'].iloc[row]
        result = len(re.findall(r'[A-Z]\w+\s\d\d\d\d|in\s\d\d\d\d|In\s\d\d\d\d|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday', string))
        count_dates.append(result)
    return count_dates

In [38]:
count_dates = count_dates(df_features_raw)

# <span style="color:lightslategray">Text Feature Extraction on Pre-processed Data</span> 

In [39]:
df_features_proc = df.copy()
df_features_proc['text'] = df_features_proc['text'].map(lambda x: normalization(x, lowercase=False, remove_stops=True, prt_stemming=False, snb_stemming = False, lemmatization = False))
df_features_proc['title'] = df_features_proc['title'].map(lambda x: normalization(x, lowercase=False, remove_stops=True, prt_stemming=False, snb_stemming = False, lemmatization = False))
df_features_proc['titles_text'] = df_features_proc[['title', 'text']].apply(lambda x: ' '.join(x), axis=1)

In [40]:
def count_verbs(df):
    count_verbs = []
    for row in range(df.shape[0]):
        text = nltk.word_tokenize(df['titles_text'].iloc[row])
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        result = a.map(lambda x: 1 if re.match(r'VB(.*?)', x[1]) else 0).sum()/a.map(lambda x: 1).sum()
        count_verbs.append(result)
    return count_verbs

def count_nouns(df):
    count_nouns = []
    for row in range(df.shape[0]):
        text = nltk.word_tokenize(df['titles_text'].iloc[row])
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        result = a.map(lambda x: 1 if re.match(r'NN(.*?)', x[1]) else 0).sum()/a.map(lambda x: 1).sum()
        count_nouns.append(result)
    return count_nouns

def count_adjectives(df):
    count_adjectives = []
    for row in range(df.shape[0]):
        text = nltk.word_tokenize(df['titles_text'].iloc[row])
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        result = a.map(lambda x: 1 if re.match(r'JJ(.*?)', x[1]) else 0).sum()/a.map(lambda x: 1).sum()
        count_adjectives.append(result)
    return count_adjectives

def count_adverbs(df):
    count_adverbs = []
    for row in range(df.shape[0]):
        text = nltk.word_tokenize(df['titles_text'].iloc[row])
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        result = a.map(lambda x: 1 if re.match(r'RB(.*?)', x[1]) else 0).sum()/a.map(lambda x: 1).sum()
        count_adverbs.append(result)
    return count_adverbs

def count_superlatives(df):
    count_superlatives = []
    for row in range(df.shape[0]):
        text = nltk.word_tokenize(df['titles_text'].iloc[row])
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        result = a.map(lambda x: 1 if re.match(r'JJS', x[1]) else 0).sum()/a.map(lambda x: 1).sum()
        count_superlatives.append(result)
    return count_superlatives

def count_comparatives(df):
    count_comparatives = []
    for row in range(df.shape[0]):
        text = nltk.word_tokenize(df['titles_text'].iloc[row])
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        result = a.map(lambda x: 1 if re.match(r'JJR', x[1]) else 0).sum()/a.map(lambda x: 1).sum()
        count_comparatives.append(result)
    return count_comparatives

def count_proper_nouns(df):
    count_proper_nouns = []
    for row in range(df.shape[0]):
        text = nltk.word_tokenize(df['titles_text'].iloc[row])
        tag = nltk.pos_tag(text)
        a = pd.Series(tag)
        result = a.map(lambda x: 1 if re.match(r'NNP', x[1]) else 0).sum()/a.map(lambda x: 1).sum()
        count_proper_nouns.append(result)
    return count_proper_nouns

In [41]:
count_verbs = count_verbs(df_features_proc)
count_nouns = count_nouns(df_features_proc)
count_adjectives = count_adjectives(df_features_proc)
count_superlatives = count_superlatives(df_features_proc)
count_comparatives = count_comparatives(df_features_proc)
count_proper_nouns = count_proper_nouns(df_features_proc)

# <span style="color:lightslategray">Text Feature Extraction on Fully Clean Data</span> 

In [42]:
df_features_clean = df.copy()
df_features_clean['text'] = df_features_clean['text'].map(lambda x: normalization(x, lowercase=True, remove_stops=True, prt_stemming=False, snb_stemming = False, lemmatization = True))
df_features_clean['title'] = df_features_clean['title'].map(lambda x: normalization(x, lowercase=True, remove_stops=True, prt_stemming=False, snb_stemming = False, lemmatization = True))
df_features_clean['titles_text'] = df_features_clean[['title', 'text']].apply(lambda x: ' '.join(x), axis=1)

In [43]:
def count_words(df):
    count_words = []
    for row in range(df.shape[0]):
        text = nltk.word_tokenize(df_features_clean['titles_text'].iloc[row])
        result = len(text)
        count_words.append(result)
    return count_words

def vocab_diversity(df):
    vocab_diversity = []
    for row in range(df.shape[0]):
        text = nltk.word_tokenize(df['titles_text'].iloc[row])
        result = len(set(text))/len(text)
        vocab_diversity.append(result)
    return vocab_diversity

In [44]:
count_words = count_words(df_features_clean)
vocab_diversity = vocab_diversity(df_features_clean)

# <span style="color:lightslategray">Text Feature Extraction EDA</span> 

In [45]:
features_list = ['count_dates',
                'count_verbs',
                'count_nouns',
                'count_adjectives',
                'count_superlatives',
                'count_comparatives',
                'count_proper_nouns',
                'count_words',
                'vocab_diversity']

data_tuples = list(zip(count_dates,
                       count_verbs,
                       count_nouns,
                       count_adjectives,
                       count_superlatives,
                       count_comparatives,
                       count_proper_nouns,
                       count_words,
                       vocab_diversity))

df_features = pd.DataFrame(data_tuples, columns = features_list)

In [46]:
scaler = preprocessing.MinMaxScaler()

columns = list(df_features.columns)

df_features[columns] = scaler.fit_transform(df_features[columns]) 
df_features.skew(axis = 0, skipna = True) # We are good to go!

count_dates            6.847714
count_verbs           -0.738484
count_nouns            1.260269
count_adjectives      -0.125030
count_superlatives    22.565608
count_comparatives     6.603347
count_proper_nouns     2.087801
count_words            4.082407
vocab_diversity        1.286070
dtype: float64

In [47]:
skewed_columns = ['count_dates','count_nouns','count_superlatives','count_comparatives',
                  'count_proper_nouns','count_words']

df_features[skewed_columns] = np.cbrt(df_features[skewed_columns]) # Get the log of the only signficantly skewed column
df_features.skew(axis = 0, skipna = True) # We are good to go!

count_dates          -0.006521
count_verbs          -0.738484
count_nouns          -0.246259
count_adjectives     -0.125030
count_superlatives    0.176484
count_comparatives    0.299286
count_proper_nouns    0.287063
count_words           0.174797
vocab_diversity       1.286070
dtype: float64

In [48]:
df_features['news_label'] = list(df['news_label'])

In [49]:
df_dummies = pd.get_dummies(df_features['news_label'])
del df_dummies[df_dummies.columns[-1]]
df_new = pd.concat([df_features, df_dummies], axis=1)
del df_new['news_label']

corr = df_new.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,count_dates,count_verbs,count_nouns,count_adjectives,count_superlatives,count_comparatives,count_proper_nouns,count_words,vocab_diversity,FAKE
count_dates,1.0,0.13,-0.084,0.0059,0.19,0.17,-0.058,0.56,-0.47,-0.39
count_verbs,0.13,1.0,-0.41,0.084,0.062,0.077,-0.4,0.23,-0.23,-0.16
count_nouns,-0.084,-0.41,1.0,-0.35,-0.22,-0.26,0.77,-0.34,0.34,0.17
count_adjectives,0.0059,0.084,-0.35,1.0,0.23,0.28,-0.47,0.27,-0.16,-0.022
count_superlatives,0.19,0.062,-0.22,0.23,1.0,0.19,-0.22,0.37,-0.32,-0.072
count_comparatives,0.17,0.077,-0.26,0.28,0.19,1.0,-0.31,0.38,-0.3,-0.13
count_proper_nouns,-0.058,-0.4,0.77,-0.47,-0.22,-0.31,1.0,-0.38,0.3,0.21
count_words,0.56,0.23,-0.34,0.27,0.37,0.38,-0.38,1.0,-0.82,-0.18
vocab_diversity,-0.47,-0.23,0.34,-0.16,-0.32,-0.3,0.3,-0.82,1.0,0.098
FAKE,-0.39,-0.16,0.17,-0.022,-0.072,-0.13,0.21,-0.18,0.098,1.0


In [50]:
del df_features['news_label']

# <span style="color:lightslategray">Text Classification with Features - Tentative</span> 

In [51]:
df_FE = df.copy()
df_FE['text'] = df_FE['text'].map(lambda x: normalization(x, lowercase=True, remove_stops=False, prt_stemming=True, snb_stemming = False, lemmatization = False))
df_FE['title'] = df_FE['title'].map(lambda x: normalization(x, lowercase=True, remove_stops=False, prt_stemming=True, snb_stemming = False, lemmatization = False))

In [52]:
df_FE['titles_text'] = df_FE[['title', 'text']].apply(lambda x: ' '.join(x), axis=1)

In [53]:
# Initialize the `tfidf_vectorizer` 
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, ngram_range=(1, 3), norm = 'l2') 

In [71]:
# Fit and transform the training data 
df_text = pd.DataFrame(tfidf_vectorizer.fit_transform(df_FE['titles_text']).toarray()) 

In [None]:
#df_text['count_dates'] = df_features['count_dates']

#X_train, X_test, y_train, y_test = train_test_split(df_text, y, test_size=0.2, random_state=666, shuffle = False)

#pipe = Pipeline([('PAC', PassiveAggressiveClassifier(random_state = 69))])

#pipe_params = {'PAC__max_iter': [100, 500]}

#gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3)
#gs.fit(X_train, y_train);
#print("Best score:", gs.best_score_)
#print("Train score", gs.score(X_train, y_train))
#print("Test score", gs.score(X_test, y_test))
#gs.best_params_

# <span style="color:lightslategray">Isolating Words Based on POS</span> 

Another attemps was made to extract features from the news dataset. Instead of running a machine learning algorithm on all tokens of the 'titles_text' column, the algorithm was trained only on words with a specific POS tag. The relevant POS tags included the following:

**1/ POS tag beginning with NN**: Nouns are the most significant tokens among the articles. 

**2/ POS tag beginning with VB**: Verbs in all their forms were extracted. 

**3/ POS tag beginning with JJ**: The POS tags beginning with JJ also include comparatives and superlatives. 

**4/ POS tag beginning with RB**: Aderbs in all their forms were extracted. Since they often include the suffix 'ly', the text could not be in lemmatized or stemmed form to extract the POS tags.  

**5/ POS tag FW**: The 'foreign word' POS tag, typically indicative of a rich vocabulary and culture from real news, was also attributed to misspelled words from fake news, and performed poorly on the algorithm. 

**6/ POS tag MD**: The 'modal auxiliary' POS tag is very common among all articles, and performed poorly. 

**7/ POS tag UH**: The 'interjection' POS tag was present in very few articles, which impeded our model's ability to train on those features. 

**8/ POS tag LS**: The 'list marker' POS tag was present in very few articles, which impeded our model's ability to train on those features. 

After running the algorithm on the above POS tags individually and in various combinations, the **highest accuracy score was 93.5%** and included all tokens with POS tags starting with 'NN', 'VB', and 'RB'; i.e. all nouns, verbs, and adverbs. The worst scores were obtained by including the 'MD' and 'UH' POS tags. 'MD' tags were present in nearly all articles, real or fake, and 'UH' tags were very scarce across the dataset. 

Though the accuracy score obtained using nouns, verbs, and adverbs was relatively high, it still did not perform as well as the optimal cross-validation score of 93.9% by including all tokens. It does however demonstrate that the most relevant tokens for this text classification algorithm are nouns, verbs, and adverbs. Such tokens are key in differentiating fake news from real news. 

In [72]:
df_pos = df.copy()
df_pos['text'] = df_pos['text'].map(lambda x: normalization(x, lowercase=False, remove_stops=True, prt_stemming=False, snb_stemming = False, lemmatization = False))
df_pos['title'] = df_pos['title'].map(lambda x: normalization(x, lowercase=False, remove_stops=True, prt_stemming=False, snb_stemming = False, lemmatization = False))
df_pos['titles_text'] = df_pos[['title', 'text']].apply(lambda x: ' '.join(x), axis=1)

In [73]:
def extract_nouns(df):
    extract_nouns = []
    for row in range(df.shape[0]):
        txt = df['titles_text'].iloc[row]
        nouns = [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if re.match(r'NN(.*?)', pos)]
        nouns = " ".join(nouns)
        extract_nouns.append(nouns)
    return extract_nouns

def extract_verbs(df):
    extract_verbs = []
    for row in range(df.shape[0]):
        txt = df['titles_text'].iloc[row]
        verbs = [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if re.match(r'VB(.*?)', pos)]
        verbs = " ".join(verbs)
        extract_verbs.append(verbs)
    return extract_verbs

def extract_adj(df):
    extract_adj = []
    for row in range(df.shape[0]):
        txt = df['titles_text'].iloc[row]
        adj = [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if re.match(r'JJ(.*?)', pos)]
        adj = " ".join(adj)
        extract_adj.append(adj)
    return extract_adj

def extract_adv(df):
    extract_adv = []
    for row in range(df.shape[0]):
        txt = df['titles_text'].iloc[row]
        adv = [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if re.match(r'RB(.*?)', pos)]
        adv = " ".join(adv)
        extract_adv.append(adv)
    return extract_adv

def extract_interjections(df):
    extract_uh = []
    for row in range(df.shape[0]):
        txt = df['titles_text'].iloc[row]
        uh = [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if re.match(r'US(.*?)', pos)]
        uh = " ".join(uh)
        extract_uh.append(uh)
    return extract_uh

def extract_foreign_words(df):
    extract_fw = []
    for row in range(df.shape[0]):
        txt = df['titles_text'].iloc[row]
        fw = [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if re.match(r'FW(.*?)', pos)]
        fw = " ".join(fw)
        extract_fw.append(fw)
    return extract_fw

def extract_modal(df):
    extract_md = []
    for row in range(df.shape[0]):
        txt = df['titles_text'].iloc[row]
        md = [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if re.match(r'MD(.*?)', pos)]
        md = " ".join(md)
        extract_md.append(md)
    return extract_md

def extract_list_marker(df):
    extract_ls = []
    for row in range(df.shape[0]):
        txt = df['titles_text'].iloc[row]
        ls = [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if re.match(r'LS(.*?)', pos)]
        ls = " ".join(ls)
        extract_ls.append(ls)
    return extract_ls

In [74]:
extract_nouns = extract_nouns(df_pos)
extract_verbs = extract_verbs(df_pos)
extract_adj = extract_adj(df_pos)
extract_adv = extract_adv(df_pos)
extract_uh = extract_interjections(df_pos)
extract_fw = extract_foreign_words(df_pos)
extract_md = extract_modal(df_pos)
extract_ls = extract_list_marker(df_pos)

In [75]:
df_pos['nouns'] = extract_nouns
df_pos['verbs'] = extract_verbs
df_pos['adjectives'] = extract_adj
df_pos['adverbs'] = extract_adv
df_pos['foreign words'] = extract_fw
df_pos['interjections'] = extract_uh
df_pos['modal'] = extract_md
df_pos['list marker'] = extract_ls

# <span style="color:lightslategray">Modelling on isolated POS</span> 

In this section, the team tried isolating the effect of the different POS. The team also tried combination of different isolated POS. Nouns explain most of the variance. However, combinin different POS lead to better performance that using only one POS. In fact, the best score in this section was obtained when blablabla

In [76]:
def basic_clean2(dataframe):
    'Function that takes the dataframe as an input and cleans it by removing punction, digits and strips'
    #Removing punctuation
    dataframe.title = dataframe.title.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
    dataframe.text = dataframe.text.apply(lambda x: x.translate(str.maketrans('','', string.punctuation)))
    #Removing regular expressions
    dataframe['title'] = dataframe['title'].str.replace('[^\w\s]',' ')
    dataframe['text'] = dataframe['text'].str.replace('[^\w\s]',' ')
    #Removing digits
    dataframe.title = dataframe.title.apply(lambda x: x.translate(str.maketrans('', '', string.digits)))
    dataframe.text = dataframe.text.apply(lambda x: x.translate(str.maketrans('','', string.digits)))
    #Removing double spaces
    dataframe['title'] = dataframe['title'].str.replace('  ',' ')
    dataframe['text'] = dataframe['text'].str.replace('  ',' ')
    #Removing strips
    dataframe['title'] = dataframe['title'].replace(r'\s+|\\n', ' ', regex = True, inplace = False)
    dataframe['text'] = dataframe['text'].replace(r'\s+|\\n', ' ', regex = True, inplace = False)
    return dataframe

def normalization2(text, lowercase = False, remove_stops = False, prt_stemming = False, snb_stemming = False, lemmatization = False):
    'Flexible function to try effect of removing stopwords and the different stemming techniques'
    txt = str(text)
    if lowercase:
        txt = " ".join([w.lower() for w in txt.split()])
    if remove_stops:
        txt = " ".join([w for w in txt.split() if w not in stops])
    if prt_stemming:
        st = PorterStemmer()
        txt = " ".join([st.stem(w) for w in txt.split()])
    if snb_stemming:
        snb = SnowballStemmer('english')
        txt = " ".join([snb.stem(w) for w in txt.split()])
    if lemmatization:
        wordnet_lemmatizer = WordNetLemmatizer()
        txt = " ".join([wordnet_lemmatizer.lemmatize(w, pos = 's') for w in txt.split()])
    return txt

### Adverbs

In [77]:
df_pos1 = df_pos.copy()
basic_clean(df_pos1)
df_pos1['adverbs'] = df_pos1['adverbs'].map(lambda x: normalization2(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = False))

In [78]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df_pos1['adverbs'], y, test_size=0.2, random_state=666)

In [79]:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

cv_scores = cross_val_score(PassiveAggressiveClassifier(), tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.673


### Nouns

In [80]:
df_pos1['nouns'] = df_pos1['nouns'].map(lambda x: normalization2(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = False))

In [81]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df_pos1['nouns'], y, test_size=0.2, random_state=666)

In [82]:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

cv_scores = cross_val_score(PassiveAggressiveClassifier(), tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.928


### Verbs

In [83]:
df_pos1['verbs'] = df_pos1['verbs'].map(lambda x: normalization2(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = False))

In [84]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df_pos1['verbs'], y, test_size=0.2, random_state=666)

In [85]:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

cv_scores = cross_val_score(PassiveAggressiveClassifier(), tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.816


### Adjectives

In [86]:
df_pos1['adjectives'] = df_pos1['adjectives'].map(lambda x: normalization2(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = False))

In [87]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df_pos1['adjectives'], y, test_size=0.2, random_state=666)

In [88]:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

cv_scores = cross_val_score(PassiveAggressiveClassifier(), tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.807


In [89]:
df_pos1['foreign words'] = df_pos1['foreign words'].map(lambda x: normalization2(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = False))

### Verbs and Nouns

In [132]:
# Merging Title and Text
df_pos1['verbs_nouns'] = df_pos1[['verbs', 'nouns']].apply(lambda x: ' '.join(x), axis=1)

In [133]:
df_pos1['verbs_nouns'] = df_pos1['verbs_nouns'].map(lambda x: normalization2(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = True))

In [134]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df_pos1['verbs_nouns'], y, test_size=0.2, random_state=666)

In [135]:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

cv_scores = cross_val_score(PassiveAggressiveClassifier(), tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.931


### Adjectives and Nouns

In [94]:
# Merging Title and Text
df_pos1['adj_nouns'] = df_pos1[['adjectives', 'nouns']].apply(lambda x: ' '.join(x), axis=1)

In [95]:
df_pos1['adj_nouns'] = df_pos1['adj_nouns'].map(lambda x: normalization2(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = True))

In [96]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df_pos1['adj_nouns'], y, test_size=0.2, random_state=666)

In [97]:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

cv_scores = cross_val_score(PassiveAggressiveClassifier(), tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.926


### Verbs and Adjectives

In [98]:
# Merging Title and Text
df_pos1['adj_verbs'] = df_pos1[['adjectives', 'verbs']].apply(lambda x: ' '.join(x), axis=1)

In [99]:
df_pos1['adj_verbs'] = df_pos1['adj_verbs'].map(lambda x: normalization2(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = False))

In [100]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df_pos1['adj_verbs'], y, test_size=0.2, random_state=666)

In [101]:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

cv_scores = cross_val_score(PassiveAggressiveClassifier(), tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.863


### Verbs, Adjectives and Nouns

In [102]:
# Merging Title and Text
df_pos1['all'] = df_pos1[['adjectives', 'verbs', 'nouns']].apply(lambda x: ' '.join(x), axis=1)

In [103]:
basic_clean2(df_pos1)
df_pos1['all'] = df_pos1['all'].map(lambda x: normalization2(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = False))

In [104]:
# Further cleaning
df_pos1['all'] = df_pos1['all'].map(lambda x: re.sub(r'\W+', ' ', x))
df_pos1['all'] = df_pos1['all'].str.replace('[0-9]','')

In [105]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df_pos1['all'], y, test_size=0.2, random_state=666)

In [106]:
tfidf_vectorizer = TfidfVectorizer(max_df = 0.75) 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

cv_scores = cross_val_score(PassiveAggressiveClassifier(random_state=11), tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.934


### Nouns, Verbs and Adverbs

In [107]:
# Merging Title and Text
df_pos1['all2'] = df_pos1[['nouns', 'verbs', 'adverbs']].apply(lambda x: ' '.join(x), axis=1)

In [108]:
basic_clean2(df_pos1)
df_pos1['all2'] = df_pos1['all2'].map(lambda x: normalization2(x, lowercase=True, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = False))

In [109]:
# Further cleaning
df_pos1['all2'] = df_pos1['all2'].map(lambda x: re.sub(r'\W+', ' ', x))
df_pos1['all2'] = df_pos1['all2'].str.replace('[0-9]','')

In [110]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df_pos1['all2'], y, test_size=0.2, random_state=666)

In [111]:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

cv_scores = cross_val_score(PassiveAggressiveClassifier(random_state=11), tfidf_train,y_train,cv=10,n_jobs=-1)
mean_score = round(np.mean(cv_scores), 3)
print("Mean Cross-Validation score: ",mean_score)

Mean Cross-Validation score:  0.935


# <span style="color:darkblue">VI. Predicting on Test Set</span> 

Overall, the optimal data cleaning processes and hyperparameters for this text classification exercise are the following:

1/ Removing regular expressions '^\w\s'; which represents a one-letter word at the beginning of the article. For instance, it removes the 'A' if it were an article were beginning with 'A thing of the past ...'.

2/ Punctuation was also removed from the both titles and texts.

3/ Satellite adjectives were also lemmatized. 

The tfidf vectorizer was optimized with the following hyperparameter:

1/ 'max_df' was set at 1.0: this implied no token was removed from the dataset because it appeared too frequently. 

2/ 'ngram_range' was set at (1, 2): 1 was the lower boundary and 2 was the upper boundary for the n-grams to be extracted from the sentence to determine the POS tag of a particular word.

3/ 'norm' was set to 'l2': This is the default setting by which the sum of squares of all numbers in the vector is equal to 1. 

The algorithm with the highest overall performance was the passive aggressive classifier. The latter minimizes the loss function aggressivaly, while being conservative when training. The classifier does not overreact a positive classification on the training dataset. The passive aggressive classifier was optimized with the following hyperparameter:

1/ The loss argument was set at 'hinge': The hinge function takes the maximum of 1 minus the label in question multiplied by the classifier times the data point for each mistaken prediction. The algorithm takes the square of the classifier for a correct prediction, though computes the loss in a linear fashion when set to 'hinge'. This means that the function is aggressive in reducing the loss quickly, as opposed to the 0-1 loss function. The latter would impose a penalty of 1 for mistaken predictions, and no penalty for a correct prediction.

### Reading the CSV

In [121]:
# Reading the new Training dataset
url = "https://raw.githubusercontent.com/jonathanserrano1993/Fake_news_detection_NLP/master/fake_or_real_news_test.csv"
s = requests.get(url).content
df_test = pd.read_csv(io.StringIO(s.decode('utf-8')), sep=',')
df_test.head()

Unnamed: 0,ID,title,text
0,10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...
1,2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...
2,864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...
3,4128,Surviving escaped prisoner likely fatigued and...,Police searching for the second of two escaped...
4,662,Clinton and Sanders neck and neck in Californi...,No matter who wins California's 475 delegates ...


### Pre-Processing on Test set

In [122]:
test = df_test.copy()
basic_clean(test)

# Further cleaning
test['text'] = test['text'].map(lambda x: re.sub(r'\W+', ' ', x))
test['title'] = test['title'].map(lambda x: re.sub(r'\W+', ' ', x))

# Lowercase and lemmatizing on satellite adjectives
test['text'] = test['text'].map(lambda x: normalization(x, lowercase=False, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = True))
test['title'] = test['title'].map(lambda x: normalization(x, lowercase=False, remove_stops=False, prt_stemming=False, snb_stemming = False, lemmatization = True))

In [123]:
test['titles_text'] = test[['title', 'text']].apply(lambda x: ' '.join(x), axis=1)

### Applying Best Performing Model

In [124]:
#set my model to DecisionTree
model = PassiveAggressiveClassifier(loss = 'hinge', random_state = 69)

tfidf_vectorizer = TfidfVectorizer(max_df = 1.0, norm = 'l2', ngram_range = (1,2)) 
tfidf_train = tfidf_vectorizer.fit_transform(df1['titles_text']) 
tfidf_test = tfidf_vectorizer.transform(test['titles_text'])

#set prediction data to factors that will predict, and set target to SalePrice
train_data = tfidf_train
test_data = tfidf_test
target = y

#fitting model with prediction data and telling it my target
model.fit(train_data, target)

model.predict(test_data)

array(['FAKE', 'REAL', 'REAL', ..., 'FAKE', 'REAL', 'REAL'], dtype='<U4')

### Exporting the results

In [125]:
test['prediction'] = list(model.predict(test_data))

In [126]:
test = test[['ID','prediction']]

In [127]:
test.to_csv('NLP_paul_jonathan.csv', encoding='utf-8', index=False)

# <span style="color:darkblue">Thank you!</span> 

In [130]:
Image(url= "https://i.ibb.co/hdsP0W4/tumblr-otxq0p-WDSh1s2791bo1-1280.gif")