# NLP Assignment 1 - Federico García e Ignacio Rupérez. MBD O1. 

**Madrid. 28th May 2017.** In the following notebook, we have created a model able to classify fake and real news with a **96.06% accuracy**. To do so we have applied different iterative Natural Language Processing techniques.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import itertools
import collections
import nltk.classify.util
import logging
import warnings
warnings.simplefilter('ignore')
from __future__ import print_function
from pprint import pprint
from time import time
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from nltk.classify import MaxentClassifier
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

First of all, we import the csv with the labeled (fake or real) pieces of news. 

In [3]:
# Import 'fake_or_real_news.csv' 
df = pd.read_csv("fake_or_real_news_training.csv")
    
# Inspect shape of 'df' 
df.shape

(3999, 6)

We see that the training set has 3999 pieces of news and 6 columns

In [4]:
# Print first lines of 'df' 
df.head()

Unnamed: 0,ID,title,text,label,X1,X2
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,,
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,,
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,,


The columns are ID, title, text, label, X1 and X2. We can set the ID column to be the index.

In [5]:
# Set index 
df = df.set_index("ID")

# Print first lines of 'df' 
df.head()

Unnamed: 0_level_0,title,text,label,X1,X2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,,
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,,
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,,


X1 and X2 columns seem to be empty, but we should check them first.

In [6]:
# Check if all values in X1 and X2 are NaN
len(df)-sum(pd.isnull(df['X1']))

33

In [7]:
len(df)-sum(pd.isnull(df['X2']))

2

We see that 33 rows have some values in column X1, and 2 rows have some values in column X1. Let's sort the dataframe by X1 to see what are those values.

In [8]:
print(df.sort_values(by='X1', ascending=True))

                                                   title  \
ID                                                         
6268   Chart Of The Day: Since 2009—–Recovery For The 5%   
10499  30th Infantry Division: “Work Horse of the Wes...   
6717                    Jim Rogers: It’s Time To Prepare   
8748       WATCH: Mass Shooting Occurs During #TrumpRiot   
5741                                       Why Trump Won   
10138                Inside The Mind Of An FBI Informant   
10492  TOP BRITISH GENERAL WARNS OF NUCLEAR WAR WITH ...   
6404   #BREAKING: SECOND Assassination Attempt On Tru...   
8470   The Amish In America Commit Their Vote To Dona...   
7559   STATE OF GEORGIA FIRES PASTOR BECAUSE OF HIS F...   
9954   Incredible smoke haze seen outside NDTV office...   
10194  Who rode it best? Jesse Jackson mounts up to f...   
9203         Political Correctness for Yuengling Brewery   
9097                    ICE Agent Commits Suicide in NYC   
7375   Shallow 5.4 magnitude earthquake 

It seems that some texts have been split and moved to the label column, and the X1 column has therefore the FAKE or REAL label. The same happens with the X2 column. Let's then move texts in X1 and X2 columns to the text column, and the FAKE or REAL values to the label column.

In [9]:
for index, row in df.iterrows():
    if row['X1'] == "FAKE" or row['X1'] == "REAL":
        row['text'] = row['text'] + " " + row['label']
        row['label'] = row['X1']
        row['X1'] = np.nan
        
for index, row in df.iterrows():
    if row['X2'] == "FAKE" or row['X2'] == "REAL":
        row['text'] = row['text'] + " " + row['label'] + " " + row['X1']
        row['label'] = row['X2']
        row['X1'] = np.nan
        row['X2'] = np.nan
        
len(df)-sum(pd.isnull(df['X1']))

0

In [10]:
len(df)-sum(pd.isnull(df['X2']))

0

Now we don't have any values in X1 and X2 columns.
<br>
<br>
We'll create the variable 'y' with all the labels.

In [11]:
# Set 'y'
y = df.label 

In our opinion, we believe that we should merge 'title' and 'text' columns into one 'all' column, as there might be some informative words in the title that will help to achieve a better accuracy in the later classification.
<br>
<br>
Columns X1 and X2 can be dropped.

In [12]:
# Paste together title and text into a column "all"
df["all"] = df["title"] + " " + df["text"]
df = df.drop(['X1', 'X2'], axis=1)
df.head()

Unnamed: 0_level_0,title,text,label,all
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,You Can Smell Hillary’s Fear Daniel Greenfield...
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,Watch The Exact Moment Paul Ryan Committed Pol...
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,Kerry to go to Paris in gesture of sympathy U....
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,Bernie supporters on Twitter erupt in anger ag...
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,The Battle of New York: Why This Primary Matte...


Now we create the training and test sets with the train_test_split function. We take two thirds for training and one third for testing.

In [13]:
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(df['all'], y, test_size=0.33, random_state=53)

In the cell below we set two classes to define both count and tfidf vectorizers with stemming (using the Snowball Stemmer).

In [14]:
stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
    
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

We define a function 'fake_news' with the following parameters:
- ngramrange: this is the range to take for the ngrams
- stop: a list of words to be used as stop words
- vect: the name of the vectorizer
- classifier: the name of the classifier to be used
- X_train, X_test, y_train and y_test: vectors with all the sets for training and test news and labels

This function will return the accuracy (score) for every model.

In [15]:
def fake_news(ngramrange, stop, vect, classifier, X_train = X_train, X_test = X_test, y_train = y_train, y_test = y_test):
        
    if vect == "count": 
        # Initialize the 'count_vectorizer'
        count_vectorizer = CountVectorizer(stop_words = stop, ngram_range = ngramrange)
        # Fit and transform the training data 
        train = count_vectorizer.fit_transform(X_train) 
        # Transform the test set
        test = count_vectorizer.transform(X_test)
        
    elif vect == "tfidf":
        # Initialize the 'tfidf_vectorizer' 
        tfidf_vectorizer = TfidfVectorizer(stop_words = stop, max_df = 0.7, ngram_range = ngramrange) 
        train = tfidf_vectorizer.fit_transform(X_train)  
        test = tfidf_vectorizer.transform(X_test)
        
    elif vect == "stemcount":
        # Initialize the 'stemmed_count_vectorizer'
        stemmed_count_vect = StemmedCountVectorizer(stop_words = stop, ngram_range = ngramrange)
        train = stemmed_count_vect.fit_transform(X_train) 
        test = stemmed_count_vect.transform(X_test)
        
    elif vect == "stemtfidf":
        # Initialize the 'stemmed_tfidf_vectorizer'
        stemmed_tfidf_vect = StemmedTfidfVectorizer(stop_words = stop, max_df = 0.7, ngram_range = ngramrange)
        train = stemmed_tfidf_vect.fit_transform(X_train) 
        test = stemmed_tfidf_vect.transform(X_test)
    
    if classifier == "NB":
        # Instantiate a Multinomial Naive Bayes classifier
        clf = MultinomialNB() 
              
    elif classifier == "PA":
        # Instantiate a Passive-Aggresive classifier
        clf = PassiveAggressiveClassifier(max_iter=50, n_jobs=-1)

    elif classifier == "SVC":
        # Instantiate a SVC Classifier
        clf = LinearSVC()
        
    # Fit the classifier to the training data
    clf.fit(train, y_train)
    # Create the predicted tags
    pred = clf.predict(test)    
        
    # Calculate the accuracy score
    score = metrics.accuracy_score(y_test, pred)
    
    return score

We will now initialize a table to compare all the possible combinations of ngrams, stopwords, vectorizers and classifiers with their accuracies.

In [16]:
col_names =  ['ACCURACY', 'VECTORIZER', 'CLASSIFIER', 'NGRAMS', 'STOPWORDS']
acc_table = pd.DataFrame(columns = col_names)

As stopwords we can use the 'english' list, but we will also create another list (stopset) removing from the english list some words that can help to better perform the classification.
We also create lists of ngrams, stopwords, vectorizers and classifiers to be used in the 'fake_news' function as arguments.

In [17]:
stopset = set(stopwords.words('english')) - set(('over', 'under', 'below', 'more', 'most', 'no', 'not', 'only', 'such', 'few', 'so', 'too', 'very', 'just', 'any', 'once'))
ngrams_list = [(1,1), (1,2), (1,3)]
stopwords_list = [None, "english", stopset]
vectorizers_list = ["count", "tfidf", "stemcount", "stemtfidf"]
classifiers_list = ["NB", "PA", "SVC"]


Now we are ready to call the 'fake_news' function with all the possible combinations of the above lists. To make that, we can use 4 nested _for_ loops. In every round a new row with the resulting accuracy score is added to the table for comparison.

In [17]:
i = 0
total = len(ngrams_list) * len(stopwords_list) * len(vectorizers_list) * len(classifiers_list)

for ngramrange in ngrams_list:
    for stop in stopwords_list:
        for vect in vectorizers_list:
            for classifier in classifiers_list:
                i = i + 1
                print("Calculating score", i, "of", total)
                score = fake_news(ngramrange, stop, vect, classifier)
                acc_table.loc[len(acc_table)] = [score, vect, classifier, ngramrange, stop]

acc_table = acc_table.sort_values(by='ACCURACY', ascending=False)

Calculating score 1 of 108
Calculating score 2 of 108
Calculating score 3 of 108
Calculating score 4 of 108
Calculating score 5 of 108
Calculating score 6 of 108
Calculating score 7 of 108
Calculating score 8 of 108
Calculating score 9 of 108
Calculating score 10 of 108
Calculating score 11 of 108
Calculating score 12 of 108
Calculating score 13 of 108
Calculating score 14 of 108
Calculating score 15 of 108
Calculating score 16 of 108
Calculating score 17 of 108
Calculating score 18 of 108
Calculating score 19 of 108
Calculating score 20 of 108
Calculating score 21 of 108
Calculating score 22 of 108
Calculating score 23 of 108
Calculating score 24 of 108
Calculating score 25 of 108
Calculating score 26 of 108
Calculating score 27 of 108
Calculating score 28 of 108
Calculating score 29 of 108
Calculating score 30 of 108
Calculating score 31 of 108
Calculating score 32 of 108
Calculating score 33 of 108
Calculating score 34 of 108
Calculating score 35 of 108
Calculating score 36 of 108
C

In [18]:
acc_table

Unnamed: 0,ACCURACY,VECTORIZER,CLASSIFIER,NGRAMS,STOPWORDS
70,0.942424,stemtfidf,PA,"(1, 2)","{do, couldn't, haven't, been, ourselves, mustn..."
52,0.941667,tfidf,PA,"(1, 2)",english
64,0.941667,tfidf,PA,"(1, 2)","{do, couldn't, haven't, been, ourselves, mustn..."
23,0.940909,stemtfidf,SVC,"(1, 1)",english
35,0.940152,stemtfidf,SVC,"(1, 1)","{do, couldn't, haven't, been, ourselves, mustn..."
40,0.940152,tfidf,PA,"(1, 2)",
76,0.939394,tfidf,PA,"(1, 3)",
11,0.939394,stemtfidf,SVC,"(1, 1)",
71,0.939394,stemtfidf,SVC,"(1, 2)","{do, couldn't, haven't, been, ourselves, mustn..."
106,0.939394,stemtfidf,PA,"(1, 3)","{do, couldn't, haven't, been, ourselves, mustn..."


We see that the best accuracy (0.9424) score is achieved with the Passive-Aggressive classifier with stemmed Tfidf vectorizer, bigrams and the stopset list of stopwords.

We will also try a SVM pipeline model to check its accuracy 

In [18]:
# Support vector machines without taking stop words out 
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42)),])
text_clf_svm = text_clf_svm.fit(X_train, y_train)

# Performance
pred = text_clf_svm.predict(X_test)
score = metrics.accuracy_score(y_test, pred) 
print("accuracy:   %0.3f" % score)

accuracy:   0.925


In [19]:
# Support vector machines after taking stopwords out
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42)),])
text_clf_svm = text_clf_svm.fit(X_train, y_train)

# Performance
pred = text_clf_svm.predict(X_test)
score = metrics.accuracy_score(y_test, pred) 
print("accuracy:   %0.3f" % score)

accuracy:   0.919


The accuracy for the SVM classifier with and without stopwords are 0.925 and 0.919 (worse than before) 

Let's also try a Maximum Entropy model to check its performance

In [20]:
# MaxEnt model

realnews = []
fakenews = []
for index, row in df.iterrows():
    if row["label"] == "REAL":
        realnews.append(row["all"])
    else:
        fakenews.append(row["all"])  

def word_split(data):    
    data_new = []
    for word in data:
        word_filter = [i.lower() for i in word.split()]
        data_new.append(word_filter)
    return data_new

def word_feats(words):    
    return dict([(word, True) for word in words])

def evaluate_classifier(featx):
    
    fakefeats = [(featx(f), 'fake') for f in word_split(fakenews)]
    realfeats = [(featx(f), 'real') for f in word_split(realnews)]
        
    fakecutoff = int(len(fakefeats)*3/4)
    realcutoff = int(len(realfeats)*3/4)
 
    trainfeats = fakefeats[:fakecutoff] + realfeats[:realcutoff]
    testfeats = fakefeats[fakecutoff:] + realfeats[realcutoff:]
    
    classifier = MaxentClassifier.train(trainfeats, 'GIS', trace=0, encoding=None, labels=None, gaussian_prior_sigma=0, max_iter = 1)
    
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
    
    for i, (feats, label) in enumerate(testfeats):
                refsets[label].add(i)
                observed = classifier.classify(feats)
                testsets[observed].add(i)
 
    accuracy = nltk.classify.util.accuracy(classifier, testfeats)
    
    return accuracy

evaluate_classifier(word_feats)



0.615

The accuracy is 0.615 (much worse than any previous method)

The best performance until now is Passive-Aggressive with an accuracy of 0.9424.

We will perform now a Grid Search method combined with Cross Validation (3-fold) in order to tune the parameters for our best model.
<br>
<br>
For the Grid Search, we will use a pipeline with the count vectorizer and the tfidf transformer (which is the same as using the tfidf vectorizer). Ideally we would have to use the stemmed version of the count vectorizer, but it takes a lot of time to compute, so we have decided to go for the "non-stemmed" version.

In [20]:
print(__doc__)

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

pipeline = Pipeline([
    ('vect', CountVectorizer(ngram_range = (1,2))),
    ('tfidf', TfidfTransformer()),
    ('clf', PassiveAggressiveClassifier())])

Automatically created module for IPython interactive environment


We have 11 parameters in total to add into the Grid Search. The number of combinations with 2 or 3 values per parameter is 116640, which makes it impossible to perform in a short time (it will take aproximmately 162 hours to compute). Therefore, we have decided to perform the Grid Search taking 3 parameters every time, and saving the best values for those variables.

#### 1st round

In [21]:
parameters1 = {
    'vect__lowercase': (True, False),
    #'vect__stop_words': (stopset, 'english', None),
    #'vect__max_features': (None, 1000, 5000, 10000, 50000),
    #'vect__strip_accents': ('ascii', 'unicode', None),
    'tfidf__norm': ('l1', 'l2', None),
    #'tfidf__use_idf': (True, False),
    #'tfidf__sublinear_tf': (True, False),
    'clf__fit_intercept': (True, False),
    #'clf__max_iter': (10, 100, 1000),
    #'clf__random_state': (29, 850, 3866),
    #'clf__warm_start': (True, False),
}

if __name__ == "__main__":

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters1, n_jobs=-1, verbose=50, return_train_score=False)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters1)
    t0 = time()
    grid_search.fit(X_train, y_train)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters1 = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters1.keys()):
        print("\t%s: %r" % (param_name, best_parameters1[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__fit_intercept': (True, False),
 'tfidf__norm': ('l1', 'l2', None),
 'vect__lowercase': (True, False)}
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   13.8s
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   15.2s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:   16.1s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   17.5s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   27.4s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:   28.7s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:   29.1s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   30.4s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   40.4s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   42.3s
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed:   42.7s
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:   43.7s
[Parallel(n_j

#### 2nd round

In [22]:
stopset = set(stopwords.words('english')) - set(('over', 'under', 'below', 'more', 'most', 'no', 'not', 'only', 'such', 'few', 'so', 'too', 'very', 'just', 'any', 'once'))

parameters2 = {
    #'vect__lowercase': (True, False),
    'vect__stop_words': (stopset, 'english', None),
    #'vect__max_features': (None, 1000, 5000, 10000, 50000),
    #'vect__strip_accents': ('ascii', 'unicode', None),
    #'tfidf__norm': ('l1', 'l2', None),
    'tfidf__use_idf': (True, False),
    #'tfidf__sublinear_tf': (True, False),
    #'clf__fit_intercept': (True, False),
    'clf__max_iter': (10, 100, 1000),
    #'clf__random_state': (29, 850, 3866),
    #'clf__warm_start': (True, False),
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters2, n_jobs=-1, verbose=50, return_train_score=False)
    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters2)
    t0 = time()
    grid_search.fit(X_train, y_train)
    print("done in %0.3fs" % (time() - t0))
    print()
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters2 = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters2.keys()):
        print("\t%s: %r" % (param_name, best_parameters2[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__max_iter': (10, 100, 1000),
 'tfidf__use_idf': (True, False),
 'vect__stop_words': ({'a',
                       'about',
                       'above',
                       'after',
                       'again',
                       'against',
                       'ain',
                       'all',
                       'am',
                       'an',
                       'and',
                       'are',
                       'aren',
                       "aren't",
                       'as',
                       'at',
                       'be',
                       'because',
                       'been',
                       'before',
                       'being',
                       'between',
                       'both',
                       'but',
                       'by',
                       'can',
                       'couldn',
                      

[Parallel(n_jobs=-1)]: Done  47 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done  49 out of  54 | elapsed:  3.9min remaining:   23.7s
[Parallel(n_jobs=-1)]: Done  51 out of  54 | elapsed:  4.1min remaining:   14.3s
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:  4.4min finished
done in 277.770s

Best score: 0.923
Best parameters set:
	clf__max_iter: 100
	tfidf__use_idf: True
	vect__stop_words: None


#### 3rd round

In [23]:
parameters3 = {
    #'vect__lowercase': (True, False),
    #'vect__stop_words': (stopset, 'english', None),
    'vect__max_features': (None, 1000, 5000, 10000, 50000),
    #'vect__strip_accents': ('ascii', 'unicode', None),
    #'tfidf__norm': ('l1', 'l2', None),
    #'tfidf__use_idf': (True, False),
    'tfidf__sublinear_tf': (True, False),
    #'clf__fit_intercept': (True, False),
    #'clf__max_iter': (10, 100, 1000),
    'clf__random_state': (29, 850, 3866),
    #'clf__warm_start': (True, False),
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters3, n_jobs=-1, verbose=50, return_train_score=False)
    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters3)
    t0 = time()
    grid_search.fit(X_train, y_train)
    print("done in %0.3fs" % (time() - t0))
    print()
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters3 = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters3.keys()):
        print("\t%s: %r" % (param_name, best_parameters3[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__random_state': (29, 850, 3866),
 'tfidf__sublinear_tf': (True, False),
 'vect__max_features': (None, 1000, 5000, 10000, 50000)}
Fitting 3 folds for each of 30 candidates, totalling 90 fits
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   14.7s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:   14.8s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   15.6s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   24.3s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:   25.4s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:   25.7s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   26.7s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   36.0s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   37.3s
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed:   37.6s
[Parallel(n_jobs=-1)]: Done  12 tasks      | elap

#### 4th round

In [24]:
parameters4 = {
    #'vect__lowercase': (True, False),
    #'vect__stop_words': (stopset, 'english', None),
    #'vect__max_features': (None, 1000, 5000, 10000, 50000),
    'vect__strip_accents': ('ascii', 'unicode', None),
    #'tfidf__norm': ('l1', 'l2', None),
    #'tfidf__use_idf': (True, False),
    #'tfidf__sublinear_tf': (True, False),
    #'clf__fit_intercept': (True, False),
    #'clf__max_iter': (10, 100, 1000),
    #'clf__random_state': (29, 850, 3866),
    'clf__warm_start': (True, False),
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters4, n_jobs=-1, verbose=50, return_train_score=False)
    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters4)
    t0 = time()
    grid_search.fit(X_train, y_train)
    print("done in %0.3fs" % (time() - t0))
    print()
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters4 = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters4.keys()):
        print("\t%s: %r" % (param_name, best_parameters4[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__warm_start': (True, False),
 'vect__strip_accents': ('ascii', 'unicode', None)}
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   13.9s
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   14.9s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:   15.9s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   18.7s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   28.9s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:   29.0s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:   30.1s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   31.4s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   41.4s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   41.7s
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed:   42.8s
[Parallel(n_jobs=-1)]: Done  12 out of  18 | elapsed:   44.1s remaining:   22.0s
[Parallel(n_jobs=

Now we can print the values for all the parameters to use in the model.

In [25]:
for param_name in sorted(parameters1.keys()):
        print("\t%s: %r" % (param_name, best_parameters1[param_name]))
for param_name in sorted(parameters2.keys()):
        print("\t%s: %r" % (param_name, best_parameters2[param_name]))
for param_name in sorted(parameters3.keys()):
        print("\t%s: %r" % (param_name, best_parameters3[param_name]))
for param_name in sorted(parameters4.keys()):
        print("\t%s: %r" % (param_name, best_parameters4[param_name]))

	clf__fit_intercept: True
	tfidf__norm: 'l2'
	vect__lowercase: False
	clf__max_iter: 100
	tfidf__use_idf: True
	vect__stop_words: None
	clf__random_state: 3866
	tfidf__sublinear_tf: True
	vect__max_features: 50000
	clf__warm_start: False
	vect__strip_accents: None


We can use now the stemmed version of the tfidf vectorizer, as we will run it only once.

In [21]:
stemmed_tfidf_vect = StemmedTfidfVectorizer(stop_words = None, max_df = 0.5, min_df=0, ngram_range = (1,2), strip_accents=None, lowercase=False, max_features=50000, norm='l2', use_idf=True, smooth_idf=False, sublinear_tf=True)
train = stemmed_tfidf_vect.fit_transform(X_train) 
test = stemmed_tfidf_vect.transform(X_test)
clf = PassiveAggressiveClassifier(n_jobs=-1, C=1, fit_intercept=True, max_iter=100, tol=0.001, loss='hinge', warm_start=True, average=True, random_state=3866)
clf.fit(train, y_train)
pred = clf.predict(test)    
score = metrics.accuracy_score(y_test, pred)
score

0.9606060606060606

Our final accuracy score is **0.9606** (96.06%)
<br>
<br>
Now we train the model with the full training dataset.

In [42]:
full_train = X_train.append(X_test)
full_y_train = y_train.append(y_test)
train = stemmed_tfidf_vect.fit_transform(full_train)
clf.fit(train, full_y_train)

PassiveAggressiveClassifier(C=1, average=True, class_weight=None,
              fit_intercept=True, loss='hinge', max_iter=100, n_iter=None,
              n_jobs=-1, random_state=3866, shuffle=True, tol=0.001,
              verbose=0, warm_start=True)

We apply the model to the test file to predict the labels.

In [43]:
df2 = pd.read_csv("fake_or_real_news_test.csv")
df2.shape

(2321, 3)

In [44]:
df2.head()

Unnamed: 0,ID,title,text
0,10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...
1,2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...
2,864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...
3,4128,Surviving escaped prisoner likely fatigued and...,Police searching for the second of two escaped...
4,662,Clinton and Sanders neck and neck in Californi...,No matter who wins California's 475 delegates ...


In [45]:
df2 = df2.set_index("ID")
df2["all"] = df2["title"] + " " + df2["text"]
df2.head()

Unnamed: 0_level_0,title,text,all
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...,September New Homes Sales Rise——-Back To 1992 ...
2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...,Why The Obamacare Doomsday Cult Can't Admit It...
864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...,"Sanders, Cruz resist pressure after NY losses,..."
4128,Surviving escaped prisoner likely fatigued and...,Police searching for the second of two escaped...,Surviving escaped prisoner likely fatigued and...
662,Clinton and Sanders neck and neck in Californi...,No matter who wins California's 475 delegates ...,Clinton and Sanders neck and neck in Californi...


In [46]:
df2_news = df2['all']
test2 = stemmed_tfidf_vect.transform(df2_news)
pred = clf.predict(test2) 
pred

array(['FAKE', 'REAL', 'REAL', ..., 'FAKE', 'REAL', 'REAL'], dtype='<U4')

In [47]:
df2.insert(loc=3, column='label', value=pred)
df2.head()

Unnamed: 0_level_0,title,text,all,label
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...,September New Homes Sales Rise——-Back To 1992 ...,FAKE
2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...,Why The Obamacare Doomsday Cult Can't Admit It...,REAL
864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...,"Sanders, Cruz resist pressure after NY losses,...",REAL
4128,Surviving escaped prisoner likely fatigued and...,Police searching for the second of two escaped...,Surviving escaped prisoner likely fatigued and...,REAL
662,Clinton and Sanders neck and neck in Californi...,No matter who wins California's 475 delegates ...,Clinton and Sanders neck and neck in Californi...,REAL


In [48]:
df2.to_csv('check.csv', index=True)
df2 = df2.drop(['title', 'text', 'all'], axis=1)
df2.head()

Unnamed: 0_level_0,label
ID,Unnamed: 1_level_1
10498,FAKE
2439,REAL
864,REAL
4128,REAL
662,REAL


In [49]:
df2.to_csv('submission.csv', index=True)

## Conclusions

In order to build our text classification model (with a **0.9606 accuracy score**) we carried out the following steps: 
1. We cleaned the dataset (since some text was splitted and had moved columns)
<br>
2. We created our training and test set in a 67-33 proportion. 
<br>
3. We compared all the possible combinations of ngrams, stopwords, vectorizers ( both count and tfidf vectorizers with and without stemming, using the Snowball Stemmer) and classifiers (Naive Bayes, SVC, Passive-Agressive) with their accuracies. 
<br>
<br>
   As stopwords we used the 'english' list, but also create another list (stopset) removing from the english list some words that helped to better perform the classification.
<br>
<br>
   The best possible result was **Passive-Aggressive classifier with stemmed Tfidf vectorizer, bigrams and the stopset list of stopwords**.
<br>
<br>
4. We also tried SVM and Maximum Entropy, but the accuracy score was worse than before.
<br>
5. Finally, we applied a Grid Search method combined with Cross Validation (3-fold) in order to tune the parameters for our best model.