# Predicting fakenews using text classification
Student project, Kari Lehtomaa , lehtomaa.kari@gmail.com
![Fakenews](fn.jpeg)

## Introduction
Today, fake news are widely used in polictics elections to influence people with the wrong and false information. Socialmedia is very easy media to spread false information and sometimes it is not easy to know what is true or not. Many of us don't bother to check information from other sources.

In this project we make a text classifier to detect fakenews. Same methods can be used to classifu and detect any false information.

We will use Kaggledata for this and this is binary classification problem.
Datasource used is: https://www.kaggle.com/jruvika/fake-news-detection

This project is available in Github too: https://github.com/kleh/ML_Fakenews_LogisticsRegression


## Problem Formulation
Our problem is classification problem. In our data features are URLs,Headline and Body. 

Labels are 1 for true and 0 for fake. We could use also body as a feature but that could make operations slow.
For the ML classifier we will convert text to number vectors, called bags of words. To prevent differencies with different length of texts, we will also convert vectors to inverse term frequencies.

We do classifying using two ML methods:
- Naive Bayes Multinomial , this method is suitable for text counts and works also for tf-idf term frequencies, see: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB
- SGDClassifier, this method uses regularized linear models with stochastic gradient descent learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule. See: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier 

For the quality of our models we will use confusion matrix and report for precision, recall and f1-score. Precision tells us that how much are predicted correctly , and recall tells that how much are correctly predicted as for that label. The f1-score is a harmonic mean of the precision and recall.


Start with loading some libraries, and data.

In [157]:
# https://www.kaggle.com/jruvika/fake-news-detection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [158]:
df = pd.read_csv("fakenews.csv",sep=",")
df.head()

Unnamed: 0,URLs,Headline,Body,Label
0,http://www.bbc.com/news/world-us-canada-414191...,Four ways Bob Corker skewered Donald Trump,Image copyright Getty Images\nOn Sunday mornin...,1
1,https://www.reuters.com/article/us-filmfestiva...,Linklater's war veteran comedy speaks to moder...,"LONDON (Reuters) - “Last Flag Flying”, a comed...",1
2,https://www.nytimes.com/2017/10/09/us/politics...,Trump’s Fight With Corker Jeopardizes His Legi...,The feud broke into public view last week when...,1
3,https://www.reuters.com/article/us-mexico-oil-...,Egypt's Cheiron wins tie-up with Pemex for Mex...,MEXICO CITY (Reuters) - Egypt’s Cheiron Holdin...,1
4,http://www.cnn.com/videos/cnnmoney/2017/10/08/...,Jason Aldean opens 'SNL' with Vegas tribute,"Country singer Jason Aldean, who was performin...",1


For the model we need Headline and Body for the features and Label for label (y)

### Build dataset
Combine text columns to new column and drop original columns

In [159]:
dataset = df[['URLs','Headline','Body', 'Label']]
dataset['text'] = dataset['URLs'] + " " + dataset['Headline'] + " " + dataset['Body']
dataset = dataset.drop(['Headline'], axis=1)
dataset = dataset.drop(['URLs'], axis=1)
dataset = dataset.drop(['Body'], axis=1)
dataset.head()

Unnamed: 0,Label,text
0,1,http://www.bbc.com/news/world-us-canada-414191...
1,1,https://www.reuters.com/article/us-filmfestiva...
2,1,https://www.nytimes.com/2017/10/09/us/politics...
3,1,https://www.reuters.com/article/us-mexico-oil-...
4,1,http://www.cnn.com/videos/cnnmoney/2017/10/08/...


Some statistics about our data, there is 2120 fakenews and 1868 true news. 

In [160]:
dataset.groupby('Label').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
Label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,2120,2120,http://dailybuzzlive.com/256-years-old-man-rev...,1
1,1868,1868,http://www.cnn.com/2017/10/09/us/las-vegas-sho...,1


## Method


Load need libraries and transform features for ML methods.

Add a helper function we need to remove some unneeded characters from the text

Then following operations will be done:

1. Use helper function to remove some unneeded characters from the text
2. Split data to Train and Test datasets
3. Do CountVectorising transform for Train and Test feature-data, which will build sparse datasets containing tokenized words and their counts in the data. See: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
4. Do TfidfTransformer transform for Train and Test feature-data, Weigh the counts, so that frequent tokens get lower weight ,inverse document frequency.
See: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html


In [161]:
from sklearn.feature_extraction.text import CountVectorizer
import string

In [162]:
def removeNotNeededChars(mess):
    mess = str(mess)
    mess = mess.replace("\n"," ")
    xx = [x for x in mess if x not in string.punctuation]
    xx = ''.join(xx)
    return xx

In [163]:
dataset['text'] = dataset['text'].apply(removeNotNeededChars)

### Split to train and test/validation sets

In [164]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    dataset['text'], dataset['Label'], test_size=0.3, random_state=1)

In [165]:
countsv = CountVectorizer()
X_traincv = countsv.fit_transform(X_train)
X_testcv = countsv.transform(X_test)

Word count

In [166]:
print(len(countsv.vocabulary_))

49139


In [167]:
from sklearn.feature_extraction.text import TfidfTransformer

In [168]:
tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(X_traincv)
X_train_tfidf = tfidf_transformer.transform(X_traincv)
X_test_tfidf = tfidf_transformer.transform(X_testcv)

X arrays are now sparse matrix arrays containing mostly 0. Only columns containing words in a sentence have some other value

In [169]:
X_train_tfidf.shape

(2806, 49139)

## Results


Fit our models and get results

### Our first model Naive Bayes MultinomialNB, it can be used with sparse arrays

In [170]:
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)
bayes_predy = clf.predict(X_test_tfidf)

In [171]:
np.mean(bayes_predy == y_test)

0.9260182876142976

In [172]:
metrics.confusion_matrix(y_test, bayes_predy)

array([[579,  58],
       [ 31, 535]])

In [173]:
print(metrics.classification_report(y_test, bayes_predy))

              precision    recall  f1-score   support

           0       0.95      0.91      0.93       637
           1       0.90      0.95      0.92       566

    accuracy                           0.93      1203
   macro avg       0.93      0.93      0.93      1203
weighted avg       0.93      0.93      0.93      1203



Precision tells us that 95% of fake news are predicted correctly , and recall tells that 93% of news are correctly predicted as fake news. The f1-score is a harmonic mean of the precision and recall. 

### Second model we use is SGDClassifier
In this case we build Pipeline to make operations easier. Pipeline will run CountVectorizer, TfidfTransformer and SGDClassifier. So we input splitted data and fo transformations and classifying in one call.

As a hyperparameters for SGDRegression:
- loss is default hinge
- alpha = 1e-03, The higher the value, the stronger the regularization
- random_state = 42 , use similar state for multiple runs
- max_iter = 20 , keep iterations low

In [174]:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

mytestpl = Pipeline([
('cv', CountVectorizer()),
('tfidf', TfidfTransformer()),
('cgf', SGDClassifier(loss='hinge', alpha=1e-3, random_state=42,max_iter=20)),
])

With the pipeline we provide data which is not CountVectorized or tf-idf:d. Those operations will be done in the pipeline

In [175]:
mytestpl.fit(X_train, y_train)
sgd_predy = mytestpl.predict(X_test)

In [176]:
metrics.confusion_matrix(y_test, sgd_predy)

array([[617,  20],
       [ 15, 551]])

In [177]:
print(metrics.classification_report(y_test, sgd_predy))

              precision    recall  f1-score   support

           0       0.98      0.97      0.97       637
           1       0.96      0.97      0.97       566

    accuracy                           0.97      1203
   macro avg       0.97      0.97      0.97      1203
weighted avg       0.97      0.97      0.97      1203



### Use GridSearch to search optimal hyperparameters
Load the library and list of hyperparameters and possible values to test. We use the previous pipeline

In [178]:
from sklearn.model_selection import GridSearchCV
>>> parameters = {
'cv__ngram_range': [(1, 1), (1, 2)],
'tfidf__use_idf': (True, False),
'cgf__alpha': (1e-2, 1e-3),
'cgf__max_iter': (20,40,60)    
}

In [179]:
gs_cgf = GridSearchCV(mytestpl, parameters, cv=5, n_jobs=-1)

Fit the gridsearch to start getting the good hyperparameters. This can take some time, about few minutes

In [180]:
gs_cgf = gs_cgf.fit(X_train, y_train)

After run, we get best params for our pipeline

In [181]:
gs_cgf.best_params_

{'cgf__alpha': 0.001,
 'cgf__max_iter': 20,
 'cv__ngram_range': (1, 2),
 'tfidf__use_idf': True}

Do prediction and get classification report with these params

In [184]:
sgd_gs_predy = gs_cgf.predict(X_test)

In [185]:
print(metrics.classification_report(y_test, sgd_gs_predy))

              precision    recall  f1-score   support

           0       0.98      0.95      0.97       637
           1       0.95      0.98      0.96       566

    accuracy                           0.97      1203
   macro avg       0.96      0.97      0.97      1203
weighted avg       0.97      0.97      0.97      1203



## Results

Naive Bayes gave pretty good results. With SGDRegressor we got only little bit better. 


It may be that data is too good, at least true news contains lot of "CITY (Reuters)" , which can make classifying easy.

## Conclusion

With the quite easy code, it is possible to classify texts quite precily. Features could be prepared better using NLTK: https://www.nltk.org/ 