# Predicting fakenews, text classification
Student project, Kari Lehtomaa , lehtomaa.kari@gmail.com
![Fakenews](fn.jpeg)

## Introduction
Today, fake news are widely used in polictics elections to influence people with the wrong and false information. Socialmedia is very easy media to spread false information and sometimes it is not easy to know what is true or not. Many of us don't bother to check information from other sources.

In this project we make a text classifier to detect fakenews. Same methods can be used to classifu and detect any false information.

We will use Kaggledata for this and this is binary classification problem.
Datasource used is: https://www.kaggle.com/jruvika/fake-news-detection

This project is available in Github too: https://github.com/kleh/ML_Fakenews_LogisticsRegression


## Problem Formulation
In our data our features are text and label is 1 for true and 0 for fake.
For the ML classifier we will convert text to number vectors, called bags of words. To prevent differencies with different length of texts, we will also convert vectors to term frequencies.

We do classifying using two ML methods:
- Naive Bayes Multinomial , ? EXPLAIN
- SGDClassifier, ? EXPLAIN

For the quality of our models we will use confusion matrix and report for precision, recall and f1-score

Same type of task can be found from sklearn tutorial: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Let's start with loading some libraries, and loading data.

In [8]:
# https://www.kaggle.com/jruvika/fake-news-detection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [9]:
df = pd.read_csv("fakenews.csv",sep=",")
df.head()

Unnamed: 0,URLs,Headline,Body,Label
0,http://www.bbc.com/news/world-us-canada-414191...,Four ways Bob Corker skewered Donald Trump,Image copyright Getty Images\nOn Sunday mornin...,1
1,https://www.reuters.com/article/us-filmfestiva...,Linklater's war veteran comedy speaks to moder...,"LONDON (Reuters) - “Last Flag Flying”, a comed...",1
2,https://www.nytimes.com/2017/10/09/us/politics...,Trump’s Fight With Corker Jeopardizes His Legi...,The feud broke into public view last week when...,1
3,https://www.reuters.com/article/us-mexico-oil-...,Egypt's Cheiron wins tie-up with Pemex for Mex...,MEXICO CITY (Reuters) - Egypt’s Cheiron Holdin...,1
4,http://www.cnn.com/videos/cnnmoney/2017/10/08/...,Jason Aldean opens 'SNL' with Vegas tribute,"Country singer Jason Aldean, who was performin...",1


For the model we need Headline and Body for the features and Label for label (y)

In [10]:
dataset = df[['Headline','Body', 'Label']]

In [11]:
dataset.groupby('Label').describe()

Unnamed: 0_level_0,Headline,Headline,Headline,Headline,Body,Body,Body,Body
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
Label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,2137,1226,"10/5 TRS-PNC Park: Bucs Win in '71, '79; Lose ...",6,2120,1193,A Potato Battery Can Light up a Room for Over ...,143
1,1872,1605,World Cup 2018: Who needs what to qualify for ...,5,1868,1670,Chat with us in Facebook Messenger. Find out w...,61


## Method
Add some text

Load need libraries and transform features for ML methods.

Add a helper function we need to remove some unneeded characters from the text

Then following operations will be done:

1. Combine  Headline and Body
2. Use helper function to remove some unneeded characters from the text
3. Split data to Train and Test datasets
4. Do CountVectorising transform for Train and Test feature-data, which will build sparse datasets containing words and their counts in the data.
5. Do TfidfTransformer transform for Train and Test feature-data, which will turn word counts to frequencies

In [12]:
# for extracting features
from sklearn.feature_extraction.text import CountVectorizer
import string

In [13]:
dataset['text'] = dataset['Headline'] + " " + dataset['Body']
dataset = dataset.drop(['Headline', 'Body'], axis=1)

In [14]:
def removeNotNeededChars(mess):
    mess = str(mess)
    mess = mess.replace("\n"," ")
    xx = [x for x in mess if x not in string.punctuation]
    xx = ''.join(xx)
    return xx

In [15]:
dataset['text'] = dataset['text'].apply(removeNotNeededChars)

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    dataset['text'], dataset['Label'], test_size=0.3, random_state=1)

In [17]:
countsv = CountVectorizer()
X_traincv = countsv.fit_transform(X_train)
X_testcv = countsv.transform(X_test)

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer

In [19]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_traincv)
X_test_tfidf = tfidf_transformer.transform(X_testcv)

## Results
Add some text

### Our first model Naive Bayes MultinomialNB, it can be used with sparse arrays

In [29]:
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)

In [30]:
np.mean(bayes_predy == y_test)

0.9301745635910225

In [31]:
metrics.confusion_matrix(y_test, bayes_predy)

array([[585,  52],
       [ 32, 534]])

In [32]:
print(metrics.classification_report(y_test, bayes_predy))

              precision    recall  f1-score   support

           0       0.95      0.92      0.93       637
           1       0.91      0.94      0.93       566

    accuracy                           0.93      1203
   macro avg       0.93      0.93      0.93      1203
weighted avg       0.93      0.93      0.93      1203



### Second model we use is SGDClassifier
In this case we build Pipeline to make operations easier. Pipeline will run CountVectorizer, TfidfTransformer and SGDClassifier. So we input splitted data and fo transformations and classifying in one call.

In [33]:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

mytestpl = Pipeline([
('cv', CountVectorizer()),
('tfidf', TfidfTransformer()),
('cgf', SGDClassifier(loss='hinge', alpha=1e-3, random_state=42,max_iter=5, tol=None)),
])

In [34]:
mytestpl.fit(X_train, y_train)
sgd_predy = mytestpl.predict(X_test)

In [35]:
metrics.confusion_matrix(y_test, sgd_predy)

array([[622,  15],
       [ 19, 547]])

In [36]:
print(metrics.classification_report(y_test, sgd_predy))

              precision    recall  f1-score   support

           0       0.97      0.98      0.97       637
           1       0.97      0.97      0.97       566

    accuracy                           0.97      1203
   macro avg       0.97      0.97      0.97      1203
weighted avg       0.97      0.97      0.97      1203



### Evaluate the results

Possible GridSearch ???

In [30]:
# Parameter tuning
from sklearn.model_selection import GridSearchCV
parameters = {
'cv__ngram_range': [(1, 1), (1, 2)],
'tfidf__use_idf': (True, False),
'cgf__alpha': (1e-2, 1e-3),
}

In [31]:
gs_cgf = GridSearchCV(mytestpl, parameters, cv=5, n_jobs=-1)

In [32]:
gs_cgf = gs_cgf.fit(X_train, y_train)

In [33]:
gs_cgf.best_params_

{'cgf__alpha': 0.001, 'cv__ngram_range': (1, 2), 'tfidf__use_idf': True}

In [34]:
sgd_gs_predy = gs_cgf.predict(X_test)

In [35]:
print(metrics.classification_report(y_test, sgd_gs_predy))

              precision    recall  f1-score   support

           0       0.98      0.96      0.97       637
           1       0.96      0.98      0.97       566

    accuracy                           0.97      1203
   macro avg       0.97      0.97      0.97      1203
weighted avg       0.97      0.97      0.97      1203



## Conclusion