## Fake News and Real News Classification
**Kan Zhou and Minhui Ma**

### 1. Import Packages and Load Data

In [1]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
import re
import string
from nltk.corpus import stopwords

from bs4 import BeautifulSoup

In [2]:
# load data
real_data = pd.read_csv('True.csv')
fake_data = pd.read_csv('Fake.csv')

In [3]:
real_data.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [4]:
fake_data.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


The source of the data set is https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset. The data source consists of a real news data set and a fake news data set. Each data set has the title, text, subject and the published date for each news. Since the fake news and the real news are separated in two data tables, we should merge the two tables and give each record a label.

In [5]:
# add column "target"
# 1 for real news, 0 for fake news
real_data['target'] = 1
fake_data['target'] = 0 

# merge the 2 datasets
data = pd.concat([real_data, fake_data], ignore_index=True, sort=False)
data.head()

Unnamed: 0,title,text,subject,date,target
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


### 2. Data Cleaning

#### 2.1. Data Information

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   target   44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB


In [7]:
print("The number of real news is %d" %len(real_data))
print("The number of fake news is %d" %len(fake_data))

The number of real news is 21417
The number of fake news is 23481


The dataset has 5 features and 44898 observations with no missing values. 4 of the features are object and 1 feature is integer. 21417 of the observations are real news and 23481 of the observations are fake news.

#### 2.2. Remove URL's, Punctuations, and Stopwords

Since the news are grabbed online, the texts are mess and contain html contents. We can use the BeautifulSoup package to help us clean the html contents. Also, we can remove the English stopwords using the NLTK package.

In [8]:
stop = set(stopwords.words('english'))

# Removing html contents
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()
# Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)
# Removing URL's
def remove_between_square_brackets(text):
    return re.sub(r'http\S+', '', text)
# Removing the punctuations
def remove_punctuations(text):
    return re.sub(r'[^\w\s]', '', text)
# Removing the stopwords from text
def remove_stopwords(text):
    final_text = []
    for i in text.split():
        if i.strip().lower() not in stop:
            final_text.append(i.strip())
    return " ".join(final_text)
# Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    text = remove_punctuations(text)
    text = remove_stopwords(text)
    return text
# Apply function on review column
data['text'] = data['text'].apply(denoise_text)





In [9]:
data.head()

Unnamed: 0,title,text,subject,date,target
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON Reuters head conservative Republica...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON Reuters Transgender people allowed ...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON Reuters special counsel investigati...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON Reuters Trump campaign adviser Geor...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLEWASHINGTON Reuters President Donald Tru...,politicsNews,"December 29, 2017",1


#### 2.3. Merge all the text data into one column

The "subject" feature describe the subject of the news. The data source has 8 different subjects of news. 

In [10]:
data.subject.value_counts()

politicsNews       11272
worldnews          10145
News                9050
politics            6841
left-news           4459
Government News     1570
US_News              783
Middle-east          778
Name: subject, dtype: int64

Title, text, and subject are stored as 3 features in the data set. We can merge these 3 columns into one. Also, since the publish date of the news is unrelated to the purpose of the project, we can directly drop the column.

In [11]:
data['text'] = data['text'] + " " + data['title'] + " " + data['subject']
del data['title']
del data['subject']
del data['date']

In [12]:
data.head()

Unnamed: 0,text,target
0,WASHINGTON Reuters head conservative Republica...,1
1,WASHINGTON Reuters Transgender people allowed ...,1
2,WASHINGTON Reuters special counsel investigati...,1
3,WASHINGTON Reuters Trump campaign adviser Geor...,1
4,SEATTLEWASHINGTON Reuters President Donald Tru...,1


#### 2.4. Lemmatization

In [13]:
data['text'][0]

'WASHINGTON Reuters head conservative Republican faction US Congress voted month huge expansion national debt pay tax cuts called fiscal conservative Sunday urged budget restraint 2018 keeping sharp pivot way among Republicans US Representative Mark Meadows speaking CBS Face Nation drew hard line federal spending lawmakers bracing battle January return holidays Wednesday lawmakers begin trying pass federal budget fight likely linked issues immigration policy even November congressional election campaigns approach Republicans seek keep control Congress President Donald Trump Republicans want big budget increase military spending Democrats also want proportional increases nondefense discretionary spending programs support education scientific research infrastructure public health environmental protection Trump administration already willing say going increase nondefense discretionary spending 7 percent Meadows chairman small influential House Freedom Caucus said program Democrats saying 

The news text still contains numbers. Also, some letters are uppercases. We can remove the numbers from texts and convert all letters to lowercases.

In [14]:
def lowercase_number(text):
    text = text.lower()
    return re.sub(r'[0-9]+', '', text)

data['text'] = data['text'].apply(lowercase_number)
data['text'][0]

'washington reuters head conservative republican faction us congress voted month huge expansion national debt pay tax cuts called fiscal conservative sunday urged budget restraint  keeping sharp pivot way among republicans us representative mark meadows speaking cbs face nation drew hard line federal spending lawmakers bracing battle january return holidays wednesday lawmakers begin trying pass federal budget fight likely linked issues immigration policy even november congressional election campaigns approach republicans seek keep control congress president donald trump republicans want big budget increase military spending democrats also want proportional increases nondefense discretionary spending programs support education scientific research infrastructure public health environmental protection trump administration already willing say going increase nondefense discretionary spending  percent meadows chairman small influential house freedom caucus said program democrats saying thats

Do lemmatization for the texts.

In [15]:
lemma = nltk.WordNetLemmatizer()

def lemmatization(text):
    final_text = []
    for i in text.split():
            final_text.append(lemma.lemmatize(i))
    return " ".join(final_text)

data['text'] = data['text'].apply(lemmatization)
data.head()

Unnamed: 0,text,target
0,washington reuters head conservative republica...,1
1,washington reuters transgender people allowed ...,1
2,washington reuters special counsel investigati...,1
3,washington reuters trump campaign adviser geor...,1
4,seattlewashington reuters president donald tru...,1


### 3. Data Visualization

### 4. Classification

#### 4.1. CountVectorizer

In [23]:
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [17]:
# Tokenization
data['text'] = data.text.apply(word_tokenize)
data.head()

Unnamed: 0,text,target
0,"[washington, reuters, head, conservative, repu...",1
1,"[washington, reuters, transgender, people, all...",1
2,"[washington, reuters, special, counsel, invest...",1
3,"[washington, reuters, trump, campaign, adviser...",1
4,"[seattlewashington, reuters, president, donald...",1


In [None]:
# CountVectorizer
corpus = []
for i in range(len(data)):
    corpus.append(' '.join(data['text'][i]))

cv = CountVectorizer(max_features=5000, ngram_range=(1,3), min_df = 1)
X = cv.fit_transform(corpus).toarray()

In [26]:
# Train test split
x_train, x_test, y_train, y_test = train_test_split(X, data['target'], random_state=15)

#### 4.2. TF-IDF

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text 

In [33]:
vectorizer = TfidfVectorizer(min_df = 1, ngram_range = (1,3))
tfidf = vectorizer.fit_transform(corpus).toarray()

MemoryError: Unable to allocate 3.70 TiB for an array with shape (44898, 11329023) and data type float64

In [None]:
# Train test split
x_train_t, x_test_t, y_train_t, y_test_t = train_test_split(tfidf, data['target'], random_state=15)

In [32]:
tfidf

<bound method _cs_matrix.toarray of <44898x11329023 sparse matrix of type '<class 'numpy.float64'>'
	with 28114669 stored elements in Compressed Sparse Row format>>