# Deliverable 4



## Introduction

The goal of this project is to be able to classify news into the binary "true" or "fake". The input to this project should be a body of text. 

* [Libraries and Data](#scrollTo=YjJ80zE8LlPl&line=1&uniqifier=1)
* [Data Pre-processing](#scrollTo=NWpFfa61ri-Q&line=3&uniqifier=1)
* [Naive Bayes Modelling](#scrollTo=T8JraQaO8FEq&line=1&uniqifier=1)
* [Random test](#scrollTo=DDoymd_YN6FC&line=1&uniqifier=1)


## 1. Libraries and Data





### 1.1 Importing Libraries


In [None]:
import csv
import random
import pandas as pd
import re
from sklearn.metrics import accuracy_score

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from nltk import sent_tokenize, word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline


lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### 1.2 Downloading dataset CSVs

In [None]:
#fake csv
!wget https://raw.githubusercontent.com/peterghrong/fake_news_detection/master/dataset/Fake.csv
#true csv
!wget https://raw.githubusercontent.com/peterghrong/fake_news_detection/master/dataset/True.csv

--2020-11-29 22:44:47--  https://raw.githubusercontent.com/peterghrong/fake_news_detection/master/dataset/Fake.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 62789876 (60M) [text/plain]
Saving to: ‘Fake.csv’


2020-11-29 22:44:49 (97.1 MB/s) - ‘Fake.csv’ saved [62789876/62789876]

--2020-11-29 22:44:49--  https://raw.githubusercontent.com/peterghrong/fake_news_detection/master/dataset/True.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53582940 (51M) [text/plain]
Saving to: ‘True.csv’


2020-11-29 22:44:50 (75.0 MB/s) - ‘True.csv’

In [None]:
# sanity check
fake_csv = pd.read_csv('Fake.csv')
true_csv = pd.read_csv('True.csv')

In [None]:
fake_csv.head(5)

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [None]:
true_csv.head(5)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


## 2. Data Pre-processing




Removing "Reuters" and its location tags is much faster if we just write it to a new csv

### 2.1 Removing "Reuters" from the True.csv

In [None]:
with open("TrueClean.csv", mode='w') as write_file:
  writer = csv.writer(write_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
  with open("True.csv") as read_file:
    csv_reader = csv.reader(read_file, delimiter=',')
    for row in csv_reader:
      writer.writerow([row[0], re.sub(r'(.*?)\(Reuters\) - ',"",row[1],count=1) , row[2], row[3]])

In [None]:
cleaned_true = pd.read_csv('TrueClean.csv')
cleaned_fake = pd.read_csv('Fake.csv')

We add the binary classification to each csv here, true news are classified as 1, fake news are classified as 0

In [None]:
cleaned_true["class"] = 1
cleaned_fake["class"] = 0

In [None]:
cleaned_true.head(5)


Unnamed: 0,title,text,subject,date,class
0,"As U.S. budget fight looms, Republicans flip t...",The head of a conservative Republican faction ...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,Transgender people will be allowed for the fir...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,The special counsel investigation of links bet...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,Trump campaign adviser George Papadopoulos tol...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,President Donald Trump called on the U.S. Post...,politicsNews,"December 29, 2017",1


In [None]:
cleaned_fake.head(5)

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [None]:
data = cleaned_true.append(cleaned_fake, ignore_index=True)

In [None]:
data.head(5)

Unnamed: 0,title,text,subject,date,class
0,"As U.S. budget fight looms, Republicans flip t...",The head of a conservative Republican faction ...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,Transgender people will be allowed for the fir...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,The special counsel investigation of links bet...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,Trump campaign adviser George Papadopoulos tol...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,President Donald Trump called on the U.S. Post...,politicsNews,"December 29, 2017",1


We know that there could be empty text bodies in the text data, so we will remove those

In [None]:
index = 0
collection = []
for sentence in data['text']:
  if len(sentence) == 1: 
    collection.append(index)  
  index+=1

In [None]:
for i in collection:
  data = data.drop(data.index[i])

In [None]:
data.shape

(44271, 5)

By appending the article title into the text body, we can make use of the title in our machine learning model to improve accuracy

In [None]:
data['text'] = data['title'] + " " + data["text"]
data = data.drop(["title", "subject", "date"],axis=1)
data.head(5)

Unnamed: 0,text,class
0,"As U.S. budget fight looms, Republicans flip t...",1
1,U.S. military to accept transgender recruits o...,1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,1
3,FBI Russia probe helped by Australian diplomat...,1
4,Trump wants Postal Service to charge 'much mor...,1


### 2.2 implementing Sentence Tokenisation

We will Tokenise the sentences by removing stopwords, punctuations and single character words that don't have any impact on the meaning of the sentences. Most of these tasks can be done quite easily when we implement pre-exixsting libraries especially NLTK 

Note: Lemmatizing requires a huge amount of computation, therefore it could actually take up to 10 mins to run tokenisation on the dataset

In [None]:
y = data["class"].values
X = []
stop_words = set(stopwords.words("english"))
tokeniser = nltk.tokenize.RegexpTokenizer(r'\w+')


def sentenece_tokenisation(par): 
  tmp = []
  sentences = nltk.sent_tokenize(par)
  for sent in sentences:
    sent = sent.lower()
    tokens = tokeniser.tokenize(sent)
    filtered_words = [w.strip() for w in tokens if w not in stop_words and len(w) > 1]
    tmp.extend(filtered_words)
    tmp = [lemmatizer.lemmatize(j) for j in tmp]
  X.append(tmp)

def input_tokenisation(par):
  tmp = []
  if "(Reuters) - " in par:
    par = par.split(" - ")[1]
  sentences = nltk.sent_tokenize(par)
  for sent in sentences:
    sent = sent.lower()
    tokens = tokeniser.tokenize(sent)
    filtered_words = [w.strip() for w in tokens if w not in stop_words and len(w) > 1]
    tmp.extend(filtered_words)
    tmp = [lemmatizer.lemmatize(j) for j in tmp]
  tmp = ','.join(tmp)
  return tmp

In [None]:
for par in data["text"]:
  sentenece_tokenisation(par)

In [None]:
for i in range(len(X)):
  X[i] = ','.join(X[i])

### 2.3 Train Validation Test Split
We will split train test validation sets into size of 80%, 10%, 10% sets respectively. This can be implemented by pre-existing libraries

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.11, random_state=1) # 0.25 x 0.8 = 0.2

## 3. Model

### 3.1 Modelling
We will vectorise the existing training set by applying CountVectorising. It summarises all the unique strings in the training set. Then we will apply TF-IDF to find the probablity of occurance of each words in the string.

In [None]:
# bow_transformer = CountVectorizer().fit(X_train)
# train_bow = bow_transformer.transform(X_train)
# tfidf_transformer = TfidfTransformer().fit(train_bow)
# train_tfidf = tfidf_transformer.transform(train_bow)

In [None]:
# def calculate_tfidf(data_set):
#   bow_result = bow_transformer.transform(data_set)
#   calculated_tfidf = tfidf_transformer.transform(bow_result)
#   return calculated_tfidf

### 3.2 Use Gridsearch to find the best parameters for the Multinomial algorithm

We will then put everything into a standard library pipeline object for convinience 

In [None]:
# pipeline = Pipeline([
#     ('bow', CountVectorizer()),  
#     ('tfidf', TfidfTransformer()),  
#     ('classifier', MultinomialNB()),  
# ])
# pipeline.fit(X_train,y_train)

# parameters = {
#     'tfidf__use_idf': (True, False),
#     'bow__ngram_range': [(1, 1), (1, 2), (1, 3)],
#     'classifier__alpha': (1e-2, 1e-3)
# }

# grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
# grid_search_tune.fit(X_train, y_train)

# print("Best parameters set:")
# print(grid_search_tune.best_estimator_.steps)

### 3.3 Building the final pipeline

We will use the best params from the above gridsearch for the final pipeline

In [None]:
best_pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer='word', binary=False,
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 3), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)),  
    ('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=False)),  
    ('classifier', MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True)),  
])
best_pipeline.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('bow',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 3), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=False)),
                ('classifier',
                 MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True))],
         verbose=False)

In [None]:
prediction = best_pipeline.predict(X_test)
print(classification_report(prediction,y_test))

### 3.4 Use the below function to output a result for a specific sentence or news article

In [None]:
def pred_res(aString):
  foo = input_tokenisation(aString)
  foo = [foo]
  res = best_pipeline.predict(foo)[0]
  if res == 0:
    return "This news is fake"
  else:
    return "This news is real"
  

In [None]:
print(classification_report(best_pipeline.predict(X_val),y_val))
print(classification_report(best_pipeline.predict(X_test),y_test))

              precision    recall  f1-score   support

           0       0.97      0.98      0.98      2207
           1       0.98      0.97      0.97      2176

    accuracy                           0.97      4383
   macro avg       0.97      0.97      0.97      4383
weighted avg       0.97      0.97      0.97      4383

              precision    recall  f1-score   support

           0       0.98      0.98      0.98      2347
           1       0.97      0.98      0.98      2081

    accuracy                           0.98      4428
   macro avg       0.98      0.98      0.98      4428
weighted avg       0.98      0.98      0.98      4428



In [None]:
confusion_matrix(y_test, best_pipeline.predict(X_test))

array([[2293,   45],
       [  54, 2036]])

## 4. A random test

### 4.1 Random test on an onion article
I copied a news article from the onion.com which i'd imagine that it would be difficult to tell if its true or fake, but the machine handled it well

In [None]:
print(pred_res("Bucking centuries of precedent with a decision not to participate in the peaceful transfer of his authority, Donald Trump Jr. has refused to step down from his post as the president’s oldest son, sources confirmed Wednesday. “I, Don Trump Jr.—or, I should now say, Don Biden Jr.—will continue to fulfill my duties as the president’s eldest male offspring regardless of any attempts to unseat me or render my position illegitimate,” the 42-year-old real estate heir told reporters, saying he looked forward to promoting president-elect Joe Biden’s personal brand and to spending holidays with the first family, during which he hoped to bond with his “new siblings” Hunter and Ashley. “In this tumultuous time, the nation needs continuity, and through my proven experience as the president’s first male issue, I can provide that. Let me be clear: I am the commander-in-chief’s very special boy, and I will continue in this role for four more years. I’ve already filed a lawsuit to halt the installation of any other person in my position and, if necessary, will pursue this matter all the way to the Supreme Court.” At press time, the U.S. Supreme Court had issued a summary judgment declaring that the last person to claim the post of president’s oldest son had never officially stepped down, and thus the role still rightfully belonged to George W. Bush."))

This news is real


In [None]:
print(pred_res("Additional intelligence briefings “would be useful,” Biden added, but “we don’t see anything slowing us down.” The measured comments come as Biden prepares to confront dueling national crises that actively threaten the health, safety and economic security of millions of Americans irrespective of the political debate. Coronavirus infections, hospitalizations and deaths are surging, the economy faces the prospect of long-term damage, and the nation’s political and cultural divides may be worsening. Biden is betting that his low-key approach and bipartisan outreach — a sharp reversal from the current president’s style — will help him govern effectively on Day One. But just 71 days before he will be inaugurated, Trump and his allies seemed determined to make Biden’s transition as difficult as possible. From his Twitter account on Tuesday, Trump again raised unsupported claims of “massive ballot counting abuse” and predicted he would ultimately win the race he has already lost. His allies on Capitol Hill, led by Senate Majority Leader Mitch McConnell, have encouraged the president’s baseless accusations. Trump’s tweets were swiftly flagged by the social media network as disputed claims about election fraud."))

This news is real


In [None]:
print(pred_res("Most recently, in response to Trump’s withdrawal from the 2015 nuclear agreement and the administration’s maximum pressure campaign of withering sanctions, Iran has carried out a series of provocations with Suleimani’s fingerprints all over them—including threats to U.S. troops in Iraq. According Chairman of the Joint Chiefs of Staff Mark Milley, Iranian-backed groups have engaged in a sustained campaign of rocket attacks targeting U.S. facilities in Iraq since October 2019. But until Dec. 27, none had drawn American blood. Once one did, the United States swiftly retaliated, launching strikes against Kataib Hezbollah targets on both sides of the Iraq-Syria border. That, in turn, prompted Shiite militia leaders to mobilize a siege of the U.S. Embassy in Baghdad, raising the specter of a Benghazi-like scenario. This was the context for Trump’s decision to kill Suleimani. But whatever underlying sense of justice Americans may feel now that a terrorist mastermind is dead, it should not obscure the very real prospect that his assassination could set in motion events that spiral out of control in ways that put Americans and U.S. interests in deeper danger. Two previous U.S. administrations decided against a direct shot against Suleimani out of concern, widely shared by the Pentagon and the intelligence community, that all-out escalation would likely follow. As recently as this past spring, the Department of Defense warned the White House against designating the IRGC as a foreign terrorist organization, arguing that it could put the lives of U.S. personnel in Iraq and elsewhere in the region at risk. (Trump did so anyway.) And in June, then-Chairman of the Joint Chiefs of Staff Joseph Dunford helped talk Trump out of retaliating on Iranian soil for the Tehran’s downing of a U.S. drone. This cautious tradition has now been overturned."))

This news is real


In [None]:
print(pred_res("WASHINGTON—Following news that the Democratic nominee had officially cleared 270 electoral votes, The Lincoln Project super PAC immediately released a series of ads Friday calling for Joe Biden to be impeached. “It’s time for America to heal, and we can’t do it with this maniac in office,” said the commercial’s narrator, which aired on TV channels across the country alongside an animated ad depicting a mustachioed Biden rapping under the name “Joe Stalin.” “In the 2020 election, we voted for Biden. Now, we understand that was a grave mistake, and it’s our job to turn things around. Joseph Biden must go. The president-elect is out of control. He’s a risk to American democracy, and he’s certainly no conservative.” At press time, The Lincoln Project had unleashed a new campaign tearing into itself for such appalling hypocrisy."))

This news is fake


In [None]:
print(pred_res("Students at New Residence Hall threw a 600 person orgy in celebration of the news that they had achieved herd immunity from chlamydia, sources confirmed on Wednesday. This breakthrough came after McGill modeled its groundbreaking approach to dealing with chlamydia off Stockholm University. Early last week, New Residence Hall’s chlamydia levels were so high that the hall was placed under an orange alert, limiting orgies to six people instead of ten. The residence also implemented a new condom mandate, forcing students to wear condoms at all orgies. This mandate was met by a large anti-condom protest, where hundreds of itchy students gathered to defend their right to get sick and spread the plague. However, not all students have felt the effect of this epidemic. Among the unaffected students were the residents of La Citadelle, being totally clueless as to how STDs were even transmitted. French students were also among the fortunate to not get the STD, as students stringently continued their century-long social distancing campaign from them. Unfortunately for them, they were all checked into the ICU later with a grave respiratory illness: lung cancer. Our reporters reached out to the prestigious McGill Administration to inquire about the situation. When asked about the health and safety of students, they replied \"The important thing is that we can charge them $10,000 a year to live here.\" In other Rez news, a kid from Solin Hall showed early symptoms of the common flu. In a quick clean-up effort initiated by McGill, the building was levelled in a drone strike. Everyone else continued about their lives as normal."))

This news is fake
