## Section 3: Fake News and Naive Bayes (20 pts)

For this section, I am going to apply Naive Bayes to a data set used in a recent [Kaggle competition](https://www.kaggle.com/competitions/fake-news/data). The goal of of this project is to build a classifer that could predict whether a news story is fake or true. 

The dataset includes two files: (1) a training data file with labels that includes an id, author, title, and text, (2) the test data set with the same features as the training data file except with no label

In [78]:
# Importing all necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree
import seaborn as sns
import re, string #import packages for regex replacement
from nltk.tokenize import TweetTokenizer # import tokenizer from nltk
import nltk # download the list of stopwords from nltk if you have not done this before
from nltk.corpus import stopwords # import stopwords
stopeng = set(stopwords.words('english')) #set language
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import *
from nltk.stem.porter import *
from nltk.stem.snowball import SnowballStemmer
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

In [79]:
# Read in the training data
train_df = pd.read_csv('train.csv', dtype='str')
train_df.head(10)

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
5,5,Jackie Mason: Hollywood Would Love Trump if He...,Daniel Nussbaum,"In these trying times, Jackie Mason is the Voi...",0
6,6,Life: Life Of Luxury: Elton John’s 6 Favorite ...,,Ever wonder how Britain’s most iconic pop pian...,1
7,7,Benoît Hamon Wins French Socialist Party’s Pre...,Alissa J. Rubin,"PARIS — France chose an idealistic, traditi...",0
8,8,Excerpts From a Draft Script for Donald Trump’...,,Donald J. Trump is scheduled to make a highly ...,0
9,9,"A Back-Channel Plan for Ukraine and Russia, Co...",Megan Twohey and Scott Shane,A week before Michael T. Flynn resigned as nat...,0


#### For the above problem, I think the most relevant features are author, title, and text. The author could be a good predictor because it is possible for a author (or source) to be known for publishing untrue stories. Additionally, Title and Text features could help the machine determine some commonly used words in the untrue stories.

In [80]:
# Extract features

print(train_df.isnull().sum())
#train_df['title'].fillna('',inplace=True)
#train_df['author'].fillna('',inplace=True)
train_df = train_df.dropna(axis=0)
print(train_df.isnull().sum())

train_df = train_df[['title', 'author', 'label']]
train_df.head(10)

id           0
title      558
author    1957
text        39
label        0
dtype: int64
id        0
title     0
author    0
text      0
label     0
dtype: int64


Unnamed: 0,title,author,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,1
5,Jackie Mason: Hollywood Would Love Trump if He...,Daniel Nussbaum,0
7,Benoît Hamon Wins French Socialist Party’s Pre...,Alissa J. Rubin,0
9,"A Back-Channel Plan for Ukraine and Russia, Co...",Megan Twohey and Scott Shane,0
10,Obama’s Organizing for Action Partners with So...,Aaron Klein,0
11,"BBC Comedy Sketch ""Real Housewives of ISIS"" Ca...",Chris Tomlinson,0


In [81]:
# Combining title and author to create one input feature for the model 

train_df_updated = pd.DataFrame()
train_df_updated['feature'] = train_df['title'] + " " + train_df['author']
train_df_updated['label'] = train_df['label']

train_df_updated.head(10)

Unnamed: 0,feature,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",0
2,Why the Truth Might Get You Fired Consortiumne...,1
3,15 Civilians Killed In Single US Airstrike Hav...,1
4,Iranian woman jailed for fictional unpublished...,1
5,Jackie Mason: Hollywood Would Love Trump if He...,0
7,Benoît Hamon Wins French Socialist Party’s Pre...,0
9,"A Back-Channel Plan for Ukraine and Russia, Co...",0
10,Obama’s Organizing for Action Partners with So...,0
11,"BBC Comedy Sketch ""Real Housewives of ISIS"" Ca...",0


In [82]:
# Define a function for removing punctuations and cleaning the data

def clean_text_round(row):
    row = row.lower()
    row = re.sub(r"[^\w\s]", '', row)
    row = re.sub('\'', '', row) #remove commas
    row = re.sub(',', '', row) #remove commas
    row = re.sub('\n', '', row) # remove carrot inserts from collection <---- new operation
    row = re.sub('-', '', row)
    row = re.sub('\?', '', row)
    row = re.sub('\.', '', row)
    row = re.sub('\'', '', row)
    row = re.sub('\"', '', row)
    row = re.sub('#', '', row)
    return row

clean = lambda x: clean_text_round(x)

In [83]:
# apply the function above across each row of the text column

train_df_updated.loc[:, 'feature'] = train_df_updated['feature'].apply(clean)
train_df_updated.head()

Unnamed: 0,feature,label
0,house dem aide we didnt even see comeys letter...,1
1,flynn hillary clinton big woman on campus bre...,0
2,why the truth might get you fired consortiumne...,1
3,15 civilians killed in single us airstrike hav...,1
4,iranian woman jailed for fictional unpublished...,1


In [84]:
# import tokenizer from nltk
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer() 

# define a function that we can apply over our data

def tweet_tokenize(row):
    row = tweet_tokenizer.tokenize(row)
    return row

tokenized = lambda x: tweet_tokenize(x)

In [85]:
# apply the tweet_tokenize function on the train data

train_df_updated.loc[:, 'feature'] = train_df_updated['feature'].apply(tokenized)
train_df_updated.head()

Unnamed: 0,feature,label
0,"[house, dem, aide, we, didnt, even, see, comey...",1
1,"[flynn, hillary, clinton, big, woman, on, camp...",0
2,"[why, the, truth, might, get, you, fired, cons...",1
3,"[15, civilians, killed, in, single, us, airstr...",1
4,"[iranian, woman, jailed, for, fictional, unpub...",1


In [86]:
import nltk
#nltk.download('stopwords') # download the list of stopwords from nltk if you have not done this before
from nltk.corpus import stopwords # import stopwords

stopeng = set(stopwords.words('english')) #set language

#define a function to remove stopwords
def remove_stopwords(row):
    row = [w for w in row if w not in stopeng]
    return row

no_stopwords = lambda x: remove_stopwords(x)

In [87]:
# apply function to remove stopwords on the train data

train_df_updated.loc[:, 'feature'] = train_df_updated['feature'].apply(no_stopwords)
train_df_updated.head()

Unnamed: 0,feature,label
0,"[house, dem, aide, didnt, even, see, comeys, l...",1
1,"[flynn, hillary, clinton, big, woman, campus, ...",0
2,"[truth, might, get, fired, consortiumnewscom]",1
3,"[15, civilians, killed, single, us, airstrike,...",1
4,"[iranian, woman, jailed, fictional, unpublishe...",1


In [88]:
#nltk.download('wordnet') # you may need to run these depending on your setup
#nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer

lmtzr = WordNetLemmatizer()

# define function to lemmatize
def lemmatize(row):
    row = [lmtzr.lemmatize(token) for token in row]
    row = ' '.join(row) # this is the final step of our guided walkthrough, so I have re-joined the tweets into single documents instead of lists
    return row

lemmatized = lambda x: lemmatize(x)

In [89]:
#nltk.download('omw-1.4')

# apply the lemmatization function on the train data

train_df_updated.loc[:, 'feature'] = train_df_updated['feature'].apply(lemmatized)
train_df_updated.head()

Unnamed: 0,feature,label
0,house dem aide didnt even see comeys letter ja...,1
1,flynn hillary clinton big woman campus breitba...,0
2,truth might get fired consortiumnewscom,1
3,15 civilian killed single u airstrike identifi...,1
4,iranian woman jailed fictional unpublished sto...,1


In [90]:
from nltk.stem import *
from nltk.stem.porter import *

stemmer = PorterStemmer()

def stemming(row):
    row = row.split()
    row = [stemmer.stem(token) for token in row]
    row = ' '.join(row) # this is the final step of our guided walkthrough, so I have re-joined the tweets into single documents instead of lists
    return row

stemmed = lambda x: stemming(x)

In [91]:
# apply stemming function on the train data

train_df_updated.loc[:, 'feature'] = train_df_updated['feature'].apply(stemmed)
train_df_updated.head()

Unnamed: 0,feature,label
0,hous dem aid didnt even see comey letter jason...,1
1,flynn hillari clinton big woman campu breitbar...,0
2,truth might get fire consortiumnewscom,1
3,15 civilian kill singl u airstrik identifi jes...,1
4,iranian woman jail fiction unpublish stori wom...,1


In [92]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split # Import train_test_split function

# Create a CountVectorizer object to extract features from the text data
vectorizer = CountVectorizer()

# Extract features from the text data and the labels
X_train = train_df_updated['feature'].values
y_train = train_df_updated['label'].values

print(X_train.shape)

print(y_train.shape)

# fit the model on the train input data

X_train_counts = vectorizer.fit_transform(X_train)

print(X_train_counts.shape)


(18285,)
(18285,)
(18285, 18207)


In [93]:
# Run the NB classifier using cross validation


#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import KFold, cross_val_score

#Create a Naive Bayes Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train_counts.toarray(), y_train)

#Predict the response for test dataset generated
#y_pred = gnb.predict(X_test_counts.toarray())

k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

# Evaluate the classifier using 5-fold cross-validation
scores = cross_val_score(gnb, X_train_counts.toarray(), y_train, cv=k_fold)

# Report the accuracy
print('Accuracy:', scores.mean()*100)


Accuracy: 91.48475270173104


In [94]:
# Report Accuracy
print('Accuracy:', scores.mean()*100)

Accuracy: 91.48475270173104


In [95]:
# Run the classifer on the test data.

# Load test data
test_df = pd.read_csv('test.csv', dtype='str')
test_df.head(10)

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...
5,20805,Trump is USA's antique hero. Clinton will be n...,,Trump is USA's antique hero. Clinton will be n...
6,20806,Pelosi Calls for FBI Investigation to Find Out...,Pam Key,"Sunday on NBC’s “Meet the Press,” House Minori..."
7,20807,Weekly Featured Profile – Randy Shannon,Trevor Loudon,You are here: Home / *Articles of the Bound* /...
8,20808,Urban Population Booms Will Make Climate Chang...,,Urban Population Booms Will Make Climate Chang...
9,20809,,cognitive dissident,don't we have the receipt?


## Extract Features

In [96]:
# Delete rows with na values and selct only relevant columns 

print(test_df.isnull().sum())

test_df = test_df.dropna(axis=0)
print(test_df.isnull().sum())

test_df = test_df[['title', 'author']]
test_df.head(10)

id          0
title     122
author    503
text        7
dtype: int64
id        0
title     0
author    0
text      0
dtype: int64


Unnamed: 0,title,author
0,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld
2,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams
3,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor
4,Keiser Report: Meme Wars (E995),Truth Broadcast Network
6,Pelosi Calls for FBI Investigation to Find Out...,Pam Key
7,Weekly Featured Profile – Randy Shannon,Trevor Loudon
10,184 U.S. generals and admirals endorse Trump f...,Dr. Eowyn
11,“Working Class Hero” by John Brennon,Doug Diamond
12,The Rise of Mandatory Vaccinations Means the E...,Shaun Bradley
13,Communists Terrorize Small Business,Steve Watson


In [97]:
# Read the test dataset 

test_df_updated = pd.DataFrame()
test_df_updated['feature'] = test_df['title'] + " " + test_df['author']

test_df_updated.head(10)

Unnamed: 0,feature
0,"Specter of Trump Loosens Tongues, if Not Purse..."
2,#NoDAPL: Native American Leaders Vow to Stay A...
3,"Tim Tebow Will Attempt Another Comeback, This ..."
4,Keiser Report: Meme Wars (E995) Truth Broadcas...
6,Pelosi Calls for FBI Investigation to Find Out...
7,Weekly Featured Profile – Randy Shannon Trevor...
10,184 U.S. generals and admirals endorse Trump f...
11,“Working Class Hero” by John Brennon Doug Diamond
12,The Rise of Mandatory Vaccinations Means the E...
13,Communists Terrorize Small Business Steve Watson


## Clean Data

In [98]:
# apply the clean function created above across each row of the text column in test data
test_df_updated.loc[:, 'feature'] = test_df_updated['feature'].apply(clean)

# Tokenization of test data
test_df_updated.loc[:, 'feature'] = test_df_updated['feature'].apply(tokenized)

# Remove stopwords from test data
test_df_updated.loc[:, 'feature'] = test_df_updated['feature'].apply(no_stopwords)

# Lemmatize test data
test_df_updated.loc[:, 'feature'] = test_df_updated['feature'].apply(lemmatized)

# Stem test data
test_df_updated.loc[:, 'feature'] = test_df_updated['feature'].apply(stemmed)

test_df_updated.head(10)

Unnamed: 0,feature
0,specter trump loosen tongu purs string silicon...
2,nodapl nativ american leader vow stay winter f...
3,tim tebow attempt anoth comeback time basebal ...
4,keiser report meme war e995 truth broadcast ne...
6,pelosi call fbi investig find russian donald t...
7,weekli featur profil randi shannon trevor loudon
10,184 u gener admir endors trump commanderinchie...
11,work class hero john brennon doug diamond
12,rise mandatori vaccin mean end medic freedom s...
13,communist terror small busi steve watson


In [99]:
# Create a CountVectorizer object to extract features from the text data

# Extract features from the text data and the labels
X_test = test_df_updated['feature'].values

print(X_test.shape)

X_test_counts = vectorizer.transform(X_test)

print(X_test_counts.shape)

#Predict the response for test dataset that we generated
y_pred = gnb.predict(X_test_counts.toarray())

test_df_updated['Predicted_Labels'] = y_pred

test_df_updated.head()

(4575,)
(4575, 18207)


Unnamed: 0,feature,Predicted_Labels
0,specter trump loosen tongu purs string silicon...,1
2,nodapl nativ american leader vow stay winter f...,1
3,tim tebow attempt anoth comeback time basebal ...,0
4,keiser report meme war e995 truth broadcast ne...,1
6,pelosi call fbi investig find russian donald t...,0


## How could you improve your accuracy?
While the normal Naive Bayes method is straightforward and efficient, there are more complex variations such as Tree-Augmented Naive Bayes and Semi-Naive Bayes that can increase model performance in some instances.

The feature used as input strongly influences the effectiveness of a Naive Bayes model. Precise feature engineering can dramatically enhance model accuracy. This entails picking the most important and useful featuress and appropriately encoding them. Moreover, it is important to get rid of redundant features because high correlation between features adversely affects the accuracy of the model.

Another technique to improve the model is to use Naive Bayes in combination with ensemnbling techniques such as bagging and boosting which can help with reducing the the variance and bias in the model.

## Discussion Questions

1. Why is the term "fake news" insufficient and problematic as it is commonly applied to the analysis of misinformation? It may help to distinguish between misinformation, disinformation, and propaganda. It may also be useful to trace back how "fake news" has been used politically in the US. 

As it is frequently used to analyze misinformation, the phrase "fake news" is inadequate and troublesome since it imprecise. Many a times, especially in political situations, true stories are claimed as false and legitimate sources are discredited. Although the term make the distinguishing process faster, it would give inaccurate results because of how loosely the term is used by the media and the general public.

2. Why is it difficult to classify misinformation? What makes something misinformation in the first place? How do people get around being flagged as misinfo? Is it always as binary as "true" or "false"? 

Misinformation is difficult to categorize since it can be complex and nuanced. People can purposely conceal the truth for their advantage. Moreover, information spreads very fast, and it is really hard to identify the source of it especially when it is spread by word of mouth. Nobody really bothers to confirm the legitimacy of the information before spreading it. Misinformation is not always binary, true or false. It could be partially true at times.

3. Beyond using ML, what domain expertise and stakeholder analysis would be useful for identifying and combatting misinformation? 

Detecting and combating disinformation necessitates domain knowledge and stakeholder analysis from a variety of professions, including journalism, social media, psychology, and law. Domain specialists can assist in identifying the origins and forms of disinformation, as well as developing ways to address it. Analysis of body language, for instance could be one way of determing whether a person is lying or not.