# I301D Final Project: Fake News

## Melanie Huynh, Alexander Imhoff, Isabel Yang, Ashley Yude

In our project, we will be looking through datasets containing both fake and real news. We want to find the correlation between the presence of certain words and whether or not the machine will tag the text as REAL or FAKE.

## Data Cleaning

### Looking into our data:
First we will load our datasets into our Notebook as a PANDAS dataframe.

In [126]:
import pandas as pd
pd.options.mode.chained_assignment = None # mutes a trivial warning about PANDAS


# fake_df only contains fake news
fake_df = pd.read_csv('Fake.csv')

# real_df only contains real news
real_df = pd.read_csv('True.csv')

# text_df has both true and fake news, but is labeled with a String
text_df = pd.read_csv('fake_or_real_news.csv')

# num_df has both true and fake news, but is labeled with an Integer
num_df = pd.read_csv('perez-rosas-fakenews.csv')

In [127]:
fake_df.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [128]:
real_df.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [129]:
text_df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [130]:
num_df.head()

Unnamed: 0.1,Unnamed: 0,title,body,label,category
0,0,"Alex Jones Vindicated in ""Pizzagate"" Controversy","""Alex Jones, purveyor of the independent inves...",1,biz
1,1,THE BIG DATA CONSPIRACY,Government and Silicon Valley are looking to e...,1,biz
2,2,California Surprisingly Lenient on Auto Emissi...,"Setting Up Face-Off With Trump ""California's c...",1,biz
3,3,Mexicans Are Chomping at the Bit to Stop NAFTA...,Mexico has been unfairly gaining from NAFTA as...,1,biz
4,4,Breaking News: Snapchat to purchase Twitter fo...,Yahoo and AOL could be extremely popular over ...,1,biz


### Combining our dataframes:
As seen above, we looked into the data from each source and observed that the formatting for all of them is a little different. Before we combine our dataframes and use it to train our model, we must clean up the dataframes and make sure all the columns match.

We will start with fake_df and real_df and delete columns that are not present in the other two datasets.

In [131]:
fake_cleaned = fake_df[['title','text']]
real_cleaned = real_df[['title','text']]

After cleaning it up, we will then an Integer label that will identify all fake news as '1' and all real news as '0'.                              

In [132]:
fake_cleaned = fake_cleaned.assign(label = 1)
real_cleaned = real_cleaned.assign(label = 0)

Now we will clean up the other two data frames by deleting unnecessary columns, changing column names as needed, and making sure the label is an Integer. (Fake = 1, Real = 0)

In [133]:
# Deleting the 'Unnamed: 0' columns and the 'category column'
num_cleaned = num_df[['title','body','label']]
text_cleaned = text_df[['title','text','label']]

# Renaming the 'body' column to be 'text'
num_cleaned.rename(columns={"body":"text"}, inplace = True)
num_cleaned.head()

# Making 'label' in text_cleaned into a column of Integers
text_cleaned["label"] = (text_cleaned["label"] == "FAKE") + 0 ## Use the binary coding for true/false.
text_cleaned.head()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",1
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,1
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,0
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",1
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,0


In [134]:
num_cleaned.head()

Unnamed: 0,title,text,label
0,"Alex Jones Vindicated in ""Pizzagate"" Controversy","""Alex Jones, purveyor of the independent inves...",1
1,THE BIG DATA CONSPIRACY,Government and Silicon Valley are looking to e...,1
2,California Surprisingly Lenient on Auto Emissi...,"Setting Up Face-Off With Trump ""California's c...",1
3,Mexicans Are Chomping at the Bit to Stop NAFTA...,Mexico has been unfairly gaining from NAFTA as...,1
4,Breaking News: Snapchat to purchase Twitter fo...,Yahoo and AOL could be extremely popular over ...,1


All the dataframes in our notebook now have the same columns. This means it is time to combine all of them into one sing dataframe

In [135]:
combine_all = [fake_cleaned, real_cleaned, num_cleaned, text_cleaned]
final_df = pd.concat(combine_all)
final_df

Unnamed: 0,title,text,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,1
...,...,...,...
6330,State Department says it can't find emails fro...,The State Department told the Republican Natio...,0
6331,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,1
6332,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,1
6333,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",0


## Creating a Bag of Words

We also need to create a bag of words to act as our training data, but we have to clean and process the actual bodies of text first.

### Preprocessing the text
Techniques we will do:
* Make each letter of everyword lowercase so that words like "Crazy" and "crazy" count as the same word.
* Remove stopwords since they don't bring any value in modeling. (We found a list of English stopwords from the nltk package and used that to figure out which words to delete.
* Remove characters we don't need

In [136]:
# make words lowercase
final_df['text'] =  [str(x).lower() for x in final_df['text']]

# format words and remove unwanted chatacters
final_df['text'] =  [re.sub(r"@\S+", "",a) for a in final_df['text']] # removing @ or mentions
final_df['text'] =  [re.sub("http[s]?\://\S+","",b) for b in final_df['text']] # removing urls
final_df['text'] =  [re.sub(r"[0-9]", "",c) for c in final_df['text']] # removing numbers
final_df['text'] =  [re.sub(r"\n", "",d) for d in final_df['text']] # removing newlines or tabs
final_df['text'] =  [re.sub(r"(\(.*\))|(\[.*\])", "",x) for x in final_df['text']] # removing text in brackets or parentheses
final_df['text'] =  [re.sub('\s+'," ",e) for e in final_df['text']] # removing extra space

final_df['text'] = final_df['text'].apply(lambda x: ' '.join([word for word in str(x).split() if word not in (stopwords)]))
print (final_df)

                                                  title  \
0      Donald Trump Sends Out Embarrassing New Year’...   
1      Drunk Bragging Trump Staffer Started Russian ...   
2      Sheriff David Clarke Becomes An Internet Joke...   
3      Trump Is So Obsessed He Even Has Obama’s Name...   
4      Pope Francis Just Called Out Donald Trump Dur...   
...                                                 ...   
6330  State Department says it can't find emails fro...   
6331  The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...   
6332  Anti-Trump Protesters Are Tools of the Oligarc...   
6333  In Ethiopia, Obama seeks progress on peace, se...   
6334  Jeb Bush Is Suddenly Attacking Trump. Here's W...   

                                                   text  label  
0     donald trump just couldn t wish all americans ...      1  
1     house intelligence committee chairman devin nu...      1  
2     on friday, it was revealed that former milwauk...      1  
3     on christmas day, donald 

Note: We tried using nltk, but had a lot of trouble getting jupyter notebook to recognized that we had already downloaded the module. Instead, we found a list of english stopwords on github we will be using(https://gist.github.com/sebleier/554280).

In [137]:
import numpy as np
import re

stopwords = pd.read_csv("NLTK's list of english stopwords")
    
final_df['clean_text'] = list(final_df.text)
final_df.head()

Unnamed: 0,title,text,label,clean_text
0,Donald Trump Sends Out Embarrassing New Year’...,donald trump just couldn t wish all americans ...,1,donald trump just couldn t wish all americans ...
1,Drunk Bragging Trump Staffer Started Russian ...,house intelligence committee chairman devin nu...,1,house intelligence committee chairman devin nu...
2,Sheriff David Clarke Becomes An Internet Joke...,"on friday, it was revealed that former milwauk...",1,"on friday, it was revealed that former milwauk..."
3,Trump Is So Obsessed He Even Has Obama’s Name...,"on christmas day, donald trump announced that ...",1,"on christmas day, donald trump announced that ..."
4,Pope Francis Just Called Out Donald Trump Dur...,pope francis used his annual christmas day mes...,1,pope francis used his annual christmas day mes...


In [139]:
from sklearn.feature_extraction.text import CountVectorizer

 
vectorizer = CountVectorizer(analyzer='word', token_pattern=r'\w+')

text_arr = np.array(final_df['clean_text'])

# final_df.head()

# make a bag of words
bag = vectorizer.fit_transform(text_arr)
# print (bag)

## Machine Learning

Now that all of our data is cleaned up, and we have a bag of words, we can move on to training a machine learning algorithm and later analyzing its accuracy.

To begin, we first define our features X, which we use to predict, and the label Y, which we try to predict.
In this case, we could determine whether or not an article is fake news based on the title and content.

In [141]:
# We are only keeping 'text' & 'title' because they are what we are using to predict the label
X = final_df['clean_text']

Y = final_df['label']
print (X.head())

0    donald trump just couldn t wish all americans ...
1    house intelligence committee chairman devin nu...
2    on friday, it was revealed that former milwauk...
3    on christmas day, donald trump announced that ...
4    pope francis used his annual christmas day mes...
Name: clean_text, dtype: object


Here, we followed a tutorial to extract the features after tokenizing the words. We made another method to keep it clean.

In [166]:
def extract_features(df,training_data,testing_data):
     
        # TF-IDF BASED FEATURE REPRESENTATION
        tfidf=TfidfVectorizer(use_idf=True, max_df=0.95)
        xx = tfidf.fit_transform(training_data.values)
        terms = tfidf.get_feature_names_out()

        # sum tfidf frequency of each term through documents
        sums = xx.sum(axis=0)

        # connecting term to its sums frequency
        data = []
        for col, term in enumerate(terms):
            if term not in stop and term != ' ':
                data.append((term, sums[0,col] ))

        ranking = pd.DataFrame(data, columns=['term','rank'])
        print(ranking.sort_values('rank', ascending=False))
        
        train_feature_set=tfidf.transform(training_data.values)
        test_feature_set=tfidf.transform(testing_data.values)

        return train_feature_set,test_feature_set,tfidf_vectorizer

Now we will divide the dataset into training / testing parts. We will use the training dataset to train our model to make predictions, in this whether or not an article is fake news. Then, we will use the testing set to see how our model performed.

We will also be setting the random_state so that everytime the code is run, we will end up with the exact same split.

In [167]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS

# GET A TRAIN TEST SPLIT (set seed for consistent results)
training_data, testing_data = train_test_split(final_df,random_state = 9)
# GET LABELS
training_pred = training_data['label'].values
testing_pred = testing_data['label'].values
print (training_pred)
training_data, testing_data = train_test_split(final_df.clean_text,random_state = 9)
print (training_data)

X_train,X_test,feature_transformer=extract_features(final_df,training_data,testing_data)

[0 0 0 ... 1 1 1]
13101    berlin - police raided apartments across germa...
17193    beijing - china will let the market play a dec...
15291    beijing - u.s. secretary of state rex tillerso...
513      in house speaker paul ryan s ayn randish views...
3803     washington - former director of national intel...
                               ...                        
5014     when you think of civil rights activists, it s...
19266    zurich from previous tests. korean peninsula u...
22584    tune in to the alternate current radio network...
501      white house counselor kellyanne conway crawled...
20828    every parent in the united states of america s...
Name: clean_text, Length: 38784, dtype: object
               term         rank
83874         trump  1661.447639
70961          said  1247.743570
63580     president   707.585053
14674       clinton   630.117338
60741        people   538.604733
...             ...          ...
64836       punting     0.008568
14806        clover   

We will be using a Logistic Regression classifier for this model.

In [145]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='saga',random_state=0, C=5, penalty='l2',max_iter=5000)
model = lr.fit(X_train,training_pred)

In [146]:
prediction = lr.predict(X_test)

In [147]:
from sklearn import metrics

# measure the accuracy
print("LogisticRegression Accuracy %.3f" %metrics.accuracy_score(testing_pred, prediction))

LogisticRegression Accuracy 0.960


WOWOW OUR MODEL WAS 97% ACCURATE

In [38]:
import matplotlib.pyplot as plt
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    print('Confusion matrix, without normalization')
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')