# 1741433 - Final Year Project (Minimum Viable Product)

The requirements for the minimum viable product (as defined in the official project documentation) are as follows:
 - The MVP should be developed in Python, within Jupyter notebooks so as to show how the findemntal concept behind this has been built
 - the MVP should utilise pandas and SKLearn to vectorise and classify the fake and real news data
 - Train the algorithm on 75% of the real-fake data
 - Create a passive aggressive classifier and test it on the remaining 25% 
 - Print a confusion matrix on that test
 - Calculate the accuracy of the model through the data in the confusion matrix
 - Calculate K Fold accuracy to see how it compares
 - Test the passive aggressive nature of the model against the other real and fake news datasets
 - Calculate accuracy for these other datasets
 - Report on the accuracies and how the MVP can be changed/improved for the first version
 
This MVP will allow us to see whether the concept is feasible and what we can build off upon. I have an expectation of a low accuracy for my classifier (anywhere between 40% - 50%) I can proceed to justify why it is so low and suggest ideas for improvement

I have begun by importing all the neccessary libraries into my python program. These libraries and their descriptions can be seen below:
- pandas and its instance 'pd' is an open source python package built off of Numpy that is commonly used within data science and machine learning. Pandas will help us perform statistical analysis and create our dataframes for our vectors.
- TFIDFVectorizer allows us to vectorise our text. We will take the text from our articles and turn them into vectors (numberical values) for us to plot
- SKLearn
- PassoveAggressiveClassifier
- Train_test_split
- Accuracy_Score
- Confusion Matrix
- Cross_Val_Score

In [28]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score

#TFIDFVECTORIZER - Right now all of our data is in text, we want to vectorise this data so we have an array that is usable from a machine learing standpoint

Our pandas instance, 'pd' will be used to create a dataframe, 'df' from the imported csv our csv. We want to replace the value 0 with real and 1 with Fake to make the dataframe and our outputs more readable and comprehensive. We have done this below.

In [10]:
#Convert 0 to Real and 1 to Fake
df=pd.read_csv('fake-news/train.csv')
df
convert_val = {0: 'Real', 1: 'Fake'}
df['label'] = df['label'].replace(convert_val)
df.label.value_counts() #These values can show us how balanced the dataset is
df

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,Fake
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,Real
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",Fake
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,Fake
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,Fake
...,...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,Real
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,Real
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,Real
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...",Fake


We are defining the size of our test set and training set through x and y values. We are splitting our set up, randomising the data and vectorising the words. We are also ommitting any stop words (words that dont add value to a sentence) 

In [11]:
x_train, x_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.25, random_state=7, shuffle=True)
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.75)

The code below is us plotting our vectorised data and runnig it through sklearn's passive aggressive classifier

In [12]:
vec_train=tfidf_vectorizer.fit_transform(x_train.values.astype('U'))
vec_test=tfidf_vectorizer.transform(x_test.values.astype('U'))

In [14]:
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(vec_train,y_train)

PassiveAggressiveClassifier(max_iter=50)

We can now calculate the accuracy of our algorithm's predictions. Upon running the algorithm on our test set we will see how many correct guesses were made and devide it by all the guesses made, and multiply that by 100. We can see that the PAC accuracy we get is 96.23%. This accuracy is further valiidated on this datasert when calculating a K-Fold Accuracy. 

These values, are howeveer, not representative of the algorithm's true accuracy on articles. I have taken more articles (both real and fake) and tested them against my model. These articles are from different sources and have been provided by different people. 

I had to change a few fields in the csv to class the data into true and false. There is a section of code that allows us to enter an article to see how our algorithm classifies it. I have also printed a confusion matrix to show our positive, false and flase positive results. 

We can see that when we ran our foreign articles against our model we got an accuracy between 69% and 71%. This is much lower than our original accuracy but it is still acceptable.

In [18]:
y_pred=pac.predict(vec_test)
score=accuracy_score(y_test, y_pred)
print(f'PAC Accuracy: {round(score*100,2)}%')

PAC Accuracy: 96.23%


In [19]:
#We can see our accuracy through our confusion matrix
confusion_matrix(y_test,y_pred,labels=('Real','Fake'))
#Through the confusion matrix we can see that we have predicted an article to be true correctly 2488 times and we have predicted an article to be false correctly 2516 times

array([[2488,   98],
       [  98, 2516]])

In [20]:
X=tfidf_vectorizer.transform(df['text'].values.astype('U'))

In [21]:
scores = cross_val_score(pac, X, df['label'].values, cv=5)
print(f'K Fold Accuracy: {round(scores.mean()*100,2)}%')

K Fold Accuracy: 96.23%


In [22]:
df_true=pd.read_csv('True.csv')
df_true['label']='Real'
#The line of code below redacts the publishers from the articles. We don't want this to skew the algorithms determination of whether the article is real or fake
df_true_rep=[df_true['text'][i].replace('WASHINGTON (Reuters) - ','').replace('LONDON (Reuters) - ','').replace('(Reuters) - ','') for i in range(len(df_true['text']))]
df_true['text']=df_true_rep
df_fake=pd.read_csv('Fake.csv')
df_fake['label']='Fake'
df_final=pd.concat([df_true,df_fake])
df_final=df_final.drop(['subject','date'], axis=1)
df_fake             

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Fake
...,...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",Fake
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",Fake
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",Fake
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",Fake


In [23]:
#This function will allow us to figure out whether an article is true or fake using our model
def findlabel(newtext):
    vec_newtest=tfidf_vectorizer.transform([newtext])
    y_pred1=pac.predict(vec_newtest)
    return y_pred1[0]

The line of code below can be altered (the value and the dataframe) to test our model.

In [25]:
#This is a test perameter to see if our passagg classifier is classifying our articles correctly. 
findlabel((df_fake['text'][0]))

'Fake'

In [40]:
#This is another test to see how accurately our model is preciting the articles on our new dataset. We can see that our accuracy is much lower
realnewsacc = sum([1 if findlabel((df_true['text'][i]))=='Real' else 0 for i in range(len(df_true['text']))])/df_true['text'].size
print(f'Real News Dataset Classification Accuracy: {round(realnewsacc * 100,2)}%')

Real News Dataset Classification Accuracy: 71.82%


In [38]:
fakenewsacc = sum([1 if findlabel((df_fake['text'][i]))=='Fake' else 0 for i in range(len(df_fake['text']))])/df_fake['text'].size * 100
print(f'Fake News Dataset Classification Accuracy: {round(fakenewsacc,2)}%')

Fake News Dataset Classification Accuracy: 69.9%


This MVP was developed to prove the feasibility of the classification of these articles through text analysis. I believe that accuracies like 69% and 71% accuracy on foreign datasets proves that this model works well. I will work to make this algorithm more accuracte through support vector machining and neural networks. The requirement for the first version includes research on defining attributes of fake and real news articles. I will be able to use neural networks and support vector machines to classify the data with these attributes. I can test whether adding these classifications increases or decreases the accuracy of my model.

In summary, I believe this MVP has succeeded in proving the feasibility of this project as well as thre strength of this model even with it being at such a basic level.