# Natural Language Processing Assignment 1
<h2 align='center'> Detecting Fake news </h2>

**Names :** Vaishali Raja, Ludwig Orsini-Rosenberg, Nicolas Linsenmaier

**Best Score : ** 94.7% (2nd : 94.3%)

**Classifier Used : ** Passive-Aggressive Classifier (2nd : SGDC with GridSearch)

<align = 'center'> 
The objective of this assignment was to create a classifier that is able to detect and differentiate fake news from real news based on NLP techniques. The main pre-processing steps used in this assignment involves basic ideas such as removing stop words, converting to lowercase, stemming , working with punctuation and using n-grams. Following this, features would be extracted and transformed primarily through scikit-learn's TF-IDF vectorizer which is a combination of Count Vectorizer (Bag of Words) + a TF-IDF Transformer. The extracted features would then be applied to 3 classifiers : Multinomial Naive Bayes, Stochastic Gradient Descent Classifier and Passive-Aggresive Classifier using cross validation techniques. 

## Index
    1. Import necessary packages
    2. Import Train and Test Data
    3. Clean Train Data
        3.1 Split dataset 
        3.2 Process X2 column
        3.2 Process X1 column
        3.3 Merge and create final train dataset
    4. Base analysis
        4.1 Levels
        4.2 Frequency distribution and Vocab
    5. Tokenizer and Vectorizer
    6. Multinomial Naive Bayes
    7. Stochastic Gradient Descent Classifier
    8. Passive-Aggressive Classifier
    9. Applying classifier on Test Data
    10. Conclusion

## 1.Importing packages

In [1]:
import numpy as np 
import pandas as pd 
import string
from prettytable import PrettyTable

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics

In [3]:
import nltk
from nltk.text import TextCollection
from nltk.stem.porter import PorterStemmer
import snowballstemmer

## 2. Import Train and Test Data

In [4]:
train = pd.read_csv("./fake_or_real_news_training.csv", encoding = "UTF-8" )
test = pd.read_csv("./fake_or_real_news_test.csv", encoding = "UTF-8")

In [5]:
print('Train Data: \n')
print("Number of columns: "+ str(train.shape[1]))
print("Number of rows: "+ str(train.shape[0]))

print('Train Data: \n')
print("Number of columns: "+ str(test.shape[1]))
print("Number of rows: "+ str(test.shape[0]))

Train Data: 

Number of columns: 6
Number of rows: 3999
Train Data: 

Number of columns: 3
Number of rows: 2321


## 3. Clean Dataset

In [6]:
#Percentage of missing values in Numerical Variables
data_num_na = (train.isnull().sum() / len(train)) * 100
data_num_na = data_num_na.drop(data_num_na[data_num_na == 0].index).sort_values(ascending=False)[:30]
missing_data_n = pd.DataFrame({'Missing % of NA in Numerical Var' :data_num_na})
print('\n Missing % of NA in Num variable: \n')
print("Table: " + str(missing_data_n))


 Missing % of NA in Num variable: 

Table:     Missing % of NA in Numerical Var
X2                         99.949987
X1                         99.174794


### 3.1 Split dataset

In [7]:
#Splitting the dataset into 3 set; Clean - Data has been parsed correctly
#rows_x1 - Data has been split and moved into x1
#rows_x2 - Data has been split into both X1 and X2 columns

clean = train[train['X1'].isnull()]
rows_x1 = train[train['X1'].notnull()& train['X2'].isnull()]
rows_x2 = train[train['X2'].notnull()]

print('\n Count of values in col X1: \n')
print(str(rows_x1.shape[0]) + "\n" + "\n Count of values in col X2: \n" + str (rows_x2.shape[0]))


 Count of values in col X1: 

31

 Count of values in col X2: 
2


There are a total of 31 rows which have been extended into Col X1 and 2 rows which have been extended twice into Col X1 and X2. We will now take a look at these data points and work to make it fit within the first 4 columns. 

### 3.2 Processing X2 Column

In [8]:
rows_x2

Unnamed: 0,ID,title,text,label,X1,X2
2184,9,Planned Parenthood’s lobbying effort,pay raises for federal workers,and the future Fed rates,PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE....,REAL
3537,6268,Chart Of The Day: Since 2009—–Recovery For The 5%,Stagnation for the 95%,Chart Of The Day: Since 2009 Recovery For The 5%,Stagnation for the 95%,FAKE


It looks the values in the title have been shifted into the text(&label) columns causing values to appear in the X1 and X2 columns. We are going to investigate the columns that have been shifted twice and proceed further. 

In [9]:
print("Title" + " : " + rows_x2.iloc[0,1])
print("Text" + " : " + rows_x2.iloc[0,2])
print("Label" + " : " + rows_x2.iloc[0,3])
#print("X1" + " : " + rows_x2.iloc[0,4])

Title : Planned Parenthood’s lobbying effort
Text :  pay raises for federal workers
Label :  and the future Fed rates


In [10]:
print("Title" + " : " + rows_x2.iloc[1,1])
print("Text" + " : " + rows_x2.iloc[1,2])
print("Label" + " : " + rows_x2.iloc[1,3])
#print("X1" + " : " + rows_x2.iloc[1,4])

Title : Chart Of The Day: Since 2009—–Recovery For The 5%
Text :  Stagnation for the 95%
Label : Chart Of The Day: Since 2009 Recovery For The 5%


Based on the text, we are going to drop input no:3537 as it does not have any real value for the text section. As for ID no:2184, we are going to move the values 2 colums back as the Title has been shifted over 3 columns. 

In [11]:
#Drop ID: 3537
rows_x2 = rows_x2.drop(3537)

In [12]:
#Shift the other row 2 columns back
rows_x2.title.iloc[0] = rows_x2.title.iloc[0] + rows_x2.text.iloc[0] + rows_x2.label.iloc[0]
rows_x2.iloc[0,2:] = rows_x2.iloc[0,2:].shift(-2)
rows_x2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,ID,title,text,label,X1,X2
2184,9,Planned Parenthood’s lobbying effort pay raise...,PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE....,REAL,,


### 3.3 Processing X1 Column

In [13]:
rows_x1

Unnamed: 0,ID,title,text,label,X1,X2
192,599,Election Day: No Legal Pot In Ohio,Democrats Lose In The South,Election Day: No Legal Pot In Ohio; Democrats ...,REAL,
308,10194,Who rode it best? Jesse Jackson mounts up to f...,Leonardo DiCaprio to the rescue?,Who rode it best? Jesse Jackson mounts up to f...,FAKE,
382,356,Black Hawk crashes off Florida,human remains found,(CNN) Thick fog forced authorities to suspend ...,REAL,
660,2786,Afghanistan: 19 die in air attacks on hospital,U.S. investigating,(CNN) Aerial bombardments blew apart a Doctors...,REAL,
889,3622,Al Qaeda rep says group directed Paris magazin...,US issues travel warning,A member of Al Qaeda's branch in Yemen said Fr...,REAL,
911,7375,Shallow 5.4 magnitude earthquake rattles centr...,shakes buildings in Rome,00 UTC © USGS Map of the earthquake's epicent...,FAKE,
1010,9097,ICE Agent Commits Suicide in NYC,Leaves Note Revealing Gov’t Plans to Round-up...,Email Print After writing a lengthy suicide no...,FAKE,
1043,9203,Political Correctness for Yuengling Brewery,What About Our Opioid Epidemic?,We Are Change \n\nIn today’s political climate...,FAKE,
1218,1602,Poll gives Biden edge over Clinton against GOP...,VP meets with Trumka,A new national poll shows Vice President Biden...,REAL,
1438,4562,Russia begins airstrikes in Syria,U.S. warns of new concerns in conflict,Russian warplanes began airstrikes in Syria on...,REAL,


In [14]:
print("Title" + " : " + rows_x1.iloc[12,1])
print("Text" + " : " + rows_x1.iloc[12,2])
#print("Label" + " : " + rows_x1.iloc[12,3])
print("X1" + " : " + rows_x1.iloc[12,4])

Title : STATE OF GEORGIA FIRES PASTOR BECAUSE OF HIS FAITH
Text :  GOVERNMENT DIDN’T “APPROVE” BIBLICAL SERMONS
X1 : FAKE


Looking at a sample of texts, it seems like the data from title has been split into both the title and text columns. As such we will be moving these data points one columns back. 

In [15]:
#Appending title and text in title column
rows_x1.title = rows_x1.title + ':' + rows_x1.text
#checking sample
print("Title" + " : " + rows_x1.iloc[12,1])
print("Text" + " : " + rows_x1.iloc[12,2])

Title : STATE OF GEORGIA FIRES PASTOR BECAUSE OF HIS FAITH: GOVERNMENT DIDN’T “APPROVE” BIBLICAL SERMONS
Text :  GOVERNMENT DIDN’T “APPROVE” BIBLICAL SERMONS


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [16]:
#Shifting one column back
cols = rows_x1.columns[:-1]
rows_x1 = rows_x1.drop('text', 1)
rows_x1.columns = cols

In [17]:
print("Title" + " : " + rows_x1.iloc[12,1])
#print("Text" + " : " + rows_x1.iloc[12,2])
print("Label" + " : " + rows_x1.iloc[12,3])
#print("X1" + " : " + rows_x1.iloc[12,4])

Title : STATE OF GEORGIA FIRES PASTOR BECAUSE OF HIS FAITH: GOVERNMENT DIDN’T “APPROVE” BIBLICAL SERMONS
Label : FAKE


### 3.4 Merge and create final train dataset

In [18]:
#Drop X1 and X2 columns as they are not required further
rows_x1 = rows_x1.drop('X1', 1)
rows_x2 = rows_x2.drop(['X1','X2'],1)
clean = clean.drop(['X1','X2'],1)

In [19]:
#Merge cleaned datasets together
train_l = pd.concat([clean,rows_x1,rows_x2])

#Create a new level with both title and text
train_l['both'] = train_l.title + train_l.text

In [20]:
label = train_l.label
train = train_l.drop('label',axis=1)

## 4. Base analysis 

### 4.1 Levels
The original input data has 2 levels: 'Title' and 'Text'. With the final train dataset, a new level,'both' has been created which combines both the Title and the Text of the news article. 

In this section, we will be performing a base analysis on 2 levels : 'Text' and'both' using a simple TF-TDF vectorizer on Multinomial Naive Bayes Classifier to decide on which level to use for further analysis.

In [21]:
text_pip = Pipeline([('tfidf', TfidfVectorizer()),
                     ('chi2', SelectKBest(chi2, k=20000)),
                     ('Multi_NB', MultinomialNB())])

accuracy = cross_val_score(text_pip, train.text, label, scoring='accuracy', cv=5)
print("Accuracy = " + str(accuracy))
print("Mean Acc = " + str(accuracy.mean()))

Accuracy = [0.815      0.79375    0.8075     0.80225282 0.80225282]
Mean Acc = 0.80415112640801


In [22]:
both_pip = Pipeline([('tfidf', TfidfVectorizer()),
                     ('chi2', SelectKBest(chi2, k=20000)),
                     ('Multi_NB', MultinomialNB())])

accuracy = cross_val_score(both_pip, train.both, label, scoring='accuracy', cv=5)
print("Accuracy = " + str(accuracy))
print("Mean Acc = " + str(accuracy.mean()))

Accuracy = [0.82       0.7975     0.81125    0.81476846 0.80350438]
Mean Acc = 0.8094045682102629


Using both the text and title is marginally better and it makes sense as it would mean that features from both levels will be included in the analysis and would enhance classification. 

### 4.1 Frequency distribution and Vocab

This section is to have a basic understanding of the differences in frequency distribution of words and vocab between the Fake and Real news articles. The process was done step-by-step and slight differences could only be observed after stopwords and single-character tokens were removed. 

The vocab set between Fake and Real datasets do not show much difference, but some sort of sentiment analysis can be applied to gather more insights on the type of vocab used. 

In [23]:
Fake = train_l[train_l['label']=='FAKE']
Real = train_l[train_l['label']=='REAL']


In [24]:
str_f = str(Fake['both'])
str_r = str(Real['both'])
type(str_f)

str

In [25]:
words = nltk.word_tokenize(str_f)

# Remove single-character tokens (mostly punctuation)
words = [word for word in words if len(word) > 1]

# Remove stopwords
default_stopwords = set(nltk.corpus.stopwords.words('english'))
words = [word for word in words if word not in default_stopwords]

# Calculate frequency distribution
fdist = nltk.FreqDist(words)

# Output top 10 words    
x = PrettyTable()
x.field_names = ["Word", "Word Count"]

for word, frequency in fdist.most_common(20) :
    x.add_row([word,frequency])

print(x)

words_l = [w.lower() for w in words]
vocab = len(set(words))

print("Vocabulary in Fake News : " + str(vocab))

+----------+------------+
|   Word   | Word Count |
+----------+------------+
|   ...    |     61     |
|  Trump   |     7      |
| Hillary  |     6      |
|   The    |     4      |
|    To    |     4      |
|   FBI    |     4      |
| America  |     3      |
|   Why    |     3      |
|  Woman   |     3      |
|    In    |     3      |
|   This   |     3      |
|    OF    |     3      |
|   News   |     2      |
|   War    |     2      |
| Clinton  |     2      |
|  After   |     2      |
|    24    |     2      |
|  Syrian  |     2      |
| November |     2      |
|   2016   |     2      |
+----------+------------+
Vocabulary in Fake News : 393


In [26]:
words = nltk.word_tokenize(str_r)

# Remove single-character tokens (mostly punctuation)
words = [word for word in words if len(word) > 1]

# Remove stopwords
words = [word for word in words if word not in default_stopwords]

# Calculate frequency distribution
fdist = nltk.FreqDist(words)

# Output top 10 words    
x = PrettyTable()
x.field_names = ["Word", "Word Count"]

for word, frequency in fdist.most_common(20) :
    x.add_row([word,frequency])

print(x)

words_l = [w.lower() for w in words]
vocab = len(set(words))

print("Vocabulary in Real News : " + str(vocab))

+---------+------------+
|   Word  | Word Count |
+---------+------------+
|   ...   |     62     |
|  Trump  |     13     |
| Clinton |     9      |
|   GOP   |     6      |
|   The   |     5      |
|  Donald |     5      |
|  Paris  |     4      |
|  Obama  |     4      |
|    's   |     4      |
|   New   |     3      |
|  makes  |     3      |
|   Cruz  |     3      |
|    In   |     3      |
|   Iran  |     2      |
|   With  |     2      |
|  women  |     2      |
|    19   |     2      |
| Hillary |     2      |
|  Senate |     2      |
|   His   |     2      |
+---------+------------+
Vocabulary in Real News : 368


# 5. Defining Tokenizer and Vectorizer

Defining the tokenizer here which takes into account stemming and punctuation. The vectorizer used is the TF-IDF vectorizer which extracts features using the count vectorizer and transforms it using a TF-IDF transformer. 
There are a number of inputs factors that can be changed to extract features that would provide the highest accuracy. Some of these input factors that would be looked into are : 

    1) Stop Words
    2) Tokenizing with Stemming
    3) Converting words to lowercase
    4) N-grams of order 2 and 3

In [27]:
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = []
    for item in tokens:
        #if item in string.punctuation: continue
        stems.append(PorterStemmer().stem(item))
    return stems

In [28]:
# Base vectorizer with stopwords
tfidf_vectorizer_sw = TfidfVectorizer(stop_words='english')

#Base vectorizer with tokenizer
tfidf_vectorizer_tk = TfidfVectorizer(tokenizer = tokenize)

# Base vectorizer with n-gram
tfidf_vectorizer_try = TfidfVectorizer(sublinear_tf=True, min_df=5, 
                                       norm='l2', ngram_range=(1,2),
                                       max_df = 0.7)

# Base vectorizer with n-gram, stopwords
tfidf_vectorizer_swn = TfidfVectorizer(stop_words='english',sublinear_tf=True,
                                       min_df=5, norm='l2', 
                                       ngram_range=(1,2),max_df = 0.7)

# Base vectorizer with n-gram, stopwords and tokenizer
tfidf_vectorizer_swntk = TfidfVectorizer(stop_words='english', sublinear_tf=True, 
                                         min_df=5, norm='l2', tokenizer = tokenize,
                                         ngram_range=(1,2), max_df = 0.7)


# 6. Multinomial Naive Bayes

Multinomial Naive Bayes is the most commonly used classifier for text classification. For the first part, we will apply the MNB classifier on the various vectorizers created above and identify the best performing one. With that, we would further apply gridsearch to identify new features that would increase the accuracy score. 

In [29]:
text_clf_NB_1 = Pipeline([('tfidf', tfidf_vectorizer_sw),
                         ('chi2', SelectKBest(chi2, k=20000)),
                         ('clf-NB', MultinomialNB())])

accuracy = cross_val_score(text_clf_NB_1, train.both, label, scoring='accuracy', cv=5)
print("Accuracy = " + str(accuracy))
print("Mean Acc = " + str(accuracy.mean()))

Accuracy = [0.8425     0.8225     0.84375    0.83604506 0.82728411]
Mean Acc = 0.834415832290363


In [None]:
#text_clf_NB_2 = Pipeline([('tfidf', tfidf_vectorizer_tk),
                         ('chi2', SelectKBest(chi2, k=20000)),
                         ('clf-NB', MultinomialNB())])

#accuracy = cross_val_score(text_clf_NB_2, train.both, label, scoring='accuracy', cv=5)
#print("Accuracy = " + str(accuracy))
#print("Mean Acc = " + str(accuracy.mean()))

In [30]:
text_clf_NB_3 = Pipeline([('tfidf', tfidf_vectorizer_try),
                         ('chi2', SelectKBest(chi2, k=20000)),
                         ('clf-NB', MultinomialNB())])

accuracy = cross_val_score(text_clf_NB_3, train.both, label, scoring='accuracy', cv=5)
print("Accuracy = " + str(accuracy))
print("Mean Acc = " + str(accuracy.mean()))

Accuracy = [0.9        0.89       0.90625    0.89486859 0.8873592 ]
Mean Acc = 0.8956955569461827


In [31]:
text_clf_NB_4 = Pipeline([('tfidf', tfidf_vectorizer_swn),
                         ('chi2', SelectKBest(chi2, k=20000)),
                         ('clf-NB', MultinomialNB())])

accuracy = cross_val_score(text_clf_NB_4, train.both, label, scoring='accuracy', cv=5)
print("Accuracy = " + str(accuracy))
print("Mean Acc = " + str(accuracy.mean()))

Accuracy = [0.91       0.9        0.90875    0.9048811  0.88235294]
Mean Acc = 0.9011968085106382


In [None]:
#text_clf_NB_5 = Pipeline([('tfidf', tfidf_vectorizer_swntk),
                         ('chi2', SelectKBest(chi2, k=20000)),
                         ('clf-NB', MultinomialNB())])

#accuracy = cross_val_score(text_clf_NB_4, train.both, label, scoring='accuracy', cv=5)
#print("Accuracy = " + str(accuracy))
#print("Mean Acc = " + str(accuracy.mean()))

Based on the results of the tests above, the best base vetorizer to use is **tfidf_vectorizer_swn** which includes **Stopwords and n-grams of order 2.** This will be used for further classifiers and will be updated based on the results obtained. 
It is to be noted that stemming does not provide better results here. This could be because stemming or cutting off the roots of the words is probably not optimal for Fake news detection as it may use different verb conjugations or change up the tenses as opposed to the more structured and formal Real news. 

*For the final run, the pipelines including stemming were not run as they took a long time to run. This is the same case for the gridsearches, but the results of a previous run is included. 

### 6.1 Parameter optimization

In [None]:
# Grid search for MNB
#parameters_NB = {'tfidf__use_idf': (True, False),
#                  'tfidf__lowercase': (True,False),
#                  'clf-NB__alpha': (0.01, 0.05),
#                 }

#gs_clf_NB = GridSearchCV(text_clf_NB_4, cv=5, param_grid=parameters_NB, n_jobs=-1)
#gs_clf_NB = gs_clf_NB.fit(train.both, label)

#print (gs_clf_NB.best_score_)
#print (gs_clf_NB.best_params_)

The results of the Grid Search is as follows: 
   
       0.936468234117
    {'clf-NB_alpha': 0.01, 'tfidflowercase': False, 'tfidf_use_idf': True}

Although converting all words to lowercase is the usual normalization practise, in this case, it is not to be. It makes logical sense as Fake news would try to elicit heightened emotions by using excessive capitalization which would be very useful for classification purposes. With this result, we will amend **tfidf_vectorizer_swn** to include lowercase = False. Use_idf will not be changed as True is the default setting.

In [36]:
tfidf_vectorizer_swn = TfidfVectorizer(stop_words='english',sublinear_tf=True,
                                       min_df=5, norm='l2', 
                                       ngram_range=(1,2),max_df = 0.7,
                                       lowercase = False)

# 7. Support Vector Machine

The most commonly used linear model in text classification is Support Vector Machine. There are 2 common classifiers (linear_svc and SGDC), both of which were tries and it was found that SGDC provided much better results as compared linear SVC. 

In [37]:
linear_clf_sgdc = SGDClassifier(loss='hinge', penalty='l2',
                    alpha=1e-3, random_state=42,
                    max_iter=5, tol=None)

In [38]:
text_clf_sgdc = Pipeline([('tfidf', tfidf_vectorizer_swn),
                         ('chi2', SelectKBest(chi2, k=20000)),
                         ('clf-svm-sgdc', linear_clf_sgdc)])

accuracy = cross_val_score(text_clf_sgdc, train.both, label, scoring='accuracy', cv=5)
print("Accuracy = " + str(accuracy))
print("Mean Acc = " + str(accuracy.mean()))

Accuracy = [0.9125     0.92875    0.91625    0.93617021 0.91239049]
Mean Acc = 0.921212140175219


SGDC Classifier without grid search has a higher mean accuracy that Multinomial Naive Bayes Classifier. We will not try to attempt to further enhance it by using parameter optimization

### 7.1 Parameter optimization

**Grid search for SGDC**
parameters_sgdc = {'tfidf__max_df': (0.5, 0.75, 1.0),
                  'clf-svm-sgdc__alpha': [0.00009, 0.00008],
                  'clf-svm-sgdc__penalty': ('l2', 'elasticnet')
                 }

gs_clf_sgdc = GridSearchCV(text_clf_sgdc, cv=5, param_grid=parameters_sgdc, n_jobs=-1)
gs_clf_sgdc = gs_clf_sgdc.fit(train.both, label)

print (gs_clf_sgdc.best_score_)
print (gs_clf_sgdc.best_params_)

The results of the Grid Search is as follows: 
   
       0.942971485743
{'clf-svm-sgdv_alpha': 9e-05, 'clf-svm-sgdvpenalty': 'l2', 'tfidf_max_df': 0.5}

# 8. Passive- Aggressive Classifier

This particular classifier is from a family of online learning models and is usually used to classify massive streams of data. We are going to attempt to verify the type of accuracy provided by this classifier with this dataset. 

In [35]:
text_clf_pac = Pipeline([('tfidf', tfidf_vectorizer_swn),
                         ('chi2', SelectKBest(chi2, k=20000)),
                         ('clf-PAS', PassiveAggressiveClassifier(max_iter=50))])

accuracy = cross_val_score(text_clf_pac, train.both, label, scoring='accuracy', cv=5)
print("Accuracy = " + str(accuracy))
print("Mean Acc = " + str(accuracy.mean()))

Accuracy = [0.95       0.94625    0.94       0.95494368 0.94242804]
Mean Acc = 0.9467243429286608


# 9. Test Data

In [39]:
test['both'] = test['title'] + test['text']

In [40]:
# Prediction for test data
text_clf_pac.fit(train.both, label)
pred = text_clf_pac.predict(test.both)

In [41]:
submission = pd.DataFrame({
            "News_id": test["ID"],
            "prediction": pred
        },columns = ['News_id', 'prediction'])

In [42]:
# Store submission dataframe into file
submission.to_csv("Assignment1_VLN.csv", index = False)

# 10. Conclusion

The classifier used for final prediction was Passive Agressive Classifier with stopwords removed in the text and using an n-gram of order 2. Stemming was attempted but it did not provide an increase in accuracy. 

Further investigation could include the following: 
1. Text Processing
    - Lemmatization
    - Sentiment Analysis
    - Punctuation Analysis (as fake news tends to use a lot more exclamation marks)
    
2. Classifier
    - Boosting Algorithms
    - Simple Neural Networks
    
3. Gathering insights from the classifiers to investigate how exactly it understood the data and performed the classification process. 
