| <b> </b> | <b></b> |
|--|--|
| **Group** | *22* |
| **Group member 1** | *Alfred Schell* |
| **Group member 2** | *Laura Rusu* |


# Main idea:
We will follow 5 main steps:
1. Loading the data
2. Preprocessing                                                                               
2.1 Cleaning the datasets (concatenating the title and the text, creating a single dataset from the two)                  
2.2 Tokenization (we will use the NLTK tokenizer)                                                                                
2.3 Lemmatization                                                                                
2.4 Splitting into unigrams and bigrams (depending on the chosen model)                                                                               
2.5 Removing stopwords (this will be employed in case our accuracy proves to be insufficient)
3. Vectorizing the data
4. Employing a model to train the data (multiple models will be used to test the various accuracies)                                                                               
4.1 Logistic regression model                                                                               
4.2 Naive Bayes                                                                               
4.3 Support Vector Machines                                                                               
4.4 Unigram language model                                                                               
4.5 Bigram language model
5. Checking the accuracy for each variation, to decide on the best option for the given problem


## 1. Loading the data

In [78]:
#Basic libraries
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt 
from matplotlib import rcParams
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 5]
from collections import Counter
#from pandas import series

#NLTK libraries
import nltk
import spacy
import string
from nltk import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from spacy.lemmatizer import Lemmatizer
sp = spacy.load('en_core_web_sm')
from itertools import chain

#from sklearn.feature_extraction.text import TfidfVectorizer
#from sklearn.feature_extraction.text import CountVectorizer

# Machine Learning libraries
import sklearn 
#from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
#from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB 
#from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
 
#Metrics libraries
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
#from sklearn.metrics import roc_auc_score
#from sklearn.metrics import roc_curve

#Miscellanous libraries


#Ignore warnings
#import warnings
#warnings.filterwarnings('ignore')

In [51]:
#reading the fake and true datasets
fake_news = pd.read_csv('Fake.csv')
true_news = pd.read_csv('True.csv')

# print shape of the datasets
print ("The shape of the  data is (row, column):"+ str(fake_news.shape)) #(23481, 4) - contains 23481 fake news articles
print ("The shape of the  data is (row, column):"+ str(true_news.shape)) #(21417, 4) - contains 21417 true news articles

#Either category contains a title (the news headline), text (the article), subject (type of news) 
#and date (when the news was published)


The shape of the  data is (row, column):(23481, 4)
The shape of the  data is (row, column):(21417, 4)


In [52]:
fake_news.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [53]:
true_news.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


## 2. Preprocessing

### 2.1 Cleaning the datasets

In [54]:
#Creating target variables
fake_news['Output'] = 0
true_news['Output'] = 1

#Concatenating title and article text
fake_news['News'] = fake_news['title'] + fake_news['text']
true_news['News'] = true_news['title'] + true_news['text']

#Removing the 'title' and 'text' columns
fake_news = fake_news.drop(['title', 'text'], axis=1)
true_news = true_news.drop(['title', 'text'], axis=1)

#Rearranging the columns
fake_news = fake_news[['subject', 'date', 'News','Output']]
true_news = true_news[['subject', 'date', 'News','Output']]

both_news = [fake_news, true_news]
news_dataset = pd.concat(both_news)
news_dataset.head()

Unnamed: 0,subject,date,News,Output
0,News,"December 31, 2017",Donald Trump Sends Out Embarrassing New Year’...,0
1,News,"December 31, 2017",Drunk Bragging Trump Staffer Started Russian ...,0
2,News,"December 30, 2017",Sheriff David Clarke Becomes An Internet Joke...,0
3,News,"December 29, 2017",Trump Is So Obsessed He Even Has Obama’s Name...,0
4,News,"December 25, 2017",Pope Francis Just Called Out Donald Trump Dur...,0


### 2.2 Tokenization


In [55]:
#We will be using an off-the-shelf tokenizer

sp()
def tokenize_news(news):  
    tokenized_news = [] 
    for i in news:
        tokenized_str=(word_tokenize(i))
        
        #Remove punctuation marks
        for j in range(len(tokenized_str)):
            tokenized_str[j]=tokenized_str[j].translate(str.maketrans('','',string.punctuation))
            
        #Remove empty spaces
        tokenized_str[:] = [x for x in tokenized_str if x]
        
        #Transform everything to lowercase 
        tokenized_str[:] =[x.lower() for x in tokenized_str]
        
        #Add start and end token 
        tokenized_str[:] =['<S>'] + tokenized_str[:] +['<E>']
        tokenized_news.append(tokenized_str)
        
    return tokenized_news

In [56]:
news_dataset['Tokenized'] = tokenize_news(news_dataset['News'])
news_dataset.head()

Unnamed: 0,subject,date,News,Output,Tokenized
0,News,"December 31, 2017",Donald Trump Sends Out Embarrassing New Year’...,0,"[<S>, donald, trump, sends, out, embarrassing,..."
1,News,"December 31, 2017",Drunk Bragging Trump Staffer Started Russian ...,0,"[<S>, drunk, bragging, trump, staffer, started..."
2,News,"December 30, 2017",Sheriff David Clarke Becomes An Internet Joke...,0,"[<S>, sheriff, david, clarke, becomes, an, int..."
3,News,"December 29, 2017",Trump Is So Obsessed He Even Has Obama’s Name...,0,"[<S>, trump, is, so, obsessed, he, even, has, ..."
4,News,"December 25, 2017",Pope Francis Just Called Out Donald Trump Dur...,0,"[<S>, pope, francis, just, called, out, donald..."


### 2.3 Lemmatization
The lemmatized dataset will have a specific name ('news_dataset_lemmatized') instead of keeping the same name. This is due to the fact that some methods that we will employ i.e. unigram and bigram method do not benefit from a lemmatized dataset. It is safest to keep the two separated.

In [74]:
def lemmatize_news(news):  
    lemmatized_news = [] 

    news_to_lemmatize = sp(news.to_string)
    lemmatizer = sp.vocab.morphology.lemmatizer
    for i in news:
        lemmatized_news.append(lemmatizer(i.text, i.pos_))
    return lemmas
    

In [75]:
news_dataset['Lemmatized'] = lemmatize_news(news_dataset['News'])
news_dataset.head()

TypeError: object of type 'method' has no len()

### 2.4 Splitting into unigrams and bigrams
First, we will prepare the token list for each type of ngrams and count the frequency of each structure.

In [None]:
unigrams = list(chain.from_iterable(news_dataset['Tokenized']))
bigrams = list(ngrams(chain.from_iterable(news_dataset['Tokenized']), 2))

def get_frequencies(ngrams):
    ngram_frequencies = {}
    for ngram in ngrams:
        ngram_frequencies [ngram]=str(ngrams.count(ngram))
    return ngram_frequencies

unigram_frequencies = get_frequencies(unigrams)
bigram_frequencies = get_frequencies(bigrams)

### 2.4 Splitting into bigrams

### 2.5 Removing stopwords