# Deliverable 1 modeling



## 0. Introduction

The goal of this project is to be able to classify news into the binary "true" or "fake". The input to this project should be a body of text. 

* [Preprocessing](#scrollTo=YjJ80zE8LlPl&line=1&uniqifier=1)


## 1. Libraries and Data





### 1.1 Importing Libraries


In [1]:
import csv
import random
import pandas as pd
import re
from string import ascii_letters, digits
from sklearn.metrics import accuracy_score


### 1.2 Downloading dataset CSVs

In [2]:
#fake csv
!wget https://raw.githubusercontent.com/peterghrong/fake_news_detection/master/dataset/Fake.csv

#true csv
!wget https://raw.githubusercontent.com/peterghrong/fake_news_detection/master/dataset/True.csv

--2020-10-16 19:10:36--  https://raw.githubusercontent.com/peterghrong/fake_news_detection/master/dataset/Fake.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 62789876 (60M) [text/plain]
Saving to: ‘Fake.csv’


2020-10-16 19:10:38 (61.5 MB/s) - ‘Fake.csv’ saved [62789876/62789876]

--2020-10-16 19:10:38--  https://raw.githubusercontent.com/peterghrong/fake_news_detection/master/dataset/True.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53582940 (51M) [text/plain]
Saving to: ‘True.csv’


2020-10-16 19:10:40 (54.2 MB/s) - ‘True.csv’

### 1.3 Loading reviews into local memory

In [3]:
## We write a preview function just to preview the data set to get a better idea
def print_review(review, flag):
  if flag == true:
    print('----------------Real news-----------------')
  else:
    print('----------------Fake news-----------------')
  print(review)
  print('------------------------------------------')
  print()


In [4]:
fake_csv = pd.read_csv('Fake.csv')
true_csv = pd.read_csv('True.csv')

## we well drop all columns except the content of the news
fake_csv.drop(["title", "subject", "date"], axis=1, inplace=True)
true_csv.drop(["title", "subject", "date"], axis=1, inplace=True)
fake_csv['nature'] = 0;
true_csv['nature'] = 1;

dataset = fake_csv.append(true_csv)
## below returns the shuffled dataset
shuffled_dataset = dataset.sample(frac=1)
print(shuffled_dataset.head(5))
final_data = shuffled_dataset.to_records(index=False)

                                                    text  nature
1900   Larry King has been in the radio and journalis...       0
13371  We live near the city of Detroit, and anyone c...       0
20839  Wait for the Left to blame micro-aggressions b...       0
9902   Like mother, like daughter Brainiac Chelsea Cl...       0
829    According to the results from a recent survey ...       0


In [5]:
print(len(true_csv))
print(len(fake_csv))
print(len(shuffled_dataset))
print(len(final_data))
print(final_data[4])

21417
23481
44898
44898
('According to the results from a recent survey by Public Policy Polling, only 45 percent of Donald Trump voters polled wholeheartedly believe that Donald Trump Jr. met with Russian lawyer Natalia Veselnitskaya, despite the fact that the U.S. President s eldest son has openly admitted doing so.President Trump s alleged collusion with Russia has been one of the main stories doing the rounds since before his inauguration, but more fuel was added to the fire when the New York Times broke the news at the beginning of the month of Donald Trump Jr. s meeting with Veselnitskaya, a Russian lawyer with ties to Russian leader Vladimir Putin, and six others in an effort to obtain damaging information about his father s campaign rival, Hillary Clinton, back in June 2016.Donald Trump Jr. has since admitted that the meeting took place and has even given details of the attendees, including former Trump campaign chairman Paul Manafort and Trump s son-in-law and now-senior advis

## Preprocessing

### 2.1 Dataset cleaning

We will lowercase all letters, remove all characters except letters and spaces, and finally as per specification, we will remove the redenduncy of the keyword "Reuters"

In [6]:
total = len(final_data)
num_training = round(total * 0.8)
num_validation = round((total - num_training)/2)
num_testing = total - num_training - num_validation

In [7]:
def clean(review):  
  ## lower case
  review_lower = review.lower()
  review_v2 = review_lower.replace(' (reuters) ', '')

  allowed = set(ascii_letters + ' ')
  pat = re.compile(r'[^a-zA-Z ]+')
  cleaned = re.sub(pat, '', review_v2)

  return cleaned

In [8]:
print(clean(final_data[4][0]))
example = clean(final_data[4][0])
print(len(example.split()))

according to the results from a recent survey by public policy polling only  percent of donald trump voters polled wholeheartedly believe that donald trump jr met with russian lawyer natalia veselnitskaya despite the fact that the us president s eldest son has openly admitted doing sopresident trump s alleged collusion with russia has been one of the main stories doing the rounds since before his inauguration but more fuel was added to the fire when the new york times broke the news at the beginning of the month of donald trump jr s meeting with veselnitskaya a russian lawyer with ties to russian leader vladimir putin and six others in an effort to obtain damaging information about his father s campaign rival hillary clinton back in june donald trump jr has since admitted that the meeting took place and has even given details of the attendees including former trump campaign chairman paul manafort and trump s soninlaw and nowsenior adviser jared kushner however many trump voters aren t 

In [9]:
X_train_data = []
Y_train_data = []

X_valid_data = []
Y_valid_data = []

X_test_data = []
Y_test_data = []

## this step loads all cleaned strings into their respective array

for i in range(num_training):
  X_train_data.append(clean(final_data[i][0]))
  Y_train_data.append(int(final_data[i][1]))

for j in range(num_validation):
  X_valid_data.append(clean(final_data[j+num_training][0]))
  Y_valid_data.append(int(final_data[j+num_training][1]))

for k in range(num_validation):
  X_test_data.append(clean(final_data[k+num_training+num_validation][0]))
  Y_test_data.append(int(final_data[k+num_training+num_validation][1]))

In [10]:
print(len(X_train_data))
print(X_train_data[50])

print(len(X_valid_data))
print(X_valid_data[50])

print(len(X_test_data))
print(X_test_data[50])

35918
reuters  presidentelect donald trump is considering dallas investor ray washburne as a possible interior secretary cnbc reported on monday citing unnamed sources on trumps transition team washburnes company charter holdings is involved in real estate restaurants and diversified financial investments a top republican fundraiser washburne has served as vice chair of trump victory committee 
4490
washington republican presidential hopeful ted cruz is leaning on new sources of cash as he prepares for a long primary fight against frontrunner donald trump with new campaign finance filings showing the expense of competing against a billionaire adept at grabbing headlines cruzs more traditional campaign has struggled to compete with trump the us senator from texas poured money into advertising staff and calls to voters spending  million more in february than he raised as he tried to outmaneuver trump according to campaign finance records made public on sunday but the effort had a limited

### 2.2 Picking features

We will only implement "Bags of words" in this initial implementation, more features are to be implemented in the future to obtain better predictions. the word count of "Bag of words" can be cross validated to find the most optimal solution of our current implementation

In [11]:
def get_vocab(reviews, vocab_size):
  # use a dictionary to find the most used words
  counter = {}
  lst = []
  for string in reviews:
    string = string.split()
    for word in string:
      if word in counter:
        counter[word] += 1
      else:
        counter[word] = 1;

  sorted_d = dict( sorted(counter.items(), key=lambda x: x[1], reverse=True))
  
  count = 0
  
  for key in sorted_d:
    if count < vocab_size:
      lst.append(key)
      count += 1 
  return lst 

In [12]:
num_features = 10000
vocabulary = get_vocab(X_train_data, num_features)
print(len(vocabulary))
print(vocabulary)

10000


In [13]:
def vectorize(review_string, vocab):
  
  spl_string = review_string.split()
  size=len(vocab)
  lst = [0]*size
  for word in spl_string:
    if word in vocab:
      index = vocab.index(word)
      lst[index] = 1

  return lst

### 2.3 Vectorising the datasets 
Because the vectorisation process is extremely computationally intensive and time consuming, we will save everything in memory.

Note: vectorising the training set could take 10 mins

In [14]:
## this step vectorises the raw training dataset
X_train_vect = []
for string in X_train_data:
  vector = vectorize(string, vocabulary)
  X_train_vect.append(vector)

In [15]:
## this vectorises the validation dataset
X_valid_vect = []
for string in X_valid_data:
  vector = vectorize(string, vocabulary)
  X_valid_vect.append(vector)

In [16]:
## this vectorises the testing vector dataset 
X_test_vect = []
for string in X_test_data:
  vector = vectorize(string, vocabulary)
  X_test_vect.append(vector)

In [17]:
for i in range(5):
  print("review of nature", Y_train_data[i])
  print(X_train_vect[i])

review of nature 0
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## Naive Bayes Implementation

We will use Naive Bayes to determine if a given body is text is fake news or real news, the Naive Bayes Algorithm will output a probability of {0,1}, it is considered true if the probability is bigger than 0.5

### 3.1 Estimating the probability Distribution

We will store the computed probability in the memory

In [18]:
### Answer starts here ###

##calculates probablity if y is 1
num_o_rows = len(Y_train_data)
num_o_cols = len(X_train_vect[0])

y_is_1 = 0
y_is_0 = 0

prX1given1 = [0] * num_o_rows
prX1given0 = [0] * num_o_rows

prX0given1 = []
prX0given0 = []



for row_num in range(num_o_rows):
  instanceofX = X_train_vect[row_num]
  if Y_train_data[row_num] == 1:
    y_is_1 += 1
    for col_num in range(num_o_cols):
      if (instanceofX[col_num] == 1):
        prX1given1[col_num] += 1
  else:
    y_is_0 += 1
    for col_num in range(num_o_cols):
      if (instanceofX[col_num] == 1):
        prX1given0[col_num] += 1

#finds the probablity
prY1 = y_is_1/num_o_rows
prY0 = 1- prY1
for i in range(len(prX1given1)): 
  prX1given1[i] /= y_is_1
  prX1given0[i] /= y_is_0
  prX0given1.append(1-prX1given1[i])
  prX0given0.append(1-prX1given0[i])
  

#given 0






#P(y = 0)
print(prY0)
#P(y = 1)
print(prY1)

#P(xi=1|y=1)
print(prX1given1)

#P(xi=0|y=1)
print(prX0given1)

#P(xi=1|y=0)
print(prX1given0)

#P(xi=0|y=0)
print(prX0given0)

### Answer ends here ###

0.5197394064257476
0.48026059357425244
[0.9964057971014493, 0.9663768115942029, 0.9529275362318841, 0.9701449275362319, 0.9390144927536231, 0.9522898550724638, 0.8455072463768116, 0.9901449275362318, 0.4446376811594203, 0.820463768115942, 0.7503768115942029, 0.9372753623188406, 0.648463768115942, 0.7034782608695652, 0.7656811594202898, 0.4249855072463768, 0.7172173913043478, 0.7077101449275363, 0.5371014492753623, 0.7354782608695652, 0.6972753623188406, 0.6333333333333333, 0.6468985507246376, 0.6545507246376812, 0.675536231884058, 0.5164637681159421, 0.6404057971014493, 0.5333913043478261, 0.5420869565217391, 0.6657391304347826, 0.452, 0.6278260869565218, 0.3085797101449275, 0.5672463768115942, 0.4121739130434783, 0.5614492753623188, 0.5909565217391304, 0.4252173913043478, 0.5208115942028986, 0.41744927536231885, 0.5327536231884058, 0.13779710144927537, 0.08457971014492753, 0.4871304347826087, 0.4668405797101449, 0.3635942028985507, 0.39704347826086955, 0.4220869565217391, 0.5245217391

### 3.2 Creating the Naive Bayes Classifier

The function Naive Bayes outputs the class with the largest posterior probability given the input features.

In [19]:
def naive_bayes(vec):
  ### Answer starts here ###
  cond = 1
  cond2 = 1
  
  ##calculate the probablity that y = 1 given the vector
  for i in range(len(vec)):
    if vec[i] == 1:
      cond = cond*prX1given1[i]
      cond2 = cond2*prX1given0[i]
    else:
      cond = cond*prX0given1[i]
      cond2 = cond2 *prX0given0[i]
    
  numerator = prY1 * cond
  denom = numerator + prY0 * cond2

  if denom == 0 :
    return 0
  else:
    p_is_1 = numerator/denom
    if p_is_1 > 0.5:
      return 1
    else:
      return 0

  ### Answer ends here ###

### 3.3 Measuring performance 

In [20]:
y_test_pred = []
y_valid_pred = []

for j in X_valid_vect:
  y_valid_pred.append(naive_bayes(j))

train_accu = accuracy_score(Y_valid_data, y_valid_pred)
print("---------------------------------------------------------")
print("Validation accuracy is ", train_accu)
print()


for i in X_test_vect:
  y_test_pred.append(naive_bayes(i))
test_accu = accuracy_score(Y_test_data, y_test_pred)
print("---------------------------------------------------------")
print("Test accuracy is", test_accu)
print()


---------------------------------------------------------
Validation accuracy is  0.810467706013363

---------------------------------------------------------
Test accuracy is 0.8155902004454343



In [29]:
print(naive_bayes(vectorize("In an effort to help socially conscious subscribers avoid the judgment of their peers, Amazon reportedly began offering a new blank box upcharge Tuesday for progressive members to discreetly receive their Prime orders. “For just $3 per shipment, Amazon users who are outwardly critical of our company can have their packages delivered in a blank cardboard box without any logos or branding so they’ll never get called out for being hypocritical,” read the company’s press release, which explained how the packages would be delivered by a plainclothes employee in a nondescript white van so as to avoid drawing any suspicion. “Our new concealment services are perfect for any activist type who publicly condemns our unethical business practices yet sometimes needs to order some disposable razors on a two-day turnaround. We have already begun rolling out the service in liberal hubs such as Portland, Berkeley, and New York, and will be expanding to other cities in the coming months.” The press release also explained that for an extra $2, each package could be branded with the logo of a struggling, non-profit bookstore that helps disenfranchised communities or a local farming co-op.", vocabulary)))

0
