# Section 2: Creating a Baseline Model for Fake News Classification


In this notebook we'll be:
1.   Performing feature Selection
2.   Understanding Performance Metrics



In [None]:
!pip install kora -q
from kora import jupyter
jupyter.start(lab=True)

In [5]:
#@title Run this code to get started
import math
import os
import numpy as np
import pandas as pd

import pickle

import requests, io, zipfile

# Download class resources...

basepath = '/content/gdrive/Colab Notebooks/Fake_news_data'

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

--2021-05-23 21:33:17--  https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Fake%20News%20Detection/inspirit_fake_news_resources%20(1).zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.9.208, 172.217.12.240, 172.217.164.144, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.9.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 109422100 (104M) [application/zip]
Saving to: ‘data.zip’


2021-05-23 21:33:18 (231 MB/s) - ‘data.zip’ saved [109422100/109422100]

Archive:  data.zip
replace train_val_data.pkl? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
fake_data = pd.read_csv(os.path.join(basepath, 'Fake.csv'))
true_data = pd.read_csv(os.path.join(basepath, 'True.csv'))

## Exercise 1

Below we introduce some code for taking our training and val data and producing X, y examples that can be fit by a logistic regression model. This code extracts a few basic features from the domain name extension of the website. **Your task: add features testing whether the domain name ends in ".co", ".tv", and ".news", according to the template below.**


In [None]:
def prepare_data(data, featurizer):
    X = []
    y = []
    for datapoint in data:
        url, html, label = datapoint
        # We convert all text in HTML to lowercase, so <p>Hello.</p> is mapped to
        # <p>hello</p>. This will help us later when we extract features from 
        # the HTML, as we will be able to rely on the HTML being lowercase.
        html = html.lower() 
        y.append(label)

        features = featurizer(url, html)

        # Gets the keys of the dictionary as descriptions, gets the values
        # as the numerical features. Don't worry about exactly what zip does!
        feature_descriptions, feature_values = zip(*features.items())

        X.append(feature_values)

    return X, y, feature_descriptions
  
# Returns a dictionary mapping from plaintext feature descriptions to numerical
# features for a (url, html) pair.
def domain_featurizer(url, html):
    features = {}
    
    # Binary features for the domain name extension.
    features['.com domain'] = url.endswith('.com')
    features['.org domain'] = url.endswith('.org')
    features['.net domain'] = url.endswith('.net')
    features['.info domain'] = url.endswith('.info')
    features['.org domain'] = url.endswith('.org')
    features['.biz domain'] = url.endswith('.biz')
    features['.ru domain'] = url.endswith('.ru')
    features['.co.uk domain'] = url.endswith('.co.uk')
    
    ### YOUR CODE HERE ###
    features['.co domain'] = url.endswith('.co')
    
    ### END CODE HERE ###
    
    return features

## Instructor-Led Discussion: Deciding Inputs to our Model

Make sure you understand what the code above is doing. It produces X, y such that X contains a list of features for each site in the dataset, and y contains the labels in corresponding order. *feature_descriptions* is a list of the names of features (.e.g., '.com domain'). This will be important later when we want to know the names of features when interpreting the model. Let's run our code for processing the data on the train and val sets from yesterday.



## Exercise 2


Your task: call *prepare_data* twice, once on *train_data* and *domain_featurizer* and once on *val_data* and *domain_featurizer*. Save the results as *train_X, train_y, feature_descriptions* and *val_X, val_y, feature_descriptions*.

In [None]:
with open(os.path.join(basepath, 'train_val_data.pkl'), 'rb') as f:
  train_data, val_data = pickle.load(f)
  
print('Number of train examples:', len(train_data))
print('Number of val examples:', len(val_data))
  
### YOUR CODE HERE ###

### END CODE HERE ###

print('Number of features per example:', len(train_X[0]))
print('Feature descriptions:')
print(feature_descriptions)

Now to train on our featurized data. We use scikit-learn as in the previous week, because it makes it easy to quickly iterate on different types of models.



## Exercise 3


Another quick exercise: load the LogisticRegression model from scikit-learn with default parameters (no arguments to the constructor). Then fit it on *train_X* and *train_y* (~5 minutes). 

In [None]:
### YOUR CODE HERE ###

### END CODE HERE ###

## Exercise 4 



In [None]:
train_y_pred = baseline_model.predict(train_X)
print('Train accuracy', accuracy_score(train_y, train_y_pred))

We can see that we are not doing very well, but we are doing better than 50%. We can do the same for the val data to see how we are doing on unseen data, which is more valuable for us if we want to make predictions on new websites. Fill in the code below to evaluate val accuracy!

In [None]:
### YOUR CODE HERE ###

### END CODE HERE ###
print('Val accuracy', accuracy_score(val_y, val_y_pred))

We appear to be doing a bit worse on the val data, not much better than chance. To better understand the performance of our binary classification model, we should seek to better understand the mistakes that it is making. Specifically, when our model makes a mistake (about 40% of the time), are these mistakes false negatives or false positives?

## Confusion Matrix 

To answer these questions, we produce and analyze the confusion matrix. The confusion matrix is a matrix that shows the following:

![Confusion Matrix](https://cdn-images-1.medium.com/max/1600/1*Z54JgbS4DUwWSknhDCvNTQ.png)

where the terms mean

* TP (True Positive) = You predicted positive (fake in our case, since fake has a label of 1) and it’s true.
* FP (False Positive) = You predicted positive and it’s false.
* FN (False Negative) = You predicted negative and it’s false.
* TN (True Negative) = You predicted negative and it’s true.


###Common Metrics

From the confusion matrix, we can extract commonly used metrics like precision (TP/(TP + FP)) and recall (TP/(TP + FN)). 

* Precision quantifies how often the things we classify as positive are actually positive. For our task, this measures what fraction of the sites we classify as fake are actually fake. 
* Recall quantifies what fraction of actually positive examples we classify as positive. In our case, this is the fraction of fake news websites that we actually identify as fake.

Finally, a useful score to summarize both precision and recall is the F-1 score. This is just a simple function (the harmonic mean) of precision and recall, shown in the summary below:

<img src="https://datascience103579984.files.wordpress.com/2019/04/capture3-24.png" width="400" height="200"></img>

##Exercise 5 |  Using the Confusion Matrix 

Run the cell below to create the confusion matrix for our own model. 

In [None]:
print('Confusion matrix:')
print(confusion_matrix(val_y, val_y_pred))

A Confusion Matrix can quickly tell you how well your model is doing. The primary way to figure this out is to calculate the Error Rate. 

The Error Rate is:   (FP) + (FN)) / (TP + FP + FN + TN).

This is just all the false predictions (False Negative + False Positive) divided by all the predictions added together.  

Use the Confusion Matrix we just created to calculate the Error Rate for our model. 

In [None]:
### YOUR CODE HERE ###

### END CODE HERE ###

## Exercise 6

In [None]:
print(val_y_pred)

We can see that we have many false negatives, and not as many false positives. Why is this the case? If we print out *val_y_pred*, we can see that our model is mostly predicting 0's (websites are real).

What fraction of predictions in *val_y_pred* are 1's? Hint: you may find *np.mean* useful.

In [None]:
### YOUR CODE HERE ###

### END CODE HERE ###

Why so many 0's? The only information we are giving our model is its domain name extension. It's natural that the model would learn that websites with ".biz" extensions are unlikely to be reliable news websites, but it is still the case that most websites in the dataset (fake and real) have ".com" extensions. Thus, our model will misclassify many fake news websites with ".com" extensions as real. 

In [None]:
prf = precision_recall_fscore_support(val_y, val_y_pred)

print('Precision:', prf[0][1])
print('Recall:', prf[1][1])
print('F-Score:', prf[2][1])

Again, the precision and recall metrics suggest that when we classify a website as fake, we are usually right, but we are not doing great at classifying these websites as fake frequently enough.

##Using Keywords for a Stronger Baseline 

The key problem with our model in its current state is that it simply does not have enough information. This should not be a surprise–it was pretty unlikely in the first place that domain name extensions would be enough. If you like, feel free to add a few more extensions in the “featurizer” above and re-run all the code for evaluation–you'll find it doesn't make much of a difference.
Where can we get more information about webpages? From the HTML! Remember that the HTML contains all of the text and structure of a webpage. If we cleverly choose features from the HTML to feed into our logistic regression model, we will drastically improve our performance. We saw yesterday that probing hypotheses related to the counts of hypotheses words produced interesting results, and we will continue in this direction today to produce a model that leverages these differences in word frequencies.


## Exercise 7: Instructor-Led Discussion on Better Input Features



The below code introduces a better featurizer that counts the number of keywords (normalized using the *log* function) in the HTML. Normalizing the counts is a trick that prevents the featurized values from becoming too extreme. Read the code and make sure you understand what it is doing. Then add "sports" and "finance" as additional keywords to expand our model.

**Run the below code and discuss what it is doing as a class. Add in additional keywords to further expand our model as you see fit.**





In [None]:
# Gets the log count of a phrase/keyword in HTML (transforming the phrase/keyword
# to lowercase).
def get_normalized_count(html, phrase):
    return math.log(1 + html.count(phrase.lower()))


# Returns a dictionary mapping from plaintext feature descriptions to numerical
# features for a (url, html) pair.
def keyword_featurizer(url, html):
    features = {}
    
    # Same as before.
    features['.com domain'] = url.endswith('.com')
    features['.org domain'] = url.endswith('.org')
    features['.net domain'] = url.endswith('.net')
    features['.info domain'] = url.endswith('.info')
    features['.org domain'] = url.endswith('.org')
    features['.biz domain'] = url.endswith('.biz')
    features['.ru domain'] = url.endswith('.ru')
    features['.co.uk domain'] = url.endswith('.co.uk')
    features['.co domain'] = url.endswith('.co')
    features['.tv domain'] = url.endswith('.tv')
    features['.news domain'] = url.endswith('.news')
    
    ### YOUR CODE HERE ###
    keywords = ["hillary", "obama", "sports"] 
    ### END CODE HERE


    
    for keyword in keywords:
      features[keyword + ' keyword'] = get_normalized_count(html, keyword)


    
    return features

##Exercise 8



Let's run and evaluate the above featurizer. Add in code to fit the model, compute train accuracy, val accuracy, val confusion matrix, and val precision, recall, and F1-Score, just as before.

In [None]:
train_X, train_y, feature_descriptions = prepare_data(train_data, keyword_featurizer)
val_X, val_y, feature_descriptions = prepare_data(val_data, keyword_featurizer)

print('Number of features per example:', len(train_X[0]))
print('Feature descriptions:')
print(feature_descriptions)
print()
  
baseline_model = LogisticRegression()

### YOUR CODE HERE ###

### END CODE HERE ###

## Interpreting our Model



### Instructor-Led Discussion: Interpreting Input Variables

As mentioned earlier, a key motivation for using a simpler model is interpretability.

We've learned that the prediction of a logistic regression classifier is just the output of a multiplication with model weights, followed by a non-linear transformation (sigmoid). Because the sigmoid function is always increasing (monotonic) on its domain (see below), we know that if the dot product (or multiplication of vectors) between model weights and input features is large, then the output prediction will be closer to 1. If the dot product is small, then the output prediction will be closer to 0.

![Sigmoid](https://cdn-images-1.medium.com/max/2400/1*RqXFpiNGwdiKBWyLJc_E7g.png)

Thus, the weights corresponding to features tell us whether the features are important in the classification. If the weight corresponding to the feature ".net domain" has a large positive value, then websites with ".net" domains are more likely to be classified as fake (since fake has label 1). If it has a large negative value, then these websites are more likely to be classified as real. If it has value close to 0, then the feature may not be useful (at least, it may not be useful given that the other features are present).


###Using Feature Descriptions

Let's see what weights our model learned. The code below uses *feature_descriptions* and the weights, or coefficients, of the model and sorts them in ascending order.

In [None]:
sorted(zip(feature_descriptions, baseline_model.coef_[0].tolist()), key=lambda x: x[1])

## Exercise 9

Answer the following questions:

* What features have positive weight (most predictive of being fake)? What does that indicate?
* Which ones have negative weight (most predictive of being real)? What does that indicate?
* Which ones have close to 0 weight? 
* Are there any feature weights that surprise you? 
* Try coming up with explanations for why the feature weights are the way they are. Does this help you come up with new feature ideas? (~15 minutes)

In [None]:
'''
- 
'''

## Instructor-Led Discussion: Final Interpretation of Inputs

##Exercise 10 |  Final Baseline

Finally, play around with the last few cells, adding more keywords and domain names to see how the results change. Note that "keywords" can be a variety of things: English words, English phrases (spaces are allowed), HTML tags, and any other string present in HTML. Also notice how the weights on different features vary–you may observe some interesting effects. When you are done, run the cell below to run evaluations again!

In [None]:
train_y_pred = baseline_model.predict(train_X)
print('Train accuracy', accuracy_score(train_y, train_y_pred))

val_y_pred = baseline_model.predict(val_X)
print('Val accuracy', accuracy_score(val_y, val_y_pred))

print('Confusion matrix:')
print(confusion_matrix(val_y, val_y_pred))

prf = precision_recall_fscore_support(val_y, val_y_pred)

print('Precision:', prf[0][1])
print('Recall:', prf[1][1])
print('F-Score:', prf[2][1])

Congratulations on completing this notebook. Looking at the results of our final baseline, you may be surprised this approach is working at all–after all, our model is still barely looking at the content of websites. We will further explore the issue of modeling the content of websites tomorrow, but as a result of our efforts today, we now know that we can make progress with a relatively simple approach!