# Day 2: Creating a Baseline Model for Fake News Classification

Yesterday, we investigated several hypotheses for "tells" that could be used to separate out real and fake news websites in our dataset without actually determining the truth value of individual articles. This was a big first step towards our goal of doing coarse-grained fake news classification.

Today, we build off of the insights we gleaned to build a baseline model using logistic regression. 

Why build a baseline? Building a baseline provides a benchmark for further work on a task–if you can do well with a simple model, this tells us that even with more sophisticated model architectures, we might have diminishing returns. On the other hand, if our baseline does poorly, this may provide indication that our task/dataset is malformed, or that more complex architectures are necessary. 

In our case, we will find that building a baseline will show us that the problem of identifying fake news websites is approachable without modeling the truth content of individual articles, a useful insight. Importantly, we will see that using logistic regression in particular gives us a strong foundation for interpreting and improving our model. 

Run the below cell to get started!

In [1]:
import math
import os
import numpy as np

import pickle

import requests, io, zipfile
# Download class resources...
r = requests.get("https://www.dropbox.com/s/2pj07qip0ei09xt/inspirit_fake_news_resources.zip?dl=1")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

basepath = '.'

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

## Why Logistic Regression?

We've just spent the last week or so learning about more sophisticated neural network architectures. Why should we begin working on a complicated task like fake news classification using such a simple model? Remember that logistic regression is just linear regression followed by a sigmoid function. See [here](https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc) for a detailed review of logistic regression.

First, as suggested above, using a simpler model tells us how much room we have to improve. 

Second, a simple model makes iteration quick and easy–we'll see that for the project of classifying a website based on its URL and HTML, cleverly extracting features from the URL and HTML will be important for our success. Using a model that trains and evaluates quickly is essential for rapid feature selection. 

Lastly, and perhaps most importantly, logistic regression is *interpretable*. You may have heard in the past that one thing deep neural networks struggle with is interpretability–when you are using these models to make predictions that affect people's wellbeing (e.g., sentencing decisions, predictive policing decisions), it becomes extremely important that you are able to understand why a model is making the predictions it makes. Making deeper neural networks more interpretable is an active area of AI research. This is important for fake news classification as well–as we know in the case of Facebook, poorly filtering out misinformation on social media might even affect elections. 

However, for simpler models like logistic regression, we get interpretability for free! More on this below, but for now know that when engineering features for logistic regression, we will be able to examine which features correspond most with fake news websites, and which features correspond most with real news websites.

## Building Our First Baseline

Then our task is to take URL, HTML pairs, turn them into a series of numerical features, and then input them into a logistic regression classifier along with training labels. The tricky part here is finding features that are informative for predicting whether a website is fake or not. Luckily, this is what we worked on yesterday!

Yesterday, we found that we could extract some features related to the domain name extension of a website, and they were often informative about whether a website is fake or not. For example, you may have noticed that both fake and real news websites use the ".org" domain name extension, but fake news websites use it more frequently (perhaps contrary to what you'd expect).

Below we introduce some code for taking our training and val data and producing X, y examples that can be fit by a logistic regression model. This code extracts a few basic features from the domain name extension of the website. Your task: add features testing whether the domain name ends in ".co", ".tv", and ".news", according to the template below (~5 minutes).


In [2]:
def prepare_data(data, featurizer):
    X = []
    y = []
    for datapoint in data:
        url, html, label = datapoint
        # We convert all text in HTML to lowercase, so <p>Hello.</p> is mapped to
        # <p>hello</p>. This will help us later when we extract features from 
        # the HTML, as we will be able to rely on the HTML being lowercase.
        html = html.lower() 
        y.append(label)

        features = featurizer(url, html)

        # Gets the keys of the dictionary as descriptions, gets the values
        # as the numerical features. Don't worry about exactly what zip does!
        feature_descriptions, feature_values = zip(*features.items())

        X.append(feature_values)

    return X, y, feature_descriptions
  
# Returns a dictionary mapping from plaintext feature descriptions to numerical
# features for a (url, html) pair.
def domain_featurizer(url, html):
    features = {}
    
    # Binary features for the domain name extension.
    features['.com domain'] = url.endswith('.com')
    features['.org domain'] = url.endswith('.org')
    features['.net domain'] = url.endswith('.net')
    features['.info domain'] = url.endswith('.info')
    features['.org domain'] = url.endswith('.org')
    features['.biz domain'] = url.endswith('.biz')
    features['.ru domain'] = url.endswith('.ru')
    features['.co.uk domain'] = url.endswith('.co.uk')
    
    ### YOUR CODE HERE ###
    
    features['.co domain'] = url.endswith('.co')
    features['.tv domain'] = url.endswith('.tv')
    features['.news domain'] = url.endswith('.news')
    
    ### END CODE HERE ###
    
    return features

Make sure you understand what the code above is doing. It produces X, y such that X contains a list of features for each site in the dataset, and y contains the labels in corresponding order. *feature_descriptions* is a list of the names of features (.e.g., '.com domain'). This will be important later when we want to know the names of features when interpreting the model. Let's run our code for processing the data on the train and val sets from yesterday.

In [3]:
with open(os.path.join(basepath, 'sample_train_val_data.pkl'), 'rb') as f: # TODO change this to actual data
  train_data, val_data = pickle.load(f)
  
print('Number of train examples:', len(train_data))
print('Number of val examples:', len(val_data))
  
train_X, train_y, feature_descriptions = prepare_data(train_data, domain_featurizer)
val_X, val_y, feature_descriptions = prepare_data(val_data, domain_featurizer)

print('Number of features per example:', len(train_X[0]))
print('Feature descriptions:')
print(feature_descriptions)

Number of train examples: 772
Number of val examples: 90
Number of features per example: 10
Feature descriptions:
('.com domain', '.org domain', '.net domain', '.info domain', '.biz domain', '.ru domain', '.co.uk domain', '.co domain', '.tv domain', '.news domain')


Now to train on our featurized data. We use scikit-learn as in the previous week, because it makes it easy to quickly iterate on different types of models.

In [4]:
baseline_model = LogisticRegression()
baseline_model.fit(train_X, train_y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

## Evaluation

We have a very simple baseline, and it would be interesting if just the features we've created were enough to produce a classification accuracy above 50%, since all we're looking at is the domain nam extension. One natural way to start evaluation of such a simple model is to see how it is doing on the train data. Given that our model only knows a few basic things about the URL, it's not clear whether it will do better than chance on the training data. Sci-kit learn makes computing accuracy on training data easy:



In [5]:
train_y_pred = baseline_model.predict(train_X)
print('Train accuracy', accuracy_score(train_y, train_y_pred))

Train accuracy 0.5906735751295337


We can see that we are not doing very well, but we are doing better than 50%. We can do the same for the val data to see how we are doing on unseen data, which is more valuable for us if we want to make predictions on new websites. Fill in the code below to evaluate val accuracy (~4 minutes)!

In [7]:
### YOUR CODE HERE ###
val_y_pred = baseline_model.predict(val_X)
### END CODE HERE ###
print('Val accuracy', accuracy_score(val_y, val_y_pred))

Val accuracy 0.6555555555555556


We appear to be doing similarly on the val dataset. To better understand the performance of our binary classification model, we should seek to better understand the mistakes that it is making. Specifically, when our model makes a mistake (about 40% of the time), are these mistakes false negatives or false positives?

To answer these questions, we produce and analyze the confusion matrix. The confusion matrix is a matrix that shows the following:

![Confusion Matrix](https://cdn-images-1.medium.com/max/1600/1*Z54JgbS4DUwWSknhDCvNTQ.png)

where the terms mean

* TP (True Positive) = You predicted positive (fake in our case, since fake has a label of 1) and it’s true.
* FP (False Positive) = You predicted positive and it’s false.
* FN (False Negative) = You predicted negative and it’s false.
* TN (True Negative) = You predicted negative and it’s true.

From the confusion matrix, we can extract commonly used metrics like precision (TP/(TP + FP)) and recall (TP/(TP + FN)). Precision quantifies how often the things we classify as positive are actually positive. For our task, this measures what fraction of the sites we classify as fake are actually fake. Recall quantifies what fraction of actually positive examples we classify as positive. In our case, this is the fraction of fake news websites that we actually identify as fake.

Finally, a useful score to summarize both precision and recall is the F-1 score. This is just a simple function (the harmonic mean) of precision and recall, shown in the summary below:

![Metrics](https://image.noelshack.com/fichiers/2018/20/5/1526651367-qcon-rio-machine-learning-for-everyone-51-638-1.jpg)

In [8]:
print('Confusion matrix:')
print(confusion_matrix(val_y, val_y_pred))

Confusion matrix:
[[50  0]
 [31  9]]


We can see that we have many false negatives, and not as many false positives. Why is this the case? If we print out *val_y_pred*, we can see that our model is mostly predicting 0's (websites are real).

In [9]:
print(val_y_pred)

[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0]


Why so many 0's? The only information we are giving our model is its domain name extension. It's natural that the model would learn that websites with ".biz" extensions are unlikely to be reliable news websites, but it is still the case that most websites in the dataset (fake and real) have ".com" extensions. Thus, our model will misclassify many fake news websites with ".com" extensions as real. 

In [10]:
prf = precision_recall_fscore_support(val_y, val_y_pred)

print('Precision:', prf[0][1])
print('Recall:', prf[1][1])
print('F-Score:', prf[2][1])

Precision: 1.0
Recall: 0.225
F-Score: 0.36734693877551017


Again, the precision and recall metrics suggest that when we classify a website as fake, we are usually right, but we are not doing great at classifying these websites as fake frequently enough.

## Using Keywords for a Stronger Baseline

The key problem with our model in its current state is that it simply does not have enough information. This should not be a surprise–it was pretty unlikely in the first place that domain name extensions would be enough. If you like, feel free to add a few more extensions in the featurizer above and re-run all the code for evaluation–you'll find it doesn't make much of a difference.

Where can we get more information about webpages? From the HTML! Remember that the HTML contains all of the text and structure of a webpage. If we cleverly choose features from the HTML to feed into our logistic regression model, we will drastically improve our performance. We saw yesterday that probing hypotheses related to the counts of hypotheses words produced interesting results, and we will continue in this direction today to produce a model that leverages these differences in word frequencies.

The below code introduces a better featurizer that counts the number of keywords (normalized using the *log* function) in the HTML. Normalizing the counts is a trick that prevents the featurized values from becoming too extreme. Read the code and make sure you understand what it is doing. Then add "sports" and "finance" as additional keywords to expand our model (~3 minutes).



In [11]:
# Gets the log count of a phrase/keyword in HTML (transforming the phrase/keyword
# to lowercase).
def get_normalized_count(html, phrase):
    return math.log(1 + html.count(phrase.lower()))

# Returns a dictionary mapping from plaintext feature descriptions to numerical
# features for a (url, html) pair.
def keyword_featurizer(url, html):
    features = {}
    
    # Same as before.
    features['.com domain'] = url.endswith('.com')
    features['.org domain'] = url.endswith('.org')
    features['.net domain'] = url.endswith('.net')
    features['.info domain'] = url.endswith('.info')
    features['.org domain'] = url.endswith('.org')
    features['.biz domain'] = url.endswith('.biz')
    features['.ru domain'] = url.endswith('.ru')
    features['.co.uk domain'] = url.endswith('.co.uk')
    features['.co domain'] = url.endswith('.co')
    features['.tv domain'] = url.endswith('.tv')
    features['.news domain'] = url.endswith('.news')
    
    ### YOUR CODE HERE ###
    keywords = ['trump', 'biden', 'clinton', 'sports', 'finance']
    ### END CODE HERE
    
    for keyword in keywords:
      features[keyword + ' keyword'] = get_normalized_count(html, keyword)
    
    return features

Let's run and evaluate the above featurizer. Add in code to fit the model, compute train accuracy, val accuracy, val confusion matrix, and val precision, recall, and F1-Score, just as before (~8 minutes).

In [13]:
train_X, train_y, feature_descriptions = prepare_data(train_data, keyword_featurizer)
val_X, val_y, feature_descriptions = prepare_data(val_data, keyword_featurizer)

print('Number of features per example:', len(train_X[0]))
print('Feature descriptions:')
print(feature_descriptions)
print()
  
baseline_model = LogisticRegression()

### YOUR CODE HERE ###
baseline_model.fit(train_X, train_y)
print()

train_y_pred = baseline_model.predict(train_X)
print('Train accuracy', accuracy_score(train_y, train_y_pred))

val_y_pred = baseline_model.predict(val_X)
print('Val accuracy', accuracy_score(val_y, val_y_pred))

print('Confusion matrix:')
print(confusion_matrix(val_y, val_y_pred))

prf = precision_recall_fscore_support(val_y, val_y_pred)

print('Precision:', prf[0][1])
print('Recall:', prf[1][1])
print('F-Score:', prf[2][1])

### END CODE HERE ###

Number of features per example: 15
Feature descriptions:
('.com domain', '.org domain', '.net domain', '.info domain', '.biz domain', '.ru domain', '.co.uk domain', '.co domain', '.tv domain', '.news domain', 'trump keyword', 'biden keyword', 'clinton keyword', 'sports keyword', 'finance keyword')


Train accuracy 0.805699481865285
Val accuracy 0.8111111111111111
Confusion matrix:
[[40 10]
 [ 7 33]]
Precision: 0.7674418604651163
Recall: 0.825
F-Score: 0.7951807228915662




We can see that we are doing dramatically better! The next section addresses how to know which of the above added features made the difference in improving the model.

## Interpreting our Model

As mentioned earlier, a key motivation for using a simpler model is interpretability.

We've learned that the prediction of a logistic regression classifier is just the output of a multiplication with model weights, followed by a non-linear transformation (sigmoid). Because the sigmoid function is always increasing (monotonic) on its domain (see below), we know that if the dot product (or multiplication of vectors) between model weights and input features is large, then the output prediction will be closer to 1. If the dot product is small, then the output prediction will be closer to 0.

![Sigmoid](https://cdn-images-1.medium.com/max/2400/1*RqXFpiNGwdiKBWyLJc_E7g.png)

Thus, the weights corresponding to features tell us whether the features are important in the classification. If the weight corresponding to the feature ".net domain" has a large positive value, then websites with ".net" domains are more likely to be classified as fake (since fake has label 1). If it has a large negative value, then these websites are more likely to be classified as real. If it has value close to 0, then the feature may not be useful (at least, it may not be useful given that the other features are present).

Let's see what weights our model learned. The code below uses *feature_descriptions* and the weights, or coefficients, of the model and sorts them in ascending order.

In [14]:
sorted(zip(feature_descriptions, baseline_model.coef_[0].tolist()), key=lambda x: x[1])

[('sports keyword', -0.9914383185247577),
 ('.co.uk domain', -0.8873160634555248),
 ('finance keyword', -0.7659863029847725),
 ('biden keyword', -0.37088232295973445),
 ('trump keyword', -0.12615208492554814),
 ('.com domain', -0.0127152664527546),
 ('.biz domain', 0.3742897680425671),
 ('.ru domain', 0.3742897680425671),
 ('.news domain', 0.3742897680425671),
 ('.org domain', 0.4371680521273934),
 ('.info domain', 0.5049555367155103),
 ('.co domain', 0.9743557117179396),
 ('.tv domain', 1.0726727152070907),
 ('clinton keyword', 1.2194725921054308),
 ('.net domain', 1.2923412213596992)]

What features have positive weight (most predictive of being fake)? Which ones have negative weight (most predictive of being real)? Which ones have close to 0 weight? Are there any feature weights that surprise you? Try coming up with explanations for why the feature weights are the way they are. Does this help you come up with new feature ideas?

## Final Baseline

Finally, play around with the last few cells, adding more keywords and domain names to see how the results change. Note that "keywords" can be a variety of things: English words, English phrases (spaces are allowed), HTML tags, and any other string present in HTML. Also notice how the weights on different features vary–you may observe some interesting effects. When you are done, run the cell below to run evaluations again!

In [15]:
train_y_pred = baseline_model.predict(train_X)
print('Train accuracy', accuracy_score(train_y, train_y_pred))

val_y_pred = baseline_model.predict(val_X)
print('Val accuracy', accuracy_score(val_y, val_y_pred))

print('Confusion matrix:')
print(confusion_matrix(val_y, val_y_pred))

prf = precision_recall_fscore_support(val_y, val_y_pred)

print('Precision:', prf[0][1])
print('Recall:', prf[1][1])
print('F-Score:', prf[2][1])

Train accuracy 0.805699481865285
Val accuracy 0.8111111111111111
Confusion matrix:
[[40 10]
 [ 7 33]]
Precision: 0.7674418604651163
Recall: 0.825
F-Score: 0.7951807228915662


Congratulations on completing this notebook. Looking at the results of our final baseline, you may be surprised this approach is working at all–after all, our model is still barely looking at the content of websites. We will further explore the issue of modeling the content of websites tomorrow, but as a result of our efforts today, we now know that we can make progress with a relatively simple approach!