# Day 1: Introduction to Fake News Classification

In this project, we will learn how to build a classifier that is able to distinguish fake news websites from established news websites. The fake news problem is important: many argue that the unchecked spread of misinformation played a significant factor in the outcome of the 2016 U.S. election, along with other elections elsewhere. This is also a hard problem, one that we will certainly not completely solve in the next few daysâ€“the task of separating truth from fiction is a difficult research task today. 

That said, we will find that fake news websites often have obvious "tells" that suggest they may not be as respectable as established news websites. Without attempting to discern whether a specific news story on a site is true or false, we may be able to get a clue by studying these "tells". We will then use this insight to classify new fake news websites!

Today, we will begin by better understanding how websites work under the hood so that we are better equipped to identify signs that a website might not be reliable. Next, we introduce a dataset of news websites labeled real or fake. Finally, we investigate different "tells" that may suggest a website is unreliable.

Run the below cell to get started!


In [1]:
import os
from bs4 import BeautifulSoup as bs
import pickle
  
import requests
import zipfile
import io

# Download class resources...
r = requests.get("https://www.dropbox.com/s/2pj07qip0ei09xt/inspirit_fake_news_resources.zip?dl=1")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

basepath = '.'

## Anatomy of a (Fake) News Website

Have you ever wondered how websites like *google.com* and *nytimes.com* work under the hood? Using the internet every day, it is easy to forget how magical even the most mundane web browsing experiences are. Consider, for example, this article on the New York Times:

![NYTimes Article](https://karansinghal.com/public/inspirit/nytimes-article.png)


How does the browser know to show the title of the article near the top of the page? How does it know that the word "Opinion" should be left-centered and gray-colored? How does it know where to find the image to display?

All of these questions can be answered by probing through the HTML of a webpage. HTML is a simple markup language that augments text with the structure you'd expect from a webpage. It's the language that provides the structure for every webpage you see. Here's an example of an HTML document for a simple webpage.

![HTML Example](https://karansinghal.com/public/inspirit/html-example.png)

You can [play around with this specific example in an interactive environment](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_default). Also read [this short, basic introduction to HTML](https://www.w3schools.com/html/html_intro.asp).

Every webpage you see on the web has an associated source HTML document, and you can [view this yourself if you like](https://www.computerhope.com/issues/ch000746.htm). For example, the earlier New York Times article, has an HTML document that instructs the browser to format the text according to what you see on the webpage.

## Problem Statement

Given the URL of a news website and its HTML, can we classify the news website as either fake or real? Note that this is not a fine-grained task: we are not classifying individual articles, but rather the homepage of the corresponding news websites. We are not even attempting to determine the truth value of individual stories, and this is a key limitation of our approach. However, we will find that we will be able to achieve surprisingly solid results on this task using relatively simple models, given some clever feature selection.

## Dataset 

As we've seen, machine learning models are typically trained using groundtruth data split into train, dev, and test data. Where can we get groundtruth real and fake news websites? We use an independent third-party, OpenSources. Given a list of websites and their labels, we scrape them to get their HTML, which we will use to identify features that are useful for classification (this has already been done by the teaching team behind the scenes). We scrape each website several times over the course of days, ensuring that we are not overfitting to the news stories of a specific day. Thus, each website provides multiple labeled data points.

Finally, we split the data into train, val, and test by assigning 80% of the websites to train, 10% to val (a.k.a. dev), and 10% to test. The test data is hidden from you for now. Each portion of the data includes about 50% real and fake news websites, so there isn't a large data imbalance. We ensure that different examples (on different timesteps) for the same website are in the same portion of the data. Why is this important to ensure that our val and test accuracy are representative of predictions on unseen websites?

Load the train and val in the below cell:

In [2]:
with open(os.path.join(basepath, 'sample_train_val_data.pkl'), 'rb') as f:
  train_data, val_data = pickle.load(f)

print('Number of train examples:', len(train_data))
print('Number of val examples:', len(val_data))

print('Fraction of train examples that are fake:', len([datapoint for datapoint in train_data if datapoint[2] == 0]) / float(len(train_data)))
print('Fraction of val examples that are fake:', len([datapoint for datapoint in val_data if datapoint[2] == 0]) / float(len(val_data)))

Number of train examples: 772
Number of val examples: 90
Fraction of train examples that are fake: 0.533678756476684
Fraction of val examples that are fake: 0.5555555555555556


We can see that the number of examples for each portion of the data approximately matches the split above, and each portion has roughly 50% fake news websites. Now to explore what each data point looks like. Spend ~5 minutes browsing through the data by changing example_idx below. You are able to see the URL, label (0 is real, 1 is fake), and part of the HTML for an example.

In [3]:
### YOUR CODE HERE ###
example_idx = 0
### END CODE HERE ###

print('Number of values per data point: %d\n' % len(train_data[0]))

print('URL for chosen example:', train_data[example_idx][0])
print('Label for chosen example:', train_data[example_idx][2])
print('HTML for chosen example (first 5000 chars):\n\n', bs(train_data[example_idx][1]).prettify()[:5000])

Number of values per data point: 3

URL for chosen example: www.motherjones.com
Label for chosen example: 0
HTML for chosen example (first 5000 chars):

 <!DOCTYPE html>
<html class="no-js" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var o=n[t]={exports:{}};e[t][0].call(o.exports,function(n){var o=e[t][1][n];return r(o||n)},o,o.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r}({1:[function(e,n,t){function r(){}function o(e,n,t){return function(){return i(e,[c.now()].concat(u(arguments)),n?null:this,t),n?void 0:this}}var i=e("handle"),a=e(3),u=e(4),f=e("ee").get("tracer"),c=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var p=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],d="api-",l=d+"ixn-";a(p,function(e,n){

Observe that each data point has three values: the URL, the HTML, and the binary (0 or 1) label. A label of "1" indicates that the website is a fake news website, and a label of "0" indicates that the website does not have fake news. See if you can spot some differences between examples with label 0 and examples with label 1, especially in their URLs! The HTML may be a bit difficult to read, since it is much longer, so don't worry about this.

## Probing Hypotheses

Browsing through the examples above, you might have gotten a few ideas for differences between real and fake news websites. For instance, you might have noticed that many fake news websites use domain name extensions other than ".com", whereas this is less common for real news websites.

How do we test this hypothesis in a rigorous way? One way would be to calculate what fraction of fake websites have ".com" domain extensions, along with the fraction of real websites that have ".com" domain extensions, and compare these numbers (by taking their ratio of fake fraction to real fraction). 

If the ratio is less than 1, then we have reason to believe that real news websites disporportionately use ".com" extensions, and knowing this would be useful for classification. If the ratio is greater than 1, then we know the same of fake news websites, and this is still useful for separating out real and fake news websites. If the ratio is 1, this means that our hypothesis isn't very useful for separating out real and fake news websites, at least not by itself.

For those of you with some probability background, this ratio is important in updating probabilities using Bayes Theorem, and basically corresponds to how informative something is in telling us whether a website if fake or not. If you're unfamiliar with this, don't worry, it's not important!

We define a function below that returns the real and fake fractions of the training data that satisfy a hypothesis. In our code, our hypotheses will just be simple functions that take in a single data point and return "True" or "False". Make sure you understand what the code is doing.

In [4]:
def get_real_and_fake_fractions(train_data, hypothesis):
    real_true = 0.0
    real_total = 0.0
    fake_true = 0.0
    fake_total = 0.0
    
    for datapoint in train_data:
        # Each datapoint has URL, HTML, label in that order.
        label = datapoint[2]
        hypothesis_truth = int(hypothesis(datapoint))
        if label: # Fake
            fake_total += 1
            fake_true += hypothesis_truth 
        else: # Real
            real_total += 1
            real_true += hypothesis_truth
            
    return real_true / real_total, fake_true / fake_total

Now, play around with this demonstration that asks you for a domain name extension, and prints out the real fraction, the fake fraction, and the ratio of fake fraction to real fraction. Make sure you understand what the code is doing! After running initially, try other values, like ".org", ".co.uk", and ".edu"! The printed values will update automatically. Note that in some cases, the ratio may be "Infinity", if no real websites in the training data have that domain name.

In [12]:
#@title Run this cell with your hypothesis domain name extension { run: "auto" }

def domain_extension_hypothesis(datapoint):
  extension = ".com" #@param {type:"string"}
  url = datapoint[0]
  return url.endswith(extension)
  
real_fraction, fake_fraction = get_real_and_fake_fractions(train_data, 
                                                           domain_extension_hypothesis)

print('Real fraction:', real_fraction)
print('Fake fraction:', fake_fraction)

def prettify_ratio(ratio):
    ratio = (fake_fraction / real_fraction) if real_fraction > 0 else 'Infinity'
    if fake_fraction == real_fraction:
      ratio = 1
    return ratio
  
print('Ratio fraction:', prettify_ratio(ratio))

Real fraction: 0.9150485436893204
Fake fraction: 0.7944444444444444
Ratio fraction: 0.868199233716475


Can you find a domain name extension that produces ratio fraction Infinity? Can you find one that produces ratio fraction 0 (~3 minutes)? Fill them in below.

In [7]:
### YOUR CODE HERE ###
domain_name_extension_with_ratio_infinity = ''
domain_name_extension_with_ratio_zero = ''
### END CODE HERE

Let's try building a more powerful hypothesis that tests things besides domain name extension. Remember, a hypothesis has to be true or false for a specific website, so we are able to calculate the fraction of fake and real websites that satisfy a hypothesis.

One natural idea is counting whether the frequency of words in the HTML of a webpage is above a certain threshold. For example, given the word "Clinton" and a threshold of 3, does nytimes.com mention "Clinton" 3 times? Does infowars.com? This may tell us something about how useful the word "Clinton" is for telling us whether a website is fake or not.

Now, code up the below hypothesis function that tests whether the count of a provided word is above a threshold and play with the resulting demo (~15 minutes). We have provided some starter code for you.

In [18]:
#@title Run this cell with a word and a threshold { run: "auto" }

def get_count_from_html(html, hypothesis_word):
    # Transform word to lowercase for consistent results.
    return html.count(hypothesis_word.lower())

def word_threshold_hypothesis(datapoint):
  hypothesis_word = "opinion" #@param {type:"string"}
  threshold = 3 #@param {type:"integer"}
  # Transform HTML to lowercase for consistent results.
  html = datapoint[1].lower() 
    
  ### YOUR CODE HERE ### (Use get_count_from_html!)
  count = get_count_from_html(html, hypothesis_word)
  return count > threshold
  ### END CODE HERE ###
  
real_fraction, fake_fraction = get_real_and_fake_fractions(train_data, 
                                                           word_threshold_hypothesis)

print('Real fraction:', real_fraction)
print('Fake fraction:', fake_fraction)
  
print('Ratio fraction:', prettify_ratio(ratio))

Real fraction: 0.4320388349514563
Fake fraction: 0.09722222222222222
Ratio fraction: 0.2250312109862672


Once you have "Clinton" working with a threshold of 3, try other words, like "Trump", "Obama", "Sports", "Finance", and "Opinion". Why do you see what you see?

Share your most interesting hypothesis word and threshold combinations with the class for discussion!

Now, create your own custom hypotheses! All you should change is the hypothesis function (~15 minutes).

In [0]:
### YOUR CODE HERE ###

### END CODE HERE ###

Once you are done, share your most interesting hypotheses with the class to discuss!

Congratulations on completing this notebook! Tomorrow, we'll use the insights you just built up to build our baseline model.