# Day 3: Using Word Vectors for Fake News Classification

Yesterday, we built a baseline that performed surprisingly well given that it did not look much at the actual content of websites, besides the few manually-selected keywords. Today, we extend our approach to better model the content of websites using a more hands-off approach, where the model learns the keywords that are important by itself.

Run the below cells to get started.

In [1]:
import math
import os
import numpy as np
from bs4 import BeautifulSoup as bs
import requests
from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer
from torchtext.vocab import GloVe

import pickle

import requests, io, zipfile
# Download class resources...
r = requests.get("https://www.dropbox.com/s/2pj07qip0ei09xt/inspirit_fake_news_resources.zip?dl=1")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

basepath = '.'

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

with open(os.path.join(basepath, 'sample_train_val_data.pkl'), 'rb') as f: # TODO change this to actual data
  train_data, val_data = pickle.load(f)
  
print('Number of train examples:', len(train_data))
print('Number of val examples:', len(val_data))

Number of train examples: 772
Number of val examples: 90


One potential source of information for websites is their descriptions (often called meta descriptions). These are descriptions embedded into the HTML of a webpage that describe what the website is about, so that search engines and other crawlers can use it to determine the content of a website. For example, here is the description for google.com, retrieved using the BeautifulSoup Python library for parsing HTML:

In [2]:
def get_description_from_html(html):
  soup = bs(html)
  description_tag = soup.find('meta', attrs={'name':'og:description'}) or soup.find('meta', attrs={'property':'description'}) or soup.find('meta', attrs={'name':'description'})
  if description_tag:
    description = description_tag.get('content') or ''
  else: # If there is no description, return empty string.
    description = ''
  return description

response = requests.get('https://google.com', timeout=10)
html = response.text
description = get_description_from_html(html)

print('Description of Google.com:')
print(description)

Description of Google.com:
Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.


## Bag-of-Words Model

It is easy to retrieve the descriptions for the fake and real news websites in our dataset as well. What can we do with these? We can use the approach from yesterday where we extract counts from the description for particular keywords and use these as features, but this would require us to manually select features that we think are important. What if our model automatically collected all of the most important keywords and added their counts for each website description to our feature vector? Our model could then learn feature weights for these words to help us correctly classify news websites.

This approach of automatically featurizing the counts of words in text is called the bag-of-words model. This name comes from the fact that the features do not store the order of the words, rather just their counts.

Let's start by extracting the descriptions for the websites in our dataset. Use the helper function *get_description_from_html* defined above to extract all of the descriptions for the websites in the training data. The return value of the function should be a list of descriptions, in the same order as the sites in *train_data* (~8 minutes).


In [None]:
def get_descriptions_from_data(data):
  # A dictionary mapping from url to description for the websites in 
  # train_data.
  descriptions = []
  for site in tqdm(data):
    ### YOUR CODE HERE ###
    url, html, label = site
    descriptions.append(get_description_from_html(html))
    ### END CODE ###
  return descriptions
  

train_descriptions = get_descriptions_from_data(train_data)
train_urls = [url for (url, html, label) in train_data]

print('\nNYTimes Description:')
print(train_descriptions[train_urls.index('nytimes.com')])


  0%|          | 0/772 [00:00<?, ?it/s][A
  0%|          | 2/772 [00:00<00:39, 19.58it/s][A
  0%|          | 3/772 [00:00<01:37,  7.86it/s][A
  1%|          | 6/772 [00:00<01:20,  9.46it/s][A
  1%|          | 9/772 [00:00<01:05, 11.62it/s][A
  1%|▏         | 11/772 [00:00<01:10, 10.81it/s][A
  2%|▏         | 15/772 [00:01<00:55, 13.72it/s][A
  2%|▏         | 17/772 [00:01<00:56, 13.36it/s][A
  3%|▎         | 20/772 [00:01<00:52, 14.38it/s][A
  3%|▎         | 22/772 [00:01<00:51, 14.63it/s][A
  3%|▎         | 24/772 [00:01<00:58, 12.71it/s][A
  4%|▎         | 28/772 [00:02<01:00, 12.28it/s][A
  4%|▍         | 30/772 [00:02<01:06, 11.19it/s][A
  4%|▍         | 33/772 [00:02<01:20,  9.20it/s][A
  5%|▍         | 35/772 [00:02<01:15,  9.76it/s][A
  5%|▌         | 40/772 [00:03<01:00, 12.06it/s][A
  5%|▌         | 42/772 [00:03<01:06, 11.00it/s][A
  6%|▌         | 45/772 [00:03<01:08, 10.66it/s][A
  6%|▌         | 48/772 [00:03<01:04, 11.18it/s][A
  6%|▋         | 50/772 

 54%|█████▎    | 414/772 [08:54<00:36,  9.87it/s][A
 54%|█████▍    | 416/772 [08:54<00:39,  9.10it/s][A
 54%|█████▍    | 418/772 [08:54<00:38,  9.30it/s][A
 55%|█████▍    | 421/772 [08:54<00:33, 10.45it/s][A
 55%|█████▍    | 423/772 [08:55<00:44,  7.90it/s][A
 55%|█████▌    | 427/772 [08:55<00:37,  9.22it/s][A
 56%|█████▌    | 430/772 [08:55<00:33, 10.08it/s][A
 56%|█████▌    | 432/772 [08:55<00:32, 10.33it/s][A
 56%|█████▌    | 434/772 [08:56<00:30, 11.21it/s][A
 57%|█████▋    | 437/772 [08:56<00:24, 13.74it/s][A
 57%|█████▋    | 439/772 [08:56<00:23, 14.04it/s][A
 57%|█████▋    | 441/772 [08:56<00:25, 13.00it/s][A
 57%|█████▋    | 443/772 [08:56<00:24, 13.57it/s][A
 58%|█████▊    | 446/772 [08:57<00:30, 10.75it/s][A
 58%|█████▊    | 450/772 [08:57<00:23, 13.77it/s][A
 59%|█████▊    | 453/772 [08:57<00:36,  8.74it/s][A
 59%|█████▉    | 457/772 [08:58<00:31, 10.10it/s][A
 60%|█████▉    | 460/772 [08:58<00:26, 11.77it/s][A
 60%|█████▉    | 463/772 [08:58<00:23, 13.19it

We now have a bunch of descriptions for the websites in our training data. How do we map this to a meaningful feature representation? As suggested above, we use the approach of assigning each feature index a specific word from the descriptions, and the feature value for each site is just the count of that specific word in its description. This is just an automatic version of the keyword-based approach we used yesterday.

How do we choose which words to include as features? We could include all of them, but this would give us a lot of features for words that don't show up often enough to be helpful. Instead, we choose the 300 most frequent words.

Below, we use the CountVectorizer class from scikit-learn to do the heavy-lifting for us. We train it using just the train data (so it learns which are the 300 most frequent words in train only), and then we use to featurize both the train and val data. Fill in the last line of code that vectorizes the val data descriptions, using the train version for reference (~4 minutes).

In [70]:
vectorizer = CountVectorizer(max_features=300)

vectorizer.fit(train_descriptions)

def vectorize_data_descriptions(data, vectorizer):
  descriptions = get_descriptions_from_data(data)
  print(descriptions)
  X = vectorizer.transform(descriptions).todense()
  y = [label for (url, html, label) in data]
  return X, y

print('\nPreparing train data...')
bow_train_X, bow_train_y = vectorize_data_descriptions(train_data, vectorizer)

print('\nPreparing val data...')
### YOUR CODE HERE ###
bow_val_X, bow_val_y = vectorize_data_descriptions(val_data, vectorizer)
### END CODE HERE ###


  0%|          | 0/772 [00:00<?, ?it/s][A


Preparing train data...



  0%|          | 3/772 [00:00<01:01, 12.47it/s][A
  1%|          | 6/772 [00:00<00:55, 13.90it/s][A
  1%|          | 8/772 [00:00<00:50, 15.21it/s][A
  1%|▏         | 10/772 [00:00<00:47, 16.17it/s][A
  2%|▏         | 12/772 [00:00<00:47, 15.92it/s][A
  2%|▏         | 15/772 [00:00<00:40, 18.50it/s][A
  2%|▏         | 17/772 [00:00<00:45, 16.63it/s][A
  3%|▎         | 20/772 [00:01<00:44, 16.79it/s][A
  3%|▎         | 22/772 [00:01<00:45, 16.58it/s][A
  3%|▎         | 24/772 [00:01<00:53, 14.04it/s][A
  4%|▎         | 28/772 [00:02<01:11, 10.35it/s][A
  4%|▍         | 30/772 [00:02<01:10, 10.54it/s][A
  4%|▍         | 33/772 [00:02<01:12, 10.20it/s][A
  5%|▍         | 35/772 [00:02<01:07, 10.98it/s][A
  5%|▌         | 40/772 [00:03<01:00, 12.19it/s][A
  5%|▌         | 42/772 [00:03<01:01, 11.90it/s][A
  6%|▌         | 45/772 [00:03<00:50, 14.43it/s][A
  6%|▌         | 48/772 [00:03<00:49, 14.53it/s][A
  6%|▋         | 50/772 [00:03<00:54, 13.20it/s][A
  7%|▋        

['', 'Business news, small business news, business financial news and investment news from FoxBusiness.com.', 'News, Arizona Wildcats sports, breaking news, lifestyle, parenting, business, entertainment, weather, jobs, autos and real estate listings from the Arizona Daily Star', '', '', '', 'PC Magazine is your complete guide to PC computers, peripherals and upgrades. We test and review computer- and Internet-related products and services, report technology news and trends, and provide shopping advice and price comparisons.', '', '', '', '', '', '', '', "The Buffalo News is Western New York's No. 1 news source, providing in-depth, up to the minute news. The Buffalo News brings you breaking news and the latest in local news, sports, business, politics, opinion and entertainment from around Buffalo and Western New York.", 'Your trusted source for breaking news, analysis, exclusive interviews, headlines, and videos at ABCNews.com', "It isn't Islamophobia when they really ARE trying to kil


  4%|▍         | 4/90 [00:00<00:06, 13.44it/s][A
  7%|▋         | 6/90 [00:00<00:06, 13.38it/s][A
  9%|▉         | 8/90 [00:00<00:06, 12.72it/s][A
 11%|█         | 10/90 [00:00<00:06, 12.61it/s][A
 13%|█▎        | 12/90 [00:01<00:07, 10.75it/s][A
 14%|█▍        | 13/90 [00:01<00:07, 10.49it/s][A
 18%|█▊        | 16/90 [00:01<00:05, 12.93it/s][A
 23%|██▎       | 21/90 [00:01<00:04, 16.48it/s][A
 27%|██▋       | 24/90 [00:01<00:03, 16.81it/s][A
 30%|███       | 27/90 [00:01<00:03, 19.22it/s][A
 33%|███▎      | 30/90 [00:01<00:03, 19.93it/s][A
 37%|███▋      | 33/90 [00:02<00:03, 14.58it/s][A
 39%|███▉      | 35/90 [00:02<00:04, 12.90it/s][A
 41%|████      | 37/90 [00:02<00:03, 13.54it/s][A
 44%|████▍     | 40/90 [00:02<00:04, 12.02it/s][A
 47%|████▋     | 42/90 [00:03<00:04,  9.84it/s][A
 50%|█████     | 45/90 [00:03<00:04, 11.25it/s][A
 52%|█████▏    | 47/90 [00:03<00:03, 11.00it/s][A
 56%|█████▌    | 50/90 [00:03<00:03, 12.19it/s][A
 58%|█████▊    | 52/90 [00:03<00:

['', '', 'The Hill is a top US political website, read by the White House and more lawmakers than any other site -- vital for policy, politics and election campaigns.', '', 'View the latest news and breaking news today for U.S., world, weather, entertainment, politics and health at CNN.com.', '', 'View the latest news and breaking news today for U.S., world, weather, entertainment, politics and health at CNN.com.', '', '', 'Indiatimes.com brings you the news, articles, stories and videos on entertainment, latest lifestyle, culture and new technologies emerging worldwide.', 'Go to NBCNews.com for breaking news, videos, and the latest top stories in world news, business, politics, health and pop culture.', 'Examples of certificates, invoices, resumes and various types of templates.', '', '', 'Indiatimes.com brings you the news, articles, stories and videos on entertainment, latest lifestyle, culture and new technologies emerging worldwide.', 'sharing news that matters to you', '', 'Bing 

Now we have all we need to test our bag-of-words featurization. Below, we want to use logistic regression, as before, combined with our *train_X* produced by CountVectorizer to train our fake news classification model. We also evaluate using our familiar metrics.

Fill in the code below that fits the model on *bow_train_X* and *bow_train_y* and outputs train accuracy, val accuracy, val confusion matrix, and val precision, recall, and F1-Score (~10 minutes).

In [71]:
model = LogisticRegression()

### YOUR CODE HERE ###
model.fit(bow_train_X, bow_train_y)

bow_train_y_pred = model.predict(bow_train_X)
print('Train accuracy', accuracy_score(bow_train_y, bow_train_y_pred))

bow_val_y_pred = model.predict(bow_val_X)
print('Val accuracy', accuracy_score(bow_val_y, bow_val_y_pred))

print('Confusion matrix:')
print(confusion_matrix(bow_val_y, bow_val_y_pred))

prf = precision_recall_fscore_support(bow_val_y, bow_val_y_pred)

print('Precision:', prf[0][1])
print('Recall:', prf[1][1])
print('F-Score:', prf[2][1])
### END CODE HERE ###

Train accuracy 0.8782383419689119
Val accuracy 0.7
Confusion matrix:
[[25 25]
 [ 2 38]]
Precision: 0.6031746031746031
Recall: 0.95
F-Score: 0.7378640776699029




Solid! You should be getting results roughly similar to what you got yesterday, which may be surprising since we are restricting ourselves to look only at the description of a website. The strength of our approach today is that the featurization is automatic–we didn't put any work into determining the features.

You might be wondering how well a model that combines both our bag-of-words approach, our keywords approach, and our domain name extension approach does. We will implement combining our featurization approaches tomorrow! For now, let's explore another approach.

## Modeling the Meaning of Websites using Word Vectors

A shortcoming of our bag-of-words approach is that it only looks at the counts of words in the description for each website. What if we had some way of understanding the meaning of words in the description for each website?

The idea of computationally extracting meaning from words is central to word vectors, which have become a cornerstone of modern deep learning on text. Word vectors are a mapping from words to vectors such that words that have similar meaning have similar word vectors. 

For example, the words "good" and "great" have similar word vectors, and the words "good" and "planet" have different word vectors. Thus, word vectors provide us a way to account for the meanings of words with our machine learning models.

Run the below cell to load our word vectors, which come from a model called "GloVe".



In [0]:
VEC_SIZE = 300
glove = GloVe(name='6B', dim=VEC_SIZE)

# Returns word vector for word if it exists, else return None.
def get_word_vector(word):
    try:
      return glove.vectors[glove.stoi[word.lower()]].numpy()
    except KeyError:
      return None

We've included a handy helper function which retrieves the word vector for a word. Let's retrieve the word vector for "good" (~3 minutes).

In [73]:
### YOUR CODE HERE ###
good_vector = get_word_vector('good')
### END CODE HERE ###

print('Shape of good vector:', good_vector.shape)
print(good_vector)

Shape of good vector: (300,)
[-1.3602e-01 -1.1594e-01 -1.7078e-02 -2.9256e-01  1.6149e-02  8.6472e-02
  1.5759e-03  3.4395e-01  2.1661e-01 -2.1366e+00  3.5278e-01 -2.3909e-01
 -2.2174e-01  3.6413e-01 -4.5021e-01  1.2104e-01 -1.5596e-01 -3.8906e-02
 -2.9419e-03  1.6009e-02 -1.1620e-01  3.8680e-01  3.5109e-01  9.7426e-02
 -1.2425e-02 -1.7864e-01 -2.3259e-01 -2.6960e-01  4.1083e-02 -7.6194e-02
 -2.3362e-01  2.0919e-01 -2.7264e-01  5.4967e-02 -1.8055e+00  5.6348e-01
 -1.2778e-01  2.3147e-01 -5.8820e-03 -2.6630e-01  4.1187e-01 -3.7162e-01
 -2.0600e-01 -1.9619e-01 -4.3945e-03  1.2513e-01  4.6638e-01  4.5159e-01
 -1.5000e-01  5.9589e-03  5.9070e-02 -4.1440e-01  6.1035e-02 -2.1117e-01
 -4.0988e-01  5.6393e-01  2.3021e-01  2.7240e-01  4.9364e-02  1.4239e-01
  4.1841e-01 -1.3983e-01  3.4826e-01 -1.0745e-01 -2.5002e-01 -3.2554e-01
  3.3343e-01 -3.5617e-01  2.0442e-01  1.4439e-01 -1.2686e-01 -7.7273e-02
 -1.9667e-01  1.0759e-01 -1.1860e-01 -2.5083e-01  1.4205e-02  2.7251e-01
 -2.3707e-01 -2.3545e-

Not too much to see here–each word vector is a vector of 300 numbers, and it's hard to interpret them from looking at the numbers. Remember that the important property of word vectors is that words with similar meaning have similar word vectors. The magic happens when we compare word vectors.

Below, we have set up a demo where we compare the word vectors for two words using a comparison metric known as cosine similarity. Intuitively, cosine similarity measures the extent to which two vectors point in the same direction. You might be familiar with the fact that the cosine similarity between two vectors is the same as the cosine of the angle between the two vectors–ranging between -1 and 1. -1 means that two vectors are facing opposite directions, 0 means that they are perpindicular, and 1 means that they are facing the same direction. 

Try running the below to compare the vectors for "good" and "great", and then try other words, like "planet" (~10 minutes). What do you notice that's expected and unexpected?

In [74]:
#@title Word Similarity { run: "auto", display-mode: "both" }

def cosine_similarity(vec1, vec2):    
  return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

word1 = "good" #@param {type:"string"}
word2 = "great" #@param {type:"string"}

print('Word 1:', word1)
print('Word 2:', word2)

def cosine_similarity_of_words(word1, word2):
  vec1 = get_word_vector(word1)
  vec2 = get_word_vector(word2)
  
  if vec1 is None:
    print(word1, 'is not a valid word. Try another.')
  if vec2 is None:
    print(word2, 'is not a valid word. Try another.')
  if vec1 is None or vec2 is None:
    return None
  
  return cosine_similarity(vec1, vec2)
  

print('\nCosine similarity:', cosine_similarity_of_words(word1, word2))


Word 1: good
Word 2: great

Cosine similarity: 0.6410047


We can see that word embeddings appear to capture the meaning of different words–when two words are similar, the cosine similarity score is higher, and when two words are dissimilar, the cosine similarity score is lower.

Word vectors are created by going over a large body of text (the vectors you are using were trained on Wikipedia in part) and noticing which words tend to occur near each-other. If word A tends to co-occur with similar words as word B, then the word vectors for words A and B are mathematically constrained to be similar. If you want to learn more about an algorithm for training word vectors, see this [helpful introduction to word2vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa).

Given word vectors that represent the meaning of words, what can we do with this? We can add word vectors to our feature vector, but which do we choose? It turns out that a solid approach is just to average the word vectors for all the words in the description. Averaging word vectors produces a natural way to produce vectors for sentences and other collections of words, and this is the approach we will use.

First, let's reload the descriptions of websites.

In [66]:
train_descriptions = get_descriptions_from_data(train_data)
val_descriptions = get_descriptions_from_data(val_data)


  0%|          | 0/772 [00:00<?, ?it/s][A
  0%|          | 2/772 [00:00<00:46, 16.63it/s][A
  0%|          | 3/772 [00:00<01:11, 10.75it/s][A
  1%|          | 5/772 [00:00<01:01, 12.45it/s][A
  1%|          | 7/772 [00:00<00:58, 13.06it/s][A
  1%|▏         | 10/772 [00:00<01:05, 11.64it/s][A
  2%|▏         | 12/772 [00:00<01:00, 12.51it/s][A
  2%|▏         | 15/772 [00:01<00:50, 15.10it/s][A
  2%|▏         | 17/772 [00:01<00:52, 14.51it/s][A
  3%|▎         | 20/772 [00:01<00:50, 14.90it/s][A
  3%|▎         | 22/772 [00:01<00:48, 15.48it/s][A
  3%|▎         | 24/772 [00:01<01:14, 10.04it/s][A
  3%|▎         | 27/772 [00:02<00:59, 12.52it/s][A
  4%|▍         | 29/772 [00:02<01:02, 11.88it/s][A
  4%|▍         | 32/772 [00:02<00:52, 14.15it/s][A
  4%|▍         | 34/772 [00:02<01:47,  6.85it/s][A
  5%|▌         | 40/772 [00:03<01:21,  9.03it/s][A
  6%|▌         | 43/772 [00:03<01:09, 10.42it/s][A
  6%|▌         | 45/772 [00:03<01:01, 11.75it/s][A
  6%|▌         | 48/772 

Next, we want to write a function that takes a list of descriptions and turns it into an array containing the average GloVe vector for each description (~8 minutes):

In [0]:
def glove_transform_data_descriptions(descriptions):
    X = np.zeros((len(descriptions), VEC_SIZE))
    for i, description in enumerate(descriptions):
        found_words = 0.0
        description = description.strip()
        for word in description.split(): 
            vec = get_word_vector(word)
            if vec is not None:
                ### YOUR CODE HERE ###
                # Increment found_words and add vec to X[i].
                found_words += 1
                X[i] += vec
                ### END CODE HERE ###
        # We divide the sum by the number of words added, so we have the
        # average word vector.
        if found_words > 0:
            X[i] /= found_words
            
    return X
  
glove_train_X = glove_transform_data_descriptions(train_descriptions)
glove_train_y = [label for (url, html, label) in train_data]

glove_val_X = glove_transform_data_descriptions(val_descriptions)
glove_val_y = [label for (url, html, label) in val_data]



Then, we can evaluate our approach as we have in the past. As before, fill in the code for fitting and evaluation (~8 minutes).

In [78]:
model = LogisticRegression()
### YOUR CODE HERE ###
model.fit(glove_train_X, glove_train_y)

glove_train_y_pred = model.predict(glove_train_X)
print('Train accuracy', accuracy_score(glove_train_y, glove_train_y_pred))

glove_val_y_pred = model.predict(glove_val_X)
print('Val accuracy', accuracy_score(glove_val_y, glove_val_y_pred))

print('Confusion matrix:')
print(confusion_matrix(glove_val_y, glove_val_y_pred))

prf = precision_recall_fscore_support(glove_val_y, glove_val_y_pred)

print('Precision:', prf[0][1])
print('Recall:', prf[1][1])
print('F-Score:', prf[2][1])
### END CODE HERE ###

Train accuracy 0.8639896373056994
Val accuracy 0.7111111111111111
Confusion matrix:
[[30 20]
 [ 6 34]]
Precision: 0.6296296296296297
Recall: 0.85
F-Score: 0.723404255319149




We can see that we again get solid results using a different approach. Each approach is encoding different information about websites, and so we would expect that combining them together would produce even better results. Tomorrow, we will combine these approaches together and create our final model!