# Week 9 Homework: Fake News Evaluation

For homework, we will build off of the class exercises. One thing you might have noticed as you were looking at text generated by our language model was that it was difficult to tell how well the language model was doing besides eye-balling the quality of the generated text. It turns out there is a way to quantitatively evaluate the model, using a metric called perplexity. Load the following cell to get started.

In [0]:
import numpy as np
import re
import math
import sys
import random
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.models import load_model
from collections import defaultdict
from tqdm import tqdm

import requests
import zipfile
import io

# Download and extract data.
r = requests.get("http://web.stanford.edu/class/cs21si/resources/unit5_resources.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

In [0]:
#@title Run this cell to load helpers (double-click to read code) { display-mode: "form" }

def get_X_y(text):
	text = text.lower()
	text = simplify_text(text)

	print('Corpus length:', len(text))

	chars = sorted(list(set(text)))
	print('Total chars:', len(chars))
	char_indices = dict((c, i) for i, c in enumerate(chars))
	indices_char = dict((i, c) for i, c in enumerate(chars))

	# cut the text in semi-redundant chunks of maxlen characters
	maxlen = 40
	step = 3
	sentences = []
	next_chars = []
	for i in range(0, len(text) - maxlen, step):
	    sentences.append(text[i: i + maxlen])
	    next_chars.append(text[i + maxlen])
	print('Chunk length:', maxlen)
	print('Number of chunks:', len(sentences))

	x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
	y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
	for i, sentence in enumerate(sentences):
	    for t, char in enumerate(sentence):
	        x[i, t, char_indices[char]] = 1
	    y[i, char_indices[next_chars[i]]] = 1
	return x, y, char_indices, indices_char

def simplify_text(text):
    counts = defaultdict(int)
    for ch in text:
        counts[ch] += 1
    counts = [(counts[k], k) for k in counts.keys()]
    removed = 0
    for count, ch in counts:
        if count <= 200:
            text = text.replace(ch, '')
            removed += 1
    return text

def sample_from_model(model, text, char_indices, indices_char, chunk_length, number_of_characters, seed=""):
	text = text.lower()
	start_index = random.randint(0, len(text) - chunk_length - 1)
	for diversity in [0.2, 0.5, 0.7]:
	    print('----- diversity:', diversity)

	    generated = ''
	    if not seed:
	    	sentence = text[start_index: start_index + chunk_length]
	    else:
	    	seed = seed.lower()
	    	sentence = seed[:chunk_length]
	    	sentence = ' ' * (chunk_length - len(sentence)) + sentence
	    generated += sentence
	    print('----- Generating with seed: "' + sentence + '"')
	    sys.stdout.write(generated)

	    for i in range(400):
	        x_pred = np.zeros((1, chunk_length, number_of_characters))
	        for t, char in enumerate(sentence):
	            x_pred[0, t, char_indices[char]] = 1.

	        preds = model.predict(x_pred, verbose=0)[0]
	        next_index = sample(preds, diversity)
	        next_char = indices_char[next_index]

	        generated += next_char
	        sentence = sentence[1:] + next_char

	        sys.stdout.write(next_char)
	        sys.stdout.flush()
	    print("\n")

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64') + 1e-8
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


Run the below cells to load up our dataset and pretrained model as before.

In [0]:
with io.open('unit5_resources/fake.txt', encoding='utf-8') as f:
    articles_raw = f.read()
    articles_split = re.split("<a>", articles_raw)[1:]
    articles = [a[:-6].strip() for a in articles_split]

X, y, char_indices, indices_char = get_X_y(articles_raw)

print("Shapes:", X.shape, y.shape)

number_of_chunks, chunk_length, number_of_characters = X.shape

In [0]:
pretrained_model = load_model('unit5_resources/pretrained_model.h5')
pretrained_model.summary()

We can see that we are working with a 2-layer LSTM. When evaluating models, we generally prefer tune our hyperparameters based on quantitative performance metrics, such as test set accuracy. We don't have anything like that at the moment for our language model, so we will use perplexity. 

Intuitively, perplexity is a measure of how confused our model is when it looks at our original dataset. For each chunk, we compare our model's predictions to the actual next character, and if the model's predictions differ by a lot, the model has a high perplexity. Thus, we aim to minimize perplexity when building a language model.

More formally, perplexity is the geometric mean of the product of the inverse prediction probabilities for the correct characters: 

![Perplexity](http://web.stanford.edu/class/cs21si/resources/assets/perplexity.png)

Note that for us, T is the number of chunks. To calculate perplexity, we evaluate our model on each chunk to produce a y_hat for that chunk. In practice, we may instead calculate the log(Perplexity), which simplifies to the following:

![log perplexity](http://web.stanford.edu/class/cs21si/resources/assets/log-perplexity.png)

Note that in both of these equations we have the following term:

![Probability term](http://web.stanford.edu/class/cs21si/resources/assets/prob-term.png)

Does this term look familiar? It is the dot-product of our prediction vector and the one-hot label vector. When we compute the dot product with a one-hot vector, we are effectively doing an index lookup at the position represented by the one-hot vector. This sum always simplifies to the predicted probability of the correct character, but it is still simpler to compute this as a dot product in code.

We can see that the perplexity is a natural measure of how well a language model captures its training data. Although we won't do this, we can also see that it might extend to helping us classify language: presumably, if we ask our model to evaluate perplexity on unseen fake news text and real news text, perplexity will be higher for real news. This suggests a naive approach for fake news classification!

Now you will finish a function to compute log(perplexity). Note that to compute this, you need all of the prediction vectors for each chunk (this corresponds to the y_hat(t) vectors in the equations above). We have provided the below function to get all of the predictions for our dataset using our pretrained model. The reason this code is more complicated than you'd expect is that Keras expects the input to come as batches of size 128 since this is how we trained our model. Don't worry about the details of this!

In [0]:
def get_pred(model, X):
    number_of_chunks, chunk_length, number_of_characters = X.shape
    pred = np.zeros((number_of_chunks, number_of_characters))
    num_batches = int(math.ceil(number_of_chunks / 128.0))
    for i in tqdm(range(num_batches)):
        curr_pred = model.predict(X[(i) * 128: (i + 1) * 128])
        pred[(i) * 128: (i + 1) * 128] = curr_pred
    return pred

print("Getting predictions... this will take a few minutes.")
pretrained_pred = get_pred(pretrained_model, X)
print("\n\nDone getting predictions.")
print("Shape:", pretrained_pred.shape)

Now that we have our predictions, we simply need to use them, along with *y*, to compute log(perplexity). You just need to change two lines below to finish the below function. Hint: the first line involves computing the predicted probability of the correct character. You may find *np.dot* useful for this.

Note: in the case that the predicted probability of the correct character is exactly 0.0 due to underflow or some other issue, we skip this chunk to avoid log domain errors. This is obviously a hack, but we do it infrequently enough that our metric is still meaningful.

In [0]:
def get_log_perplexity(pred, y):
    number_of_chunks, number_of_characters = pred.shape
    total = 0.0
    for t in range(number_of_chunks):
        ### YOUR CODE HERE
        prob = None
        ### END CODE
        # If prob is 0.0, skip to avoid log domain errors
        if prob == 0.0:
            continue
        ### YOUR CODE HERE
        total -= None
        ### END CODE
    return total / number_of_chunks

In [0]:
get_log_perplexity(pretrained_pred, y)

**Expected output:**

1.1104002202573617

## Optional: Train your own model and evaluate it!

You can also optionally train your own model with your own choice of architecture and evaluate it against our new metric! Go into the [*resources* directory](http://web.stanford.edu/class/cs21si/resources/unit5_resources.zip) and edit *lstm_text_generation.py* to your liking. You should only need to edit lines 93-97 if you want to change the architecture only, but feel free to change anything else you like. Simply run ```python lstm_text_generation.py``` to train for up to 300 epochs and save model files each epoch (this will take a really long time, feel free to stop training at any time). The model files will be saved to the *outputs* directory, which will be created if it does not already exist. You should expect the model to take a few minutes per epoch and around 100 epochs before it has mostly converged on CPU. Once you have a trained model, just insert your model path below and run the cells to evaluate it. See if you can do better than our pretrained model!

In [0]:
### YOUR MODEL PATH HERE
your_model = load_model('resources/outputs/lstm_epochXXX.h5')
### END MODEL PATH
your_model.summary()

In [0]:
your_pred = get_pred(your_model, X)
get_log_perplexity(your_pred, y)