# Week 9 Class Exercises: Fake News Generation

This week, we'll be learning how to apply deep learning to natural language processing (NLP), language models, recurrent neural networks. We'll motivate our discussion by trying to better understand fake news! As we discussed in class, more sophisticated language models models introduce the possibility of automated disinformation campaigns–and you'll see in this notebook that even simpler models produce surprisingly reasonable results.

Run the below cell to get started.

In [0]:
import numpy as np
import re
import random
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.models import load_model
from collections import defaultdict
import sys

import requests
import zipfile
import io

# Download and extract data.
r = requests.get("http://web.stanford.edu/class/cs21si/resources/unit5_resources.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

In [0]:
#@title Run this cell to load helpers (double-click to read code) { display-mode: "form" }

def get_X_y(text):
	text = text.lower()
	text = simplify_text(text)

	print('Corpus length:', len(text))

	chars = sorted(list(set(text)))
	print('Total chars:', len(chars))
	char_indices = dict((c, i) for i, c in enumerate(chars))
	indices_char = dict((i, c) for i, c in enumerate(chars))

	# cut the text in semi-redundant chunks of maxlen characters
	maxlen = 40
	step = 3
	sentences = []
	next_chars = []
	for i in range(0, len(text) - maxlen, step):
	    sentences.append(text[i: i + maxlen])
	    next_chars.append(text[i + maxlen])
	print('Chunk length:', maxlen)
	print('Number of chunks:', len(sentences))

	x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
	y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
	for i, sentence in enumerate(sentences):
	    for t, char in enumerate(sentence):
	        x[i, t, char_indices[char]] = 1
	    y[i, char_indices[next_chars[i]]] = 1
	return x, y, char_indices, indices_char

def simplify_text(text):
    counts = defaultdict(int)
    for ch in text:
        counts[ch] += 1
    counts = [(counts[k], k) for k in counts.keys()]
    removed = 0
    for count, ch in counts:
        if count <= 200:
            text = text.replace(ch, '')
            removed += 1
    return text

def sample_from_model(model, text, char_indices, indices_char, chunk_length, number_of_characters, seed=""):
	text = text.lower()
	start_index = random.randint(0, len(text) - chunk_length - 1)
	for diversity in [0.2, 0.5, 0.7]:
	    print('----- diversity:', diversity)

	    generated = ''
	    if not seed:
	    	sentence = text[start_index: start_index + chunk_length]
	    else:
	    	seed = seed.lower()
	    	sentence = seed[:chunk_length]
	    	sentence = ' ' * (chunk_length - len(sentence)) + sentence
	    generated += sentence
	    print('----- Generating with seed: "' + sentence + '"')
	    sys.stdout.write(generated)

	    for i in range(400):
	        x_pred = np.zeros((1, chunk_length, number_of_characters))
	        for t, char in enumerate(sentence):
	            x_pred[0, t, char_indices[char]] = 1.

	        preds = model.predict(x_pred, verbose=0)[0]
	        next_index = sample(preds, diversity)
	        next_char = indices_char[next_index]

	        generated += next_char
	        sentence = sentence[1:] + next_char

	        sys.stdout.write(next_char)
	        sys.stdout.flush()
	    print("\n")

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64') + 1e-8
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


## Part 1: Neural NLP Warmup

In this part, you won't code up any models, but you'll get a better understanding of the strengths and limitations of our first attempt at modeling text data using neural networks.

First, finish the following function, which computes the number of input features into our model given an *embedding_size* (the size of our word vectors) and *sequence_length* (the number of words we put into our model).

In [0]:
def num_input_neurons(embedding_size, sequence_length):
    ### YOUR CODE HERE
    return None
    ### END CODE

print("Input neurons with 300-dimensional embeddings and a sequence length of 8:", num_input_neurons(300, 8))

**Expected output**

Input neurons with 300-dimensional embeddings and a sequence length of 8: 2400

Assuming our next layer has *hidden1_size* neurons, how many parameters do we need (including both weights and bias) for the first feedforward layer? Finish the below function. You might find your previous function helpful.

In [0]:
def num_parameters_first_layer(embedding_size, sequence_length, hidden1_size):
    ### YOUR CODE HERE
    return None
    ### END CODE
    
print("Number of params for the first layer with 100-dimensional embeddings, sequence length of 5, hidden size of 50:", num_parameters_first_layer(100, 5, 50))

**Expected output**

Number of params for the first layer with 100-dimensional embeddings, sequence length of 5, hidden size of 50: 25050

Now, assume we have a 3-layer layer network (input layer, hidden layer 1, hidden layer 2, output layer) that does multi-class classification with *vocabulary_size* possible classes. Using our previous function, the *embedding_size*, *sequence_length*, *hidden1_size*, *hidden2_size*, and *vocabulary_size*, calculate the number of parameters needed for the weights and biases for this network. You may find your previous function helpful.

In [0]:
def num_parameters_feedforward(embedding_size, sequence_length, hidden1_size, hidden2_size, vocabulary_size):
    ### YOUR CODE HERE
    return None
    ### END CODE
    
print("Number of total params with 100-dimensional embeddings, sequence length of 5, hidden 1 size of 50, hidden 1 size of 25, vocabulary size of 100:", num_parameters_feedforward(100, 5, 50, 25, 100))

**Expected output**

Number of total params with 100-dimensional embeddings, sequence length of 5, hidden 1 size of 50, hidden 1 size of 25, vocabulary size of 100: 28925
    
This seems like a lot of parameters, but for a decently sized neural network, this isn't much. Notice that the number of parameters scales linearly with parameters like the sequence length (it would be exponential if we had used n-grams!). It seems like our neural NLP model is relatively compact!

## Part 2: RNN Warmup

Now we'll follow the same approach to better understand vanilla RNNs. 

Note that when we build an RNN architecture, the hidden state size is a hyperparameter, just like it was for feedforward networks. Given an embedding size and a hidden size, calculate the number of parameters needed to represent W_e.

In [0]:
def num_parameters_W_e(embedding_size, hidden_size):
    ### YOUR CODE HERE
    return None
    ### END CODE
    
print("Number of parameters for W_e with embedding size 100 and hidden size 50:", num_parameters_W_e(100, 50))

**Expected output**

Number of parameters for W_e with embedding size 100 and hidden size 50: 5000
    
Now do the same thing for W_h:

In [0]:
def num_parameters_W_h(embedding_size, hidden_size):
    ### YOUR CODE HERE
    return None
    ### END CODE
print("Number of parameters for W_h with embedding size 100 and hidden size 50:", num_parameters_W_h(100, 50))

**Expected output**

Number of parameters for W_h with embedding size 100 and hidden size 50: 2500
    
Now calculate the number of parameters needed to calculate the next hidden state from the previous one and the current input embedding. This involves W_e, W_h, and b_1. You will find your previous functions helpful.

In [0]:
def num_parameters_hidden(embedding_size, hidden_size):
    ### YOUR CODE HERE
    return None
    ### END CODE
print("Number of hidden params with embedding size 100 and hidden size 50:", num_parameters_hidden(100, 50))

**Expected output**

Number of hidden params with embedding size 100 and hidden size 50: 7550
    
Now calculate the number of parameters needed to calculate the output at the final timestep. Note that the output is O = softmax(U\*h + b_2), where h is the hidden state at the final timestep. You'll need vocabulary size.

In [0]:
def num_parameters_output(embedding_size, hidden_size, vocabulary_size):
    ### YOUR CODE HERE
    return None
    ### END CODE
print("Number of output params with embedding size 100 hidden size 50, vocab size 100:", num_parameters_output(100, 50, 100))

**Expected output**

Number of output params with embedding size 100 hidden size 50, vocab size 100: 5100
    
Now combine everything together to get the number of parameters for our complete RNN:

In [0]:
def num_parameters_RNN(embedding_size, hidden_size, vocabulary_size):
    ### YOUR CODE HERE
    return None
    ### END CODE
print("Number of params with embedding size 100, hidden size 50, vocabulary size 100:", num_parameters_RNN(100, 50, 100))

**Expected output**

Number of params with embedding size 100, hidden size 50, vocabulary size 100: 12650
    
Notice that even though our model has similar hyperparameters to our original feedforward network, we have significantly fewer parameters because of weight-sharing! This gives us more room to build complex models without making computation too expensive. Also notice that sequence length did not factor into our calculations: we can run any length sequence through our RNN to make a prediction, and we don't need extra parameters for this!

## Part 3: Generating Fake News

Now we'll apply our knowledge of RNNs to build a language model for fake news! Besides being a fun way to test our new models against real world data, building a language model for fake news is one way to show how AI can be used for harmful purposes. We'll take our language model and use it to generate new fake news that the world has never seen before.

Let's start by loading up our dataset. Remember that this is a subset of a dataset scraped from websites tagged as fake news providers by OpenSources. The full dataset can be found [here](https://www.kaggle.com/mrisdal/fake-news/version/1#). 

It's just a .txt file containing a bunch of articles concatenated together. Run the below cell to view the first couple of articles.

In [0]:
with io.open('unit5_resources/fake.txt', encoding='utf-8') as f:
    articles_raw = f.read()
    articles_split = re.split("<a>", articles_raw)[1:]
    articles = [a[:-6].strip() for a in articles_split]
print(articles[0][:1000])
print()
print()
print(articles[1][:1000])
print()
print()
print(articles[2][:1000])

The way we will build our language model will be slightly different from the way presented in class. Instead of predicting the next word at each time step, our language model will predict the next character. Each step in the sequence will be a character input. This means we will be generating fake news character by character! Surprisingly, it will still look decently plausible. Note that instead of passing word embeddings at each time step as inputs for our RNN, we will be passing in one-hot vectors representing the current character. 

Also remember that, as mentioned in class, the way we train our model is to cut up our full corpus into manageable chunks and have our model predict the next character after each chunk. We will be using chunks of size 40 characters.

This means that our *X* will have dimensionality (number_of_chunks, 40, number_of_characters). This means that for each chunk, and for each of 40 characters in each chunk, we will have a one-hot vector of size 40 representing the character in that position in the chunk. Our *y* will have dimensionality (number_of_chunks, number_of_characters). For each chunk, we have a one-hot vector representing the next character after the corresponding chunk.

Run the below code to get *X* and *y* as described above. The code that does this can be found in near the top of this notebook, if you're curious how this was done. The code also gives us *char_indices* and *indices_char*, which are just dictionaries mapping from character to its one-hot index and vice-versa. We'll need these later.

In [0]:
X, y, char_indices, indices_char = get_X_y(articles_raw)

print("Shapes:", X.shape, y.shape)

number_of_chunks, chunk_length, number_of_characters = X.shape

Now just build a Keras model to fit this data. Your model should be an LSTM (non-bidirectional). Feel free to add multiple layers, but note that this will make your model more computationally expensive (if you do this, ensure all LSTM layers before the last one have the *return_sequences* flag set to True). The *input_shape* for the first layer should be (chunk_length, number_of_characters). The output layer should be fully-connected (Dense) with *number_of_characters* neurons and a softmax activation, since we are doing classification among this many classes. You'll find the documentation [here](https://keras.io/layers/recurrent/) helpful.

In [0]:
model = Sequential()
### YOUR CODE HERE

### END CODE

optimizer = RMSprop(lr=0.0005)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

You could try to train this model, but it will take a while. We trained a version of this model for you so you can see the results. Run the below cell to load the pretrained model and a summary of its architecture.

In [0]:
pretrained_model = load_model('unit5_resources/pretrained_model.h5')
pretrained_model.summary()

Run the below cell to sample text from the language model. This code picks a random chunk in our dataset and feeds it into our language model to predict the next character and repeats. The diversity refers to the tendency of the model to sample less likely characters for the sake of producing more novel text. Higher diversity results in more novel text, but it often makes less sense. Try running the below cell a few times.

In [0]:
sample_from_model(pretrained_model, articles_raw, char_indices, indices_char, chunk_length, number_of_characters)

If you squint, you may be able to see a conspiracy theory taking shape! For more fun with this, supply your own initial chunk (seed). The initial chunk can be at most 40 characters (otherwise it will be cut off). If it is shorter, it will be left-padded with whitespace. The seed can only contain characters in the dataset! Note that since the seed isn't coming from the dataset, the result may be less coherent.

In [0]:
sample_from_model(pretrained_model, articles_raw, char_indices, indices_char, chunk_length, number_of_characters, seed="Hilary Clinton is under invest")

Congratulations on completing this notebook! There's plenty of room for improvement with this model (see the [OpenAI blog post mentioned in class](https://openai.com/blog/better-language-models/) for examples). For the homework, you'll develop a metric to quantitatively evaluate language models, optionally train your own model, and evaluate it against the metric.

You can find the actual script we used to train this model in the resources folder [here](http://web.stanford.edu/class/cs21si/resources/unit5_resources.zip). You can play around with hyperparameters, create more compelling demonstrations, or do anything else you want!