# Finding Fact-checkable Tweets with Machine Learning

This notebook was copied and modified from one originally created by Jeremy Howard and the other folks at [fast.ai](https://fast.ai) as part of [this fantastic class](https://course.fast.ai/). Specifically, it comes from Lesson 4. You can [see the lession video](https://course.fast.ai/videos/?lesson=4) and [the original class notebook](https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-imdb.ipynb). 

For more information about this project, and details about how to use this work in the wild, check out our [Quartz AI Studio blog post about the checkable-tweets project](https://qz.ai/?p=89).

-- John Keefe


## Using this notebook

Essentially you need a computer that's running a GPU running fast.ai. There are a few ways to do this without owning a computer with a GPU (I certainly don't). There are [lots of options](https://course.fast.ai/index.html). I like to use use [the Amazon EC2 setup](https://course.fast.ai/start_aws.html), which is probably the most complicated. In most of these cases, you'll just clone [the workshop repository](https://github.com/Quartz/aistudio-workshops) and get the notebook running.

I'm also tailoring this notebook for use with [Google Colaboratory](https://colab.research.google.com), which as of this writing is the fastest, cheapest (free) way to get going.


### If you're using Google Colaboratory ...

Be aware that Google Colab instances are ephemeral -- they vanish *Poof* when you close them, or after a period of sitting idle (currently 90 minutes).

There are great steps on the fast.ai site for [getting started with fast.ai an Google Colab](https://course.fast.ai/start_colab.html). 

Those instructions will show you how to save your own copy of this _notebook_ to Google Drive.

They also tell you how to save a copy of your _data_ to Google Drive (Step 4), which is unneccesary for this workshop. 

In [None]:
## ALL GOOGLE COLAB USERS RUN THIS CELL

## This runs a script that installs fast.ai
!curl -s https://course.fast.ai/setup/colab | bash

### If you are _not_ using Google Colaboratory ...

Run the cell below.

In [None]:
## NON-COLABORATORY USERS SHOULD RUN THIS CELL
%reload_ext autoreload
%autoreload 2
%matplotlib inline

### Everybody do this ...

In [None]:
## AND *EVERYBODY* SHOULD RUN THIS CELL

from fastai.text import *

## The Challenge

Figure out which `#txlege` tweets are checkable statements of fact.

Take a look at the [tweets in question](https://twitter.com/search?q=%23txlege&src=typed_query).

## The Plan

Here's what we're going to do:

- Grab files with a bunch of tweets
- Make a **language model** from a model pretrained on Wikipedia _plus_ as many tweets as we have
- Make a **classification model** to predict whether a given tweet is checkable or not, using tweets that were hand-labeled by folks at the Austin American-Statesman.
- Use that classification model to predict the checkability of unseen tweets.

## The Data

The data isn't stored with this notebook, so we've put it online so you can download it here:

In [None]:
!wget -N https://qz-aistudio-public.s3.amazonaws.com/workshops/austin_tweet_data.zip --quiet
!unzip -q austin_tweet_data.zip
print('Done!')

We now have a directory called `data` with two files of tweets. Let's take a look.

In [None]:
%ls data/

### Take a peek at the tweet data

Working with Dan Keemahill and Madlin Mekelburg over a couple of weeks during the 2019 Texas state legislative session, I have have a set of 3,797 tweets humans at the Austin American-Statesman have determined are – or are not – statements that can be fact-checked. 

In [None]:
# Here I read the csv into a data frame I called `austin_tweets`
# and take a look at the first few rows
path = Path('./data')
hand_coded_tweets = pd.read_csv(path/'hand_coded_austin_tweets.csv')
hand_coded_tweets.head()

## The language model

First we need a model that 'understands' the rules of English – the language model. 

We'll start with a language model pretrained on a thousands of Wikipedia articles called [wikitext-103](https://einstein.ai/research/blog/the-wikitext-long-term-dependency-language-modeling-dataset). That language model has been trained to guess the next word in a sentence based on all the previous words. It has a recurrent structure with a hidden state that is updated each time it sees a new word. This hidden state thus contains information about the sentence up to that point.

For our project, we want to infuse the Wikitext model with our particular dataset – the #txlege tweets. Because the English of #txlege tweets isn't the same as the English of Wikipedia, we'll adjust the internal parameters of the model by a little bit. That includes adding words that might be extremely common in the tweets but would be barely present in wikipedia–and therefore might not be part of the vocabulary the model was trained on.

### Adding more tweets for the language model

We want as many tweets for the language model as possible. We'll start with the text of the 3,797 "hand-coded" tweets (though for the language model, we ignore the checkable/not checkable part of that file). To give it even more examples, I used the website [IFTTT](https://ifttt.com) and Google Spreadsheets to collect several days worth of #txlege tweets, and then saved them into a file called `tweet_corpus.txt`.


In [None]:
# read in the corpus, which has one tweet per row,
# and take a look at the first frew rows
corpus_tweets = pd.read_csv(path/'tweet_corpus.txt')
corpus_tweets.head()

In [None]:
# here I concatenate the two tweet sets into one big set
lm_tweets = pd.concat([hand_coded_tweets,corpus_tweets], sort=True)

# as a sanity check, let's look at the size of each set, 
# and then the ontatenated set
print('hand coded tweets:', len(hand_coded_tweets) )
print('corpus of tweets:', len(corpus_tweets) )
print('total tweets:', len(lm_tweets) )

Great: Now we have 7,485 tweets to use for the language model.

One thing to note ... the first set had two columns, `checkable` and `tweet_text`, while the corpus had just one collumn,  `tweet_text`. The combined has the original two columns, though many of the entries will be `NaN` for "not a number." Thats okay, because we're only going to use the `tweet_text` column for the language model.




In [None]:
# Saving as csv for easier reading in a moment
lm_tweets.to_csv(path/'lm_tweets.csv', index=False)

Fast.ai uses a concept called a "[data bunch](https://docs.fast.ai/basic_data.html)" to handle machine-learning data, which takes care of a lot of the more fickle machine-learning data preparation.

We have to use a special kind of data bunch for the language model, one that ignores the labels, and will shuffle the texts at each epoch before concatenating them all together (only the training set gets shuffled; we don't shuffle for the validation set). It will also create batches that read the text in order with targets (aka the best guesses) that are the next word in the sentence.


In [None]:
# Loading in data with the TextLMDataBunch factory class, using all the defaults
data_lm = TextLMDataBunch.from_csv(path, 'lm_tweets.csv', text_cols='tweet_text', label_cols='checkable')
data_lm.save('data_lm_tweets')

We can then put all of our tweets (now stored in `data_lm`) into a learner object along with the pretrained Wikitext model -- here called `AWD_LTSM`, which is downloaded the first time you'll execute the following line.

In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

One of the most important settings when we actually _train_ our model is the **learning rate**. I'm not going to dive into it here (though I encourage you to explore it), but will use a fast.ai tool to find the best learning rate to start with:

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot(skip_end=15)

This gives us a graph of the optimal learning rate ... which is the point where the graph really dives downward (`1e-02`). Again, there's much more on picking and learning rates in the fast.ai course.

Now we can train the Language Model. (Essentailly, we're training it to be good at guessing the *next word* in a sentence, given all of the previous words.)

The variabales we're passing are `1` to just do one cycle of learning, the learning rate of `1e-2`, and some momentum settings we won't get into here -- but these are pretty safe. 

In [None]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

In [None]:
learn.fit_one_cycle(1, 1e-1, moms=(0.8,0.7))


To complete the fine-tuning, we "unfreeze" the original Wikitext language model and let the new training efforts -- work their way into the original neural network.

In [None]:
# This takes a couple of minutes!
learn.unfreeze()
learn.fit_one_cycle(2, 1e-3, moms=(0.8,0.7))

While our accuracy may _seem_ low ... in this case it means the language model correctly guessed the next word in a sentence more than 1/3 of the time. That's pretty good! And we can see that even when it's wrong, it makes some pretty "logical" guesses. 

Let's give it a starting phrase and see how it does:


In [None]:
TEXT = "I wonder if this"
N_WORDS = 40
N_SENTENCES = 3

print("\n\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

Remember, these are not real ... they were _generated_ by the model when it tried to guess each of the next words in the sentence! Generating text like this is not why we made the language model (though you can see where text-generation AI starts from!)

Also note that the model is often crafting the response _in the form of a tweet!_

We now save the model's encoder, which is the mathematical representation of what the language model "understands" about English patterns infused by our tweets.

In [None]:
learn.save_encoder('fine_tuned_enc')  

## Building the classifier model

This is the model that will use our langauge model **and** the hand-coded tweets to guess if new tweets are fact-checkable or not.

We'll create a new data bunch that only grabs the hand-coded tweets and keeps track of the labels there (true or false, for fact-checkability). We also pass in the `vocab` -- which is the list of the most useful words from the language model.

In [None]:
data_clas = TextClasDataBunch.from_csv(path, 'hand_coded_austin_tweets.csv', vocab=data_lm.vocab, text_cols='tweet_text', label_cols='checkable')

data_clas.save('data_clas_tweets')

And here's how the computer has tokenized the tweets:

In [None]:
data_clas.show_batch()

We can then create a model to classify tweets. You can see that in the next two lines we include the processed, hand-coded tweets (`data_clas`), the original Wikitext model (`AWD_LSTM`), and the knowledge we saved after infusing the language model with tweets (`fine_tuned_enc`).

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc');

With neural networks, there are lots of tweaks you can adjust — known as "hyperparameters" — such as learning rate and momentum. The fast.ai defaults are pretty great, and the tools it has for finding the learning rate are super useful. I'm going to skip those details here for now. There's more to learn at [qz.ai](https://qz.ai) or at the [this great fast.ai course](https://course.fast.ai/).

In [None]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

Let's give it an example ....

In [None]:
example = "Four states have two universities represented in the top 20 highest-paid executives of public colleges. Texas has SIX"
learn.predict(example)

`True` means checkable! 

We can open the "black box" a little to see what words the model is keying into.

In [None]:
interp = TextClassificationInterpretation.from_learner(learn) 
interp.show_intrinsic_attention(example)

Let's save our work.

## Saving to Google Drive

At present, your Google Colaboratory Notebook disappears when you close it — along with all of your data. If you'd like to save your model to your Google Drive, run the following cell and grant the permissions it requests.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'ai-workshop/'
path = Path(base_dir)
path.mkdir(parents=True, exist_ok=True)

The next line will save everything we need for predictions to a ~90MB file to your Google Drive.

In [None]:
learn.export()

For details about deploying a predictor in the cloud using Render, see our [blog post about building the checkable-tweets project](https://qz.ai/?p=89).