# Classify text using fast.ai

This notebook will walk you through a simple example that trains a model to determine if there's a bicycle in an image and then use that to find bicycles in a video.

This work is based on the early lessons in [Practical Deep Learning for Coders](https://course.fast.ai/), taught online by Jeremy Howard. I **highly** recommend this free online course.

## Using this notebook

Essentially you need a computer that's running a GPU running fast.ai. There are a few ways to do this without owning a computer with a GPU (I certainly don't). There are [lots of options](https://course.fast.ai/index.html). I like to use use [the Amazon EC2 setup](https://course.fast.ai/start_aws.html), which is probably the most complicated. In most of these cases, you'll just clone [the workshop repository](https://github.com/Quartz/aistudio-workshops) and get the notebook running.

I'm also tailoring this notebook for use with [Google Colaboratory](https://colab.research.google.com), which as of this writing is the fastest, cheapest (free) way to get going.


### If you're using Google Colaboratory ...

Be aware that Google Colab instances are ephemeral -- they vanish *Poof* when you close them, or after a period of sitting idle (currently 90 minutes).

There are great steps on the fast.ai site for [getting started with fast.ai an Google Colab](https://course.fast.ai/start_colab.html). 

Those instructions will show you how to save your own copy of this _notebook_ to Google Drive.

They also tell you how to save a copy of your _data_ to Google Drive (Step 4), which is unneccesary for this workshop. 

In [None]:
## ALL GOOGLE COLAB USERS RUN THIS CELL

## This runs a script that installs fast.ai
!curl -s https://course.fast.ai/setup/colab | bash

### If you are _not_ using Google Colaboratory ...

Run the cell below.

In [None]:
## NON-COLABORATORY USERS SHOULD RUN THIS CELL
%reload_ext autoreload
%autoreload 2
%matplotlib inline

### Everybody do this ...

In [None]:
## AND *EVERYBODY* SHOULD RUN THIS CELL

from fastai.text import *

## The Plan

Given a set of political Facebook ads, we want to sort them into three categories: fundraising, list-building, and persuasion.

We're going to take a hand-coded set of 1,700 ads (which Jeremy B. Merrill coded on a long flight), and apply them to the larger Facebook ad database. As of this writing, that database has nearly 165,000 ads and clocks in about 3.2 GB. So for this class, as a proof of concept, we'll take a slice 5,000 ads.

Our plan will be:

- Download an English-language recognition **language model** pre-trained on Wikipedia articles
- Further train that **language model** on the type of English we're working with, specifically the corpus of Facebook ads we have
- Train a **classification model** on the difference between fundraising, list-building, and persuasion ads.
- Use that **classification model** model to label the bigger group of ads

## The Data

Let's get the two data sets we'll be using: The hand-labeled set of 1,700 ads and the raw set of 5,000 ads.

In [None]:
!wget -N https://qz-aistudio-public.s3.amazonaws.com/workshops/facebook_ad_data.zip
!unzip facebook_ad_data.zip > /dev/null
print('Done!')

Now you have a subdirectory called `facebook_ad_data` which contains two files.

In [None]:
%ls facebook_ad_data

Next we'll load the `hand-labeled-ads.csv` file into a structure called a "data frame," which is a common way to handle large amounts of data in python.

In [None]:
hand_coded_ads = pd.read_csv('facebook_ad_data/hand-labeled-ads.csv')

Let's take a peek!

In [None]:
hand_coded_ads.head()

And let's load in the 5,000 raw ads.

In [None]:
raw_ads = pd.read_csv('facebook_ad_data/fbpac-ads-en-US-slice.csv')

In [None]:
raw_ads.head()

### A little cleaning

We want to study the "message" line in each row, but it's got a bunch of html tags inside...

In [None]:
raw_ads.iloc[0]['message']

This code will clean it all up for us!

In [None]:
import re
def remove_html_tags(text):
    """Remove html tags from a string"""
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

Let's try it ...

In [None]:
remove_html_tags(raw_ads.iloc[0]['message'])

Ahhh, that's better. OK, let's make a new column called "clean_message" in our data sets applying this function.

In [None]:
hand_coded_ads['clean_message'] = hand_coded_ads['message'].apply(remove_html_tags)
raw_ads['clean_message'] = raw_ads['message'].apply(remove_html_tags)

In [None]:
hand_coded_ads.head()

In [None]:
hand_coded_ads.iloc[4]['message']

In [None]:
hand_coded_ads.iloc[4]['clean_message']

## The Language Model

First we're going to take a model trained on Wikipedia and give it some additional training on Facebook ads. We can use all of the data we have available to make this work better, so we're going to make a dataframe of ALL of our ads.

In [None]:
all_ads = pd.concat([hand_coded_ads,raw_ads], sort=True)

In [None]:
len(all_ads), len(hand_coded_ads), len(raw_ads)

In [None]:
# This takes the first 80% of the rows as the training row, and the rest as validation
# all_ads_train = all_ads.sample(frac=0.8,random_state=200)
# all_ads_validation = all_ads.drop(all_ads_train.index)

In [None]:
# Loading in data with the TextLMDataBunch factory class, using all the defaults
# data_lm = TextLMDataBunch.from_df('facebook_ad_data', train_df=all_ads_train, valid_df=all_ads_validation, text_cols='clean_message')

data_lm = (TextList.from_df(all_ads, cols=['clean_message'])
    .split_by_rand_pct(0.2)
    .label_for_lm()
    .databunch(bs=64))

In [None]:
data_lm.save()

In [None]:
data_lm.show_batch()

Now we'll actually train the language model!

In [None]:
learn_lm = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

In [None]:
learn_lm.lr_find()

In [None]:
learn_lm.recorder.plot(suggestion=True)

In [None]:
learn_lm.fit_one_cycle(1, 7e-02, moms=(0.8,0.7))

In [None]:
learn_lm.fit_one_cycle(2, 1e-01, moms=(0.8,0.7))

In [None]:
# optionally save and reload the model (file is about 150MB)
learn_lm.save('fit_head')
learn_lm.load('fit_head');

In [None]:
# This takes a few minutes!
learn_lm.unfreeze()
learn_lm.fit_one_cycle(4, 1e-3, moms=(0.8,0.7))

In [None]:
TEXT = "I wonder if this"
N_WORDS = 40
N_SENTENCES = 3

print("\n".join(learn_lm.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

In [None]:
# save the encoder we used for this ...
learn_lm.save_encoder('fine_tuned_encoder2')

## The Classification Model

This is where we use the language model to train a new model that will learn to classify ads as fundraising, listbuilding, and persuasion. 

First, we train it on the hand-coded ads, splitting them into "training" and "validation" sets.

In [None]:
# data_lm = load_data('facebook_ad_data')

In [None]:
# data_clas = TextClasDataBunch.from_df('facebook_ad_data', train_df=hand_coded_train, valid_df=hand_coded_validation, vocab=data_lm.vocab, text_cols='clean_message', label_cols='label')

In [None]:
data_clas = (TextList.from_df(hand_coded_ads, cols=['clean_message','label'], vocab=data_lm.vocab)
    .split_by_rand_pct(0.2)
    .label_from_df(cols='label')
    .databunch(bs=64))

In [None]:
data_clas.show_batch()

In [None]:
class_learner2 = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
class_learner2.load_encoder('fine_tuned_encoder2')

In [None]:
class_learner2.lr_find()

In [None]:
class_learner2.recorder.plot(suggestion=True)

In [None]:
lr = 1e-02
class_learner2.freeze()
class_learner2.fit_one_cycle(1,slice(lr/(2.6**4),lr), moms=(0.8,0.7) )

In [None]:
class_learner2.fit_one_cycle(2,slice(lr/(2.6**4),lr), moms=(0.8,0.7) )

In [None]:
class_learner.freeze_to(-2)
class_learner.fit_one_cycle(1,slice(lr/(2.6**4),lr), moms=(0.8,0.7) )

In [None]:
class_learner.freeze_to(-2)
class_learner.fit_one_cycle(1,slice(lr/(2.6**4),lr), moms=(0.8,0.7) )

In [None]:
class_learner.fit_one_cycle(3,slice(lr/(2.6**4),lr), moms=(0.8,0.7) )

In [None]:
class_learner.freeze()
class_learner.fit_one_cycle(1,slice(lr/(2.6**4),lr), moms=(0.8,0.7) )

In [None]:
class_learner.freeze_to(-3)
class_learner.fit_one_cycle(3,slice(lr/(2.6**4),lr), moms=(0.8,0.7) )

In [None]:
class_learner.lr_find()

In [None]:
class_learner.recorder.plot(suggestion=True)