# IMBD Review

## Download the Dataset File

In [0]:
# !pip install fastai==0.7.0
# import sys
# !{sys.executable} -m pip install torchtext==0.2.3

In [5]:
# !curl http://files.fast.ai/data/aclImdb.tgz -o aclImdb.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  139M  100  139M    0     0  29.9M      0  0:00:04  0:00:04 --:--:-- 31.9M


In [0]:
# !tar -xvzf aclImdb.tar.gz

## Load the Libraries

In [0]:
from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy

### Language Modeling

#### Data

The large movie view dataset contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

However, before we try to classify sentiment, we will simply try to create a language model; that is, a model that can predict the next word in a sentence. Why? Because our model first needs to understand the structure of English, before we can expect it to recognize positive vs negative sentiment.

So our plan of attack is the same as we used for Dogs v Cats: pretrain a model to do one thing (predict the next word), and fine tune it to do something else (classify sentiment).

Unfortunately, there are no good pretrained language models available to download, so we need to create our own. To follow along with this notebook, we suggest downloading the dataset from this location on files.fast.ai.

In [10]:
PATH = 'aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

imdbEr.txt  imdb.vocab  README  [0m[01;34mtest[0m/  [01;34mtrain[0m/


We do not have separate test and validation in this case. Just like in vision, the training directory has bunch of files in it:

Let's look inside the training folder

In [11]:
trn_files = !ls {TRN}
trn_files[:10]

['0_0.txt       1562_10.txt  24997_0.txt\t34371_0.txt  43748_0.txt  6248_7.txt',
 '0_3.txt       15621_0.txt  24998_0.txt\t3437_1.txt   43749_0.txt  6249_0.txt',
 '0_9.txt       1562_1.txt   24999_0.txt\t34372_0.txt  437_4.txt\t  6249_2.txt',
 '10000_0.txt   15622_0.txt  25000_0.txt\t34373_0.txt  43750_0.txt  6249_7.txt',
 '10000_4.txt   15623_0.txt  2500_0.txt\t34374_0.txt  4375_0.txt   624_9.txt',
 '10000_8.txt   15624_0.txt  25001_0.txt\t34375_0.txt  43751_0.txt  6250_0.txt',
 '1000_0.txt    15625_0.txt  2500_1.txt\t34376_0.txt  4375_1.txt   6250_10.txt',
 '10001_0.txt   15626_0.txt  25002_0.txt\t34377_0.txt  43752_0.txt  6250_1.txt',
 '10001_10.txt  15627_0.txt  25003_0.txt\t34378_0.txt  43753_0.txt  625_0.txt',
 '10001_4.txt   15628_0.txt  25004_0.txt\t3437_8.txt   43754_0.txt  625_10.txt']

Let's also look at an example review

In [12]:
review = !cat {TRN}{trn_files[6]}
review[0]

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose...man, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out 

Now we will check how many words are in the dataset:

In [13]:
!find {TRN} -name '*.txt' | xargs cat | wc -w

17486581


In [14]:
!find {VAL} -name '*.txt' | xargs cat | wc -w

5686719


Before we can do anything with text, we have to turn it into a list of tokens. Token is basically like a word. Eventually we will turn them into a list of numbers, but the first step is to turn it into a list of words — this is called “tokenization” in NLP. A good tokenizer will do a good job of recognizing pieces in your sentence. Each separated piece of punctuation will be separated, and each part of multi-part word will be separated as appropriate. Spacy does a lot of NLP stuff.

In [0]:
spacy_tok = spacy.load('en')

In [16]:
' '.join([sent.string.strip() for sent in spacy_tok(review[0])])

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop - socky fung - ku , but what I got instead was a comedy . So , it was n't quite was I was expecting , but I really liked it anyway ! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them ! ! I was laughing my ass off . I mean , the cops were just so bad ! And when I say bad , I mean The Shield Vic Macky bad . But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose ... man , oh man . What can you say about that hottie . She was great and put those other actresses to shame . She should work more often ! ! ! ! ! I also really liked the fight scene outside of the building . That was done really well . Lots of fighting and people getting their heads banged up . FUN ! Last , but not least Joe Estevez and William Smith were great as the ... well , I was n't sure what they were , but they see

First, we create a torchtext field, which describes how to preprocess a piece of text - in this case, we tell torchtext to make everything lowercase, and tokenize it with spacy.



In [0]:
TEXT = data.Field(lower=True, tokenize="spacy")

- `lower=True` — lowercase the text
- `tokenize=spacy_tok` — tokenize with `spacy_tok`


In [0]:
bs=64
bptt=70

- `bs` : batch size
- `bptt` : Back Prop Through Time. It means how long a sentence we will stick on the GPU at once

In [0]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

- `PATH` : as per usual where the data is, where to save models, etc
- `TEXT` : torchtext’s Field definition
- `**FILES` : list of all of the files we have: training, validation, and test (to keep things simple, we do not have a separate validation and test set, so both points to validation folder)
- `min_freq=10` : In a moment, we are going to be replacing words with integers (a unique index for every word). If there are any words that occur less than 10 times, just call it unknown.

After building our ModelData object, it automatically fills the TEXT object with a very important attribute: TEXT.vocab. This is a vocabulary, which stores which words (or tokens) have been seen in the text, and how each word will be mapped to a unique integer id. 

In [0]:
# !mkdir -p aclImdb/models

In [0]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl', 'wb'))

Here are the: # batches; # unique tokens in the vocab; # tokens in the training set; # sentences

In [24]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(4583, 37392, 1, 20540756)

This is the start of the mapping from integer IDs to unique tokens.

In [26]:
TEXT.vocab.itos[:12]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'in', 'it']

`itos` is sorted by frequency except for the first two special ones. Using `vocab`, torchtext will turn words into integer IDs for us :

In [27]:
TEXT.vocab.stoi['the']

2


Note that in a LanguageModelData object there is only one item in each dataset: all the words of the text joined together.

In [28]:
md.trn_ds[0].text[:12]

['one',
 'of',
 'my',
 'all',
 'time',
 'favorite',
 'movies',
 'about',
 'wwii',
 '.',
 'we',
 'forget']

torchtext will handle turning this words into integer IDs for us automatically.

In [29]:
TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
   37
    7
   76
   41
   71
  534
  116
   54
 3035
    4
   83
  872
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

Our LanguageModelData object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our bptt parameter - backprop through time).

Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.

In [30]:
next(iter(md.trn_dl))

(Variable containing:
     37     10     73  ...    2961     95    119
      7      2     11  ...     495    126      3
     76    146     82  ...      17    279    246
         ...            ⋱           ...         
     35    256    181  ...     399      4    464
      2    180     20  ...       3    164      5
    498    394     11  ...    6554    291   1588
 [torch.cuda.LongTensor of size 66x64 (GPU 0)], Variable containing:
      7
      2
     11
   ⋮   
      9
    139
     21
 [torch.cuda.LongTensor of size 4224 (GPU 0)])

What happens in a language model is even though we have lots of movie reviews, they all get concatenated together into one big block of text. So we predict the next word in this huge long thing which is all of the IMDB movie reviews concatenated together.

- We split up the concatenated reviews into batches. In this case, we will split it to 64 sections

- We then move each section underneath the previous one, and transpose it.

- We end up with a matrix which is 1 million by 64.

- We then grab a little chunk at time and those chunk lengths are approximately equal to BPTT. Here, we grab a little 70 long section and that is the first thing we chuck into our GPU (i.e. the batch).

- We grab our first training batch by wrapping data loader with iter then calling next.

- We got back a 75 by 64 tensor (approximately 70 rows but not exactly)

- A neat trick torchtext does is to randomly change the bptt number every time so each epoch it is getting slightly different bits of text — similar to shuffling images in computer vision. We cannot randomly shuffle the words because they need to be in the right order, so instead, we randomly move their breakpoints a little bit.

- The target value is also 75 by 64 but for minor technical reasons it is flattened out into a single vector.

## TRAIN

We have a number of parameters to set - we'll learn more about these later, but you should find these values suitable for many problems



In [0]:
em_sz = 200   # size of each embedding vector
nh = 500      # number of hidden activations per Layer
nl = 3        # number of layers

Researchers have found that large amounts of momentum (which we'll learn about later) don't work well with these kinds of RNN models, so we create a version of the Adam optimizer with less momentum than it's default of 0.9.

In [0]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

fastai uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through Dropout. There is no simple way known (yet!) to find the best values of the dropout parameters below - you just have to experiment...

However, the other parameters (alpha, beta, and clip) shouldn't generally need tuning.

In [0]:
learner = md.get_model(opt_fn, em_sz, nh, nl, dropouti=0.05,
                       dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

- There is another kind of way we can avoid overfitting that we will talk about in the last class. For now, learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1) works reliably so all of your NLP models probably want this particular line.

- learner.clip=0.3 : when you look at your gradients and you multiply them by the learning rate to decide how much to update your weights by, this will not allow them be more than 0.3. This is a cool little trick to prevent us from taking too big of a step.

In [0]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

In [0]:
learner.save_encoder('adam1_enc')

In [0]:
learner.load_encoder('adam1_enc')

In [0]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

In the sentiment analysis section, we'll just need half of the language model - the encoder, so we save that part.



In [0]:
learner.save_encoder('adam3_10_enc')

In [0]:
learner.load_encoder('adam3_10_enc')

Language modeling accuracy is generally measured using the metric perplexity, which is simply exp() of the loss function we used.

In [0]:
math.exp(4.165)

In [0]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

## Test

We can play around with our language model a bit to check it seems to be working OK. First, let's create a short bit of text to 'prime' a set of predictions. We'll use our torchtext field to numericalize it so we can feed it to our language model.

In [0]:
m = learner.model
ss = """. So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [TEXT.preprocess(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

We haven't yet added methods to make it easy to test a language model, so we'll need to manually go through the steps.

In [0]:
# set batch size to 1
m[0].bs = 1
# Turn off dropout
m.eval()
# reset hidden state
m.reset()
# Get predictions from model
res, *_ = m(t)
# put the batch size back to what it was
m[0].bs = bs

Let's see what the top 10 predictions were for the next word after our short text:

In [0]:
nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

...and let's see if our model can generate a bit more text all by itself!

In [0]:
print(ss,"\n")
for i in range(50):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

## Sentiment

We'll need to the saved vocab from the language model, since we need to ensure the same words map to the same IDs.

In [0]:
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))

`sequential=False` tells torchtext that a text field should be tokenized (in this case, we just want to store the 'positive' or 'negative' single label).

`splits` is a torchtext method that creates train, test, and validation sets. The IMDB dataset is built into torchtext, so we can take advantage of that. Take a look at `lang_model-arxiv.ipynb` to see how to define your own fastai/torchtext datasets.

In [0]:
IMBD_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

In [0]:
t = splits[0].examples[0]

In [0]:
t.label, ' '.join(t.text[:16])

fastai can create a ModelData object directly from torchtext splits.

In [0]:
md2 = TextData.from_splits(PATH, splits, bs)

In [0]:
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl,
                   dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
m3.load_encoder(f'adam3_10_enc')

Because we're fine-tuning a pretrained model, we'll use differential learning rates, and also increase the max gradient for clipping, to allow the SGDR to work better.

In [0]:
m3.clip=25.
lrs=np.array([1e-4, 1e-4, 1e-4, 1e-3, 1e-2])

In [0]:
m3.freeze_to(-1)
m3.fit(lrs/2, 1, metrics=[accuracy])
m3.unfreeze()
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1)

In [0]:
m3.fit(lrs, 7, metrics=[accuracy], cycle_len=2, cycle_save_name='imdb2')

In [0]:
m3.load_cycle('imdb2', 4)

In [0]:
accuracy_np(*m3.predict_with_targs())