In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as Pickle
import spacy

  from numpy.core.umath_tests import inner1d


## Language modeling

### Data

The [large movie view dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

However, before we try to classify *sentiment*, we will simply try to create a *language model*; that is, a model that can predict the next word in a sentence. Why? Because our model first needs to understand the structure of English, before we can expect it to recognize positive vs negative sentiment.

So our plan of attack is the same as we used for Dogs v Cats: pretrain a model to do one thing (predict the next word), and fine tune it to do something else (classify sentiment).

Unfortunately, there are no good pretrained language models available to download, so we need to create our own. To follow along with this notebook, we suggest downloading the dataset from [this location](http://files.fast.ai/data/aclImdb.tgz) on files.fast.ai.

I have also added a top articles from Wikipedia dump file from [this location](http://www.evanjones.ca/software/wikipedia2text.html) specifically the item listed as 
**Extracted plain text: wikipedia2text-extracted.txt.bz2 (18 MB compressed; 63 MB uncompressed; 10 million words)**

The hope is adding 10 million more words can tune a more generalized PKL file for use in other NLP problems.

In [2]:
PATH='data/aclImdb'

TRN_PATH = 'train/all'
VAL_PATH = 'test/all'
TRN = f'{PATH}/{TRN_PATH}'
VAL = f'{PATH}/{VAL_PATH}'

# %ls {PATH}
!C:/ProgramData/chocolatey/bin/ls {PATH}
    # {PATH}

README
imdb.vocab
imdbEr.txt
models
test
tmp
train


Let's look inside the training folder...

In [3]:
trn_files = !ls {TRN}
trn_files[:10]

['0_0.txt',
 '0_3.txt',
 '0_9.txt',
 '10000_0.txt',
 '10000_4.txt',
 '10000_8.txt',
 '10001_0.txt',
 '10001_10.txt',
 '10001_4.txt',
 '10002_0.txt']

...and at an example review.

In [4]:
review = !cat {TRN}/{trn_files[6]}
review[0]

"Everybody has seen 'Back To The Future,' right? Whether you LIKE that movie or not, you've seen an example of how to make a time-travel movie work. A torn-up poster for 'Back To The Future' shows up in this movie, representing, perhaps unintentionally, what the makers of 'Tangents' (aka 'Time Chasers') did to the time-travel formula. Then again, the movie claims to have been made in 1994, but it looks -- and sounds -- like it was produced at least ten years earlier, so maybe they achieved time-travel after all.<br /><br />Start with an intensely unappealing leading man. I mean, what woman doesn't love gangly, whiny, lantern-jawed, butt-chinned, mullet-men with Coke-bottle glasses? Oh, none of you? Prepare to tough it out, ladies, cuz that's what this movie gives you.<br /><br />Second, add a leading lady who -- while not entirely unattractive -- represents many '80s clichÃ©s: big hair, too much makeup, two different plaids, shoulder pads, acid-washed mom-jeans, etc.<br /><br />Throw i

### Tokenize the text

Before we can analyze text, we must first *tokenize* it. This refers to the process of splitting a sentence into an array of words (or more generally, into an array of *tokens*).

*Note:* If you get an error like:

    Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
    
then you need to install the Spacy language model by running this command on the command-line:

    $ python -m spacy download en

In [5]:
spacy_tok = spacy.load('en')

In [6]:
' '.join([sent.string.strip() for sent in spacy_tok(review[0])])

"Everybody has seen ' Back To The Future , ' right ? Whether you LIKE that movie or not , you 've seen an example of how to make a time - travel movie work . A torn - up poster for ' Back To The Future ' shows up in this movie , representing , perhaps unintentionally , what the makers of ' Tangents ' ( aka ' Time Chasers ' ) did to the time - travel formula . Then again , the movie claims to have been made in 1994 , but it looks -- and sounds -- like it was produced at least ten years earlier , so maybe they achieved time - travel after all.<br /><br />Start with an intensely unappealing leading man . I mean , what woman does n't love gangly , whiny , lantern - jawed , butt - chinned , mullet - men with Coke - bottle glasses ? Oh , none of you ? Prepare to tough it out , ladies , cuz that 's what this movie gives you.<br /><br />Second , add a leading lady who -- while not entirely unattractive -- represents many ' 80s clichÃ © s : big hair , too much makeup , two different plaids , sh

We use Pytorch's [torchtext](https://github.com/pytorch/text) library to preprocess our data, telling it to use the wonderful [spacy](https://spacy.io/) library to handle tokenization.

First, we create a torchtext *field*, which describes how to preprocess a piece of text - in this case, we tell torchtext to make everything lowercase, and tokenize it with spacy.

In [7]:
TEXT = data.Field(lower=True, tokenize="spacy")

fastai works closely with torchtext. We create a ModelData object for language modeling by taking advantage of `LanguageModelData`, passing it our torchtext field object, and the paths to our training, test, and validation sets. In this case, we don't have a separate test set, so we'll just use `VAL_PATH` for that too.

As well as the usual `bs` (batch size) parameter, we also now have `bptt`; this define how many words are processing at a time in each row of the mini-batch. More importantly, it defines how many 'layers' we will backprop through. Making this number higher will increase time and memory requirements, but will improve the model's ability to handle long sentences.

In [8]:
bs=16; bptt=35

In [9]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

After building our `ModelData` object, it automatically fills the `TEXT` object with a very important attribute: `TEXT.vocab`. This is a *vocabulary*, which stores which words (or *tokens*) have been seen in the text, and how each word will be mapped to a unique integer id. We'll need to use this information again later, so we save it.

*(Technical note: python's standard `Pickle` library can't handle this correctly, so at the top of this notebook we used the `dill` library instead and imported it as `pickle`)*.

In [10]:
Pickle.dump(TEXT, open(f'{PATH}/models/TEXTLOCALIMDBWIKIPEDIA.pkl','wb'))

Here are the: # batches; # unique tokens in the vocab; # tokens in the training set; # sentences

In [11]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(57838, 56823, 1, 32389949)

This is the start of the mapping from integer IDs to unique tokens.

In [12]:
# 'itos': 'int-to-string'
TEXT.vocab.itos[:12]

['<unk>', '<pad>', 'the', ',', '.', 'of', 'and', 'a', 'to', 'in', 'is', 'it']

In [13]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

2

Note that in a `LanguageModelData` object there is only one item in each dataset: all the words of the text joined together.

In [14]:
md.trn_ds[0].text[:12]

['i',
 'admit',
 ',',
 'the',
 'great',
 'majority',
 'of',
 'films',
 'released',
 'before',
 'say',
 '1933']

torchtext will handle turning this words into integer IDs for us automatically.

In [15]:
TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
   17
 1570
    3
    2
  107
  994
    5
  140
  576
  160
  179
 5092
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

Our `LanguageModelData` object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our `bptt` parameter - *backprop through time*).

Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.

In [16]:
next(iter(md.trn_dl))

(Variable containing:
 
 Columns 0 to 10 
     17     19   9528   3230     98      4      4      2      0     17   3823
   1570     40    580     75    349     44   1125   1583     10   3978    154
      3      5      4   1024      3     43      3     49      7    158      3
      2      2      0     52     82     93  14438     88     97     46     62
    107    182   5090  26867     42     18      3    288   1651     93    183
    994    156     36     13    118    256  14438      3     32     33     36
      5    967    210     11      3      7    199     11      3    543      7
    140    355      8     10     13     40      3     15     20     11    124
    576      8    210    314     31    761   1897     62      7      3      4
    160    969      4     33     69    186      3     13    186     17     26
    179     21      9      8   4247   1335   8544      2     70     64      2
   5092      2   6923   5720    426     20      3     32     11   1459   3392
     28    544    121 

### Train

We have a number of parameters to set - we'll learn more about these later, but you should find these values suitable for many problems.

In [17]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

Researchers have found that large amounts of *momentum* (which we'll learn about later) don't work well with these kinds of *RNN* models, so we create a version of the *Adam* optimizer with less momentum than it's default of `0.9`.

In [18]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

fastai uses a variant of the state of the art [AWD LSTM Language Model](https://arxiv.org/abs/1708.02182) developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through [Dropout](https://en.wikipedia.org/wiki/Convolutional_neural_network#Dropout). There is no simple way known (yet!) to find the best values of the dropout parameters below - you just have to experiment...

However, the other parameters (`alpha`, `beta`, and `clip`) shouldn't generally need tuning.

In [19]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

As you can see below, I gradually tuned the language model in a few stages. I possibly could have trained it further (it wasn't yet overfitting), but I didn't have time to experiment more. Maybe you can see if you can train it to a better accuracy! (I used `lr_find` to find a good learning rate, but didn't save the output in this notebook. Feel free to try running it yourself now.)

In [23]:
learner.fit(1e-3, 3, wds=1e-6, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=7), HTML(value='')))

epoch      trn_loss   val_loss                                                                                         
    0      5.089685   4.708887  
    1      5.078059   4.733121                                                                                         
 27%|█████████████████▌                                                | 15353/57838 [27:36<1:16:24,  9.27it/s, loss=5]

KeyboardInterrupt: 

In [21]:
learner.save_encoder('adam1_local_wenc')

In [None]:
learner.load_encoder('adam1_local_wenc')

In [None]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

In the sentiment analysis section, we'll just need half of the language model - the *encoder*, so we save that part.

In [None]:
learner.save_encoder('adam3_local_w10_enc')

In [None]:
learner.load_encoder('adam3_local_w10_enc')

Language modeling accuracy is generally measured using the metric *perplexity*, which is simply `exp()` of the loss function we used.

In [None]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

In [None]:
learner.save_encoder('adam3_local_w20_enc')

In [None]:
learner.load_encoder('adam3_local_w20_enc')

In [None]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

In [None]:
learner.save_encoder('adam3_local_w35_enc')

In [None]:
math.exp(4.165)

### Test

We can play around with our language model a bit to check it seems to be working OK. First, let's create a short bit of text to 'prime' a set of predictions. We'll use our torchtext field to numericalize it so we can feed it to our language model.

In [None]:
m=learner.model
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [TEXT.preprocess(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

We haven't yet added methods to make it easy to test a language model, so we'll need to manually go through the steps.

In [None]:
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs

Let's see what the top 10 predictions were for the next word after our short text:

In [None]:
nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

...and let's see if our model can generate a bit more text all by itself!

In [None]:
print(ss,"\n")
for i in range(50):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

We'll need to save the vocab from the language model as well as the encodings, since we need to ensure the same words map to the same IDs.

### End