# Language Models/Natural Language Processing

We can use neural networks to create a __language model__.

* __Language model__: A model build that, given some words, can prediction the next word.

* __Vocabulary__: The unique words in a language model

* __Sentiment classification problem__: Given some text, classify it according to category. For example, is a movie review positive or negative, is an online comment benign or should it be deleted, etc.

* __Perplexity__: How we measure the accuracy of language models, just `exp()` of the loss function we use (`exp()` == $e^x$)

We will take a sentiment classification problem for IMDB: given a review, we want to classify it as +ve or -ve.

This problem will be built on `torchtext`, the `PyTorch` NLP library. `torchtext` uses `Fields`. `Fields` define how to pre-process our text for tokenization. E.g., make it all lowercase.

The steps are as follows:

First, build a language model:
1. Take a collection of existing movie reviews
2. Stick them all together into one big block of text
3. Split this text into batches.

-- Note, the batches differ from our image classification batches in that we don't process things a batch at a time, because the batches are still too big. Instead, they denote how many times we split up our text. For example, if we had a batch size of 64, we would split our text into 64 segements, but if we had 1,000,000 words, 1,000,000/64 would still result in batches that are too big. So we also use:

__bptt__: Backpropagation through time. We take this many words from our batches at a time. It's also the number of words backprop will work on?

### Back to building the langauge model...

4. Map all of our unique words to an integer (unique tokens)
5. Map all of these unique tokens (representing words) to an embedding matrix
6. Set learner.clip to restrict the learning rate, so we don't take too big a step
7. Take __bptt__ words from our batches and process them
-- looking ahead to get the error, and using backprop to improve the embedding matrix?

Once the language model is built we:

1. Run our individual reviews through the pre-built language model
-- Adding on some extra layer to process the extra classification (such as positive or negative)

In [5]:
# First set up our imports
# Note that pickle has some problems with the data, so we use dill instead
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle

In [6]:
# Next set up our paths
PATH = 'data/aclImdb/'

TRAINING_PATH = 'train/all/'
VALIDATION_PATH = 'test/all/'

TRAINING = f'{PATH}{TRAINING_PATH}'
VALIDATION = f'{PATH}{VALIDATION_PATH}'

%ls {PATH}

imdbEr.txt  imdb.vocab  [0m[01;34mmodels[0m/  README  [01;34mtest[0m/  [01;34mtmp[0m/  [01;34mtrain[0m/


In [7]:
# Inside the training folder, we have text files with IMDB reviews
training_files = !ls {TRAINING}
training_files[:10]

['0_0.txt',
 '0_3.txt',
 '0_9.txt',
 '10000_0.txt',
 '10000_4.txt',
 '10000_8.txt',
 '1000_0.txt',
 '10001_0.txt',
 '10001_10.txt',
 '10001_4.txt']

In [8]:
# The reviews look like this:
review = !cat {TRAINING}{training_files[0]}
review[0]

'I admit, the great majority of films released before say 1933 are just not for me. Of the dozen or so "major" silents I have viewed, one I loved (The Crowd), and two were very good (The Last Command and City Lights, that latter Chaplin circa 1931).<br /><br />So I was apprehensive about this one, and humor is often difficult to appreciate (uh, enjoy) decades later. I did like the lead actors, but thought little of the film.<br /><br />One intriguing sequence. Early on, the guys are supposed to get "de-loused" and for about three minutes, fully dressed, do some schtick. In the background, perhaps three dozen men pass by, all naked, white and black (WWI ?), and for most, their butts, part or full backside, are shown. Was this an early variation of beefcake courtesy of Howard Hughes?'

In [9]:
# Check how many words are in the datasets
!find {TRAINING} -name '*.txt'  | xargs | wc -w

75000


In [10]:
!find {VALIDATION} -name '*.txt' | xargs | wc -w

25000


Before we can do anything with the text, we have to __tokenize__ it -- split it up into tokens.

We use __spaCy__ to do this. Note that the tokens will not just be words, but also punctuation. Words like wasn't will be split into tokens too -- _will_ and _n't_.

In [11]:
' '.join(spacy_tok(review[0]))

'I admit , the great majority of films released before say 1933 are just not for me . Of the dozen or so " major " silents I have viewed , one I loved ( The Crowd ) , and two were very good ( The Last Command and City Lights , that latter Chaplin circa 1931 ) . \n\n So I was apprehensive about this one , and humor is often difficult to appreciate ( uh , enjoy ) decades later . I did like the lead actors , but thought little of the film . \n\n One intriguing sequence . Early on , the guys are supposed to get " de - loused " and for about three minutes , fully dressed , do some schtick . In the background , perhaps three dozen men pass by , all naked , white and black ( WWI ? ) , and for most , their butts , part or full backside , are shown . Was this an early variation of beefcake courtesy of Howard Hughes ?'

In [12]:
# Now we will set up our torchtext Field. The Field defines how to preprocess the text.
TEXT = data.Field(lower=True, tokenize=spacy_tok)

In [13]:
# Next set up our batch size and bptt (backpropagation through time)
# We will split our text into batch_size batches, and then take bptt words from each batch
batch_size = 64
bptt = 70

In [14]:
# Now we'll build our ModelData object.
# It will fill the TEXT object with the vocab -- the list of unique words that have been seen, and how they map
# to a unique integer.
# Only words that occur more than min_freq will be included.

# Note we don't have separate test data
FILES = dict(train=TRAINING_PATH, validation=VALIDATION_PATH, test=VALIDATION_PATH)
model_data = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=batch_size, bptt=bptt, min_freq=10)

In [15]:
# Now save the data
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl', 'wb'))

We can now query the model to get some information.

In [16]:
# Number of batches (training data loader)
len(model_data.trn_dl)

4603

In [17]:
# Number of unique vocab tokens
model_data.nt

34933

In [18]:
# Number of tokens in the training set
# (Everything is stuck together into 1 big sentence)
len(model_data.trn_ds)

1

In [19]:
# Number of sentences
len(model_data.trn_ds[0].text)

20626674

In [20]:
# We can use int-to-string to see some of the words we have
# <unk> = unknown
TEXT.vocab.itos[:10]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']

In [21]:
# Conversely we can see the int mapping for a given word
TEXT.vocab.stoi['the']

2

In [22]:
# Note that PyTorch will handle turning the words into integers
TEXT.numericalize([model_data.trn_ds[0].text[:10]])

Variable containing:
   40
  101
    3
   12
  212
   13
   19
    6
  685
    8
[torch.cuda.LongTensor of size 10x1 (GPU 0)]

Now, our `model_data` object will create 64 columns of data -- i.e., it will split all of the text into 64 batches (64 is our batch size). And then it will create sequence lengths of around 70. 70 is our `bptt` value. Note that it is _around_ 70, because PyTorch randomizes the number somewhat just to keep randomness/variety in the data.

So these ~70 token from our batch * 64 columns can be thought of as our batch -- the chunk of data we're going to process.

Each of these batches also contains a flattened out version of the same data, but one token ahead, since we are trying to predict the next token.

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/nlp-batch2.jpg" width=800 height=500>

In [23]:
# Convert the training data loader into an iterator, and get the next one
# Effectively getting the next batch to process
next(iter(model_data.trn_dl))

(Variable containing:
     40   8328     16  ...     613     98      4
    101     72      6  ...       2      8      8
      3    674    759  ...     111     79     14
         ...            ⋱           ...         
    680     24     99  ...      32     19   4934
      4     63    193  ...       0    635   1219
     10     11    112  ...       5    274      3
 [torch.cuda.LongTensor of size 81x64 (GPU 0)], Variable containing:
    101
     72
      6
   ⋮   
    397
    947
      2
 [torch.cuda.LongTensor of size 5184 (GPU 0)])

Now it's time to start training.

To train, we will use an embedding matrix like we used for structured data prediction.
So each word will map to a row in this embedding matrix.

Jeremy suggests a good size for the rows (vectors) in the embedding matrix is around 200-600.

In [24]:
embedding_matrix_vector_size = 200

In [25]:
hidden_activations_per_layer = 500

In [26]:
number_of_layers = 3

In [27]:
# We'll also define an optimization function, which will be covered later
optimization_function = partial(optim.Adam, betas =(0.7, 0.99))

We'll set various dropout parameters for our learner. These values generally should not need tweaking. They are based on [this research](https://arxiv.org/abs/1708.02182).

The `.clip` parameter stops our learning rate from exceeding 0.3 for the usual reason -- to stop us overshooting the minimum.

In [28]:
learner = model_data.get_model(optimization_function, embedding_matrix_vector_size, hidden_activations_per_layer, number_of_layers,
                              dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)

learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [29]:
# Now we do our learning
# wds = weight decay
learner.fit(3e-4, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

epoch      trn_loss   val_loss                                
    0      5.62888    5.534121  
 51%|█████     | 2336/4603 [10:53<10:34,  3.57it/s, loss=5.33]

KeyboardInterrupt: 

In [30]:
learner.save_encoder('adam1_enc')

In [31]:
learner.load_encoder('adam1_enc')

In [33]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

  0%|          | 21/4603 [00:05<20:58,  3.64it/s, loss=5.33]  
  0%|          | 22/4603 [00:06<21:01,  3.63it/s, loss=5.33]

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py", line 144, in run
    for instance in self.tqdm_cls._instances:
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration



epoch      trn_loss   val_loss                                
    0      4.836986   4.719616  
    1      4.638133   4.516003                                
 52%|█████▏    | 2372/4603 [11:06<10:26,  3.56it/s, loss=4.57]


KeyboardInterrupt: 

In [34]:
learner.save_encoder('adam3_10_enc')

In [35]:
learner.save_encoder('adam3_20_enc')

In [36]:
learner.load_encoder('adam3_20_enc')

In [37]:
# Now we get the perplexity -- a way of measuring the accuracy of language models
# This is just E^loss function
math.exp(4.57)

96.54410977284468

In [38]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

## Testing
We can play around with our language model a bit.
If we give it some words, it should write some more.

In [39]:
model = learner.model
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [spacy_tok(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

". So , it was n't quite was I was expecting , but I really liked it anyway ! The best"

In [42]:
# Set up for testing, get predictions
# Set batch size to 1
model[0].bs=1
# Turn off dropout
model.eval()
# Reset hidden state
model.reset()
# Get predictions from model
res,*_ = model(t)
# Put the batch size back to what it was
model[0].bs=batch_size

In [43]:
# Get the top 10 next predictions from our text
next_words = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(next_words)]

['performance',
 'is',
 'part',
 'of',
 'was',
 'actor',
 ',',
 'movie',
 'friend',
 'thing']

In [45]:
# Let the model try to generate more text by itself
print(ss,"\n")
for i in range(50):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = model(n[0].unsqueeze(0))
print('...')

. So, it wasn't quite was I was expecting, but I really liked it anyway! The best 

performance by the director , the director , and the director , and the director , and the director , and the director , and the director , and the director , and the director , and the director , should have been able to make a film . 

 ...


## Sentiment analysis

When performing sentiment analysis, we need to make sure we use the same vocabulary as the model. The same words need to map to the same ID.

In [46]:
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl', 'rb'))

In [48]:
# sequenial=False -- text should be tokenized (store the positive or negative label)
# splits == tell torchtext to create the training, validation and test data sets
IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

downloading aclImdb_v1.tar.gz


In [49]:
t = splits[0].examples[0]

In [50]:
t.label, ' '.join(t.text[:16])

('pos',
 "fantastic documentary of 1924 . this early 20th century geography of today 's iraq was powerful")

In [53]:
# Create the fast ai model data from the torchtext splits
model2 = TextData.from_splits(PATH, splits, batch_size)

In [62]:
# Train the sentiment model using differential learning rates since we have a pretrained model
model3 = model2.get_model(optimization_function, 1500, bptt, emb_sz=embedding_matrix_vector_size, n_hid=hidden_activations_per_layer, n_layers=number_of_layers, 
           dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
model3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
model3.load_encoder(f'adam3_20_enc')

model.clip=25.
learning_rates=np.array([1e-4,1e-3,1e-2])

In [64]:
# Train
model3.freeze_to(-1)

In [None]:
model3.fit(learning_rates/2, 1, metrics=[accuracy])

In [None]:
model3.unfreeze()
model3.fit(learning_rates, 1, metrics=[accuracy], cycle_len=1)

In [None]:
model3.fit(learning_rates, 7, metrics=[accuracy], cycle_len=2, cycle_save_name='imdb2')

In [None]:
accuracy(*model3.predict_with_targs())