# NLP with RNN

This notebook uses RNN to do text classification. We use the IMDB dataset to train a model that classifies movie reviews as either positive or negative.

## Tokenization  

### Word Tokenization with fast.ai

Grabbing the IMDB dataset:

In [None]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

To get the text file in the path for tokenization using get_text_file.  
We cna also pass the folders to restrict the search to a particular list of subfolders:

In [None]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

Grabbing the first file, open and read it:

In [None]:
txt = files[0].open().read(); txt[:75]

Fastai uses a library called spaCy for tokenization. As we are doing word tokenization, we will have to specify that.  
Also, we use fastai's coll_repr(collection, n) function to display the results. This displays the first n items of collection.  
We have to pass txt as a list to our tokenizer(spacy) as it only takes a collection of documents.

In [None]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

spaCy separates "." when it's being used to terminate a sentence but not in an acroynm or number:

In [None]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

Tokenizer class by fastai adds some additional functionality to the tokenization process:

In [None]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

This allows us to lowercase everything so that the embedding matrix can only work with lowercase text. 
Words that begin with a capital letter have a special token 'xxmaj' while the beginning of a stream is indicated by 'xxbos'.

Other special tokens:

In [None]:
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

xxrep: replaces any character repeated 3 or more times with a special token with a special token for repetition (xxrep), the number of times it's repeated, then the character.  
xxup: Lowercases a word written in all caps and adds a special token for all caps (xxup) in front of it

### Subword Tokenization

This method follows the following steps:  
1) Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.
2) Tokenize the corpus using this vocab of subword units.

Let's look at an example.  
For our corpus, we'll use the first 2,000 movie reviews:

In [None]:
txts = L(o.open().read() for o in [:2000])

Now we can use setup(), which is a special fastai method that is used to train our Tokenizer to find common sequences of characters to create the vocab.  
We'll create a function that takes a certain size of a vocabulary:

In [None]:
def subword(sz):
    sp = SubWordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp[txts]))[:40]

Trying it:

In [None]:
subword(1000)

The special character ▁ represents a space character in the original text when using fastai's subword tokenizer.

For small vocabs, each token will represent fewer characters, and it will take more tokens to represent a sentence.  
For larger vocabs, most common English words will end up in the vocab themselves, and we will not need as many to represent a sentence.

In [None]:
subword(200)

In [None]:
subword(10000)

### Numericalization

Numericalization is the process of mapping tokens to integers. It involves:  
1) Make a list of all possible levels of that categorical variable (the vocab).  
2) Replace each level with its index in the vocab. 

We'll use the word tokenized text:

In [None]:
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

In order to numericalize, we first have to call setup() that creates a vocab.  
Since tokenization takes a while, it's done in parallel by fastai; but for this manual walkthrough, we'll use a small subset:

In [None]:
toks200 = txts[:200].map(tkn)
toks200[0]

In [None]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

Our special rules tokens appear first, and then every word appears once, in frequency order
The default vocab size is a maximum of 60000 and thats's the size of the embedding matrix by default.

Once we've created our Numericalize object, we can use it as if it were a function:

In [None]:
nums = num(toks)[:20]; nums

This will replace our words with tensors of integer.  
We can check that they map back to the original text:

In [None]:
' '.join(num.vocab[o] for o in nums)

## Putting our Text into Batches

The first step is to transform the individual texts into a stream by concatenating them together. At the beginning of each epoch we will shuffle the entries to make a new stream (we shuffle the order of the documents, not the order of the words inside them, or the texts would not make sense anymore!). We then cut this stream into a certain number of batches (which is our batch size).

In [None]:
nums200 = toks200.map(num)

We pass LMDataLoader:

In [None]:
dl = LMDataLoader(nums200)

Let's see if we get the expected results by grabbing the first batch:

In [None]:
x,y = first(dl)
x.shape,y.shape

Looking at the first row of the independent variable, which should be the start of the first text:

In [None]:
' '.join(num.vocab[o] for o in x[0][:20])

The dependent variable is the same thing offset by one token:

In [None]:
' '.join(num.vocab[o] for o in y[0][:20])

## Train a Text Classifier

### Language Model Using DataBlock

Here's how we use TextBlock to create a language model, using fastai's defaults. TextBlock handles both Tokenization and Numericalization atomatically.

In [None]:
#clear gpu cache memory
import torch
torch.cuda.empty_cache()

In [None]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks = TextBlock.from_folder(path, is_lm=True),
    get_items = get_imdb, splitter = RandomSplitter(0.1)
).dataloaders(path, path=path, bs=60, seq_len=100)

We set a batch size of 128 and sequence length of 80.

We call show_batch:

In [None]:
dls_lm.show_batch(max_n=2)

### Fine tuning the Language Model

We'll create a learner which is going to learn to predict the next word of a movie review:

In [None]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics = [accuracy, Perplexity()]).to_fp16()

Call fit_one_cycle():

In [None]:
learn.fit_one_cycle(1, 2e-2)

### Saving and Loading Models

We can save the state of our model:

In [None]:
learn.save('1epoch)

We can then load the model:

In [None]:
learn.load('1epoch')

We can continue fine-tuning the model after unfreezing:

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

We save the encoder model as we won't be saving the final layers(task specific):

In [None]:
learn.save_encoder('finetuned')

#### Text Generation

We can try something different(text generation of movie reviews as):

In [None]:
TEXT = "I like this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature = 0.75) 
         for _ in (N_SENTENCES)]

In [None]:
print("\n".join(preds))

Not bad for a model with 35% accuracy.

### Creating the Classifier Dataloaders

As we want a model that classifies reviews as positive or negative, we need to fine tune the model to that specific task.  
Here's the dataloader for that:

In [None]:
dls_clas = DataBlock(
    blocks = (TextBlock.from_folder(path, vocab=dls_lm.vocacab), CategoryBlock),
    get_y = parent_label,
    get_items = partial(get_text_files, folders = ['train', 'test']),
    splitter = GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=60, seq_len=100)

Now, we can show batch:

In [None]:
dls_clas.show_batch(max_n=3)

For language models, we were able to split the data into strings of equal length to create a batch size and load it to learner. we cannot do this for the movie reviews. We need to load the whole movie review to be able to classify it. As we are using a batch size of 60 and movie reviews often are about 3000 words long, this will be a problem to fit in GPU memory. We can split them to avoid memory error.
Let's see with an example, by trying to create a mini-batch containing the first 10 documents. First we'll numericalize them:

In [None]:
nums_samp=toks200[:10].map(num)

Let's now look at how many tokens each of these 10 movie reviews have:

In [None]:
nums_samp.map(len)

From the output, we get different sizes and therefore we can't split them into sequences of 100. We need to do 'padding'(expand the shortest texts to make them all the same size) so as to be able to split into equal sequence length. As we have used the data block API(that has TextBlock and is_lm=False), we dont have to do it manually.  

We can now create a model to classify our texts:

In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

We can now load the model:

In [None]:
learn.load_encooder('finetuned')