In [1]:
from fastai2.text.all import *

In [2]:
path = untar_data(URLs.IMDB)
Path.BASE_PATH = path
path.ls()

(#7) [Path('README'),Path('tmp_lm'),Path('tmp_clas'),Path('imdb.vocab'),Path('train'),Path('test'),Path('unsup')]

In [3]:
files = get_text_files(path, folders = ['train', 'test', 'unsup']) #folders restricts the function from only getting files form those folders
files

(#100000) [Path('train/neg/4336_4.txt'),Path('train/neg/10833_3.txt'),Path('train/neg/4851_3.txt'),Path('train/neg/4602_3.txt'),Path('train/neg/10929_1.txt'),Path('train/neg/1183_2.txt'),Path('train/neg/6287_4.txt'),Path('train/neg/1907_2.txt'),Path('train/neg/7652_3.txt'),Path('train/neg/11370_1.txt')...]

In [24]:
txt = files[0].open().read()
txt

"I watched this movie and the original Carlitos Way back to back. The difference between the two is disgusting. Now i know that people are going to say that the prequel was made on a small budget but that never had anything to do with a bad script. Now maybe it's just me, but i always thought that a prequel was made to go set up the other movie, starring key characters and maybe filling in a bit about life that we didn't know. Rise to Power is just a movie that has Carlito's name. There should have been at least a few characters from the original movie, the ending makes no sense in relation to the original. In the end of this movie he retires with his sweet heart but how the hell do we get him coming out of prison in the next movie? And his woman isn't even the same woman that he talks about as his only love in the original. I would say the movie is mildly entertaining in its self, with a few decent bits but it pales when held up to it's big brother. Don't lay awake at night waiting to

### Spacy

In [19]:
# Using Fastai's default tokenizer: Spacy
# Two methods available: WordTokenizer and SpacyTokenizer

spacy = WordTokenizer() # WordTokenizer will point out to the latest tokenizer being used in Fastai, need not always be Spacy

In [25]:
toks = first(spacy([txt]))
coll_repr(toks, 30)

"(#233) ['I','watched','this','movie','and','the','original','Carlitos','Way','back','to','back','.','The','difference','between','the','two','is','disgusting','.','Now','i','know','that','people','are','going','to','say'...]"

In [26]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

### Fastai's Tokenizer Class

Small modifications on Spacy tokenizer

In [27]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#246) ['xxbos','i','watched','this','movie','and','the','original','xxmaj','carlitos','xxmaj','way','back','to','back','.','xxmaj','the','difference','between','the','two','is','disgusting','.','xxmaj','now','i','know','that','people'...]


- There are now some tokens added that start with the characters "xx", which is not a common word prefix in English. These are special tokens.
- "xxbos", is a special token that indicates the start of a new text ("BOS" is a standard NLP acronym which means "beginning of stream"). 
- These special tokens are added in Fastai library on top of the Spacy. 


> For instance, the rules will replace a sequence of four exclamation points with a single exclamation point, followed by a special repeated character token, and then the number four. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalised word will be replaced with a special capitalisation token, followed by the lower case version of the word. This way, the embedding matrix only needs the lower case version of the words, saving compute and memory, but can still learn the concept of capitalisation.

To see the rules that were used:

In [30]:
defaults.text_proc_rules

[<function fastai2.text.core.fix_html(x)>,
 <function fastai2.text.core.replace_rep(t)>,
 <function fastai2.text.core.replace_wrep(t)>,
 <function fastai2.text.core.spec_add_spaces(t)>,
 <function fastai2.text.core.rm_useless_spaces(t)>,
 <function fastai2.text.core.replace_all_caps(t)>,
 <function fastai2.text.core.replace_maj(t)>,
 <function fastai2.text.core.lowercase(t, add_bos=True, add_eos=False)>]

In [31]:
replace_rep??

[0;31mSignature:[0m [0mreplace_rep[0m[0;34m([0m[0mt[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mreplace_rep[0m[0;34m([0m[0mt[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"Replace repetitions at the character level: cccc -- TK_REP 4 c"[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m_replace_rep[0m[0;34m([0m[0mm[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0mc[0m[0;34m,[0m[0mcc[0m [0;34m=[0m [0mm[0m[0;34m.[0m[0mgroups[0m[0;34m([0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0;32mreturn[0m [0;34mf' {TK_REP} {len(cc)+1} {c} '[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0m_re_rep[0m[0;34m.[0m[0msub[0m[0;34m([0m[0m_replace_rep[0m[0;34m,[0m [0mt[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mFile:[0m      /opt/conda/lib/python3.7/site-packages/fastai2/text/core.py
[0;31mType:[0m      function


Here is a brief summary of what each does:

- `fix_html`:: replace special HTML characters by a readable version (IMDb reviwes have quite a few of them for instance) ;
- `replace_rep`:: replace any character repeated three times or more by a special token for repetition (xxrep), the number of times it's repeated, then the character ;
- `replace_wrep`:: replace any word repeated three times or more by a special token for word repetition (xxwrep), the number of times it's repeated, then the word ;
- `spec_add_spaces`:: add spaces around / and # ;
- `rm_useless_spaces`:: remove all repetitions of the space character ;
- `replace_all_caps`:: lowercase a word written in all caps and adds a special token for all caps (xxcap) in front of it ;
- `replace_maj`:: lowercase a capitalized word and adds a special token for capitalized (xxmaj) in front of it ;
- `lowercase`:: lowercase all text and adds a special token at the beginning (xxbos) and/or the end (xxeos).

In [36]:
coll_repr(tkn('Fast.ai'))

"(#3) ['xxbos','xxmaj','fast.ai']"

---

### Subword Tokenization

In addition to the word tokenization approach seen in the last section, another popular tokenization method is subword tokenization. Word tokenization relies on an assumption that spaces provide a useful separation of components of meaning in a sentence. However, this assumption is not always appropriate. For instance, consider this sentence: 我的名字是郝杰瑞 (which means "My name is Jeremy Howard" in Chinese). That's not going to work very well with a word tokenizer, because there are no spaces in it! Languages like Chinese and Japanese don't use spaces, and in fact they don't even have a well-defined concept of a "word". There are also languages, like Turkish and Hungarian, which can add many bits together without spaces, to create very long words which include a lot of separate pieces of information.

To handle these cases, it's generally best to use subword tokenization. This proceeds in two steps:

1. Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.
2. Tokenize the corpus using this vocab of *subword units*.

---

Lets look at a corpus of 2000 reviews

In [38]:
corpus = L(o.open().read() for o in files[:2000])
# corpus

(#2000) ["I watched this movie and the original Carlitos Way back to back. The difference between the two is disgusting. Now i know that people are going to say that the prequel was made on a small budget but that never had anything to do with a bad script. Now maybe it's just me, but i always thought that a prequel was made to go set up the other movie, starring key characters and maybe filling in a bit about life that we didn't know. Rise to Power is just a movie that has Carlito's name. There should have been at least a few characters from the original movie, the ending makes no sense in relation to the original. In the end of this movie he retires with his sweet heart but how the hell do we get him coming out of prison in the next movie? And his woman isn't even the same woman that he talks about as his only love in the original. I would say the movie is mildly entertaining in its self, with a few decent bits but it pales when held up to it's big brother. Don't lay awake at night w

In [39]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

`sz` ndicates the size of the embeddings which we want to pass. It determines how the subword tokens are formed. 

- A small size indicates that the subwords repeating are less and thus require many tokens to represent the sentence. 

- Whereas a large size would mean that entire common words in the text become tokens. 

Thus, finding the exact size is a trial and error method to get a good model. Picking a subword vocab size represents a compromise: a larger vocab means more fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn.

>Overall, subword tokenization provides a way to easily scale between character tokenization (i.e. use a small subword vocab) and word tokenization (i.e. use a large subword vocab), and handles every human language without needing language-specific algorithms to be developed. It can even handle other "languages" such as genomic sequences or MIDI music notation! For this reason, in the last year its popularity has soared, and it seems likely to become the most common tokenization approach (it may well already be, by the time you read this!)

Once we have our texts as tokens, next step is to numericalize them.

### Numericalization with Fastai

In [40]:
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

(#246) ['xxbos','i','watched','this','movie','and','the','original','xxmaj','carlitos','xxmaj','way','back','to','back','.','xxmaj','the','difference','between','the','two','is','disgusting','.','xxmaj','now','i','know','that','people'...]


In [42]:
toks200 = corpus[:200].map(tkn)
toks200[0]

(#246) ['xxbos','i','watched','this','movie','and','the','original','xxmaj','carlitos'...]

In [43]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

"(#1912) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','to','of','is','i','it','this'...]"

In [44]:
nums = num(toks)[:20]; nums

tensor([   2,   17,  281,   19,   27,   13,    9,  203,    8,    0,    8,  130,
         177,   14,  177,   10,    8,    9, 1403,  263])

In [45]:
' '.join(num.vocab[o] for o in nums)

'xxbos i watched this movie and the original xxmaj xxunk xxmaj way back to back . xxmaj the difference between'

---

## Training a text Classifier:

### Language model using DataBlock

fastai handles tokenization and numericalization automatically when `TextBlock` is passed to `DataBlock`. All of the arguments that can be passed to `Tokenize` and `Numericalize` can also be passed to `TextBlock`. In the next chapter we'll discuss the easiest ways to run each of these steps separately, to ease debugging--but you can always just debug by running them manually on a subset of your data as shown in the previous sections. And don't forget about `DataBlock`'s handy `summary` method, which is very useful for debugging data issues.

Here's how we use `TextBlock` to create a language model, using fastai's defaults:

In [47]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

One thing that's different to previous types used in `DataBlock` is that we're not just using the class directly (i.e. `TextBlock(...)`, but instead are calling a *class method*. A class method is a Python method which, as the name suggests, belongs to a *class* rather than an *object*. (Be sure to search online for more information about class methods if you're not familiar with them, since they're commonly used in many Python libraries and applications; we've used them a few times previously in the book, but haven't called attention to them.) The reason that `TextBlock` is special is that setting up the numericalizer's vocab can take a long time (we have to read every document and tokenize it to get the vocab); to be as efficient as possible fastai does things such as: 

- Save the tokenized documents in a temporary folder, so fastai doesn't have to tokenize more than once
- Runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs.

Therefore we need to tell `TextBlock` how to access the texts, so that it can do this initial preprocessing--that's what `from_folder` does.

`show_batch` then works in the usual way:

In [49]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos i really hope that xxmaj concorde / xxmaj new xxmaj horizons was n't trying to make a serious horror , or even action movie when they made xxmaj carnosaur 3 . xxmaj the movie is flat - out silly from start to finish . xxmaj even the humor in xxup xxunk is funny because it 's bad . xxmaj definitely a high water mark in the ' so xxmaj bad it 's xxmaj good ' genre . xxmaj if","i really hope that xxmaj concorde / xxmaj new xxmaj horizons was n't trying to make a serious horror , or even action movie when they made xxmaj carnosaur 3 . xxmaj the movie is flat - out silly from start to finish . xxmaj even the humor in xxup xxunk is funny because it 's bad . xxmaj definitely a high water mark in the ' so xxmaj bad it 's xxmaj good ' genre . xxmaj if you"
1,"at the xxmaj film xxmaj forum , recently , i could not resist watching this masterpiece once more when it was shown by xxup tcm , the other night . \n\n xxmaj this movie owes a debt of gratitude to xxmaj graham xxmaj greene , a writer who had the most developed sense of intrigue among his contemporaries and one of the best writers of the last century . xxmaj it also helped that a great director , xxmaj carol","the xxmaj film xxmaj forum , recently , i could not resist watching this masterpiece once more when it was shown by xxup tcm , the other night . \n\n xxmaj this movie owes a debt of gratitude to xxmaj graham xxmaj greene , a writer who had the most developed sense of intrigue among his contemporaries and one of the best writers of the last century . xxmaj it also helped that a great director , xxmaj carol xxmaj"


### Finetuning language model

For converting the integer word indices into activations that we can use for our neural network, we will use embeddings, just like we did for collaborative filtering and tabular modelling. Then those embeddings are fed in a *Recurrent Neural Network* (RNN), using an architecture called *AWD_LSTM* (we will show how to write such a model from scratch in <<chapter_nlp_dive>>). As we discussed earlier, the embeddings in the pretrained model are merged with random embeddings added for words that weren't in the pretraining vocabulary. This is handled automatically inside `language_model_learner`:

In [50]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

The loss function used by default is cross entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). A metric often used in NLP for language models is called *perplexity*. It is the exponential of the loss (i.e. `torch.exp(cross_entropy)`). We  will also add accuracy, to see how many times our model is right when trying to predict the next word, since cross entropy (as we've seen) is both hard to interpret, and also tells you more about the model's confidence, rather than just its accuracy

In [51]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.104637,3.908968,0.299777,49.847504,26:16


In [52]:
learn.save('1epoch')

In [53]:
learn = learn.load('1epoch')

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time


In [None]:
learn.save_encoder('finetuned')

---

### Text Generation

In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

In [None]:
print("\n".join(preds))

---

### Creating the classifier dataloader

In [None]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [None]:
dls_clas.show_batch(max_n=3)

Looking at the `DataBlock` definition above, every piece is familiar from previous data blocks we've built, with two important exceptions:

- `TextBlock.from_folder` no longer has the `is_lm=True` parameter, and
- We pass the `vocab` we created for the language model fine-tuning.

The reason that we pass the vocab of the language model is to make sure we use the same correspondence of token to index. Otherwise the embeddings we learned in our fine-tuned language model won't make any sense to this model, and the fine-tuning step won't be of any use.

By passing `is_lm=False` (or not passing `is_lm` at all, since it defaults to `False`) we tell `TextBlock` that we have regular labeled data, rather than using the next tokens as labels. There is one challenge we have to deal with, however, which is to do with collating multiple documents into a minibatch.

---

Remember, PyTorch `DataLoader`s need to collate all the items in a batch into a single tensor, and that a single tensor has a fixed shape (i.e. it has some particular length on every axis, and all items must be consistent). This should look a bit familiar: we had the same issue with images. In that case, we use cropping, padding, and/or squishing to make everything the same size. Cropping might not be a good idea for documents, because it seems likely we'd remove some key information (having said that, the same issue is true for images, and we use cropping there; data augmentation hasn't been well explored for NLP yet, so perhaps there are actually opportunities to use cropping in NLP too!) You can't really "squish" a document. So that leaves padding!

We will expand the shortest texts to make them all the same size. To do this, we use a special token that will be ignored by our model. This is called *padding* (just like in vision). Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. The result of this is that the documents collated into a single batch will tend of be of similar lengths. We won't make every batch, therefore, the same size, but will instead use the size of the largest document in each batch. (It is possible to do something similar with images, which is especially useful for irregularly sized rectangular images, although as we write these words, no library provides good support for this yet, and there aren't any papers covering it. It's something we're planning to add to fastai soon however, so have a look on the book website, where we'll add information about this if and when it's working well.)

The padding and sorting is automatically done by the data block API for us when using a `TextBlock`, with `is_lm=False`. (We don't have this same issue for language model data, since we concatenate all the documents together first, and then split them into equally sized sections.)

We can now create a model to classify our texts:

In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

We need to add the previously learnt languagle model fine tuned as the encoder and use this learner as the decoder which classifes the text. 

We use `load_encoder` instead of `load` because we only have pretrained weights available for the encoder; `load` by default raises an exception if an incomplete model is loaded.

In [None]:
learn = learn.load_encoder('finetuned')

---

### Finetuning the classifier

The last step is to train with discriminative learning rates and **gradual unfreezing**. In computer vision, we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference.

In [None]:
learn.fit_one_cycle(1, 2e-2)

In [None]:
# We can pass -2 to freeze_to to freeze all except the last two parameter groups:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

In [None]:
# unfreeze a bit more and train further
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

In [None]:
# unfreeze the entire model and train
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

We reach 94.3% accuracy, which was state-of-the-art just three years ago. By training a model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the [ULMFiT paper](https://arxiv.org/abs/1801.06146). It was only beaten a few months ago, fine-tuning a much bigger model and using expensive data augmentation (translating sentences in another language and back, using another model for translation).