# Lesson 13 - NLP with Deep Learning

> An introduction to Deep Learning and its applications in NLP

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lewtun/dslectures/blob/master/notebooks/lesson13_nlp-deep.ipynb)[![slides](https://img.shields.io/static/v1?label=slides&message=lesson13_nlp-deep.pdf&color=blue&logo=Google-drive)](https://drive.google.com/open?id=1g613_b3643zUuPVB1JBBuyxXka_WRRWF)


> Note: Make sure you are connected to a GPU machine when running the Colab notebook by clicking on `Runtime -> Change runtime type` and set hardware type to GPU.

## Learning objectives
In this lecture we cover the basics of Deep Learning and its application to NLP. The learning goals are:
* The basics of transfer learning
* Preprocess data with the fastai data loader
* Train a text classifier with the fastai library

## References
* Practical Deep Learning for Coders - Lesson 4: Deep Learning 2019 by fastai [[video](https://www.youtube.com/watch?v=qqt3aMPB81c)]

This notebooks follows fastai's excellent tutotrial in this video and the original notebook can be found [here](https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-imdb.ipynb).

## Homework
As homework read the references, work carefully through the notebook and solve the exercises. 

## Introduction
Last time we built a sentiment classifier with `scikit-learn` and achieved around 85% accuracy on the test set. This is already pretty good, but can we do better? We are still wrong 15/100 times. It turns out we can if we use deep learning.

<div style="text-align: center">
<img src='https://github.com/lewtun/dslectures/blob/master/notebooks/images/deeper.jpg?raw=1' width='400'>
</div>

Deep learning uses an architecture that is modeled after the brain and uses networks of artificial neurons to mimic its behaviour. These models are much bigger than the models we encountered so far and can have millions to billions of parameters. Training these models and adjusting the parameters is also more challenging, and generally requires much more data and compute. A lot of computations are easily parallelizable, which is a strength of modern GPUs. Therefore we will run this notebook on a GPU that enables much faster training than a CPU.

<div style="text-align: center">
<img src='https://github.com/lewtun/dslectures/blob/master/notebooks/images/gpu_meme.jpg?raw=1' width='400'>
</div>

Since we don't have much training data on the IMDb dataset for deep learning standards we use transfer learning to still achieve high accuracy in predicting the sentiment of the movie reviews.

## Transfer learning
Training deep learning models requires a lot of data. It is not uncommon to train models on millions of images or gigabytes of text data to achieve good results. Most real-world problems don't have that amount of labeled data ready, and not all companies or individuals who want to train a model can afford to hire people to label data for them.

For many years this has been very challenging. Fortunately, it has been solved for image based models a couple of years ago and recently also for NLP. One approach that helps train models with limited labeled data is called **transfer learning**.

The idea is that once a model is trained on a large dataset for a specific task (e.g., classifying houses vs. planes), the model has learned certain features of the data that can be reused for another task. Such features could be how to detect edges or textures in images. If these features are useful for another task, then we can train the model on new data without requiring as many labels as if we were training it from scratch.

### ULMFiT
For transfer learning in NLP Jeremy Howard and Sebastian Ruder came up with a similar approach called `ULMFiT` (Universal Language Model Fine-tuning for Text Classification) for texts. The central theme of the approach is language modeling.

#### Language modeling
In language modeling the goal is to predict the next word based on the previous word in a text. An example:

`Yesterday I discovered a really nice cinema so I went there to watch a ____ .`

The task of the model is to fill the blank. The advantage of language modeling is that it does not require any labels, but to achieve good results, the model needs to learn a lot about language. In this example, the model needs to understand that one watches movies in cinemas. The same goes for sentiment and other topics. With `ULMFiT` one can train a language model and then use it for classifications tasks in three steps.

#### Three steps
The three steps are visualised in the following figure:
<div style="text-align: center">
<img src='https://github.com/lewtun/dslectures/blob/master/notebooks/images/ulmfit_approach.png?raw=1' width='800'>
</div>

1. **Language model (wiki)**: A language model is trained on a large dataset. Wikipedia is a common choice for this task as it includes many topics, and the text is of high quality. This step usually takes the most time on the order of days. In this step, the model learns the general structure of language.

2. **Language model (domain)**: The language model trained on Wikipedia might be missing some aspects of the domain we are interested in. If we want to do sentiment classification, Wikipedia does not offer much insight since Wikipedia articles are generally of neutral sentiment. Therefore, we continue training the language model on the text we are interested in. This step still takes several hours.

3. **Classifier (domain)**: Now that the language model works well on the text we are interested in, it is time to build a classifier. We do this by adapting the output of the network to yield classes instead of words. This step only takes a couple of minutes to an hour to complete.

The power of this approach is that you only need little labeled data for the last step and only need to go through the expensive first step once. Even the second step can be reused on the dataset if you, for example, build a sentiment classifier and additionally a topic classifier. This can be done in minutes and allows us to achieve great results with little time and resources.

### Other methods
Today, there are many approaches in NLP that use transfer learning such as Google's BERT. These models cost on the order of $100'000 to pretrain (1. step) and massive computational facilities. Fortunately, most of these models are shared and can then be fine-tuned on small machines.

## The `fastai` library
The `fastai` library wraps around the deep learning framework `PyTorch` and has a lot of functionality built in to achieve great results quickly. The library abstracts a lot of functionality, so it can be difficult to follow initially. To get a better understanding, we highly recommend the [fastai course](https://course.fast.ai/). In this lesson, we will use the library to build a world-class classifier with just a few lines of code.

## Imports

In [1]:
!pip install dslectures

Collecting dslectures
  Downloading https://files.pythonhosted.org/packages/7f/63/65334bbd906733f61daf7b2f6c538d48dd2e264af67343307fb7f7318804/dslectures-0.0.15-py3-none-any.whl
Collecting giotto-tda
[?25l  Downloading https://files.pythonhosted.org/packages/d0/c8/6dac3bc9cc9e3b0b1e735670a614c93fad864ce317a454e38808cc51c993/giotto_tda-0.3.0-cp36-cp36m-manylinux2010_x86_64.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 9.2MB/s 
[?25hCollecting black
[?25l  Downloading https://files.pythonhosted.org/packages/dc/7b/5a6bbe89de849f28d7c109f5ea87b65afa5124ad615f3419e71beb29dc96/black-20.8b1.tar.gz (1.1MB)
[K     |████████████████████████████████| 1.1MB 29.4MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/se

In [2]:
from fastai.text import *
from dslectures.core import get_dataset

## Training data

### Download data
The fastai library includes a lot of datasets including, the IMDb dataset we already know. Similar to our `download_dataset()` function we can do this with fastai's `untar_data()` function.

In [3]:
path = untar_data(URLs.IMDB)

Downloading https://s3.amazonaws.com/fast-ai-nlp/imdb.tgz


### Data structure
Looking at the downloaded folder, we can see that there are several files and folders. The relevant ones for our case are `test`, `train`, and `unsup`. The `train` and `test` folders split the data the same way we split it in the previous lecture. The new `unsup` (for unsupervised) folder contains 50k movie reviews that are not classified. We can't use them for training a classifier, but we can use them to fine-tune the language model.

In [4]:
path.ls()

(#7) [Path('/root/.fastai/data/imdb/test'),Path('/root/.fastai/data/imdb/README'),Path('/root/.fastai/data/imdb/tmp_lm'),Path('/root/.fastai/data/imdb/train'),Path('/root/.fastai/data/imdb/tmp_clas'),Path('/root/.fastai/data/imdb/imdb.vocab'),Path('/root/.fastai/data/imdb/unsup')]

Looking at the training folder, we find two folders for the negative and positive movie reviews.

In [5]:
(path/'train/').ls()

(#4) [Path('/root/.fastai/data/imdb/train/neg'),Path('/root/.fastai/data/imdb/train/labeledBow.feat'),Path('/root/.fastai/data/imdb/train/unsupBow.feat'),Path('/root/.fastai/data/imdb/train/pos')]

In both folders, we find many files, each containing one movie review. This is exactly the same data we used last time. It is just arranged in a different structure. We don't need to load all these files manually - the fastai library does this automatically. 

In [6]:
(path/'train/neg').ls()[:10]

(#10) [Path('/root/.fastai/data/imdb/train/neg/6553_4.txt'),Path('/root/.fastai/data/imdb/train/neg/6577_4.txt'),Path('/root/.fastai/data/imdb/train/neg/7009_4.txt'),Path('/root/.fastai/data/imdb/train/neg/6706_1.txt'),Path('/root/.fastai/data/imdb/train/neg/4758_4.txt'),Path('/root/.fastai/data/imdb/train/neg/2025_1.txt'),Path('/root/.fastai/data/imdb/train/neg/7627_4.txt'),Path('/root/.fastai/data/imdb/train/neg/11850_3.txt'),Path('/root/.fastai/data/imdb/train/neg/12493_1.txt'),Path('/root/.fastai/data/imdb/train/neg/1835_2.txt')]

## Fine-tune language model

> Note: Fine-tuning the language model takes around 4h. You can skip this step and download the fine-tuned model in the *Load language model* section of the notebook.

### Preprocess data for language modeling
In the last lecture, we implemented our own function to preprocess the texts and tokenize them. In principle, we could do the same here, but fastai comes with built-in functions to take care of this. In addtion, we can specify which folders to use and what percentage to split off for validation. The batch size `bs` specifies how many samples the model is optimised for at each step. 

In [7]:
bs=48

In [8]:
data_lm = (TextList.from_folder(path)
           #Inputs: all the text files in path
            .filter_by_folder(include=['train', 'test', 'unsup']) 
           #We may have other temp folders that contain text files so we only keep what's in train and test
            .split_by_rand_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly
            .databunch(bs=bs))

Similar to the vectorizer vocabulary, we can have a look at the encoding scheme. The `itos` (stands for id-to-string) object tells us which token is encoded at which position.

In [9]:
len(data_lm.vocab.itos)

60000

We can see that the vocabulary contains XXX tokens.

In [10]:
data_lm.vocab.itos[:20]


['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the',
 '.',
 ',',
 'and',
 'a',
 'of',
 'to',
 'is',
 'it',
 'in',
 'i']

We can see that the first few positions are reserved for special tokens starting with `xx`. The token `xxunk` is used for a word that is not in the dictionary. The `xxbos` and `xxeos` identify the beginning and the end of a string. So if the first entry in the encoding vector is 1 this means that the token is `xxunk`. If the third entry is 1 then the token is `xxbos`.

We can also look at a processed text. We notice that the token `xxmaj` is used frequently. It signifies that the first letter of the following word is capitalised.

In [11]:
data_lm.train_ds[0][0]

Text [  2  19 698  21 ... 121  76 548  10]

With fastai we can also sample a batch from the dataset and display the sample in a dataframe:

In [12]:
data_lm.show_batch()

	nonzero()
Consider using one of the following signatures instead:
	nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  idx_min = (t != self.pad_idx).nonzero().min()


idx,text
0,"the film attempts to tell the story of a dark future , one in which xxmaj hawk ( a xxmaj mad xxmaj max type of character ) heads off to rescue a damsel in distress . xxmaj in reality , the plot is a thinly disguised excuse for the producers to promote their own philosophies on life ( watch the end credits and the ' these people are not real"
1,"guess it was worth my time sitting through this * once * but i wo n't be watching it again . xxmaj there are several things about this film that irritated me . \n \n xxmaj first , man ... i really hated the characters . i had the same problem with xxmaj sid and xxmaj nancy . i have a hard time rationalizing spending a fair chunk of"
2,"xxmaj this film is a cartoonish piece of snot with bright colors and bad mediocre acting . xxmaj was xxmaj mike xxmaj myers even in this movie actually ? xxmaj and another thing , the fish . xxmaj what is with that stupid fish ! xxmaj first time you see him , he 's an actual fish . xxmaj next time you see him , he 's all animated and"
3,"her brother is a scientist who 's trying to deal with a speeding comment headed for xxmaj earth . xxmaj but in the runaway comet business , even a near miss causes some real problems as the xxmaj earth 's orbit goes out of kilter . \n \n xxmaj from the survival of the xxmaj earth we go to the survival of xxmaj van xxmaj dien and his immediate"
4,"xxup tv did n't have such a dominant presence in the industry , this movie would have seemed entertaining . xxmaj but xxmaj mike and xxmaj mark are so obviously playing themselves , xxmaj mike and xxmaj mark . xxmaj at times they are funny and some of the lines seem off the cuff , but mostly they do not ring true . xxmaj they are the reality version of"


### Representation
When we vectorized the texts in the last lecture, we represented them as count vectors. The architecture we use in this lecture allows for each word to be processed sequentially and thus conserving the order information. Therefore we encode the text with **one-hot encodings**. However, storing the information as vectors would not be very memory efficient; one entry is 1 and all the other entries are 0. It is more efficient to just store the information on which entry is 1 and then create the vector when we need it.

We can look at the data representation of the example we printed above. Each number represents a word in the vocabulary and specifies the entry in the one-hot encoding that is set to one.

In [13]:
data_lm.train_ds[0][0].data[:10]

array([  2,  19, 698,  21,  19,  26,  38,  60,  15,  85])

With the `itos` we can translate it back to tokens:

In [14]:
for i in data_lm.train_ds[0][0].data[:10]:
    print(data_lm.vocab.itos[i])

xxbos
i
knew
that
i
was
not
about
to
see


### Train model
We will train a variant of a model called LSTM (long short-term memory) network. This is a neural network with a feedback loop. That means that when fed a sequence of tokens, it feeds back its output for the next prediction. With this the model has a mechanism remembering the past inputs. This is especially useful when dealing with sequential data such as texts, where the sequence of words and characters carries important meaning.

<div style="text-align: center">
<img src='https://github.com/lewtun/dslectures/blob/master/notebooks/images/rnn-diagram.png?raw=1' width='400'>
    <p style="text-align: center;"> <b>Reference:</b> https://colah.github.io/posts/2015-08-Understanding-LSTMs/ </p>
</div>

#### Load pretrained model
Training the model on Wikipedia takes a day or two. Fortunately, people have already trained the model and shared it in the fastai library. Therefore, we can just load the pretrained langauge model. When we load it, we also pass the dataset it will be trained on.

In [15]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3, model_dir="../data/")

Downloading https://s3.amazonaws.com/fast-ai-modelzoo/wt103-fwd.tgz


#### Learning rate finder
The learning rate is a key parameter when training models in deep learning. It specifies how strongly we update the model parameters. If the learning rate is too small, the training takes forever. If the learning rate is too big, we will never converge to a minimum.

With the `lr_find()` function, we can explore how the loss function behaves with regards to the value of the learning rate:

In [None]:
learn.lr_find()

epoch,train_loss,valid_loss,accuracy,time


In [None]:
learn.recorder.plot(skip_end=15)

First of all, we see that if we choose the learning rate too big, the loss function starts to increase. We want to avoid this at all costs. So we want to find the spot where the loss function decreases the steepest with the largest learning rate. In this case, a good value is `1e-2`. The first parameter determines how many epochs we train. One epoch corresponds to one pass through the training set.

In [None]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

Deep learning models learn more and more abstractions with each layer. The first layers of an image model might learn about edges and textures in an image, and as you progress through the layers, you can see how the model combines edges to eyes or ears and eventually combines these to faces. Therefore the last few layers are usually the ones that are very task-specific while the others contain general information.

For this reason, we usually start by just tuning the last few layers because we don't want to lose that information and then, in the end, fine-tune the whole model. This is what we did above: we just trained the last few layers. Now to get the best possible performance, we want to train the whole model. The `unfreeze()` function enables the training of the whole model. We train the model for 10 more epochs with a slightly lower learning rate.

In [None]:
learn.unfreeze()

In [None]:
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))

### Save language model
Since the step above took about 4h we want to save the progress, so we don't have to repeat the step when we restart the notebook.

In [None]:
# uncomment if you want to fine-tune the language mode
# data_lm.path = Path('')
# data_lm.save('data_lm.pkl')
# learn.save('fine_tuned')

### Load language model

In [None]:
get_dataset("fine_tuned.pth")
get_dataset("data_lm.pkl")

In [None]:
data_lm = load_data(Path("../data/"), 'data_lm.pkl', bs=bs)
data_lm.path = path

learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3, model_dir="../data/")
learn.path = Path("")
learn.load('fine_tuned');
learn.path = path

### Text generation
The objective of a language model is to predict the next word based on a sequence of words. We can use the trained model to generate some movie reviews:

In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2

In [None]:
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

### Exercise 1
Generate a few movie reviews with different input texts. Post the funniest review on the Teams channel.

### Encoder
As mentioned previously, the last layers of a deep learning model are usually the most task-specific. In the case of language modeling, the last layer predicts the next word in a sequence. We want to do text classification, however, so we don't need that layer. Therefore, we discard the last layer and only save what is called the encoder. In the next step, we add a new layer on top of the encoder for text classification.

In [None]:
learn.save_encoder('fine_tuned_enc')

## Train classifier
In this section, we will use the fine-tuned language model and build a text classifier on top of it.  The procedure is very similar to the language model fine-tuning but needs some minor adjustments for text classification.

### Preprocess data for classification
Preprocessing the data follows similar steps as for language modeling. The main differences are that 1) we don't want a random train/valid split, but the official one and 2) we want to label each element with its sentiment based on the folder name.

In [None]:
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
             #grab all the text files in path
             .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             .label_from_folder(classes=['neg', 'pos'])
             #label them all with their folders
             .databunch(bs=bs))

When we display a batch we see that the tokens look the same with the addition of a label column:

In [None]:
data_clas.show_batch()

### Load model
We create a text classifier model and load the pretrained encoder part from the fine-tuned language model into it.

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5, model_dir="../data/")
learn.load_encoder('fine_tuned_enc');

### Find the learning rate
Again, we need to find the best learning rate for training the classifier.

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

### Train classifier
We start by just training the very last layer:

In [None]:
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))

Then we unfreeze the second to last layer as well and train these layers.

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

Next, we train the last three layers.

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

Finally, we train all layers for two epochs.

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

Note that after the last optimisation step the model can predict the sentiment on the test set with **94%** accuracy. This is roughly 10% better than our Naïve Bayes model from the last lecture. In other words, this model makes **3 times** fewer mistakes than the Naïve Bayes model!

### Make predictions

In [None]:
learn.predict("I really loved that movie, it was awesome!")

### Exercise 2
Experiment with the trained classifier and see if you can fool it. Can you find a pattern that fools it consistently?