In [None]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [None]:
from fastbook import *
from fastai.text.all import *
from IPython.display import display,HTML

We'll be looking at the IMDB dataset, so download it

In [None]:
path = untar_data(URLs.IMDB)

# Conceptual review

In this notebook, we are going to review some of the content from Chapter 10 and explore how alternative choices would effect the final outcome.

## Question 1: What changes would need to be made to the notebook from Chapter 10 to use Subword Tokenization? Why might you want to use it?

Consider the pretrained model and the fine tuning.

Answer here

### Text Generation

In [None]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()


Before we move on to fine-tuning the classifier, let's quickly try something different: using our model to generate random reviews. Since it's trained to guess what the next word of the sentence is, we can use the model to write new reviews:

In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

In [None]:
print("\n".join(preds))

With a fine tuned language model Jeremy got the following results
```
i liked this movie because of its story and characters . The story line was very strong , very good for a sci - fi film . The main character , Alucard , was very well developed and brought the whole story
i liked this movie because i like the idea of the premise of the movie , the ( very ) convenient virus ( which , when you have to kill a few people , the " evil " machine has to be used to protect
```

## Question 2
In the section above, Jermy has shown us how convincingly the fine tuned language model can generate movie reviews. Your task is to explore generating examples from a language model that hasn't been fine tuned. What can we say about what the fine tuning has taught the language model?


## Question 3
The fine tuned classifcation model in Chapter 10 reached 94.3% accuracy, explore how close you can get without starting with a fine tuned language model.

While this model trains, try out question 4.

Think about whether the language model fine tuning is worth the time it takes.

In [None]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

In [None]:
learn.fit_one_cycle(1, 2e-2)

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

## Question 4
In the `Creating the Classifier DataLoaders` section of https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb, the classifcation data loader gets defined with a seq_len of 72. Later in the section Jeremy explains "`We will expand the shortest texts to make them all the same size.`"  Some of the samples in the first batch have over 500 tokens. Explain how seq_len can be 72 with so many tokens in the samples.

Hint start looking at (https://github.com/fastai/fastai/blob/99d38fec7207db9b4209568bebc85ded7e3d3f1b/fastai/text/models/core.py#L115)

## Questionnaire

1. What is "self-supervised learning"?
1. What is a "language model"?
1. Why is a language model considered self-supervised?
1. What are self-supervised models usually used for?
1. Why do we fine-tune language models?
1. What are the three steps to create a state-of-the-art text classifier?
1. How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?
1. What are the three steps to prepare your data for a language model?
1. What is "tokenization"? Why do we need it?
1. Name three different approaches to tokenization.
1. What is `xxbos`?
1. List four rules that fastai applies to text during tokenization.
1. Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?
1. What is "numericalization"?
1. Why might there be words that are replaced with the "unknown word" token?
1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)
1. Why do we need padding for text classification? Why don't we need it for language modeling?
1. What does an embedding matrix for NLP contain? What is its shape?
1. What is "perplexity"?
1. Why do we have to pass the vocabulary of the language model to the classifier data block?
1. What is "gradual unfreezing"?
1. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?