## 00:37:56 - Text generation

* Firstly, to load the model from the previous notebook:

In [1]:
from fastai.text.all import *

In [2]:
path = untar_data(URLs.IMDB)

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()],
    path='/kaggle/working'
).to_fp16()

In [3]:
learn.load_encoder('/kaggle/input/lesson-8/models/finetuned')

<fastai.text.learner.LMLearner at 0x7f374aed6bd0>

* Then make some predictions using a short sentence as a prompt:

In [4]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]
print("\n".join(preds))

i liked this movie because there were n't any twists or turns and was never dull . But the acting was pretty good considering everything else . The characters were believable and believable . Tom Sizemore and Emily Dafoe
i liked this movie because of its relatively low budget . The special effects are n't bad either -- except for an exaggerated facial expression . 

 But now for the sudden appearance of these aliens . In fact , these aliens


In [5]:
TEXT = "I thought this movie would be more like Jaws but it was "
N_WORDS = 40
N_SENTENCES = 2

preds = [
    learn.predict(TEXT, N_WORDS, temperature=0.75)
    for _ in range(N_SENTENCES)
]

print("\n".join(preds))

i thought this movie would be more like Jaws but it was n't a rip off . It was mostly about middle spectacular men searching for boat and finding each other with their own basic twists and turns . This was written and directed by Al Pacino who
i thought this movie would be more like Jaws but it was more like a remake of Jaws than a remake or update . 

 Although they did n't compare this to Jaws 3 because it still has n't changed the storyline since it was made . 




* There are better ways to do language generation, but this tells us that the model has learned something.

## 00:39:51 - Creating classification model

* Create another DataBlock. This time, we provide a vocab from the language by passing in a vocab (`dls_lm.vocab`) to `TextBlock.from_folder`
* Also aren't passing `is_lm=True` instead passing `CategoryBlock` as the dataset has a sentiment label.

In [6]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y=parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [7]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos xxmaj titanic directed by xxmaj james xxmaj cameron presents a fictional love story on the historical setting of the xxmaj titanic . xxmaj the plot is simple , xxunk , or not for those who love plots that twist and turn and keep you in suspense . xxmaj the end of the movie can be figured out within minutes of the start of the film , but the love story is an interesting one , however . xxmaj kate xxmaj winslett is wonderful as xxmaj rose , an aristocratic young lady betrothed by xxmaj cal ( billy xxmaj zane ) . xxmaj early on the voyage xxmaj rose meets xxmaj jack ( leonardo dicaprio ) , a lower class artist on his way to xxmaj america after winning his ticket aboard xxmaj titanic in a poker game . xxmaj if he wants something , he goes and gets it",pos
2,"xxbos xxmaj warning : xxmaj does contain spoilers . \n\n xxmaj open xxmaj your xxmaj eyes \n\n xxmaj if you have not seen this film and plan on doing so , just stop reading here and take my word for it . xxmaj you have to see this film . i have seen it four times so far and i still have n't made up my mind as to what exactly happened in the film . xxmaj that is all i am going to say because if you have not seen this film , then stop reading right now . \n\n xxmaj if you are still reading then i am going to pose some questions to you and maybe if anyone has any answers you can email me and let me know what you think . \n\n i remember my xxmaj grade 11 xxmaj english teacher quite well . xxmaj",pos


## 00:41:04 - Question: Do tokeniser do stemming or lemitisation, or is that outdated?

* Stemming and lemitisation is not part of tokenisation.
* We have stems for a reason, so we don't remove them.

## 00:42:21 - Handling different sequence lengths

* With the language model, we can concat all documents together and split into substrings based on batch size. This ensures each mini-batch is the same size (batch_size x sequence length).
  * We can't do that with shorter movie reviews. Each movie needs to be associated with dependant variable.

In [8]:
spacy = WordTokenizer()
tkn = Tokenizer(spacy)
files = get_text_files(path, folders=['train', 'test', 'unsup'])
txts = L(o.open().read() for o in files[:2000])

toks200 = txts[:200].map(tkn)
toks200[0]

num = Numericalize()
num.setup(toks200)

In [9]:
nums_samp = toks200[:10].map(num)

* Can look at the lengths, and notice that they're all quite different:

In [10]:
nums_samp.map(len)

(#10) [158,319,181,193,114,145,260,146,252,295]

* What we need to do is add padding.
  * Add special `xxpad` token to each sequence to make them all the size of the largest sequence.
* fastai also tries to get similarly sized sentences together to try to minimise padding.
* All of this happens when you call `TextBlock.from_folder`

## 00:45:30 - Create and fine tune classifier

In [11]:
learn = text_classifier_learner(
    dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()

* Now we can load the encoder:

In [12]:
learn = learn.load_encoder('/kaggle/input/lesson-8/models/finetuned')

In [13]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.269207,0.195298,0.9244,01:46


* The results are similar to the first classifier, but took under 2 minutes to train.
* In NLP, it's better to unfreeze a layer at time:

In [14]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4), 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.253995,0.181843,0.93144,01:58


* Can unfreeze a few more layers:

In [15]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.211313,0.166533,0.93672,02:42


* Then the whole model:

In [16]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.173864,0.164418,0.93888,03:18
1,0.170001,0.163286,0.93964,03:17


* Can get to 95.1% accuracy by training on all texts backwards.

## 00:48:54 - Question: how can a model trained to predict next word work on a different domain like sentiment?

* To be able to predict next word of sentence, you have to know a lot about language and the world. That knowledge transfers to other domains.

## 00:51:00 - Question: how do you do data augmentation on text?

* One approach is to pass text through a translator and back again.
* Some goods ideas in this paper: [Unsupervised Data Augmentation for Consistency Training](https://arxiv.org/abs/1904.12848).

## 00:51:52 - Ethics and risks of text generation

* FCC asked for comments about a proposal to repeal Net Neutrality. [Turned out less than 800k of 22M comments where unique](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6).
* What would happen if someone created a million Twitter bots so that 99% of the content were fake deep learning bots?