<a href="https://colab.research.google.com/github/kellyslpang/unpackAIworkbooks/blob/main/Kelly_10_nlp_own_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Deep Dive - Own Code

Own refactored code and notes for *Chapter 10: NLP Deep Dive: RNNs* ([`10_nlp.ipynb`](https://colab.research.google.com/github/vtecftwy/fastbook/blob/master/10_nlp.ipynb)).

## Instructions

It is recommended that you work in two steps:
1. Copy the code from the fastbook notebook and make sure it works
2. Refactor (i.e. rewrite the code in your own style) by 
    - regrouping things together that make sense to you
    - adding text cells to explain what to code does in your own words and possible references to the doc you may have consulsted
    - deleting code you think was only there to explain things but are not required once you run models end to end

When you have done that, you get a customized reference notebook for you which you can consult later on when you forgot the details, withouht having to read the full notebook from fastbook.

## Your code

In [2]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
from fastbook import *
from IPython.display import display,HTML

[K     |████████████████████████████████| 720 kB 7.1 MB/s 
[K     |████████████████████████████████| 46 kB 5.6 MB/s 
[K     |████████████████████████████████| 188 kB 59.0 MB/s 
[K     |████████████████████████████████| 1.2 MB 59.1 MB/s 
[K     |████████████████████████████████| 54 kB 4.0 MB/s 
[K     |████████████████████████████████| 51 kB 397 kB/s 
[?25hMounted at /content/gdrive


In [3]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

In [4]:
# 1. List all the folders under path (using the path.iterdir() method)
print(f"path to dataset: {path.absolute()}")
[f"{'file:  ' if p.is_file() else 'folder:'} {p.name}" for p in path.iterdir()]

path to dataset: /root/.fastai/data/imdb


['file:   README',
 'folder: unsup',
 'folder: train',
 'folder: test',
 'file:   imdb.vocab',
 'folder: tmp_lm',
 'folder: tmp_clas']

In [5]:
# 2. Get the full text of README
with open(path/'README', mode='r') as f:
    txt = f.readlines()
print(''.join(txt))

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a scor

In [6]:
# List the folders and list the files
print('Folders:')
display([p.name for p in path.iterdir() if p.is_dir()])
print('Files:')
display([p.name for p in path.iterdir() if p.is_file()])

Folders:


['unsup', 'train', 'test', 'tmp_lm', 'tmp_clas']

Files:


['README', 'imdb.vocab']

In [7]:
# Content of the training set (in train folder), test/validation set (in test folder) and in unsupervised (excluding text files)
[p.name for p in (path/'train').iterdir()], [p.name for p in (path/'test').iterdir()], [p.name for p in (path/'unsup').iterdir() if 'txt' not in p.suffix]

(['labeledBow.feat', 'neg', 'pos', 'unsupBow.feat'],
 ['labeledBow.feat', 'neg', 'pos'],
 [])

In [8]:
# First files for training in the positive review folder (pos) and negative review (neg). As mentioned in read.me the format is id_rating.txt
[p.name for p in (path/'train/pos').iterdir()][:5], [p.name for p in (path/'train/neg').iterdir()][:5]

(['7236_8.txt', '3889_10.txt', '8885_10.txt', '7490_8.txt', '5845_7.txt'],
 ['10311_3.txt', '8879_4.txt', '6222_1.txt', '4851_3.txt', '8626_1.txt'])

In [9]:
# First files for testing in the positive review folder (pos) and negative review (neg). As mentioned in read.me the format is id_rating.txt
[p.name for p in (path/'test/pos').iterdir()][:5], [p.name for p in (path/'test/neg').iterdir()][:5]

(['7814_8.txt', '2943_10.txt', '4776_8.txt', '177_8.txt', '8338_7.txt'],
 ['8879_4.txt', '4866_3.txt', '6222_1.txt', '9901_1.txt', '7848_2.txt'])

In [10]:
# First files in unsup folder (pos). As mentioned in read.me the format is id_rating.txt, where rating is 0
[p.name for p in (path/'unsup').iterdir()][:5]

['5962_0.txt', '49064_0.txt', '10550_0.txt', '25322_0.txt', '593_0.txt']

In [4]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])
files

(#100000) [Path('/root/.fastai/data/imdb/train/neg/428_1.txt'),Path('/root/.fastai/data/imdb/train/neg/4231_1.txt'),Path('/root/.fastai/data/imdb/train/neg/10597_2.txt'),Path('/root/.fastai/data/imdb/train/neg/4184_4.txt'),Path('/root/.fastai/data/imdb/train/neg/6746_1.txt'),Path('/root/.fastai/data/imdb/train/neg/4399_4.txt'),Path('/root/.fastai/data/imdb/train/neg/1534_1.txt'),Path('/root/.fastai/data/imdb/train/neg/7054_2.txt'),Path('/root/.fastai/data/imdb/train/neg/1785_2.txt'),Path('/root/.fastai/data/imdb/train/neg/2131_2.txt')...]

In [10]:
txt = files[1].open().read()
txt[:150]

'I felt obliged to watch this movie all the way through, since I had found it in a bargain bin and bought it for my own, but I came close many times to'

In [11]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#275) ['I','felt','obliged','to','watch','this','movie','all','the','way','through',',','since','I','had','found','it','in','a','bargain','bin','and','bought','it','for','my','own',',','but','I'...]


In [14]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

In [12]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn('The U.S. dollar $1 is $1.00.'),20))

(#13) ['xxbos','xxmaj','the','xxup','u.s','.','dollar','$','1','is','$','1.00','.']


In [13]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#292) ['xxbos','i','felt','obliged','to','watch','this','movie','all','the','way','through',',','since','i','had','found','it','in','a','bargain','bin','and','bought','it','for','my','own',',','but','i'...]


In [17]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html>,
 <function fastai.text.core.replace_rep>,
 <function fastai.text.core.replace_wrep>,
 <function fastai.text.core.spec_add_spaces>,
 <function fastai.text.core.rm_useless_spaces>,
 <function fastai.text.core.replace_all_caps>,
 <function fastai.text.core.replace_maj>,
 <function fastai.text.core.lowercase>]

In [18]:
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

In [19]:
tokens = spacy([txt])
display(tokens)

tokens = spacy([txt])
display(next(tokens, None))

tokens = spacy([txt])
display(first(tokens))

txt0 = files[0].open().read()
print(1, ': ', txt0[0:90])
txt1 = files[1].open().read()
print(2, ': ', txt1[0:90])
txt2 = files[2].open().read()
print(3, ': ', txt2[0:90])

txt_collection = [txt0, txt1, txt2]
toks_collection = spacy(txt_collection)
display(first(toks_collection))
display(first(toks_collection))
display(first(toks_collection))

<generator object SpacyTokenizer.__call__.<locals>.<genexpr> at 0x7f4a757b9650>

(#450) ['Well',',','where','to','start','?','I','want','to','start'...]

(#450) ['Well',',','where','to','start','?','I','want','to','start'...]

1 :  The story is completely different from the video game. I liked that. You don't know whats 
2 :  Well, where to start? I want to start off by saying that this movie was pretty bad, but th
3 :  This show is different, thats for sure. It lacks the slap stick humor your used to when wa


(#154) ['The','story','is','completely','different','from','the','video','game','.'...]

(#450) ['Well',',','where','to','start','?','I','want','to','start'...]

(#220) ['This','show','is','different',',','that','s','for','sure','.'...]

### Subword Tokenization

In [7]:
txts = L(o.open().read() for o in files[:2000])

In [8]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

In [14]:
subword(1000)

'▁I ▁felt ▁ob li g ed ▁to ▁watch ▁this ▁movie ▁all ▁the ▁way ▁through , ▁since ▁I ▁had ▁found ▁it ▁in ▁a ▁b ar g ain ▁b in ▁and ▁b ough t ▁it ▁for ▁my ▁own , ▁but ▁I ▁came'

In [23]:
subword(200)

'▁ W e ll , ▁w h er e ▁to ▁ st ar t ? ▁I ▁w an t ▁to ▁ st ar t ▁of f ▁b y ▁s ay ing ▁that ▁this ▁movie ▁was ▁p re t t y'

In [24]:
subword(10000)

'▁Well , ▁where ▁to ▁start ? ▁I ▁want ▁to ▁start ▁off ▁by ▁saying ▁that ▁this ▁movie ▁was ▁pretty ▁bad , ▁but ▁the ▁gratuitous ▁nudity ▁was ▁comfort ing . ▁This ▁movie ▁runs ▁about ▁8 7 ▁minutes , ▁which ▁is ▁amazing ▁considering'

### Numericalization with fastai

In [15]:
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

(#292) ['xxbos','i','felt','obliged','to','watch','this','movie','all','the','way','through',',','since','i','had','found','it','in','a','bargain','bin','and','bought','it','for','my','own',',','but','i'...]


In [16]:
toks200 = txts[:200].map(tkn)
toks200[0]

(#78) ['xxbos','xxmaj','wow',',','a','movie','about','xxup','nyc','politics'...]

In [17]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

"(#2184) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','it','i','in'...]"

In [18]:
nums = num(toks)[:20]
nums

TensorText([  2,  18, 335,   0,  15, 127,  21,  27,  46,   9, 118, 163,  11, 235,  18,  77, 363,  17,  19,  12])

In [19]:
' '.join(num.vocab[o] for o in nums)

'xxbos i felt xxunk to watch this movie all the way through , since i had found it in a'

### Putting Our Texts into Batches for a Language Model

In [20]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)

bs, seq_len = 6, 15

d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])

df = pd.DataFrame(d_tokens)

display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


In [21]:
bs,seq_len = 6, 5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
xxbos,xxmaj,in,this,chapter
movie,reviews,we,studied,in
first,we,will,look,at
how,to,customize,it,.
of,the,preprocessor,used,in
will,study,how,we,build


In [22]:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
over,the,example,of,classifying
under,the,surface,.,xxmaj
convert,text,into,numbers,and
we,'ll,have,another,example
.,\n,xxmaj,then,we
it,for,a,while,.


 We create an LMDataLoader. We do this by first applying our Numericalize object to the tokenized texts:

In [23]:
nums200 = toks200.map(num)

In [24]:
dl = LMDataLoader(nums200)

In [25]:
x, y = first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

In [26]:
' '.join(num.vocab[o] for o in x[0][:20])

'xxbos xxmaj wow , a movie about xxup nyc xxunk seemingly written by someone who has never set foot in'

In [27]:
' '.join(num.vocab[o] for o in y[0][:20])

'xxmaj wow , a movie about xxup nyc xxunk seemingly written by someone who has never set foot in xxup'

## Training a Text Classifier

Language Model Using DataBlock

In [28]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(blocks=TextBlock.from_folder(path, is_lm=True),
                   get_items=get_imdb, splitter=RandomSplitter(0.1)
                   ).dataloaders(path, path=path, bs=128, seq_len=80)

In [30]:
dls_lm.show_batch(max_n=5)

Unnamed: 0,text,text_
0,"xxbos xxmaj it plays like your usual teenage - audience xxup t&a movie , but the sentiment is incredibly bleak . xxmaj if it was made today , it 'd be considered an art house movie . xxmaj it goes through the usual routine of a guy trying to get laid , but the results of his efforts are harsh and cruel and unsatisfying . \n\n xxmaj the whole teen flick formula is adhered to , but nothing turns out","xxmaj it plays like your usual teenage - audience xxup t&a movie , but the sentiment is incredibly bleak . xxmaj if it was made today , it 'd be considered an art house movie . xxmaj it goes through the usual routine of a guy trying to get laid , but the results of his efforts are harsh and cruel and unsatisfying . \n\n xxmaj the whole teen flick formula is adhered to , but nothing turns out the"
1,"\n\n xxmaj this is a two - hour film and not the typical action - packed macho xxmaj will xxmaj smith film . xxmaj in fact , the most shocking aspect might be seeing the drawn , sad face of xxmaj smith throughout this story . xxmaj it almost does n't even look like him in a number of shots . xxmaj he looks like he 's lost weight and is sick . xxmaj smith does a great job portraying","xxmaj this is a two - hour film and not the typical action - packed macho xxmaj will xxmaj smith film . xxmaj in fact , the most shocking aspect might be seeing the drawn , sad face of xxmaj smith throughout this story . xxmaj it almost does n't even look like him in a number of shots . xxmaj he looks like he 's lost weight and is sick . xxmaj smith does a great job portraying a"
2,"old and long for the days before reality xxup tv ruled when good drama and sitcoms proliferated . xxbos no , i'm not a radical xxunk bashing the hentai and yaoi genre , i just find it really boring and xxunk god , i was xxup made to watch this for initiation from some stupid punk and my my , even an xxup mst3k movie has a storyline , not to mention that this xxup hentai crap is what 's","and long for the days before reality xxup tv ruled when good drama and sitcoms proliferated . xxbos no , i'm not a radical xxunk bashing the hentai and yaoi genre , i just find it really boring and xxunk god , i was xxup made to watch this for initiation from some stupid punk and my my , even an xxup mst3k movie has a storyline , not to mention that this xxup hentai crap is what 's giving"
3,"xxmaj stone ) , an experienced hunter friend of xxmaj challenger . xxmaj prof . xxmaj summerlee ( arthur xxmaj hoyt ) goes as well , hoping to prove that xxmaj challenger is a fraud , and finally , reporter xxmaj edward xxmaj malone ( lloyd xxmaj hughes ) joins the expedition , hoping to prove his girlfriend xxmaj gladys ( alma xxmaj bennet ) that he is brave enough to face death . \n\n xxmaj cleverly adapted by xxmaj","stone ) , an experienced hunter friend of xxmaj challenger . xxmaj prof . xxmaj summerlee ( arthur xxmaj hoyt ) goes as well , hoping to prove that xxmaj challenger is a fraud , and finally , reporter xxmaj edward xxmaj malone ( lloyd xxmaj hughes ) joins the expedition , hoping to prove his girlfriend xxmaj gladys ( alma xxmaj bennet ) that he is brave enough to face death . \n\n xxmaj cleverly adapted by xxmaj broadway"
4,"intelligence too . xxmaj her xxmaj lady xxmaj marion was much more than a helpless female . xxmaj was she a damsel in distress ? xxmaj oh , most surely she was that , but not a screaming , whiny helpless girl . \n\n xxmaj basil xxmaj rathbone ( sir xxmaj guy of xxmaj gisbourne ) was perhaps the best villain in the business . xxmaj next to his characterization of xxmaj sherlock xxmaj holmes in all those films (","too . xxmaj her xxmaj lady xxmaj marion was much more than a helpless female . xxmaj was she a damsel in distress ? xxmaj oh , most surely she was that , but not a screaming , whiny helpless girl . \n\n xxmaj basil xxmaj rathbone ( sir xxmaj guy of xxmaj gisbourne ) was perhaps the best villain in the business . xxmaj next to his characterization of xxmaj sherlock xxmaj holmes in all those films ( and"


### Fine-Tuning the Language Model

In [31]:
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3, 
                               metrics=[accuracy, Perplexity()]
                               ).to_fp16()

In [32]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.015255,3.896605,0.300395,49.235023,21:10


### Saving and Loading Models

In [33]:
learn.save('1epoch')

Path('/root/.fastai/data/imdb/models/1epoch.pth')

In [34]:
learn = learn.load('1epoch')

In [35]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.757856,3.752456,0.317064,42.62566,22:37
1,3.700445,3.69568,0.323698,40.272964,22:44
2,3.634137,3.644895,0.329365,38.278748,23:11
3,3.571423,3.614856,0.332681,37.145996,23:14
4,3.501326,3.590635,0.336075,36.257084,23:32
5,3.444747,3.573214,0.338475,35.630913,23:24
6,3.375748,3.565423,0.339745,35.354424,23:24
7,3.293104,3.562633,0.340946,35.255905,23:12
8,3.249682,3.564199,0.341284,35.311157,22:51
9,3.224097,3.568984,0.341065,35.480511,23:04


In [36]:
learn.save_encoder('finetuned')

In [41]:
!ls /root/.fastai/data/imdb/models/




1epoch.pth  finetuned.pth


In [45]:
!ls /content/gdrive/MyDrive/ai/nb10

finetuned.pth


In [44]:
path2 = Path('gdrive/MyDrive/ai/nb10')
!cp /root/.fastai/data/imdb/models/finetuned.pth /content/gdrive/MyDrive/ai/nb10

### Text Generation

In [37]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

In [38]:
print("\n".join(preds))

i liked this movie because the storyline was very well done . It had one of the best movies ever made . It was like a big budget film made on a lot of budget , but i did n't think it was
i liked this movie because it was really about a boy 's father who has a son who wants to take care of his father who is a Texan . That 's the way the movie was filmed called Miami . 




### Creating the Classifier DataLoaders

In [39]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [46]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos * ! ! - xxup spoilers - ! ! * \n\n xxmaj before i begin this , let me say that i have had both the advantages of seeing this movie on the big screen and of having seen the "" authorized xxmaj version "" of this movie , remade by xxmaj stephen xxmaj king , himself , in 1997 . \n\n xxmaj both advantages made me appreciate this version of "" the xxmaj shining , "" all the more . \n\n xxmaj also , let me say that xxmaj i 've read xxmaj mr . xxmaj king 's book , "" the xxmaj shining "" on many occasions over the years , and while i love the book and am a huge fan of his work , xxmaj stanley xxmaj kubrick 's retelling of this story is far more compelling … and xxup scary . \n\n xxmaj kubrick",pos
2,"xxbos xxmaj warning : xxmaj does contain spoilers . \n\n xxmaj open xxmaj your xxmaj eyes \n\n xxmaj if you have not seen this film and plan on doing so , just stop reading here and take my word for it . xxmaj you have to see this film . i have seen it four times so far and i still have n't made up my mind as to what exactly happened in the film . xxmaj that is all i am going to say because if you have not seen this film , then stop reading right now . \n\n xxmaj if you are still reading then i am going to pose some questions to you and maybe if anyone has any answers you can email me and let me know what you think . \n\n i remember my xxmaj grade 11 xxmaj english teacher quite well . xxmaj",pos


There is one challenge we have to deal with, however, which is to do with collating multiple documents into a mini-batch. Let's see with an example, by trying to create a mini-batch containing the first 10 documents. First we'll numericalize them:

In [47]:
nums_samp = toks200[:10].map(num)

In [48]:
nums_samp.map(len)

(#10) [78,292,187,548,558,204,189,168,584,178]

In [49]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

In [50]:
learn = learn.load_encoder('finetuned')

### Fine-Tuning the Classifier

In [51]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.246536,0.183735,0.9286,01:09


In [52]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.222068,0.165184,0.93832,01:14


In [53]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.188739,0.148851,0.9438,01:33


In [54]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.159825,0.147953,0.94564,01:53
1,0.152178,0.14802,0.9456,01:56


## Questionnaire

1. **What is "self-supervised learning"?**
1. **What is a "language model"?**
1. **Why is a language model considered self-supervised?**
1. What are self-supervised models usually used for?
1. **Why do we fine-tune language models?**
1. What are the three steps to create a state-of-the-art text classifier?
1. **How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?**
1. What are the three steps to prepare your data for a language model?
1. **What is "tokenization"? Why do we need it?**
1. Name three different approaches to tokenization.
1. **What is `xxbos`?**
1. List four rules that fastai applies to text during tokenization.
1. **Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?**
1. What is "numericalization"?
1. **Why might there be words that are replaced with the "unknown word" token?**
1. **With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain?**
1. Why do we need padding for text classification? Why don't we need it for language modeling?
1. **What does an embedding matrix for NLP contain? What is its shape?**
1. What is "perplexity"?
1. Why do we have to pass the vocabulary of the language model to the classifier data block?
1. What is "gradual unfreezing"?
1. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?