In [1]:
!pip install git+https://github.com/fastai/fastai.git &> /dev/null

## 00:00:00 - Intro and NLP Review

* In Lesson 1, we create a sentiment classifier, but haven't looked at what's going on behind the scene.

## 00:01:31 - Language models in NLP

* The pretrained model we used for the IMDB sentiment analysis, was a pretrained language model.
* One example of a language model: trying to predict next word of text.
  * To be able to this, a language needs to have a good understanding of language.
* You can download pretrained language models through various Model Zoos.
  
## 00:04:36 - Review of lesson 1.

* We downloaded a pretrained model and fine tuned it:

In [2]:
from fastai.text.all import *

dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.449429,0.379644,0.831,02:17


epoch,train_loss,valid_loss,accuracy,time
0,0.297342,0.359466,0.86068,04:14
1,0.251144,0.222691,0.91048,04:16
2,0.198196,0.195068,0.92292,04:14
3,0.146078,0.191531,0.92836,04:15


## 00:05:08 - Improving results by creating a domain-specific language model

* One trick to use is start with a Wikitext language model, then improve it by creating a domain-specific language model. In this example, an IMDb language model.
  * Will learn IMDb specific words.
  
## 00:05:58 - Building a language model from scratch

* Sentences can be different lengths. Documents can be very long.
* We already looked at using categorical variables as independant variables:
  * Make list of all possible levels of the categorical variable (vocab)
  * Replace each level with index in vocab
  * Create embedding matrix containing a row for each level
  * Use embedding matric as first layer of nn
* We can almost do the same thing with text.
  * Create a list of all words to generate vocab
  * Only question is how to use a sequence?
* Independant variable is sequence of words from first to 2nd last.
* Dependant variable is 2nd word down to last.

* When we create vocab, a lot of words will be already in embedding matrix
  * May be new ones: informal slang, words not in Wikipedia etc.
* For words in Wikipedia, we will use the pretrained embedding vector.
* For new words, we will create a new random row.

* List of steps:
  * **Tokenisation** - convert into a list of words (characters or substrings)
  * **Numericalisation** - make a list of unique words in the vocab. Convert each word into a number by looking up the index in vocab
  * **Language model data loader** - fastai has `LMDataLoader` which handles creating dependent varible offset from independent by one token.
  * **Language model** - need a special model that can handle sequences of arbitrary length.

## 00:10:27 - Tokenisation

* Converting a text into a list of words.
* Questions to ask: What to do with punctuation? How to deal with words like "don't"? Long medical or chemical words? etc
* 3 common approaches:
  * Word-based approached: used by default in English. Split on spaces and applies some language specific rules to separate parts of meaning. Like turning "don't" into "do n't".  Punctuation marks also split into separate tokens.
  * Subword based: split words into smaller parts, based on common occurring substrings. "occasion" might be tokenised as "o c ca sion"
  * Character-based: split sentence into individual characters.
  
## 00:12:19 - Word tokenisation

* fastai doesn't invent its own tokenisers, but provide consistent interface to a range of tokenisers.
* Start by getting all text files from IMDB dataset:

In [3]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

In [4]:
files = get_text_files(path, folders=['train', 'test', 'unsup'])

* Can get and read first one:

In [5]:
txt = files[0].open().read(); txt[:75]

'Jiang Xian uses the complex backstory of Ling Ling and Mao Daobing to study'

* Default English word tokeniser called spaCy.

In [6]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#143) ['Jiang','Xian','uses','the','complex','backstory','of','Ling','Ling','and','Mao','Daobing','to','study','Mao',"'s",'"','cultural','revolution','"','(','1966','-','1976',')','at','the','village','level','.'...]


* Another example:

In [7]:
first(spacy(['The U.S. dollar $1 is $1.00']))

(#8) ['The','U.S.','dollar','$','1','is','$','1.00']

* Fastai adds some additional functionality. It adds some "special tokens" that start with character `xx`
  * `xxbos` indicates start of new text or "beginning of stream".
  * `xxmaj` - next word beings with capital.
  * `xxunk` indicates next word is unknown

In [8]:
tkn = Tokenizer(spacy)

In [9]:
coll_repr(tkn('&copy;    Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

## 00:17:38 - Subword tokeniser

* In Chinese, they don't use spaces. Instead, they use subword tokenisation.
  * Look at corpus of documents and find most commonly occurring groups of letters: they become vocab.

In [10]:
txts = L(o.open().read() for o in files[:2000])

* This function `setup` is used to prepare or train some transformations. In this example, it trains on a subset of the corpus

In [11]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

In [12]:
subword(1000)

'▁J i ang ▁ X ian ▁us es ▁the ▁comp le x ▁back st or y ▁of ▁L ing ▁L ing ▁and ▁Ma o ▁Da o b ing ▁to ▁st u d y ▁Ma o \' s ▁" c ul'

* With a larger vocab, most common English words end up in vocab itself

In [13]:
subword(10000)

'▁J ian g ▁X ian ▁uses ▁the ▁complex ▁back story ▁of ▁L ing ▁L ing ▁and ▁Ma o ▁Da ob ing ▁to ▁study ▁Ma o \' s ▁" cul t ural ▁revolution " ▁( 1966 - 1 9 7 6'

* Jeremy predicts that subword will be the most popular form of tokenisation.

## 00:21:21 - Question: how can we determine if pretrained model is suitable for downstream task? If there's limited vocab overlap, should we create a language model from scratch

* In same language, Wikitext usually works well.
* If you were using Genomic sequences, or Greek then you would likely need a domain-specific language model.

## 00:23:25 - Numericalization

* After splitting into tokens, the next step is numericalization.

In [14]:
toks200 = txts[:200].map(tkn)
toks200[0]

(#158) ['xxbos','xxmaj','jiang','xxmaj','xian','uses','the','complex','backstory','of'...]

* Can then pass into `setup` to create `vocab`

In [15]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab, 20)

"(#2152) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','in','it','i'...]"

* The tokens are in order of frequency.
  * Defaults:
    * `min_freq=3` - minimum frequency required.
    * `max_vocab=60000` - limit the size of the embedding matrix

* Can now call `num` as if it was a function:

In [16]:
nums = num(toks)[:20]; nums

TensorText([   0,    0, 1268,    9, 1269,    0,   14,    0,    0,   12,    0,    0,
          15, 1270,    0,   22,   24,    0,  795,   24])

* Can convert back by indexing into vocab:

In [17]:
' '.join(num.vocab[o] for o in nums)

'xxunk xxunk uses the complex xxunk of xxunk xxunk and xxunk xxunk to study xxunk \'s " xxunk revolution "'

## 00:25:43 - Putting texts into batches for language model

* Items in a mini-batch need to be the same size to be stacked in Tensor.
    * With images, we resize them all to a consistent size.
* How do we do it with text?
* We could break into contiguous blocks that match our batch size. Here's an example of a batch size of 6:

In [18]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
1,movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
2,first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
3,how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
4,of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
5,will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


* That won't scale to a giant corpus, so we have to divide into subarrays with a fixed sequence length.
* Let's try a sequence length of 5:

In [19]:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
df

Unnamed: 0,0,1,2,3,4
0,xxbos,xxmaj,in,this,chapter
1,movie,reviews,we,studied,in
2,first,we,will,look,at
3,how,to,customize,it,.
4,of,the,preprocessor,used,in
5,will,study,how,we,build


* Then the next batch:

In [20]:
#hide_input
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
df

Unnamed: 0,0,1,2,3,4
0,",",we,will,go,back
1,chapter,1,and,dig,deeper
2,the,processing,steps,necessary,to
3,xxmaj,by,doing,this,","
4,the,data,block,xxup,api
5,a,language,model,and,train


* And so on.

## 00:29:24 - LMDataLoader

* Don't have to do these steps yourself, `LMDataLoader` in fastai handles it for you.

In [21]:
nums200 = toks200.map(num)

In [22]:
dl = LMDataLoader(nums200)

In [23]:
x, y = first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

* 64 is batch size and 72 is sequence size.
* Can examine the first 20 tokens of independent variable:

In [24]:
' '.join(num.vocab[o] for o in x[0][:20])

'xxbos xxmaj xxunk xxmaj xxunk uses the complex xxunk of xxmaj xxunk xxmaj xxunk and xxmaj xxunk xxmaj xxunk to'

* then dependent variable, which is the same thing offset by one token:

In [25]:
' '.join(num.vocab[o] for o in y[0][:20])

'xxmaj xxunk xxmaj xxunk uses the complex xxunk of xxmaj xxunk xxmaj xxunk and xxmaj xxunk xxmaj xxunk to study'

## 00:31:07 - Creating language model with DataBlock

* In the DataBlock instance, we're using a `TextBlock` from the block argument.
  * Passing in a class method, allows the tokenisation to be cached.

In [26]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

* Note the items are text items from folders.
* We can then call show_batch:

In [27]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj this movie is so xxup stupid ! i mean .. why does xxmaj liam xxmaj neeson take this woman so seriously , having only heard her say 3 xxup words ! xxmaj when they found her in the woods , they should xxunk have committed her to psychiatric hospital to try and make her a real human being . xxmaj just to see xxmaj jodie "" hey - see how stupid i look "" xxmaj foster dance around","xxmaj this movie is so xxup stupid ! i mean .. why does xxmaj liam xxmaj neeson take this woman so seriously , having only heard her say 3 xxup words ! xxmaj when they found her in the woods , they should xxunk have committed her to psychiatric hospital to try and make her a real human being . xxmaj just to see xxmaj jodie "" hey - see how stupid i look "" xxmaj foster dance around chanting"
1,"tracking down vehicles that may have entered the crime scene from camera - visible locations adjacent to the crime scene as part of developing clues . \n\n * xxmaj in xxmaj england , driving is on the left . xxmaj the director goes out of his way to have the car at the crime scene park on the right , several meters away from the flower kiosk , when it could have easily parked immediately behind , or even on","down vehicles that may have entered the crime scene from camera - visible locations adjacent to the crime scene as part of developing clues . \n\n * xxmaj in xxmaj england , driving is on the left . xxmaj the director goes out of his way to have the car at the crime scene park on the right , several meters away from the flower kiosk , when it could have easily parked immediately behind , or even on the"


## 00:33:23 - Fine tuning a language model

* Fine tuning language model creates a learning which learns to predict next word of music review.
* `AWD_LSTM` is the architecture.
* `drop_mult` is amount of Dropout (covered later in the course).

In [28]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()],
    path='/kaggle/working'
).to_fp16()

In [29]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.006212,3.899826,0.300805,49.393833,27:45


## 00:35:07 - Saving and loading models

* Can use `learn.save` to save intermediary results:

In [30]:
learn.save('1epoch')

Path('/kaggle/working/models/1epoch.pth')

And load with `learn.load`:

In [31]:
learn.load('1epoch')

<fastai.text.learner.LMLearner at 0x7fa49e0a2e10>

* Can then finetune after unfreezing:

In [32]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.763982,3.759459,0.316605,42.925179,30:04
1,3.684646,3.699707,0.323066,40.435436,30:15
2,3.62389,3.654301,0.328286,38.640499,30:13
3,3.564344,3.620885,0.332433,37.370613,30:22
4,3.501577,3.598828,0.335188,36.555359,30:30
5,3.427867,3.579867,0.337651,35.868767,30:05
6,3.362838,3.575086,0.33917,35.697678,30:37
7,3.302284,3.569533,0.340268,35.499996,30:19
8,3.239792,3.574208,0.34049,35.666363,30:12
9,3.215606,3.578157,0.340285,35.807476,30:14


* We can then save just the encoder.
* Encoder is all of the model that isn't the final layer.

In [33]:
learn.save_encoder('finetuned')

## 00:36:44 - Question: Do language models attempt to provide meaning?

* Language model do tend to get good at understanding nuances of languages.

## 00:37:56 - Text generation (next notebook)