In [1]:
# Workaround for training output not visible in JupyterNotebook https://github.com/microsoft/vscode-jupyter/issues/13163
from IPython.display import clear_output, DisplayHandle
def update_patch(self, obj):
    clear_output(wait=True)
    self.display(obj)
DisplayHandle.update = update_patch

In [2]:
from fastbook import *
from IPython.display import display,HTML

## NLP wih RNNs. 

A language model is trained to guess the next word in a given text, based on text it has read before. This is called self supervised learning. 
The IMDB example used a language model trained on Wikipedia, but training on a corpus of target text (in this case IMDB) produces much better results. In this example the IMDB dataset will have a lot more different words, slang, and names that aren't in the Wikipedia dataset. 

The process goes: 

Tokenization - converting the text into tokens which are almost words, sometimes words and sometimes parts of words. Can be subwords or characters too. 
Numericalization - Make a list of all unique words that appear (the vocabulary) and convert each into a number. 
Language model data loader creation - create an independent variable which is the sequence of worsd from 1 to n-1, and a dependent variable which is from 2 to n. 
Lanauge model creation - Recurrent Neural Network to create an LM which takes large inputs. 




## Tokenization

In [4]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)  

In [5]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

In [6]:
txt = files[0].open().read(); txt[:75]

'Don\'t mistake "War Inc." for a sharply chiseled satire or a brainy comedy f'

In [7]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

2023-09-06 18:37:12.917919: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-06 18:37:14.169628: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-06 18:37:14.211238: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-

(#344) ['Do',"n't",'mistake','"','War','Inc.','"','for','a','sharply','chiseled','satire','or','a','brainy','comedy','full','of','inside','jokes','for','news','buffs','.','It',"isn't.<br",'/><br','/>This','is','an'...]


In [8]:
first(spacy(['The U.S. dollar 1.00.']))

(#5) ['The','U.S.','dollar','1.00','.']

In [9]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#383) ['xxbos','xxmaj','do',"n't",'mistake','"','war','xxmaj','inc','.','"','for','a','sharply','chiseled','satire','or','a','brainy','comedy','full','of','inside','jokes','for','news','buffs','.','xxmaj','it','is'...]


The `xx` is not common in English, so it's used as a prefix to indicate special tokens here.   The `xxbos` indicates start of a new text. (Beginning of Stream). `xxmaj` means the next word begins with a capital.  `xxunk` is an unknown word. 

Similarly there is `xxrep` to indicate repeated characters. 

In [10]:
print(coll_repr(tkn("I like turtles!!!!"), 31))

(#7) ['xxbos','i','like','turtles','xxrep','4','!']


The tokenization helps with model training, letting it recognize important parts of a sentence. 

In [10]:
# See the rules
defaults.text_proc_rules

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

In [11]:
coll_repr(tkn('©   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

## Subword tokenization

Another approach to tokenization, instead of full words, is subwords. Subword is useful for languages like Chinese, Japanese because they don't necessarily use spaces or have the same definition of word as in English. Similarly, Turkish, Hungarian and German can add many subwords without spaces to make new words. 

Analyze a corpus of documents to find the most commonly occuring groups of letters (vocab)
Then, tokenize the corpus using this vocabulary of 'subword units'

In [11]:
txts = L(o.open().read() for o in files[:1000])

In [13]:
len(txts)

1000

In [12]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])
     

In [15]:
subword(1000)

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=tmp/texts.out --vocab_size=1000 --model_prefix=tmp/spm --character_coverage=0.99999 --model_type=unigram --unk_id=9 --pad_id=-1 --bos_id=-1 --eos_id=-1 --minloglevel=2 --user_defined_symbols=▁xxunk,▁xxpad,▁xxbos,▁xxeos,▁xxfld,▁xxrep,▁xxwrep,▁xxup,▁xxmaj --hard_vocab_limit=false


'▁Do n \' t ▁mi s t ake ▁" W ar ▁I n c . " ▁for ▁a ▁sh ar p ly ▁ch is el ed ▁s at i re ▁or ▁a ▁ br ain y ▁comedy ▁full ▁of ▁in'

In [16]:
subword(200)

'▁ D on \' t ▁ m i s t a k e ▁ " W ar ▁I n c . " ▁for ▁a ▁ s h ar p ly ▁ ch i s e l ed ▁ s a'

In [17]:
subword(10000)

'▁Don \' t ▁mistake ▁" W ar ▁In c . " ▁for ▁a ▁sharply ▁ch ise led ▁satire ▁or ▁a ▁br ain y ▁comedy ▁full ▁of ▁inside ▁jokes ▁for ▁new s ▁buffs . ▁It ▁isn \' t . < br'

## Numericalization with fastai
Mapping those tokens to integers. 
Make a list of all possible levels of that  categorical variable (the voacbulary), and replace each level with its index. 



In [13]:
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

(#383) ['xxbos','xxmaj','do',"n't",'mistake','"','war','xxmaj','inc','.','"','for','a','sharply','chiseled','satire','or','a','brainy','comedy','full','of','inside','jokes','for','news','buffs','.','xxmaj','it','is'...]


In [14]:
# Using a smaller subset for the purpose of the lesson.
toks200 = txts[:200].map(tkn)
toks200[0]

(#383) ['xxbos','xxmaj','do',"n't",'mistake','"','war','xxmaj','inc','.'...]

In [15]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,50)

'(#2280) [\'xxunk\',\'xxpad\',\'xxbos\',\'xxeos\',\'xxfld\',\'xxrep\',\'xxwrep\',\'xxup\',\'xxmaj\',\'the\',\',\',\'.\',\'and\',\'a\',\'of\',\'to\',\'is\',\'in\',\'it\',\'i\',\'that\',"\'s",\'this\',\'as\',\'"\',\'\\n\\n\',\'with\',\'was\',\'for\',\'film\',\'but\',\'-\',\'you\',\'he\',\'movie\',\'on\',\'his\',\')\',\'are\',\'(\',\'not\',\'have\',"n\'t",\'one\',\'be\',\'who\',\'at\',\'all\',\'from\',"\'"...]'

The default in Numericalize() above is 60000. Any words after the most common 60K are replaced with xxunk (unknown). 

In [16]:
nums = num(toks)[:20]; nums

TensorText([   2,    8,   71,   42, 1307,   24,  287,    8,    0,   11,   24,   28,   13,    0,    0, 1672,   64,   13,    0,  226])

In [22]:
' '.join(num.vocab[o] for o in nums)

'xxbos xxmaj do n\'t mistake " war xxmaj xxunk . " for a xxunk xxunk satire or a xxunk comedy'

### Putting Our Texts into Batches for a Language Model


In [23]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


In [24]:
#hide_input
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
xxbos,xxmaj,in,this,chapter
movie,reviews,we,studied,in
first,we,will,look,at
how,to,customize,it,.
of,the,preprocessor,used,in
will,study,how,we,build


In [25]:

#hide_input
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
",",we,will,go,back
chapter,1,and,dig,deeper
the,processing,steps,necessary,to
xxmaj,by,doing,this,","
the,data,block,xxup,api
a,language,model,and,train


In [26]:
#hide_input
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
over,the,example,of,classifying
under,the,surface,.,xxmaj
convert,text,into,numbers,and
we,'ll,have,another,example
.,\n,xxmaj,then,we
it,for,a,while,.


The idea is to slice the documents into mini streams. Within each sub-batch the words are in order but otherwise the batches will be shuffled. 


In [17]:
nums200 = toks200.map(num)

In [18]:
dl = LMDataLoader(nums200)

In [19]:
x,y = first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

In [30]:
' '.join(num.vocab[o] for o in x[0][:20])

'xxbos xxmaj do n\'t mistake " war xxmaj xxunk . " for a xxunk xxunk satire or a xxunk comedy'

In [31]:
# The dependent variable is the same text off by one
' '.join(num.vocab[o] for o in y[0][:20])

'xxmaj do n\'t mistake " war xxmaj xxunk . " for a xxunk xxunk satire or a xxunk comedy full'

## Training a Text Classifier

Two steps to training a text classifier using transfer learning: fine tune the language model pretrained on Wikipedia, to the IMDB review dataset. Then use that model to train a classifier. 

First preparing the language model using DataBlock()

In [20]:
# Here's how we use TextBlock to create a language model, using fastai's defaults:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=64, seq_len=80, device=torch.device('cuda:0'))  # bs 64, seq len 80

In [21]:
# now show how it has sliced the variable and dependent variable
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos a fascinating tale of lust , jealousy and mourning . xxmaj well acted and skillfully written , it shows why xxmaj british cinema is best at giving us a view of the dark side of human nature . \n\n xxmaj this film is not for the squeamish , and those of a delicate stomach should close their eyes at the first sound of "" anyone xxmaj who xxmaj had a heart "" , and not open them until it","a fascinating tale of lust , jealousy and mourning . xxmaj well acted and skillfully written , it shows why xxmaj british cinema is best at giving us a view of the dark side of human nature . \n\n xxmaj this film is not for the squeamish , and those of a delicate stomach should close their eyes at the first sound of "" anyone xxmaj who xxmaj had a heart "" , and not open them until it ends"
1,first foray into post - nuke sci - fi / action cinema remains to this very day his single most novel and idiosyncratic entry in that sub - genre . xxmaj it 's a wickedly wacked - out black comic tongue - in - cheek end - of - the - world oddity which fuses vintage 40 's film noir conventions -- morally upright gumshoes with a strong personal code of honor that 's constantly being challenged by every twisted,foray into post - nuke sci - fi / action cinema remains to this very day his single most novel and idiosyncratic entry in that sub - genre . xxmaj it 's a wickedly wacked - out black comic tongue - in - cheek end - of - the - world oddity which fuses vintage 40 's film noir conventions -- morally upright gumshoes with a strong personal code of honor that 's constantly being challenged by every twisted turn


Now that our data is ready, we can fine-tune the pretrained language model.


## Fine-Tuning the Language Model

The data is ready, now the language model can be fine tuned. 
To convert the word indices into activations to use with the neural network, convert them into embeddings. 
Feed the embeddings into a RNN, using the AWD-LSTM architecture. 

In [22]:

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

Perplexity is often used with language models. It is the exponential of the loss. The accuracy metric shows how many times the model predicted the next word correctly. 

So far, we've taken the wikitext language model, prepared the dataloader and learner for training it on IMDB. It's now ready for fine tuning. 
Since it takes a long time, and fine_tune() doesn't save intermediate results, instead using fit_one_cycle, which calls .freeze() after the epoch

In [35]:
torch.cuda.empty_cache()
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.03365,3.913044,0.300011,50.051064,21:24


In [29]:
# Now save this first epoch.  It is saved to learn.path/models/
learn.save('1epoch')

Path('/home/mendhak/.fastai/data/imdb/models/1epoch.pth')

In [30]:
# Load the saved model. 
learn = learn.load('1epoch')

In [31]:
learn.save_encoder('finetuned')

In [34]:
TEXT = "This movie was awful."
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]   


In [35]:
print("\n".join(preds))

This movie was awful . The story centers around Ally Sheedy , a neighbor who is writing the commercials for a movie about the lives of two people . She has not completely forgotten how to make a movie like this
This movie was awful . It was boring - boring , and the plot was very predictable . The acting was n't that good , but i did n't think it would be bad . i liked it . The actors were


In [28]:
# Unfreeze the model and run another cycle. 
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.767079,3.755556,0.317983,42.75798,23:13
1,3.714692,3.716784,0.322324,41.131912,22:31
2,3.659372,3.674558,0.327532,39.43124,22:53


KeyboardInterrupt: 

I didn't complete it to 10 rows, it was taking too long. I ran it for a few rows and then loaded it and then did a prediction, which was... uh... decent.

## Creating classifier dataloaders
A language model predicts the next word in a sequence. It needs no external labels. 
A classifier predicts an external label. In the case of IMDB, sentiment. 