<a href="https://colab.research.google.com/github/hlf401/nlpbook/blob/main/ch02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers and Transfer Learning

Now that you've been introduced to the field of natural language processing, there's something imporant you need to understand. It's not actually a very long journey from where you start to state of the art.

Eventually, we _will_ return to the basics, discuss the fundamentals, and understand all the details, of course. But we're going to show you the promised land before we venture on the long and hard journey to get there.
    
One of the most important ideas to implement if you want to get deep learning working in the real world is transfer learning, which is the process of taking a model that has already been trained on another dataset and fine-tuning it to fit your new dataset. For example, if you're training a language model to generate compelling short stories in the style of Hemingway, you could fine-tune a model trained on a wide variety of books instead of training on just the text samples of Hemingway, of which there may not be many.

.Who's That Pokémon? Language Models
> Tip: A language model is a function that takes in a sequence of words and returns a probility distribution over all the possible next words in that sequence. This task is considered one of the most important in NLP because, as the reasoning goes, to predict the next word in a sentence, you **must** have a good understanding of the language. Language models learn the features and characteristics of language in order to guess what the next word should be after any given prior phrase or sentence. Language models are the backbone of NLP today because they do not require explicit annotations (labels) and can be trained on massive corpuses without any material data preparation. Once they learn the properties of language well, language models can be fine-tuned to perform more specific NLP tasks such as text classification, which is exactly what we're going to do in this chapter.

> Note: When we refer to pretrained models throughout this book, we are generally referring to large, pretrained *language* models that have been trained to perform language modeling on large corpuses.

A nice analogy in object-oriented programming is the concept of inheritance in classes. Suppose we're making some sort of zoo management video game, where each animal is represented by a class. The animals have properties like weight and height, as well as functions like eat and sleep. In theory, we could just create a new class for each animal and replicate those shared functions, but in practice, we usually refactor our code so that we have a superclass for a generic animal and a subclass for each species to avoid duplication in our code, making it easier to read.

By training on the larger dataset, the model essentially inherits a large amount of extra knowledge, which it can use to perform better on the task you care about. From a practical standpoint, transfer learning helps you get better performing models faster since fine-tuning, if done correctly, is often computationally cheaper than training from scratch.footnote:[Assuming that the original dataset you're transferring *from* is much larger than the dataset you're using for fine-tuning. If your fine-tuning dataset is larger, perhaps you should be applying transfer learning the other way around! But in practice, it's very hard to natural language text datasets that are of comparable size to the ones used for pretraining.]

The other big advancement we'll discuss is the use of a new kind of model architecture called the transformer. Training transformers can be complicated and does not always work well without some fine-tuning. So, instead of traning it from scratch, we'll show you the pretraining technique on another architecure, and the use a popular pre-trained transformer to perform inference.

For this chapter, it's important that you have your compute environment set up since we'll be training models. Check out our [Github page](https://github.com/nlpbook/nlpbook/) for more info on how to do this.

## Training with fastai

The first thing we'll look at is an idea called transfer leaning. We're going to fine-tune a language model and then transform it into a text classifier that categorizes text based on sentiment. We'll start with the simplest working implementation, and progressively train our network using the [ULMFit](https://arxiv.org/abs/1801.06146) technique. This particular example was

The dataset we're going to use here is the IMDB movie review datset. It's not very fun, but it's simple and small, which is what we want when starting off.

In [1]:
from fastai.text.all import *

###挂载Google Drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


###symbol link到google drive-太慢，不好用

In [None]:
#!ln -s /content/drive/MyDrive/root/.fastai /root/.fastai
#!ls /root/.fastai -ls

0 lrwxrwxrwx 1 root root 35 Jan  9 03:53 /root/.fastai -> /content/drive/MyDrive/root/.fastai


### Using the high-level fastai API

`fastai` is more more than your standard deep learning library. It includes tools that help you solve the problem at hand end-to-end as fast as possible. One of those tools is a built-in set of common datasets that can be easily downloaded.

In [3]:
# downloading the datasets,   it may fail in China with vpn, when it failed , delete the folders:
# C:\Users\Administrator\.fastai\archive and C:\Users\Administrator\.fastai\data, then run the codes to download again.
# it is /root/.fastai/data/ on Colab
path = untar_data(URLs.IMDB)
path.ls()
print("abc")
#print(path.ls())

abc


This particular instance of the IMDB dataset is organized just like ImageNet is (i.e. one directory per class). So in this case, the positive reviews are saved under `pos` and the negative reviews are saved under `neg`.

We can set up set up our dataset and prepare for training by using the `TextDataLoaders.from_folder` method built into `fastai`. The only thing we need to specify is the name of the validation folder, which is "test" (and not the default "valid").

In [5]:
(path/'train').ls()

(#4) [Path('/root/.fastai/data/imdb/train/labeledBow.feat'),Path('/root/.fastai/data/imdb/train/unsupBow.feat'),Path('/root/.fastai/data/imdb/train/pos'),Path('/root/.fastai/data/imdb/train/neg')]

###数据集介绍
The data follows an ImageNet-style organization, in the train folder, we have two subfolders, pos and neg (for positive reviews and negative reviews).


We can gather it by using the TextDataLoaders.from_folder method. The only thing we need to specify is the name of the validation folder, which is “test” (and not the default “valid”).
"test"是目录名字，用以指定测试数据集所在的文件目录。不指定默认为"valid"

函数：TextDataLoaders.from_folder (path, train='train', valid='valid',...)
说明：Create from imagenet style dataset in path with train and valid subfolders (or provide valid_pct)
参考：
https://docs.fast.ai/text.data.html#textdataloaders.from_folder


###复制fastai创建的数据集IMDB到Google Drive里以查看

In [None]:
!ls /root/.fastai/data/imdb/
!tar -zcvf /content/drive/MyDrive/root/fastai.tgz /root/.fastai

In [13]:
# there is some warnings when running with Jupyter notebook on windows, you can ignore it,
# the warning is "Due to IPython and Windows limitation, python multiprocessing isn’t available now"
# it can work but very slowly, about 30 minutes for the first time
#  fastai directly using Jupyter notebook in Win10 will occur this limit.
# see https://forums.fast.ai/t/dataloaders-due-to-ipython-and-windows-limitation-python-multiprocessing-isnt-available-now/93906/2
# see https://christianjmills.com/posts/fastai-book-notes/chapter-11/index.html  the source codes, for the warning details
dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')
# stored data in C:\Users\Administrator\.fastai\data\imdb_tok



In [None]:
!ls /root/.fastai/data/imdb/
!tar -zcvf /content/drive/MyDrive/root/fastai2.tgz /root/.fastai

We can then have a look at the data with the `show_batch` method:

In [18]:
dls.show_batch()

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos i thought that xxup rotj was clearly the best out of the three xxmaj star xxmaj wars movies . i find it surprising that xxup rotj is considered the weakest installment in the xxmaj trilogy by many who have voted . xxmaj to me it seemed like xxup rotj was the best because it had the most profound plot , the most suspense , surprises , most xxunk the ending ) and definitely the most episodic movie . i personally like the xxmaj empire xxmaj strikes xxmaj back a lot also but i think it is slightly less good than than xxup rotj since it was slower - moving , was not as episodic , and i just did not feel as much suspense or emotion as i did with the third movie . \n\n xxmaj it also seems like to me that after reading these surprising reviews that",pos
2,"xxbos xxrep 3 * xxup spoilers xxrep 3 * xxrep 3 * xxup spoilers xxrep 3 * xxmaj continued … \n\n xxmaj from here on in the whole movie collapses in on itself . xxmaj first we meet a rogue program with the indication we 're gon na get ghosts and vampires and werewolves and the like . xxmaj we get a guy with a retarded accent talking endless garbage , two ' ghosts ' that serve no real purpose and have no character what - so - ever and a bunch of henchmen . xxmaj someone 's told me they 're vampires ( straight out of xxmaj blade 2 ) , but they 're so undefined i did n't realise . \n\n xxmaj the funny accented guy with a ridiculous name suffers the same problem as the xxmaj oracle , only for far longer and far far worse .",neg
3,"xxbos xxmaj i 've rented and watched this movie for the 1st time on xxup dvd without reading any reviews about it . xxmaj so , after 15 minutes of watching xxmaj i 've noticed that something is wrong with this movie ; it 's xxup terrible ! i mean , in the trailers it looked scary and serious ! \n\n i think that xxmaj eli xxmaj roth ( mr . xxmaj director ) thought that if all the characters in this film were stupid , the movie would be funny … ( so stupid , it 's funny … ? xxup wrong ! ) xxmaj he should watch and learn from better horror - comedies such xxunk xxmaj night "" , "" the xxmaj lost xxmaj boys "" and "" the xxmaj return xxmaj of the xxmaj living xxmaj dead "" ! xxmaj those are funny ! \n\n """,neg
4,"xxbos xxup myra xxup breckinridge is one of those rare films that established its place in film history immediately . xxmaj praise for the film was absolutely nonexistent , even from the people involved in making it . xxmaj this film was loathed from day one . xxmaj while every now and then one will come across some maverick who will praise the film on philosophical grounds ( aggressive feminism or the courage to tackle the issue of xxunk ) , the film has not developed a cult following like some notorious flops do . xxmaj it 's not hailed as a misunderstood masterpiece like xxup scarface , or trotted out to be ridiculed as a camp classic like xxup showgirls . \n\n xxmaj undoubtedly the reason is that the film , though outrageously awful , is not lovable , or even likable . xxup myra xxup breckinridge is just",neg
5,"xxbos xxmaj my xxmaj comments for xxup vivah : - xxmaj its a charming , idealistic love story starring xxmaj shahid xxmaj kapoor and xxmaj amrita xxmaj rao . xxmaj the film takes us back to small pleasures like the bride and bridegroom 's families sleeping on the floor , playing games together , their friendly banter and mutual respect . xxmaj vivah is about the sanctity of marriage and the importance of commitment between two individuals . xxmaj yes , the central romance is naively visualized . xxmaj but the sneaked - in romantic moments between the to - be - married couple and their stubborn resistance to modern courtship games makes you crave for the idealism . xxmaj the film predictably concludes with the marriage and the groom , on the wedding night , tells his new bride who suffers from burn injuries : "" come let me",pos
6,"xxbos xxup warning : xxup possible xxup spoilers ( but not really - keep reading ) . a xxrep 3 h , there are so many reasons to become utterly addicted to this spoof gem that i wo n't have room to list them all . xxmaj the opening credits set the playful scene with kitsch late 1950s cartoon stills ; an enchanting xxmaj xxunk ' prez ' xxmaj xxunk mambo theme which appears to be curiously uncredited ( but his grunts are unmistakable , and no - one else did them ) ; and with familiar cast names , including xxmaj kathy xxmaj xxunk a full year before she hit with xxmaj sister xxmaj acts 1 & 2 plus xxmaj teri xxmaj hatcher from tv 's xxmaj superman . \n\n xxmaj every scene is imbued with shallow injustices flung at various actors , actresses and producers in daytime xxup",pos
7,"xxbos xxmaj pier xxmaj paolo xxmaj pasolini , or xxmaj pee - pee - pee as i prefer to call him ( due to his love of showing male genitals ) , is perhaps xxup the most overrated xxmaj european xxmaj marxist director - and they are thick on the ground . xxmaj how anyone can see "" art "" in this messy , cheap sex - romp concoction is beyond me . xxmaj some of the "" stories "" here could have come straight out of a soft - core porn film , and i am not even so much referring to the nudity but the simplistic and banal , often pointless stories . xxmaj anyone who enjoyed this relatively watchable but dumb oddity should really sink his teeth into the "" der xxmaj xxunk "" soft - porn xxmaj german 70s movie series , because that 's what",neg
8,"xxbos xxmaj well , on it 's credit side ( if it can be said to have one ) , xxmaj timothy xxmaj hines xxup did manage to capture the original setting of xxup h.g . xxmaj wells ' outstanding novella . xxmaj but other than that - well , to call a spade a spade - it sucks bigtime . xxmaj what the xxmaj master xxmaj ed xxmaj wood could have done with the alleged $ 20 million dollar budget ! xxmaj timothy xxmaj hines really does make xxmaj mr . xxmaj wood , who was a flawed genius anyway , look like the best filmmaker of all time . xxmaj the special effects ( i guess you 'd call them that ) are not even up to computer game standards . xxmaj the acting is , well , perhaps about dinner theater comparable , and the accents are",neg


问题：
*   show_batch()显示的数据从哪里提取的？
*   9个是哪里定义的
*   每次运行后结果不一致，看来随机从train集里跳出来的。



We can see that the library automatically processed all the texts to split then in *tokens*, adding some special tokens like:

库自动添加的这些tokens 用来分词

- `xxbos` to indicate the beginning of a text
- `xxmaj` to indicate the next word was capitalized

`fastai` uses an object called a `Learner` for doing pretty much everything. We can construct one for text classification in one line of code:

In [19]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


Instead of the transformer model that we've been raving about (and will continue to dicuss) throughout a vast majority of the book, we're going to use the [AWD LSTM](https://arxiv.org/abs/1708.02182) architecture instead for now, since it's easier and faster to train.

There are a few other details: `drop_mult` is a parameter that controls the magnitude of all dropouts in that model, and we use `accuracy` to track down how well we are doing. But you don't need to worry too much about hyperparameters just yet.

With the `Learner` defined, we can now fine-tune our pretrained model, using a method with an unsurprising name:

In [20]:
learn.fine_tune(4, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.457851,0.408969,0.81664,03:25


epoch,train_loss,valid_loss,accuracy,time
0,0.314508,0.267143,0.89232,06:59
1,0.248217,0.199403,0.92232,07:00
2,0.194687,0.201378,0.92124,06:59
3,0.141802,0.185537,0.93216,06:59


In [21]:
learn.fine_tune(4, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.15061,0.201975,0.93004,03:35


epoch,train_loss,valid_loss,accuracy,time
0,0.178695,0.204791,0.9214,07:01
1,0.168506,0.18203,0.9302,06:59
2,0.111449,0.197089,0.9312,06:59
3,0.074513,0.221328,0.93072,06:59


93% accuracy look good! But let's see how well it's actually doing...

In [24]:
learn.show_results()

Unnamed: 0,text,category,category_
0,"xxbos xxmaj there 's a sign on xxmaj the xxmaj lost xxmaj highway that says : \n\n * major xxup spoilers xxup ahead * \n\n ( but you already knew that , did n't you ? ) \n\n xxmaj since there 's a great deal of people that apparently did not get the point of this movie , xxmaj i 'd like to contribute my interpretation of why the plot makes perfect sense . xxmaj as others have pointed out , one single viewing of this movie is not sufficient . xxmaj if you have the xxup dvd of xxup md , you can "" cheat "" by looking at xxmaj david xxmaj lynch 's "" top 10 xxmaj hints to xxmaj unlocking xxup md "" ( but only upon second or third viewing , please . ) ;) \n\n xxmaj first of all , xxmaj mulholland xxmaj drive is",pos,pos
1,"xxbos ( some spoilers included :) \n\n xxmaj although , many commentators have called this film surreal , the term fits poorly here . xxmaj to quote from xxmaj encyclopedia xxmaj xxunk 's , surreal means : \n\n "" fantastic or incongruous imagery "" : xxmaj one need n't explain to the unimaginative how many ways a plucky ten - year - old boy at large and seeking his fortune in the driver 's seat of a red xxmaj mustang could be fantastic : those curious might read xxmaj james xxmaj kincaid ; but if you asked said lad how he were incongruous behind the wheel of a sports car , he 'd surely protest , "" no way ! "" xxmaj what fantasies and incongruities the film offers mostly appear within the first fifteen minutes . xxmaj thereafter we get more iterations of the same , in an ever",pos,neg
2,"xxbos xxmaj tony xxmaj hawk 's xxmaj pro xxmaj skater 2x , is n't much different at all from the previous games ( excluding xxmaj tony xxmaj hawk 3 ) . xxmaj the only thing new that is featured in xxmaj tony xxmaj hawk 's xxmaj pro xxmaj skater 2x , is the new selection of levels , and tweaked out graphics . xxmaj tony xxmaj hawk 's xxmaj pro xxmaj skater 2x offers a new career mode , and that is the 2x career . xxmaj the 2x career is basically xxmaj tony xxmaj hawk 1 career , because there is only about five challenges per level . xxmaj if you missed xxmaj tony xxmaj hawk 1 and 2 , i suggest that you buy xxmaj tony xxmaj hawk 's xxmaj pro xxmaj skater 2x , but if you have played the first two games , you should still",pos,pos
3,"xxbos xxmaj warner xxmaj brothers tampered considerably with xxmaj american history in "" big xxmaj trail "" director xxmaj raoul xxmaj walsh 's first - rate western "" they xxmaj died with xxmaj their xxmaj boots xxmaj on , "" a somewhat inaccurate but wholly exhilarating biography of cavalry officer xxmaj george xxmaj armstrong xxmaj custer . xxmaj the film chronicles xxmaj custer from the moment that he arrives at xxmaj west xxmaj point xxmaj academy until the xxmaj indians massacre him at the xxmaj little xxmaj big xxmaj horn . xxmaj this is one of xxmaj errol xxmaj flynn 's signature roles and one of xxmaj raoul xxmaj walsh 's greatest epics . xxmaj walsh and xxmaj flynn teamed in quite often afterward , and "" they xxmaj died with xxmaj their xxmaj boots xxmaj on "" reunited xxmaj olivia de xxmaj havilland as xxmaj flynn 's romantic interest",pos,pos
4,"xxbos xxmaj may 2nd : someone clicked 11 nos , and then proceeded to do 15 more on my previous 15 comments : almost as funny as this turkey ! \n\n xxmaj may 1st : \n\n xxmaj as i write this , xxmaj i 'm still very much under the impression of what must be the funniest thriller xxmaj i 've ever seen . xxmaj i 've got a major case of the giggles , but xxmaj i 'll try and calm down . ( it 's kind of hard to write when your nose spills snot and the mouth ejects sporadic drool onto the keyboard . ) \n\n a pair of young women who just returned from a vacation take a ride on a shuttle bus . a couple of young guys join them . xxmaj but the bus is n't really a taxi service : it 's a",neg,neg
5,"xxbos xxmaj this is , per se , an above average film but why in the name of xxmaj bog was it made ? xxmaj it 's impossible to treat it as a thing unto itself because it is an almost shot - for - shot remake of an xxmaj alfred xxmaj hitchcock classic of 1960 . xxmaj you ca n't watch it without the 1960 film nudging into your consciousness . \n\n xxmaj what does the word "" credit "" mean ? xxmaj how can we credit xxmaj van xxmaj xxunk and his associates with anything except deciding to use different actors , slightly different sets , and color ? \n\n xxmaj anne xxmaj heche is attractive but lacks xxmaj janet xxmaj leigh 's stolid determination to become a respectable middle - class woman . xxmaj and xxmaj heche is younger than xxmaj leigh , who brought to her",neg,neg
6,"xxbos by xxmaj dane xxmaj youssef \n\n i was kind of looking forward to this one . i enjoy xxmaj eddie xxmaj murphy and i love it when a star hand - makes a vehicle for themselves or when someone who writes decides to mark their own directorial debut . xxmaj but when the star 's head gets too big for the rest of his body , there 's always a danger of a big - budgeted xxmaj hollywood vanity production . \n\n xxmaj will the filmmaker keep it real ▁ or will he just waste amounts of money ( the studio 's , ours ) and time ( the studio 's , ours & his own ) patting himself on the back for an hour in a half ? xxmaj sadly , it 's the latter here . \n\n xxmaj another thing i really like is when someone breathes",neg,neg
7,"xxbos i do not think i am alone when i say that 2005 has not been particularly kind to the horror genre . xxmaj while "" cursed "" , "" hide and xxmaj seek "" , "" the xxmaj ring xxmaj two "" , and "" the xxmaj amityville xxmaj horror "" all showed glimpses of interest and potential , there have been more misses than hits . xxmaj for proof , see : "" white xxmaj noise "" , "" boogeyman "" , "" the xxmaj jacket "" , "" mindhunters "" , and "" alone in the xxmaj dark "" . xxmaj imagine my surprise when "" house of xxmaj wax "" , tightly written by siblings xxmaj chad and xxmaj carey xxmaj hayes , turned out to be … well , a surprise . \n\n xxmaj carly xxmaj jones ( elisha xxmaj cuthbert ) is a young",pos,pos
8,"xxbos xxmaj why would xxmaj burt xxmaj lancaster allow himself to play a poor schnook who is ultimately undermined by femme fatale xxmaj anna xxmaj dundee , played by xxmaj yvonne decarlo in ' criss xxmaj cross ' ? xxmaj the same reason why xxmaj robert xxmaj mitchum allows himself to be cast as another loser who falls for femme fatale xxmaj faith xxmaj domergue in the 1950 noir , "" where xxmaj danger xxmaj lives "" . xxmaj perhaps they both felt it was a good way to show that they had ' range ' as xxunk playing against type , the usual ' tough - guy ' role they were known for , would enhance their image as actors who could play any role . xxmaj but the problem was that roles like xxmaj steve xxmaj thompson , the pathetic love - sick milquetoast in ' criss xxmaj",neg,neg


**HLF COMMENTS：**

每次输出结果，不一样，对不同的文本进行分类。

We can also run prediction on individual sentences one at a time:

In [None]:
!ls /root/.fastai/data/imdb/
!tar -zcvf /content/drive/MyDrive/root/fastai3.tgz /root/.fastai

In [26]:
learn.predict("That movie was wicked cool!")

('pos', tensor(1), tensor([0.0459, 0.9541]))

In [27]:
learn.predict("That movie was stupid!")

('neg', tensor(0), tensor([0.9976, 0.0024]))

In [28]:
learn.predict("That movie was very good!")

('pos', tensor(1), tensor([0.0037, 0.9963]))

Here we can see the model has considered the review to be positive. The second part of the result is the index of "pos" in our data vocabulary and the last part is the probabilities attributed to each class (99.1% for "pos" and 0.9% for "neg").

#### Building a Dataset with fastai's DataBlock API

We can also use the `fastai` data block API to get our data in a `DataLoaders`. This is a bit more advanced, so fell free to skip this part if you are not comfortable with `fastai` just yet. This approach will give us the same results in the end.

A datablock is built by giving the fastai library a bunch of information:

- the types used, through an argument called `blocks`: here we have images and categories, so we pass `TextBlock` and `CategoryBlock`. To inform the library our texts are files in a folder, we use the `from_folder` class method.
- how to get the raw items, here our function `get_text_files`.
- how to label those items, here with the parent folder.
- how to split those items, here with the grandparent folder.

In [None]:
imdb = DataBlock(blocks=(TextBlock.from_folder(path), CategoryBlock),
                 get_items=get_text_files,
                 get_y=parent_label,
                 splitter=GrandparentSplitter(valid_name='test'))

This only gives a blueprint on how to assemble the data. To actually create it, we need to use the `dataloaders` method:

In [None]:
dls = imdb.dataloaders(path)

### ULMFiT for Transfer Learning

The pretrained model we used in the previous section is called a language model. It was trained to guess the next word on a set of Wikipedia articles after reading all the words before. We got great results by directly fine-tuning this language model to a movie review classifier, but with one extra step, we can do even better.

The Wikipedia English is slightly different from the IMDb English. So instead of jumping directly to the classifier, we could fine-tune our pretrained language model to the IMDb dataset and then use *that* as the base for our classifier instead of the Wikipedia language model.

This intuitivly makes sense - if you, as a literate human being, get some context on what movie review generally sound like, you'd probably do a better job of classifying them. It's kind of like getting the passage to read a few days in advance before you take the SAT. Only here, we won't call the language model out for cheating, since we're friends footnote:[See, I said it right here. Please don't eat me, robot overlords in the future.].

But beyond that, another very important reason this is useful is because we often have more data for our than we have *labelled* data. Labelling is expensive and generally requires human time and effort, so it's not uncommon to have a large database of text record where only a small subset of them are used for say, document tagging. But with this fine-tuning approach, we can still use the unlabelled data to fine-tune the *language model* even before we train the

At the risk of dragging on a flawed analogy, this is almost like getting access to years of previous SAT passages. None of them will show up on the test *exactly*, but practicing them will help get a sense of what the SAT is like.

This approach is called ULMFiT, introducted by Jeremy Howard footnote:[Who also happends to be the creator of fastai!] and Sebastian Ruder in 2018. The process is summarized in [[ulmfit]]

![ULMFit](https://github.com/hlf401/nlpbook/blob/main/images/ulmfit.png?raw=1)

Arrows and circles make everything so much simpler, don't they?

Since we already have the pretrained Wikipedia language model, we can start with step 2 of the piple in [[ulmfit]] - fine-tuning the IMDB language model.

添加：

But there is another very practical reason, which is that you get even better results if you fine tune the (sequence-based) language model prior to fine tuning the classification model. For instance, in the IMDb sentiment analysis task, the dataset includes 50,000 additional movie reviews that do not have any positive or negative labels attached in the unsup folder. We can use all of these reviews to fine tune the pretrained language model — this will result in a language model that is particularly good at predicting the next word of a movie review. In contrast, the pretrained model was trained only on Wikipedia articles.

unsup folder里存放没有标记的文本。可以利用它predicting the next word of a movie review.


### Fine-tuning a language model on IMDb

We can get our texts in a `DataLoaders` suitable for language modeling very easily:

In [None]:
dls_lm = TextDataLoaders.from_folder(path, is_lm=True, valid_pct=0.1)

We need to pass something for `valid_pct` otherwise this method will try to split the data by using the grandparent folder names. By passing `valid_pct=0.1`, we tell it to get a random 10% of those reviews for the validation set.

We can have a look at our data using `show_batch`. Here the task is to guess the next word, so we can see the targets have all shifted one word to the right.

这里目标是预测后面的 words，不是分类

**HLF COMMENTS::**


https://docs.fast.ai/text.data.html#textdataloaders.from_folder

textdataloaders.from_folder()


If valid_pct is provided, a random split is performed (with an optional seed) by setting aside that percentage of the data for the validation set (instead of looking at the grandparents folder).


In [None]:
dls_lm.show_batch(max_n=5)

Then we have a convenience method to directly grab a `Learner` from it, using the `AWD_LSTM` architecture like before. We use accuracy and perplexity as metrics (the later is the exponential of the loss) and we set a default weight decay of 0.1. `to_fp16` puts the `Learner` in mixed precision, which is going to help speed up training on GPUs that have Tensor Cores.

In [None]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()],
    path=path, wd=0.1).to_fp16()

**HLF COMMENTS**
参考：https://docs.fast.ai/text.data.html#textdataloaders.from_folder



By default, a pretrained `Learner` is in a frozen state, meaning that only the head of the model will train while the body stays frozen. We show you what is behind the fine_tune method here and use a fit_one_cycle method to fit the model:

In [None]:
learn.fit_one_cycle(1, 1e-2)

This model takes a while to train, so it's a good opportunity to talk about saving intermediary results.

You can easily save the state of your model like so:

In [None]:
learn.save('1epoch')

In [None]:
print(learn.path)
#!ls /root/.fastai/data/imdb/
#!tar -zcvf /content/drive/MyDrive/root/fastai4.tgz /root/.fastai

It will create a file in `learn.path/models/` named "1epoch.pth". If you want to load your model on another machine after creating your `Learner` the same way, or resume training later, you can load the content of this file with:

In [None]:
learn = learn.load('1epoch')

We can them fine-tune the model after unfreezing:

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10, 1e-3)

Once this is done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the *encoder*. We can save it with `save_encoder`:

In [None]:
learn.save_encoder('finetuned')

.Who's That Pokémon?
> Tip: The encoder is the model not including the task-specific final layer(s). It means much the same thing as *body* when applied to vision CNNs, but tends to be more used for NLP and generative models.

Before using this to fine-tune a classifier on the reviews, we can use our model to generate random reviews: since it's trained to guess what the next word of the sentence is, we can use it to write new reviews:

In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75)
         for _ in range(N_SENTENCES)]

In [None]:
print("\n".join(preds))

With the language model fine-tuned on movie review, we can now modify it to *classify* movie reviews. The idea is that at this point, if the model is "smart enough" to predict the next word, it *must* be able to a simple positive/negative classification.

### Training a text classifier

We can gather our data for text classification almost exactly like before:

In [None]:
dls_clas = TextDataLoaders.from_folder(
    untar_data(URLs.IMDB), valid='test',
    text_vocab=dls_lm.vocab)

The main difference is that we have to use the exact same vocabulary as when we were fine-tuning our language model, or the weights learned won't make any sense. We pass that vocabulary with `text_vocab`.

函数：TextDataLoaders.from_folder (path, train='train', valid='valid',...) 
说明：Create from imagenet style dataset in path with train and valid subfolders (or provide valid_pct) 
参考： https://docs.fast.ai/text.data.html#textdataloaders.from_folder

Then we can define our text classifier like before:

In [None]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

The difference is that before training it, we load the previous encoder:

In [None]:
learn = learn.load_encoder('finetuned')

The last step is to train with discriminative learning rates and *gradual unfreezing*. In computer vision, we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference.

In [None]:
learn.fit_one_cycle(1, 2e-2)

In just one epoch we get the same result as our training in the first section, not too bad! We can pass `-2` to `freeze_to` to freeze all except the last two parameter groups:

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

Then we can unfreeze a bit more, and continue training:

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

Finally, we can unfreeze the entire model, and let it train all the layers to get a final boost in accuracy.

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

Now, you have a text classification model that can accuractely predict if a movie review has positive or negative sentiment based on the raw text content of the review alone. With an understanding of the `fastai` APIs, you should now be able to implement your own text classifier on a dataset of your choice.

While IMDB itself was fairly simple, many NLP problems today can be formulated a text classification problems. Some of the things you can do with text classifcation include:

- Predicting the programming language of some source code
- Building a simple email spam classifier
- Improving the functionality of an automated content moderation bot for online chats or forums
- Categorize documents based on their language footnote:[To do this well, you need a powerful tokenizer that can recognize text encoding in many languages]

One of the best parts about text classification is that there's a single, simple, interpretable metric to optimize - accuracy. So not only can you solve these tasks, but you can also know how well you're doing using statistics that many people are familiar with.

While the IMDB model we built just now does a wonderful job, it's perhaps not super impressive. We've had spam classifiers that do pretty good since the dawn of the Dinosaurs, so binary predictions on text is not something you might associate with the glorious future we sold you on in [[ch01]]. But it turns out that this idea of a language model is so powerful that it has become the poster child for NLP today.

To illustrate this, let's give a language model on it's own, with no additional training or fine-tuning, a change to flex it's muscles.

## Inference with Huggingface

Now that we know how to train language models, we could conceptually train a very, very large one on a lot of data, and get it to produce very accurate sounding text. Here, we'll use the huggingface library to get prediction samples from a language model that was trained using a procedure similar to the one we used above.

In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode a text inputs
text = "With great power comes great "
indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

The code snippet above initializes a tokenizer, which is function thats strings as input and returns arrays of numbers that are easier for the model to interpret. We'll be covering tokenizers in much more detail in the next chapter, but for now, if you want a quick look into what our model sees, try printing `tokens_tensor`.

In [None]:
print(tokens_tensor)

Now, let's do the actual inference, which is, again, just a few lines of code thanks to the amazing huggingface transformers library.

In [None]:
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# get the predicted next sub-word
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print(predicted_text)

Nice! It looks like whatever we just ran, in just a few lines of code, was able to recreate the wisdom of uncle Ben!footnote[A character from the Spider-Man comics book series, who once said "with great power comes great responsibility," just like our language model did!]

And to be clear, this wasn't just some simple lookup, database search, or something like that. This was an actual state-of-the-art neural network that after reading large amounts of text on the internet, is able to complete senences based in the "knowledge" it gained. Pretty cool, huh?

But without context, this is all just a black box that you throw sentences into. So now, let's break down each line code in the block we just ran to really get a good idea of what's going on.

### Loading Models

First, we load a pretrained model. This is the single most important step for transfer learning. It downloads the model that we're going to use to make predictions from somewhere on the internet and loads it in the right format into  an object in our code. All of that functionality is thankfully packed into this one line of code:

In [None]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

Most deep learning libraries package this model loading functionality neatly into a simple function. It's the last thing you'll have to worry about.

The specific model we're loading here is unimportant at this stage, but just so you know, it's called "GPT2", which was really revolutionary when it came out and basically broke the internet. You can read more about it from [an article](https://blog.floydhub.com/gpt2/) that Ajay wrote in 2019, but we'll talk about in this book as well, in [[ch09]].

.Loading Models
> **Note:** Loading models to a variable named +model+, regardless of the task or domain is something that's extremely common in deep learning, so keep that in mind when you're browsing notebooks or code samples online.

Next we run this little line of code, which tells our model that we're not training now and are instead going to make predictions (i.e., perform inference). There are a few things that change internally in the `model` object when we call this linefootmote:[Primarily, we disable the DropOut and BatchNorm layers, which are only useful during training], allowing us to generate predictions from the `model`. Again, this is not the most important line for what we're doing now, but make sure that you call this function whenever you would like to generate predictions. Running this line in a notebook will also print out all the layers of the model in the standard PyTorch format, so maybe scroll through that if you're feeling curious.

In [None]:
model.eval()

With the weights downloaded, modedl loaded into memory, and the `model` object set into evaluation mode, It's time to crank out some output from our lean footnote:[Ok, maybe this phrase is not applicable to GPT-2 specifically, but when we all have computers that are 300 times faster than what we have today, this adjective will be accurate.], mean, text generating machine.

### Generating Predictions

We're going to group the next three lines together, since they work as a block.

In [None]:
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

The first line, `with torch.no_grad():` tells PyTorch to run the the lines in that indent block in the `torch.no_grad()` context, which means PyTorch won't calculate the gradients, or backward pass, for the model. If you're not familiar with backpropogation, or not entirely clear why gradients are calculated in the first place, refer to the resources we have in the introduction. Strictly speaking, we don't _need_ to turn off gradient computation, but this saves time, memory, and compute, and makes the inference run faster.

In the `torch.no_grad()` context, we then run a forward pass. As always, PyTorch makes this extremely simple. Just call `model` as a function, with the `tokens_tensor` we prepared above as the input.

But wait, wasn't +model+ an object with the pretrained weights that we loaded above? How is it also a function?

.Python Dunder Methods
****

In Python, you can actually do this! You have to define `__call__` method in your class, which is a special function called a dunder method. Python has a lot of these cool dunder functions, some of which you've likely encounter before, like `__init__`, which let's you set up a constructor for your class, and `__len__`, which let's you define a "length" property for your objects which you can access via the `len()` function. Python dunder methods allow you to define a lot of cool functionality for your custom classes, such as addition, equality, and more.

If you define a function called `__call__()` in your Python class, you can then treat instances of your class as functions, and the `__call__()` function will be invoked everytime you do so. We'll soon talk about PyTorch `nn.Module` objects, which are the building blocks for neural nets. The `nn.Module` class implements the `__call__` function by default. Therefore, every PyTorch model (and submodule) can also be called as a function, which can make your code very neat and tidy. This is why we can both define the +model+ variable and call it as we would for a function at the same time.

If you're interested in learning more about python dunder methods, check out this [tutorial](https://rszalski.github.io/magicmethods/) or read more online (there are plenty of great resources one search away).

****

Calling `model(input_tensor)`, in general, will return a `torch.tensor` object with the predictions. But in this case, the huggingface library actually gives us a lot of other items as well. In this case, `model(tokens_tensor)` will return a tuple, where the first element is the predictions tensor. Let's quickly confirm alll of this by checking a few lengths and shapes.

In [None]:
outputs[0].shape

This checks out, because according to the huggingface transformers documentation, the predictions tensor is supposed to have shape `(batch_size, sequence_length, config.vocab_size)`. Here, the batch size is 1, since we're only passing in one sentence. The sequence length should be 5, which makes sense if you take a look at the line where we define the input sentence, which had five words (space delimitted substrings) in the string:

```
text = "With great power comes great "
```

The value of 50257 for the vocabulary size seems accurate, but this is something we could always double-check by going through the documentation for this model.


> **Tip:** We can't emphasize enough how much this technique of checking the size, shape, and dimensions of `torch.tensors` is. It's one of the most effective ways of debugging your model. Hopefully, as you start training more complex models and building your own architectures from scratch, this will come naturally. But until then, always remember to try check the size with `.size` and reason through what's going on in your model.

Since it seems like `outputs[0]` is what we want, we'll assign to the variable `predictions`. Putting these together and wrapping them in the `torch.no_grad()` context gives us that mini-block of code that we had above.

In [None]:
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

`predictions` is a `torch.tensor` that has values that describe the probability of each word. Remember, one of the dimensions of this `torch.tensor` is the size of the vocabulary (i.e. the number of possible words that the model could predict). What we want now is the word that is mostly likey to come next in out sentence, We grab this by using the `argmax` function, which gets the index of the largest value in the array.

In [None]:
predicted_index = torch.argmax(predictions[0, -1, :]).item()

To ensure that we're absolutely clear on the what exactly we're doing, let's also quickly break down the way we index `predictions`. It's a three dimensional tensor, so we specify 3 indices. The first, along the batch dimension, is `0`. Since we're not running batch predictions, there's only one element in this axis, so it's what we pick. Along the sequence length dimension, we pick the last element. This is because we want to predict the last word in the sentence we passed in. The last index is `:`, which means we want to grab everything. We need all the elements along the vocabulary dimension to calculate which one is most likely.

Finally, we decode the index we got into a word using the +tokenizer.decode()+ funtion. This is just a simple lookup.

In [None]:
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print(predicted_text)

And there we have it! Recreating wisdom in just a few lines of code.

## Conclusion

In [[ch08]] we'll try putting these ideas together to develop a technique that utilizes both transformers and transfer learning together to create an incredibly powerful set of models that can solve these tasks we just demonstrated as well as many more.
    
There's a lot in this chapter that we haven't explained yet. We've intentionally left out a lot of details such as what exactly a model is/does, how the tokenizer is implemented in code, and perhaps most importantly, how to use the pretrained model for transfer learning.

Don't worry though: we'll eventually get to all that. The goal of this chapter was to help you understand some of the important components of an NLP pipeline by running code and seeing results in real time. To test your understanding of the material so far, try to use a different language model, swap out the prompt, and see if you can get the model to predict a popular quote, phrase, or idiom. Note that to do this, you might need to swap out the tokenizer as well.
    
Once you're able to perform these tasks, you should be ready to move on to the next chapter, in which we formally introduce some of the most popular NLP applications today and build a few together.