Fastai's text module has three steps:
1. Pre-process data
2. Fine-tune a pre-trained model
3. Create other models (e.g. classifiers) on top of the encoder of the fine-tuned model

In [1]:
from fastai.text import *

In [2]:
imdb = untar_data(URLs.IMDB_SAMPLE)

In [3]:
imdb

PosixPath('/Users/p787144/.fastai/data/imdb_sample')

In [4]:
df = pd.read_csv(imdb/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


# Fine-tune

Can also be done by reading from dataframe

`train_df, valid_df = df.loc[:12000,:], df.loc[12000:,:]`

`data_lm = TextLMDataBunch.from_df(path, train_df, valid_df, text_cols=10, bs=32)`

In [5]:
data_lm = (
    TextList
    .from_csv(imdb, 'texts.csv', cols='text')
    .split_by_rand_pct()
    .label_for_lm()
    .databunch()
)

data_lm.save()

data.show_batch( ) shows the beginning of each sequence of text along the batch dimension (the target being to guess the next word).

You  may notice that there are quite a few strange tokens starting with xx. These are special FastAI tokens that have the following meanings:

- xxunk: Token used instead of unknown words (words not found in the vocabulary).
- xxbos: Beginning of a text.
- xxfld: Represents separate parts of your document (several columns in a dataframe) like headline, body, summary, etc.
- xxmaj: Indicates that the next word starts with a capital, e.g. “House” will be tokenized as “xxmaj house”.
- xxup: Indicates that next word is written in all caps, e.g. “WHY” will be tokenized as “xxup why ”.
- xxrep: Token indicates that a character is repeated n times, e.g. if you have  10 $ in a row it will be tokenized as “xxrep 10 $” (in general “xxrep n  {char}”)
- xxwrep: Indicates that a word is repeated n times.
- xxpad : Token used as padding (so every text has the same length)

In [5]:
data_lm.show_batch()

idx,text
0,"! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a xxunk of xxmaj jonestown - hollywood style . xxmaj xxunk ! xxbos xxmaj this is a extremely well - made film . xxmaj the acting , script and camera - work are all first - rate . xxmaj the music is good , too , though it is"
1,"about a beautiful family support and faith for their children and a special dream for their xxunk son and his sister . xxbos xxmaj father and son xxunk very little . xxup in fact they speak different xxunk . but when the son drives his father xxunk miles for his xxunk 's to xxmaj xxunk , the conversations finally take place . they are difficult and xxunk is necessary on"
2,"like this one , xxup xxunk xxunk , xxup xxunk ( xxmaj xxunk ) and xxup all xxup quiet xxup on xxup the xxup western xxup front ( xxmaj xxunk ) had great messages of peace and harmony but ultimately were failures in positively xxunk public opinion . xxmaj so , from a historical point of view , it 's an amazing and sad xxunk that is well worth seeing"
3,"xxunk . xxmaj some aspects of xxmaj victorian dress may appear odd , particularly the use of xxunk or xxunk on head and facial hair . \n \n xxmaj this film is the only one that follows with some xxunk xxmaj wells ' original narrative  as has been noted . xxmaj viewers may find it informative to note plot details that appear here that are occasionally xxunk in"
4,"funny because these are schools that want racial xxunk , equality etc . and i can honestly say , that it 's there . xxmaj but the thing is when class lets out , or when they 're just hanging out waiting for class , they ( students ) seem to just hang around with people of their own race or ethnicity . xxmaj is that bad ? xxmaj not"


<br>

I believe in below it doesn't matter to specify 
`learn.unfreeze()` or not

In [9]:
learn = language_model_learner(data_lm, AWD_LSTM)
#learn.fit_one_cycle(5, 1e-2)
learn.fit_one_cycle(5)
learn.save('mini_train_lm')
learn.save_encoder('mini_train_encoder')

epoch,train_loss,valid_loss,accuracy,time
0,4.419681,3.908901,0.281756,03:57
1,4.243346,3.775755,0.292664,04:00
2,4.082794,3.733076,0.296116,04:10
3,3.960967,3.726044,0.296577,04:08
4,3.887568,3.724927,0.296905,04:10


To evaluate your language model, you can run the Learner.predict method and specify the number of words you want it to guess.

`learn.predict("This is a review about", n_words=10)`

Or

In [10]:
learn.show_results()

text,target,pred
"xxbos xxmaj well , because i 'm a xxunk i thought , maybe i 'll check this movie out on","xxup xxunk , nothing else good on . xxmaj one of the worst mistakes of my life so far ,",the dvd . but else . . the xxmaj it of the best movies i the life is far .
even if it is xxunk ... but the only lame here is the end ... sorry xxbos a woman asks,"for advice on the road to reach a mysterious town , and hears two xxunk stories from the local xxunk",her a . how subject . the the xxunk place . and the the xxunk of . the xxunk police
"of a cross between "" xxmaj desperate xxmaj xxunk "" with "" xxmaj the xxmaj xxunk xxmaj wives "" and","other better known features , combined with a mild dose of xxunk . xxmaj the best thing about the movie","xxmaj xxunk "" "" of and with the xxunk - of xxunk , xxmaj the xxunk thing about this film"
"tries to kill xxmaj ben . xxmaj the xxunk attempt is , of course , xxunk , if overly melodramatic",". xxmaj it also xxunk xxmaj ben the opportunity to pick up , and pick on , the very xxunk",", xxmaj the 's xxunk the xxunk 's xxunk to xxunk up the and the up the the xxunk xxunk"
"white if you can figure out the sense in that ) , and xxunk , ( so funny my fellow",audience member who usually like movies like this actually xxunk and laughed when then the doc 's xxunk finally ended,xxunk members ) was xxunk xxmaj like xxmaj ) ) ) xxunk at i xxunk xxunk is xxunk xxunk xxunk


# Build a classifier

In [6]:
data_clas = (
    TextList
    .from_csv(imdb, 'texts.csv', cols='text', vocab=data_lm.vocab)
    .split_from_df(col='is_valid')
    .label_from_df(cols='label')
    .databunch(bs=32)
)

In [7]:
data_clas.show_batch()

text,target
"xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n \n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and xxunk , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and xxunk xxmaj raising xxmaj",negative
"xxbos xxup the xxup shop xxup around xxup the xxup xxunk is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with",positive
"xxbos xxmaj now that xxmaj xxunk ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of "" xxmaj at xxmaj the xxmaj movies "" in taking xxmaj steven xxmaj xxunk to xxunk . \n \n xxmaj it 's usually satisfying to watch a film director change his style /",negative
"xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . \n \n xxmaj the format is the same as xxmaj max xxmaj xxunk ' "" xxmaj la xxmaj xxunk",positive
"xxbos xxmaj many xxunk that this is n't just a classic due to the fact that it 's the first xxup 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphics",positive


### Method 1:

Currently has AWD_LSTM, Transformer and TransformerXL

In [8]:
learn = text_classifier_learner(data_clas, AWD_LSTM)
learn.load_encoder('mini_train_encoder')
#learn.fit_one_cycle(4, slice(1e-3,1e-2))
learn.fit_one_cycle(4)
learn.save('mini_train_clas')

RuntimeError: Error(s) in loading state_dict for AWD_LSTM:
	size mismatch for encoder.weight: copying a param with shape torch.Size([6160, 400]) from checkpoint, the shape in current model is torch.Size([6056, 400]).
	size mismatch for encoder_dp.emb.weight: copying a param with shape torch.Size([6160, 400]) from checkpoint, the shape in current model is torch.Size([6056, 400]).

In [14]:
learn.show_results()

text,target,prediction
"xxbos \n \n i 'm sure things did n't exactly go the same way in the real life of xxmaj homer xxmaj hickam as they did in the film adaptation of his book , xxmaj rocket xxmaj boys , but the movie "" xxmaj october xxmaj sky "" ( an xxunk of the book 's title ) is good enough to stand alone . i have not read xxmaj",positive,positive
"xxbos xxmaj to review this movie , i without any doubt would have to quote that memorable scene in xxmaj tarantino 's "" xxmaj pulp xxmaj fiction "" ( xxunk ) when xxmaj jules and xxmaj vincent are talking about xxmaj xxunk xxmaj xxunk and what she does for a living . xxmaj jules tells xxmaj vincent that the "" xxmaj only thing she did worthwhile was pilot "" .",negative,negative
"xxbos xxmaj how viewers react to this new "" adaption "" of xxmaj shirley xxmaj jackson 's book , which was promoted as xxup not being a remake of the original 1963 movie ( true enough ) , will be based , i suspect , on the following : those who were big fans of either the book or original movie are not going to think much of this one",negative,negative
"xxbos xxmaj the trouble with the book , "" xxmaj memoirs of a xxmaj geisha "" is that it had xxmaj japanese xxunk but underneath the xxunk it was all an xxmaj american man 's way of thinking . xxmaj reading the book is like watching a magnificent ballet with great music , sets , and costumes yet performed by xxunk animals dressed in those xxunk far from xxmaj japanese",negative,negative
"xxbos xxmaj bonanza had a great cast of wonderful actors . xxmaj xxunk xxmaj xxunk , xxmaj pernell xxmaj whitaker , xxmaj michael xxmaj xxunk , xxmaj dan xxmaj blocker , and even xxmaj guy xxmaj williams ( as the cousin who was brought in for several episodes during 1964 to replace xxmaj adam when he was leaving the series ) . xxmaj the cast had chemistry , and they",positive,positive


### Method 2:

In [9]:
learn = text_classifier_learner(data_clas, AWD_LSTM)
learn.load_encoder('mini_train_encoder')
learn.fit_one_cycle(1, 1e-2)

RuntimeError: Error(s) in loading state_dict for AWD_LSTM:
	size mismatch for encoder.weight: copying a param with shape torch.Size([6160, 400]) from checkpoint, the shape in current model is torch.Size([6136, 400]).
	size mismatch for encoder_dp.emb.weight: copying a param with shape torch.Size([6160, 400]) from checkpoint, the shape in current model is torch.Size([6136, 400]).

In [26]:
data_clas = (
    TextList
    .from_csv(imdb, 'texts.csv', cols='text')
    .split_from_df(col='is_valid')
    .label_from_df(cols='label')
    .databunch(bs=42)
)

learn = text_classifier_learner(data_clas, AWD_LSTM)

learn.freeze()

In [24]:
learn.fit_one_cycle(4, slice(1e-3,1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.7738,0.690593,0.535,28:54
1,0.735652,0.695866,0.47,05:31
2,0.723024,0.69868,0.465,05:15
3,0.713905,0.695252,0.47,05:32


In [28]:
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,time
0,0.80389,0.711848,0.465,04:59
