In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.text import *

### Preparing the data

In [3]:
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()

[WindowsPath('C:/Users/my/.fastai/data/imdb_sample/data_save.pkl'),
 WindowsPath('C:/Users/my/.fastai/data/imdb_sample/texts.csv')]

In [4]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [5]:
df['text'][1]

'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very u

In [6]:
data_lm = TextDataBunch.from_csv(path, 'texts.csv')

In [7]:
data_lm.save()

In [8]:
data = load_data(path)

### Tokenization

The first step of processing we make the texts go through is to split the raw sentences into words, or more exactly tokens. The easiest way to do this would be to split the string on spaces, but we can be smarter:

* we need to take care of punctuation
* some words are contractions of two different words, like isn't or don't
* we may need to clean some parts of our texts, if there's HTML code for instance

To see what the tokenizer had done behind the scenes, let's have a look at a few texts in a batch.

In [9]:
data = TextClasDataBunch.from_csv(path, 'texts.csv')
data.show_batch()

text,target
"xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n \n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , steaming bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj",negative
"xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the sweetest and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with",positive
"xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj sydney , after xxunk ) , i can xxunk join both xxunk of "" xxmaj at xxmaj the xxmaj movies "" in taking xxmaj steven xxmaj soderbergh to task . \n \n xxmaj it 's usually satisfying to watch a film director change his style /",negative
"xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj yorkers . \n \n xxmaj the format is the same as xxmaj max xxmaj xxunk ' "" xxmaj la xxmaj ronde",positive
"xxbos i really wanted to love this show . i truly , honestly did . \n \n xxmaj for the first time , gay viewers get their own version of the "" xxmaj the xxmaj bachelor "" . xxmaj with the help of his obligatory "" hag "" xxmaj xxunk , xxmaj james , a good looking , well - to - do thirty - something has the chance",negative


The texts are truncated at 100 tokens for more readability. We can see that it did more than just split on space and punctuation symbols:

* the "'s" are grouped together in one token
* the contractions are separated like this: "did", "n't"
* content has been cleaned for any HTML symbol and lower cased
* there are several special tokens (all those that begin by xx), to replace unknown tokens (see below) or to introduce different * text fields (here we only have one).


### Numericalization

Once we have extracted tokens from our texts, we convert to integers by creating a list of all the words used. We only keep the ones that appear at least twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token UNK.

The correspondance from ids to tokens is stored in the vocab attribute of our datasets, in a dictionary called itos (for int to string).

In [10]:
data.vocab.itos[:10]

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the']

And if we look at what a what's in our datasets, we'll see the tokenized text as a representation:

In [11]:
data.train_ds[0][0]

Text xxbos xxmaj not everything is said in this excellent first feature from xxmaj xxunk xxmaj xxunk . xxmaj the friendship , the " wanting to fit in " , the first sexual feelings ... xxmaj all this and much more is xxunk through the underwater xxunk swimming scenes . 
 
  xxmaj all three girls in the movie try to find and express their personality in a very different way . xxmaj it is a much less violent approach to the understanding of the teenage years compared to , say , " xxmaj thirteen " , but a very worthwhile trip nonetheless . 
 
  a must see , and please leave all xxmaj american cinematographic xxunk at he door . xxmaj the soundtrack is xxup a+ by the way . 
 
  xxmaj bon xxunk !

In [12]:
data.train_ds[0][0].data[:10]

array([  2,   5,  39, 361,  16, 309,  18,  21, 405, 106], dtype=int64)

In [13]:
data = (TextList.from_csv(path, 'texts.csv', cols='text')
                .split_from_df(col=2)
                .label_from_df(cols=0)
                .databunch())

### Language model

In [14]:
bs=20

In [15]:
path = untar_data(URLs.IMDB)
path.ls()

[WindowsPath('C:/Users/my/.fastai/data/imdb/data_lm.pkl'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/imdb.vocab'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/models'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/README'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/test'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/tmp_clas'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/tmp_lm'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/train'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/unsup')]

In [16]:
(path/'train').ls()

[WindowsPath('C:/Users/my/.fastai/data/imdb/train/labeledBow.feat'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/train/neg'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/train/pos'),
 WindowsPath('C:/Users/my/.fastai/data/imdb/train/unsupBow.feat')]

In [17]:
data_lm = (TextList.from_folder(path)
           #Inputs: all the text files in path
            .filter_by_folder(include=['train', 'test', 'unsup']) 
           #We may have other temp folders that contain text files so we only keep what's in train and test
            .split_by_rand_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly
            .databunch(bs=bs))

data_lm.save('data_lm.pkl')

We have to use a special kind of TextDataBunch for the language model, that ignores the labels (that's why we put 0 everywhere), will shuffle the texts at each epoch before concatenating them all together (only for training, we don't shuffle for the validation set) and will send batches that read that text in order with targets that are the next word in the sentence.

The line before being a bit long, we want to load quickly the final ids by using the following cell. 

In [18]:
data_lm = load_data(path, 'data_lm.pkl', bs=bs)

In [19]:
data_lm.show_batch()

idx,text
0,"later , by which time i did not care . xxmaj the character we should really care about is a very cocky , overconfident xxmaj ashton xxmaj kutcher . xxmaj the problem is he comes off as kid who thinks he 's better than anyone else around him and shows no signs of a cluttered closet . xxmaj his only obstacle appears to be winning over xxmaj costner . xxmaj"
1,"like the xxmaj stooges , or very athletically , like xxmaj buster xxmaj keaton , can be hilarious . xxmaj but otherwise it 's boring and , well , stupid . i think i got one good laugh out of the entire movie . \n \n xxmaj avoid this one . i saw it for free on cable , and still wanted my money back . xxbos ' xxmaj"
2,"with a story that leaves much to be desired . xxmaj with a script that the screen - writers for "" xxmaj touched by an xxmaj angel "" might have passed up as being too xxunk , xxmaj ozpetek still keeps us interested at times . xxmaj in fact , i wanted to focus on the positives but i found the last act so bafflingly bizarre and awful that i"
3,"last half hour is not so great , with many questions left unanswered . xxmaj this will doubtless annoy others as it annoyed me . xxmaj but nevertheless , good fun and a very smart first feature from xxmaj xxunk . xxbos xxmaj on many levels it 's very good . xxmaj in fact , considering that this was a low - budget xxmaj british indie by a first time"
4,"everyone feel so good because the characters embody what every man and woman wants to be , not what they are . xxmaj minnie and xxmaj moskowitz , instead of indulging in any hint of fantasy in the realm of romance , depicts people who may just be more common than the attractive , confident people with so much experience playing the field . xxmaj what 's the story behind"


In [20]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

In [21]:
learn.lr_find()

epoch,train_loss,valid_loss,accuracy,time


LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.


RuntimeError: CUDA out of memory. Tried to allocate 322.00 MiB (GPU 0; 4.00 GiB total capacity; 1.29 GiB already allocated; 315.15 MiB free; 1.36 GiB reserved in total by PyTorch)

In [None]:
learn.recorder.plot(skip_end=15)

In [None]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

In [None]:
learn.save('fine_tuned')

In [None]:
learn.load('fine_tuned');

In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2

In [None]:
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

In [None]:
learn.save_encoder('fine_tuned_enc')

## Classifier

Now, we'll create a new data object that only grabs the labelled data and keeps those labels. Again, this line takes a bit of time. 

In [None]:
path = untar_data(URLs.IMDB)

In [None]:
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
             #grab all the text files in path
             .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             .label_from_folder(classes=['neg', 'pos'])
             #label them all with their folders
             .databunch(bs=bs))

data_clas.save('data_clas.pkl')

In [None]:
data_clas = load_data(path, 'data_clas.pkl', bs=bs)

In [None]:
data_clas.show_batch()

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))

In [None]:
learn.save('first')

In [None]:
learn.load('first');

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

In [None]:
learn.save('second')

In [None]:
learn.load('second');

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

In [None]:
learn.save('third')

In [None]:
learn.load('third');

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

In [None]:
learn.predict("I really loved that movie, it was awesome!")

expected output: (Category pos, tensor(1), tensor([7.5928e-04, 9.9924e-01]))