<a href="https://colab.research.google.com/github/vinaykumar2491/Project_MachineLearning/blob/master/imflash217_fastai_3_imdb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Preparing the dataset
```
This IMDB dataset of public reviews has been curated by Andrew Mass et al. and has follow features:
1. Total 100,000 reviews
2. 25000 (+ve & -ve) labelled for Training
3. 25000 labelled for Testing
4. 50000 un-labelled data
```

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [0]:
from fastai import *
from fastai.text import *

In [0]:
### loading the IMDB sample dataset as the original dataset is very big
path = untar_data(url=URLs.IMDB_SAMPLE)

In [6]:
path

PosixPath('/root/.fastai/data/imdb_sample')

In [7]:
path.ls()

[PosixPath('/root/.fastai/data/imdb_sample/texts.csv')]

In [8]:
df = pd.read_csv(path/"texts.csv")
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [14]:
df["text"][1]

'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very u

In [0]:
data_lm = TextDataBunch.from_csv(path=path, csv_name="texts.csv")
data_lm.save()

In [26]:
data_lm

TextClasDataBunch;

Train: LabelList (799 items)
x: TextList
xxbos xxmaj yes , i 'm sentimental & schmaltzy ! ! xxmaj but this movie ( and it 's theme song ) remain one of my all time xxunk ! ! xxmaj robert xxmaj xxunk xxmaj jr. does such justice to the role of " xxmaj louis xxmaj xxunk " xxunk and the storyline ( although far - xxunk ) is romantic & makes one believe in happy endings ! !,xxbos xxmaj astronaut xxmaj steve xxmaj west ( xxmaj alex xxmaj rebar ) and his comrades xxunk a space mission that sees them flying through the rings of xxmaj saturn . xxmaj his comrades are killed instantly , but it would seem that they are in fact the lucky ones . xxmaj steve returns to xxmaj earth a constantly xxunk mass of xxunk pulp ; as he turns into a savage killer , melting every step of the way , he is tracked by his friend , xxmaj dr. xxmaj ted xxmaj nelson ( xxmaj burr debenning ) . 
 
  xxmaj this is often so xxunk funny - with enough absurd lines and situations to go around - that it 's 

In [27]:
path.ls()

[PosixPath('/root/.fastai/data/imdb_sample/data_save.pkl'),
 PosixPath('/root/.fastai/data/imdb_sample/texts.csv')]

In [0]:
### loading the saved language model we created earlier
data = load_data(path)

## Tokenization
```
The first step of processing the texts go through is:
* Split the raw sentances into words (more specifically TOKENS)

The easiest way to do this would be to split the strings on SPACES; but we can be smarter:
* We need to take care of the punctuation
* Some words are contractions of two different words like isn't or don't
* We may need to clean some part of text, if there is HTML code for instance
```

In [31]:
### lets have a separate look at what the tokenizer does to the text
data1 = TextClasDataBunch.from_csv(path=path, csv_name="texts.csv")
data1.show_batch()

text,target
"xxbos xxmaj this film sat on my xxmaj tivo for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj yorkers . \n \n xxmaj the format is the same as xxmaj max xxmaj xxunk ' "" xxmaj la xxmaj ronde",positive
"xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first xxup 3d game , or even the first xxunk - up . xxmaj it 's also one of the first stealth games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphics",positive
"xxbos i really wanted to love this show . i truly , honestly did . \n \n xxmaj for the first time , gay viewers get their own version of the "" xxmaj the xxmaj bachelor "" . xxmaj with the help of his obligatory "" hag "" xxmaj xxunk , xxmaj james , a good looking , well - to - do thirty - something has the chance",negative
"xxbos \n \n i 'm sure things did n't exactly go the same way in the real life of xxmaj homer xxmaj hickam as they did in the film adaptation of his book , xxmaj rocket xxmaj boys , but the movie "" xxmaj october xxmaj sky "" ( an xxunk of the book 's title ) is good enough to stand alone . i have not read xxmaj",positive
"xxbos xxmaj to review this movie , i without any doubt would have to quote that memorable scene in xxmaj tarantino 's "" xxmaj pulp xxmaj fiction "" ( xxunk ) when xxmaj jules and xxmaj vincent are talking about xxmaj mia xxmaj wallace and what she does for a living . xxmaj jules tells xxmaj vincent that the "" xxmaj only thing she did worthwhile was pilot "" .",negative


```
Following ar the works done by the tokenizer above:
1. The `s` are grouped together in one token
2. The contractions are separated like this: (did, n't)
3. Content has been cleaned for any HTML symbol
4. Content has been lower cased
5. There are several special tokens (starting with xx) like xxup(means uppercase), xxunk(means unknown) .....

These special tokens are used to 
    * replace unknown tokens or
    * introduce different text fields
```

## Numericalization
```
Once we have extracted TOKENS from text, we follow following steps:
1. We convert TOKENS to integers by creating a list of all words used
2. We only keep the ones that appear atleast twice with the maximum vocabulary size (default 60K)
3. Replace the TOKENS that don't make the cut (i.e. not included in the vocabulary) by `UNK`

The correspondance from ids-to-tokens are stored in `vocab` attribute of our datasets, 
in a dictionary called `itos` (for int-to-string)
```

In [33]:
### the extracted TOKENS
data1.vocab.itos[:10]

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the']

In [34]:
### taking a look at the TOKENized dataset
data.train_ds[0][0]

Text xxbos xxmaj yes , i 'm sentimental & schmaltzy ! ! xxmaj but this movie ( and it 's theme song ) remain one of my all time xxunk ! ! xxmaj robert xxmaj xxunk xxmaj jr. does such justice to the role of " xxmaj louis xxmaj xxunk " xxunk and the storyline ( although far - xxunk ) is romantic & makes one believe in happy endings ! !

In [35]:
### The above TOKENized data is actualled represented & converted to numbers for each token 
### so that it can be used to train the neural network
data.train_ds[0][0].data[:10]

array([   2,    5,  384,   10,   19,  176, 4754,  215, 6172,   54])

## Using the datablock API

In [0]:
data = (TextList.from_csv(path=path, csv_name="texts.csv")
                .split_from_df(col=2)
                .label_from_df(cols=0)
                .databunch()
        )

In [37]:

bs = 16
path = untar_data(url=URLs.IMDB)
path.ls()

[PosixPath('/root/.fastai/data/imdb/README'),
 PosixPath('/root/.fastai/data/imdb/imdb.vocab'),
 PosixPath('/root/.fastai/data/imdb/tmp_lm'),
 PosixPath('/root/.fastai/data/imdb/unsup'),
 PosixPath('/root/.fastai/data/imdb/train'),
 PosixPath('/root/.fastai/data/imdb/tmp_clas'),
 PosixPath('/root/.fastai/data/imdb/test')]

In [0]:
### Creating our language model
data_lm = (TextList.from_folder(path=path)
                   .filter_by_folder(include=["train", "test", "unsup"])
                   .split_by_rand_pct(0.1)
                   .label_for_lm()
                   .databunch(bs=bs))

### saving the trained language-model to be used later
data_lm.save("data_lm_IMDB.pkl")

In [0]:
data_lm = load_data(path=path, file="data_lm_IMDB.pkl", bs=bs)

In [40]:
data_lm.show_batch()

idx,text
0,"my girlfriend and we saw it together . xxmaj we are both xxmaj xxunk and she loves xxmaj latin films but she totally agrees with me . xxmaj do n't lose time and money , do n't even try this one ! xxmaj love 's a xxmaj bitch , xxmaj nine xxmaj queens , xxmaj cidade xxup de xxmaj deus if you wanna give south xxmaj america / xxmaj latin"
1,the foreign channel ) . xxmaj being a big fan of xxmaj japanese horror films i waited up to watch it . xxmaj its really not bad ! xxmaj the film opens with a particularly gruesome death scene - which got my attention straight away - and it is suggested it was caused by these people using voodoo dolls . xxmaj we are then introduced to a high school and
2,"one kill scene involving a pendulum hanging from the ceiling and a creepy puppet in a clown suit . xxmaj otherwise , xxup class xxup reunion xxup massacre was 90 minutes that i should 've put to better use , like clipping my toe nails or watching paint dry . xxbos i realize the movie has a fine cast , that it was adapted from a play by xxmaj hart"
3,". xxbos xxmaj it seemed like bad cut scenes from a video game . a couple of the cast are known actors , but they do n't want to be known by this film . xxmaj the camera work may have been done by a xxmaj jr. high xxmaj student . xxmaj please forgive any insult i have made to xxmaj jr. xxmaj high cameramen who are competent . xxmaj"
4,xxunk are xxunk love xxmaj en xxmaj xxunk xxmaj xxunk and xxmaj xxunk xxmaj xxunk the xxunk most amazing movie i 've seen in a while . xxbos xxmaj is this movie better than the xxmaj street xxmaj fighter movie with xxup jcvd ? xxmaj yes and xxmaj no ... xxmaj yes because there was better acting and no because it 's 2009 and they still ca n't seem to


In [0]:
### Creating the learner object
learner = language_model_learner(data=data_lm, arch=AWD_LSTM, drop_mult=0.3)

In [43]:
learner.lr_find()
learner.recorder.plot(skip_end=15)

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.


KeyboardInterrupt: ignored

In [44]:
learner.fit_one_cycle(cyc_len=1, max_lr=1e-2, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time


KeyboardInterrupt: ignored

In [0]:
learner.save("fit_head")
learner.load("fit_head")

In [0]:
learner.unfreeze()
learner.fit_one_cycle(cyc_len=5, max_lr=1e-3, moms=(0.8,0.7))

In [0]:
learner.save("fine_tuned")

In [0]:
### testing the model's ability to predict the next words for a given sentance
TEXT = "I liked this movie beacuse"
N_WORDS = 40
N_SENTANCES = 2
print("\n".join(learner.predict(text=TEXT, n_words=N_WORDS, temperature=0.75) for _ in range(N_SENTANCES)))

### We also need to save the ENCODER along with the MODEL.
```
ENCODER is the part that is responsible for creating and updating the hidden state
For the next part we don't care about the part that tries to guess the next word
```

In [0]:
learner.save_encoder("fine_tuned_encoder")

## Classifier

In [0]:
data_classifier = (TextList.from_folder(path=path, vocab=data_lm.vocab)
                           .split_by_folder(valid="text")
                           .label_from_folder(classes=["neg", "pos"])
                           .databunch(bs=bs))

data_classifier.save("data_classifier.pkl")

In [0]:
data_classifier = load_data(path=path, file="data_classifier.pkl", bs=bs)

In [0]:
data_classifier.show_batch()

In [0]:
### Step: Create a model to classify the reviews/texts
### Step: Load the ENCODER we saved before

learner = text_classifier_learner(data=data_classifier, arch=AWD_LSTM, drop_mult=0.5)
learner.load_encoder("fine_tuned_encoder")

In [0]:
learner.lr_find()
learner.recorder.plot()

In [0]:
learner.fit_one_cycle(cyc_len=4, max_lr=2e-2, moms=(0.8,0.7))
learner.save("stage1")
learner.load("stage1")

In [0]:
learner.freeze_to(-2)
learner.fit_one_cycle(cyc_len=4, max_lr=slice(1e-2/(2.6**4), 1e-2), moms=(0.8,0.7))
learner.save("stage2")
learner.load("stage2")

In [0]:
learner.freeze_to(-3)
learner.fit_one_cycle(cyc_len=4, max_lr=slice(5e-3/(2.6**4), 5e-3), moms=(0.8,0.7))
learner.save("stage3")
learner.load("stage3")

In [0]:
learner.unfreeze()
learner.fit_one_cycle(cyc_len=4, max_lr=slice(1e-3/(2.6**4), 1e-3), moms=(0.8,0.7))
learner.save("stage4")
learner.load("stage4")

In [0]:
learner.predict(text="I really loved that movie; it was awesome!!!")