The chapter 10 notebook from the book: https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb mentions the following:

> We reached 94.3% accuracy, which was state-of-the-art performance just three years ago. By training another model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the ULMFiT paper. It was only beaten a few months ago, by fine-tuning a much bigger model and using expensive data augmentation techniques (translating sentences in another language and back, using another model for translation).

A good way to consolidate my understanding of [lesson 4](https://course.fast.ai/Lessons/lesson4.html) of the fast.ai course would be to try to replicate this forward + backward LM language model training (the book only covers forward training), averaging the weights, and confirming I get a commensurate accuracy increase.

In [5]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [6]:
from fastbook import *
from IPython.display import display,HTML

In [7]:
import torch
torch.cuda.empty_cache()

### Load Data

In [8]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

Note: The IMDB dataset is large (at least relative to the size of my GPU/paperspace instance). I'm not trying to train a "real" predictive model here, I just want to get the gist and intuition of the approaches. Thus, I've defined the `get_imdb` method to load a configurable subset of text data (by default 0.01), which can be increased if I get a bigger GPU with more memory.

In [29]:
def get_imdb(path, folders, pct):
    files = get_text_files(path, folders=folders)
    n_files = int(len(files) * pct)
    return files[:n_files]

pct = 0.1
bs = 48
seq_len = 60
get_items = partial(get_imdb, folders=['train', 'test'], pct=pct)

In [12]:
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_items, splitter=RandomSplitter(0.2)
).dataloaders(path, path=path, bs=bs, seq_len=seq_len)
dls_lm.show_batch(max_n=2)

xxbos xxmaj acting xxmaj this film is a very well acted film . i will say that the performances are slightly weak at times ; but for the most part , the acting is very good . xxmaj the only actor that blew me away with his performance was xxmaj jude xxmaj law as xxmaj harlen xxmaj maguire . xxmaj
journey . " xxmaj this is a movie that is heaps of fun to watch , xxmaj keanu and xxmaj alex make a great on screen team reprising their characters from " bill and xxmaj ted 's excellent adventure " with even more ' style ' then they had in 1st movie . xxmaj it 's not rocket science but
xxmaj acting xxmaj this film is a very well acted film . i will say that the performances are slightly weak at times ; but for the most part , the acting is very good . xxmaj the only actor that blew me away with his performance was xxmaj jude xxmaj law as xxmaj harlen xxmaj maguire . xxmaj he
. " xxmaj this is a movie that is heaps of fun to watch , xxmaj keanu and xxmaj alex make a great on screen team repris

## Finetune the Language Model

Finetune the language model on the IMDB data, as in the notebook.

In [13]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.5, 
    metrics=[accuracy, Perplexity()]).to_fp16()

In [19]:
print(len(learn.dls.train))
print(len(learn.dls.valid))

423

In [16]:
learn.fit_one_cycle(1, 2e-2)
learn.save('1epoch')
learn.fit_one_cycle(5, 2e-3)
learn.save_encoder('finetuned')

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.285134,4.148143,0.285002,63.31633,02:43


  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.887031,4.11717,0.288344,61.385296,02:43
1,3.855094,4.087678,0.290501,59.601364,02:43
2,3.754683,4.073962,0.291986,58.789436,02:43
3,3.673376,4.067539,0.292753,58.413017,02:44
4,3.667172,4.066792,0.292801,58.369442,02:44


On 1% of the data, accuracy is 23% after 10 epochs (curiously, it seems to degrade over time...this may be a function of data).

## Finetune the Classifier

As in the notebook, now that we've finetuned the language model on IMDB data, we're now going to finetune the finetuned language model on sentiment classification.

Define a new data loader that is pinned to the previous data loader's vocab.

In [35]:
# Note: need to use more documents since the classification model considers each document as an `item` whereas language modeling considers
# each sequence_len of tokens as a batch. 
get_items_for_clas = partial(get_imdb, folders=['train', 'test'], pct=0.5)
# Note: 
# - here we need the label (since this is not a language modeling task)
# - The label is the sentiment (positive or negative)
# - And it's obtained from the parent folder 
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y = parent_label,  
    get_items=get_items_for_clas,
    splitter=RandomSplitter(0.2)  # forget about 'train' and 'valid' folders, just split randomly
).dataloaders(path, path=path, bs=bs, seq_len=seq_len)

Define the learner by creating a text classifier, then loading the finetuned encoder.

In [36]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()
learn = learn.load_encoder('finetuned')

And do the training

In [37]:
print(len(learn.dls.valid))
print(len(learn.dls.train))

105
416


In [38]:
learn.fit_one_cycle(1, 2e-2)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.412384,0.372616,0.8404,02:20


In [39]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.339496,0.267835,0.8934,02:47


In [40]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.247152,0.207702,0.9172,04:15


In [41]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.193395,0.193456,0.925,05:35
1,0.185944,0.192839,0.9282,05:35


Note: 92.8% accuracy for the forward classifier.

In [42]:
learn.save('forwards_clas')

Path('/home/paperspace/.fastai/data/imdb/models/forwards_clas.pth')

## Training Backwards Versions

Now, let's do the interesting part: train backwards versions of these models, average them together, and confirm the overall accuracy improves as indicated in the book. 

The approach is:
- reverse the text
- train a backwards version language model on the reversed text
- train a backwards classifier on the reversed text (same labels)
- create an ensemble model that averages the forwards classifier and the backwards classier
- evaluate the ensemble model on the validation set



Reverse the text by passing `backwards=True`

In [43]:
dls_lm_reversed = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True, backwards=True),
    get_items=get_items, splitter=RandomSplitter(0.2)
).dataloaders(path, path=path, bs=bs, seq_len=seq_len)
dls_lm_reversed.show_batch(max_n=2)

:) ! ) time through propaganda of fashion of curiosity for , once least at ( it see must xxmaj .. indian xxmaj local drunk a from manhattan xxmaj of purchasing the and amsterdam xxmaj new xxmaj of " xxunk " the pretty and columbus xxmaj by continent xxmaj american xxmaj the of discovering the describing of way the is
aidan xxmaj help to order in xxmaj . illuminate to how brendon xxmaj teach to offers aiden xxmaj chagrin 's uncle his to much xxmaj . personality warm his and him towards drifts brendon xxmaj , monastery destroyed a from arrives aidan xxmaj xxunk legendary a when xxmaj . time own his in things doing by uncle his of ire
! ) time through propaganda of fashion of curiosity for , once least at ( it see must xxmaj .. indian xxmaj local drunk a from manhattan xxmaj of purchasing the and amsterdam xxmaj new xxmaj of " xxunk " the pretty and columbus xxmaj by continent xxmaj american xxmaj the of discovering the describing of way the is funny
xxmaj help to order in xxmaj . illumin

In [44]:
learn_reversed = language_model_learner(
    dls_lm_reversed, AWD_LSTM, drop_mult=0.5, 
    metrics=[accuracy, Perplexity()]).to_fp16()

In [45]:
learn_reversed.fit_one_cycle(1, 2e-2)
learn_reversed.save('1epoch_reversed')
learn_reversed.unfreeze()
learn_reversed.fit_one_cycle(5, 2e-3)
learn_reversed.save_encoder('finetuned_reversed')

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,5.175122,5.044724,0.234684,155.201385,02:45


  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.571108,4.770617,0.260896,117.992081,03:05
1,4.323902,4.521734,0.287128,91.994957,03:05
2,4.066789,4.437424,0.298586,84.556816,03:05
3,3.757337,4.437806,0.301543,84.589157,03:04
4,3.515084,4.475455,0.301009,87.834579,03:05


Since `dls_clas` and `dls_clas_reversed` are used to train classifiers that will ultimately be ensembled together, we need to ensure the validation set of both models are the same documents in the same order.

In [49]:
# Create a custom splitter that uses the exact same indices
def fixed_splitter(items):
    # Get validation set indices from dls_clas
    valid_idxs = dls_clas.splits[1]
    
    # Create train indices (everything not in valid)
    all_idxs = set(range(len(items)))
    valid_set = set(valid_idxs)
    train_idxs = list(all_idxs - valid_set)
    return train_idxs, list(valid_idxs)
    
dls_clas_reversed = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab, backwards=True), CategoryBlock),
    get_y = parent_label,  
    get_items=get_items_for_clas,  # same function as above
    splitter=fixed_splitter
).dataloaders(path, path=path, bs=bs, seq_len=seq_len)

In [50]:
learn_reversed = learn_reversed.load_encoder('finetuned_reversed')
learn_reversed = text_classifier_learner(dls_clas_reversed, AWD_LSTM, drop_mult=0.5, 
                                        metrics=accuracy).to_fp16()

Ensure dls_clas and dls_clas_reversed have the same labels in the same order.

In [73]:
len(dls_clas.valid.items) == len(dls_clas_reversed.valid.items)

True

In [74]:
all(dls_clas.valid.items[0] == dls_clas_reversed.valid.items[0] for i in range(len(dls_clas.valid.items)))

True

In [76]:
all(dls_clas.valid.items[0] == dls_clas_reversed.valid.items[0] for i in range(len(dls_clas.valid.items)))
all(i1[1] == i2[1] for i1, i2 in zip(dls_clas.valid_ds, dls_clas_reversed.valid_ds))

True

In [77]:
learn_reversed.fit_one_cycle(1, 2e-2)
learn_reversed.freeze_to(-2)
learn_reversed.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn_reversed.freeze_to(-3)
learn_reversed.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
learn_reversed.unfreeze()
learn_reversed.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.486241,0.441408,0.7982,02:20


  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.398709,0.355015,0.8462,02:46


  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.31413,0.266313,0.8918,04:16


  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.277666,0.250407,0.8994,05:35
1,0.231261,0.247488,0.9008,05:35


Accuracy here is 90.1%

In [78]:
learn_reversed.save('reversed_clas')

Path('/home/paperspace/.fastai/data/imdb/models/reversed_clas.pth')

## Ensembling

Now, we have classifiers `learn` (92.8%) and `learn_reversed` (90.1%). 

Next target: average their predictions together on the `valid` dataset and see if the averaged predictions eclipse what we found on either the individual classifiers.

In [80]:
# Get predictions on validation set (we confirmed above that dls_clas.valid and dls_clas_reverse.valid are ordered the same)
preds_forward, targs_forward = learn.get_preds(dl=dls_clas.valid)
preds_reversed, targs_reversed = learn_reversed.get_preds(dl=dls_clas_reversed.valid)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


In [95]:
targs_forward[:5]

tensor([1, 0, 0, 1, 0])

In [96]:
preds_forward[:5]

tensor([[4.2554e-04, 9.9957e-01],
        [9.7327e-01, 2.6734e-02],
        [9.8475e-01, 1.5248e-02],
        [8.3772e-04, 9.9916e-01],
        [9.5573e-01, 4.4266e-02]])

In [97]:
dls_clas.vocab[1]

['neg', 'pos']

In [98]:
# Average the predictions
ensemble_preds = (preds_forward + preds_reversed) / 2

# Calculate accuracy
from fastai.metrics import accuracy
acc = accuracy(ensemble_preds, targs_forward)
print(f"Ensemble accuracy: {acc}")

Ensemble accuracy: TensorBase(0.9296)


The ensembled accuracy is 92.96%, which is a slight nudge up in accuracy from just the forward preds.