# NLP Model Distillation
The aim of this task was to distill a small sentiment classifier from a pretrained RoBERTa teacher model (delivered via the `bert-fast` library). For the student model, GloVe embeddings were fed into a variant on the Computer Vision [ResNeXt model](https://arxiv.org/abs/1611.05431) that was updated to accomoadate 1D sequences instead of 2D image inputs. All training commands (given below) are excecuted via the command line - this notebook is to present results rather than train the models but if you have checkpoints available you can run evaluation below.

## Student Architecture
The sudent architecture design-choices were as follows:
* Avoid attention-based residual blocks/models as this would significantly increase student inference time which defeats the point of the distillation.
* For model trunk use tried-and-tested 2016 resNext model as this should make ~efficient use of parameters.
* Use resNext-18 instead of any of ResNeXt-{34, 50, 101, 152}.

I used the `torchvision` implementation of ResNeXt from [this link](https://github.com/pytorch/vision/blob/052edcecef3eb0ae9fe9e4b256fa2a488f9f395b/torchvision/models/resnet.py#L14) and made the following changes:
1. Move from 2D convolution -> 1D operators to support 1D input sequences instead of 2D images. 
2. Reduce model size significantly. Specifically reduce base channel sizes of `[64, 128, 256, 512]` to
`[16, 32, 64, 128]`. This reduced the number of parameters from 9M -> 0.6M.
3. When downsampling the feature map, use maxpool & convs in parallel (instead of just maxpool) and concatenate. This also increases feature map size (previous defaults collapsed most sequences to length of 1 very early in the network).


## Student from scratch
To train the student from scratch run the following command:

```bash
   python distill/train/train_student.py \
        --expt_name from-scratch \
        --input_csv input-csv-containing-labelled-text.csv 
 
```

This will train and evaluate the model on a 60:20 subset of `--input_csv` (the final 20% is set aside as a test-set).  

## Distillation
To distill a provided `bert-fast` model directory into a randomly initialized student model run the following command:

```bash
   python distill/train/distill_from_teacher.py \
        --expt_name teacher \
        --input_csv input-csv-containing-unlabelled-text.csv 
```
This will train the student model on the teacher's softmax outputs.  

## Aside: Potential Optimization
The distillation is slow - unecessarily so as I am re-generating the teacher outputs on each epoch. A simple optimization would be to preprocess these once at the start of training.  

# Evaluation
I will load local versions of these models to generate the results below - to run the cells below it will be necessary to train and manually select the best epoch on the validation set after viewing the model printouts. I will evaluate both models on my held-out test set of labelled headlines.

In [1]:
cd ..

/home/julian/challenges/permutable-test


In [2]:
student_from_scratch_fp = './logs/n5/ConvClassifier_90.pt'
student_distilled_fp = './logs/distill/distill2/ConvClassifier_7.pt'
teacher_dir = './model'

In [3]:
import argparse 
import copy 
import time 

import torch 

from distill.evaluate import evaluate, print_eval_res
from distill.train.train_student import add_train_args, train_init
from distill.labels import probs_to_labels, all_labels
from distill.teacher import TeacherNLPClassifier
from distill.train.train_teacher import unpack_batch_send_to_device as unpack_batch_teacher

In [4]:
def init_and_load_student_model(ckpt_path):
    parser = argparse.ArgumentParser()
    parser = add_train_args(parser)
    args = parser.parse_args()
    student_dict = train_init(args)
    student = student_dict['model']
    student.load_state_dict(torch.load(ckpt_path)['model'])
    return student_dict 

In [5]:
scratch_dict = init_and_load_student_model(student_from_scratch_fp)
distilled_dict = init_and_load_student_model(student_distilled_fp)
teacher = TeacherNLPClassifier(teacher_dir)

test_loader = scratch_dict['test_loader']

### Evaluate student trained from scratch

In [6]:
results = evaluate(
    **scratch_dict, 
    loader=test_loader, 
    subset='test',
    probs_to_labels=probs_to_labels, 
    all_labels=all_labels
)
print_eval_res(results)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Accuracy:       	av=64.7%  
F1 Scores:      	negative=0.351  neutral=0.782  positive=0.431  av=0.521  av_weight=0.634  micro=0.647  
Confusion [negative,neutral,positive]
[[ 37  47  30]
 [ 20 487  80]
 [ 40 125 104]]


### Evaluate distilled student

In [7]:
t1 = time.time()
results = evaluate(
    **distilled_dict, 
    loader=test_loader, 
    subset='test',
    probs_to_labels=probs_to_labels, 
    all_labels=all_labels
)
t2 = time.time()
print_eval_res(results)
print(f'Time taken for evaluation = {(t2 - t1):.3f}s')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Accuracy:       	av=74.8%  
F1 Scores:      	negative=0.574  neutral=0.846  positive=0.553  av=0.658  av_weight=0.733  micro=0.748  
Confusion [negative,neutral,positive]
[[ 66  35  13]
 [ 15 540  32]
 [ 35 114 120]]
Time taken for evaluation = 2.284s


### Evaluate teacher

In [8]:
t1 = time.time()
results = evaluate(
    model=teacher,
    unpack_batch_fn=unpack_batch_teacher,
    loader=test_loader, 
    subset='test',
    probs_to_labels=probs_to_labels, 
    all_labels=all_labels
)
t2 = time.time()
print_eval_res(results)
print(f'Time taken for evaluation = {(t2 - t1):.3f}s')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Accuracy:       	av=93.3%  
F1 Scores:      	negative=0.902  neutral=0.947  positive=0.915  av=0.921  av_weight=0.933  micro=0.933  
Confusion [negative,neutral,positive]
[[101  12   1]
 [  7 555  25]
 [  2  18 249]]
Time taken for evaluation = 23.307s


## Discussion
- Distillation improved the topline accuracy by 10%. 
- It also reduced inference time by 10x (on large GPU - improvements on CPU likely greater).
- There is still significant room for improvement for the `distilled` model vs the `teacher`.