# Experiment notebook for "Context is All You Need"
In this notebook we conduct experiments involving few-shot fine-tuning, in-context learning (ICL), and a novel implementation of context distillation.

To install the development environment, run the following:
```
conda env create -f environment.yml
conda activate fine-tuning
```

Run ```pip install -e .``` if module importing isn't working.

In [27]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


If the check below fails, verify your pytorch installation by following the steps at https://pytorch.org/get-started/locally/.

In [36]:
from src.utils import cuda_check

cuda_check()

Cuda available: True


### Import datasets

Import datasets using methods from `src/data/data.py`. Datasets are downloaded from huggingface and stored in `/data`. Once downloaded, datasets are loaded locally.

Our in domain dataset is [MNLI](https://huggingface.co/datasets/glue). Our out of domain dataset is [HANS](https://huggingface.co/datasets/hans).

In [4]:
from src.data.data import get_in_domain, get_out_domain

in_domain_train, in_domain_test = get_in_domain()
out_domain = get_out_domain()

print(f"In domain (MNLI):\n{in_domain_train}")
print(in_domain_train[1])

print(f"\nOut of domain (HANS):\n{out_domain}")
print(out_domain[10])

In domain (MNLI):
Dataset({
    features: ['premise', 'hypothesis', 'label', 'idx'],
    num_rows: 261802
})
{'premise': 'One of our number will carry out your instructions minutely.', 'hypothesis': 'A member of my team will execute your orders with immense precision.', 'label': 0, 'idx': 2}

Out of domain (HANS):
Dataset({
    features: ['premise', 'hypothesis', 'label', 'parse_premise', 'parse_hypothesis', 'binary_parse_premise', 'binary_parse_hypothesis', 'heuristic', 'subcase', 'template'],
    num_rows: 10000
})
{'premise': 'The president avoided the athlete .', 'hypothesis': 'The athlete avoided the president .', 'label': 1, 'parse_premise': '(ROOT (S (NP (DT The) (NN president)) (VP (VBD avoided) (NP (DT the) (NN athlete))) (. .)))', 'parse_hypothesis': '(ROOT (S (NP (DT The) (NN athlete)) (VP (VBD avoided) (NP (DT the) (NN president))) (. .)))', 'binary_parse_premise': '( ( The president ) ( ( avoided ( the athlete ) ) . ) )', 'binary_parse_hypothesis': '( ( The athlete ) ( ( a

### Import models

Import models using methods from src/models/opt.py. Models are downloaded from huggingface and stored in /models/pretrained. Once downloaded, models are loaded locally.

In [None]:
# from src.model.model import get_model

# # Get SequenceClassification models
# model_opt125, tokenizer_opt125 = get_model(model_name='opt-125m', model_type='SequenceClassification', pretrained=True)
# model_opt350, tokenizer_opt350 = get_model(model_name='opt-350m', model_type='SequenceClassification', pretrained=True)

# # Get CasualLM models
# model_opt125_causal, tokenizer_opt125_causal = get_model(model_name='opt-125m', model_type='CausalLM', pretrained=True)
# model_opt350_causal, tokenizer_opt350_causal = get_model(model_name='opt-350m', model_type='CausalLM', pretrained=True)

### Create training and evaluation datasets

The `get_random_subsets` method from `src/data/data.py` creates a dictionary of training and evaluation data organized by sample size. Each sample size will contain 10 randomly generated trials of that sample size. Evaluation sets contain 50 samples and are randomly generated a single time to ensure consistant comparison across fine-tuning methods.

Before generating our datasets, we set random seeds to ensure reproducibility.

In [38]:
import numpy as np
import random
import pprint
from src.data.data import get_random_subsets

# Seed random generators
seed = 42
np.random.seed(seed)
random.seed(seed)

# Generate training and evaluation datasets
train_datasets, eval_dataset_in, eval_dataset_out = get_random_subsets(train_dataset=in_domain_train, 
                                                                       eval_dataset_in=in_domain_test, 
                                                                       eval_dataset_out=out_domain, 
                                                                       train_sample_sizes=[2, 4],
                                                                       num_trials=1,
                                                                       eval_sample_size=10)

print("Train datasets:")
pprint.pprint(train_datasets, depth=1)
print("Eval datasets:")
pprint.pprint(eval_dataset_in, depth=1)
pprint.pprint(eval_dataset_out, depth=1)

Train datasets:
{2: [...], 4: [...], 8: [...], 16: [...]}
Eval datasets:
Dataset({
    features: ['premise', 'hypothesis', 'label', 'idx'],
    num_rows: 50
})
Dataset({
    features: ['premise', 'hypothesis', 'label', 'parse_premise', 'parse_hypothesis', 'binary_parse_premise', 'binary_parse_hypothesis', 'heuristic', 'subcase', 'template'],
    num_rows: 50
})


### Zero-shot baseline

We evaluate both models on in and out of domain eval sets with no training or context using the `generate` method. These results serve as a baseline for comparison to other fine-tuning methods. Model parameters are not updated using this method.

In [None]:
from src.finetuners.zeroshot import batch_evaluate

metrics = batch_evaluate(model_names=['opt-125m', 'opt-350m'], 
                         eval_dataset_in=eval_dataset_in, 
                         eval_dataset_out=eval_dataset_out, 
                         verbose=False, 
                         disable_tqdm=False)

print("Metrics:")
pprint.pprint(metrics)

### Few-shot fine-tuning

We fine-tune both models on 10 trials of training data and evaluate on in and out of domain eval sets. This method updates all model parameters. Fine-tuned models can be saved locally to `models/finetuned/` by setting the `save_trials` parameter to `True`.

In [None]:
from src.finetuners.fewshot import batch_fine_tune

metrics, training_histories = batch_fine_tune(model_name=['opt-125m', 'opt-350m'], 
                                              train_datasets=train_datasets, 
                                              eval_dataset_in=eval_dataset_in, 
                                              eval_dataset_out=eval_dataset_out, 
                                              save_trials=False)

print("Metrics:")
pprint.pprint(metrics)
print("Training histories:")
pprint.pprint(training_histories)

### In-context learning (ICL)

ICL is performed similarly to zero-shot evaluation, using the `generate` method. Context (labeled training examples) is pre-pended to each evaluation example. Model parameters are not updated using this method.

In [None]:
from src.finetuners.incontext import batch_evaluate

metrics = batch_evaluate(model_names=['opt-125m'], train_datasets=train_datasets, eval_dataset_in=eval_dataset_in, eval_dataset_out=eval_dataset_out)

print("Metrics:")
pprint.pprint(metrics)

### Context-distillation fine-tuning
TODO: add description

### Plot in-domain vs. out-of-domain metrics

In [None]:
from src.visualization.plot import plot_in_out_domain

plot_in_out_domain(logfile='opt-125m_fewshot_metrics_2_4_6_8_16.csv', metric='accuracy')
plot_in_out_domain(logfile='opt-125m_fewshot_metrics_2_4_6_8_16.csv', metric='peak_memory_gb')
plot_in_out_domain(logfile='opt-125m_fewshot_metrics_2_4_6_8_16.csv', metric='runtime')

### Plot learning curves

In [None]:
from src.visualization.plot import plot_learning_curves

plot_learning_curves(logfile='opt-125m_fewshot_training_history_2_4_6_8_16.csv', subplot=False)
plot_learning_curves(logfile='opt-125m_fewshot_training_history_2_4_6_8_16.csv', subplot=True)