# Assignment B (Group A): 
* In this assignment, you will first learn how the word permutations and misspelled words in evaluation datasets can decrease the evaluation accuracy of models of different types.
* You will also learn how fine-tuning one of them on a training dataset that also contains random word permutations and misspelled words can increase its evaluation accuracy. 
* Finally, you will be asked to increase the other model's evaluation accuracy over a threshold **as quickly as possible**, based on the provided information.
* This notebook walks you through this process step-by-step. Run each cell of code and read the text instructions untill you read section 6 where you need to write your own code for the task.
* If you have any question during the assignment, please ask the instructor directly. It is prohibited to consult with any generative language models, e.g. ChatGPT, about this assignment.

#### You are given up to 40 minutes to finish this assignment. Let the instructor start timing when you read this sentence.

# 1: Library Import  (run the code, no need to read through it)

In [1]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '1'
os.environ['HF_HOME'] = '/workspace/HF_cache/'
os.environ['HF_DATASETS_CACHE'] = '/workspace/HF_cache/datasets'
os.environ['TRANSFORMERS_CACHE'] = '/workspace/HF_cache/transformers_cache/'
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
import torch
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import transformers
import datasets
from transformers import AutoTokenizer, AutoConfig, TrainingArguments, Trainer
from user_functions_group_A import set_dataset_logging_level
import logging
set_dataset_logging_level(logging.ERROR, ["datasets"])
!chmod -R 777 .
!rm -rf tmp_*
!rm -rf models/*_versioned

# 2. Datasets

* There are seven datasets under ```datasests``` that we used to train (with suffix ```_train```) and ```evaluate``` (with suffix ```_eval```) language models in this assignment. You can display the datasets by running ```!ls datasets``` later.
```
    mlm_eval	      mlm_shifted_train     sst2_eval		    sst2_train
    mlm_shifted_eval  mlm_train	         sst2_shifted_eval
```
* Prefix ```mlm``` in the names indicates that the dataset is used for models performing the pre-training task, i.e. masked language modeling, and Prefix ```sst2``` in the names indicates that the dataset is used for models performing the downstream task, i.e. sequence classification.
* ```_shifted_``` in the names indicates that the dataset contains random permutation of word order and misspelled words altered from a corresponding dataset, i.e. ```sst2_shifted_eval``` is altered from ```sst2_eval```.

# 3. An Example of Random Permutation of Word Order and Misspelled Words

Below is a comparison bewteen two inputs from dataset ```mlm_eval``` and ```mlm_shifted_eval```: 
* ```their``` in the second sentence is misspelled as ```thwir```.
* The words in the last sentence are randomly permutated.

In [2]:
print(datasets.load_from_disk('datasets/mlm_eval')[11]) #load dataset from path and display the 11th item

{'text': " Troops are divided into five classes : Scouts , Shocktroopers , Engineers , Lancers and Armored Soldier . Troopers can switch classes by changing their assigned weapon . Changing class does not greatly affect the stats gained while in a previous class . With victory in battle , experience points are awarded to the squad , which are distributed into five different attributes shared by the entire squad , a feature differing from early games ' method of distributing to different unit types . \n"}


In [3]:
print(datasets.load_from_disk('datasets/mlm_shifted_eval')[11])

{'text': "Troops are divided into five classes : Scouts , Shocktroopers , Engineers , Lancers and Armored Soldier . Troopers can switch classes by changing thwir assigned weapon . Changing class does not greatly affect the stats gained while in a previous class . With victory in battle , experience points are awarded to the squad , which are distributed into five different atributes shared by distributing method feature unit , early games entire the squad of different types to a from differing ' . "}


# 4: Models

There are ```three``` models under the directory ```models``` where you can display them by running ```!ls models``` later.
```
    distilbert  distilbert_v2  distilbert-sentiment
```
* Among the models,  ```models/distilbert_v2``` and ```models/distilbert-sentiment``` are derived from ```models/distilbert```.
* Specifically, ```models/distilbert-sentiment``` is adapted from ```models/distilbert``` and is trained on dataset ```sst2_train``` to perform a downstream task, i.e. sequence classification.
* ```models/distilbert_v2``` is the next version of ```models/distilbert``` via fine-tuning and they perform the same pre-training task, i.e.  masked language modeling. We will introduce more details about this model under block 5.

# 5. Model Accuracy Drop due to Random Permutation of Word Order and Misspelled Words

When ```distilbert``` and ```distilbert-sentiment``` were evaluated on datasets ```mlm_eval``` and ```sst2_eval``` that do not contain random permutation of word order and misspelled words, the evaluation accuracy for the two models are ```0.505``` and ```0.905``` respectively. However, when they were evaluated on datasets ```mlm_shifted_eval``` and ```sst2_shifted_eval```, you can see the evaluation accuracy drops to ```0.307``` and ```0.825``` respectively.

Your colleague noticed the decrease and created a dataset ```mlm_shifted_train``` to fine-tune ```distilbert``` and produced ```distilbert_v2``` by using transformers.Trainer. And ```distilbert_v2```'s evaluation accuracy on ```mlm_shifted_eval``` is increased to ```0.382```

### Here is the comparison between inputs from mlm_train and mlm_shifted_train

In [4]:
print(datasets.load_from_disk('datasets/mlm_train')[1])
print(datasets.load_from_disk('datasets/mlm_shifted_train')[1])

{'text': ' = = In the Union Navy = = \n'}
{'text': '= In Unoon Navy = ther = = '}


### Here we evaluate the aforementioned models and show their accuracy

In [5]:
# load models and tokenizers
model_path = "models/distilbert"
tokenizer1 = AutoTokenizer.from_pretrained(model_path)
architecture = AutoConfig.from_pretrained(model_path).architectures[0]
model1 = getattr(transformers, architecture).from_pretrained(model_path)
 
model_path = "models/distilbert-sentiment"
tokenizer2 = AutoTokenizer.from_pretrained(model_path)
architecture = AutoConfig.from_pretrained(model_path).architectures[0]
model2 = getattr(transformers, architecture).from_pretrained(model_path)

model_path = "models/distilbert_v2"
tokenizer3 = AutoTokenizer.from_pretrained(model_path)
architecture = AutoConfig.from_pretrained(model_path).architectures[0]
model3 = getattr(transformers, architecture).from_pretrained(model_path)

In [6]:
# Here we import some functions for model evaluation
from user_functions_group_A import evaluate, mlm_preprocess_function, compute_metrics_mlm, glue_preprocess_function, compute_metrics_glue

In [14]:
# Remeber to apply preprocessing functions on dataset before train/eval
eval_dataset = mlm_preprocess_function(datasets.load_from_disk('datasets/mlm_eval'), tokenizer1)
print('distilbert\'s accuracy on mlm_eval:',evaluate(model1.eval(), tokenizer1, eval_dataset, compute_metrics_mlm))

eval_dataset = mlm_preprocess_function(datasets.load_from_disk('datasets/mlm_shifted_eval'), tokenizer1)
print('distilbert\'s accuracy on mlm_shifted_eval:',evaluate(model1.eval(), tokenizer1, eval_dataset, compute_metrics_mlm))

model1 = model1.cpu() # move models back to cpu. It's a good practice to unload idle model from gpu to save memory for other models

eval_dataset = glue_preprocess_function(datasets.load_from_disk('datasets/sst2_eval'), tokenizer2)
print('distilbert-sentiment\'s accuracy on sst2_eval:',evaluate(model2.eval(), tokenizer2, eval_dataset, compute_metrics_glue))

eval_dataset = glue_preprocess_function(datasets.load_from_disk('datasets/sst2_shifted_eval'), tokenizer2)
print('distilbert-sentiment\'s accuracy on sst2_shifted_eval:',evaluate(model2.eval(), tokenizer2, eval_dataset, compute_metrics_glue))

model2 = model2.cpu()

eval_dataset = mlm_preprocess_function(datasets.load_from_disk('datasets/mlm_shifted_eval'), tokenizer3)
print(type(eval_dataset))
print('distilbert_v2\'s accuracy on mlm_shifted_eva:l',evaluate(model3.eval(), tokenizer3, eval_dataset, compute_metrics_mlm))

model3 = model3.cpu()

distilbert's accuracy on mlm_eval: {'eval_accuracy': '0.505'}


distilbert's accuracy on mlm_shifted_eval: {'eval_accuracy': '0.307'}


distilbert-sentiment's accuracy on sst2_eval: {'eval_accuracy': '0.905'}


distilbert-sentiment's accuracy on sst2_shifted_eval: {'eval_accuracy': '0.825'}
<class 'datasets.arrow_dataset.Dataset'>


distilbert_v2's accuracy on mlm_shifted_eva:l {'eval_accuracy': '0.382'}


### \*\*Note: ```distilbert_v2``` is a new version of ```distilbert``` that has higher accuracy on ```mlm_shifted_eval```**

# 6. It's Your Turn

* Now its your turn to create a new version of ```distilbert-sentiment``` and increase its accuracy on ```sst2_shifted_eval``` by at least 1\%, i.e., from 0.825 to 0.835, **as quickly as possible** (let the instructor know if you finish so he can stop timing)
* You may refer back to the tutorial for API usage.
* Don't use any ```eval``` dataset for training.
* #### Let the instructor know when you read this sentence.