# Week 14: Senquence Classification with BERT

The assignment this week is to do the senquence classification. This may sound like what we had done in the previous assignment, but we are using BERT as our classifier this week, rather than Machine Learning.

The objective is to judge the CEFR level of a sentence.  
[CEFR](https://www.cambridgeenglish.org/exams-and-tests/cefr/) is a standard for describing language ability of a person. It consists of 6 levels, A1, A2, B1, B2, C1, and C2, going from easier to harder.  
A dataset that contains sentences with the corresponding CEFR level is provided, and you have to use BERT and train a sentence classifier with this dataset.  
The dataset is collected and processed from a research by Alison Chi, 李書卉, 李冠霖 and Prof. Chang. Thank you all for allowing us to use it in the lecture.

As to the implementatin, we will introduce you the [🤗 transformers](https://huggingface.co/) library, which is mantained by huggingface company, as the training framework this week. [Pytorch](https://pytorch.org/) is used as the deep learning backend in this tutorial, but with the transformers library, all codes can be easily changed to tensorflow if you prefer so.  

## Prepare your environment

Again, we highly recommend you to install all packages with a virtual environment manager, like [venv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html), to prevent version conflicts of different packages.  

If you haven't used it before and don't know which to use, I would suggest you start with [mamba](https://github.com/mamba-org/mamba#installation) or [mambaforge](https://github.com/conda-forge/miniforge#mambaforge).

### Install CUDA

Deep learning is a computionally extensive process. It takes lots of time if relying only on the CPU, especially when it's trained on a large dataset. That's why using GPU instead is generally recommended.  
To use GPU for computation, you have to install [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit) as well as the [cuDNN library](https://developer.nvidia.com/cudnn) provided by NVIDIA.  

If you already had CUDA installed on your machine, then great! You're done here.  
If you don't, you can refer to [Appendix 1](#Appendix-1-Install-CUDA) to see how to do so.

### Install python packages

Dependencies:

1. `numpy`: for matrix operation
2. `scikit-learn`: for label encoding
3. `datasets`: for data preparation
4. `transformers`: for model loading and finetuing
5. (choose one) `tensorflow` / `pytorch`: the backend DL framework
   - Note that the tf/pt version must support the CUDA version you've installed if you want to use GPU.


### Select GPU(s) for your backend

Skip this section if you have no intension of using GPU with tensorflow/pytorch.

In [3]:
import os

# select your GPU. Note that this should be set before you load tensorflow or pytorch.
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

# To use multiple GPUs, combine all GPU ID with commas
# e.g. >>> os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,3'

#### >> Check Pytorch

In [4]:
import torch
# Check if any GPU is used
torch.cuda.is_available()

False

#### >> Check Tensorflow

In [3]:
import tensorflow as tf
# Check if your GPU(s) is(are) listed below 
tf.config.list_physical_devices()

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


AttributeError: module 'tensorflow._api.v1.config' has no attribute 'list_physical_devices'

## Prepare the dataset

Before starting the training, of course we need to load and process our dataset - but wait a sec. Let's decide which model we want to use first.  

In case you are not familiar with it, [BERT](https://arxiv.org/abs/1810.04805) (**B**idirectional **E**ncoder **R**epresentations from **T**ransformers) is a language model proposed by Google AI in 2018, and it's one of the most popular models used in NLP area.  
However, we will not directly use BERT in this tutorial, because it's large and needs plenty of time to train. Instead, we are using [DistilBert](https://medium.com/huggingface/distilbert-8cf3380435b5) this week.  

DistilBERT is a distilled (蒸餾) version of BERT that is much more light-weighted than original model while reserving 95% of its original accuracy, which makes it perfect for our task today.  

Further Reading:
 - [BERT Explained: A Complete Guide with Theory and Tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/) by Samia, 2019.
 - [進擊的 BERT：NLP 界的巨人之力與遷移學習](https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html) by 李孟, 2019

In [5]:
# the model you want to use. Available models can be found here: https://huggingface.co/models
MODEL_NAME = 'distilbert-base-uncased'

### Load data

Similar to `transformers` library, `datasets` is also a package provided by huggingface. It contains many public datasets online and can help us with the data processing.  
We can use `load_dataset` function to read the input `.csv` file.

Reference:
 - [Official datasets document](https://huggingface.co/docs/datasets)
 - [datasets.load_dataset](https://huggingface.co/docs/datasets/loading.html)

In [7]:
import os
from datasets import load_dataset

In [8]:
dataset = load_dataset('csv', data_files = os.path.join('data', 'evp.train.csv'))

Using custom data configuration default-e7064344f20d7991


Downloading and preparing dataset csv/default to /home/pride829/.cache/huggingface/datasets/csv/default-e7064344f20d7991/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /home/pride829/.cache/huggingface/datasets/csv/default-e7064344f20d7991/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
# Take a look at the data structure
print(dataset['train'])
print(dataset['train'][1])
print(dataset['train']['text'][:5])

Dataset({
    features: ['text', 'level'],
    num_rows: 20720
})
{'text': 'You can contact me by e-mail.', 'level': 'A1'}
['My mother is having her car repaired.', 'You can contact me by e-mail.', 'He had a break for the weekend, and he called me: "I am in London, so, if you want to see me, it\'s the time!"', "Research shows that 40 percent of the program's viewers are aged over 55.", "I'd guess she's about my age."]


### Preprocessing

Same as before, texts should be tokenized, embedded, and padded before put into the model.  
But don't worry, with the libraries from huggingface, the procedure is much easier now.

#### Sentence processing

Different pre-trained language models may have their own preprocessing models, and that's why we should use the tokenizers trained along with that model. In our case, we are using distilBERT, so we should use the distilBERT tokenizer.  

With huggingface, loading different tokenizer is extremely easy: just import the AutoTokenizer from `transformers` and tell it what model you plan to use, and it will handle everything for you.

Reference:
 - [transformers.AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer)

In [11]:
from transformers import AutoTokenizer # For tokenization

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

#### Play with BERTTokenizer

<small><i>*You can safely skip this section if you're already familar with BERTTokenizer.</i></small>

Let's play with this tokenizer a little bit before we go on.

Using this tokenizer is pretty easy: just call this object, and it processes the sentences for you.  

In [12]:
example = "This so-called \"Perfect Evening\" was so disappointing, as well as discouraging us from coming to your Circle Theatre again."

embeddings = tokenizer(example)
embeddings

{'input_ids': [101, 2023, 2061, 1011, 2170, 1000, 3819, 3944, 1000, 2001, 2061, 15640, 1010, 2004, 2092, 2004, 12532, 4648, 4726, 2149, 2013, 2746, 2000, 2115, 4418, 3004, 2153, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [13]:
type(embeddings)

transformers.tokenization_utils_base.BatchEncoding

As you can see, the sentence has already been tokenized and embedded. A default attention mask is returned as well.  

To get the token back is easy as well!

In [14]:
decoded_tokens = tokenizer.batch_decode(embeddings['input_ids'])
print(' '.join(decoded_tokens))

[CLS] this so - called " perfect evening " was so disappointing , as well as disco ##ura ##ging us from coming to your circle theatre again . [SEP]


You may notice that there're some weird stuffs appearing in your task, like `[CLS]` or `[SEP]`. The word *discouraging* is even split into `disco` `##ura` and `##ging` .  
`[CLS]`, `[SEP]`, `[UKN]` and `[MASK]` are four symbols introduced by BERT, which stand for "classification", "seperator", "unknown" and "mask" respectively.  
As to `##` thing, it's called a *wordpiece*, which is a concept [also brought out by Google](https://arxiv.org/abs/1609.08144). The key idea is to split words into common sub-word units, so the number of rare words can significantly decrease.

Besides simply tokenizing a sentence, there are also many parameters you can set. You can play with it a bit, changing the parameters and observe the difference.

Document:
 - [transformers.Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer)

In [15]:
# EXAMPLE: directly transform into embedding tensor
embeddings = tokenizer(example,
                       # padding='longest',         # padding strategy
                       # max_length=10,             # how long to pad sentences
                       is_split_into_words=False,
                       truncation=True,
                       return_tensors='pt',         # 'tf' for tensofrlow, 'pt' for pytorch, 'np' for numpy
                       # return_length=True         # whether to return length
                       # Any other parameters you want to try
                      )
embeddings

{'input_ids': tensor([[  101,  2023,  2061,  1011,  2170,  1000,  3819,  3944,  1000,  2001,
          2061, 15640,  1010,  2004,  2092,  2004, 12532,  4648,  4726,  2149,
          2013,  2746,  2000,  2115,  4418,  3004,  2153,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1]])}

In [16]:
type(embeddings)

transformers.tokenization_utils_base.BatchEncoding

#### Label processing

Before we start to process sentences in the whole dataset, don't forget we need to process labels as well.

In the following section, I will introduce you the OneHotEncoder provided by scikit-learn.

Documents:
 - [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder)

In [18]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# First, declare a new encoder
encoder = OneHotEncoder(sparse = False)
# Then, let the encoder learns all features in the given dataset
# Keep in mind that all `fit` functions in sklearn only make the encoder learn from the data, not transforming the data yet.
encoder = encoder.fit(np.reshape(dataset['train']['level'], (-1, 1)))

In [19]:
LABEL_COUNT = len(encoder.categories_[0])
print(LABEL_COUNT)

6


#### Play with OneHotEncoder

<small><i>*You can safely skip this section if you're already familar with sklearn.</i></small>

One thing you should always keep in mind is: features learned by OneHotEncoder are always treated as arrays, because it allows multi-field features. (See its [document](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder) for an example)  
That's why you have to reshape the level into (-1, 1), i.e. from `['A1', 'B1', 'C1', ...]` to `[['A1'], ['B1'], ['C1'], ...]` .

In [20]:
# Let's see what features has the encoder captured
print(encoder.categories_)

[array(['A1', 'A2', 'B1', 'B2', 'C1', 'C2'], dtype='<U2')]


In [21]:
# use `encoder.transform` to get the one-hot code of a label
print(encoder.transform([['B1'], ['C2']]))

[[0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]]


In [22]:
# To decode, use `encoder.inverse_transform` instead
print(encoder.inverse_transform([[0, 0, 1, 0, 0, 0]]))

[['B1']]


#### [ TODO ] Process the data

With the tokenizor and encoder prepared, we can write a function to process our dataset!

In [23]:
def preprocess(dataslice):
    """ Input: a batch of your dataset
        Example: { 'text': [['sentence1'], ['setence2'], ...],
                   'label': ['label1', 'label2', ...] }
    """
    
    embeddings = tokenizer(dataslice['text'],
                           #padding=True,         # padding strategy
                           # max_length=10,             # how long to pad sentences
                           #is_split_into_words=False,
                           #truncation=True,
                           #return_tensors='pt',         # 'tf' for tensofrlow, 'pt' for pytorch, 'np' for numpy
                           # return_length=True         # whether to return length
                           # Any other parameters you want to try
                          )

    encoder = OneHotEncoder(sparse = False)
    encoder = encoder.fit(np.reshape(dataslice['level'], (-1, 1)))
    
    
    label = encoder.transform(np.reshape(dataslice['level'], (-1, 1)))

    #print(embeddings)
    #print(type(label))
    
    output = {}
    
    #output['input_ids'] = embeddings['input_ids'].numpy()
    #output['attention_mask'] = embeddings['attention_mask'].numpy()
    
    output['input_ids'] = embeddings['input_ids']
    output['attention_mask'] = embeddings['attention_mask']
    output['label'] = label
    
    return output
    
    """ Output: a batch of processed dataset
        Example: { 'input_ids': ...,
                   'attention_masks': ...,
                   'label': ... }
    """

Now, map the function to the whole dataset.

In [24]:
processed_data = dataset.map(preprocess,    # your processing function
                             batched = True # Process in batches so it can be faster
                            )

  0%|          | 0/21 [00:00<?, ?ba/s]

In [25]:
# Take a look at processed dataset
print(processed_data)
processed_data['train'][0]

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'level', 'text'],
        num_rows: 20720
    })
})


{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101, 2026, 2388, 2003, 2383, 2014, 2482, 13671, 1012, 102],
 'label': [0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
 'level': 'B1',
 'text': 'My mother is having her car repaired.'}

### DataCollator

You may notice that we didn't pad the sentences in the preprocessing function, because we are going to do it during the training time.  

To do the training-time processing, we can use the DataCollator Class provided by `transformers`. What's even better is, transformers already provides a class that handles padding for us!

 - [transformers.DataCollatorWithPadding](https://huggingface.co/docs/transformers/master/en/main_classes/data_collator#transformers.DataCollatorWithPadding)

In [26]:
from transformers import DataCollatorWithPadding

# declare a collator to do padding during traning.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [27]:
print(data_collator)

DataCollatorWithPadding(tokenizer=PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')


## Training

### Preparation

We can load the pretrained model from `transformers`.  
Generally, you need to build your own model on top of BERT if you want to use BERT for some downstream tasks, but again, sequence classification is a popular topic. With the support from `transformers` library, all works can be done in two lines of codes: 

1. Load `AutoModelForSequenceClassification` Class.
2. Load the pretrained model.

In [28]:
# Change to TFAutoModelForSequenceClassification if you're using tensoflow
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased',
                                                           num_labels = LABEL_COUNT)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

#### [ TODO ] Split train/val data

The `Dataset` class we prepared before already has the `train_test_split` method. You can use it to split your dataset.

Document:
 - [datasets.Dataset - Sort, shuffle, select, split, and shard](https://huggingface.co/docs/datasets/process.html#sort-shuffle-select-split-and-shard)


In [29]:
# [ TODO ] Choose the validation data size
# [ DONE ]
train_val_dataset = processed_data['train'].train_test_split(test_size = 0.2)

In [30]:
# Take a look at split data
print(train_val_dataset)

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'level', 'text'],
        num_rows: 16576
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'level', 'text'],
        num_rows: 4144
    })
})


#### [ TODO ] Setup training parameters

We are using the TrainerAPI to do the training. Trainer is yet another utility provided by huggingface, which helps you train the model with ease.  

Document:
- [transformers.TrainingArguments](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.TrainingArguments)
- [transformers.Trainer](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.Trainer)

In [31]:
# Change to TFTrainingArguments, TFTrainer if you're using tensoflow
from transformers import TrainingArguments, Trainer

In [32]:
# [ TODO ] Set and tune your training properties
LEARNING_RATE = 5e-05
BATCH_SIZE = 8
EPOCH = 3
training_args = TrainingArguments(
    output_dir = 'model',
    learning_rate = LEARNING_RATE,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size = BATCH_SIZE,
    num_train_epochs = EPOCH,
    # You can also set other parameters here
)

# Now give all information to a trainer.
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_val_dataset["train"],
    eval_dataset = train_val_dataset["test"],
    tokenizer = tokenizer,
    data_collator = data_collator,
    # You can also set other parameters
)

### Training

Training is pretty easy. Simply ask the trainer to train the model for you!

In [33]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: level, text.
***** Running training *****
  Num examples = 16576
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6216


Step,Training Loss
500,0.395
1000,0.3587
1500,0.3468
2000,0.3433
2500,0.2744
3000,0.2649
3500,0.2612
4000,0.2544
4500,0.1954
5000,0.168


Saving model checkpoint to model/checkpoint-500
Configuration saved in model/checkpoint-500/config.json
Model weights saved in model/checkpoint-500/pytorch_model.bin
tokenizer config file saved in model/checkpoint-500/tokenizer_config.json
Special tokens file saved in model/checkpoint-500/special_tokens_map.json
Saving model checkpoint to model/checkpoint-1000
Configuration saved in model/checkpoint-1000/config.json
Model weights saved in model/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in model/checkpoint-1000/tokenizer_config.json
Special tokens file saved in model/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to model/checkpoint-1500
Configuration saved in model/checkpoint-1500/config.json
Model weights saved in model/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in model/checkpoint-1500/tokenizer_config.json
Special tokens file saved in model/checkpoint-1500/special_tokens_map.json
Saving model checkpoint to model/checkpoint-2000

TrainOutput(global_step=6216, training_loss=0.26146100325320526, metrics={'train_runtime': 2692.1203, 'train_samples_per_second': 18.472, 'train_steps_per_second': 2.309, 'total_flos': 446386949619552.0, 'train_loss': 0.26146100325320526, 'epoch': 3.0})

You can see that Trainer saves some ckeckpoints, so you can load your model from those checkpoints if you want to fallback to a specific version.

### Save for future use

In [34]:
model.save_pretrained(os.path.join('model', 'finetuned'))

Configuration saved in model/finetuned/config.json
Model weights saved in model/finetuned/pytorch_model.bin


## Prediction

We've known how to train a model now, but how to really use it for predicting results?

### Load finetuned model

In [36]:
# Same, change to TFxxxxxx if you are using tensorflow
from transformers import AutoModelForSequenceClassification

mymodel = AutoModelForSequenceClassification.from_pretrained(os.path.join('model', 'finetuned'))

loading configuration file model/finetuned/config.json
Model config DistilBertConfig {
  "_name_or_path": "model/finetuned",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "multi_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.14.1",
  "vocab_size": 30522
}

loading weights file model/finetuned/pyt

### Get the prediction

Given six example sentences...

In [37]:
examples = [
    # A2
    "Remember to write me a letter.",
    # B2
    "Strawberries and cream - a perfect combination.",
    "This so-called \"Perfect Evening\" was so disappointing, as well as discouraging us from coming to your Circle Theatre again.",
    # C1
    "Some may altogether give up their studies, which I think is a disastrous move.",
]

...all you need to do is to transform them to embeddings, and then you can get predictions by calling your finetuned model.  

Note that, since you don't have a DataCollator to pad the sentence and do the matrix transformation for you, you have to pad and transform the matrice on your own.

In [39]:
# Transform the sentences into embeddings
input = tokenizer(examples, truncation=True, padding=True, return_tensors="pt") # change return_tensors if youre using tensorflow
# Get the output
logits = mymodel(**input).logits
logits

tensor([[ 1.9050, -2.1400, -4.5247, -5.2337, -4.6768, -4.7797],
        [-7.5280, -6.9617, -5.1950, -1.7884,  1.2411, -2.9181],
        [-7.5949, -6.8332, -5.4501,  0.4709, -0.3539, -4.5285],
        [-7.6892, -7.3095, -5.6460,  0.0680, -0.4457, -3.2494]],
       grad_fn=<AddmmBackward0>)

Logits aren't very readable for us. Let's use softmax activation to transform them into more probability-like numbers.

In [40]:
# Or `from tensorflow import nn` and `nn.softmax`
from torch import nn

predicts = nn.functional.softmax(logits, dim = -1)
predicts

tensor([[9.7795e-01, 1.7124e-02, 1.5774e-03, 7.7628e-04, 1.3548e-03, 1.2224e-03],
        [1.4584e-04, 2.5692e-04, 1.5034e-03, 4.5344e-02, 9.3810e-01, 1.4653e-02],
        [2.1681e-04, 4.6438e-04, 1.8516e-03, 6.9025e-01, 3.0256e-01, 4.6537e-03],
        [2.6095e-04, 3.8147e-04, 2.0134e-03, 6.1017e-01, 3.6506e-01, 2.2118e-02]],
       grad_fn=<SoftmaxBackward0>)

#### [ TODO ] transform logits back to labels

Now you've got the output. Write a function to map it back into labels!

In [48]:
# [ TODO ] try to process the result
# [ DONE ]
def predicts_to_labels(predicts):
    label_cat = ['A1', 'A2', 'B1', 'B2', 'C1', 'C2']
    _, indices = torch.max(predicts, 1)
    
    labels = []
    for idx in indices:
        labels.append(label_cat[idx])
    
    return labels

predicts_to_labels(predicts)

['A1', 'C1', 'B2', 'B2']

## [ TODO ] Evaluation

It's your turn!  
Load the testing data and calculate your accuracy.

We want you to calculate two kinds of accuracy, exact accuracy and fuzzy accuracy, which will be explained in the following section.


In [56]:
# [ TODO ] 
# load test data

test_dataset = load_dataset('csv', data_files = os.path.join('data', 'evp.test.csv'))

# preprocess

processed_test_data = test_dataset.map(preprocess,    # your processing function
                             batched = True # Process in batches so it can be faster
                            )

# get predictions

test_input = tokenizer(processed_test_data['train']['text'], truncation=True, padding=True, return_tensors="pt")

# transform predictions back into labels

Using custom data configuration default-6b7e41f952df3365
Reusing dataset csv (/home/pride829/.cache/huggingface/datasets/csv/default-6b7e41f952df3365/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached processed dataset at /home/pride829/.cache/huggingface/datasets/csv/default-6b7e41f952df3365/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-26675242a5204b1c.arrow


In [58]:
logits = mymodel(**test_input).logits

In [60]:
predicts = nn.functional.softmax(logits, dim = -1)
predicts
test_predicts = predicts_to_labels(predicts)

In [61]:
test_predicts

['C2',
 'B2',
 'B2',
 'C2',
 'C1',
 'A1',
 'B1',
 'B2',
 'A2',
 'C2',
 'C2',
 'C1',
 'C1',
 'C1',
 'B1',
 'B2',
 'C1',
 'C1',
 'A2',
 'B2',
 'C1',
 'B2',
 'B2',
 'B2',
 'A2',
 'B1',
 'B1',
 'B2',
 'A2',
 'C2',
 'B2',
 'A2',
 'B1',
 'B2',
 'B2',
 'C1',
 'B2',
 'B2',
 'B2',
 'B1',
 'A2',
 'B1',
 'B2',
 'B1',
 'B1',
 'B1',
 'B1',
 'B2',
 'B2',
 'B1',
 'A1',
 'C1',
 'A2',
 'C2',
 'B1',
 'C1',
 'B1',
 'A1',
 'B1',
 'B1',
 'B2',
 'B2',
 'C2',
 'A2',
 'B2',
 'C2',
 'B1',
 'B1',
 'C1',
 'B1',
 'B2',
 'A2',
 'B2',
 'B2',
 'B1',
 'C2',
 'B2',
 'B1',
 'A2',
 'B2',
 'B2',
 'C1',
 'C1',
 'B2',
 'B1',
 'A2',
 'B2',
 'C2',
 'B2',
 'C1',
 'B1',
 'B2',
 'C2',
 'B2',
 'B1',
 'C2',
 'B2',
 'B2',
 'B2',
 'B2',
 'B2',
 'B2',
 'C2',
 'C2',
 'C1',
 'A1',
 'C1',
 'B2',
 'A1',
 'C1',
 'C1',
 'B2',
 'B2',
 'B2',
 'B2',
 'B2',
 'B1',
 'B2',
 'B2',
 'C1',
 'B2',
 'B1',
 'C1',
 'A1',
 'A2',
 'A1',
 'B2',
 'B1',
 'C2',
 'B1',
 'C2',
 'C1',
 'B1',
 'C1',
 'C2',
 'B2',
 'B1',
 'C1',
 'C2',
 'A2',
 'B1',
 'B2',
 'C1',

In [64]:
processed_test_data['train']['level']

['C2',
 'B2',
 'B2',
 'C2',
 'C1',
 'A2',
 'B1',
 'B2',
 'A2',
 'C1',
 'B2',
 'C1',
 'C1',
 'C1',
 'B1',
 'B1',
 'C1',
 'C2',
 'C1',
 'C1',
 'B2',
 'C2',
 'B2',
 'C1',
 'A2',
 'B1',
 'B2',
 'C1',
 'A2',
 'C2',
 'B1',
 'A2',
 'B2',
 'C1',
 'B2',
 'C2',
 'B2',
 'B2',
 'B2',
 'B1',
 'A2',
 'C2',
 'C2',
 'A2',
 'B1',
 'B1',
 'B2',
 'B2',
 'B1',
 'A2',
 'A1',
 'C2',
 'B2',
 'C1',
 'A2',
 'B1',
 'A2',
 'B1',
 'B2',
 'C1',
 'C2',
 'B2',
 'C2',
 'A1',
 'B2',
 'C1',
 'C1',
 'B2',
 'B2',
 'B2',
 'C1',
 'A2',
 'B2',
 'B2',
 'B1',
 'C2',
 'C2',
 'B2',
 'A2',
 'B1',
 'B1',
 'C1',
 'C1',
 'B2',
 'B1',
 'C1',
 'B2',
 'C2',
 'C2',
 'B2',
 'B2',
 'B2',
 'C1',
 'B1',
 'B2',
 'C2',
 'B2',
 'C2',
 'B2',
 'B2',
 'C1',
 'B2',
 'C2',
 'C2',
 'C2',
 'A1',
 'C2',
 'C2',
 'A1',
 'B2',
 'C1',
 'B2',
 'C2',
 'B2',
 'B2',
 'C2',
 'B1',
 'B2',
 'B2',
 'C1',
 'B2',
 'B1',
 'C2',
 'A2',
 'A2',
 'A2',
 'B2',
 'B2',
 'B2',
 'B1',
 'C2',
 'C1',
 'A2',
 'B2',
 'C2',
 'B2',
 'B1',
 'C1',
 'C2',
 'B1',
 'A2',
 'B2',
 'C2',

In [65]:
# we still recommend you to print out some predictions to check if the outputs are resonable and if you need to adjust your model at the end of every step.

for idx, (sent, level) in enumerate(zip(processed_test_data['train']['level'], test_predicts)):
    if idx >= 10: break
    print(f'{level}: {sent}') 

C2: C2
B2: B2
B2: B2
C2: C2
C1: C1
A1: A2
B1: B1
B2: B2
A2: A2
C2: C1


### Six Level Accuracy

Exact accuracy is what you've been familiar with:

$
accuracy = \frac{\#exactly\:the\:same\:levels}{\#total}
$

Example:
```
Prediction:   A1 A2 B1 B2 C1 C2
Ground truth: A2 B1 B1 B2 B2 C2
                    ^  ^     ^
```

The six level accuracy is $\frac{3}{6} = 0.5$

As the requirement, <u>your exact accuracy should be higher than $0.5$</u>.

In [68]:
same_count = 0

for idx, (sent, level) in enumerate(zip(processed_test_data['train']['level'], test_predicts)):
    if(sent == level):
        same_count += 1
        
same_count / len(test_predicts)

0.5430434782608695

### [ TODO ] Three Level Accuracy

Three Level Accuracy is used when you only want the general of right or wrong.

$
accuracy = \frac{\#the\:same\:ABC\:levels}{\#total}
$

Example:
```
Prediction:   A1 A2 B1 B2 C1 C2
Ground truth: A2 B1 B1 B2 B2 C2
              ^     ^  ^     ^
```

The six level accuracy is $\frac{4}{6} = 0.667$

As the requirement, <u>your exact accuracy should be higher than $0.6$</u>.

In [71]:
same_count = 0

for idx, (sent, level) in enumerate(zip(processed_test_data['train']['level'], test_predicts)):
    if(sent[0] == level[0]):
        same_count += 1
        
same_count / len(test_predicts)

0.73

### [ TODO ] Fuzzy accuracy

However, the level of a sentence is relatively subjective. Generally speaking, $\pm1$ errors are allowed in the real evaluation in linguistic area.  

For example, if the label is actually 'B1', but the model predicts 'B2', we still consider the prediction good enough, and this also applys when the model predicts 'A2'.

Hence, the fuzzy accuracy is

$
accuracy = \frac{\#good\:enough\:answers}{\#total}
$

Example:
```
Prediction:   0 1 2 3 4 5
Ground truth: 0 1 1 3 3 3
              ^ ^ ^ ^ ^
```

The fuzzy accuracy is $\frac{5}{6} = 0.833$

As the requirement, <u>your accuracy should be higher than $0.8$</u>.

In [73]:
label_cat = ['A1', 'A2', 'B1', 'B2', 'C1', 'C2']

label_dict = {}
idx = 0
for l in label_cat:
    label_dict[l] = idx
    idx += 1

same_count = 0
    
for idx, (sent, level) in enumerate(zip(processed_test_data['train']['level'], test_predicts)):
    if(abs(label_dict[sent] - label_dict[level]) <= 1):
        same_count += 1

same_count / len(test_predicts)

0.8704347826086957

## TA's note

Congratuation! You've finished the assignment this week.  
Don't forget to <b>[make an appoiment with TA](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=134737606) to demo/explain your implementation <u>before <font color="red">12/23 15:30</font></u></b> .  
Also make sure you submit your `{student_id}.ipynb` to [eeclass](https://eeclass.nthu.edu.tw/course/homework/6053).

This is the last assignment of this class. A TA will still be at the online classroom and answer your question during the class time in the following weeks, and you can have make-up demos at that time.  
Prof. Chang's office hours are in Tues. to Thurs. evenings. You can come to Delta 712 to consult him at that time, but make sure you follow the appointment rules written on the bulletin or [the appointment sheet](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit?usp=sharing).




## Appendix 

<a name="Appendix-1-Install-CUDA"></a>

### Appendix 1 - Install CUDA

1. Check your GPU vs. CUDA compatibility:
   - [NVIDIA -> Your GPU Compute Capability](https://developer.nvidia.com/cuda-gpus) -> GeForce and TITAN Products
2. Check library vs. CUDA compatibility: 
   - Pytorch: [Previous PyTorch Versions](https://pytorch.org/get-started/previous-versions/)
   - Tensorflow: [Linux/MacOX](https://www.tensorflow.org/install/source#tested_build_configurations) or [Windows](https://www.tensorflow.org/install/source_windows#tested_build_configurations)
3. Note the highest CUDA version that fits your system.

#### >> for conda/mamba users

You can directly install CUDA library with the selected CUDA version.
1. Get [the driver for NVIDIA GPU](https://www.nvidia.com/download/index.aspx)
2. `conda/mamba install -c conda-forge cudatoolkit=${VERSION}`

#### >> for non-conda users

1. Get [the driver for NVIDIA GPU](https://www.nvidia.com/download/index.aspx)
2. Download and install [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit-archive)
3. Download and install [cuDNN Library](https://developer.nvidia.com/rdp/cudnn-archive)

<a name="Appendix-2-TAs-Environmental-setup"></a>

### Appendix 2 - TA's Environmental Setup

The following is my setup for this notebook. You can refer to it if you encounter some environmental issues.  

System: Ubuntu 18.04.6, x64, With GPU support. All packages are installed in new conda environment with channels default to conda-forge.

1. Python 3.8.12
2. numpy=1.21.2
3. cudatoolkit=11.1.74
4. pytorch=1.8.2
5. datasets=1.16.1
6. transformers=4.12.5
7. scikit-learn=1.0.1

Notes:

 - conda create -n week14 python=3.8 & conda activate week14
 - conda config --add channels conda-forge
 - conda config --set channel_priority strict
 - conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch-lts -c nvidia
 - conda install transformers
 - conda install datasets scikit-learn


### Appendix 3 - Further Readings

1. [Huggingface Official Tutorials](https://github.com/huggingface/notebooks/tree/master/examples)
2. How to use Bert with other downstream tasks: [How to use BERT from the Hugging Face transformer library](https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209): 
3. Training with pytorch backend: [transformers-tutorials](https://github.com/abhimishra91/transformers-tutorials)
4. A more complicated example that include manual data/training processing with Pytorch: [Transformers for Multi-Label Classification made simple](https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1)
5. [Text Classification with tensorflow](https://github.com/huggingface/notebooks/blob/master/examples/text_classification-tf.ipynb): tensorflow example