# Week 9: Sentence Level Classification with BERT

Your goal this week is to train a classifier that can predict the CEFR level of any given sentence. In this notebook we will guide you through the process of using ðŸ¤—[Hugging Face](https://huggingface.co/) and its transformers library as the training framework, with [Pytorch](https://pytorch.org/) as the deep learning backend, but feel free to use [TensorFlow](https://www.tensorflow.org) if that's what you are more familiar with.

For this assignment we will provide a dataset containing sentences with the corresponding CEFR level, and you have to use BERT and train a sentence classifier with this dataset.

## Prepare your environment

As always, we highly recommend that you install all packages with a virtual environment manager, like [venv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html), to prevent version conflicts of different packages.  

### Install CUDA
Deep learning is a computionally extensive process. It takes lots of time if relying only on the CPU, especially when it's trained on a large dataset. That's why using GPU instead is generally recommended.  
To use GPU for computation, you have to install [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit) as well as the [cuDNN library](https://developer.nvidia.com/cudnn) provided by NVIDIA.  

If you already had CUDA installed on your machine, then great! You're done here.  
If you don't, you can refer to [Appendix](#Appendix-1-Install-CUDA) to see how to do so.


### Install python packages
The following python packages will be used in this tutorial:

1. `numpy`: for matrix operation
2. `scikit-learn`: for label encoding
3. `datasets`: for data preparation
4. `transformers`: for model loading and finetuing
5. `pytorch`: the backend DL framework
  - Note that the pt version must support the CUDA version you've installed if you want to use GPU.

### Select GPU(s) for your backend

Skip this section if you have no intension of using GPU with tensorflow/pytorch.

In [1]:
import os
import torch
# select your GPU. Note that this should be set before you load tensorflow or pytorch.
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

# To use multiple GPUs, combine all GPU ID with commas
# e.g. >>> os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,3'

In [2]:
import torch
# Check if any GPU is used
torch.cuda.is_available()

True

## Prepare the dataset

Before starting the training, we need to load and process our dataset - but wait, let's decide which model we want to use first.  

In the highly unlikely chance you've never heard of it, [BERT](https://arxiv.org/abs/1810.04805) (**B**idirectional **E**ncoder **R**epresentations from **T**ransformers) is a language model proposed by Google AI in 2018, and it's currently one of the most popular models used in NLP.  
You can learn more about it here:
- [BERT Explained: A Complete Guide with Theory and Tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/) by Samia, 2019.


However, we will not directly use BERT in this tutorial, because it's large and takes too long to train. Instead, we'll be using [DistilBert](https://medium.com/huggingface/distilbert-8cf3380435b5), a version of BERT that while light-weight, reserves 95% of its original accuracy.




In [3]:
# the model you want to use. Available models can be found here: https://huggingface.co/models
MODEL_NAME = 'distilbert-base-uncased'

### Load data

Similar to the `transformers` library, `datasets` is also a package by huggingface. It contains many public datasets online and can help us with the data processing.  
We can use `load_dataset` function to read the input `.csv` file provided for this assignment.

Reference:
 - [Official datasets document](https://huggingface.co/docs/datasets)
 - [datasets.load_dataset](https://huggingface.co/docs/datasets/loading.html)

In [4]:
# [ TODO ] load the data using the load_dataset function
from datasets import load_dataset
dataset = load_dataset("csv",data_files="data/csv")

Using custom data configuration default-a3ae748087bf9cd4
Found cached dataset csv (C:/Users/user/.cache/huggingface/datasets/csv/default-a3ae748087bf9cd4/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
print(dataset['train'])
print(dataset['train'][1])
print(dataset['train']['text'][:5])

Dataset({
    features: ['text', 'level'],
    num_rows: 23020
})
{'text': "Unfortunately he was too fast and I couldn't keep up with him.", 'level': 'B2'}
['No longer a remote, backward, unimportant country, it became a force to be reckoned with in Europe.', "Unfortunately he was too fast and I couldn't keep up with him.", 'Most mushrooms are totally harmless, but some are poisonous.', 'This provided solid evidence that he committed the crime.', "You can't just accept everything you read in the newspapers at face value."]


In [6]:
print(dataset['train']['level'][:5])

['C2', 'B2', 'B2', 'C2', 'C1']


### Preprocessing

As always, texts should be tokenized, embedded, and padded before being put into the model.  
But not to worry, there are libraries from huggingface to help with this, too.

#### Sentence processing

Different pre-trained language models may have their own preprocessing models, and that's why we should use the tokenizers trained along with that model. In our case, we are using distilBERT, so we should use the distilBERT tokenizer.  

With huggingface, loading different tokenizers is extremely easy: just import the AutoTokenizer from `transformers` and tell it what model you plan to use, and it will handle everything for you.

Reference:
 - [transformers.AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer)

In [7]:
# [ TODO ] load the distilBERT tokenizer using AutoTokenizer

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

#### Label processing

Our labels also need to be processed, so let's do that next.

For this tutorial, we'll use the OneHotEncoder provided by scikit-learn.

For now, just declare a new encoder and use `fit` to learn the data. Hint: you should still end up with 6 labels.

Documents:
 - [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder)

In [8]:
# [ TODO ] declare a new encoder and let it learn from the dataset
import numpy as np
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(np.reshape(dataset['train']['level'],(-1,1)))

OneHotEncoder(handle_unknown='ignore')

In [9]:
# check if you still have 6 labels
LABEL_COUNT = len(encoder.categories_[0])
print(LABEL_COUNT)

6


In [10]:
np.reshape(dataset['train']['level'],(-1,1))

array([['C2'],
       ['B2'],
       ['B2'],
       ...,
       ['B1'],
       ['B2'],
       ['A1']], dtype='<U2')

#### Process the data

To make things easier, we can write a function to process our dataset in batches. 

In [11]:
def preprocess(dataslice):
    """ Input: a batch of your dataset
        Example: { 'text': [['sentence1'], ['setence2'], ...],
                   'label': ['label1', 'label2', ...] }
    """
    oneHot=encoder.transform(np.reshape(dataslice['level'],(-1,1))).toarray()
    #print(oneHot)
    #label_num=np.argmax(oneHot, axis=1)
    
    dataslice["label"]=oneHot 
    return tokenizer(dataslice["text"], padding='max_length', truncation=True)
    

    
    

    """ Output: a batch of processed dataset
        Example: { 'input_ids': ...,
                   'attention_masks': ...,
                   'label': ... }
    """

In [12]:
encoder.transform(np.reshape(dataset['train']['level'],(-1,1))).toarray()

array([[0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       ...,
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0.]])

In [13]:
encoder2 = OneHotEncoder(handle_unknown='ignore')
encoder2.fit_transform(np.reshape(dataset['train']['level'],(-1,1))).toarray()

array([[0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       ...,
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0.]])

In [14]:
encoded_str = tokenizer(dataset['train']['text'][0], padding=True, truncation=True) 
encoded_str

{'input_ids': [101, 2053, 2936, 1037, 6556, 1010, 8848, 1010, 4895, 5714, 6442, 4630, 2406, 1010, 2009, 2150, 1037, 2486, 2000, 2022, 29072, 2098, 2007, 1999, 2885, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [15]:
encoder.categories_[0]

array(['A1', 'A2', 'B1', 'B2', 'C1', 'C2'], dtype='<U2')

In [43]:
# map the function to the whole dataset
processed_data = dataset.map(preprocess,    # your processing function
                             batched = True, # Process in batches so it can be faster                            
                             )


Loading cached processed dataset at C:/Users/user/.cache/huggingface/datasets/csv/default-a3ae748087bf9cd4/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-3e881b76b7d4c251.arrow


In [17]:
a=[[1,2,3],
   [4,5,6]
  ]
np.argmax(a, axis=1)

array([2, 2], dtype=int64)

In [44]:
print(processed_data)
processed_data['train'][20]

DatasetDict({
    train: Dataset({
        features: ['text', 'level', 'label', 'input_ids', 'attention_mask'],
        num_rows: 23020
    })
})


{'text': 'She seemed ideally suited for the job.',
 'level': 'B2',
 'label': [0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
 'input_ids': [101,
  2016,
  2790,
  28946,
  10897,
  2005,
  1996,
  3105,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0

In [19]:
#processed_data.set_format(type='torch', columns=['input_ids', 'level', 'attention_mask', 'label'])
#tokenized_datasets = tokenized_datasets.remove_columns_(dataset["train"].column_names)

### DataCollator

You might have noticed that we skipped padding the sentences. That's because we are going to do it during training.  

To do training-time processing, we can use the DataCollator Class provided by `transformers`. And guess what - transformers has a class that will handle padding for us, too!

 - [transformers.DataCollatorWithPadding](https://huggingface.co/docs/transformers/master/en/main_classes/data_collator#transformers.DataCollatorWithPadding)

In [20]:
# [ TODO ] declare a collator to do padding during traning
# https://zhuanlan.zhihu.com/p/414552021
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_collator

DataCollatorWithPadding(tokenizer=PreTrainedTokenizer(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

## Training

Finally, we can move on to training.

### Preparation

We can load the pretrained model from `transformers`.  
Generally, you need to build your own model on top of BERT if you want to use BERT for some downstream tasks, but again, sequence classification is a popular topic. With the support from `transformers` library, it can be done in two lines of codes: 

1. Load `AutoModelForSequenceClassification` Class.
2. Load the pretrained model.

In [21]:
'''label2id, id2label = dict(), dict()
for i, label in enumerate(encoder.categories_[0]):
    label2id[label] = str(i)
    id2label[str(i)] = label'''

'label2id, id2label = dict(), dict()\nfor i, label in enumerate(encoder.categories_[0]):\n    label2id[label] = str(i)\n    id2label[str(i)] = label'

In [22]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased',
                                                           num_labels = LABEL_COUNT,
                                                           #label2id=label2id,
                                                           #id2label=id2label
                                                          )

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifi

#### Split train/val data

The `Dataset` class we prepared before has a `train_test_split` method. You can use it to split your (processed) dataset.

Document:
 - [datasets.Dataset - Sort, shuffle, select, split, and shard](https://huggingface.co/docs/datasets/process.html#sort-shuffle-select-split-and-shard)

In [23]:
# [ TODO ] choose a validation size and split your data
train_val_dataset = processed_data['train'].train_test_split(test_size=0.2)

In [24]:
print(train_val_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'level', 'label', 'input_ids', 'attention_mask'],
        num_rows: 18416
    })
    test: Dataset({
        features: ['text', 'level', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4604
    })
})


#### Setup training parameters

We are using the TrainerAPI to do the training. Trainer is yet another utility provided by huggingface, which helps you train the model with ease.  

Document:
- [transformers.TrainingArguments](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.TrainingArguments)
- [transformers.Trainer](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.Trainer)

In [25]:
from transformers import TrainingArguments, Trainer

In [26]:
# [ TODO ] set and tune your training properties
OUTPUT_DIR = "data/out.txt"
LEARNING_RATE = 0.1
BATCH_SIZE = 16
EPOCH = 40
training_args = TrainingArguments(
    output_dir = OUTPUT_DIR,
    learning_rate = LEARNING_RATE,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size = BATCH_SIZE,
    num_train_epochs = EPOCH,
    # you can set more parameters here if you want
)

# now give all the information to a trainer
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    tokenizer=tokenizer,
    train_dataset=train_val_dataset['train'],
    eval_dataset=train_val_dataset['test']
    # set your parameters here
)

### Training

This is the easy part. Simply ask the trainer to train the model for you!

In [27]:
#https://blog.csdn.net/weixin_41868756/article/details/122961147
#hidden
'''
Step	Training Loss
500	0.397600
1000	0.357100
1500	0.349800
2000	0.339100
2500	0.310500
3000	0.257100
3500	0.256400
4000	0.253900
4500	0.241700
5000	0.184100
5500	0.156000
6000	0.146900
6500	0.144500
'''
#trainer.train()

'\nStep\tTraining Loss\n500\t0.397600\n1000\t0.357100\n1500\t0.349800\n2000\t0.339100\n2500\t0.310500\n3000\t0.257100\n3500\t0.256400\n4000\t0.253900\n4500\t0.241700\n5000\t0.184100\n5500\t0.156000\n6000\t0.146900\n6500\t0.144500\n'

### Save for future use

Hint: try using `save_pretrained`

In [28]:
# [ TODO ] practice saving your model for future use
#hidden
'''model_path='./data/model'
model.save_pretrained(model_path)'''

"model_path='./data/model'\nmodel.save_pretrained(model_path)"

## Prediction

Now we know exactly how to train a model, but how do we use it for predicting results?

### Load finetuned model

In [47]:
# [ TODO ] load the model that you saved
# https://stackoverflow.com/questions/72108945/saving-finetuned-model-locally


model_path='./data/model'
mymodel = AutoModelForSequenceClassification.from_pretrained(model_path)


loading configuration file ./data/model\config.json
Model config DistilBertConfig {
  "_name_or_path": "./data/model",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "multi_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.21.0",
  "vocab_size": 30522
}

loading weights file ./data/model\pytorch_mode

### Get the prediction

Here are a few example sentences:

In [48]:
examples = [
    # A2
    "Remember to write me a letter.",
    # B2
    "Strawberries and cream - a perfect combination.",
    "This so-called \"Perfect Evening\" was so disappointing, as well as discouraging us from coming to your Circle Theatre again.",
    # C1
    "Some may altogether give up their studies, which I think is a disastrous move.",
]

All we need to do is to transform them to embeddings, and then we can get predictions by calling your finetuned model.  

Since we don't have a DataCollator to pad the sentence and do the matrix transformation this time, we have to pad and transform the matrice on our own.

In [49]:
# Transform the sentences into embeddings
input = tokenizer(examples, truncation=True, padding=True, return_tensors="pt")
# Get the output
logits = mymodel(**input).logits
logits

tensor([[-0.5930,  0.7402, -4.4594, -5.6575, -5.8491, -6.1265],
        [-6.9154, -6.3664, -3.0927,  1.6226, -3.7276, -2.8576],
        [-7.5301, -7.0676, -3.6806,  3.0856, -4.1731, -4.8100],
        [-6.9661, -6.6125, -5.2798, -3.6026,  2.8461, -3.5893]],
       grad_fn=<AddmmBackward0>)

Logits aren't very readable for us. Let's use softmax 
activation to transform them into more probability-like numbers.

In [50]:
from torch import nn

predicts = nn.functional.softmax(logits, dim = -1) 
predicts

tensor([[2.0707e-01, 7.8539e-01, 4.3346e-03, 1.3080e-03, 1.0799e-03, 8.1833e-04],
        [1.9100e-04, 3.3071e-04, 8.7337e-03, 9.7507e-01, 4.6289e-03, 1.1049e-02],
        [2.4473e-05, 3.8862e-05, 1.1494e-03, 9.9771e-01, 7.0243e-04, 3.7153e-04],
        [5.4584e-05, 7.7732e-05, 2.9473e-04, 1.5769e-03, 9.9640e-01, 1.5979e-03]],
       grad_fn=<SoftmaxBackward0>)

#### Transform logits back to labels

Now you've got the output. Write a function to map it back into labels!

In [51]:
# [ TODO ] try to process the result
def Decoder(encoder,predicts):
    label=encoder.inverse_transform(predicts.detach().numpy())
    label=np.reshape(label,(1,-1))
    return label[0]

Decoder(encoder,predicts)


array(['A2', 'B2', 'B2', 'C1'], dtype='<U2')

## Evaluation

Let's see how you did!  
Load the testing data and calculate your accuracy.

We want you to calculate the three kinds of accuracy mentioned in the lecture, which will also be explained in the following section.

In [52]:
train_val_dataset['test']['text'][:3]

['You must have a firm, outgoing personality, but be self-reliant and strong-willed.',
 'I like living on my own.',
 'I think this milk is bad.']

In [53]:
#test ground truth
test_ground_truth=train_val_dataset['test']['level']
test_ground_truth

['C2',
 'B1',
 'B2',
 'C2',
 'C1',
 'C1',
 'B1',
 'C1',
 'C1',
 'B2',
 'B2',
 'B2',
 'C2',
 'B2',
 'C1',
 'C1',
 'A1',
 'C2',
 'C2',
 'C2',
 'B2',
 'C1',
 'B1',
 'B2',
 'C2',
 'C1',
 'B2',
 'B2',
 'B2',
 'C1',
 'A2',
 'C2',
 'B2',
 'C1',
 'B2',
 'B2',
 'B1',
 'C2',
 'C1',
 'C1',
 'C2',
 'B1',
 'B1',
 'A2',
 'C2',
 'A2',
 'C1',
 'B1',
 'C2',
 'B2',
 'A2',
 'B2',
 'C2',
 'B2',
 'C2',
 'B2',
 'B1',
 'C2',
 'C2',
 'B1',
 'B2',
 'C2',
 'B2',
 'B1',
 'C1',
 'C2',
 'C2',
 'B2',
 'C2',
 'B1',
 'B2',
 'B2',
 'B2',
 'B2',
 'A1',
 'B1',
 'A2',
 'C1',
 'C2',
 'A1',
 'B1',
 'B1',
 'B1',
 'C1',
 'A2',
 'B2',
 'B2',
 'B1',
 'C1',
 'B2',
 'B2',
 'B2',
 'B1',
 'B2',
 'C2',
 'A1',
 'C2',
 'B1',
 'C2',
 'C2',
 'C1',
 'C1',
 'B2',
 'B1',
 'B2',
 'B1',
 'B2',
 'B2',
 'B2',
 'C2',
 'A2',
 'C2',
 'B2',
 'C1',
 'C2',
 'A2',
 'C1',
 'B2',
 'C2',
 'B2',
 'C1',
 'A2',
 'B2',
 'B2',
 'B2',
 'B1',
 'B2',
 'C1',
 'B1',
 'B2',
 'A2',
 'B2',
 'C2',
 'B2',
 'B2',
 'A1',
 'B2',
 'A1',
 'A1',
 'B2',
 'B1',
 'B2',
 'B2',

In [54]:
# [ TODO ] 
# load test data
# preprocess
# get predictions
# transform predictions back into labels
#test predict
with torch.no_grad():
    test_input=tokenizer(train_val_dataset['test']['text'], truncation=True, padding=True, return_tensors="pt")
    test_logits = mymodel(**test_input).logits
    predict_ = nn.functional.softmax(test_logits, dim = -1) 

In [55]:
predict_label = Decoder(encoder,predict_)
predict_label

array(['C2', 'B1', 'A2', ..., 'C1', 'C2', 'C1'], dtype='<U2')

In [56]:
#  try printing out some predictions to check if the outputs are reasonable and if you need to adjust your model at the end of every step.

for idx, (sent, level) in enumerate(zip(test_ground_truth, predict_label)):
    if idx >= 10: break
    print(f'{level}: {sent}') 

C2: C2
B1: B1
A2: B2
C2: C2
C1: C1
C1: C1
B1: B1
B2: C1
B2: C1
B2: B2


### Six Level Accuracy

Exact accuracy is probably what you're most familiar with:

$
accuracy = \frac{\#exactly\:the\:same\:levels}{\#total}
$

Example:
```
Prediction:   A1 A2 B1 B2 C1 C2
Ground truth: A2 B1 B1 B2 B2 C2
                    ^  ^     ^
```

The six level accuracy is $\frac{3}{6} = 0.5$

As the requirement, <u>your exact accuracy should be higher than $0.5$</u>.

In [57]:
# [ TODO ] calculate accuracy
'''
BATCH_SIZE = 16
EPOCH = 40
Accuracy=0.5612510860121633
'''
total=len(predict_label)
correct=0
for idx, (sent, level) in enumerate(zip(test_ground_truth, predict_label)):
    if sent==level:
        correct+=1
accuracy=correct/total
print('Accuracy:{}'.format(accuracy))

Accuracy:0.8307993049522154


### Three Level Accuracy

Three Level Accuracy is used when you only want a more general sense of right or wrong.

$
accuracy = \frac{\#the\:same\:ABC\:levels}{\#total}
$

Example:
```
Prediction:   A1 A2 B1 B2 C1 C2
Ground truth: A2 B1 B1 B2 B2 C2
              ^     ^  ^     ^
```

The three level accuracy is $\frac{4}{6} = 0.667$

As the requirement, <u>your exact accuracy should be higher than $0.6$</u>.

In [58]:
# [ TODO ] calculate accuracy
'''
BATCH_SIZE = 16
EPOCH = 40
Accuracy:0.7454387489139879
'''
correct=0
for idx, (sent, level) in enumerate(zip(test_ground_truth, predict_label)):
    if sent=='A1' or sent=='A2':
        if level=='A1' or level=='A2':
            correct+=1
    elif sent=='B1' or sent=='B2':
        if level=='B1' or level=='B2':
            correct+=1
    else:
        if level=='C1' or level=='C2':
            correct+=1
accuracy=correct/total
print('Accuracy:{}'.format(accuracy))

Accuracy:0.9057341442224153


### Fuzzy accuracy

However, the level of a sentence is relatively subjective. Generally speaking, $\pm1$ errors are allowed in the real evaluation in linguistic area.  

For example, if the actual label is 'B1', we'll also consider the prediction 'right' if the model predicts 'B2' or 'A2'.

Hence, the fuzzy accuracy is

$
accuracy = \frac{\#good\:enough\:answers}{\#total}
$

Example:
```
Prediction:   0 1 2 3 4 5
Ground truth: 0 1 1 3 3 3
              ^ ^ ^ ^ ^
```

The fuzzy accuracy is $\frac{5}{6} = 0.833$

As the requirement, <u>your accuracy should be higher than $0.8$</u>.

In [59]:
# [ TODO ] calculate accuracy
'''
BATCH_SIZE = 16
EPOCH = 40
Accuracy:0.8746741963509991
'''
correct=0
for idx, (sent, level) in enumerate(zip(test_ground_truth, predict_label)):
    if sent=='A1':
        if level=='A1' or level=='A2':
            correct+=1
    elif sent=='A2' :
        if level=='A1' or level=='A2' or level=='B1':
            correct+=1
    elif sent=='B1' :
        if level=='A2' or level=='B1' or level=='B2':
            correct+=1
    elif sent=='B2' :
        if level=='B1' or level=='B2' or level=='C1':
            correct+=1
    elif sent=='C1' :
        if level=='B2' or level=='C1' or level=='C2':
            correct+=1
    elif sent=='C2' :
        if level=='C1' or level=='C2' :
            correct+=1
accuracy=correct/total
print('Accuracy:{}'.format(accuracy))

Accuracy:0.960251954821894


## TA's Note

Congratulations, you made it to the end of the tutorial! Make sure you make an appointment to show your work and turn in your finished assignment before next week's lesson. We will ask you to run your code, so double check that everything is working and that your model is saved. Don't worry if you didn't pass the evaluation requirements, you'll still get partial points for trying.

## Appendix 


<a name="Appendix-1-Install-CUDA"></a>

### Appendix 1 - Install CUDA

1. Check your GPU vs. CUDA compatibility:
   - [NVIDIA -> Your GPU Compute Capability](https://developer.nvidia.com/cuda-gpus) -> GeForce and TITAN Products
2. Check library vs. CUDA compatibility: 
   - Pytorch: [Previous PyTorch Versions](https://pytorch.org/get-started/previous-versions/)
   - Tensorflow: [Linux/MacOX](https://www.tensorflow.org/install/source#tested_build_configurations) or [Windows](https://www.tensorflow.org/install/source_windows#tested_build_configurations)
3. Note the highest CUDA version that fits your system.

#### >> for conda/mamba users

You can directly install CUDA library with the selected CUDA version.
1. Get [the driver for NVIDIA GPU](https://www.nvidia.com/download/index.aspx)
2. `conda/mamba install -c conda-forge cudatoolkit=${VERSION}`

#### >> for non-conda users

1. Get [the driver for NVIDIA GPU](https://www.nvidia.com/download/index.aspx)
2. Download and install [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit-archive)
3. Download and install [cuDNN Library](https://developer.nvidia.com/rdp/cudnn-archive)

### Appendix 2 - Further Readings

1. [Huggingface Official Tutorials](https://github.com/huggingface/notebooks/tree/master/examples)
2. How to use Bert with other downstream tasks: [How to use BERT from the Hugging Face transformer library](https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209): 
3. Training with pytorch backend: [transformers-tutorials](https://github.com/abhimishra91/transformers-tutorials)
4. A more complicated example that include manual data/training processing with Pytorch: [Transformers for Multi-Label Classification made simple](https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1)
5. [Text Classification with tensorflow](https://github.com/huggingface/notebooks/blob/master/examples/text_classification-tf.ipynb): tensorflow example