# Week 14: Senquence Classification with BERT

The assignment this week is to do the senquence classification. This may sound like what we had done in the previous assignment, but we are using BERT as our classifier this week, rather than Machine Learning.

The objective is to judge the CEFR level of a sentence.  
[CEFR](https://www.cambridgeenglish.org/exams-and-tests/cefr/) is a standard for describing language ability of a person. It consists of 6 levels, A1, A2, B1, B2, C1, and C2, going from easier to harder.  
A dataset that contains sentences with the corresponding CEFR level is provided, and you have to use BERT and train a sentence classifier with this dataset.  
The dataset is collected and processed from a research by Alison Chi, 李書卉, 李冠霖 and Prof. Chang. Thank you all for allowing us to use it in the lecture.

As to the implementatin, we will introduce you the [🤗 transformers](https://huggingface.co/) library, which is mantained by huggingface company, as the training framework this week. [Pytorch](https://pytorch.org/) is used as the deep learning backend in this tutorial, but with the transformers library, all codes can be easily changed to tensorflow if you prefer so.  

## Prepare your environment

Again, we highly recommend you to install all packages with a virtual environment manager, like [venv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html), to prevent version conflicts of different packages.  

If you haven't used it before and don't know which to use, I would suggest you start with [mamba](https://github.com/mamba-org/mamba#installation) or [mambaforge](https://github.com/conda-forge/miniforge#mambaforge).

### Install CUDA

Deep learning is a computionally extensive process. It takes lots of time if relying only on the CPU, especially when it's trained on a large dataset. That's why using GPU instead is generally recommended.  
To use GPU for computation, you have to install [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit) as well as the [cuDNN library](https://developer.nvidia.com/cudnn) provided by NVIDIA.  

If you already had CUDA installed on your machine, then great! You're done here.  
If you don't, you can refer to [Appendix 1](#Appendix-1-Install-CUDA) to see how to do so.

### Install python packages

Dependencies:

1. `numpy`: for matrix operation
2. `scikit-learn`: for label encoding
3. `datasets`: for data preparation
4. `transformers`: for model loading and finetuing
5. (choose one) `tensorflow` / `pytorch`: the backend DL framework
   - Note that the tf/pt version must support the CUDA version you've installed if you want to use GPU.


### Select GPU(s) for your backend

Skip this section if you have no intension of using GPU with tensorflow/pytorch.

In [1]:
import os

# select your GPU. Note that this should be set before you load tensorflow or pytorch.
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# To use multiple GPUs, combine all GPU ID with commas
# e.g. >>> os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,3'

#### >> Check Pytorch

In [3]:
import torch
# Check if any GPU is used
torch.cuda.is_available()

True

#### >> Check Tensorflow

In [4]:
import tensorflow as tf
# Check if your GPU(s) is(are) listed below 
tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

## Prepare the dataset

Before starting the training, of course we need to load and process our dataset - but wait a sec. Let's decide which model we want to use first.  

In case you are not familiar with it, [BERT](https://arxiv.org/abs/1810.04805) (**B**idirectional **E**ncoder **R**epresentations from **T**ransformers) is a language model proposed by Google AI in 2018, and it's one of the most popular models used in NLP area.  
However, we will not directly use BERT in this tutorial, because it's large and needs plenty of time to train. Instead, we are using [DistilBert](https://medium.com/huggingface/distilbert-8cf3380435b5) this week.  

DistilBERT is a distilled (蒸餾) version of BERT that is much more light-weighted than original model while reserving 95% of its original accuracy, which makes it perfect for our task today.  

Further Reading:
 - [BERT Explained: A Complete Guide with Theory and Tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/) by Samia, 2019.
 - [進擊的 BERT：NLP 界的巨人之力與遷移學習](https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html) by 李孟, 2019

In [5]:
# the model you want to use. Available models can be found here: https://huggingface.co/models
#MODEL_NAME = 'distilbert-base-uncased'
MODEL_NAME = 'bert-base-uncased'

### Load data

Similar to `transformers` library, `datasets` is also a package provided by huggingface. It contains many public datasets online and can help us with the data processing.  
We can use `load_dataset` function to read the input `.csv` file.

Reference:
 - [Official datasets document](https://huggingface.co/docs/datasets)
 - [datasets.load_dataset](https://huggingface.co/docs/datasets/loading.html)

In [6]:
import os
from datasets import load_dataset
import csv
import json

In [7]:
dataset = load_dataset('csv', data_files = 'train_df_noemoji.csv')
#dataset = load_dataset('csv', data_files = os.path.join('data', 'train_df.csv'))

Using custom data configuration default-790bd679a5cc5029
Reusing dataset csv (/home/neaf-3090/.cache/huggingface/datasets/csv/default-790bd679a5cc5029/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)


In [14]:
# Take a look at the data structure
print(dataset['train'])
print(dataset['train'][1])
print(dataset['train']['text'][:10])

Dataset({
    features: ['text', 'emotion'],
    num_rows: 1455563
})
{'text': 'brianklaas As we see Trump is dangerous to freepress around the world What a LH LH TrumpLegacy  CNN', 'emotion': 'sadness'}
['People who post add me on Snapchat must be dehydrated Cuz man thats LH', 'brianklaas As we see Trump is dangerous to freepress around the world What a LH LH TrumpLegacy  CNN', 'Now ISSA is stalking Tasha  LH', 'RISKshow TheKevinAllison Thx for the BEST TIME tonight What stories Heartbreakingly LH authentic LaughOutLoud good', 'Still waiting on those supplies Liscus LH', 'Love knows no gender  LH', 'DStvNgCare DStvNg More highlights are being shown than actual sports Who watches triathlon highlights anyway LH LeagueCup', 'The SSM debate LH a manufactured fantasy used to distract the ignorant masses from their mundane lives V gender diversity a m', 'I love suffering  I love when valium does nothing to help  I love when my doctors say that theyve done all they can  LH', 'Can someone tel

### Preprocessing

Same as before, texts should be tokenized, embedded, and padded before put into the model.  
But don't worry, with the libraries from huggingface, the procedure is much easier now.

#### Sentence processing

Different pre-trained language models may have their own preprocessing models, and that's why we should use the tokenizers trained along with that model. In our case, we are using distilBERT, so we should use the distilBERT tokenizer.  

With huggingface, loading different tokenizer is extremely easy: just import the AutoTokenizer from `transformers` and tell it what model you plan to use, and it will handle everything for you.

Reference:
 - [transformers.AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer)

In [9]:
from transformers import AutoTokenizer # For tokenization

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

#### Play with BERTTokenizer

<small><i>*You can safely skip this section if you're already familar with BERTTokenizer.</i></small>

Let's play with this tokenizer a little bit before we go on.

Using this tokenizer is pretty easy: just call this object, and it processes the sentences for you.  

In [9]:
example = "This so-called \"Perfect Evening\" was so disappointing, as well as discouraging us from coming to your Circle Theatre again."

embeddings = tokenizer(example)
embeddings

{'input_ids': [101, 2023, 2061, 1011, 2170, 1000, 3819, 3944, 1000, 2001, 2061, 15640, 1010, 2004, 2092, 2004, 12532, 4648, 4726, 2149, 2013, 2746, 2000, 2115, 4418, 3004, 2153, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

As you can see, the sentence has already been tokenized and embedded. A default attention mask is returned as well.  

To get the token back is easy as well!

In [10]:
decoded_tokens = tokenizer.batch_decode(embeddings['input_ids'])
print(' '.join(decoded_tokens))

[CLS] this so - called " perfect evening " was so disappointing , as well as disco ##ura ##ging us from coming to your circle theatre again . [SEP]


You may notice that there're some weird stuffs appearing in your task, like `[CLS]` or `[SEP]`. The word *discouraging* is even split into `disco` `##ura` and `##ging` .  
`[CLS]`, `[SEP]`, `[UKN]` and `[MASK]` are four symbols introduced by BERT, which stand for "classification", "seperator", "unknown" and "mask" respectively.  
As to `##` thing, it's called a *wordpiece*, which is a concept [also brought out by Google](https://arxiv.org/abs/1609.08144). The key idea is to split words into common sub-word units, so the number of rare words can significantly decrease.

Besides simply tokenizing a sentence, there are also many parameters you can set. You can play with it a bit, changing the parameters and observe the difference.

Document:
 - [transformers.Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer)

In [11]:
# EXAMPLE: directly transform into embedding tensor
embeddings = tokenizer(example,
                       # padding='longest',         # padding strategy
                       # max_length=10,             # how long to pad sentences
                       is_split_into_words=False,
                       truncation=True,
                       return_tensors='pt',         # 'tf' for tensofrlow, 'pt' for pytorch, 'np' for numpy
                       # return_length=True         # whether to return length
                       # Any other parameters you want to try
                      )
embeddings

{'input_ids': tensor([[  101,  2023,  2061,  1011,  2170,  1000,  3819,  3944,  1000,  2001,
          2061, 15640,  1010,  2004,  2092,  2004, 12532,  4648,  4726,  2149,
          2013,  2746,  2000,  2115,  4418,  3004,  2153,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1]])}

#### Label processing

Before we start to process sentences in the whole dataset, don't forget we need to process labels as well.

In the following section, I will introduce you the OneHotEncoder provided by scikit-learn.

Documents:
 - [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder)

In [10]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# First, declare a new encoder
encoder = OneHotEncoder(sparse = False)
# Then, let the encoder learns all features in the given dataset
# Keep in mind that all `fit` functions in sklearn only make the encoder learn from the data, not transforming the data yet.
encoder = encoder.fit(np.reshape(dataset['train']['emotion'], (-1, 1)))

In [11]:
LABEL_COUNT = len(encoder.categories_[0])
print(LABEL_COUNT)

8


#### Play with OneHotEncoder

<small><i>*You can safely skip this section if you're already familar with sklearn.</i></small>

One thing you should always keep in mind is: features learned by OneHotEncoder are always treated as arrays, because it allows multi-field features. (See its [document](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder) for an example)  
That's why you have to reshape the level into (-1, 1), i.e. from `['A1', 'B1', 'C1', ...]` to `[['A1'], ['B1'], ['C1'], ...]` .

In [14]:
# Let's see what features has the encoder captured
print(encoder.categories_)

[array(['anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness',
       'surprise', 'trust'], dtype='<U12')]


In [15]:
# use `encoder.transform` to get the one-hot code of a label
print(encoder.transform([['anger'], ['fear']]))

[[1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]]


In [16]:
# To decode, use `encoder.inverse_transform` instead
print(encoder.inverse_transform([[0, 0, 1, 0, 0, 0, 0, 0]]))

[['disgust']]


#### [ TODO ] Process the data

With the tokenizor and encoder prepared, we can write a function to process our dataset!

In [15]:
def preprocess(dataslice):
    """ Input: a batch of your dataset
        Example: { 'text': [['sentence1'], ['setence2'], ...],
                   'label': ['label1', 'label2', ...] }
    """
    dataset = {}
    input_ids_list = []
    token_type_ids_list = []
    attention_mask_list = []
    label_list = []
    emotion_list = []
    text_list = []
    
    for text, emotion in zip(dataslice['text'], dataslice['emotion']):
        text = text.replace("LH", "").replace("  ", " ").lower()
        embeddings = tokenizer(text,
                       padding=True,         # padding strategy
                       # max_length=10,             # how long to pad sentences
                       is_split_into_words=False,
                       truncation=True,
                       return_tensors='pt',         # 'tf' for tensofrlow, 'pt' for pytorch, 'np' for numpy
                       # return_length=True         # whether to return length
                       # Any other parameters you want to try
                      )
        input_ids_list.append(embeddings['input_ids'][0])
        token_type_ids_list.append(embeddings['token_type_ids'][0])
        attention_mask_list.append(embeddings['attention_mask'][0])
        #label_list.append(encoder.transform(np.reshape(label, (-1, 1))))
        emotion_list.append([emotion])
        text_list.append(text)
        
        
        
    dataset['input_ids'] = input_ids_list
    dataset['token_type_ids'] = token_type_ids_list
    dataset['attention_mask'] = attention_mask_list
    dataset['label'] = encoder.transform(emotion_list)
    dataset['emotion'] = emotion_list
    dataset['text'] = text_list
    return dataset
        
        
    
    # [ TODO ]
    ...

    """ Output: a batch of processed dataset
        Example: { 'input_ids': ...,
                   'attention_masks': ...,
                   'label': ... }
    """

Now, map the function to the whole dataset.

In [16]:
processed_data = dataset.map(preprocess,    # your processing function
                             batched = True # Process in batches so it can be faster
                            )

  0%|          | 0/1456 [00:00<?, ?ba/s]

In [17]:
# Take a look at processed dataset
print(processed_data)
processed_data['train'][0]

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'emotion', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 1455563
    })
})


{'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'emotion': ['anticipation'],
 'input_ids': [101,
  2111,
  2040,
  2695,
  5587,
  2033,
  2006,
  10245,
  7507,
  2102,
  2442,
  2022,
  2139,
  10536,
  7265,
  3064,
  12731,
  2480,
  2158,
  2008,
  2015,
  102],
 'label': [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 'text': 'people who post add me on snapchat must be dehydrated cuz man thats ',
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

### DataCollator

You may notice that we didn't pad the sentences in the preprocessing function, because we are going to do it during the training time.  

To do the training-time processing, we can use the DataCollator Class provided by `transformers`. What's even better is, transformers already provides a class that handles padding for us!

 - [transformers.DataCollatorWithPadding](https://huggingface.co/docs/transformers/master/en/main_classes/data_collator#transformers.DataCollatorWithPadding)

In [18]:
from transformers import DataCollatorWithPadding

# declare a collator to do padding during traning.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Training

### Preparation

We can load the pretrained model from `transformers`.  
Generally, you need to build your own model on top of BERT if you want to use BERT for some downstream tasks, but again, sequence classification is a popular topic. With the support from `transformers` library, all works can be done in two lines of codes: 

1. Load `AutoModelForSequenceClassification` Class.
2. Load the pretrained model.

In [19]:
# Change to TFAutoModelForSequenceClassification if you're using tensoflow
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased',
                                                           num_labels = LABEL_COUNT)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

#### [ TODO ] Split train/val data

The `Dataset` class we prepared before already has the `train_test_split` method. You can use it to split your dataset.

Document:
 - [datasets.Dataset - Sort, shuffle, select, split, and shard](https://huggingface.co/docs/datasets/process.html#sort-shuffle-select-split-and-shard)


In [16]:
# [ TODO ] Choose the validation data size                                v here
train_val_dataset = processed_data['train'].train_test_split(test_size = 0.1)

Loading cached split indices for dataset at /home/neaf-3090/.cache/huggingface/datasets/csv/default-27ef3c0bbd90db4a/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff/cache-5812b656b1494ff3.arrow and /home/neaf-3090/.cache/huggingface/datasets/csv/default-27ef3c0bbd90db4a/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff/cache-d4937ddb40026fe9.arrow


In [17]:
# Take a look at split data
print(train_val_dataset)

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'emotion', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 1310006
    })
    test: Dataset({
        features: ['attention_mask', 'emotion', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 145557
    })
})


#### [ TODO ] Setup training parameters

We are using the TrainerAPI to do the training. Trainer is yet another utility provided by huggingface, which helps you train the model with ease.  

Document:
- [transformers.TrainingArguments](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.TrainingArguments)
- [transformers.Trainer](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.Trainer)

In [20]:
# Change to TFTrainingArguments, TFTrainer if you're using tensoflow
from transformers import TrainingArguments, Trainer

In [22]:
# [ TODO ] Set and tune your training properties
LEARNING_RATE = 5e-5
BATCH_SIZE = 128
EPOCH = 8
#WARMUP = 0.1
training_args = TrainingArguments(
    output_dir = 'model',
    #overwrite_output_dir = True,
    learning_rate = LEARNING_RATE,
    per_device_train_batch_size = BATCH_SIZE,
    #per_device_eval_batch_size = BATCH_SIZE,
    num_train_epochs = EPOCH,
    load_best_model_at_end=True,
    #warmup_ratio = WARMUP,
    # You can also set other parameters here
)

# Now give all information to a trainer.
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = processed_data['train'],
    #train_dataset = train_val_dataset["train"],
    #eval_dataset = train_val_dataset["test"],
    tokenizer = tokenizer,
    data_collator = data_collator,
    
    # You can also set other parameters
)

### Training

Training is pretty easy. Simply ask the trainer to train the model for you!

In [23]:
trainer.train()

Step,Training Loss
500,0.2902
1000,0.2494
1500,0.2397
2000,0.2333
2500,0.2298
3000,0.2266
3500,0.2253
4000,0.2232
4500,0.2226
5000,0.2194


TrainOutput(global_step=90976, training_loss=0.1418504040718414, metrics={'train_runtime': 13465.0756, 'train_samples_per_second': 864.793, 'train_steps_per_second': 6.756, 'total_flos': 2.9288245545525286e+17, 'train_loss': 0.1418504040718414, 'epoch': 8.0})

You can see that Trainer saves some ckeckpoints, so you can load your model from those checkpoints if you want to fallback to a specific version.

### Save for future use

In [24]:
model.save_pretrained(os.path.join('model', 'finetuned'))

## Prediction

We've known how to train a model now, but how to really use it for predicting results?

### Load finetuned model

In [25]:
# Same, change to TFxxxxxx if you are using tensorflow
from transformers import AutoModelForSequenceClassification

mymodel = AutoModelForSequenceClassification.from_pretrained(os.path.join('model', 'finetuned'))

### Get the prediction

Given six example sentences...

In [30]:
examples = [
    # A2
    "Remember to write me a letter.",
    # B2
    "Strawberries and cream - a perfect combination.",
    "This so-called \"Perfect Evening\" was so disappointing, as well as discouraging us from coming to your Circle Theatre again.",
    # C1
    "Some may altogether give up their studies, which I think is a disastrous move.",
]

...all you need to do is to transform them to embeddings, and then you can get predictions by calling your finetuned model.  

Note that, since you don't have a DataCollator to pad the sentence and do the matrix transformation for you, you have to pad and transform the matrice on your own.

In [31]:
# Transform the sentences into embeddings
input = tokenizer(examples, truncation=True, padding=True, return_tensors="pt") # change return_tensors if youre using tensorflow
# Get the output
logits = mymodel(**input).logits
logits

tensor([[-0.6909,  0.7161, -3.7188, -4.3224, -3.6388, -4.1544],
        [-5.4560, -5.4394, -3.9893,  2.4557, -3.0997, -2.8560],
        [-5.0532, -5.2660, -4.2828,  2.7641, -2.9263, -3.7443],
        [-4.8139, -4.8385, -4.6691, -2.5907,  2.7176, -4.7137]],
       grad_fn=<AddmmBackward>)

Logits aren't very readable for us. Let's use softmax activation to transform them into more probability-like numbers.

In [32]:
# Or `from tensorflow import nn` and `nn.softmax`
from torch import nn

predicts = nn.functional.softmax(logits, dim = -1)
predicts

tensor([[1.9076e-01, 7.7898e-01, 9.2353e-03, 5.0503e-03, 1.0005e-02, 5.9746e-03],
        [3.6239e-04, 3.6846e-04, 1.5709e-03, 9.8900e-01, 3.8237e-03, 4.8791e-03],
        [4.0011e-04, 3.2343e-04, 8.6450e-04, 9.9357e-01, 3.3564e-03, 1.4813e-03],
        [5.3209e-04, 5.1918e-04, 6.1502e-04, 4.9150e-03, 9.9283e-01, 5.8820e-04]],
       grad_fn=<SoftmaxBackward>)

#### [ TODO ] transform logits back to labels

Now you've got the output. Write a function to map it back into labels!

In [23]:
# [ TODO ] try to process the result
def transform_labels(logits):
    logits = logits.detach().cpu().numpy()
    if isinstance(logits, list):
        logits = np.array(logits)
    a = np.argmax(logits, axis=1)
    b = np.zeros((len(a), logits.shape[1]))
    b[np.arange(len(a)), a] = 1
    return b

In [34]:
predicts = transform_labels(predicts)
print(encoder.inverse_transform(predicts))

[['A2']
 ['B2']
 ['B2']
 ['C1']]


## [ TODO ] Evaluation

It's your turn!  
Load the testing data and calculate your accuracy.

We want you to calculate two kinds of accuracy, exact accuracy and fuzzy accuracy, which will be explained in the following section.


In [26]:
# [ TODO ] 
# load test data
# preprocess
# get predictions
# transform predictions back into labels

#eval_dataset = load_dataset('csv', data_files = os.path.join('data', 'test_df.csv'))
eval_dataset = load_dataset('csv', data_files = 'test_df_noemoji.csv')
print(eval_dataset['train'])
print(eval_dataset['train'][1])
print(eval_dataset['train']['text'][:5])
processed_evaldata = eval_dataset.map(preprocess,    # your processing function
                             batched = True # Process in batches so it can be faster
                            )
print("-----------")
print(processed_evaldata)
print(processed_evaldata['train'][0])

Using custom data configuration default-473e3bdc78202b1e


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/neaf-3090/.cache/huggingface/datasets/csv/default-473e3bdc78202b1e/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff...


0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /home/neaf-3090/.cache/huggingface/datasets/csv/default-473e3bdc78202b1e/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff. Subsequent calls will reuse this data.
Dataset({
    features: ['text', 'emotion'],
    num_rows: 411972
})
{'text': 'Trust is not the same as faith A friend is someone you trust Putting faith in anyone is a mistake  Christopher Hitchens LH LH', 'emotion': None}
['Confident of your obedience I write to you knowing that you will do even more than I ask Philemon 121 34 bibleverse LH LH', 'Trust is not the same as faith A friend is someone you trust Putting faith in anyone is a mistake  Christopher Hitchens LH LH', 'When do you have enough  When are you satisfied  Is you goal really all about money   materialism money possessions LH', 'God woke you up now chase the day GodsPlan GodsWork LH', 'In these tough times who do YOU turn to as your symbol of hope LH']


  0%|          | 0/412 [00:00<?, ?ba/s]

ValueError: Found unknown categories [None] in column 0 during transform

In [27]:
from transformers import pipeline
mymodel = mymodel.cpu()
classifier = pipeline('sentiment-analysis', model=mymodel, tokenizer=tokenizer)

In [28]:
predict_list = []
label_list = []
print_num = 0
for i in processed_evaldata['train']:
    result = classifier(i['text'])[0]
    predict_list.append(int(result['label'][-1]))
    label_list.append(int(np.argmax(i['label'])))
    #if print_num < 10:
        #print(result['label'])
        #print_num += 1
#print(prediction)

NameError: name 'processed_evaldata' is not defined

In [38]:
print_num = 0
for p, l in zip(predict_list, label_list):
    if print_num < 50:
        print((p, l))
        print_num += 1
    #break

(5, 5)
(3, 3)
(3, 3)
(3, 5)
(4, 4)
(1, 1)
(2, 2)
(3, 3)
(2, 1)
(5, 4)
(3, 3)
(4, 4)
(4, 4)
(4, 4)
(3, 2)
(3, 2)
(4, 4)
(3, 5)
(1, 4)
(4, 4)
(4, 3)
(3, 5)
(3, 3)
(4, 4)
(0, 1)
(2, 2)
(3, 3)
(3, 4)
(1, 1)
(5, 5)
(3, 2)
(1, 1)
(3, 3)
(3, 4)
(3, 3)
(4, 5)
(3, 3)
(3, 3)
(2, 3)
(2, 2)
(0, 1)
(3, 5)
(3, 5)
(2, 1)
(1, 2)
(2, 2)
(3, 3)
(2, 3)
(3, 2)
(0, 1)


### Six Level Accuracy

Exact accuracy is what you've been familiar with:

$
accuracy = \frac{\#exactly\:the\:same\:levels}{\#total}
$

Example:
```
Prediction:   A1 A2 B1 B2 C1 C2
Ground truth: A2 B1 B1 B2 B2 C2
                    ^  ^     ^
```

The six level accuracy is $\frac{3}{6} = 0.5$

As the requirement, <u>your exact accuracy should be higher than $0.5$</u>.

In [39]:
# [ TODO ] calculate accuracy
correct = 0
total = 0
for p, l in zip(predict_list, label_list):
    if p == l:
        correct += 1
    total += 1
print("accuracy: ", correct/total)

accuracy:  0.5569565217391305


### [ TODO ] Three Level Accuracy

Three Level Accuracy is used when you only want the general of right or wrong.

$
accuracy = \frac{\#the\:same\:ABC\:levels}{\#total}
$

Example:
```
Prediction:   A1 A2 B1 B2 C1 C2
Ground truth: A2 B1 B1 B2 B2 C2
              ^     ^  ^     ^
```

The six level accuracy is $\frac{4}{6} = 0.667$

As the requirement, <u>your exact accuracy should be higher than $0.6$</u>.

In [40]:
# [ TODO ] calculate accuracy
correct = 0
total = 0
label_a = [0, 1]
label_b = [2, 3]
label_c = [4, 5]
for p, l in zip(predict_list, label_list):
    if p in label_a and l in label_a:
        correct += 1
    elif p in label_b and l in label_b:
        correct += 1
    elif p in label_c and l in label_c:
        correct += 1
    total += 1
print("accuracy: ", correct/total)

accuracy:  0.7421739130434782


### [ TODO ] Fuzzy accuracy

However, the level of a sentence is relatively subjective. Generally speaking, $\pm1$ errors are allowed in the real evaluation in linguistic area.  

For example, if the label is actually 'B1', but the model predicts 'B2', we still consider the prediction good enough, and this also applys when the model predicts 'A2'.

Hence, the fuzzy accuracy is

$
accuracy = \frac{\#good\:enough\:answers}{\#total}
$

Example:
```
Prediction:   0 1 2 3 4 5
Ground truth: 0 1 1 3 3 3
              ^ ^ ^ ^ ^
```

The fuzzy accuracy is $\frac{5}{6} = 0.833$

As the requirement, <u>your accuracy should be higher than $0.8$</u>.

In [42]:
# [ TODO ] calculate accuracy
for p, l in zip(predict_list, label_list):
    if np.abs(p-l) <= 1:
        correct += 1
    total += 1
print("accuracy: ", correct/total)

accuracy:  0.8291304347826087


## TA's note

Congratuation! You've finished the assignment this week.  
Don't forget to <b>[make an appoiment with TA](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=134737606) to demo/explain your implementation <u>before <font color="red">12/23 15:30</font></u></b> .  
Also make sure you submit your `{student_id}.ipynb` to [eeclass](https://eeclass.nthu.edu.tw/course/homework/6053).

This is the last assignment of this class. A TA will still be at the online classroom and answer your question during the class time in the following weeks, and you can have make-up demos at that time.  
Prof. Chang's office hours are in Tues. to Thurs. evenings. You can come to Delta 712 to consult him at that time, but make sure you follow the appointment rules written on the bulletin or [the appointment sheet](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit?usp=sharing).




## Appendix 

<a name="Appendix-1-Install-CUDA"></a>

### Appendix 1 - Install CUDA

1. Check your GPU vs. CUDA compatibility:
   - [NVIDIA -> Your GPU Compute Capability](https://developer.nvidia.com/cuda-gpus) -> GeForce and TITAN Products
2. Check library vs. CUDA compatibility: 
   - Pytorch: [Previous PyTorch Versions](https://pytorch.org/get-started/previous-versions/)
   - Tensorflow: [Linux/MacOX](https://www.tensorflow.org/install/source#tested_build_configurations) or [Windows](https://www.tensorflow.org/install/source_windows#tested_build_configurations)
3. Note the highest CUDA version that fits your system.

#### >> for conda/mamba users

You can directly install CUDA library with the selected CUDA version.
1. Get [the driver for NVIDIA GPU](https://www.nvidia.com/download/index.aspx)
2. `conda/mamba install -c conda-forge cudatoolkit=${VERSION}`

#### >> for non-conda users

1. Get [the driver for NVIDIA GPU](https://www.nvidia.com/download/index.aspx)
2. Download and install [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit-archive)
3. Download and install [cuDNN Library](https://developer.nvidia.com/rdp/cudnn-archive)

<a name="Appendix-2-TAs-Environmental-setup"></a>

### Appendix 2 - TA's Environmental Setup

The following is my setup for this notebook. You can refer to it if you encounter some environmental issues.  

System: Ubuntu 18.04.6, x64, With GPU support. All packages are installed in new conda environment with channels default to conda-forge.

1. Python 3.8.12
2. numpy=1.21.2
3. cudatoolkit=11.1.74
4. pytorch=1.8.2
5. datasets=1.16.1
6. transformers=4.12.5
7. scikit-learn=1.0.1

Notes:

 - conda create -n week14 python=3.8 & conda activate week14
 - conda config --add channels conda-forge
 - conda config --set channel_priority strict
 - conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch-lts -c nvidia
 - conda install transformers
 - conda install datasets scikit-learn


### Appendix 3 - Further Readings

1. [Huggingface Official Tutorials](https://github.com/huggingface/notebooks/tree/master/examples)
2. How to use Bert with other downstream tasks: [How to use BERT from the Hugging Face transformer library](https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209): 
3. Training with pytorch backend: [transformers-tutorials](https://github.com/abhimishra91/transformers-tutorials)
4. A more complicated example that include manual data/training processing with Pytorch: [Transformers for Multi-Label Classification made simple](https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1)
5. [Text Classification with tensorflow](https://github.com/huggingface/notebooks/blob/master/examples/text_classification-tf.ipynb): tensorflow example