# Evaluate

In this notebook, it will:

    I. present the evaluate module
    II. How to use
    III. Apply to Training

## I. Presentation

It is a module to compute different metrics.
Here is good description of the module: https://huggingface.co/docs/evaluate/a_quick_tour.

This site: https://huggingface.co/evaluate-metric contains the available metrics.

## II. Usage

In [1]:
# import
########

import evaluate

2024-06-19 17:29:06.000889: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-19 17:29:06.000949: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-19 17:29:06.003230: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-19 17:29:06.015190: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [8]:
## show available modules

# evaluate.list_evaluation_modules()

# select
evaluate.list_evaluation_modules(include_community=False, with_details=True)

[]

### A. Load

In [6]:
## loading

acc = evaluate.load("accuracy")

In [11]:
## print the description

# one can show the doc of this method:
# print(acc.__doc__)
# Or by calling the members, one can show the doc as well 

print(acc.description)


Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative



In [10]:
## print the usage

print(acc.inputs_description)


Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
        {'accuracy': 0.5}

    Example 2-The same as Example 1, except with `normalize` set to `False`.
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> res

### B. Use case

In [14]:
## for pairs of comparisons

acc = evaluate.load("accuracy")
ref = [0, 1, 2, 0, 1, 2]
pre = [0, 1, 1, 2, 1, 0]

for r, p in zip(ref, pre) :
    acc.add(reference=r, prediction=p)

print(acc.compute())

{'accuracy': 0.5}


In [15]:
## batched results

acc = evaluate.load("accuracy")
refs = [[0, 1, 2, 0, 1, 2], [0, 1, 2, 0, 1, 2]]
pres = [[0, 1, 1, 2, 1, 0], [0, 1, 1, 2, 1, 0]]

for r, p in zip(refs, pres) :
    acc.add_batch(references=r, predictions=p)

print(acc.compute())

# notes: function "add" accepts either "reference" and "references" as vairiable name
# but not "add_batch".

{'accuracy': 0.5}


In [16]:
## combine several criteria

metrics = evaluate.combine(["accuracy", "recall", "f1", "precision"])

In [17]:
ref = [0, 1, 1, 0, 1, 1]
pre = [0, 1, 1, 1, 1, 0]
metrics.compute(references=ref, predictions=pre)

{'accuracy': 0.6666666666666666, 'recall': 0.75, 'f1': 0.75, 'precision': 0.75}

In [None]:
## plot

# evaluate provides also a way to visualize (plot) the results for comparison

# from evaluate.visualization import radar_plot
# data = [{"accuracy": 0.8, "precision": 0.7, "f1": 0.6, "latency_in_seconds": 10}, ...]
# models = ["model1", ...]
# plot = radar_plot(data=data, model_names=models)

## III. Update Training

We take the Training section from 'hf_transformers_basics_datasets.ipynb" and replace the manually defined evaluation function by the evaluation module explain above.

In [1]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
# 1) load dataset
#################

# not changed

from datasets import load_dataset

ckp_data = "davidberg/sentiment-reviews"

data = load_dataset(ckp_data, split="train")

data = data.filter(lambda column: "neutral" not in column["division"]) # filter out neutral
data

Dataset({
    features: ['Unnamed: 0', 'review', 'polarity', 'division'],
    num_rows: 3548
})

In [3]:
# 2) split data
###############

# not changed

split_data = data.train_test_split(test_size=0.1)
split_data

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'review', 'polarity', 'division'],
        num_rows: 3193
    })
    test: Dataset({
        features: ['Unnamed: 0', 'review', 'polarity', 'division'],
        num_rows: 355
    })
})

In [4]:
# 3) tokenizer
##############

# not changed

from transformers import AutoTokenizer
import torch

label2id = {"negative":0, "positive": 1}
id2label = {0: "negative", 1: "positive"}

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

def process(batch):

    toks = tokenizer(batch["review"], max_length=128, truncation=True, padding="max_length", return_tensors="pt")
    toks["labels"] = torch.tensor([label2id.get(item) for item in batch["division"]])

    return toks

tokenized_data = split_data.map(process, batched=True, remove_columns=data.column_names)
tokenized_data

Map:   0%|          | 0/3193 [00:00<?, ? examples/s]

Map:   0%|          | 0/355 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3193
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 355
    })
})

In [5]:
# 4) dataloader
###############

# not changed

from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

trainset, validset = tokenized_data["train"], tokenized_data["test"]

trainloader = DataLoader(trainset, batch_size=32, shuffle=True, collate_fn=DataCollatorWithPadding(tokenizer))
validloader = DataLoader(validset, batch_size=32, shuffle=False, collate_fn=DataCollatorWithPadding(tokenizer))

2024-06-19 17:56:22.144518: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-19 17:56:22.144586: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-19 17:56:22.146834: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-19 17:56:22.158900: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [6]:
# 5) load model
###############

# not changed

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")

# sent to gpu

if torch.cuda.is_available():
    model = model.cuda()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
# 6) define optimizer
#####################

# not changed

from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=2e-5)

In [14]:
# 7) evaluation
###############

# replace the count of labels by the metrics defined by evaluate

import evaluate

metrics_fct = evaluate.combine(["accuracy", "f1"])

def eval():

    model.eval()

    for batch in validloader:

        # if there is GPU, send the data to GPU
        if torch.cuda.is_available():
            batch = {k: v.to(model.device) for k, v in batch.items()}

        output = model(**batch)

        pred = torch.argmax(output.logits, dim=-1)

        metrics_fct.add_batch(predictions=pred.int(), references=batch["labels"].int())

    return metrics_fct.compute()

In [15]:
# 8) Train
##########

# not changed

def train(epoch=3, log_step=50):

    gStep = 0

    for e in range(epoch):

        model.train()

        for batch in trainloader:
            
            # if there is GPU, send the data to GPU
            if torch.cuda.is_available():
                batch = {k: v.to(model.device) for k, v in batch.items()}

            optimizer.zero_grad()

            output = model(**batch)

            output.loss.backward()

            optimizer.step()

            if gStep % log_step == 0:

                print(f"{e+1} / {epoch} - global step: {gStep}, loss: {output.loss.item()}")

            gStep += 1

        metrics = eval()

        print(f"{e+1} / {epoch} - {metrics}")

In [16]:
# not changed

train()

1 / 3 - global step: 0, loss: 0.12130105495452881
1 / 3 - global step: 50, loss: 0.17807073891162872
1 / 3 - {'accuracy': 0.952112676056338, 'f1': 0.9704347826086956}
2 / 3 - global step: 100, loss: 0.0437583364546299
2 / 3 - global step: 150, loss: 0.129874587059021
2 / 3 - {'accuracy': 0.9549295774647887, 'f1': 0.9719298245614035}
3 / 3 - global step: 200, loss: 0.03587239980697632
3 / 3 - global step: 250, loss: 0.006216248031705618
3 / 3 - {'accuracy': 0.9464788732394366, 'f1': 0.9663716814159292}
