<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning ChemBERTa (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way. 

All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels), indicating the unnormalized scores for a number of labels for every example in the batch.

## Load dataset

Next, let's download a multi-label text classification dataset from the [hub](https://huggingface.co/).

At the time of writing, I picked a random one as follows:   

* first, go to the "datasets" tab on huggingface.co
* next, select the "multi-label-classification" tag on the left as well as the the "1k<10k" tag (fo find a relatively small dataset).

Note that you can also easily load your local data (i.e. csv files, txt files, Parquet files, JSON, ...) as explained [here](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files).



In [3]:
from datasets import load_dataset, DatasetDict
import os

data_dir = "./data/processed_data/scaffold/"
# Load the datasets
train_dataset = load_dataset('csv', data_files=os.path.join(data_dir, "train/metadata.csv"))['train']
valid_dataset = load_dataset('csv', data_files=os.path.join(data_dir, "validation/metadata.csv"))['train']
test_dataset = load_dataset('csv', data_files=os.path.join(data_dir, "test/metadata.csv"))['train']

# Create a DatasetDict
datasets = DatasetDict({
    'train': train_dataset,
    'valid': valid_dataset,
    'test': test_dataset
})

As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

Let's check the first example of the training split:

In [4]:
example = datasets['train'][1]
example

{'Unnamed: 0': 1,
 'smiles': 'O=C1CN(N=CC2:C:C:C(C3:C:C:C([N+](=O)[O-]):C:C:3):O:2)C([O-])=N1',
 'Hepatobiliary disorders': 1,
 'Metabolism and nutrition disorders': 1,
 'Product issues': 0,
 'Eye disorders': 1,
 'Investigations': 1,
 'Musculoskeletal and connective tissue disorders': 1,
 'Gastrointestinal disorders': 1,
 'Social circumstances': 0,
 'Immune system disorders': 1,
 'Reproductive system and breast disorders': 0,
 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)': 1,
 'General disorders and administration site conditions': 1,
 'Endocrine disorders': 0,
 'Surgical and medical procedures': 0,
 'Vascular disorders': 1,
 'Blood and lymphatic system disorders': 1,
 'Skin and subcutaneous tissue disorders': 1,
 'Congenital, familial and genetic disorders': 0,
 'Infections and infestations': 0,
 'Respiratory, thoracic and mediastinal disorders': 1,
 'Psychiatric disorders': 1,
 'Renal and urinary disorders': 1,
 'Pregnancy, puerperium and perinatal conditions'

The dataset consists of smiles, labeled with one or more ADRs. 

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [5]:
labels = [label for label in datasets['train'].features.keys() if label not in ['Unnamed: 0', 'smiles']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
print(labels)
print(id2label)
print(label2id)

['Hepatobiliary disorders', 'Metabolism and nutrition disorders', 'Product issues', 'Eye disorders', 'Investigations', 'Musculoskeletal and connective tissue disorders', 'Gastrointestinal disorders', 'Social circumstances', 'Immune system disorders', 'Reproductive system and breast disorders', 'Neoplasms benign, malignant and unspecified (incl cysts and polyps)', 'General disorders and administration site conditions', 'Endocrine disorders', 'Surgical and medical procedures', 'Vascular disorders', 'Blood and lymphatic system disorders', 'Skin and subcutaneous tissue disorders', 'Congenital, familial and genetic disorders', 'Infections and infestations', 'Respiratory, thoracic and mediastinal disorders', 'Psychiatric disorders', 'Renal and urinary disorders', 'Pregnancy, puerperium and perinatal conditions', 'Ear and labyrinth disorders', 'Cardiac disorders', 'Nervous system disorders', 'Injury, poisoning and procedural complications']
{0: 'Hepatobiliary disorders', 1: 'Metabolism and nu

## Preprocess data

As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' `BCEWithLogitsLoss` (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

In [6]:
from transformers import AutoTokenizer
import numpy as np
pretrained_path = "seyonec/PubChem10M_SMILES_BPE_450k"
tokenizer = AutoTokenizer.from_pretrained(pretrained_path)

def preprocess_data(examples):
  # take a batch of texts
  smiles = examples["smiles"]
  # encode them
  encoding = tokenizer(smiles, padding="max_length", truncation=True, max_length=300)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(smiles), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  return encoding

In [7]:
encoded_dataset = datasets.map(preprocess_data, batched=True, remove_columns=datasets['train'].column_names)

In [8]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1143
    })
    valid: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 142
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 142
    })
})

In [9]:
example = encoded_dataset['train'][0]
print(example)
print(len(example['input_ids']))
print(example.keys())

{'input_ids': [0, 262, 263, 51, 13, 50, 12, 262, 12, 51, 13, 291, 12, 39, 12, 39, 287, 51, 13, 39, 21, 30, 39, 12, 45, 13, 30, 39, 12, 39, 263, 51, 13, 301, 12, 51, 13, 298, 13, 30, 39, 12, 45, 13, 30, 39, 12, 39, 263, 51, 13, 301, 12, 51, 13, 298, 13, 30, 39, 30, 21, 45, 13, 39, 21, 30, 39, 12, 45, 13, 30, 39, 12, 39, 263, 51, 13, 301, 12, 51, 13, 298, 13, 30, 39, 12, 45, 13, 30, 39, 12, 39, 263, 51, 13, 301, 12, 51, 13, 298, 13, 30, 39, 30, 21, 45, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [10]:
tokenizer.decode(example['input_ids'])

'<s>CC(=O)N(CC(O)CN(C(C)=O)C1:C(I):C(C(=O)NCC(O)CO):C(I):C(C(=O)NCC(O)CO):C:1I)C1:C(I):C(C(=O)NCC(O)CO):C(I):C(C(=O)NCC(O)CO):C:1I</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><

In [11]:
example['labels']

[0.0,
 1.0,
 0.0,
 1.0,
 1.0,
 1.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0,
 1.0,
 1.0,
 0.0,
 1.0,
 1.0,
 1.0,
 1.0,
 0.0,
 1.0,
 1.0,
 1.0,
 1.0]

In [12]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['Metabolism and nutrition disorders',
 'Eye disorders',
 'Investigations',
 'Musculoskeletal and connective tissue disorders',
 'Gastrointestinal disorders',
 'Immune system disorders',
 'General disorders and administration site conditions',
 'Vascular disorders',
 'Blood and lymphatic system disorders',
 'Skin and subcutaneous tissue disorders',
 'Infections and infestations',
 'Respiratory, thoracic and mediastinal disorders',
 'Psychiatric disorders',
 'Renal and urinary disorders',
 'Ear and labyrinth disorders',
 'Cardiac disorders',
 'Nervous system disorders',
 'Injury, poisoning and procedural complications']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html). 

In [13]:
encoded_dataset.set_format("torch")

## Define model

Here we define a model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [14]:
import torch.nn as nn
import torch

class WeightedBCEWithLogitsLoss(nn.Module):
    def __init__(self, weights):
        super(WeightedBCEWithLogitsLoss, self).__init__()
        self.weights = weights

    def forward(self, logits, targets):
        loss_fn = nn.BCEWithLogitsLoss(reduction='none')
        loss = loss_fn(logits, targets)
        weighted_loss = loss * self.weights
        return weighted_loss.mean()


In [15]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(pretrained_path, 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at seyonec/PubChem10M_SMILES_BPE_450k and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train the model!

We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things: 

* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).

In [16]:
batch_size = 4
metric_name = "f1"

In [42]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"ChemBERTa_weighted",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=4,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [43]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result



Let's verify a batch as well as a forward pass:

In [44]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [45]:
encoded_dataset['train']['input_ids'][0]

tensor([  0, 262, 263,  51,  13,  50,  12, 262,  12,  51,  13, 291,  12,  39,
         12,  39, 287,  51,  13,  39,  21,  30,  39,  12,  45,  13,  30,  39,
         12,  39, 263,  51,  13, 301,  12,  51,  13, 298,  13,  30,  39,  12,
         45,  13,  30,  39,  12,  39, 263,  51,  13, 301,  12,  51,  13, 298,
         13,  30,  39,  30,  21,  45,  13,  39,  21,  30,  39,  12,  45,  13,
         30,  39,  12,  39, 263,  51,  13, 301,  12,  51,  13, 298,  13,  30,
         39,  12,  45,  13,  30,  39,  12,  39, 263,  51,  13, 301,  12,  51,
         13, 298,  13,  30,  39,  30,  21,  45,   2,   1,   1,   1,   1,   1,
          1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
          1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
          1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
          1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
          1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,  

In [46]:
#forward pass
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0), labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs


SequenceClassifierOutput(loss=tensor(0.2913, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.0543,  1.0951, -3.1105,  0.3504,  1.8950,  1.0990,  2.4127, -1.4540,
          1.5247,  0.1080, -1.0859,  2.0167, -1.2291, -1.5817,  1.8846,  0.6076,
          2.6498, -1.2971,  1.1798,  1.2497,  0.6378,  0.7856, -2.0823, -0.0781,
          0.9122,  2.4336,  0.9523]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Let's start training!

In [47]:
import transformers
print(transformers.__version__)


4.35.2


In [50]:
from torch import nn
from transformers import Trainer

weights = torch.load("./data/processed_data/train_class_dist_rate.pt")
print(weights)

# Initialize custom loss function with the calculated weights
loss_func = WeightedBCEWithLogitsLoss(weights=weights)

tensor([ 0.9948,  0.4524, 59.1579,  0.6422,  0.2505,  0.4505,  0.1151,  5.0476,
         0.4059,  1.0159,  2.8746,  0.1097,  3.6653,  5.8443,  0.3033,  0.6662,
         0.0896,  4.4689,  0.4252,  0.3527,  0.4323,  0.6099, 10.7835,  1.2544,
         0.4673,  0.1022,  0.5488], dtype=torch.float64)


In [51]:
from transformers import Trainer
import torch.nn as nn

class CustomTrainer(Trainer):
    def __init__(self, model, args, loss_func=None, **kwargs):
        # Store the custom loss function
        self.loss_func = loss_func

        # Call the parent class's constructor without the custom loss function
        super().__init__(model=model, args=args, **kwargs)

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get('labels')
        outputs = model(**inputs)
        logits = outputs.logits

        # Use custom loss function if provided
        if self.loss_func:
            loss = self.loss_func(logits, labels.float())
        else:
            # Default to the standard loss calculation
            loss = outputs.loss if 'loss' in outputs else None

        return (loss, outputs) if return_outputs else loss


In [52]:
trainer = CustomTrainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["valid"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    loss_func=loss_func
)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [53]:
trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
0,No log,0.973603,0.831231,0.749575,0.021127
2,No log,0.981648,0.831887,0.754578,0.021127
4,No log,1.000904,0.829229,0.75109,0.021127


TrainOutput(global_step=355, training_loss=0.8210913429797535, metrics={'train_runtime': 359.1246, 'train_samples_per_second': 15.914, 'train_steps_per_second': 0.989, 'total_flos': 440754008666400.0, 'train_loss': 0.8210913429797535, 'epoch': 4.97})

## Evaluate

After training, we evaluate our model on the validation set.

In [54]:
trainer.evaluate()

{'eval_loss': 0.9816477756341142,
 'eval_f1': 0.8318869828456106,
 'eval_roc_auc': 0.754578167333307,
 'eval_accuracy': 0.02112676056338028,
 'eval_runtime': 2.2533,
 'eval_samples_per_second': 63.02,
 'eval_steps_per_second': 15.977,
 'epoch': 4.97}

In [55]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import pandas as pd

# Replace 'path_to_your_project/ChemBERTa/checkpoint-360' with the actual path
model_path = './ChemBERTa_weighted/checkpoint-286'
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [56]:
test_dataset = pd.read_csv(os.path.join(data_dir, "test/metadata.csv"))
test_smiles = test_dataset["smiles"].values
test_ground_truth = test_dataset.drop(["smiles", "Unnamed: 0"], axis=1)

In [57]:
test_ground_truth_values = test_ground_truth.values
test_smiles.shape
test_smiles_list = test_smiles.tolist()

In [58]:
tokenized_inputs = tokenizer(test_smiles_list, padding="max_length", truncation=True, max_length=300, return_tensors="pt")

In [59]:
import torch

# Assuming you are using a PyTorch model
with torch.no_grad():
    outputs = model(**tokenized_inputs)

In [60]:
probabilities = torch.sigmoid(outputs.logits)
probabilities[0]

tensor([0.4720, 0.6340, 0.0036, 0.6293, 0.8017, 0.6970, 0.8754, 0.1200, 0.7078,
        0.4500, 0.2256, 0.8912, 0.1557, 0.0874, 0.7558, 0.5509, 0.9042, 0.1179,
        0.6518, 0.7003, 0.6899, 0.6468, 0.0261, 0.5428, 0.6183, 0.8828, 0.6678])

In [61]:
thredshold = 0.5
predictions = (probabilities > thredshold).int()
predictions[0]

tensor([0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
        1, 1, 1], dtype=torch.int32)

In [62]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

import numpy as np

# Convert your tensors to numpy arrays if they aren't already
predictions_np = predictions.numpy()
true_labels_np = test_ground_truth_values  # Replace true_labels with your actual labels
probabilities_np = probabilities.numpy()

# Initialize lists to store metrics for each label
accuracies = []
precisions = []
recalls = []
f1_scores = []
AUC_scores = []

# Calculate metrics for each label
for label in range(predictions_np.shape[1]):
    print(label)
    accuracies.append(accuracy_score(true_labels_np[:, label], predictions_np[:, label]))
    precisions.append(precision_score(true_labels_np[:, label], predictions_np[:, label], zero_division=0))
    recalls.append(recall_score(true_labels_np[:, label], predictions_np[:, label], zero_division=0))
    f1_scores.append(f1_score(true_labels_np[:, label], predictions_np[:, label], zero_division=0))
    if (true_labels_np[:, label].sum() != 0):
        AUC_scores.append(roc_auc_score(true_labels_np[:, label], probabilities_np[:, label]))

# Calculate average of each metric
average_accuracy = np.mean(accuracies)
average_precision = np.mean(precisions)
average_recall = np.mean(recalls)
average_f1_score = np.mean(f1_scores)
average_AUC = np.mean(AUC_scores)

print(f"Average Accuracy: {average_accuracy}")
print(f"Average Precision: {average_precision}")
print(f"Average Recall: {average_recall}")
print(f"Average F1 Score: {average_f1_score}")
print(f"Average AUC: {average_AUC}")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Average Accuracy: 0.7767344809598329
Average Precision: 0.6074995658309068
Average Recall: 0.6921425327089815
Average F1 Score: 0.6222410258944913
Average AUC: 0.6345645218304565


In [63]:
print(f1_scores)

[0.716577540106952, 0.8215767634854771, 0.0, 0.75, 0.916030534351145, 0.859437751004016, 0.982078853046595, 0.0, 0.8685258964143426, 0.631578947368421, 0.0, 0.9710144927536231, 0.052631578947368425, 0.0, 0.8818897637795275, 0.8355555555555555, 0.982078853046595, 0.0, 0.8215767634854771, 0.8455284552845529, 0.880952380952381, 0.8067226890756304, 0.0, 0.5245901639344263, 0.8284518828451882, 0.9747292418772564, 0.8489795918367347]


In [None]:
[0.7024390243902439, 0.8166666666666667, 0.0, 0.7555555555555554, 0.916030534351145, 0.859437751004016, 0.982078853046595, 0.0, 0.873015873015873, 0.7103825136612022, 0.0, 0.9710144927536231, 0.052631578947368425, 0.0, 0.8774703557312252, 0.8220338983050849, 0.982078853046595, 0.0, 0.8264462809917354, 0.8455284552845529, 0.880952380952381, 0.8215767634854771, 0.0, 0.4827586206896552, 0.825, 0.9747292418772564, 0.8455284552845529]
