### Motivation

This notebook provides a practical demonstration of how to use the performance metrics available in the `ThreeWToolkit` to evaluate machine learning models. It covers both classification and regression tasks, illustrating the process from data loading and model training to performance evaluation.

### What you will learn:
- **Classification Model Evaluation**:
    - How to train a classifier on the 3W dataset.
    - How to use and interpret the following metrics:
        - `accuracy_score`
        - `balanced_accuracy_score`
        - `average_precision_score`
        - `precision_score`
        - `recall_score`
        - `f1_score`
        - `roc_auc_score`

- **Regression Model Evaluation**:
    - How to set up and train a regression model.
    - How to apply and understand the `explained_variance_score`.

### Imports

**Adaptation to recognize the project root. For demonstration purposes only.**

In [1]:
import sys
import os
import numpy as np

# Adiciona o diretório raiz ao sys.path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../../')))

**Required**

In [2]:
from ThreeWToolkit.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    average_precision_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    explained_variance_score
)
from ThreeWToolkit.core.base_preprocessing import WindowingConfig
from ThreeWToolkit.preprocessing import Windowing
from ThreeWToolkit.trainer.trainer import ModelTrainer, TrainerConfig
from ThreeWToolkit.models.mlp import MLPConfig
from ThreeWToolkit.dataset import ParquetDataset
from ThreeWToolkit.core.base_dataset import ParquetDatasetConfig

from pathlib import Path
import torch
from tqdm import tqdm
import pandas as pd

### How to use metrics for classification tasks

#### Loading 3W Dataset

In [3]:
dataset_path = Path("../../data/raw/")
ds_config = ParquetDatasetConfig(
    path=dataset_path, clean_data=True, download=False, target_class=[0, 1, 2]
)
ds = ParquetDataset(ds_config)
ds[19]




[ParquetDataset] Dataset found at ../../data/raw
[ParquetDataset] Validating dataset integrity...
[ParquetDataset] Dataset integrity check passed!


{'signal':                      ABER-CKGL  ABER-CKP  ESTADO-DHSV  ESTADO-M1  ESTADO-M2  \
 timestamp                                                                     
 2017-05-25 13:00:00        0.0       0.0     0.867921   0.414652  -0.681653   
 2017-05-25 13:00:01        0.0       0.0     0.867921   0.414652  -0.681653   
 2017-05-25 13:00:02        0.0       0.0     0.867921   0.414652  -0.681653   
 2017-05-25 13:00:03        0.0       0.0     0.867921   0.414652  -0.681653   
 2017-05-25 13:00:04        0.0       0.0     0.867921   0.414652  -0.681653   
 ...                        ...       ...          ...        ...        ...   
 2017-05-25 18:59:04        0.0       0.0     0.867921   0.414652  -0.681653   
 2017-05-25 18:59:05        0.0       0.0     0.867921   0.414652  -0.681653   
 2017-05-25 18:59:06        0.0       0.0     0.867921   0.414652  -0.681653   
 2017-05-25 18:59:07        0.0       0.0     0.867921   0.414652  -0.681653   
 2017-05-25 18:59:08        0.

### Setting up model 

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"

window_size = 1000
mlp_config = MLPConfig(
    input_size=window_size,
    hidden_sizes=(32, 16),
    output_size=3,
    random_seed=11,
    activation_function="relu",
    regularization=None,
)

trainer_config = TrainerConfig(
    optimizer="adam",
    criterion="cross_entropy",
    batch_size=32,
    epochs=20,
    seed=11,
    config_model=mlp_config,
    learning_rate=0.001,
    device=device,
    cross_validation=False,
    shuffle_train=True
)

trainer = ModelTrainer(trainer_config)
print(trainer.model)

MLP(
  (activation_func): ReLU()
  (model): Sequential(
    (0): Linear(in_features=1000, out_features=32, bias=True)
    (1): ReLU()
    (2): Linear(in_features=32, out_features=16, bias=True)
    (3): ReLU()
    (4): Linear(in_features=16, out_features=3, bias=True)
  )
)


In [5]:
# Select target columns and prepare training data with windowing
selected_col = "T-TPT"
x_train = []
y_train = []
dfs = []

wind = Windowing(WindowingConfig(window="hann",
        window_size=window_size,
        overlap=0.5,
        pad_last_window=True))

for event in tqdm(ds):
    windowed_signal = wind(
        event["signal"][selected_col]
    )
    windowed_signal.drop(columns=["win"], inplace=True)
    windowed_signal["label"] = np.unique(event["label"]["class"])[0]
    dfs.append(windowed_signal)
dfs_final = pd.concat(dfs, ignore_index=True, axis=0)

100%|██████████| 760/760 [00:23<00:00, 32.79it/s]


### Model Validation

In [6]:
from sklearn.model_selection import train_test_split

# Separate features (X) and target (y) from the final dataframe
X = dfs_final.iloc[:, :-1]
y = dfs_final['label']

# Perform a stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Print the shapes of the resulting datasets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
print("-" * 25)

# Show the class distribution in the training and testing sets
print("Train occurrences by class:")
print(y_train.value_counts())
print("\nTest occurrences by class:")
print(y_test.value_counts())


X_train shape: (35516, 1000)
X_test shape: (8879, 1000)
y_train shape: (35516,)
y_test shape: (8879,)
-------------------------
Train occurrences by class:
label
0    19706
1    14614
2     1196
Name: count, dtype: int64

Test occurrences by class:
label
0    4927
1    3653
2     299
Name: count, dtype: int64


In [7]:
# Train the MLP model using the new ModelTrainer interface
trainer.train(x_train=X_train, y_train=y_train)

[Pipeline] Training:   0%|          | 0/20 [00:00<?, ?epoch/s]

In [8]:
X_test = torch.Tensor(X_test.to_numpy()).to(device)
y_pred = trainer.model(X_test).detach().cpu().numpy().argmax(axis=1)
y_true = y_test.to_numpy()

### Evaluating the Classification Model

With the model trained and predictions made on the test set, we can now evaluate its performance. The following cells demonstrate how to use the various classification metrics imported from `ThreeWToolkit.metrics`. We will use `y_true` (the actual labels) and `y_pred` (the model's predicted labels) to calculate these scores.

**Accuracy Score**

Basic usage

In [9]:
acc = accuracy_score(y_true = y_true,
                     y_pred = y_pred)

print(f"The accuracy is {(acc * 100):.3}%.")

The accuracy is 95.0%.


Using sample weight

In [10]:
sample_weight = np.random.rand(len(y_true))

acc = accuracy_score(y_true = y_true,
                     y_pred = y_pred,
                     sample_weight = sample_weight)

print(f"The accuracy is {(acc * 100):.3}%.")

The accuracy is 94.9%.


_________

**Balanced Accuracy Score**

Basic usage

In [11]:
balanced_acc = balanced_accuracy_score(y_true = y_true,
                                       y_pred = y_pred)

print(f"The balanced accuracy is {(balanced_acc * 100):.3}%.")

The balanced accuracy is 86.7%.


Using sample weight

In [12]:
sample_weight = np.random.rand(len(y_true))

balanced_acc = balanced_accuracy_score(y_true = y_true,
                                       y_pred = y_pred,
                                       sample_weight = sample_weight)

print(f"The balanced accuracy is {(balanced_acc * 100):.3}%.")

The balanced accuracy is 87.5%.


______

**Average Precision Score**

Basic usage

In [13]:
y_pred = trainer.model(X_test).detach().cpu().numpy() # needs pseudoprobabilities

ap = average_precision_score(y_true = y_true, y_pred = y_pred, average = 'weighted')

print(f"The average precision is {(ap * 100):.3}%.")

The average precision is 92.0%.


Using sample weight

In [14]:
sample_weight = np.random.rand(len(y_true))

ap = average_precision_score(y_true = y_true,
                             y_pred = y_pred,
                             sample_weight = sample_weight)

print(f"The average precision is {(ap * 100):.3}%.")

The average precision is 86.9%.


___________

**Precision Score**

Basic usage

In [15]:
y_pred = trainer.model(X_test).detach().cpu().numpy().argmax(axis=1)

# Calculate precision for each class (one-vs-rest)
precision = precision_score(y_true=y_true, y_pred=y_pred, average=None)

# Print precision for each class
for i, p in enumerate(precision):
    print(f"The precision for class {i} is {(p * 100):.3f}%.")

# Calculate weighted average precision
precision_weighted = precision_score(y_true=y_true, y_pred=y_pred, average='weighted')
print(f"\nThe weighted average precision is {(precision_weighted * 100):.3f}%.")

The precision for class 0 is 92.209%.
The precision for class 1 is 99.310%.
The precision for class 2 is 95.434%.

The weighted average precision is 95.239%.


Using sample weight

In [16]:
sample_weight = np.random.rand(len(y_true))
y_pred = trainer.model(X_test).detach().cpu().numpy().argmax(axis=1)

# Calculate precision for each class (one-vs-rest)
precision = precision_score(y_true=y_true, y_pred=y_pred, average=None, sample_weight=sample_weight)

# Print precision for each class
for i, p in enumerate(precision):
    print(f"The precision for class {i} is {(p * 100):.3f}%.")

# Calculate weighted average precision
precision_weighted = precision_score(y_true=y_true, y_pred=y_pred, average='weighted')
print(f"\nThe weighted average precision is {(precision_weighted * 100):.3f}%.")

The precision for class 0 is 92.209%.
The precision for class 1 is 99.282%.
The precision for class 2 is 94.686%.

The weighted average precision is 95.239%.


Using different average options

In [17]:
precision = precision_score(y_true = y_true, y_pred = y_pred, average = "macro")

print(f"The precision [macro] is {(precision * 100):.3}%.")

####
precision = precision_score(y_true = y_true, y_pred = y_pred, average = "micro")

print(f"The precision [micro] is {(precision * 100):.3}%.")

####
precision = precision_score(y_true = y_true, y_pred = y_pred, average = "weighted")

print(f"The precision [weighted] is {(precision * 100):.3}%.")

The precision [macro] is 95.7%.
The precision [micro] is 95.0%.
The precision [weighted] is 95.2%.


__________

**Recall Score**

Basic usage

In [18]:
y_pred = trainer.model(X_test).detach().cpu().numpy().argmax(axis=1)

# Calculate precision for each class (one-vs-rest)
recall = recall_score(y_true=y_true, y_pred=y_pred, average=None)

# Print recall for each class
for i, p in enumerate(recall):
    print(f"The recall for class {i} is {(p * 100):.3f}%.")

# Calculate weighted average recall
recall_weighted = recall_score(y_true=y_true, y_pred=y_pred, average='weighted')
print(f"\nThe weighted average recall is {(recall_weighted * 100):.3f}%.")

The recall for class 0 is 99.696%.
The recall for class 1 is 90.610%.
The recall for class 2 is 69.900%.

The weighted average recall is 94.954%.


Using sample weight

In [19]:
sample_weight = np.random.rand(len(y_true))
y_pred = trainer.model(X_test).detach().cpu().numpy().argmax(axis=1)

# Calculate recall for each class (one-vs-rest)
precision = recall_score(y_true=y_true, y_pred=y_pred, average=None, sample_weight=sample_weight)

# Print recall for each class
for i, p in enumerate(precision):
    print(f"The recall for class {i} is {(p * 100):.3f}%.")

# Calculate weighted average recall
precision_weighted = recall_score(y_true=y_true, y_pred=y_pred, average='weighted')
print(f"\nThe weighted average recall is {(precision_weighted * 100):.3f}%.")

The recall for class 0 is 99.741%.
The recall for class 1 is 90.480%.
The recall for class 2 is 69.223%.

The weighted average recall is 94.954%.


Using different average options

In [20]:
recall = recall_score(y_true = y_true, y_pred = y_pred, average = "macro")

print(f"The recall [macro] is {(recall * 100):.3}%.")

####
recall = recall_score(y_true = y_true, y_pred = y_pred, average = "micro")

print(f"The recall [micro] is {(recall * 100):.3}%.")

####
recall = recall_score(y_true = y_true, y_pred = y_pred, average = "weighted")

print(f"The recall [weighted] is {(recall * 100):.3}%.")

The recall [macro] is 86.7%.
The recall [micro] is 95.0%.
The recall [weighted] is 95.0%.


_______

**F1 Score**

Basic usage

In [21]:
y_pred = trainer.model(X_test).detach().cpu().numpy().argmax(axis=1)

# Calculate precision for each class (one-vs-rest)
precision = f1_score(y_true=y_true, y_pred=y_pred, average=None, sample_weight=sample_weight)

# Print f1 for each class
for i, p in enumerate(precision):
    print(f"The f1 for class {i} against the rest is {(p * 100):.3f}%.")

# Calculate weighted average f1
precision_weighted = f1_score(y_true=y_true, y_pred=y_pred, average='weighted')
print(f"\nThe weighted average f1 is {(precision_weighted * 100):.3f}%.")

The f1 for class 0 against the rest is 95.688%.
The f1 for class 1 against the rest is 94.744%.
The f1 for class 2 against the rest is 80.969%.

The weighted average f1 is 94.867%.


Using sample weight

In [22]:
recall = f1_score(y_true = y_true, y_pred = y_pred, average = "macro")

print(f"The f1_score [macro] is {(recall * 100):.3}%.")

####
recall = f1_score(y_true = y_true, y_pred = y_pred, average = "micro")

print(f"The f1_score [micro] is {(recall * 100):.3}%.")

####
recall = f1_score(y_true = y_true, y_pred = y_pred, average = "weighted")

print(f"The f1_score [weighted] is {(recall * 100):.3}%.")

The f1_score [macro] is 90.4%.
The f1_score [micro] is 95.0%.
The f1_score [weighted] is 94.9%.


_____________

**ROC AUC Score**

Basic usage

In [23]:
y_pred = torch.softmax(trainer.model(X_test), dim=1).detach().cpu().numpy()

# Calculate ROC AUC for each class (One-vs-Rest)
print("ROC AUC for each class (One-vs-Rest):")
for i in range(y_pred.shape[1]):
    y_true_class = (y_true == i).astype(int)
    y_pred_class = y_pred[:, i]
    roc_auc = roc_auc_score(y_true=y_true_class, y_pred=y_pred_class)
    print(f"The ROC AUC for class {i} is {(roc_auc * 100):.3f}%.")

print("\nUsing multi_class options:")
# Calculate ROC AUC using One-vs-Rest (ovr) strategy with macro averaging
roc_auc_ovr_macro = roc_auc_score(y_true=y_true, y_pred=y_pred, multi_class='ovr', average='macro')
print(f"The ROC AUC (OVR, macro) is {(roc_auc_ovr_macro * 100):.3f}%.")

# Calculate ROC AUC using One-vs-Rest (ovr) strategy with weighted averaging
roc_auc_ovr_weighted = roc_auc_score(y_true=y_true, y_pred=y_pred, multi_class='ovr', average='weighted')
print(f"The ROC AUC (OVR, weighted) is {(roc_auc_ovr_weighted * 100):.3f}%.")

# Calculate ROC AUC using One-vs-One (ovo) strategy with macro averaging
roc_auc_ovo_macro = roc_auc_score(y_true=y_true, y_pred=y_pred, multi_class='ovo', average='macro')
print(f"The ROC AUC (OVO, macro) is {(roc_auc_ovo_macro * 100):.3f}%.")

# Calculate ROC AUC using One-vs-One (ovo) strategy with weighted averaging
roc_auc = roc_auc_score(y_true=y_true, y_pred=y_pred, multi_class='ovo', average='weighted')

print(f"The roc_auc is {(roc_auc * 100):.3}%.")

ROC AUC for each class (One-vs-Rest):
The ROC AUC for class 0 is 94.889%.
The ROC AUC for class 1 is 96.151%.
The ROC AUC for class 2 is 90.482%.

Using multi_class options:
The ROC AUC (OVR, macro) is 93.841%.
The ROC AUC (OVR, weighted) is 95.260%.
The ROC AUC (OVO, macro) is 92.335%.
The roc_auc is 92.7%.


_____

In [24]:


from ThreeWToolkit.core.base_preprocessing import ImputeMissingConfig
from ThreeWToolkit.preprocessing._data_processing import ImputeMissing

x_columns = ["P-TPT", "P-PDG", "T-TPT", "P-MON-CKP", "T-JUS-CKP", "P-JUS-CKGL"]

ds_config = ParquetDatasetConfig(
    columns=x_columns,
    target_column="QGL",
    path=dataset_path, 
    clean_data=True
)
cleaned_dataset = ParquetDataset(ds_config)

# Add some noise to the signal for demonstration purposes
X = cleaned_dataset[20]['signal'] + np.random.normal(0, 0.01, cleaned_dataset[20]['signal'].shape)
y = cleaned_dataset[20]['label']['QGL'] + np.random.normal(0, 1, cleaned_dataset[20]['label']['QGL'].shape)

# Perform a stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Print the shapes of the resulting datasets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
print("-" * 25)

[ParquetDataset] Dataset found at ../../data/raw
[ParquetDataset] Validating dataset integrity...
[ParquetDataset] Dataset integrity check passed!


>> ['P-TPT', 'P-PDG', 'T-TPT', 'P-MON-CKP', 'T-JUS-CKP', 'P-JUS-CKGL']
X_train shape: (27839, 6)
X_test shape: (6960, 6)
y_train shape: (27839,)
y_test shape: (6960,)
-------------------------


### How to use metrics for regression tasks

Now, let's explore how to evaluate a model on a regression task. We will set up a new MLP model configured for regression (predicting a continuous value) and then use appropriate metrics to assess its performance.

### Setting up model

In [25]:

mlp_config = MLPConfig(
    input_size=len(x_columns),
    hidden_sizes=(32, 16),
    output_size=1,
    random_seed=11,
    activation_function="relu",
    regularization=None,
)

trainer_config = TrainerConfig(
    optimizer="adam",
    criterion="mse",
    batch_size=32,
    epochs=20,
    seed=42,
    config_model=mlp_config,
    learning_rate=0.001,
    device=device,
    cross_validation=False,
    shuffle_train=True
)
trainer = ModelTrainer(trainer_config) 

trainer.train(x_train=X_train, y_train=y_train)


[Pipeline] Training:   0%|          | 0/20 [00:00<?, ?epoch/s]

In [26]:
X_test = torch.Tensor(X_test.to_numpy()).to(device)
y_pred = trainer.model(X_test).detach().cpu().numpy()
y_true = y_test.to_numpy()

**Explained Variance Score**

Basic usage

In [28]:
evscore = explained_variance_score(y_true = y_true, y_pred = y_pred)

print(f"The explained_variance_score is {(evscore * 100):.3}%.")

The explained_variance_score is 91.0%.


Using sample weight

In [29]:
weights = np.random.rand(len(y_true))

evscore = explained_variance_score(y_true = y_true, y_pred = y_pred, sample_weight = weights)

print(f"The explained_variance_score is {(evscore * 100):.3}%.")

The explained_variance_score is 91.0%.


Using different average options

In [30]:
# uniform_average (default)
evscore = explained_variance_score(y_true, y_pred, multioutput = 'uniform_average')
print(f"The explained_variance [uniform_average] is {(evscore * 100):.2f}%.")

# raw_values: retorna um valor por saída
evscore = explained_variance_score(y_true, y_pred, multioutput = 'raw_values')
print(f"The explained_variance [raw_values] is {evscore}.")

# variance_weighted
evscore = explained_variance_score(y_true, y_pred, multioutput = 'variance_weighted')
print(f"The explained_variance [variance_weighted] is {(evscore * 100):.2f}%.")

The explained_variance [uniform_average] is 91.03%.
The explained_variance [raw_values] is [0.91030194].
The explained_variance [variance_weighted] is 91.03%.


____________