In [1]:
import sys

sys.path.insert(0, "/home/cristian/holisticai/src")

 # **How Much Your Model's Accuracy is Sensitive to Changes in the Dataset's Structure?**


For a variety of reasons, a machine learning (ML) practitioner may be exposed to dataset shift or loss of instances in his/her dataset. This set of experiments aims to show how intrinsically exposed the practitioner is to variations in accuracy, considering inherent characteristics of the dataset that are not revealed when traditional ML experiments are conducted. A deeper look will be taken into the geometric structure of some datasets, and a method to measure the practitioner's exposure to significant changes in his/her dataset will be proposed, called **accuracy degradation profile** (ADP).

In [3]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from holisticai.robustness.metrics import (
    accuracy_degradation_profile,
    accuracy_degradation_factor,
)

from holisticai.robustness.plots import (
                                    plot_2d,
                                    plot_adp_and_adf,
                                    plot_label_and_prediction,
                                    plot_neighborhood,
                                    )


First, create your own 2D classification dataset for experiments:

In [None]:
# Imports
from sklearn.datasets import make_blobs

random_state = 42

# Generate synthetic data using make_blobs with closer centers
center_box = 2.5
X, y = make_blobs(n_samples=100, 
                  centers=2, 
                  n_features=2, 
                  cluster_std=0.8, 
                  center_box=(-center_box, center_box), 
                  random_state=random_state)


In [None]:
# Plot the 2D classification data
plot_2d(X, y)

Let us perform the usual machine learning pipeline to infer the test accuracy over the **entire** test set using a Decision Tree. You may change the classifier.

In [None]:
# Imports
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test, test_indices = pre_process_data(X, y, test_size = 0.3, random_state = random_state)

# Train a classifier over the data
clf = tree.DecisionTreeClassifier(random_state=random_state)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Accuracy over the test set
accuracy_test = accuracy_score(y_test, y_pred)
print(f"Accuracy test set: {accuracy_test:.4f}")


Let us highlight the test set on the graph. It will be important to manage the possible changes on it.

In [None]:
# Highlight the test set on the plot
plot_2d(X, y, highlight_group=test_indices)

Let us maintain just the test set on the graph with a (very!) small index per point for tracking.

In [None]:
# Plot test set with index
plot_2d(X, y, highlight_group=test_indices, show_just_group=True)

Let us now plot y_test and y_pred together in the same graph. The values of y_pred (shaded circles) are shifted vertically by a small amount to allow better visualization. It is possible to see where the classifier incorrectly classified the true labels **by the different collors** between y_test and y_pred.

In [None]:
# Plot y_test and y_pred simultaneously
plot_label_and_prediction(X_test, y_test, y_pred, vertical_offset=0.1)

Let us now analyze the neighborhood of a list of selected points in the test set.

In [None]:
# Choose the size of the neighborhood
n_neighbors = 4

# Choose the points of interest
# Example:
points_of_interest = [96, 83, 76]

# Plot points of interest and its neighborhoods
plot_neighborhood(X_test,
                  y_test,
                  y_pred,
                  n_neighbors,
                  points_of_interest = points_of_interest,
                  vertical_offset = 0.10,
                  indices_show = test_indices)

Check if the accuracies calculated in each convex hull composed of the point of interest and its neighborhood have the same value. Empirical experiments on different synthetic and real datasets have showed that these values were not the same when varying *points_of_interest* or *n_neighbors*. You have probably observed the same. **Why does it happen?** Because the dataset has its own nuances in its topology, which cause the local accuracies, considering only the neighborhoods of the points of interest, to be different.

At this point, we propose the **accuracy degradation profile (ADP)**, a method to evaluate the **robustness** of machine learning models on datasets by iteratively **reducing the test set size** and analyzing the impact on accuracy. ADP points out on each portions (of the reduced dataset) the mean accuracy over each sample considering its neighborhoods falls dow over a defined threshold. APD is clearly understandable by an example.

Let us perform ADP over the selected dataset:

In [None]:
# Perform accuracy degradation profile (ADP)
results = accuracy_degradation_profile(X_test, 
                                    y_test, 
                                    y_pred, 
                                    n_neighbors = 20,
                                    step_size = 0.04,
                                    )
results

Understanding the ADP profile matrix, column-by-column:

- **size_factor** is the proportion of the test set that is being considered. If e.g. **size_factor** = 0.95, all samples of the test set are considering just the 95% nearest neighboors for accuracy calculation.

- **above_threshold** is the number of samples with accuracy above the *threshold*. The variable *threshold* is the minimum acceptable accuracy, calculated as *baseline_accuracy* * *threshold_percentual*, with the second term being the threshold percentage for accuracy degradation. If e.g. **above_threshold** = 30, then 30 samples of the test set showed its accuracy (calculated over its neighboorhood) above the minimum acceptable accuracy. Default value for *threshold_percentual*: 0.95.

- **ADP** (accuracy degradation profile) is the proportion of samples with accuracy above the *threshold*. If e.g. **ADP** = 1.0, then 100% of samples of the test set showed its accuracy (calculated over its neighboorhood) above the minimum acceptable accuracy.

- **average_accuracy** is the mean accuracy considering the **size_factor**.

- **variance_accuracy** is the variance of accuracies considering the **size_factor**.

- **decision** is the result of *above_threshold* compared with *above_percentual*, with the second term being the percentage of samples required to be above the threshold to avoid degradation. If *above_threshold* is greater than or equal to *above_percentual*, then there is no accuracy degradation (marked as 'OK').  If *above_threshold* is smaller than *above_percentual*, then there is accuracy degradation (marked as 'acc degrad!'). Default value for *above_percentual*: 0.90.

**Summary**:

Every time you see 'acc degrad!' on the ADP matrix, **watch out** (!!!): *the mean accuracy over the convex hulls (samples and its neighboorhood) are not superior than than the minimum acceptable accuracy*. It may be valuable to look deeper on the topology of the dataset to check what is actually happening with your classifier applied to the reduced dataset. Even if your dataset is multidimensional, you may use dimensionality reduction to look at it in 2D.

The **accuracy degradation factor** (ADF) is the first **size_factor** on which an accuracy degradation occurs. ADF ranges from 0.0 to 1.0. As closer it is to 1.0, **less robust** is the model to dataset shifts considering the ADP methodology. As closer it is to 0.0, **more robust** is the model to dataset shifts considering the ADP methodology.

In [None]:
adf = accuracy_degradation_factor(pd.DataFrame(results.data))
adf

In [None]:
# Plot ADP and ADF
plot_adp_and_adf(results.data)

Let us now apply ADP and ADF to **real datasets:**

(You can import your own dataset)

In [None]:
from scipy.spatial import KDTree
import numpy as np

def kd_tree_uniform_sampling(X, sample_size=100):
    # Crear el K-D Tree
    tree = KDTree(X)
    
    # Realizar un muestreo uniforme de puntos en el espacio, seleccionando índices de manera aleatoria
    sampled_indices = []
    n_samples = 0
    points_per_leaf = sample_size // tree.n
    
    for point_index in range(tree.n):
        if n_samples >= sample_size:
            break
        distances, indices = tree.query(X[point_index], k=points_per_leaf)
        sampled_indices.extend(indices)
        n_samples += points_per_leaf

    # Limitar el número de muestras al tamaño deseado
    sampled_indices = list(set(sampled_indices))[:sample_size]
    
    return X[sampled_indices]

In [None]:
from holisticai.datasets import load_dataset

# Choose any of the following datasets:
# 'adult'
# 'law_school'
# 'student_multiclass'
# 'us_crime_multiclass'
# 'clinical_records'

# New datasets:
# 'german_credit'
# 'census_kdd'
# 'bank_marketing'
# 'compass'
# 'diabetes'
# 'acsincome'
# 'acspublic'

# Load dataset
dataset = load_dataset('adult')
print(f'Original X shape: {dataset["X"].shape}')
print(f'Original y shape: {dataset["y"].shape}')

# Shrink the dataset
n_rows = 20000 # Select only the first n rows
# n_rows = dataset.data.shape[0] # Select all rows
sampled_indices = kd_tree_sampling(X, 100)

X = dataset['X'].iloc[sampled_indices,:]
y = dataset['y'].iloc[sampled_indices]


In [None]:
# Split data into training and test sets
X_train, X_test, y_train, y_test, test_indices = pre_process_data(X, y, test_size = 0.3, random_state = random_state)

In [None]:
# Train a classifier over the data
clf = tree.DecisionTreeClassifier(random_state=random_state)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test.values)

# Accuracy over the entire test set
accuracy_test = accuracy_score(y_test, y_pred)
print(f"Accuracy test set: {accuracy_test:.4f}")

In [None]:
# Perform accuracy degradation profile (ADP)
results = accuracy_degradation_profile(X_test, 
                                    y_test, 
                                    y_pred, 
                                    n_neighbors = 50,
                                    step_size = 0.02,
                                    )
results

In [None]:
# Accuracy over the test set
accuracy_test = accuracy_score(y_test, y_pred)
print(f"Accuracy test set: {accuracy_test:.4f}")

# Perform accuracy degradation profile (ADP)

In [None]:
adf = accuracy_degradation_factor(pd.DataFrame(results.data))
adf

In [None]:
# Plot ADP and ADF
plot_adp_and_adf(results.data)

In [None]:
results

You may want to visualize the projections on a 2D plot of your multidimensional dataset.

First, let us choose the 2 features to plot using the *feature importance* criteria.

In [None]:
# You can optionally choose the most relevant PCs of PCA as axis to plot
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X)
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

features_to_plot = ['PC1', 'PC2']
X = pca_df

In [None]:
# Assign data structures
X_train, X_test, y_train, y_test, test_indices = pre_process_data(X, y, test_size = 0.3, random_state = random_state)


In [None]:
# Plot the 2D classification data
plot_2d(X, y, features_to_plot = features_to_plot)

In [None]:
# Highlight the test set on the plot
plot_2d(X, y, highlight_group=test_indices, features_to_plot = features_to_plot)

In [None]:
# Plot test set with index
plot_2d(X,
        y,
        highlight_group=test_indices,
        show_just_group=True,
        features_to_plot = features_to_plot)

In [None]:
# Plot y_test and y_pred simultaneously
# It maybe necessary to adjust the 'vertical_offset' parameter
plot_label_and_prediction(X_test,
                          y_test,
                          y_pred,
                          vertical_offset=0.2,
                          features_to_plot = features_to_plot)

In [None]:
# Choose the size of the neighborhood
n_neighbors = 10

# Choose the points of interest
points_of_interest = test_indices[:3]

# Plot points of interest and its neighborhoods
plot_neighborhood(X_test,
                  y_test,
                  y_pred,
                  n_neighbors,
                  points_of_interest = points_of_interest,
                  vertical_offset = 0.15,
                  indices_show = test_indices)