# Tabular Dataset Out-of-Distribution Augmentations

In this notebook we examine the proposed out-of-distribution augmentations for tabular datasets and evaluate how often a sample under augmentation would have a nearest neighbor different from the original sample.

We do this to verify that the proposed augmentations are not too aggressive so that a sample under augmentation would be closer to the original sample or to the sample from the same class than to a different sample from a different class. We identify scaling factors for each dataset to ensure that the proposed augmentations are not too aggressive which could result in augmentations that would change the data distribution too much.

In [1]:
import sys
sys.path.append('../')
import numpy as np

from src.data.uci import UCI
from src.third_party.corruptions import TabularCorruption, TABULAR_MULTIPLICATIVE_SCALE, TABULAR_ADDITIVE_SCALE
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
import pickle
import copy

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

DATASETS = ["regression_concrete", "regression_boston", "regression_energy", "regression_wine", "regression_yacht", "classification_wine", "classification_toxicity", "classification_abalone", "classification_students", "classification_adult"]
DATA_ROOT = "~/.torch"
TEST_PORTION = 0.0 
SEEDS = 3
CORRUPTION_LEVELS = TabularCorruption.levels
CORRUPTION_NAMES = TabularCorruption.corruption_names
VANILLA_TABULAR_MULTIPLICATIVE_SCALE = copy.deepcopy(TABULAR_MULTIPLICATIVE_SCALE)
VANILLA_TABULAR_ADDITIVE_SCALE = copy.deepcopy(TABULAR_ADDITIVE_SCALE)

2024-01-25 13:55:55.844169: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
def benchmark_match_after_corruptions(datasets=DATASETS, seeds=SEEDS, corruption_levels=CORRUPTION_LEVELS, corruption_names=CORRUPTION_NAMES):
    results = {}
    for seed in range(seeds):
        if seed not in results:
            results[seed] = {}
        # Fix the seed
        np.random.seed(seed)
        for dataset in datasets:
            if dataset not in results[seed]:
                results[seed][dataset] = {}
            task, dataset_name = dataset.split("_")
            clean_dataset = UCI(dataset_name, root=DATA_ROOT, task=task, train=True, test_portion=TEST_PORTION, transform=None)
            clean_xy = [(x, y) for x, y in clean_dataset]
            clean_x = np.array([x.numpy() for x, _ in clean_xy]).reshape(-1, clean_xy[0][0].shape[0])
            clean_y = np.array([y.numpy() for _, y in clean_xy]).reshape(-1, 1)            
            
            # Train a KNN classifier on the clean data
            clean_knn = KNeighborsClassifier(n_neighbors=1) if task == "classification" else KNeighborsRegressor(n_neighbors=1)
            clean_knn.fit(clean_x, clean_y.ravel())
            
            for level in range(corruption_levels):
                if level not in results[seed][dataset]:
                    results[seed][dataset][level] = {}
                for name in corruption_names:
                    if name not in results[seed][dataset][level]:
                        results[seed][dataset][level][name] = {}
                        
                    aug_dataset = UCI(dataset_name, root=DATA_ROOT, task=task, train=True, test_portion=TEST_PORTION, transform=TabularCorruption(name, level, dataset_scale=1.0))
                    aug_xy = [(x, y) for x, y in aug_dataset]
                    aug_x, aug_y = np.array([x.numpy() for x, _ in aug_xy]), np.array([y.numpy() for _, y in aug_xy])
                    aug_x = aug_x.reshape(-1, aug_xy[0][0].shape[0])
                    aug_y = aug_y.reshape(-1, 1)
                    
                    pred_y = clean_knn.predict(aug_x)
                    
                    if task == "classification":
                        acc = np.mean(pred_y.ravel() == aug_y.ravel())
                        results[seed][dataset][level][name] = acc
                    else:
                        mse = np.mean((pred_y.ravel() - aug_y.ravel()) ** 2)
                        results[seed][dataset][level][name] = mse
    return results

In [3]:
def analyse_results(results):
    """Create a table of results.
    
    Analyse the result across all datasets with respect to each corruption level across all corruption types separately for each task.
    Analyse the result across all datasets with respect to each corruption type across all corruption levels separately for each task.
    Analyse the result across all options separately for each task.
    """
    for task in ["regression", "classification"]:
        print(f"Task: {task}")
        print("Level\t" + "\t".join([str(level) for level in range(CORRUPTION_LEVELS)]))
        for name in CORRUPTION_NAMES:
            print(f"{name}\t" + "\t".join([f"{np.mean([results[seed][dataset][level][name] for seed in range(SEEDS) for dataset in DATASETS if dataset.startswith(task)])} +- {np.std([results[seed][dataset][level][name] for seed in range(SEEDS) for dataset in DATASETS if dataset.startswith(task)])}" for level in range(CORRUPTION_LEVELS)]))
        print("Average\t" + "\t".join([f"{np.mean([results[seed][dataset][level][name] for seed in range(SEEDS) for dataset in DATASETS if dataset.startswith(task) for name in CORRUPTION_NAMES])} +- {np.std([results[seed][dataset][level][name] for seed in range(SEEDS) for dataset in DATASETS if dataset.startswith(task) for name in CORRUPTION_NAMES])}" for level in range(CORRUPTION_LEVELS)]))
        print()
        print("Complete average\t" + f"{np.mean([results[seed][dataset][level][name] for seed in range(SEEDS) for dataset in DATASETS if dataset.startswith(task) for name in CORRUPTION_NAMES for level in range(CORRUPTION_LEVELS)])} +- {np.std([results[seed][dataset][level][name] for seed in range(SEEDS) for dataset in DATASETS if dataset.startswith(task) for name in CORRUPTION_NAMES for level in range(CORRUPTION_LEVELS)])}")
    

In [4]:
# Perform a sanity check by changing the severity of the corruption to all zeros
for i in range(len(TABULAR_MULTIPLICATIVE_SCALE)):
    TABULAR_MULTIPLICATIVE_SCALE[i] = 0.0
    TABULAR_ADDITIVE_SCALE[i] = 0.0
    
results = benchmark_match_after_corruptions(datasets=DATASETS, seeds=SEEDS, corruption_levels=CORRUPTION_LEVELS, corruption_names=CORRUPTION_NAMES)
analyse_results(results)

Task: regression
Level	0	1	2	3	4
additive_gaussian_noise	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127
multiplicative_gaussian_noise	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127
additive_uniform_noise	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127
multiplicative_uniform_noise	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600596964 +- 0.0016191720496863127	0.0008100830600

In [5]:
# Run the benchmark on the original corruption strengths
for i in range(len(TABULAR_MULTIPLICATIVE_SCALE)):
    TABULAR_MULTIPLICATIVE_SCALE[i] = VANILLA_TABULAR_MULTIPLICATIVE_SCALE[i]
    TABULAR_ADDITIVE_SCALE[i] = VANILLA_TABULAR_ADDITIVE_SCALE[i]
    
results = benchmark_match_after_corruptions(datasets=DATASETS, seeds=SEEDS, corruption_levels=CORRUPTION_LEVELS, corruption_names=CORRUPTION_NAMES)
analyse_results(results)
pickle.dump(results, open("./match_after_corruptions.pkl", "wb"))

Task: regression
Level	0	1	2	3	4
additive_gaussian_noise	0.7245924472808838 +- 1.4266492128372192	0.8565842509269714 +- 1.6084719896316528	0.9593607783317566 +- 1.6850926876068115	1.0404561758041382 +- 1.7138861417770386	1.0950682163238525 +- 1.6524163484573364
multiplicative_gaussian_noise	0.8654730319976807 +- 1.6306952238082886	1.0199588537216187 +- 1.6769318580627441	1.1537892818450928 +- 1.6079789400100708	1.311373233795166 +- 1.6085981130599976	1.47493577003479 +- 1.5505502223968506
additive_uniform_noise	0.6107798218727112 +- 1.2170313596725464	0.8011578917503357 +- 1.5698291063308716	0.8779402375221252 +- 1.6662099361419678	0.9475632309913635 +- 1.7502315044403076	0.9661989808082581 +- 1.7197080850601196
multiplicative_uniform_noise	0.7805746793746948 +- 1.5208638906478882	0.9210113286972046 +- 1.7049763202667236	0.9961715936660767 +- 1.656866192817688	1.0976078510284424 +- 1.7251814603805542	1.1811020374298096 +- 1.6513701677322388
multiplicative_bernoulli_noise	0.259395986795

In [6]:
# For each dataset, try different corruption strengths which are multiples of the original corruption strengths
# This enables us to find the corruption strength which is the most suitable for each dataset
dataset_results = {}
for scaling_factor in np.geomspace(0.0001, 1.0, 20):
    for i in range(len(TABULAR_MULTIPLICATIVE_SCALE)):
        TABULAR_MULTIPLICATIVE_SCALE[i] = VANILLA_TABULAR_MULTIPLICATIVE_SCALE[i] * scaling_factor
        TABULAR_ADDITIVE_SCALE[i] = VANILLA_TABULAR_ADDITIVE_SCALE[i] * scaling_factor
    
    for dataset in DATASETS:
        if dataset not in dataset_results:
            dataset_results[dataset] = {}
        dataset_results[dataset][scaling_factor] = benchmark_match_after_corruptions(datasets=[dataset], seeds=SEEDS, corruption_levels=CORRUPTION_LEVELS, corruption_names=CORRUPTION_NAMES)
pickle.dump(dataset_results, open("./match_after_corruptions_dataset_results.pkl", "wb"))

In [7]:
def get_strongest_scaling_factor(dataset_results, dataset, threshold, task):
    """This function finds the strongest scaling factor for a dataset such that the accuracy is above a threshold.
    
    For regression, the threshold is the maximum MSE.
    """
    strongest_scaling_factor = 0.0
    for scaling_factor in np.geomspace(0.0001, 1.0, 20):
        if task == "classification":
            if np.mean([dataset_results[dataset][scaling_factor][seed][dataset][level][name] for seed in range(SEEDS) for level in range(CORRUPTION_LEVELS) for name in CORRUPTION_NAMES]) > threshold:
                strongest_scaling_factor = scaling_factor
        if task == "regression":
            if np.mean([dataset_results[dataset][scaling_factor][seed][dataset][level][name] for seed in range(SEEDS) for level in range(CORRUPTION_LEVELS) for name in CORRUPTION_NAMES]) < threshold:
                strongest_scaling_factor = scaling_factor
    return strongest_scaling_factor

In [8]:
# Given thresholds 0.99, 0.98, 0.97, 0.96, 0.95, find the strongest scaling factor for classification datasets
# Given thresholds 0.01, 0.02, 0.1, 0.2, 0.3, find the strongest scaling factor for regression datasets
classification_thresholds = [0.99, 0.98, 0.97, 0.96, 0.95]
regression_thresholds = [0.01, 0.02, 0.1, 0.2, 0.3]
strongest_scaling_factors = {} # Dataset -> threshold -> strongest scaling factor
for dataset in DATASETS:
    strongest_scaling_factors[dataset] = {}
    if dataset.startswith("classification"):
        for threshold in classification_thresholds:
            strongest_scaling_factors[dataset][threshold] = get_strongest_scaling_factor(dataset_results, dataset, threshold, "classification")
    if dataset.startswith("regression"):
        for threshold in regression_thresholds:
            strongest_scaling_factors[dataset][threshold] = get_strongest_scaling_factor(dataset_results, dataset, threshold, "regression")

In [9]:
# Print a table of the strongest scaling factors for the thresholds
print("Classification (thresholds)")
print("\t".join([str(threshold) for threshold in classification_thresholds]))
for dataset in DATASETS:
    if dataset.startswith("classification"):
        print(dataset + "\t" + "\t".join([str(strongest_scaling_factors[dataset][threshold]) for threshold in classification_thresholds]))
print()
print("Regression (thresholds)")
print("\t".join([str(threshold) for threshold in regression_thresholds]))
for dataset in DATASETS:
    if dataset.startswith("regression"):
        print(dataset + "\t" + "\t".join([str(strongest_scaling_factors[dataset][threshold]) for threshold in regression_thresholds]))

Classification (thresholds)
0.99	0.98	0.97	0.96	0.95
classification_wine	0.007847599703514606	0.012742749857031334	0.012742749857031334	0.012742749857031334	0.0206913808111479
classification_toxicity	0.3792690190732246	0.615848211066026	0.615848211066026	1.0	1.0
classification_abalone	0.08858667904100823	0.14384498882876628	0.14384498882876628	0.14384498882876628	0.23357214690901212
classification_students	0.23357214690901212	0.3792690190732246	0.615848211066026	0.615848211066026	1.0
classification_adult	0.14384498882876628	0.23357214690901212	0.3792690190732246	0.3792690190732246	0.615848211066026

Regression (thresholds)
0.01	0.02	0.1	0.2	0.3
regression_concrete	0.0206913808111479	0.05455594781168514	0.3792690190732246	0.615848211066026	0.615848211066026
regression_boston	0.08858667904100823	0.14384498882876628	0.3792690190732246	0.615848211066026	1.0
regression_energy	0.05455594781168514	0.14384498882876628	0.615848211066026	1.0	1.0
regression_wine	0.004832930238571752	0.00784759970