# Contents
* [Introduction](#Introduction)
* [Imports and configuration](#Imports-and-configuration)
* [Setup](#Setup)
* [Feature transformation](#Feature-transformation)
* [Results](#Results)

# Introduction

To speed development, 5-fold train-test splits were preprocessed and prepared as separate .feather files. Further, several features were extracted for each split and aggregated. This notebook performs uniform quantile transformation on the aggregated FRILL-based data per split.

I noticed outliers and skewed distributions. I chose quantile transformation because it's non-parametric and robust. I chose a uniform output distribution since it should be even more robust.

Even if some models aren't so sensitive to these issues, GaussianNB for instance could benefit by having better features selected by a RidgeClassifier.

In [1]:
from time import time

notebook_begin_time = time()

# set random seeds

from os import environ
from random import seed as random_seed
from numpy.random import seed as np_seed
from tensorflow.random import set_seed


def reset_seeds(seed: int) -> None:
    """Utility function for resetting random seeds"""
    environ["PYTHONHASHSEED"] = str(seed)
    random_seed(seed)
    np_seed(seed)
    set_seed(seed)


reset_seeds(SEED := 2021)
del environ
del random_seed
del np_seed
del set_seed
del reset_seeds

In [2]:
# extensions
%load_ext autotime
%load_ext lab_black
%load_ext nb_black

In [3]:
# core
# import numpy as np
import pandas as pd

# utility
from gc import collect as gc_collect

# faster sklearn
from sklearnex import patch_sklearn

patch_sklearn()
del patch_sklearn

# other sklearn
from sklearn.linear_model import RidgeClassifierCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import QuantileTransformer

# typing
from typing import Tuple

# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
del InteractiveShell

_ = gc_collect()

time: 1.67 s


Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


Constants

In [4]:
# Location of CV .feather files
CV_FEATHERS_FOLDER = "."

# Location where this notebook will output
DATA_OUT_FOLDER = "."

FOLDS = (0, 1, 2, 3, 4)

_ = gc_collect()

time: 116 ms


# Setup

In [5]:
def read_data(fold_num: int, filename: str) -> pd.DataFrame:
    """Read the non-FRILL .feather features or associated labels for the specified split"""
    return pd.read_feather(f"{CV_FEATHERS_FOLDER}/cv_{fold_num}/{filename}.feather")


def scale_data(
    X_train: pd.DataFrame, X_test: pd.DataFrame
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    scaler = QuantileTransformer(output_distribution="uniform", random_state=SEED).fit(
        X_train
    )  # output_distribution="uniform" is default
    X_train = pd.DataFrame(scaler.transform(X_train), columns=scaler.feature_names_in_)
    X_test = pd.DataFrame(scaler.transform(X_test), columns=scaler.feature_names_in_)
    return X_train, X_test


def save_transformed_dataframe(fold_num: int, filename: str, df: pd.DataFrame) -> None:
    """Save a transformed dataframe to the folder for the specific split"""
    df.to_feather(f"{DATA_OUT_FOLDER}/cv_{fold_num}/{filename}.feather")


def accuracy_with_ridge(
    X_train: pd.DataFrame, X_test: pd.DataFrame, y_train: pd.Series, y_test: pd.Series
) -> float:
    """Given split data, return the test accuracy of a ridge classifier"""
    return accuracy_score(
        y_true=y_test,
        y_pred=RidgeClassifierCV(scoring="accuracy")
        .fit(X_train, y_train)
        .predict(X_test),
    )


def perform_transformation(fold_num: int) -> None:
    """Transform and save data and display ridge classification accuracy before and after for a given fold"""
    print(
        "accuracy before:",
        accuracy_with_ridge(
            X_train := read_data(fold_num, filename="X_train_nonFRILL"),
            X_test := read_data(fold_num, filename="X_test_nonFRILL"),
            y_train := read_data(fold_num, filename="y_train_untransformed").iloc[:, 0],
            y_test := read_data(fold_num, filename="y_test_untransformed").iloc[:, 0],
        ),
    )
    begin = time()
    X_train, X_test = scale_data(X_train, X_test)
    print(f"transformation completed in {time() - begin} seconds")
    print("accuracy after:", accuracy_with_ridge(X_train, X_test, y_train, y_test))
    save_transformed_dataframe(
        fold_num, filename="X_train_FRILL-based_uniQT", df=X_train
    )
    save_transformed_dataframe(fold_num, filename="X_test_FRILL-based_uniQT", df=X_test)
    print("saved transformed data")
    print()

time: 3.04 ms


# Transform features

In [6]:
already_completed = []
for fold_num in FOLDS:
    if fold_num in already_completed:
        continue
    perform_transformation(fold_num)

accuracy before: 0.4736179042537424
transformation completed in 7.5569915771484375 seconds
accuracy after: 0.6625907810878909
saved transformed data

accuracy before: 0.645001735508504
transformation completed in 7.6337890625 seconds
accuracy after: 0.6493405067684832
saved transformed data

accuracy before: 0.5295822765153476
transformation completed in 7.416523694992065 seconds
accuracy after: 0.5997204824266267
saved transformed data

accuracy before: 0.6147278136228412
transformation completed in 7.200856685638428 seconds
accuracy after: 0.6137619286790558
saved transformed data

accuracy before: 0.30316961045245583
transformation completed in 7.482924699783325 seconds
accuracy after: 0.5754294701185579
saved transformed data

time: 1min 25s


A boost in accuracy may be observed in all but one fold.

In [7]:
print(f"Time elapsed since notebook_begin_time: {time() - notebook_begin_time} s")
_ = gc_collect()

Time elapsed since notebook_begin_time: 92.18702030181885 s
time: 110 ms


[^top](#Contents)