# Contents
* [Introduction](#Introduction)
* [Imports and configuration](#Imports-and-configuration)
* [Load data](#Load-data)
* [Strata](#Strata)
* [Hyperparameters](#Hyperparameters)
* [Models](#Models)
* [Evaluate](#Evaluate)
* [Discussion](#Discussion)

# Introduction

After rounds of feature engineering, visualization & exploration, and tuning various aspects of the classification pipeline, we are about to create a benchmark prototype classifier using RandomForestClassifier. In this notebook, we perform a grid search over n_estimators using out-of-bag accuracy instead of cross validation. Other hyperparameters are based on previous 5-fold cross-validation using a related set of features.

# Imports and configuration

In [1]:
from time import time

notebook_begin_time = time()

# set random seeds

from os import environ
from random import seed as random_seed
from numpy.random import seed as np_seed
from tensorflow.random import set_seed


def reset_seeds(seed: int) -> None:
    """Utility function for resetting random seeds"""
    environ["PYTHONHASHSEED"] = str(seed)
    random_seed(seed)
    np_seed(seed)
    set_seed(seed)


reset_seeds(SEED := 2021)
del environ
del random_seed
del np_seed
del set_seed
del reset_seeds

In [2]:
# extensions
%load_ext autotime
%load_ext lab_black
%load_ext nb_black

In [3]:
# core
import numpy as np
import pandas as pd

# utility
from joblib import dump
from gc import collect as gc_collect
from tqdm.notebook import tqdm

# faster
import swifter
from sklearnex import patch_sklearn

patch_sklearn()
del patch_sklearn

# typing
from typing import List, Dict

# metrics
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score

# other sklearn
from sklearn.calibration import CalibratedClassifierCV
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, VotingClassifier
from sklearn.model_selection import cross_validate, StratifiedGroupKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline

# display outputs w/o print calls
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
del InteractiveShell

time: 3.85 s


Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [4]:
# Location of FRILL .feather files
FRILL_FEATHERS_FOLDER = "../1.0-mic-extract_FRILL_embeddings"

# Location of pre-final features
FEATURES_FOLDER = "../19.0-mic-extract_FRILL-based_features_from_full_data"

# Location where this notebook will output
DATA_OUT_FOLDER = "."

_ = gc_collect()

time: 123 ms


# Load data

In [5]:
def load_labels() -> pd.DataFrame:
    """Load just the labels"""
    keep_columns = [
        "id",
        "source",
        "speaker_id",
        "speaker_gender",
        "emo",
        "valence",
        "lang1",
        "length",
    ]
    labels = pd.concat(
        (
            pd.read_feather(
                f"{FRILL_FEATHERS_FOLDER}/dev_labels.feather", columns=keep_columns
            ),
            pd.read_feather(
                f"{FRILL_FEATHERS_FOLDER}/nondev_labels.feather", columns=keep_columns
            ),
        )
    ).set_index("id")
    return labels


def load_data(unscaled=False) -> pd.DataFrame:
    """Loads the FRILL-based features"""
    if unscaled:
        df = pd.read_feather(
            f"{FEATURES_FOLDER}/unscaled_features_ready_for_selection.feather"
        ).set_index("id")
    else:
        df = pd.read_feather(
            f"{FEATURES_FOLDER}/features_ready_for_selection.feather"
        ).set_index("id")
    df.columns = df.columns.astype(str)
    return df


data = load_data(unscaled=True)
labels = load_labels()
y_true = labels.valence
gnb_features = ["spherical-LDA1", "spherical-LDA2"]
assert all(data.index == labels.index)
_ = gc_collect()

time: 398 ms


In [6]:
data.info()
labels.info()

<class 'pandas.core.frame.DataFrame'>
UInt64Index: 86752 entries, 0 to 87363
Columns: 118 entries, theta_LDA1+LDA2 to LDA-ocSVM_poly6_pos
dtypes: float64(118)
memory usage: 78.8 MB
<class 'pandas.core.frame.DataFrame'>
UInt64Index: 86752 entries, 0 to 87363
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   source          86752 non-null  category
 1   speaker_id      86752 non-null  category
 2   speaker_gender  86752 non-null  category
 3   emo             86752 non-null  category
 4   valence         86752 non-null  int8    
 5   lang1           86752 non-null  category
 6   length          86752 non-null  category
dtypes: category(6), int8(1)
memory usage: 1.4 MB
time: 30 ms


# Strata

In [7]:
N_SPLITS = 7

# fields are concatentated for quick permutation omitting non-existent combos
strata = labels.loc[
    :, ["source", "speaker_gender", "emo", "valence", "lang1", "length"]
]
strata.valence = strata.valence.astype(str)
strata = strata.swifter.apply("".join, axis=1)

Pandas Apply: 100%|██████████| 86752/86752 [00:01<00:00, 79904.24it/s]

time: 1.19 s





In [8]:
# utility function for identifying strata with only i occurences
def get_solo(i: int, strata_: pd.Series) -> np.ndarray:
    """Given a series of stratum memberships, return a shuffled array of strata with only i members."""
    return np.unique(
        strata_.loc[
            strata_.isin(
                (strata_counts := strata_.value_counts())
                .where(strata_counts == i)
                .dropna()
                .index
            )
        ]
        .sample(frac=1, random_state=SEED)
        .values
    )


# get solos, print stuff
def get_onlys(
    strata_: pd.Series, print_me: str = "", n_splits: int = N_SPLITS
) -> List[Dict[int, np.ndarray]]:
    """Optinally prints something and returns calls of get_solo on strata_ in a list"""
    print(print_me)
    solos = []
    for i in range(1, n_splits):
        solo: np.ndarray = get_solo(i, strata_)
        print(f"only {i}:", (_ := solo.size))
        if _:  # >= 1 strata with only i samples
            solos.append({i: solo})
    return solos


def process_strata(strata: pd.Series, n_splits: int = N_SPLITS) -> pd.Series:
    """Corrects strata membership column according to n_splits"""

    count = get_onlys_calls = 0

    while onlys := get_onlys(
        strata,
        print_me=f"merge passes performed: {get_onlys_calls}",
        n_splits=n_splits,
    ):
        get_onlys_calls += 1
        if len(onlys) == 1:
            last = onlys[0]
            strata_to_merge: np.ndarray = list(last.values())[0]
            only_key = list(last.keys())[0]
            tuplet_size = n_splits // only_key + (1 if n_splits % only_key else 0)
            # perform tuplet merge
            interval = len(strata_to_merge) // n_splits
            for strata_tuplet in zip(
                *[
                    strata_to_merge[interval * i : interval * (i + 1)]
                    for i in range(tuplet_size)
                ]
            ):
                strata = strata.replace(strata_tuplet, f"stratum_group_{count}")
                count += 1
            remainder = strata_to_merge[tuplet_size * interval :]
            if len(remainder) == 1:
                # process remainder unmatched
                n = n_splits
                strata_counts = strata.value_counts()
                while not (candidates := strata_counts.loc[strata_counts == n]).size:
                    n += 1
                strata = strata.replace(
                    [remainder[0], candidates.sample(n=1, random_state=SEED).index[0]],
                    f"stratum_group_{count}",
                )
                count += 1
            else:
                # self-pair last
                remainder = remainder.tolist()
                while len(remainder) >= 2:
                    strata = strata.replace(
                        (remainder.pop(), remainder.pop()), f"stratum_group_{count}"
                    )
                    count += 1
        else:
            pop_onlys = lambda _: list(onlys.pop(_).values())[0].tolist()
            while len(onlys) >= 2:
                # pop the ends
                shortside = pop_onlys(0)
                longside = pop_onlys(-1)
                # merge until one end empty
                while shortside and longside:
                    strata = strata.replace(
                        (shortside.pop(), longside.pop()), f"stratum_group_{count}"
                    )
                    count += 1
            if onlys:
                # self-pair middle
                remainder = pop_onlys(0)
                while len(remainder) >= 2:
                    strata = strata.replace(
                        (remainder.pop(), remainder.pop()), f"stratum_group_{count}"
                    )
                    count += 1
    return strata


_ = gc_collect()

time: 105 ms


In [9]:
STRATA = process_strata(strata, n_splits=N_SPLITS)
STRATA.value_counts()
cross_validator = lambda: StratifiedGroupKFold(
    n_splits=N_SPLITS, shuffle=True, random_state=SEED
).split(X=data, y=STRATA, groups=labels.speaker_id)

merge passes performed: 0
only 1: 53
only 2: 39
only 3: 31
only 4: 17
only 5: 27
only 6: 13
merge passes performed: 1
only 1: 40
only 2: 12
only 3: 14
only 4: 0
only 5: 0
only 6: 0
merge passes performed: 2
only 1: 26
only 2: 0
only 3: 0
only 4: 20
only 5: 0
only 6: 0
merge passes performed: 3
only 1: 6
only 2: 0
only 3: 0
only 4: 0
only 5: 20
only 6: 0
merge passes performed: 4
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 14
only 6: 6
merge passes performed: 5
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 8
only 6: 0
merge passes performed: 6
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 0


MELDmneu1engmedium               2905
MELDfneu1engmedium               2452
esdfhap2engmedium                1750
esdmneu1cmnmedium                1750
esdfsur0engmedium                1750
                                 ... 
LimaCastroScottmamu2___medium       7
stratum_group_56                    7
stratum_group_50                    7
LimaCastroScottfdis0___medium       7
stratum_group_54                    7
Length: 455, dtype: int64

time: 2.46 s


# Hyperparameters

In [10]:
rf_params = lambda: {
    "n_estimators": 342,
    "criterion": "entropy",
    "max_depth": 17,
    "min_samples_split": 9,
    "min_samples_leaf": 4,
    "max_features": "sqrt",
    "max_leaf_nodes": 2207,
    "bootstrap": True,
    "oob_score": True,
    "n_jobs": -1,
    "random_state": SEED,
    "warm_start": False,
    "class_weight": "balanced_subsample",
}

var_smoothing = 0.10556503264086932
gnb_params = lambda: {
    "base_estimator": GaussianNB(var_smoothing=var_smoothing),
    "n_estimators": 23,
    "oob_score": True,
    "warm_start": False,
    "n_jobs": -1,
    "random_state": SEED,
}

calibration_params = lambda: {
    "method": "isotonic",
    "cv": list(cross_validator()),
    "n_jobs": -1,
}

_ = gc_collect()

time: 144 ms


# Models

In [11]:
gnb_data = data.loc[:, gnb_features]
plain_gnb = lambda: GaussianNB(var_smoothing=var_smoothing).fit(gnb_data, y_true)
bagging_gnb = lambda: BaggingClassifier(**gnb_params()).fit(gnb_data, y_true)
rf = lambda: RandomForestClassifier(**rf_params()).fit(data, y_true)
voting = lambda: VotingClassifier(
    estimators=[
        (
            "rf",
            CalibratedClassifierCV(
                base_estimator=RandomForestClassifier(**rf_params()),
                **calibration_params()
            ),
        ),
        (
            "gnb",
            Pipeline(
                steps=[
                    (
                        "select_features",
                        ColumnTransformer(
                            transformers=[("selector", "passthrough", gnb_features)],
                            n_jobs=-1,
                        ),
                    ),
                    (
                        "clf",
                        CalibratedClassifierCV(
                            base_estimator=BaggingClassifier(**gnb_params()),
                            **calibration_params()
                        ),
                    ),
                ]
            ),
        ),
    ],
    voting="soft",
    n_jobs=-1,
).fit(data, y_true)

_ = gc_collect()

time: 125 ms


In [12]:
models = {
    "plain_GNB": plain_gnb(),
    "bagging_GNB": bagging_gnb(),
    "randomforest": rf(),
    "voting_ensemble": voting(),
}

  warn(
  oob_decision_function = predictions / predictions.sum(axis=1)[:, np.newaxis]


time: 12min 52s


In [13]:
for model, estimator in tqdm(models.items()):
    dump(estimator, f"{DATA_OUT_FOLDER}/prototypes/{model}.joblib")

  0%|          | 0/4 [00:00<?, ?it/s]

['./prototypes/plain_GNB.joblib']

['./prototypes/bagging_GNB.joblib']

['./prototypes/randomforest.joblib']

['./prototypes/voting_ensemble.joblib']

time: 2.83 s


# Evaluate

In [14]:
print(models["bagging_GNB"].oob_score_)
print(models["randomforest"].oob_score_)
_ = gc_collect()

0.8737665990409443
0.9298575248985614
time: 131 ms


In [15]:
metrics = ["accuracy", "neg_log_loss", "roc_auc_ovo"]
_ = gc_collect()

time: 106 ms


In [16]:
cross_val_scores: Dict[str, Dict[str, float]] = {}
for model, estimator in tqdm(models.items()):
    if model == "voting_ensemble":
        continue
    cross_val_scores[model] = cross_validate(
        estimator=estimator,
        X=data.loc[:, gnb_features] if "GNB" in model else data,
        y=y_true,
        scoring=metrics,
        groups=labels.speaker_id,
        cv=cross_validator(),
        n_jobs=-1,
        error_score="raise",
    )

  0%|          | 0/4 [00:00<?, ?it/s]

time: 9min 34s


In [17]:
results = {
    "model": [],
    "metric": [],
    "mean_score": [],
    "fit_time": [],
    "score_time": [],
}
for model, score_dict in cross_val_scores.items():
    for metric in metrics:
        results["model"].append(model)
        results["metric"].append(metric)
        results["mean_score"].append(np.mean(score_dict[f"test_{metric}"]))
        results["fit_time"].append(np.mean(score_dict["fit_time"]))
        results["score_time"].append(np.mean(score_dict["score_time"]))
results = pd.DataFrame(results)
results

Unnamed: 0,model,metric,mean_score,fit_time,score_time
0,plain_GNB,accuracy,0.874059,0.042411,0.044643
1,plain_GNB,neg_log_loss,-0.331784,0.042411,0.044643
2,plain_GNB,roc_auc_ovo,0.970611,0.042411,0.044643
3,bagging_GNB,accuracy,0.874098,0.953775,0.17411
4,bagging_GNB,neg_log_loss,-0.331794,0.953775,0.17411
5,bagging_GNB,roc_auc_ovo,0.970609,0.953775,0.17411
6,randomforest,accuracy,0.925451,549.847078,2.863276
7,randomforest,neg_log_loss,-0.204123,549.847078,2.863276
8,randomforest,roc_auc_ovo,0.988636,549.847078,2.863276


time: 20 ms


In [18]:
results.reset_index(drop=True).to_csv(f"{DATA_OUT_FOLDER}/7-fold_CV_scores.csv")

time: 10 ms


I'm having trouble scoring the voting ensemble in the same way as the others, so we'll do that manually. I think the internal custom CV object is causing issues.

In [19]:
accuracy, auroc, logloss, fit_times, predict_times = [], [], [], [], []
for train_idx, test_idx in cross_validator():
    X_train, train_labels = (
        data.iloc[train_idx, :].reset_index(drop=True),
        labels.iloc[train_idx, :].reset_index(drop=True),
    )
    y_train = train_labels.valence
    strata = train_labels.loc[
        :, ["source", "speaker_gender", "emo", "valence", "lang1", "length"]
    ]
    strata.valence = strata.valence.astype(str)
    strata = strata.swifter.apply("".join, axis=1)
    strata = process_strata(strata, n_splits=N_SPLITS)
    cross_validator_ = lambda: StratifiedGroupKFold(
        n_splits=N_SPLITS, shuffle=True, random_state=SEED
    ).split(X=X_train, y=strata, groups=train_labels.speaker_id)
    model = VotingClassifier(
        estimators=[
            (
                "rf",
                CalibratedClassifierCV(
                    base_estimator=RandomForestClassifier(**rf_params()),
                    method="isotonic",
                    cv=list(cross_validator_()),
                    n_jobs=-1,
                ),
            ),
            (
                "gnb",
                Pipeline(
                    steps=[
                        (
                            "select_features",
                            ColumnTransformer(
                                transformers=[
                                    ("selector", "passthrough", gnb_features)
                                ],
                                n_jobs=-1,
                            ),
                        ),
                        (
                            "clf",
                            CalibratedClassifierCV(
                                base_estimator=BaggingClassifier(**gnb_params()),
                                method="isotonic",
                                cv=list(cross_validator_()),
                                n_jobs=-1,
                            ),
                        ),
                    ]
                ),
            ),
        ],
        voting="soft",
        n_jobs=-1,
    )
    begin = time()
    model.fit(X_train, y_train)
    end = time()
    fit_times.append(end - begin)
    del X_train
    del y_train
    _ = gc_collect()
    X_test, y_test = (
        data.iloc[test_idx, :].reset_index(drop=True),
        labels.iloc[test_idx, :].reset_index(drop=True).valence,
    )
    begin = time()
    y_pred = model.predict(X_test)
    end = time()
    predict_times.append(end - begin)
    accuracy.append(accuracy_score(y_test, y_pred))
    auroc.append(
        roc_auc_score(y_test, y_pred := model.predict_proba(X_test), multi_class="ovo")
    )
    del model
    del X_test
    _ = gc_collect()
    logloss.append(log_loss(y_test, y_pred))
    del y_test
    del begin
    del end
results = {
    "model": ["Voting_GNB+RF"],
    "accuracy": [np.mean(accuracy)],
    "auroc": [np.mean(auroc)],
    "log_loss": [np.mean(logloss)],
    "predict_time": [np.mean(predict_times)],
    "fit_time": [np.mean(fit_times)],
}
print(results)
pd.DataFrame(results).reset_index(drop=True).to_csv(
    f"{DATA_OUT_FOLDER}/Voting_GNB+RF_results.csv"
)

Pandas Apply: 100%|██████████| 74180/74180 [00:00<00:00, 95787.28it/s] 


merge passes performed: 0
only 1: 51
only 2: 49
only 3: 25
only 4: 25
only 5: 21
only 6: 8
merge passes performed: 1
only 1: 43
only 2: 28
only 3: 0
only 4: 0
only 5: 0
only 6: 0
merge passes performed: 2
only 1: 15
only 2: 0
only 3: 28
only 4: 0
only 5: 0
only 6: 0
merge passes performed: 3
only 1: 0
only 2: 0
only 3: 13
only 4: 15
only 5: 0
only 6: 0
merge passes performed: 4
only 1: 0
only 2: 0
only 3: 0
only 4: 2
only 5: 0
only 6: 0
merge passes performed: 5
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 0


VotingClassifier(estimators=[('rf',
                              CalibratedClassifierCV(base_estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                                                                           criterion='entropy',
                                                                                           max_depth=17,
                                                                                           max_features='sqrt',
                                                                                           max_leaf_nodes=2207,
                                                                                           min_samples_leaf=4,
                                                                                           min_samples_split=9,
                                                                                           n_estimators=342,
                                                                 

Pandas Apply: 100%|██████████| 68212/68212 [00:00<00:00, 83185.71it/s]


merge passes performed: 0
only 1: 51
only 2: 41
only 3: 31
only 4: 20
only 5: 15
only 6: 8
merge passes performed: 1
only 1: 43
only 2: 26
only 3: 11
only 4: 0
only 5: 0
only 6: 0
merge passes performed: 2
only 1: 32
only 2: 0
only 3: 0
only 4: 24
only 5: 0
only 6: 0
merge passes performed: 3
only 1: 8
only 2: 0
only 3: 0
only 4: 0
only 5: 24
only 6: 0
merge passes performed: 4
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 16
only 6: 8
merge passes performed: 5
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 8
only 6: 0
merge passes performed: 6
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 0


VotingClassifier(estimators=[('rf',
                              CalibratedClassifierCV(base_estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                                                                           criterion='entropy',
                                                                                           max_depth=17,
                                                                                           max_features='sqrt',
                                                                                           max_leaf_nodes=2207,
                                                                                           min_samples_leaf=4,
                                                                                           min_samples_split=9,
                                                                                           n_estimators=342,
                                                                 

Pandas Apply: 100%|██████████| 78765/78765 [00:00<00:00, 89328.42it/s]


merge passes performed: 0
only 1: 53
only 2: 45
only 3: 25
only 4: 18
only 5: 26
only 6: 17
merge passes performed: 1
only 1: 36
only 2: 19
only 3: 7
only 4: 0
only 5: 0
only 6: 0
merge passes performed: 2
only 1: 29
only 2: 1
only 3: 0
only 4: 16
only 5: 0
only 6: 0
merge passes performed: 3
only 1: 13
only 2: 1
only 3: 0
only 4: 0
only 5: 16
only 6: 0
merge passes performed: 4
only 1: 0
only 2: 1
only 3: 0
only 4: 0
only 5: 3
only 6: 13
merge passes performed: 5
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 1
only 6: 12
merge passes performed: 6
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 11
merge passes performed: 7
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 1
merge passes performed: 8
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 0


VotingClassifier(estimators=[('rf',
                              CalibratedClassifierCV(base_estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                                                                           criterion='entropy',
                                                                                           max_depth=17,
                                                                                           max_features='sqrt',
                                                                                           max_leaf_nodes=2207,
                                                                                           min_samples_leaf=4,
                                                                                           min_samples_split=9,
                                                                                           n_estimators=342,
                                                                 

Pandas Apply: 100%|██████████| 66936/66936 [00:00<00:00, 90731.73it/s]


merge passes performed: 0
only 1: 57
only 2: 38
only 3: 32
only 4: 21
only 5: 20
only 6: 11
merge passes performed: 1
only 1: 46
only 2: 18
only 3: 11
only 4: 0
only 5: 0
only 6: 0
merge passes performed: 2
only 1: 35
only 2: 0
only 3: 0
only 4: 20
only 5: 0
only 6: 0
merge passes performed: 3
only 1: 15
only 2: 0
only 3: 0
only 4: 0
only 5: 20
only 6: 0
merge passes performed: 4
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 5
only 6: 15
merge passes performed: 5
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 10
merge passes performed: 6
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 0


VotingClassifier(estimators=[('rf',
                              CalibratedClassifierCV(base_estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                                                                           criterion='entropy',
                                                                                           max_depth=17,
                                                                                           max_features='sqrt',
                                                                                           max_leaf_nodes=2207,
                                                                                           min_samples_leaf=4,
                                                                                           min_samples_split=9,
                                                                                           n_estimators=342,
                                                                 

Pandas Apply: 100%|██████████| 78147/78147 [00:00<00:00, 89285.41it/s]


merge passes performed: 0
only 1: 58
only 2: 38
only 3: 26
only 4: 21
only 5: 13
only 6: 16
merge passes performed: 1
only 1: 42
only 2: 25
only 3: 5
only 4: 0
only 5: 0
only 6: 0
merge passes performed: 2
only 1: 37
only 2: 1
only 3: 0
only 4: 17
only 5: 0
only 6: 0
merge passes performed: 3
only 1: 20
only 2: 1
only 3: 0
only 4: 0
only 5: 17
only 6: 0
merge passes performed: 4
only 1: 3
only 2: 1
only 3: 0
only 4: 0
only 5: 0
only 6: 17
merge passes performed: 5
only 1: 0
only 2: 1
only 3: 0
only 4: 0
only 5: 0
only 6: 14
merge passes performed: 6
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 13
merge passes performed: 7
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 1
merge passes performed: 8
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 0


VotingClassifier(estimators=[('rf',
                              CalibratedClassifierCV(base_estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                                                                           criterion='entropy',
                                                                                           max_depth=17,
                                                                                           max_features='sqrt',
                                                                                           max_leaf_nodes=2207,
                                                                                           min_samples_leaf=4,
                                                                                           min_samples_split=9,
                                                                                           n_estimators=342,
                                                                 

Pandas Apply: 100%|██████████| 77620/77620 [00:00<00:00, 88713.91it/s]


merge passes performed: 0
only 1: 54
only 2: 42
only 3: 28
only 4: 20
only 5: 18
only 6: 14
merge passes performed: 1
only 1: 40
only 2: 24
only 3: 8
only 4: 0
only 5: 0
only 6: 0
merge passes performed: 2
only 1: 32
only 2: 0
only 3: 0
only 4: 20
only 5: 0
only 6: 0
merge passes performed: 3
only 1: 12
only 2: 0
only 3: 0
only 4: 0
only 5: 20
only 6: 0
merge passes performed: 4
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 8
only 6: 12
merge passes performed: 5
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 4
merge passes performed: 6
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 0


VotingClassifier(estimators=[('rf',
                              CalibratedClassifierCV(base_estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                                                                           criterion='entropy',
                                                                                           max_depth=17,
                                                                                           max_features='sqrt',
                                                                                           max_leaf_nodes=2207,
                                                                                           min_samples_leaf=4,
                                                                                           min_samples_split=9,
                                                                                           n_estimators=342,
                                                                 

Pandas Apply: 100%|██████████| 76652/76652 [00:00<00:00, 81368.82it/s]


merge passes performed: 0
only 1: 58
only 2: 38
only 3: 34
only 4: 23
only 5: 28
only 6: 10
merge passes performed: 1
only 1: 48
only 2: 10
only 3: 11
only 4: 0
only 5: 0
only 6: 0
merge passes performed: 2
only 1: 37
only 2: 0
only 3: 0
only 4: 16
only 5: 0
only 6: 0
merge passes performed: 3
only 1: 21
only 2: 0
only 3: 0
only 4: 0
only 5: 16
only 6: 0
merge passes performed: 4
only 1: 5
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 16
merge passes performed: 5
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 11
merge passes performed: 6
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 1
merge passes performed: 7
only 1: 0
only 2: 0
only 3: 0
only 4: 0
only 5: 0
only 6: 0


VotingClassifier(estimators=[('rf',
                              CalibratedClassifierCV(base_estimator=RandomForestClassifier(class_weight='balanced_subsample',
                                                                                           criterion='entropy',
                                                                                           max_depth=17,
                                                                                           max_features='sqrt',
                                                                                           max_leaf_nodes=2207,
                                                                                           min_samples_leaf=4,
                                                                                           min_samples_split=9,
                                                                                           n_estimators=342,
                                                                 



{'model': ['Voting_GNB+RF'], 'accuracy': [0.9141433350896476], 'auroc': [0.9837112542548955], 'log_loss': [0.23075349167485068], 'predict_time': [7.645049776349749], 'fit_time': [534.9342562811715]}
time: 1h 4min 37s


# Discussion

These scores are good. I can't wait to try these on the holdout data.

Do note that hyperparameter tuning was performed at the end with the whole dataset (with initial guidance observed from nested cross validation). That is, tuning data and evaluation data are not disjoint. Although the cross validator keeps clean splits with no speaker leakage, these models may be inherently overfit, especially given the exposure to test data during feature extraction and scaling.

Before the modified feature engineering pipeline (dropping rho, omitting redundant scaling, etc.), GaussianNB may have been the most performant classifier. But now, the data may be more suited for RandomForestClassifier.

In [20]:
print(f"Time elapsed since notebook_begin_time: {time() - notebook_begin_time} s")
_ = gc_collect()

Time elapsed since notebook_begin_time: 5247.077162265778 s
time: 136 ms


[^top](#Contents)