I recently competed in my first [Kaggle](https://www.kaggle.com/) competition.
I didn't place well—top 60% isn't anything to brag about—but I had fun.
To get some ideas on what other competitors were doing, I'd occassionally take a peak at their notebooks.
The imports would be simple enough, but when I got to the actual code, it felt... smelly.
I don't mean smelly as in you couldn't run the notebook from the top without raising an error or that it was illegible
(though some comments and docstrings wouldn't hurt).
I mean smelly as in the wheel was reinvented in several places when
[`scikit-learn`](https://scikit-learn.org/stable/index.html) already provides a solution.

# An example

This [notebook](https://www.kaggle.com/code/aspillai/11th-place-solution-single-model-cv-cat)
took [11th place](https://www.kaggle.com/competitions/playground-series-s4e1/leaderboard)
on the [Binary Classification with a Bank Customer Churn Dataset competition](https://www.kaggle.com/competitions/playground-series-s4e1/overview).

> **Note:**
> 
> For brevity, I will only display the necessary code.

In [1]:
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold, train_test_split

train = pd.read_csv("playground-series-s4e1/train.csv", index_col="id")
test = pd.read_csv("playground-series-s4e1/test.csv", index_col="id")

#### Our first smell: Scaling features one at a time

In [2]:
scale_cols = ["Age", "CreditScore", "Balance", "EstimatedSalary"]
for c in scale_cols:
    min_value = train[c].min()
    max_value = train[c].max()
    train[f"{c}_scaled"] = (train[c] - min_value) / (max_value - min_value)
    test[f"{c}_scaled"] = (test[c] - min_value) / (max_value - min_value)

The type of scaling performed here maps the values to the range $[0, 1]$.
It's a common enough technique that `scikit-learn` offers it with the default
[`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

In [3]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
train_scaled = scaler.fit_transform(train[scale_cols])
test_scaled = scaler.transform(test[scale_cols])

We can see that the values produced are the same (within a tolerance).

In [4]:
import numpy as np

np.allclose(
    a=train[[f"{c}_scaled" for c in scale_cols]],
    b=train_scaled,
)

True

The `train_scaled` values may be the *same*, but it's not a dataframe.

In [5]:
type(train_scaled)

numpy.ndarray

True, but in
[`scikit-learn` 1.2](https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_2_0.html)
the developers introduced the
[`set_output`](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html)
API, allowing us to convert the output from the default
[`numpy.ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)
to a
[`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [6]:
scaler = MinMaxScaler().set_output(transform="pandas")  # Set output container for `scaler`.
scaler.fit_transform(train[scale_cols]).head()

Unnamed: 0_level_0,Age,CreditScore,Balance,EstimatedSalary
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.202703,0.636,0.0,0.907279
1,0.202703,0.554,0.0,0.247483
2,0.297297,0.656,0.0,0.924364
3,0.216216,0.462,0.593398,0.422787
4,0.202703,0.732,0.0,0.075293


In fact, we could set the output of all `scikit-learn` estimators to a `pandas.DataFrame` globally.

In [7]:
from sklearn import set_config

set_config(transform_output="pandas")  # Set global scikit-learn configuration.
scaler = MinMaxScaler()
scaler.fit_transform(train[scale_cols]).head()

Unnamed: 0_level_0,Age,CreditScore,Balance,EstimatedSalary
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.202703,0.636,0.0,0.907279
1,0.202703,0.554,0.0,0.247483
2,0.297297,0.656,0.0,0.924364
3,0.216216,0.462,0.593398,0.422787
4,0.202703,0.732,0.0,0.075293
