# Cross validation, imputation, and feature engineering

Below sets up our examples. We will be learning the following:

1. For cross validation:
    - `StratifiedKFold` for classification or  `KFold` for numeric prediction for set up
    - `cross_val_score` to run the validation.
2. For imputation:
    - `SimpleImputer` for simple imputations
    - `KNNImputer` for a model-based imputation of numeric variables
3. For feature engineering:
    - `StandardScaler` for standardization
    - `MinMaxScaler` for minmax normalization
    - `FunctionTransformer` for custom functions applied to columns
4. We'll learn to set up pipelines, which will simplify our code, rather than having to manually apply each step to the training and test sets.
    - `Pipeline` for pipeline steps
    - `ColumnTransformer` to pull our pipeline steps together into a single function to call on each data set.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# standard modeling
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# for cross validation
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Imputation and feature engineering
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# For pipelines, make feature engineering and imputation more straightforward
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer


In [2]:
penguins = sns.load_dataset("penguins")  # built into seaborn

penguins.head()
penguins.info()
penguins.isna().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

You should see (approx):

* sex: 11 missing
* bill_length_mm: 2 missing
* bill_depth_mm: 2 missing
* flipper_length_mm: 2 missing
* body_mass_g: 2 missing

For this module, let’s:

Predict species (multiclass classification)

Use numeric features + sex as predictors

In [4]:
features = [
    "bill_length_mm",
    "bill_depth_mm",
    "flipper_length_mm",
    "body_mass_g",
    "sex"
]
target = "species"

# Drop rows where target is missing (should be none, but be safe)
penguins = penguins.dropna(subset=[target]).copy()

# establish matrix of predictors and target variable
X = penguins[features]
y = (penguins[target] == "Chinstrap").astype(int) # binary classification, chinstrap vs. all


# Simple K-Fold Cross Validation

Done only with complete cases and numeric features to  keep things simple for now.

In [5]:
# Use only numeric features to start
num_features = [
    "bill_length_mm",
    "bill_depth_mm",
    "flipper_length_mm",
    "body_mass_g"
]

X_num = penguins[num_features]

# Drop rows with numeric missing to keep it simple for this first demo
non_na_indicator = ~X_num.isna().any(axis=1) # flags rows without missing values

# Breakdown of above:
# ~ means "not" (reverses True/False of booleans, below)
# X_num.isna() returns a matrix of True/False where X_num has/has not na values
# .any(axis=1) means any True value row-wise, converting to a vector

X_num_complete = X_num[non_na_indicator]

y_complete = y[non_na_indicator]

# Define model object
log_reg = LogisticRegression(
    max_iter=1000
)

# 5-fold Stratified CV (preserve class proportions)
cv = StratifiedKFold( # sets up CV object
    n_splits=5, # how many folds?
    shuffle=True, # True means stratify by the class variable
    random_state=42
    )

# Note: if we were doing numeric prediction, e.g., predicting body_mass_g
# as we did in earlier weeks, we would use the function KFold instead of
# StratifiedKFold. Its arguments are the same, but knowing which you need
# will prevent errors when calling cross_val_score, below.

scores = cross_val_score( # actual sampling and training done here
    log_reg, # defined model object
    X_num_complete, # x variable
    y_complete, # y variable (used for stratification)
    cv=cv, # cv object
    scoring="accuracy"
)

print("CV accuracy scores:", scores)
print("Mean accuracy:", scores.mean())


CV accuracy scores: [0.97101449 0.98550725 0.97058824 1.         0.98529412]
Mean accuracy: 0.9824808184143222


# Standardization & normalization

Now we will:
* Fit the scaler on train only
* Transform both train and test

As a reminder, we need to fit the scaler on the train only. (Think of it as a little model.) We do this because the mean and standard deviation (or min and max for normalization) are pulled from the data. If we do this on the whole dataset, we have information from the test set leaking into training, meaning our estimate of out-of-sample performance would be overly optimistic.

This problem *does* affect our evaluation using K-fold cross validation. But we are going to accept it because it's still a useful comparison between models, even if it's not a fully fair evaluation of out-of-sample performance.

One could get more complicated and put the feature engineering step *inside* of a cross validation loop. But I leave that to you later when your Python programming is stronger.


In [6]:
### Standardization ----

# Do train/test split before feature engineering

X_train, X_test, y_train, y_test = train_test_split(
    X_num_complete,
    y_complete,
    test_size=0.3,
    stratify=y_complete,
    random_state=42
)

# Standardization: (x - mean) / std
scaler = StandardScaler()

X_train_std = scaler.fit_transform(X_train)   # fit on train
X_test_std  = scaler.transform(X_test)       # transform test

# fit a model
log_reg = LogisticRegression(
    solver="lbfgs",
    max_iter=1000
)

log_reg.fit(X_train_std, y_train)

# get predictions and test
y_pred = log_reg.predict(X_test_std)

print("Test AUC (standardized):", roc_auc_score(y_test, y_pred))


Test AUC (standardized): 0.875


In [7]:
### Normalization ----
minmax = MinMaxScaler()

X_train_mm = minmax.fit_transform(X_train)
X_test_mm  = minmax.transform(X_test)

# fit a model
log_reg_mm = LogisticRegression(
    solver="lbfgs",
    max_iter=1000
)

log_reg_mm.fit(X_train_mm, y_train)

# get predictions and test
y_pred_mm = log_reg_mm.predict(X_test_mm)

print("Test AUC (min–max):", roc_auc_score(y_test, y_pred_mm))


Test AUC (min–max): 0.7


# Simple imputation (mean/median + most_frequent)

Now we bring back the missing values and show imputation.

Simple imputation replaces missing values with either the mean or median of the non-missing values of numeric variables or it replaces missing values with the most frequent category for categorical variables.

Please note: imputing missing values is not without hazzard. It can alter the distribution of your data and, again, cause a model to fail spectacularly on new data.

Like standardization and normalization, we can think of imputation as a mini model, as it requires statistics from the training set to implement. So by "training" an imputer on training data and then applying it to the test set for prediction, we get a fair evaluation of our modeling *process* on held out data.

There are more elaborate packages and methods for imputation. But this is good for a first course. I leave the rest to your Googling skills. :)

In [8]:
### Numeric-only: median imputation + standardization ----

# train test split FIRST
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    stratify=y,
    random_state=42
)

# define imputer and scaler
num_imputer = SimpleImputer(strategy="median")

scaler = StandardScaler()

# Fit imputer and scaler on training data only
X_train[num_features] = num_imputer.fit_transform(X_train[num_features])
X_train[num_features] = scaler.fit_transform(X_train[num_features])

# Now apply to test data
X_test[num_features]  = num_imputer.transform(X_test[num_features])
X_test[num_features]  = scaler.transform(X_test[num_features])

# train a model
log_reg = LogisticRegression(
    solver="lbfgs",
    max_iter=1000
)

log_reg.fit(X_train[num_features], y_train)

# Predict and eval on test set
y_pred = log_reg.predict(X_test[num_features])

print("Test AUC (median imputation + standardization):", roc_auc_score(y_test, y_pred))


Test AUC (median imputation + standardization): 0.8809523809523809


In [9]:
X

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,39.1,18.7,181.0,3750.0,Male
1,39.5,17.4,186.0,3800.0,Female
2,40.3,18.0,195.0,3250.0,Female
3,,,,,
4,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...
339,,,,,
340,46.8,14.3,215.0,4850.0,Female
341,50.4,15.7,222.0,5750.0,Male
342,45.2,14.8,212.0,5200.0,Female


In [10]:
### Categorical imputation (for sex) and including it as a feature ----

# Now include sex as a categorical feature and impute its missing values with the most frequent category.
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    stratify=y,
    random_state=42
)

# Define Imputers
num_imputer = SimpleImputer(strategy="median")

cat_imputer = SimpleImputer(strategy="most_frequent")


# Train imputers and apply to training set only
# Numeric columns
X_train[num_features] = num_imputer.fit_transform(X_train[num_features])

# Categorical columns
# Note: double brackets required because a single column gets converted to
# a series, not a one-column data frame. The double brackets enforce the
# data frame structure.
X_train[['sex']] = cat_imputer.fit_transform(X_train[['sex']])

# Apply to test set
X_test[num_features]  = num_imputer.transform(X_test[num_features])
X_test[['sex']]  = cat_imputer.transform(X_test[['sex']])


# Note: Sex is still a string {'male', 'female'}
# We need to convert it to 0/1 to use it for training
X_train



Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
10,37.8,17.1,186.0,3300.0,Female
288,43.5,14.2,220.0,4700.0,Female
67,41.1,19.1,188.0,4100.0,Male
280,45.3,13.8,208.0,4200.0,Female
301,52.5,15.6,221.0,5450.0,Male
...,...,...,...,...,...
193,46.2,17.5,187.0,3650.0,Female
332,43.5,15.2,213.0,4650.0,Female
60,35.7,16.9,185.0,3150.0,Female
147,36.6,18.4,184.0,3475.0,Female


In [11]:
# One-hot encode sex
X_train['sex'] = (X_train['sex'] == "male").astype(int)
X_test['sex']  = (X_test['sex'] == "male").astype(int)

X_train

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
10,37.8,17.1,186.0,3300.0,0
288,43.5,14.2,220.0,4700.0,0
67,41.1,19.1,188.0,4100.0,0
280,45.3,13.8,208.0,4200.0,0
301,52.5,15.6,221.0,5450.0,0
...,...,...,...,...,...
193,46.2,17.5,187.0,3650.0,0
332,43.5,15.2,213.0,4650.0,0
60,35.7,16.9,185.0,3150.0,0
147,36.6,18.4,184.0,3475.0,0


In [12]:
# Let's also scale numeric columns
scaler = StandardScaler()
X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features]  = scaler.transform(X_test[num_features])

# Train a classifier
log_reg = LogisticRegression(
    max_iter=1000
)

log_reg.fit(X_train, y_train)

# Get predictions and evaluate on test set
y_pred = log_reg.predict(X_test)

print("Test AUC (numeric + sex, imputed):", roc_auc_score(y_test, y_pred))

Test AUC (numeric + sex, imputed): 0.8809523809523809


# Pipelines: impute → scale → encode → model

Without pipelines, on a sufficiently large and complex data set, you'll end up with a lot to manage. Imagine if you wanted to normalize some columns, standardize others, and leave others alone.

Pipelines make managing this complexity easier by wrapping all of our data transformation steps into a single function, trained on training data, that can be applied to both train and test sets.

In [13]:

# Define a pipeline step for numeric variables,
# doing both standardization and imputation at once
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Do the same for categorical variables

# First, define a function to encode a categorical variable to binary
def encode_sex(X):
    # X is a 2D array of shape (n_samples, 1)
    # Compare to "male" elementwise, get True/False, cast to int (1/0)
    return (X == "male").astype(int)

# now, define the pipeline step with imputation and encoding
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encode_sex", FunctionTransformer(encode_sex))
])

# Now, define our single function that can be applied at once
# for numeric and categorical variables
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, ['sex'])
    ]
)

clf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", LogisticRegression(
        max_iter=1000
    ))
])

# K-fold CV with *no leakage* – all preprocessing inside each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(
    clf,
    X,
    y,
    cv=cv,
    scoring="accuracy"
)

print("Pipeline CV accuracy scores:", scores)
print("Mean accuracy:", scores.mean())


Pipeline CV accuracy scores: [0.95652174 0.97101449 0.97101449 0.98550725 0.97058824]
Mean accuracy: 0.970929241261722


# Model-based imputation (rough & ready): KNNImputer
So far, we only imputed median (or mean) values for numeric columns. We can be smarter. `KNNImputer` uses the K-nearest-neighbor model (a very simple predictive model based on nearby points) to predict what a missing value should be based on the values of other non-missing predictors for each observation. This should make us more confident in imputed values for numeric variables.

(One could, in principle, do the same for categorical variables. But we'd need to use a different Python package and introduce too much complexity for now. So, exercise left to your Google-fu once more.)

In [14]:
# train test split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    stratify=y,
    random_state=42
)


# define our knn imputer and apply it to numeric columns
imputer_knn = KNNImputer(n_neighbors=5)

X_train[num_features] = imputer_knn.fit_transform(X_train[num_features])
X_test[num_features] = imputer_knn.transform(X_test[num_features])

# standardize variables as well
scaler = StandardScaler()
X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features]  = scaler.transform(X_test[num_features])

# train model
log_reg = LogisticRegression(
    max_iter=1000
)

log_reg.fit(X_train[num_features], y_train)

# predict and evaluate on test set
y_pred = log_reg.predict(X_test[num_features])

print("Test AUC with KNN imputation:", roc_auc_score(y_test, y_pred))


Test AUC with KNN imputation: 0.8809523809523809


If you want it inside a pipeline:

In [15]:
knn_pipeline = Pipeline(steps=[
    ("imputer", KNNImputer(n_neighbors=5)),
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000))
])

scores_knn = cross_val_score(
    knn_pipeline,
    X_train[num_features],
    y_train,
    cv=cv,
    scoring="accuracy"
)
print("KNN imputation + pipeline CV accuracy:", scores_knn.mean())


KNN imputation + pipeline CV accuracy: 0.9833333333333332
