# Cleaning and Preprocessing

## 1. Notebook overview

This notebook prepares the dataset for modeling by defining and exporting a reusable preprocessing pipeline.

Specifically, it:

- Applies custom feature engineering using a `FeatureEngineer` transformer.
- Encodes ordinal and nominal categorical variables with `OrdinalEncoder` and `OneHotEncoder`.
- Scales selected continuous features using `StandardScaler`.
- Bundles all preprocessing steps into a scikit-learn `Pipeline`.
- Serializes the complete pipeline using `joblib` for reuse in `03_modeling.ipynb`.

No modeling or data splitting occurs in this notebook. All transformations are deferred to `03_modeling.ipynb`, which will load the saved pipeline and apply it to the data during training and evaluation.

This is the second step in the pipeline following `01_eda.ipynb`, and it feeds directly into `03_modeling.ipynb`.

In [None]:
# Imports for preprocessing
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

from imblearn.over_sampling import SMOTE
import joblib

# Set options for display
pd.set_option('display.max_columns', None)

print("Preprocessing environment initialized.")

Preprocessing environment initialized.


## 2. Reload data and confirm schema

We load the cleaned dataset exported from the EDA notebook (`data_01.csv`) and verify its structure before proceeding with preprocessing.

This step ensures:
- The dataset was saved correctly.
- The schema matches expectations (column names, data types).
- There are no unexpected missing values or type mismatches introduced during export.

We also validate that the dataset includes exactly the expected columns — no more, no less. This prevents issues downstream if column names are altered, dropped, or duplicated.

The expected schema includes only meaningful, cleaned features after removing non-informative columns in the EDA phase.

In [2]:
# Load cleaned data
df = pd.read_csv('../data/processed/data_01.csv')

# Define expected column names after EDA cleanup
expected_columns = [
    'Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
    'DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction',
    'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
    'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
    'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
    'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears',
    'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
    'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'
]

# Get actual columns from the loaded DataFrame
actual_columns = list(df.columns)

# Compare against expected
missing_columns = set(expected_columns) - set(actual_columns)
unexpected_columns = set(actual_columns) - set(expected_columns)

# Display results
if not missing_columns and not unexpected_columns:
    print("Column schema validation passed.")
else:
    if missing_columns:
        print("Missing columns:", missing_columns)
    if unexpected_columns:
        print("Unexpected columns:", unexpected_columns)

Column schema validation passed.


## 3. Cleaning checks

We perform additional cleaning checks on the dataset before continuing preprocessing.

This includes:
- Verifying data types are appropriate.
- Checking for unexpected nulls (none should exist).
- Ensuring no constant or identifier columns remain.
- Confirming target class distribution.

In [3]:
# Check for nulls (none expected)
null_counts = df.isnull().sum()
if null_counts.any():
    print("Unexpected null values found:")
    display(null_counts[null_counts > 0])
else:
    print("No null values found.")

# Recheck data types
print("\nData types:")
display(df.dtypes)

# Identify constant columns (only one unique value)
nunique = df.nunique()
constant_cols = nunique[nunique == 1].index.tolist()

if constant_cols:
    print(f"Constant columns detected and dropped: {constant_cols}")
    df.drop(columns=constant_cols, inplace=True)
    print(f"New shape after dropping: {df.shape}")
else:
    print("No constant columns detected.")

# Confirm target variable distribution
print("\nClass balance in 'Attrition':")
display(df['Attrition'].value_counts(normalize=True).round(3))

No null values found.

Data types:


Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole  

No constant columns detected.

Class balance in 'Attrition':


Attrition
No     0.839
Yes    0.161
Name: proportion, dtype: float64

## 4. Feature engineering pipeline

### Feature engineering transformer

This step defines a custom `FeatureEngineer` class that inherits from `BaseEstimator` and `TransformerMixin`, allowing integration into scikit-learn pipelines. The `transform()` method generates several new features based on domain logic.

The operations performed include:

Tenure-based features:
- `TenureCategory`: Categorizes `YearsAtCompany` into four ordinal bins.
- `YearsAtCompanyRatio`: Ratio of years at company to total working years.
- `YearsInRoleRatio`: Ratio of years in current role to years at company.
- `YearsWithManagerRatio`: Ratio of years with current manager to years at company.
- `IncomePerYear`: Monthly income divided by years at company.

Work history flags:
- `ZeroCompanyTenureFlag`: Flags employees with 0 years at company.
- `NoWorkHistoryFlag`: Flags employees with 0 years at company and 0 total working years.

Job role grouping:
- `JobRoleGroup`: Maps specific job titles into broader role categories such as technical and sales.

Satisfaction features:
- `SatisfactionAvg`: Average of environment, job, and relationship satisfaction.
- `SatisfactionRange`: Difference between max and min satisfaction values.
- `LowSatisfactionFlag`: Flags employees with any satisfaction metric less than or equal to 2.

Interaction features:
- `OverTime_JobLevel`: Combines overtime status and job level.
- `Travel_Occupation`: Combines business travel and job role.

Composite risk flag:
- `StressRisk`: Flags employees who work overtime, have low satisfaction, and a satisfaction average below 2.5.

The function returns the modified dataframe with new features added. 

In [None]:
class FeatureEngineer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        df = X.copy()

        df['TenureCategory'] = pd.cut(
            df['YearsAtCompany'],
            bins=[-1, 2, 5, 10, np.inf],
            labels=['<3 yrs', '3–5 yrs', '6–10 yrs', '10+ yrs']
        )

        df['YearsAtCompanyRatio'] = df['YearsAtCompany'] / df['TotalWorkingYears'].replace(0, np.nan)
        df['YearsInRoleRatio'] = df['YearsInCurrentRole'] / df['YearsAtCompany'].replace(0, np.nan)
        df['YearsWithManagerRatio'] = df['YearsWithCurrManager'] / df['YearsAtCompany'].replace(0, np.nan)
        df['IncomePerYear'] = df['MonthlyIncome'] / df['YearsAtCompany'].replace(0, np.nan)

        df['ZeroCompanyTenureFlag'] = (df['YearsAtCompany'] == 0).astype(int)
        df['NoWorkHistoryFlag'] = ((df['YearsAtCompany'] == 0) & (df['TotalWorkingYears'] == 0)).astype(int)

        df['JobRoleGroup'] = df['JobRole'].replace({
            'Laboratory Technician': 'Technical',
            'Research Scientist': 'Technical',
            'Healthcare Representative': 'Sales',
            'Sales Executive': 'Sales',
            'Sales Representative': 'Sales'
        })

        df['SatisfactionAvg'] = df[[
            'EnvironmentSatisfaction', 'JobSatisfaction', 'RelationshipSatisfaction'
        ]].mean(axis=1)

        df['SatisfactionRange'] = (
            df[['EnvironmentSatisfaction', 'JobSatisfaction', 'RelationshipSatisfaction']].max(axis=1) -
            df[['EnvironmentSatisfaction', 'JobSatisfaction', 'RelationshipSatisfaction']].min(axis=1)
        )

        df['LowSatisfactionFlag'] = (
            (df['EnvironmentSatisfaction'] <= 2) |
            (df['JobSatisfaction'] <= 2) |
            (df['RelationshipSatisfaction'] <= 2)
        ).astype(int)

        df['OverTime_JobLevel'] = df['OverTime'].astype(str) + "_" + df['JobLevel'].astype(str)
        df['Travel_Occupation'] = df['BusinessTravel'].astype(str) + "_" + df['JobRole'].astype(str)

        df['StressRisk'] = (
            (df['OverTime'] == 'Yes') &
            (df['LowSatisfactionFlag'] == 1) &
            (df['SatisfactionAvg'] < 2.5)
        ).astype(int)

        return df

### ColumnTransformer: preprocessing steps

This step defines a preprocessing pipeline using `ColumnTransformer` to handle different types of features appropriately:

- `ordinal_col`: Encodes the `TenureCategory` feature with an explicit order: less than 3 years, 3–5 years, 6–10 years, and more than 10 years.
- `nominal_cols`: Specifies all nominal (unordered categorical) features to be one-hot encoded.
- `scale_cols`: Lists all continuous numerical features that should be standardized using `StandardScaler`.

The `ColumnTransformer` applies:
- `OrdinalEncoder` to ordinal features using the defined category order.
- `OneHotEncoder` to nominal features, dropping the first category to avoid multicollinearity and ignoring unknown categories.
- `StandardScaler` to continuous features to normalize them to zero mean and unit variance.

All unspecified columns (binary flags) are passed through unchanged using `remainder='passthrough'`.

In [None]:


ordinal_col = ['TenureCategory']
ordinal_map = [['<3 yrs', '3–5 yrs', '6–10 yrs', '10+ yrs']]

nominal_cols = [
    'BusinessTravel', 'Department', 'EducationField', 'JobRole',
    'MaritalStatus', 'JobRoleGroup', 'Travel_Occupation', 'OverTime_JobLevel'
]

# Manually chosen continuous variables to scale
scale_cols = [
    'Age', 'DistanceFromHome', 'MonthlyIncome', 'YearsAtCompany',
    'YearsInCurrentRole', 'YearsWithCurrManager', 'YearsSinceLastPromotion',
    'TotalWorkingYears', 'YearsInRoleRatio', 'YearsWithManagerRatio',
    'YearsAtCompanyRatio', 'IncomePerYear', 'SatisfactionAvg', 'SatisfactionRange'
]

preprocessor = ColumnTransformer([
    ('ordinal', OrdinalEncoder(categories=ordinal_map), ordinal_col),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), nominal_cols),
    ('scale', StandardScaler(), scale_cols)
], remainder='passthrough')

### Preprocessing pipeline

This pipeline combines all preprocessing steps into a single object using `sklearn.pipeline.Pipeline`.

It consists of two stages:
1. `features`: Applies the custom `FeatureEngineer` transformer, which performs domain-specific feature engineering such as generating tenure ratios, satisfaction flags, and interaction features.
2. `preprocessing`: Applies the `ColumnTransformer` defined earlier, which encodes categorical variables and scales numeric features.

This modular structure ensures all transformations are applied consistently and reproducibly, enabling clean integration with downstream modeling and tuning workflows.

In [None]:


preprocessing_pipeline = Pipeline([
    ('features', FeatureEngineer()),
    ('preprocessing', preprocessor)
])

### Exporting the preprocessing pipeline

The complete preprocessing pipeline is serialized using `joblib` and saved to disk as a `.pkl` file.

This allows the exact same transformations to be reused in other notebooks (such as `modeling.ipynb`) without recomputing feature engineering, encoding, or scaling steps. This is critical for maintaining consistency between training and testing environments.

Saved to: `../models/preprocessing_pipeline.pkl`


In [None]:
joblib.dump(preprocessing_pipeline, '../models/preprocessing_pipeline.pkl')
print("Preprocessing pipeline saved.")

Preprocessing pipeline saved.


# Preprocessing summary and next steps 

This notebook implemented the full preprocessing pipeline for the attrition dataset, including:

- Custom feature engineering via a `FeatureEngineer` transformer
- Encoding of ordinal and nominal categorical features using `OrdinalEncoder` and `OneHotEncoder`
- Standardization of selected continuous variables using `StandardScaler`
- Integration of all steps into a scikit-learn `Pipeline` for modular reuse
- Exporting the complete pipeline to `../models/preprocessing_pipeline.pkl` using `joblib`

The exported pipeline preserves all transformations and ensures consistent preprocessing during model training, evaluation, and deployment. It can be loaded into `modeling.ipynb` to prevent data leakage.