# Cleaning and Preprocessing

## 1. Notebook overview

This notebook prepares the dataset for modeling by defining and exporting a reusable preprocessing pipeline.

Specifically, it:

- Applies custom feature engineering using a `FeatureEngineer` transformer.
- Encodes ordinal and nominal categorical variables with `OrdinalEncoder` and `OneHotEncoder`.
- Scales selected continuous features using `StandardScaler`.
- Bundles all preprocessing steps into a scikit-learn `Pipeline`.
- Serializes the complete pipeline using `joblib` for reuse in `03_modeling.ipynb`.

No modeling or data splitting occurs in this notebook. All transformations are deferred to `03_modeling.ipynb`, which will load the saved pipeline and apply it to the data during training and evaluation.

This is the second step in the pipeline following `01_eda.ipynb`, and it feeds directly into `03_modeling.ipynb`.

In [1]:
# Imports for preprocessing
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import numpy as np

from imblearn.over_sampling import SMOTE
import joblib

import sys
sys.path.append('../src')  
from feature_engineering import FeatureEngineer

# Set options for display
pd.set_option('display.max_columns', None)

print("Preprocessing environment initialized.")

Preprocessing environment initialized.


## 2. Reload data and confirm schema

We load the cleaned dataset exported from the EDA notebook (`data_01.csv`) and verify its structure before proceeding with preprocessing.

This step ensures:
- The dataset was saved correctly.
- The schema matches expectations (column names, data types).
- There are no unexpected missing values or type mismatches introduced during export.

We also validate that the dataset includes exactly the expected columns — no more, no less. This prevents issues downstream if column names are altered, dropped, or duplicated.

The expected schema includes only meaningful, cleaned features after removing non-informative columns in the EDA phase.

In [2]:
# Load cleaned data
df = pd.read_csv('../data/processed/data_01.csv')

# Define expected column names after EDA cleanup
expected_columns = [
    'Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
    'DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction',
    'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
    'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
    'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
    'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears',
    'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
    'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'
]

# Get actual columns from the loaded DataFrame
actual_columns = list(df.columns)

# Compare against expected
missing_columns = set(expected_columns) - set(actual_columns)
unexpected_columns = set(actual_columns) - set(expected_columns)

# Display results
if not missing_columns and not unexpected_columns:
    print("Column schema validation passed.")
else:
    if missing_columns:
        print("Missing columns:", missing_columns)
    if unexpected_columns:
        print("Unexpected columns:", unexpected_columns)

Column schema validation passed.


## 3. Cleaning checks

We perform additional cleaning checks on the dataset before continuing preprocessing.

This includes:
- Verifying data types are appropriate.
- Checking for unexpected nulls (none should exist).
- Ensuring no constant or identifier columns remain.
- Confirming target class distribution.

In [3]:
# Check for nulls (none expected)
null_counts = df.isnull().sum()
if null_counts.any():
    print("Unexpected null values found:")
    display(null_counts[null_counts > 0])
else:
    print("No null values found.")

# Recheck data types
print("\nData types:")
display(df.dtypes)

# Identify constant columns (only one unique value)
nunique = df.nunique()
constant_cols = nunique[nunique == 1].index.tolist()

if constant_cols:
    print(f"Constant columns detected and dropped: {constant_cols}")
    df.drop(columns=constant_cols, inplace=True)
    print(f"New shape after dropping: {df.shape}")
else:
    print("No constant columns detected.")

# Confirm target variable distribution
print("\nClass balance in 'Attrition':")
display(df['Attrition'].value_counts(normalize=True).round(3))

No null values found.

Data types:


Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole  

No constant columns detected.

Class balance in 'Attrition':


Attrition
No     0.839
Yes    0.161
Name: proportion, dtype: float64

## 4. Feature engineering pipeline

- This section creates custom features to capture patterns not directly visible in the raw data. 
  - It includes tenure ratios, satisfaction aggregates, income transformations, and interaction terms that reflect role dynamics, workload, and employee stability. 
  - These engineered features aim to enhance model performance by providing more expressive signals aligned with real-world attrition behavior.

### Custom Feature Engineering

The `FeatureEngineer` transformer creates domain-informed features designed to surface complex relationships in employee behavior, satisfaction, and tenure. These engineered variables aim to boost model signal by compressing nonlinear interactions and exposing patterns not easily captured by raw variables.

Below is a detailed breakdown of each derived feature and the rationale for its inclusion:

---

### Tenure and Experience Features

**`TenureCategory`**  
Buckets `YearsAtCompany` into tenure groups:  
- `0–3 yrs`  
- `4–6 yrs`  
- `7–10 yrs`  
- `10+ yrs`  
This captures key career stage segments, which may correspond to different attrition risks.

**`TenureGap`**  
Calculates: `YearsInCurrentRole` − `YearsAtCompany`  
Surfaces employees who may have changed roles internally versus those who stayed static, potentially indicating engagement or stagnation.

**`TenureRatio`**  
Calculates: `YearsInCurrentRole` / `YearsAtCompany`  
Normalizes role tenure by company tenure to identify fast or slow transitions. High ratios may indicate stagnation, while low ratios may indicate fast promotions or instability.

**`ZeroCompanyTenureFlag`**  
Binary flag indicating `YearsAtCompany` == 0  
Captures newly joined employees who may behave differently or lack long-term integration.

**`NewJoinerFlag`**  
Flags employees with:
- `YearsAtCompany` < 2  
- `TotalWorkingYears` > 3  
These are experienced professionals recently joining, which may suggest job-hopping or career instability.

---

### Role and Work Simplifications

**`JobRole_Simplified`**  
Collapses job roles into broader categories:  
- “Technical” vs. “Other”  
Highlights differences in attrition patterns between technical vs. non-technical roles.

**`Overtime_JobLevel`**  
Encodes interaction between `OverTime` and `JobLevel`  
Useful for identifying high-level staff taking overtime (possible burnout) or junior staff being overworked.

**`Travel_Occupation`**  
Encodes combined effect of travel status and job role.  
Can surface role-travel patterns associated with higher attrition.

---

### Satisfaction Features

**`SatisfactionMean`**  
Averages the 3 key satisfaction scores:  
- `EnvironmentSatisfaction`  
- `JobSatisfaction`  
- `RelationshipSatisfaction`  
Provides a generalized view of employee sentiment.

**`SatisfactionRange`**  
Calculates: max − min of the 3 satisfaction scores  
Surfaces inconsistency or volatility in perceived satisfaction, potentially indicating internal conflict or instability.

**`SatisfactionStability`**  
Binary flag: 1 if all 3 satisfaction scores are equal  
Identifies employees with consistent satisfaction levels across all domains.

---

### Financial Features

**`Log_MonthlyIncome`**  
Applies log transform to `MonthlyIncome`  
Reduces skew, stabilizes model learning, and compresses extreme values.

**`Log_DistanceFromHome`**  
Applies log transform to `DistanceFromHome`  
Captures diminishing marginal effect of commute distance on attrition.

**`LowIncomeFlag`**  
Binary flag for employees earning below the 25th percentile of income  
Captures financial dissatisfaction and potential disengagement.

---

### Composite Burnout Risk

**`StressRisk`**  
Binary flag for employees where:  
- `OverTime` == Yes  
- `JobSatisfaction` ≤ 2  
- `SatisfactionMean` < 2.5  
Combines workload and dissatisfaction into a high-risk signal for possible voluntary attrition.

---

### Preprocessing pipeline definition 

- Ordinal variables are encoded with ordered categories to preserve their scale. Nominal categorical features are one-hot encoded. 
- Continuous numerical features are standardized to ensure uniform scale. Binary flags are passed through unchanged. 
- The `ColumnTransformer` bundles all transformations, and a full `Pipeline` integrates both custom feature engineering and preprocessing for consistent application during training and inference.


In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
import joblib

def make_preprocessing_pipeline():
    # Ordinal encoding map
    ordinal_cols = ['EnvironmentSatisfaction', 'JobSatisfaction', 'RelationshipSatisfaction', 'WorkLifeBalance']
    ordinal_map = [['1', '2', '3', '4']] * len(ordinal_cols)

    # Nominal (OneHot) encoded categorical features
    nominal_cols = [
        'BusinessTravel', 'Department', 'EducationField', 'Gender', 'MaritalStatus', 'OverTime',
        'JobRole_Simplified', 'TenureCategory', 'OverTime_JobLevel', 'Travel_Occupation'
    ]

    # Scaled numeric features (original + engineered)
    scale_cols = [
        'Age', 'DistanceFromHome', 'HourlyRate', 'JobInvolvement', 'JobLevel',
        'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating',
        'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany',
        'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager',
        'TenureRatio', 'TenureGap', 'SatisfactionMean', 'SatisfactionRange',
        'PromotionPerYear', 'YearsCompany_Satisfaction',
        'Log_MonthlyIncome', 'Log_DistanceFromHome'
    ]

    # Binary passthrough flags
    passthrough_cols = [
        'ZeroCompanyTenureFlag', 'NewJoinerFlag', 'LowIncomeFlag',
        'SatisfactionStability', 'StressRisk'
    ]

    # Build column transformer
    preprocessor = ColumnTransformer(transformers=[
        ('ordinal', OrdinalEncoder(categories=ordinal_map), ordinal_cols),
        ('nominal', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), nominal_cols),
        ('scale', StandardScaler(), scale_cols),
        ('passthrough', 'passthrough', passthrough_cols)
    ])

    # Wrap with full pipeline
    pipeline = Pipeline(steps=[
        ('feature_engineering', FeatureEngineer()),
        ('preprocessing', preprocessor)
    ])

    return pipeline

### Fit and export pipeline 

- The preprocessing pipeline is instantiated and fitted on the cleaned dataset, excluding the target column `Attrition` to avoid data leakage. 
- After fitting, the pipeline is serialized using `joblib` and saved to disk. This allows the same preprocessing steps to be consistently reused during model inference and evaluation.


In [7]:
# Fit on cleaned data (exclude target)
X_clean = df.drop(columns='Attrition')
pipeline = make_preprocessing_pipeline()
pipeline.fit(X_clean)

# Save pipeline
joblib.dump(pipeline, '../models/preprocessing_pipeline.pkl')
print("Preprocessing pipeline saved.")

Preprocessing pipeline saved.


## Preprocessing summary and next steps

This notebook implemented the full preprocessing pipeline for the attrition dataset, including:

- Custom feature engineering via a `FeatureEngineer` transformer
- Encoding of ordinal and nominal categorical features using `OrdinalEncoder` and `OneHotEncoder`
- Standardization of selected continuous variables using `StandardScaler`
- Integration of all steps into a scikit-learn `Pipeline` for modular reuse
- Exporting the complete pipeline to `../models/preprocessing_pipeline.pkl` using `joblib`

The exported pipeline preserves all transformations and ensures consistent preprocessing during model training, evaluation, and deployment. It can be loaded into `modeling.ipynb` to prevent data leakage.