# **Preprocessing**

## Notebook overview

---

This notebook alters the features to best capture the true nature of the data, based on the insights from the previous notebook. To do this, we define a class the encapsulates all of this logic, then wrap it in a reusable pipeline to ensure that there is no data leakage throughout the modeling process. 

Tasks: 

- Apply custom feature engineering logic (`FeatureEngineer`) to extract meaningful patterns.
- Encode categorical variables using one-hot encoding.
- Scale numeric features to help stabilize logistic regression modeling.
- Combine all preprocessing steps into a single `Pipeline` object.
- Save the full pipeline with `joblib` so we can apply it consistently later.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import numpy as np

from imblearn.over_sampling import SMOTE
import joblib

import sys
sys.path.append('../src')  
from feature_engineering import FeatureEngineer

pd.set_option('display.max_columns', None)

print("Preprocessing environment initialized.")

Preprocessing environment initialized.


## 1. Reload data

---

We reload the cleaned dataset (data_01.csv) and validate that all expected columns are present — no extras, none missing.

In [2]:
# Load cleaned data
df = pd.read_csv('../data/processed/data_01.csv')

# Define expected column names after EDA cleanup
expected_columns = [
    'Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
    'DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction',
    'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
    'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
    'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
    'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears',
    'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
    'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'
]

# Get actual columns from the loaded DataFrame
actual_columns = list(df.columns)

# Compare against expected
missing_columns = set(expected_columns) - set(actual_columns)
unexpected_columns = set(actual_columns) - set(expected_columns)

# Display results
if not missing_columns and not unexpected_columns:
    print("Column schema validation passed.")
else:
    if missing_columns:
        print("Missing columns:", missing_columns)
    if unexpected_columns:
        print("Unexpected columns:", unexpected_columns)

Column schema validation passed.


## 2. Validation

---

Before preprocessing, we run validations to ensure:
- No missing values or constant columns remain
- Correct data types 
- Target (Attrition) is balanced

In [3]:
# Check for nulls (none expected)
null_counts = df.isnull().sum()
if null_counts.any():
    print("Unexpected null values found:")
    display(null_counts[null_counts > 0])
else:
    print("No null values found.")

print("\nData types:")
display(df.dtypes)

# Identify constant columns
nunique = df.nunique()
constant_cols = nunique[nunique == 1].index.tolist()

if constant_cols:
    print(f"Constant columns detected and dropped: {constant_cols}")
    df.drop(columns=constant_cols, inplace=True)
    print(f"New shape after dropping: {df.shape}")
else:
    print("No constant columns detected.")

# Confirm target variable distribution
print("\nClass balance in 'Attrition':")
display(df['Attrition'].value_counts(normalize=True).round(3))

No null values found.

Data types:


Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole  

No constant columns detected.

Class balance in 'Attrition':


Attrition
No     0.839
Yes    0.161
Name: proportion, dtype: float64

## 3. Feature engineering pipeline

---

- This section creates custom features to capture patterns not directly visible in the raw data. We encapsulate this logic inside of a class, `FeatureEngineer()`, and then merge this into a Pipeline to prevent data leakage and ensure consistent preprocessing steps are applied. 

### `FeatureEngineer()` and `make_preprocessing_pipeline`

To capture interactions between features, and to make features suitable for modeling, all feature engineering logic is placed inside of class FeatureEngineer. This is helpful because it avoids having to repeat logic in subsequent notebooks. 

Below is a breakdown of each added/modified feature:

---
---

### Tenure and experience features

**`TenureCategory`**  
Buckets `YearsAtCompany` into tenure groups:  
- `0–3 yrs`  
- `4–6 yrs`  
- `7–10 yrs`  
- `10+ yrs`  
This captures key career stage segments, which may correspond to different attrition risks.

**`TenureGap`**  
Calculates: `YearsInCurrentRole` − `YearsAtCompany`  
Employees who may have changed roles internally versus those who stayed static, potentially indicating engagement or stagnation.

**`TenureRatio`**  
Calculates: `YearsInCurrentRole` / `YearsAtCompany`  
Identify fast or slow transitions. High ratios may indicate stagnation, while low ratios may indicate fast promotions or instability.

**`ZeroCompanyTenureFlag`**  
Binary flag indicating `YearsAtCompany` == 0  
Captures newly joined employees who may behave differently.

**`NewJoinerFlag`**  
Flags employees with:
- `YearsAtCompany` < 2  
- `TotalWorkingYears` > 3  
These are experienced employees that recently joined - a group that may behave differently due to habits or philosophies from previous jobs. 

---
---

### Role and work features

**`Overtime_JobLevel`**  
Interaction between `OverTime` and `JobLevel`  
Useful for identifying levels of staff that are potentially overworked.

**`Travel_Occupation`**  
Combined effect of travel frequency and job role.  
Identify roles with high levels of travel which correlates with elevated attrition risk. 

---
---

### Satisfaction features

**`SatisfactionMean`**  
Averages the satisfaction scores:  
- `EnvironmentSatisfaction`  
- `JobSatisfaction`  
- `RelationshipSatisfaction`  
Provides a general overview of employee sentiment.

**`SatisfactionRange`**  
Calculates range of the 3 satisfaction scores  
Inconsistency in perceived satisfaction, potentially indicating internal conflict or instability.

**`SatisfactionStability`**  
Binary flag: 1 if all 3 satisfaction scores are equal  
Identifies employees with consistent satisfaction levels across all domains.

---
---

### Financial features

**`Log_MonthlyIncome`**  
Applies log transform to `MonthlyIncome`  
Reduce skew and compress extreme values.

**`Log_DistanceFromHome`**  
Applies log transform to `DistanceFromHome`  
Reduce skew and compress extreme values.

**`LowIncomeFlag`**  
Binary flag for employees earning below the 25th percentile of income  
Captures possible financial dissatisfaction.

---
---

### Burnout risk

**`StressRisk`**  
Binary flag for employees where:  
- `OverTime` == Yes  
- `JobSatisfaction` ≤ 2  
- `SatisfactionMean` < 2.5  
Combines workload and dissatisfaction into a high-risk signal for possible voluntary attrition.

---
---

### Preprocessing pipeline definition

We finalize the preprocessing logic here by defining which columns to encode, scale, or pass through unchanged:

- Categorical variables are one-hot encoded.
- Continuous numeric features are standardized with `StandardScaler`.
- Binary flags from feature engineering are passed through untouched.
- All transformations are bundled into a `ColumnTransformer`, which is embedded in a reusable `Pipeline`.

This pipeline will be saved and applied during modeling (`03_modeling.ipynb`) to ensure consistent preprocessing and no data leakage.

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
import joblib

def make_preprocessing_pipeline():

    # One-hot encode categorical features
    nominal_cols = [
        'Department', 'EducationField', 'Gender', 'MaritalStatus',
        'OverTime', 'TenureCategory', 'OverTime_JobLevel', 'Travel_Occupation'
    ]

    # Standardize continuous features 
    scale_cols = [
        'Age', 'DistanceFromHome', 'HourlyRate', 'JobInvolvement', 'JobLevel',
        'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
        'PerformanceRating', 'StockOptionLevel', 'TotalWorkingYears',
        'TrainingTimesLastYear',
        'YearsSinceLastPromotion', 'YearsWithCurrManager',
        'TenureRatio', 'TenureGap', 'SatisfactionMean', 'SatisfactionRange',
        'PromotionPerYear', 'YearsCompany_Satisfaction',
        'Log_MonthlyIncome', 'Log_DistanceFromHome'
    ]

    # Pass through binary flags
    passthrough_cols = [
        'ZeroCompanyTenureFlag', 'NewJoinerFlag', 'LowIncomeFlag',
        'SatisfactionStability', 'StressRisk'
    ]

    # Build column transformer
    preprocessor = ColumnTransformer(transformers=[
        ('nominal', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), nominal_cols),
        ('scale', StandardScaler(), scale_cols),
        ('passthrough', 'passthrough', passthrough_cols)
    ])

    # Full pipeline
    pipeline = Pipeline(steps=[
        ('feature_engineering', FeatureEngineer()),
        ('preprocessing', preprocessor)
    ])

    return pipeline


In [5]:
print(df.columns)


Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'OverTime',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')


### Export pipeline

We export the preprocessing pipeline unfitted here, so that we can fit in the next notebook on only the training set. 

In [6]:
# Fit on cleaned data (exclude target)
df_clean = df.drop(columns='Attrition')
print(df_clean.columns)
pipeline = make_preprocessing_pipeline()

# Save pipeline
joblib.dump(pipeline, '../models/preprocessing_pipeline.pkl')
print("Preprocessing pipeline saved.")

Index(['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
Preprocessing pipeline saved.


# **Preprocessing summary**

This notebook:

- Applies feature logic with `FeatureEngineer`
- Encodes categorical features using `OrdinalEncoder` and `OneHotEncoder`
- Scales numerical features with `StandardScaler`
- Wraps everything into a reusable `Pipeline`

This exported pipeline ensures consistent preprocessing across training and evaluation.