# 1. Importing the Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 2. Loading the Dataset

In [2]:
df = pd.read_csv("/content/WA_Fn-UseC_-HR-Employee-Attrition.csv")

# 3. Preprocessing

## Dropping Useless Columns

In [3]:
df.drop(columns=["EmployeeCount", "EmployeeNumber", "Over18", "StandardHours"], inplace=True)

These columns are constant or identifiers and do not contribute to prediction.

## Encoding Target Variable

In [4]:
df["Attrition"] = df["Attrition"].map({"Yes": 1, "No": 0})

## Separating Features & Target

In [5]:
X = df.drop("Attrition", axis=1)
y = df["Attrition"]

## Handling Categorical Variables (IMPORTANT)

In [6]:
X = pd.get_dummies(X, drop_first=True)

Why `drop_first=True`?

> Prevents multicollinearity (dummy variable trap).



## Train-Test Split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Stratification preserves attrition ratio in both sets.

## Feature Scaling

In [8]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

*   `fit` only on training data
*   `transform` on test + future inputs (website)

# 4. Saving the Artifacts

In [9]:
import pickle

with open("scaler.pkl", "wb") as f:
    pickle.dump(scaler, f)

with open("feature_columns.pkl", "wb") as f:
    pickle.dump(X.columns, f)

**Preprocessing Summary**

* Preprocessing Summary

* Removed non-informative columns

* Encoded categorical features using one-hot encoding

* Scaled numerical features

* Created reproducible train-test split

* Saved preprocessing artifacts for deployment