### 1. Load the Cleaned Dataset

We reload the cleaned German Credit dataset and separate the features from the target variable.


In [1]:
import pandas as pd

# Load data
df = pd.read_csv("../data/cleaned.csv")

# Separate target
X = df.drop("Target", axis=1)
y = df["Target"]

### 2. Identify Variable Types

We identify categorical and numerical columns. This is useful to apply the right preprocessing techniques to each type.


In [2]:
# Identify column types
cat_cols = X.select_dtypes(include="object").columns.tolist()
num_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

print("Categorical features:", cat_cols)
print("Numerical features:", num_cols)


Categorical features: ['Status_checking_account', 'Credit_history', 'Purpose', 'Savings_account', 'Employment_since', 'Personal_status_sex', 'Debtors', 'Property', 'Other_installment_plans', 'Housing', 'Job', 'Telephone', 'Foreign_worker']
Numerical features: ['Duration_month', 'Credit_amount', 'Installment_rate', 'Residence_since', 'Age', 'Existing_credits', 'Dependents']


### 3. Train-Test Split

We split the dataset into training and testing sets using stratified sampling, so the proportion of good and bad credit risks is preserved.


In [3]:
from sklearn.model_selection import train_test_split

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Train shape: (800, 20)
Test shape: (200, 20)


### 4. Encode Categorical Variables and Scale Numeric Features

We use One-Hot Encoding for categorical variables and Standard Scaling for numeric variables. We combine both transformations using a ColumnTransformer.


In [5]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

# Preprocessing pipeline
preprocessor = ColumnTransformer(transformers=[
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])

# Fit on training data and transform both train and test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)


### 5. Check Final Processed Shapes

We check the shape of the processed data. After encoding, the number of features will increase due to one-hot expansion of categorical variables.


In [6]:
print("X_train processed shape:", X_train_processed.shape)
print("X_test processed shape:", X_test_processed.shape)


X_train processed shape: (800, 61)
X_test processed shape: (200, 61)


### 6. Save the Preprocessor

We can save the fitted preprocessing pipeline to reuse it later for predictions, deployment, or explainability.


In [10]:
import joblib

joblib.dump(X_train_processed, "../models/X_train_processed.pkl")
joblib.dump(X_test_processed, "../models/X_test_processed.pkl")
joblib.dump(y_train, "../models/y_train.pkl")
joblib.dump(y_test, "../models/y_test.pkl")
joblib.dump(X_test, "../models/X_test.pkl")
joblib.dump(X_train, "../models/X_train.pkl")



['../models/X_train.pkl']