ðŸ““ Customer Transaction Prediction

1. Problem Statement

### Business Objective

Banks want to identify customers who are likely to make a transaction in the future.
Predicting such behavior helps in:

* Targeted marketing
* Customer engagement
* Revenue optimization

### Problem Definition

The objective of this project is to **predict whether a customer will make a transaction in the future**, irrespective of the transaction amount.

### Dataset Description

* Domain: Banking
* The dataset is anonymized
* Total features: 200 (`var_0` to `var_199`)
* Target variable:

  * `0` â†’ Customer will NOT make a transaction
  * `1` â†’ Customer WILL make a transaction

Traditional EDA is limited due to anonymized feature names.

2. Import Libraries & Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd

# Visualization (minimal)
import matplotlib.pyplot as plt
import seaborn as sns

# ML & evaluation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Utility
import warnings
warnings.filterwarnings("ignore")

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

3. Data Loading

In [None]:
data = pd.read_csv("train.csv")
data.head()

4. Data Integrity & Basic Analysis

In [None]:
# Dataset Shape
data.shape

# Column Overview
data.columns

# Dataset Information
data.info()

# Target Variable Distribution
data['target'].value_counts(normalize=True)

# Dataset Summary

* Total records: 200,000
* Total columns: 202
* Target distribution:

  * 0 â†’ ~90%
  * 1 â†’ ~10%

# The dataset is **highly imbalanced**, which is common in banking use cases.


# Missing Values Check

data.isnull().sum().sum()

# Duplicate Records Check

data.duplicated().sum()

# Unique ID Validation
data['ID_code'].nunique()

# Feature Variance Check

feature_cols = [col for col in data.columns if col.startswith('var_')]
len(data[feature_cols].var()[data[feature_cols].var() == 0])

# Note on Exploratory Data Analysis (EDA)

```
Due to anonymized feature names, traditional EDA techniques do not provide meaningful insights.
Hence, analysis focuses on:

* Data integrity
* Target distribution
* Model-driven learning

```

5. Data Preprocessing

In [None]:
# Separate Features and Target

X = data.drop(columns=['target'])
y = data['target']

# Drop ID Column

X = X.drop(columns=['ID_code'])

```
**Reason:**
ID_code is a unique identifier and does not contribute to prediction.
```

# Trainâ€“Test Split (Stratified)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Verify Class Distribution

y_train.value_counts(normalize=True), y_test.value_counts(normalize=True)

# Feature Scaling

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

6. Baseline Model â€“ Logistic Regression

In [None]:
# Why Logistic Regression?
```
* Simple and interpretable
* Industry-accepted baseline
* Benchmark for complex models
```

# Model Training

lr_model = LogisticRegression(
    max_iter=1000,
    random_state=42,
    n_jobs=-1
)

lr_model.fit(X_train_scaled, y_train)

# Predictions & Evaluation

y_train_pred_proba = lr_model.predict_proba(X_train_scaled)[:, 1]
y_test_pred_proba = lr_model.predict_proba(X_test_scaled)[:, 1]

roc_auc_score(y_train, y_train_pred_proba), roc_auc_score(y_test, y_test_pred_proba)

# Classification Report

y_test_pred = (y_test_pred_proba >= 0.5).astype(int)
print(classification_report(y_test, y_test_pred))

# Confusion Matrix

confusion_matrix(y_test, y_test_pred)

7. Tree-Based Models

In [None]:
# Why Random Forest?

```* Captures non-linear patterns
* Robust to noise
* Handles high-dimensional data```


# Training

rf_model = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train, y_train)

# Evaluation

y_train_rf = rf_model.predict_proba(X_train)[:, 1]
y_test_rf = rf_model.predict_proba(X_test)[:, 1]

roc_auc_score(y_train, y_train_rf), roc_auc_score(y_test, y_test_rf)

# Classification Report

y_test_pred_rf = (y_test_rf >= 0.5).astype(int)
print(classification_report(y_test, y_test_pred_rf))

7.1 Random Forest

### Random Forest Summary

* Improved recall (~0.40)
* High training score â†’ overfitting risk

7.2 Gradient Boosting â€“ XGBoost

#### Why XGBoost?

* Best-in-class for tabular data
* Better generalization
* Widely used in banking

In [None]:
# Training

xgb_model = XGBClassifier(
    n_estimators=500,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='binary:logistic',
    eval_metric='auc',
    random_state=42,
    n_jobs=-1
)

xgb_model.fit(X_train, y_train)

# Evaluation

y_train_xgb = xgb_model.predict_proba(X_train)[:, 1]
y_test_xgb = xgb_model.predict_proba(X_test)[:, 1]

roc_auc_score(y_train, y_train_xgb), roc_auc_score(y_test, y_test_xgb)

# Classification Report

y_test_pred_xgb = (y_test_xgb >= 0.5).astype(int)
print(classification_report(y_test, y_test_pred_xgb))


8. Model Comparison Report

In [None]:
comparison = pd.DataFrame({
    "Model": ["Logistic Regression", "Random Forest", "XGBoost"],
    "Train ROC-AUC": [0.86, 0.99, 0.95],
    "Test ROC-AUC": [0.85, 0.89, 0.91],
    "Recall (Class 1)": ["~0.30", "~0.40", "~0.45â€“0.50"]
})

comparison

9. Challenges Faced & Solutions


| Challenge           | Solution             | Reason                      |
| ------------------- | -------------------- | --------------------------- |
| Anonymized features | Model-based learning | Semantics unavailable       |
| Class imbalance     | ROC-AUC metric       | Robust to imbalance         |
| High dimensionality | Tree ensembles       | Handle feature interactions |
| Overfitting         | Gradient boosting    | Better generalization       |
| Scaling requirement | StandardScaler       | Needed for LR               |

10. Final Conclusion

This project successfully built a **customer transaction prediction system** using anonymized banking data.
Multiple models were evaluated, and **XGBoost** emerged as the best-performing model.

The solution is:

* Robust
* Scalable
* Suitable for real-world banking use

**Future work** may include:

* Threshold optimization
* SHAP-based explainability
* Model deployment