# Credit Score Classification Project

### Problem Statement
This project focuses on automating credit score classification for a global finance company. The company aims to use historical credit data to categorize customers into groups such as *Good*, *Standard*, and *Poor*, reducing manual review and improving credit decision-making efficiency.

### Objective
The goal was to build a supervised machine learning model capable of predicting a person’s **Credit Score** category based on their financial and behavioral data.

### Approach
1. **Data Preparation:** Removed identifiers, handled missing values, and standardized data types.  
2. **Feature Processing:** Used separate pipelines for numerical and categorical variables.  
3. **Modeling:** Applied a **Random Forest Classifier** for its stability and ability to handle mixed data types.  
4. **Evaluation:** Split the data into training and validation sets to measure model accuracy and performance.  
5. **Prediction:** Generated and exported predictions for unseen data (`submission.csv`).

### Tools & Libraries
Python, Pandas, NumPy, Scikit-learn, Jupyter Notebook / VS Code

### Limitations
As this project was part of a *Python Machine Learning* course, the focus was on building a clear, functional pipeline rather than achieving maximum accuracy. Advanced methods like hyperparameter tuning or feature engineering were not fully implemented due to time and resource limits.

### Conclusion
This project demonstrates a complete, interpretable, and efficient credit score classification pipeline. Despite its simplicity, it effectively shows how machine learning can automate real-world financial decision-making.


Install All Required Libraries

In [12]:
import sys
import subprocess

def install(pkg):
    subprocess.check_call([sys.executable, "-m", "pip", "install", pkg, "--quiet"])

packages = [
    "pandas",
    "numpy",
    "scikit-learn"
]

for pkg in packages:
    try:
        __import__(pkg.replace("-", "_"))
        print(f" {pkg} already installed.")
    except ImportError:
        print(f" Installing {pkg} ...")
        install(pkg)


 pandas already installed.
 numpy already installed.
 Installing scikit-learn ...


Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


Load Data

In [13]:
# Make sure your train.csv and test.csv are in the same folder as this script
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

print("Data Loaded Successfully!")
print("Train shape:", train.csv)
print("Test shape:", test.csv)
train.head()


  train = pd.read_csv("train.csv")


Data Loaded Successfully!
Train shape: (100000, 28)
Test shape: (50000, 27)


Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,_,809.98,26.82262,22 Years and 1 Months,No,49.574949,80.41529543900253,High_spent_Small_value_payments,312.49408867943663,Good
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.94496,,No,49.574949,118.28022162236736,Low_spent_Large_value_payments,284.62916249607184,Good
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,28.609352,22 Years and 3 Months,No,49.574949,81.699521264648,Low_spent_Medium_value_payments,331.2098628537912,Good
3,0x1605,CUS_0xd40,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.377862,22 Years and 4 Months,No,49.574949,199.4580743910713,Low_spent_Small_value_payments,223.45130972736783,Good
4,0x1606,CUS_0xd40,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,Good,809.98,24.797347,22 Years and 5 Months,No,49.574949,41.420153086217326,High_spent_Medium_value_payments,341.48923103222177,Good


Encode Target & Identify Feature Types

In [14]:
# In case Kernel 3 hasn't been run:
if "target_col" not in locals():
    target_col = "Credit_Score"
    X = train.drop(columns=[target_col])
    y = train[target_col]

# Encode target
le = LabelEncoder()
y = le.fit_transform(y)

# Separate numeric & categorical features
numeric_feats = X.select_dtypes(include=[np.number]).columns.tolist()
cat_feats = X.select_dtypes(exclude=[np.number]).columns.tolist()

print(" Target encoded successfully.")
print("Numeric features:", len(numeric_feats))
print("Categorical features:", len(cat_feats))


 Target encoded successfully.
Numeric features: 8
Categorical features: 19


Build Preprocessing Pipelines

In [15]:
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
])

preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_transformer, numeric_feats),
    ("cat", categorical_transformer, cat_feats)
])

print(" Preprocessor pipeline created successfully.")


 Preprocessor pipeline created successfully.


Build & Train Model

In [18]:
# Ensure categorical columns are strings to prevent mixed-type errors
for col in cat_feats:
    X[col] = X[col].astype(str)
    if 'X_val' in locals():
        X_val[col] = X_val[col].astype(str)

model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train-validation split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)

# Convert validation categorical columns too
for col in cat_feats:
    X_train[col] = X_train[col].astype(str)
    X_val[col] = X_val[col].astype(str)

# Fit the model
model.fit(X_train, y_train)
val_preds = model.predict(X_val)

print("Model trained successfully (fixed version).")
print("Validation Accuracy:", accuracy_score(y_val, val_preds))
print("\nClassification Report:\n", classification_report(y_val, val_preds, target_names=[str(c) for c in le.classes_]))


Model trained successfully (fixed version).
Validation Accuracy: 0.7698666666666667

Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.70      0.70      2674
           1       0.78      0.74      0.76      4350
           2       0.78      0.81      0.80      7976

    accuracy                           0.77     15000
   macro avg       0.76      0.75      0.75     15000
weighted avg       0.77      0.77      0.77     15000



Final Training on Full Data + Test Prediction

In [17]:
# Train on full dataset
model.fit(X, y)

X_test = test.copy()
if target_col in X_test.columns:
    X_test = X_test.drop(columns=[target_col])

preds = model.predict(X_test)
preds_labels = le.inverse_transform(preds)

print(" Predictions done on test set!")


 Predictions done on test set!


Save Submission File

In [None]:
if "Customer_ID" in test.columns:
    submission = pd.DataFrame({
        "Customer_ID": test["Customer_ID"],
        "Credit_Score": preds_labels
    })
else:
    submission = pd.DataFrame({"Credit_Score": preds_labels})

submission.to_csv("submission.csv", index=False)
print(" Submission file 'submission.csv' saved successfully!")
print(submission.head())


 Submission file 'submission.csv' saved successfully!
  Customer_ID Credit_Score
0   CUS_0xd40         Good
1   CUS_0xd40         Good
2   CUS_0xd40         Good
3   CUS_0xd40         Good
4  CUS_0x21b1     Standard
