# Credit Score Classification
#### A global finance company has collected customer banking and credit-related data over several years. The objective is to automatically classify customers into credit score brackets using machine learning, thereby reducing manual evaluation efforts and improving consistency.
#### Outcome: To build a machine learning classification model that predicts a person’s credit score category based on their credit-related attributes.

## Data Preprocessing

### Importing Libraries


In [3]:
!pip install pandas
!pip install scikit-learn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier




[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


### Loading and Managing Dataset

In [4]:
df = pd.read_csv('train.csv')
df.head()
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

  df = pd.read_csv('train.csv')


ID                              0
Customer_ID                     0
Month                           0
Name                         9985
Age                             0
SSN                             0
Occupation                      0
Annual_Income                   0
Monthly_Inhand_Salary       15002
Num_Bank_Accounts               0
Num_Credit_Card                 0
Interest_Rate                   0
Num_of_Loan                     0
Type_of_Loan                11408
Delay_from_due_date             0
Num_of_Delayed_Payment       7002
Changed_Credit_Limit            0
Num_Credit_Inquiries         1965
Credit_Mix                      0
Outstanding_Debt                0
Credit_Utilization_Ratio        0
Credit_History_Age           9030
Payment_of_Min_Amount           0
Total_EMI_per_month             0
Amount_invested_monthly      4479
Payment_Behaviour               0
Monthly_Balance              1200
Credit_Score                    0
dtype: int64

### Missing Values Handling

In [5]:
for col in df.columns:
    if df[col].dtype == 'object':
        df[col].fillna(df[col].mode()[0], inplace=True)
    else:
        df[col].fillna(df[col].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting value

### Encoding Categorical Variables

In [6]:
le = LabelEncoder()

for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = df[col].astype(str)   
        df[col] = le.fit_transform(df[col])

## Train and Test

In [7]:
X = df.drop('Credit_Score', axis=1)
y = df['Credit_Score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

### Using Feature Scaling 

In [8]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model

### Linear Logistic Regression

In [9]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)
y_pred_lr = log_reg.predict(X_test_scaled)
acc_lr = accuracy_score(y_test, y_pred_lr)
print("Logistic Regression Accuracy:", acc_lr)

Logistic Regression Accuracy: 0.5924333333333334


### Polynomial Logistic Regression

In [10]:
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

In [11]:
poly_lr = LogisticRegression(max_iter=600)
poly_lr.fit(X_train_poly, y_train)
y_pred_poly = poly_lr.predict(X_test_poly)

In [12]:
acc_poly = accuracy_score(y_test, y_pred_poly)
print("Polynomial Logistic Regression Accuracy:", acc_poly)

Polynomial Logistic Regression Accuracy: 0.6224666666666666


### KNN

In [13]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)

In [14]:
acc_knn = accuracy_score(y_test, y_pred_knn)
print("KNN Accuracy:", acc_knn)

KNN Accuracy: 0.6085


### Naive Bayes

In [15]:
nb = GaussianNB()
nb.fit(X_train_scaled, y_train)
y_pred_nb = nb.predict(X_test_scaled)

In [16]:
acc_nb = accuracy_score(y_test, y_pred_nb)
print("Naive Bayes Accuracy:", acc_nb)

Naive Bayes Accuracy: 0.5871


### PCA

In [17]:
pca = PCA(n_components=0.80)  
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

### SVC

In [18]:
svc = SVC(kernel='rbf')
svc.fit(X_train_pca, y_train)

0,1,2
,"C  C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. For an intuitive visualization of the effects of scaling the regularization parameter C, see :ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.",1.0
,"kernel  kernel: {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable, default='rbf' Specifies the kernel type to be used in the algorithm. If none is given, 'rbf' will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape ``(n_samples, n_samples)``. For an intuitive visualization of different kernel types see :ref:`sphx_glr_auto_examples_svm_plot_svm_kernels.py`.",'rbf'
,"degree  degree: int, default=3 Degree of the polynomial kernel function ('poly'). Must be non-negative. Ignored by all other kernels.",3
,"gamma  gamma: {'scale', 'auto'} or float, default='scale' Kernel coefficient for 'rbf', 'poly' and 'sigmoid'. - if ``gamma='scale'`` (default) is passed then it uses  1 / (n_features * X.var()) as value of gamma, - if 'auto', uses 1 / n_features - if float, must be non-negative. .. versionchanged:: 0.22  The default value of ``gamma`` changed from 'auto' to 'scale'.",'scale'
,"coef0  coef0: float, default=0.0 Independent term in kernel function. It is only significant in 'poly' and 'sigmoid'.",0.0
,"shrinking  shrinking: bool, default=True Whether to use the shrinking heuristic. See the :ref:`User Guide `.",True
,"probability  probability: bool, default=False Whether to enable probability estimates. This must be enabled prior to calling `fit`, will slow down that method as it internally uses 5-fold cross-validation, and `predict_proba` may be inconsistent with `predict`. Read more in the :ref:`User Guide `.",False
,"tol  tol: float, default=1e-3 Tolerance for stopping criterion.",0.001
,"cache_size  cache_size: float, default=200 Specify the size of the kernel cache (in MB).",200
,"class_weight  class_weight: dict or 'balanced', default=None Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``.",


In [19]:
y_pred_svc = svc.predict(X_test_pca)
acc_svc = accuracy_score(y_test, y_pred_svc)
print("SVC Accuracy:", acc_svc)

SVC Accuracy: 0.5973


### Decision Tree

In [20]:
dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train, y_train)
y_pred_dtree = dtree.predict(X_test)

In [21]:
acc_dtree = accuracy_score(y_test, y_pred_dtree)
print("Decision Tree Accuracy:", acc_dtree)

Decision Tree Accuracy: 0.6757


### Random Forest

In [22]:
Rf = RandomForestClassifier(n_estimators=100, random_state=50)
Rf.fit(X_train, y_train)
y_pred_Rf = Rf.predict(X_test)

In [23]:
acc_Rf = accuracy_score(y_test, y_pred_Rf)
print("Random Forest Accuracy:", acc_Rf)

Random Forest Accuracy: 0.7855666666666666


### ANN

In [24]:
ann = MLPClassifier(hidden_layer_sizes=(60, 30),activation='relu',max_iter=200,random_state=44)
ann.fit(X_train_scaled, y_train)
y_pred_ann = ann.predict(X_test_scaled)



In [25]:
acc_ann = accuracy_score(y_test, y_pred_ann)
print("ANN Accuracy:", acc_ann)

ANN Accuracy: 0.6670333333333334


## Comparison between Multiple Models

In [26]:
comparison = pd.DataFrame({'Model': ['Logistic Regression', 'Polynomial Logistic', 'Decision Tree', 'Random Forest'],'Accuracy': [acc_lr, acc_poly, acc_dtree, acc_Rf]})
print(comparison)

                 Model  Accuracy
0  Logistic Regression  0.592433
1  Polynomial Logistic  0.622467
2        Decision Tree  0.675700
3        Random Forest  0.785567


In [27]:
# Model Evaluation1
print("\nClassification Report (Random Forest):")
print(classification_report(y_test, y_pred_Rf))
# Model Evaluation2
print("Confusion Matrix (Random Forest):")
print(confusion_matrix(y_test, y_pred_Rf))


Classification Report (Random Forest):
              precision    recall  f1-score   support

           0       0.75      0.71      0.73      5349
           1       0.78      0.79      0.78      8699
           2       0.80      0.81      0.80     15952

    accuracy                           0.79     30000
   macro avg       0.78      0.77      0.77     30000
weighted avg       0.79      0.79      0.79     30000

Confusion Matrix (Random Forest):
[[ 3816    26  1507]
 [  145  6883  1671]
 [ 1144  1940 12868]]


## Saving Model

In [28]:
import pickle

In [29]:
pickle.dump(Rf, open("rf_model.pkl", "wb"))
pickle.dump(log_reg, open("log_reg_model.pkl", "wb"))
pickle.dump(scaler, open("scaler.pkl", "wb"))

In [30]:
print("Models and scaler saved successfully")

Models and scaler saved successfully


In [None]:
X.columns

Index(['ID', 'Customer_ID', 'Month', 'Name', 'Age', 'SSN', 'Occupation',
       'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts',
       'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan', 'Type_of_Loan',
       'Delay_from_due_date', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit',
       'Num_Credit_Inquiries', 'Credit_Mix', 'Outstanding_Debt',
       'Credit_Utilization_Ratio', 'Credit_History_Age',
       'Payment_of_Min_Amount', 'Total_EMI_per_month',
       'Amount_invested_monthly', 'Payment_Behaviour', 'Monthly_Balance'],
      dtype='object')