# Credit Score Classification

<a href="https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv"> Credit Score Classification Kaggle </a>

**Problem Statement** <br>
You are working as a data scientist in a global finance company. Over the years, the company has collected basic bank details and gathered a lot of credit-related information. The management wants to build an intelligent system to segregate the people into credit score brackets to reduce the manual efforts.

**Task** <br>
Given a person’s credit-related information, build a machine learning model that can classify the credit score.



In [None]:
# import libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# import data

train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')

In [None]:
train.head(4)

In [None]:
test.head(4)

In [None]:
df = train

In [None]:
df['Credit_Score'].unique()

## Feature Engineering

### Feature Selection

In [None]:
print(df.columns)

In [None]:
df.describe(include='all')

In [None]:
df.info()

In [None]:
df.drop(['ID', 'Customer_ID', 'Name', 'Month', 'SSN', 'Monthly_Inhand_Salary'], axis=1, inplace=True)

### Dealing with Missing Data

In [None]:
df.isnull().sum()

In [None]:
missing = df.isnull().sum()
missing = missing[missing>0]
missing

In [None]:
missing_cols = list(missing.index)
missing_cols

In [None]:
df[missing_cols].info()

Clearly, we need to change Dtype of certian features from object to float or integer

In [None]:
df['Num_of_Delayed_Payment'] = df['Num_of_Delayed_Payment'].apply(lambda x: str(x).replace("None", "0"))
df['Num_of_Delayed_Payment'] = df['Num_of_Delayed_Payment'].apply(lambda x: str(x).replace("nan", "0"))
df['Num_of_Delayed_Payment'] = df['Num_of_Delayed_Payment'].apply(lambda x: str(x).replace("_", " "))

df['Num_of_Delayed_Payment'] = pd.to_numeric(df['Num_of_Delayed_Payment'])

In [None]:
df['Num_Credit_Inquiries'].fillna(method='bfill', inplace=True)
df['Num_Credit_Inquiries'].isnull().sum()

In [None]:
df['Credit_History_Age'] = df['Credit_History_Age']

In [None]:
# convert Credit_History_Age from string dtype to float; regex then iterate over series

temp = []
df['Credit_History_Age'] = df['Credit_History_Age'].str.extract(r'(^[1-9][0-9].* \d|[1-9].* \d)')
for idx, val in df['Credit_History_Age'].items():
    if type(val)==float:
        temp.append(np.NaN)
        continue
    a = float(int(val[:2])*12) + int(val[-1])
    temp.append(a)
df['Credit_History_Age'] = temp
df['Credit_History_Age'].fillna(0, inplace=True)

In [None]:
df['Amount_invested_monthly'].isnull().sum()

In [None]:
df['Amount_invested_monthly'] = df['Amount_invested_monthly'].apply(lambda x:str(x).replace("_", " "))
df['Amount_invested_monthly'] = df['Amount_invested_monthly'].apply(lambda x:str(x).replace("nan", "0"))
df['Amount_invested_monthly'] = df['Amount_invested_monthly'].apply(lambda x:str(x).replace("None", "0"))
df['Amount_invested_monthly'] = pd.to_numeric(df['Amount_invested_monthly'])

In [None]:
df['Amount_invested_monthly'].isnull().sum()

In [None]:
df['Monthly_Balance'] = df['Monthly_Balance'].apply(lambda x:str(x).replace("_", " "))
df['Monthly_Balance'] = df['Monthly_Balance'].apply(lambda x:str(x).replace("None", "0"))
df['Monthly_Balance'] = df['Monthly_Balance'].apply(lambda x:str(x).replace("nan", "0"))
df['Monthly_Balance'] = df['Monthly_Balance'].str[:7]
df['Monthly_Balance'] = pd.to_numeric(df['Monthly_Balance'], errors='ignore')

In [None]:
# Type_of_Loan is too chaotic to deal with, so drop it

df.drop(['Type_of_Loan'], axis=1, inplace=True)

In [None]:
df.info()

In [None]:
df['Age'] = df['Age'].apply(lambda x:str(x).replace("_", " "))
df['Age'] = pd.to_numeric(df['Age'], downcast='integer')

In [None]:
df['Num_of_Loan'] = df['Num_of_Loan'].apply(lambda x:str(x).replace("_", " "))
df['Num_of_Loan'] = pd.to_numeric(df['Num_of_Loan'], downcast='integer')

In [None]:
df['Outstanding_Debt'] = df['Outstanding_Debt'].apply(lambda x:str(x).replace("_", " "))
df['Outstanding_Debt'] = pd.to_numeric(df['Outstanding_Debt'], downcast='float')

In [None]:
df['Changed_Credit_Limit'].unique()

In [None]:
df['Changed_Credit_Limit'] = df['Changed_Credit_Limit'].apply(lambda x:str(x).replace("_", " "))
df['Changed_Credit_Limit'] = df['Changed_Credit_Limit'].apply(lambda x:str(x).replace(" ", "0"))
df['Changed_Credit_Limit'] = df['Changed_Credit_Limit'].str[:7]
df['Changed_Credit_Limit'] = pd.to_numeric(df['Changed_Credit_Limit'], downcast='float')

In [None]:
df['Annual_Income'] = df['Annual_Income'].apply(lambda x:str(x).replace("_", " "))
df['Annual_Income'] = pd.to_numeric(df['Annual_Income'], downcast='float')

### Encoding the Data

In [None]:
objects = [column for column, is_type in (df.dtypes=='object').items() if is_type]
print('Object type columns in dataset :', objects, '\n\n')
for obj in objects:
    print(df[obj].head(2), '\n------------------------------\n')

In [None]:
objects_dict = {column: list(df[column].unique()) for column in df.select_dtypes('object').columns} 

In [None]:
for k, v in objects_dict.items():
    print(k, ':', v, '\n-----------------\n')

Ordinal features are features which can be both categorized and ranked, such as Good/Better, wehereas Nominal Features are features which can only be categorized like Male/Female. <br>
Ordinal Features are encoded using Label Encoding technique, while Nominal features are encoded using One Hot Encoding technique.

In [None]:
ordinal_categorical_features = ['Credit_Score', 'Payment_of_Min_Amount', 'Credit_Mix']

nominal_categorical_features = ['Occupation', 'Payment_Behaviour']

In [10]:
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

def LabelEncode(column):
    df[column] = labelencoder.fit_transform(df[column])

In [11]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()

def oneHotEncode(df, column):
    col_values  = objects_dict[column]
    encoded_cols = pd.DataFrame(enc.fit_transform(df[[column]]).toarray(), columns=col_values)
    df = df.join(encoded_cols)
    df.drop(column, axis=1, inplace=True)
    return df

In [None]:
for col in ordinal_categorical_features:
    LabelEncode(col)

In [None]:
for col in nominal_categorical_features:
    df = oneHotEncode(df, col)

In [None]:
df.columns

## Exploratory Data Analysis

In [None]:
df.describe()

In [12]:
from matplotlib import colors
from matplotlib.colors import ListedColormap

In [None]:
# Pairplot

sns.set(rc={"axes.facecolor":"#CEE8C8","figure.facecolor":"#FFFFF5"})
pallet = ["#FAD89F", "#53726D", "#424141", "#FFFFF5"]
cmap = colors.ListedColormap(["#ACC7EF", "#FAD89F", "#53726D", "#424141", "#FFFFF5"])

#Plotting following features
features = [ "Age", "Interest_Rate", "Delay_from_due_date", "Num_of_Delayed_Payment", "Credit_Mix"]
print("Relative Plot Of Features")
plt.figure(figsize=(20,20))  
sns.pairplot(df[features], hue= "Credit_Mix", palette= (["#ACC7EF", "#FAD89F", "#53726D", "#424141"]))

plt.show()

In [None]:
# Correlation Heatmap

corrmat= df.corr()
plt.figure(figsize=(20,20))  
sns.heatmap(corrmat, annot=True, cmap=cmap, center=0)

In [None]:
before_pipeline_df = df

In [None]:
# import libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

%matplotlib inline

## Pipeline creation

In [1]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted

# The DropColumns class inherits from the sklearn.base classes (BaseEstimator, TransformerMixin)
# This makes it compatible with scikit-learn’s Pipelines

class DropColumns(BaseEstimator, TransformerMixin):
    '''
    Drop columns from DataFrame
    Inherit from the sklearn.base classes (BaseEstimator, TransformerMixi) to make this compatible with scikit-learn’s Pipelines
    '''
    
    # initializer 
    def __init__(self, columns):
        # list of columns we derived that needs to be dropped
        self.columns = columns
        
        
    def fit(self, X, y=None):    #, columns
        # self.columns = columns
        return self
    
    def transform(self, X, y=None):
        # return the dataframe with dropped features
        df_cols = list(X.columns)
        for col in self.columns:
            if col in df_cols:
                X.drop(col, axis=1, inplace=True)
        return X
    
    def fit_transform(self, X, y=None):
        return self.transform(X)

In [2]:
class str_to_num(BaseEstimator, TransformerMixin):
    '''
    Convert DataFrame columns from Object Dtype to numeric Dtype while performing general cleaning for this dataset
    Inherit from the sklearn.base classes (BaseEstimator, TransformerMixi) to make this compatible with scikit-learn’s Pipelines
    '''
    
    def __init__(self, columns):
        self.columns = columns
    
    def custom_feature_handling(self, X):
        '''
        Convert 'Credit_History_Age' feature from Object Dtype to float; regex then iterate over series to capture from string
        fill empty data pointss in 'Num_Credit_Inquiries' column
        '''
        
        self.temp = np.empty([X.shape[0], 1], dtype=float)
        X['Credit_History_Age'] = X['Credit_History_Age'].str.extract(r'(^[1-9][0-9].* \d|[1-9].* \d)')
        for idx, val in X['Credit_History_Age'].items():
            if type(val)==float:
                self.temp[idx] = np.NaN
                continue
            value = float(int(val[:2])*12) + int(val[-1])
            self.temp[idx] = value
        # could run into problem in the below line: X[col] = numpy array
        X['Credit_History_Age'] = self.temp
        X['Credit_History_Age'].fillna(0, inplace=True)
        
        #fill 'Num_Credit_Inquiries' NaN points with back fill method
        X['Num_Credit_Inquiries'].fillna(method='bfill', inplace=True)        
        return X
    
    def fit(self, X=None, y=None):
        return self
    
    def transform(self, X, y=None):
        for column in self.columns:
            # changes 'None' and 'nan' (str) type to numeric and '_' (invalid) to None in row-wise operation
            X[column] = pd.to_numeric(X[column], errors='coerce')
        X = self.custom_feature_handling(X)
        X = X.fillna(0)
        return X
    
    def fit_transform(self, X, y=None):
        return self.transform(X)

In [3]:
class categorical_encoding(BaseEstimator, TransformerMixin):
    '''
    Perform Categorical encoding on Ordinal and Nominal Categorical features accordingly
    Inherit from the sklearn.base classes (BaseEstimator, TransformerMixi) to make this compatible with scikit-learn’s Pipelines
    '''
    
    def __init__(self):
        self.ordinal_categorical_features = ['Payment_of_Min_Amount', 'Credit_Mix', 'Credit_Score'] 
        self.nominal_categorical_features = ['Occupation', 'Payment_Behaviour']
        self.features_category_dict = {'Credit_Mix': ['_', 'Good', 'Standard', 'Bad'], 'Payment_of_Min_Amount': ['No', 'NM', 'Yes'], 'Credit_Score': ['Good', 'Standard', 'Poor'],
                             'Occupation': ['Scientist', '_______', 'Teacher', 'Engineer', 'Entrepreneur', 'Developer', 'Lawyer', 'Media_Manager', 'Doctor', 'Journalist', 'Manager', 'Accountant', 'Musician', 'Mechanic', 'Writer', 'Architect'], 
                             'Payment_Behaviour': ['High_spent_Small_value_payments', 'Low_spent_Large_value_payments', 'Low_spent_Medium_value_payments', 'Low_spent_Small_value_payments', 'High_spent_Medium_value_payments', '!@9#%8', 'High_spent_Large_value_payments']}
        self.labelencoder = LabelEncoder()
        self.ordinalencoder = OneHotEncoder()
        
    def fit(self, X=None, y=None):
        return self
        
    
    def transform(self, X, y=None):
        # labelencoding
        for col in self.ordinal_categorical_features:
            X[col] = self.labelencoder.fit_transform(X[col])
        # onehotencoding
        for col in self.nominal_categorical_features:
            col_values = self.features_category_dict[col]
            encoded_cols = pd.DataFrame(self.ordinalencoder.fit_transform(X[[col]]).toarray(), columns=col_values)
            X = X.join(encoded_cols)
            X.drop(col, axis=1, inplace=True)
        return X
    
    def fit_transform(self, X, y=None):
        return self.transform(X)

In [4]:
## standardscaler transformer for selected features

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

class scaler(BaseEstimator, TransformerMixin):
    '''
    implement StandardScaler from sklearn.preprocessing module on selected features/columns
    '''
    
    def __init__(self, columns):
        self.columns = columns
        self.is_fitted = False
        self.scaler = StandardScaler()
        
    def fit(self, X, y=None):
        self.is_fitted = True
        self.scaler.fit(X[self.columns])
        return self
    
    def transform(self, X, y=None):
        if not self.is_fitted:
            raise Exception("call fit() before transform() on X")
        X[self.columns] = self.scaler.transform(X[self.columns])
        return X
        
    def fit_transform(self, X, y=None):
        self.fit(X)
        X[self.columns] = self.scaler.fit_transform(X[self.columns])
        return X

### Using custom transformers to create ML Pipeline

In [5]:
drop_columns = ['ID', 'Customer_ID', 'Name', 'Month', 'SSN', 'Monthly_Inhand_Salary', 'Type_of_Loan']
str_columns = ['Num_of_Delayed_Payment', 'Amount_invested_monthly', 'Monthly_Balance', 'Age', 'Num_of_Loan', 'Outstanding_Debt', 'Changed_Credit_Limit', 'Annual_Income']
scale_cols = ['Age', 'Annual_Income', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan', 'Delay_from_due_date', 'Num_of_Delayed_Payment', 
              'Changed_Credit_Limit', 'Num_Credit_Inquiries', 'Credit_Mix', 'Outstanding_Debt', 'Credit_Utilization_Ratio', 'Credit_History_Age',
              'Payment_of_Min_Amount', 'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance']

### `Pipeline`

In [8]:
X_train = pd.read_csv("train.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [13]:
## Pipeline

from sklearn.pipeline import Pipeline

custom_transformers = [('drop', DropColumns(columns=drop_columns)), ('convert', str_to_num(columns=str_columns)),  
                ('encode', categorical_encoding()), ('StandardScaler', scaler(columns=scale_cols))]     
    
transformers = Pipeline(custom_transformers)
# transformers

In [14]:
X = transformers.fit_transform(X_train)

### train_test_split

Create pipeline of data transformations and split training data

In [15]:
from sklearn.model_selection import train_test_split

X, y = X.drop('Credit_Score', axis=1), X['Credit_Score']

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

## Machine Learning for prediction

### import libraries for model evaluation

In [17]:
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

### import libraries for Machine Learning algorithms

In [18]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

from sklearn.svm import SVC
svm = SVC()

from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=3)

from sklearn.tree import DecisionTreeClassifier
dectree = DecisionTreeClassifier(random_state=0)

from sklearn.ensemble import RandomForestClassifier
randfor = RandomForestClassifier()

from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier()

from xgboost import XGBClassifier
xgb = XGBClassifier()

In [28]:
def predictions(algorithms):
    evaluation = pd.DataFrame(columns=['Algorithm', 'Accuracy', 'Cross Validation Score', 'Confusion Matrix'])
    for algo in algorithms:
        algo.fit(X_train, y_train)
        pred = algo.predict(X_test)
        accuracy = round(metrics.accuracy_score(pred, y_test), 4)
        cross_val = cross_val_score(algo, X, y, cv=5).mean()
        conf_matrix = confusion_matrix(pred, y_test) 
        performance = pd.Series([algo, accuracy, cross_val, conf_matrix])
        pd.concat([evaluation, performance], axis=0)
    return evaluation

In [None]:
algorithms = [logreg, svm, knn, dectree, randfor, adaboost, xgb]
result = predictions(algorithms)
print(results)

 ### GridSearch for best fitting Machine Learning model with optimal hyperparameters

In [None]:
from sklearn.model_selection import GridSearchCV

# GridSearch for each Machine Learning Algorithm's optimal parameters
parameters_dict = dict() ## (key: str), (value: (ML method(), parameters))
parameters_dict['logreg_parameters'] = (LogisticRegression(), {'C': [0.01, 0.03, 0.1, 0.3, 1, 3]})
parameters_dict['svm_parameters'] = (SVC(), {'kernel': ('linear', 'rbf', 'poly', 'sigmoid'), 'C': [0.001,0.005,0,0.01,0.5,0.1,1,2,5,10,50,100,500,1000]})
parameters_dict['knn_parameters'] = (KNeighborsClassifier(), {'n_neighbors': np.arange(81, 121, 2), 'weights':['uniform', 'distance']})
parameters_dict['dectree_parameters'] = (DecisionTreeClassifier(), {'criterion':['gini','entropy', 'log_loss']})
parameters_dict['randfor_parameters'] = (RandomForestClassifier(), {'criterion':['gini','entropy', 'log_loss']})
parameters_dict['adaboost_parameters'] = (AdaBoostClassifier(), {'n_estimators': [10, 50, 100, 500], 'learning_rate': [0.0001, 0.001, 0.01, 0.03, 0.1, 0.3, 1], 'algorithm': ['SAMME', 'SAMME.R']})
parameters_dict['xgb_parameters'] = (XGBClassifier(), {'nthread':[4], 'learning_rate': [0.01, 0.03, 0.1, 0.3, 1], 'max_depth': [6, 7, 8]})

In [None]:
results = pd.DataFrame(columns=['algorithm', 'best score', 'best parameters'])

for algo, parameters in parameters_dict.values():
    model = GridSearchCV(algo, parameters)
    model.fit(X, y)
    best_score, best_parameters = model.best_score_, model.best_params_
    result = pd.Series([best_score, best_parameters])
    pd.concat(results, result, axis=0)

In [None]:
print(results)

## References

<a href="https://www.kaggle.com/code/adityaprataprathore/credit-score-classification-ml-basics">Credit Score Classification: ML Basics </a>

<a href="https://www.kaggle.com/code/fitrahrmdhn/cleaning-preprocessing-knn-param-tuning">Cleaning, Preprocessing, KNN Param Tuning</a>

<a href="https://towardsdatascience.com/creating-custom-transformers-for-sklearn-pipelines-d3d51852ecc1">Creating Custom Transformers for sklearn Pipelines </a>

<a href="https://www.andrewvillazon.com/custom-scikit-learn-transformers/"> Creating custom scikit-learn Transformers </a>

# 