
## credit score classifier

[Data Source](https://www.kaggle.com/datasets/parisrohan/credit-score-classification)


Credit Scores will have one of three values:
* `Good`
* `Standard`
* `Poor`

##### Data Notes

The same users appear multiple times in both the train and test sets. 

In my opinion, this train/test split is not a great way to measure model performance
The credit score of a single person in two different months are going to be highly correlated,
and could be preceived as a form of data leakage.

A more reasonble training method would be to only have one row for each person,
such that a person can only appear in train or test.

As this is for practice, I will allow this data leakage. In practice, it could cause the model to overfit

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

pd.set_option("display.max_columns", None)
%config Completer.use_jedi = False

Read in Data

In [2]:

# since the test.csv doesn't have labels, we can't use it to evaluate model performance
input_path = "../input/credit-score-classification/"
credit_df = pd.read_csv(input_path + "train.csv", low_memory=False)
credit_df.head(n=5)

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Type_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",3,7.0,11.27,4.0,_,809.98,26.82262,22 Years and 1 Months,No,49.574949,80.41529543900253,High_spent_Small_value_payments,312.49408867943663,Good
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",-1,,11.27,4.0,Good,809.98,31.94496,,No,49.574949,118.28022162236736,Low_spent_Large_value_payments,284.62916249607184,Good
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",3,7.0,_,4.0,Good,809.98,28.609352,22 Years and 3 Months,No,49.574949,81.699521264648,Low_spent_Medium_value_payments,331.2098628537912,Good
3,0x1605,CUS_0xd40,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",5,4.0,6.27,4.0,Good,809.98,31.377862,22 Years and 4 Months,No,49.574949,199.4580743910713,Low_spent_Small_value_payments,223.45130972736783,Good
4,0x1606,CUS_0xd40,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",6,,11.27,4.0,Good,809.98,24.797347,22 Years and 5 Months,No,49.574949,41.420153086217326,High_spent_Medium_value_payments,341.48923103222177,Good


Data Cleaning

In [3]:
# ---- Age ----
credit_df["Age"] = pd.to_numeric(credit_df["Age"], errors="coerce")
bad_ages = (credit_df["Age"] < 0) | (credit_df["Age"] > 120)
credit_df.loc[bad_ages, ["Age"]] = np.nan

# ---- Occupation ---- 
occ_one_hot_df = pd.get_dummies(credit_df["Occupation"])
occ_one_hot_df.columns = ["Occupation_"+ c for c in occ_one_hot_df.columns]
credit_df = pd.concat([occ_one_hot_df, credit_df], axis=1)


# ---- Annual_Income ---- 
credit_df["Annual_Income"] = pd.to_numeric(credit_df["Annual_Income"], errors="coerce")

# ---- Amount_Invested_Monthly
credit_df["Amount_invested_monthly"] = pd.to_numeric(credit_df["Amount_invested_monthly"], errors="coerce")

# ---- Num_of_Loan ---- 
credit_df["Num_of_Loan"] = pd.to_numeric(credit_df["Num_of_Loan"], errors="coerce")

# ---- Num_of_Delayed_Payment ----
credit_df["Num_of_Delayed_Payment"] = pd.to_numeric(credit_df["Num_of_Delayed_Payment"], errors="coerce")

# ---- Changed_Credit_Limit ----
credit_df["Changed_Credit_Limit"] = pd.to_numeric(credit_df["Changed_Credit_Limit"], errors="coerce")

# ---- Outstanding_Debt ----
credit_df["Outstanding_Debt"] = pd.to_numeric(credit_df["Outstanding_Debt"], errors="coerce")


# ---- Credit_History_Age ----
credit_age_df = credit_df["Credit_History_Age"].str.split(' Years and ', expand=True)
credit_age_df.columns = ["Years", "Months"]
credit_age_df["Years"] = pd.to_numeric(credit_age_df["Years"])
credit_age_df["Months"] = pd.to_numeric(credit_age_df["Months"].str.split(" ", expand=True)[0])
credit_df["Credit_History_Age"] = credit_age_df["Years"] + credit_age_df["Months"]/12

# ---- Payment_of_Min_Amount ----
min_amount_one_hot_df = pd.get_dummies(credit_df["Payment_of_Min_Amount"])
min_amount_one_hot_df.columns = ["Payment_of_Min_Amount_"+ c for c in min_amount_one_hot_df.columns]
credit_df = pd.concat([min_amount_one_hot_df, credit_df], axis=1)


# ---- Monthly_Balance ----
credit_df["Monthly_Balance"] = pd.to_numeric(credit_df["Monthly_Balance"], errors="coerce")


# ---- Type of Loan --- 
# for string column replace nan with empty string
credit_df.loc[credit_df["Type_of_Loan"].isna(),["Type_of_Loan"]] = ""


# vectorize our list of loans by splitting on the "," delimiter
count_vectorizer = CountVectorizer(tokenizer=lambda text: re.split(",\s+|,|\s+,",text))
loan_types = count_vectorizer.fit_transform(credit_df["Type_of_Loan"])


# convert sparse array back into a dataframe
loan_types_df = pd.DataFrame.sparse.from_spmatrix(loan_types,
                                                  columns=count_vectorizer.get_feature_names_out())

loan_types_df.drop(columns="", inplace=True)
credit_df = pd.concat([loan_types_df, credit_df], axis=1)


# ---- Credit Mix ----
credit_mix_one_hot_df = pd.get_dummies(credit_df["Credit_Mix"])
credit_mix_one_hot_df.columns = ["Credit_Mix_"+ c for c in credit_mix_one_hot_df.columns]
credit_df = pd.concat([credit_mix_one_hot_df, credit_df], axis=1)


# ---- Payment_Behavior ----
pb_one_hot_df = pd.get_dummies(credit_df["Payment_Behaviour"])
pb_one_hot_df.columns = ["Payment_Behavior_"+ c for c in pb_one_hot_df.columns]
credit_df = pd.concat([pb_one_hot_df, credit_df], axis=1)

####### Our Target Variable ###### 
le = LabelEncoder()
credit_df["Credit_Score"] = le.fit_transform(credit_df["Credit_Score"])


# columns that identify a person, or have already been converted into a 
# numeric value
# NOTE: I am not using "Month" in the model. People might tend to have 
# worse credit in December and January (due to holiday spending). However,
# "month" doesn't directly influence credit ratings.
drop_cols = ["ID",  "Month", "Name", "SSN", "Occupation", 
             "Type_of_Loan", "Credit_Mix", "Payment_of_Min_Amount",
             "Payment_Behaviour"] # "Customer_ID",


credit_df.drop(drop_cols, axis=1, inplace=True)


# shift column 'Customer_ID' to first position
first_column = credit_df.pop('Customer_ID')
  
# insert column using insert(position,column_name,first_column) function
credit_df.insert(0, 'Customer_ID', first_column)
  

#### View output

In [4]:
credit_df

Unnamed: 0,Customer_ID,Payment_Behavior_!@9#%8,Payment_Behavior_High_spent_Large_value_payments,Payment_Behavior_High_spent_Medium_value_payments,Payment_Behavior_High_spent_Small_value_payments,Payment_Behavior_Low_spent_Large_value_payments,Payment_Behavior_Low_spent_Medium_value_payments,Payment_Behavior_Low_spent_Small_value_payments,Credit_Mix_Bad,Credit_Mix_Good,Credit_Mix_Standard,Credit_Mix__,and auto loan,and credit-builder loan,and debt consolidation loan,and home equity loan,and mortgage loan,and not specified,and payday loan,and personal loan,and student loan,auto loan,credit-builder loan,debt consolidation loan,home equity loan,mortgage loan,not specified,payday loan,personal loan,student loan,Payment_of_Min_Amount_NM,Payment_of_Min_Amount_No,Payment_of_Min_Amount_Yes,Occupation_Accountant,Occupation_Architect,Occupation_Developer,Occupation_Doctor,Occupation_Engineer,Occupation_Entrepreneur,Occupation_Journalist,Occupation_Lawyer,Occupation_Manager,Occupation_Mechanic,Occupation_Media_Manager,Occupation_Musician,Occupation_Scientist,Occupation_Teacher,Occupation_Writer,Occupation________,Age,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance,Credit_Score
0,CUS_0xd40,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,23.0,19114.12,1824.843333,3,4,3,4.0,3,7.0,11.27,4.0,809.98,26.822620,22.083333,49.574949,80.415295,312.494089,0
1,CUS_0xd40,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,23.0,19114.12,,3,4,3,4.0,-1,,11.27,4.0,809.98,31.944960,,49.574949,118.280222,284.629162,0
2,CUS_0xd40,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,,19114.12,,3,4,3,4.0,3,7.0,,4.0,809.98,28.609352,22.250000,49.574949,81.699521,331.209863,0
3,CUS_0xd40,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,23.0,19114.12,,3,4,3,4.0,5,4.0,6.27,4.0,809.98,31.377862,22.333333,49.574949,199.458074,223.451310,0
4,CUS_0xd40,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,23.0,19114.12,1824.843333,3,4,3,4.0,6,,11.27,4.0,809.98,24.797347,22.416667,49.574949,41.420153,341.489231,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,CUS_0x942c,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,25.0,39628.99,3359.415833,4,6,7,2.0,23,7.0,11.50,3.0,502.38,34.663572,31.500000,35.104023,60.971333,479.866228,1
99996,CUS_0x942c,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,25.0,39628.99,3359.415833,4,6,7,2.0,18,7.0,11.50,3.0,502.38,40.565631,31.583333,35.104023,54.185950,496.651610,1
99997,CUS_0x942c,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,25.0,39628.99,3359.415833,4,6,5729,2.0,27,6.0,11.50,3.0,502.38,41.255522,31.666667,35.104023,24.028477,516.809083,1
99998,CUS_0x942c,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,25.0,39628.99,3359.415833,4,6,7,2.0,20,,11.50,3.0,502.38,33.638208,31.750000,35.104023,251.672582,319.164979,2


Summary of missing values

In [5]:
null_counts = credit_df.isna().sum()
has_nulls = null_counts > 0

null_counts[has_nulls].sort_values()

Outstanding_Debt            1009
Monthly_Balance             1209
Num_Credit_Inquiries        1965
Changed_Credit_Limit        2091
Num_of_Loan                 4785
Annual_Income               6980
Age                         7624
Amount_invested_monthly     8784
Credit_History_Age          9030
Num_of_Delayed_Payment      9746
Monthly_Inhand_Salary      15002
dtype: int64

#### Drop  Customer_ID col
We no longer need it

In [6]:
credit_df.drop(columns=["Customer_ID"], inplace=True)

##### Make sure that all columns are some sort of numeric value

In [7]:
col_dtypes = credit_df.dtypes 
col_dtypes[col_dtypes == "object"]

Series([], dtype: object)

### Save Processed Data to CSV

In [8]:
output_path = "/kaggle/working/"
credit_df.to_csv(output_path + "data_cleaned.csv", header=True, index=False)

#### Train / Dev / Test

Our data set isn't that big.
If we need more data, we could fold our dev set into our train set and use cross validation.


In [9]:
# first split to get train
credit_train_df, credit_rem_df = train_test_split(credit_df,
                                                          test_size=0.3,
                                                          shuffle=True,
                                                          random_state=99,
                                                          stratify=credit_df["Credit_Score"])


# split again to get dev and test
credit_dev_df, credit_test_df = train_test_split(credit_rem_df,
                                                          test_size=0.5,
                                                          shuffle=True,
                                                          random_state=99,
                                                          stratify=credit_rem_df["Credit_Score"])



# print row counts to make sure that sampling was done correctly
print("train count: {}".format(credit_train_df.shape))
print("dev count: {}".format(credit_dev_df.shape))
print("test count: {}".format(credit_test_df.shape))


# write to file
credit_train_df.to_csv(output_path + "/train_cleaned.csv", header=True, index=False)
credit_dev_df.to_csv(output_path + "/dev_cleaned.csv", header=True, index=False)
credit_test_df.to_csv(output_path + "/test_cleaned.csv", header=True, index=False)

train count: (70000, 66)
dev count: (15000, 66)
test count: (15000, 66)
