Basic exploration of the cleaned data

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", None)
%config Completer.use_jedi = False

Read in Data

In [2]:
train_df = pd.read_csv("../data/train_cleaned.csv")
train_df["data_sample"] = "train"

dev_df = pd.read_csv("../data/dev_cleaned.csv")
dev_df["data_sample"] = "dev"

test_df = pd.read_csv("../data/test_cleaned.csv")
test_df["data_sample"] = "test"

# combine together
all_df = pd.concat([train_df, dev_df, test_df])
all_df.reset_index(inplace=True)
all_df.drop(columns=["index"], inplace=True)

all_df.head()

Unnamed: 0,Payment_Behavior_!@9#%8,Payment_Behavior_High_spent_Large_value_payments,Payment_Behavior_High_spent_Medium_value_payments,Payment_Behavior_High_spent_Small_value_payments,Payment_Behavior_Low_spent_Large_value_payments,Payment_Behavior_Low_spent_Medium_value_payments,Payment_Behavior_Low_spent_Small_value_payments,Credit_Mix_Bad,Credit_Mix_Good,Credit_Mix_Standard,Credit_Mix__,and auto loan,and credit-builder loan,and debt consolidation loan,and home equity loan,and mortgage loan,and not specified,and payday loan,and personal loan,and student loan,auto loan,credit-builder loan,debt consolidation loan,home equity loan,mortgage loan,not specified,payday loan,personal loan,student loan,Payment_of_Min_Amount_NM,Payment_of_Min_Amount_No,Payment_of_Min_Amount_Yes,Occupation_Accountant,Occupation_Architect,Occupation_Developer,Occupation_Doctor,Occupation_Engineer,Occupation_Entrepreneur,Occupation_Journalist,Occupation_Lawyer,Occupation_Manager,Occupation_Mechanic,Occupation_Media_Manager,Occupation_Musician,Occupation_Scientist,Occupation_Teacher,Occupation_Writer,Occupation________,Age,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance,Credit_Score,data_sample
0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,46.0,35707.45,3124.620833,8,4,5,3.0,8,8.0,-3.14,4.0,933.97,34.216697,19.119048,75.508156,75.290744,353.841382,0,train
1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,3,1,1,2,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,28.0,15349.54,1180.128333,9,7,22,9.0,40,22.0,27.08,12.0,2988.2,32.072191,19.119048,108.461016,24.76283,234.788988,2,train
2,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,23.0,7123.915,756.659583,4,6,19,4.0,5,11.0,10.24,8.0,1380.44,24.896803,19.119048,19.530369,33.036671,313.098918,2,train
3,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,14.0,37777.92,3055.16,3,4,20,-100.0,30,18.0,18.37,9.0,1336.0,34.768715,19.119048,100.42795,173.101314,291.986736,1,train
4,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,2,1,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,19.0,54889.92,4512.16,7,4,17,6.0,24,10.0,10.48,4.0,1603.48,29.799462,19.119048,147.047371,109.485731,434.682898,1,train


Distribution of our labels

In [3]:
# aggregate by "Credit_Score" and "data_sample"
count_by_score_df = (pd.DataFrame(all_df
                                 .groupby(["Credit_Score", "data_sample"])["Credit_Score"]
                                 .count())
                     .rename(columns={"Credit_Score": "count"})
                     .reset_index())


# aggregate by "data_sample": will use values for normalization
count_by_sample_df = (pd.DataFrame(all_df
                                 .groupby(["data_sample"])["Credit_Score"]
                                 .count())
                     .rename(columns={"Credit_Score": "total_sample_count"})
                     .reset_index())


all_agg_df = pd.merge(count_by_score_df, count_by_sample_df, on="data_sample", how="inner")
all_agg_df["perc"] = all_agg_df["count"] / all_agg_df["total_sample_count"]
all_agg_df.sort_values(["Credit_Score", "data_sample"], inplace=True)
all_agg_df

Unnamed: 0,Credit_Score,data_sample,count,total_sample_count,perc
0,0,dev,272,1875,0.145067
3,0,test,333,1875,0.1776
6,0,train,1370,8750,0.156571
1,1,dev,527,1875,0.281067
4,1,test,576,1875,0.3072
7,1,train,2479,8750,0.283314
2,2,dev,1076,1875,0.573867
5,2,test,966,1875,0.5152
8,2,train,4901,8750,0.560114



From the table above, we can see that the `Credit_Score` classes are not balanced.

* 0: ~15%
* 1: ~30%
* 2: ~55%


All of the classes are at least on the same order of magntude (in terms of percentage).
So this shouldn't be too bad to work with.