# Data exploration

We'll begin by reading in the data

In [33]:
import pandas as pd
df = pd.read_csv("data/credit-data.csv")

Checking the column names

In [44]:
df.columns

Index(['PersonID', 'SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines',
       'age', 'zipcode', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'],
      dtype='object')

Now we'll check the type of each column

In [31]:
for col in df.columns:
    print(col, 'is type --->', df[col].dtype)

PersonID is type ---> int64
SeriousDlqin2yrs is type ---> int64
RevolvingUtilizationOfUnsecuredLines is type ---> float64
age is type ---> int64
zipcode is type ---> int64
NumberOfTime30-59DaysPastDueNotWorse is type ---> int64
DebtRatio is type ---> float64
MonthlyIncome is type ---> float64
NumberOfOpenCreditLinesAndLoans is type ---> int64
NumberOfTimes90DaysLate is type ---> int64
NumberRealEstateLoansOrLines is type ---> int64
NumberOfTime60-89DaysPastDueNotWorse is type ---> int64
NumberOfDependents is type ---> float64


Some columns have float types when they should have int, but for our exploration purposes, all variables have an appropriate numeric type.

Checking the proportion of missing values:

In [46]:
total = len(df)
for col in df.columns:
    print(col, 'has', df[col].isna().sum() / total, 'as missings proportion')

PersonID has 0.0 as missings proportion
SeriousDlqin2yrs has 0.0 as missings proportion
RevolvingUtilizationOfUnsecuredLines has 0.0 as missings proportion
age has 0.0 as missings proportion
zipcode has 0.0 as missings proportion
NumberOfTime30-59DaysPastDueNotWorse has 0.0 as missings proportion
DebtRatio has 0.0 as missings proportion
MonthlyIncome has 0.1944119368051492 as missings proportion
NumberOfOpenCreditLinesAndLoans has 0.0 as missings proportion
NumberOfTimes90DaysLate has 0.0 as missings proportion
NumberRealEstateLoansOrLines has 0.0 as missings proportion
NumberOfTime60-89DaysPastDueNotWorse has 0.0 as missings proportion
NumberOfDependents has 0.02528281646186854 as missings proportion


`MonthlyIncome` has 19.4% of observations with missing values and `NumberOfDependents` has 2.5%. This will cause us to lose a non-trivial number of observations when conducting our analysis. To avoid this, we'll have to conduct some kind of imputation rule in these cases later on.

Tabulating the target attribute:

In [15]:
df.groupby('SeriousDlqin2yrs').size()

SeriousDlqin2yrs
0    34396
1     6620
dtype: int64

We see that our dataframe has 34396 + 6620 = 41016 observations. 16% of observations have a value of one for the variable `SeriousDlqin2yrs`.

Now we'll graph a correlation matrix color map to know which of the variables we have are correlated with the target attribute.

In [36]:
df.corr().style.background_gradient(cmap='coolwarm')

Unnamed: 0,PersonID,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,zipcode,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
PersonID,1.0,-0.622739,0.00449866,0.108533,-0.0210101,-0.0825982,0.00686411,0.0197553,0.0226691,-0.0768556,0.00434091,-0.0653526,-0.0303635
SeriousDlqin2yrs,-0.622739,1.0,-0.00458616,-0.173728,-0.045051,0.149334,-0.0135022,-0.0328096,-0.0398977,0.139609,-0.0106409,0.121886,0.065708
RevolvingUtilizationOfUnsecuredLines,0.00449866,-0.00458616,1.0,-0.00800343,0.00600898,-0.00199912,0.0222501,0.00583184,-0.0145899,-0.00168577,0.00476292,-0.0014134,0.00534205
age,0.108533,-0.173728,-0.00800343,1.0,0.00540794,-0.0686957,0.0388284,0.0481377,0.159866,-0.0690361,0.0491676,-0.0636221,-0.211002
zipcode,-0.0210101,-0.045051,0.00600898,0.00540794,1.0,-0.00242439,0.00208776,-0.00498002,-0.0092136,-0.00148706,0.0031411,-0.00119796,-0.00174394
NumberOfTime30-59DaysPastDueNotWorse,-0.0825982,0.149334,-0.00199912,-0.0686957,-0.00242439,1.0,-0.0116197,-0.0152238,-0.0707039,0.984465,-0.0378634,0.98853,-0.00783968
DebtRatio,0.00686411,-0.0135022,0.0222501,0.0388284,0.00208776,-0.0116197,1.0,-0.0229878,0.0827911,-0.01479,0.177858,-0.0132897,-0.0705581
MonthlyIncome,0.0197553,-0.0328096,0.00583184,0.0481377,-0.00498002,-0.0152238,-0.0229878,1.0,0.1071,-0.0179537,0.127313,-0.0153363,0.060528
NumberOfOpenCreditLinesAndLoans,0.0226691,-0.0398977,-0.0145899,0.159866,-0.0092136,-0.0707039,0.0827911,0.1071,1.0,-0.0981764,0.442776,-0.0871536,0.0602182
NumberOfTimes90DaysLate,-0.0768556,0.139609,-0.00168577,-0.0690361,-0.00148706,0.984465,-0.01479,-0.0179537,-0.0981764,1.0,-0.0546613,0.992143,-0.0157375


Surprisingly, the variable with the highest correlation with `SeriousDlqin2yrs` is `PersonID`. As the data dictionary provided doesn't include any explanation of how the ID was assigned, we'll assume this correlation is spurious. Appart from that, the variable with the highest correlation with `SeriousDlqin2yrs` is `age`. In general, none variable has a remarkably high correlation with our target attribute, and all of them seem to have similarly low correlation levels.