# Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building the **credit score** of a potential customer. The **credit score** is used to evaluate the ability of a potential borrower to repay their loan.

- what parameter that we need to calculate to get the credit score?



## Open the data file and have a look at the general information. 


In [2]:
import pandas as pd

In [3]:
import numpy as np

In [4]:
df = pd.read_csv('/datasets/credit_scoring_eng.csv')

## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan



In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [6]:
df.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

In [7]:
df.head()


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


- Majority of the column has a values, while total_income and days_employed not tally with the total of rows.
it would be interested to analyst more to make a total conclusion of our analysis.
- We can see also an invalid value in the days_employed column, we need to study this and decide what kind of approach do we take to replace this value.

In [8]:
df[df['days_employed'].isna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [277]:
df['days_employed'].isna().sum()

2174

- We can see also the the null values seams symmetrical. 

- Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.
To solidify wheather the missing values symmentrical in our dataframe, we use multiple conditions as below.

In [278]:
df[(df['days_employed'].isna() & df['total_income'].isna())]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


Right on the money, based on the filtered table, we can see the total rows is tally with the number of rows when we apply `isna()`method. This theoritically proves, the null values is in a symmetrical rows in `days_employed` and `total_income` column.

Yet, we need to count this null values percentage inside this dataframe. From there, we need to decide what can we do to treat this missing values.

In [279]:
total_rows = len(df['education_id'])
na_rows = df['days_employed'].isna().sum()
na_ratio = na_rows/total_rows
print(f'For Null values % is: {na_ratio:0.0000%}')

For Null values % is: 10%


In [280]:
df[df['days_employed'].isna()]['income_type'].value_counts(normalize=True)

employee         0.508280
business         0.233671
retiree          0.189972
civil servant    0.067617
entrepreneur     0.000460
Name: income_type, dtype: float64

- Checking distribution (distribution can refer above cell)
Largely,a huge chunk of missing values we can see whereas the the client was employeed and quite half of it for business
and retiree. The least of it is entrepreneur just only by 1 missing values.


**Possible reasons for missing values in data**

Refer to the distribution above, we can conclude that the null values scattered throughtout all the unique values in `family_status`.
This shows that, the null values happen randomly yet symmetrical in this dataset.

Let's start checking whether the missing values are random

In [281]:
df[df['days_employed'].isna()]['family_status'].value_counts(normalize=True)

married              0.568997
civil partnership    0.203312
unmarried            0.132475
divorced             0.051518
widow / widower      0.043698
Name: family_status, dtype: float64

In [282]:
df[df['days_employed'].isna()]['education'].value_counts(normalize=True)

secondary education    0.647654
bachelor's degree      0.228151
SECONDARY EDUCATION    0.030819
Secondary Education    0.029899
some college           0.025299
Bachelor's Degree      0.011500
BACHELOR'S DEGREE      0.010580
primary education      0.008740
Some College           0.003220
SOME COLLEGE           0.003220
PRIMARY EDUCATION      0.000460
Primary Education      0.000460
Name: education, dtype: float64

This filtered also potrayed that the null values happen randomly thoroughout the dataframe. We can see every unique values has a null values. But we need to keep in mind, the unique value for this collumn need to repharse the values to one unique value for each group.

In [283]:
df[df['days_employed'].isna()]['children'].value_counts(normalize=False)

 0     1439
 1      475
 2      204
 3       36
 20       9
 4        7
-1        3
 5        1
Name: children, dtype: int64

In [284]:
df[df['days_employed'].isna()]['debt'].value_counts(normalize=False)

0    2004
1     170
Name: debt, dtype: int64

In [285]:
df.debt.value_counts(normalize=True)

0    0.919117
1    0.080883
Name: debt, dtype: float64

In [286]:
df.query('debt == 0')

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.422610,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21518,0,373995.710838,59,SECONDARY EDUCATION,1,married,0,F,retiree,0,24618.344,purchase of a car
21519,1,-2351.431934,37,graduate degree,4,divorced,3,M,employee,0,18551.846,buy commercial real estate
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car


**Intermediate conclusion**

- Above filtered data we can see that, 0 debt client tend to have this missing values. Maybe this is just a bug on the data, since this happen randomly across all the unique values.

**Conclusions**

- During our data exploration we do not see any pattern that shows us what exactly leading to missing values. The null values happen across all the unique values randomly.

- As it stand, we cannot conclude what is the correct values for this null value and since it has only 10% out of total data we decide to drop all the null values from our data.


- Moving on, with all the issues such duplicates, different registers, incorrect artifacts, and missing values, we need to treat and do data cleansing of our data by:-
    - Drop all duplicate
    - Rectify registers
    - Amend artifatcrs and missing values


- Before we proceed we also need to do final checking by counting the non-missing value to proves the is it symmettrical.

In [287]:
df[~(df['days_employed'].isna())]['debt'].value_counts(normalize=False).sum()

19351

In [288]:
df[~(df['days_employed'].isna())]['children'].value_counts(normalize=False).sum()

19351

- We know now the total missing value is 2174, and the the total rows for entire dataset is 21525. if we minus this two values we will get all the non-missing value which is 19351. Thus, proving that all the missing value in a symmettrical order.


In [289]:
df_with_no_missing_value = df[~(df['days_employed'].isna())]
df_with_no_missing_value['debt'].value_counts(normalize=True)

0    0.918816
1    0.081184
Name: debt, dtype: float64

In [290]:
df_with_no_missing_value['children'].value_counts(normalize=True)

 0     0.656814
 1     0.224433
 2     0.095654
 3     0.015193
 20    0.003462
-1     0.002274
 4     0.001757
 5     0.000413
Name: children, dtype: float64

In [291]:
df_with_no_missing_value['education'].value_counts(normalize=True)

secondary education    0.637796
bachelor's degree      0.218180
SECONDARY EDUCATION    0.036432
Secondary Education    0.033383
some college           0.031678
BACHELOR'S DEGREE      0.012971
Bachelor's Degree      0.012557
primary education      0.011937
Some College           0.002067
SOME COLLEGE           0.001137
PRIMARY EDUCATION      0.000827
Primary Education      0.000723
graduate degree        0.000207
Graduate Degree        0.000052
GRADUATE DEGREE        0.000052
Name: education, dtype: float64

In [292]:
df_with_no_missing_value['family_status'].value_counts(normalize=True)

married              0.575836
civil partnership    0.193013
unmarried            0.130484
divorced             0.055966
widow / widower      0.044701
Name: family_status, dtype: float64

In [293]:
df_with_no_missing_value['gender'].value_counts(normalize=True)

F      0.658984
M      0.340964
XNA    0.000052
Name: gender, dtype: float64

In [294]:
df_with_no_missing_value[["debt", "children", "education", "family_status", "gender"]].value_counts(normalize=False)

debt  children  education            family_status      gender
0     0         secondary education  married            F         2762
                                                        M         1342
                                     civil partnership  F         1028
      1         secondary education  married            F          873
      0         bachelor's degree    married            F          847
                                                                  ... 
      5         PRIMARY EDUCATION    married            F            1
                secondary education  civil partnership  F            1
                                     married            M            1
      20        BACHELOR'S DEGREE    married            F            1
1     20        secondary education  unmarried          M            1
Length: 476, dtype: int64

Then if we sum all this distribution it will come back to 19351. This shows the null values is in symmetry.

In [295]:
df_with_no_missing_value[["debt", "children", "education", "family_status", "gender"]].value_counts(normalize=False).sum()

19351

## Data transformation

The main goal of this section is to eliminate all the duplicates, as well to amend the illogical values we might found within this dataset.

- Let's go through each column to see what issues we may have in them.


In [296]:
df['education'].unique()

array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

Since we encounter a lot of duplicate register in education column, we will create a function to treat this duplicates.

In [297]:
def replace_wrong_values(wrong_values, correct_value):
     for wrong_value in wrong_values:
            df['education'] = df['education'].replace(wrong_value, correct_value)
sec_duplicates = ['SECONDARY EDUCATION','Secondary Education']
sec_name = 'secondary education'
deg_duplicates = ["BACHELOR'S DEGREE","Bachelor's Degree"]
deg_name = "bachelor's degree"
pri_duplicates = ['PRIMARY EDUCATION','Primary Education']
pri_name = 'primary education'
col_duplicates = ['Some College','SOME COLLEGE']
col_name = 'some college'
gra_duplicates = ['GRADUATE DEGREE','Graduate Degree']
gra_name = 'graduate degree'

replace_wrong_values(sec_duplicates,sec_name)
replace_wrong_values(deg_duplicates,deg_name)
replace_wrong_values(pri_duplicates,pri_name)
replace_wrong_values(col_duplicates,col_name)
replace_wrong_values(gra_duplicates,gra_name)

Always to check again our duplicate.

In [298]:
df['education'].unique()


array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

Next, we check the data in `children` column

In [299]:
df['children'].value_counts(normalize=False)

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

- Upon checking we found out a 0.002184 of client have -1 children. we believe this might be a typo as it would be illogical to have -1 children. Thus, we will change this to absolute value which is 1.

- We found out also, some of the client has 20 children. This normally not happening in real world and we believe its a bug. we then change this this to 2 since its quite similar

In [300]:
df['children'].median()

0.0

In [301]:
df['children'] = df['children'].replace(-1,1)

In [302]:
df['children'] = df['children'].replace(20,2)

In [303]:
df['children'].value_counts(normalize=False)

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

In [304]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


- After addressing the `children` column, we move to `days_employed`. Earlier we notice that values in `days_employed` column contain negative days. We have to calculate the percentage out of total dataset. 
- This negative value brings incorrect outcome to the goals of our analysis. So, we have to rectify this and find out the solution.

In [305]:
neg_days = df.query('days_employed <= 0')['days_employed'].count()
total_days = df['days_employed'].count()
neg_days_percentage = neg_days/total_days
pos_days_percentage = 1-neg_days_percentage
print(f'For Negative day % is: {neg_days_percentage:0.0000%}')
print(f'For Positive day % is: {pos_days_percentage:0.0000%}')

For Negative day % is: 82%
For Positive day % is: 18%


- Surprisingly, more than 80% negative values has in `days_employed` column. We cannot afford to drop out this data and this might happen due to bugs on the data. What could be the possible solution for this is to transform this negative value to absolute value using `abs()` method.
- Firstly, we use the `describe()` method to gain a rough idea of the this column.

In [306]:
df.days_employed.describe()

count     19351.000000
mean      63046.497661
std      140827.311974
min      -18388.949901
25%       -2747.423625
50%       -1203.369529
75%        -291.095954
max      401755.400475
Name: days_employed, dtype: float64

- Upon checking the negative days conclude around 82% of the data. We cannot however to drop a huge chunk of data as it most probably will impact the data and our analysis. Thus, we can try to make this data to absolute value for our continuation analysis.

In [307]:
df['days_employed'] = df['days_employed'].abs()

In [308]:
df['days_employed'].describe()

count     19351.000000
mean      66914.728907
std      139030.880527
min          24.141633
25%         927.009265
50%        2194.220567
75%        5537.882441
max      401755.400475
Name: days_employed, dtype: float64

- But, still we saw and illogical value in the `days_employed` where some people works more than 100,000 days.
- However the mean life span of human is around 73 years old. so we decide to slice the days in days_employed.

In [309]:
life_span = 73 # average life span of human
start_work = 18 # minimum age to start working
log_days_employed = (life_span-start_work)*365
print(log_days_employed)

20075


- We then have to figure out this outliers percentage.

In [310]:
ill_days = df.query('days_employed >= 20075')['days_employed'].count()
print(ill_days)

3445


In [311]:
total_days_employed = 19240
per_ill_days = ill_days/total_days_employed
print(per_ill_days)

0.17905405405405406


- Seams this outliers has score more than 15% of the data, we need to replace the data with the median days_employed
- We continue with this dataset. after we fill up the missing value we will change high employment day to median day.

In [312]:
df.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

In [313]:
df['days_employed'].describe()

count     19351.000000
mean      66914.728907
std      139030.880527
min          24.141633
25%         927.009265
50%        2194.220567
75%        5537.882441
max      401755.400475
Name: days_employed, dtype: float64

- Now, lets look at the client's age and whether there are any issues there.

In [314]:
df['dob_years'].value_counts(normalize=True)

35    0.028664
40    0.028293
41    0.028200
34    0.028014
38    0.027782
42    0.027735
33    0.026992
39    0.026620
31    0.026016
36    0.025784
44    0.025412
29    0.025319
30    0.025087
48    0.024994
37    0.024948
50    0.023879
43    0.023833
32    0.023693
49    0.023600
28    0.023368
45    0.023089
27    0.022904
56    0.022625
52    0.022485
47    0.022300
54    0.022253
46    0.022067
58    0.021417
57    0.021370
53    0.021324
51    0.020813
59    0.020627
55    0.020581
26    0.018955
60    0.017515
25    0.016585
61    0.016492
62    0.016353
63    0.012497
64    0.012311
24    0.012265
23    0.011800
65    0.009013
66    0.008502
22    0.008502
67    0.007758
21    0.005157
0     0.004692
68    0.004599
69    0.003949
70    0.003020
71    0.002695
20    0.002369
72    0.001533
19    0.000650
73    0.000372
74    0.000279
75    0.000046
Name: dob_years, dtype: float64

0 age seems not correct. Since we have a lot of categorical value, we cannot use mean as a replacement value. Thus, we *will use median* as it will be more accurate.

In [315]:
df['dob_years'].describe()

count    21525.000000
mean        43.293380
std         12.574584
min          0.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

In [316]:
df['dob_years'].median()

42.0

In [317]:
df['dob_years'] = df['dob_years'].replace(0,42)

In [318]:
df['dob_years'].value_counts()

42    698
35    617
40    609
41    607
34    603
38    598
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
22    183
66    183
67    167
21    111
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

- Let's check the `family_status` column. See what kind of values there are and what problems we may need to address.

In [319]:
df['family_status'].sort_values().value_counts()


married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64

- None of the values seems odd. all should be perfect

Moving on to `gender` column. 

In [320]:
df['gender'].sort_values().value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

Lets see more detail this XNA row.

In [321]:
df[df['gender']== 'XNA']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
10701,0,2358.600502,24,some college,2,civil partnership,1,XNA,business,0,32624.825,buy real estate


Since XNA only appeard only one time on this data. we decide to change this data to unknown.

In [322]:
df['gender'] = df['gender'].replace('XNA','unknown')

In [323]:
df['gender'].sort_values().value_counts()

F          14236
M           7288
unknown        1
Name: gender, dtype: int64

- Furthermore, let's check the `income_type` column. 

In [324]:
df['income_type'].sort_values().value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

- None of the values seems odd. all should be perfect

In [325]:
duplicated_df = df[df.duplicated()] 
duplicated_df


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0,,41,secondary education,1,married,0,F,employee,0,,purchase of the house for my family
3290,0,,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
4182,1,,34,bachelor's degree,0,civil partnership,1,F,employee,0,,wedding ceremony
4851,0,,60,secondary education,1,civil partnership,1,F,retiree,0,,wedding ceremony
5557,0,,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
20702,0,,64,secondary education,1,married,0,F,retiree,0,,supplementary education
21032,0,,60,secondary education,1,married,0,F,retiree,0,,to become educated
21132,0,,47,secondary education,1,married,0,F,employee,0,,housing renovation
21281,1,,30,bachelor's degree,0,married,0,F,employee,0,,buy commercial real estate


In [326]:
df.duplicated().sum()

72

In [327]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


- So concludely we did a data cleansing by:-
    - Chnage the illogical value in `children` column
    - Set `days_employed` column and change to absolute values.
    - Change the unneccesary value in `gender` to unknown and transform `dob_years` incorrect value to median.
- Overall the changes to this dataset already set to minimum while keeping the core of the data in place

# Working with missing values

### Restoring missing values in `total_income`

In [328]:
def age_group(age):
    if 18 <= age <= 21:
        return 'young adult'
    if 22 <= age <= 45:
        return 'adult'
    if 46 <= age <= 64:
        return 'veteran'
    return 'retired'  
 
   

In [329]:
print(age_group(75))

retired


In [330]:
print(age_group(32))

adult


In [331]:
print(age_group(55))

veteran


In [332]:
df['age_category'] = df['dob_years'].apply(age_group)

In [333]:
df[['dob_years','age_category']].head()

Unnamed: 0,dob_years,age_category
0,42,adult
1,36,adult
2,33,adult
3,32,adult
4,53,veteran


We thus need to create a table that only has data without missing values. This data will be used to restore the missing values.

In [334]:
non_null_df = df.dropna(subset=['total_income','days_employed'])

In [335]:
non_null_df.isna().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
age_category        0
dtype: int64

In [336]:
non_null_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19351 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          19351 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         19351 non-null  int64  
 3   education         19351 non-null  object 
 4   education_id      19351 non-null  int64  
 5   family_status     19351 non-null  object 
 6   family_status_id  19351 non-null  int64  
 7   gender            19351 non-null  object 
 8   income_type       19351 non-null  object 
 9   debt              19351 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           19351 non-null  object 
 12  age_category      19351 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.1+ MB


In [337]:
non_null_df.groupby('education')['total_income'].mean()

education
bachelor's degree      33142.802434
graduate degree        27960.024667
primary education      21144.882211
secondary education    24594.503037
some college           29045.443644
Name: total_income, dtype: float64

In [338]:
non_null_df.groupby('education')['total_income'].median()

education
bachelor's degree      28054.5310
graduate degree        25161.5835
primary education      18741.9760
secondary education    21836.5830
some college           25618.4640
Name: total_income, dtype: float64

A person incomes we can evaluate from a person age, years of working experience or educartion, so from that info, we can replace the missing value with the mean values we have based on `non_null_df`. we will use mean because it will be more accurate and we also already categorize our data.

In [339]:
df['total_income'].describe()

count     19351.000000
mean      26787.568355
std       16475.450632
min        3306.762000
25%       16488.504500
50%       23202.870000
75%       32549.611000
max      362496.645000
Name: total_income, dtype: float64

In [340]:
df['total_income'] = df.groupby(['age_category', 'education', 'dob_years'])['total_income'].transform(lambda x: x.fillna(x.mean()))

In [341]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
 12  age_category      21525 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.1+ MB


In [342]:
df['total_income'].describe()

count     21525.000000
mean      26795.588841
std       15691.754174
min        3306.762000
25%       17236.158000
50%       24109.506000
75%       32071.939000
max      362496.645000
Name: total_income, dtype: float64

###  Restoring values in `days_employed`

- While for dealing with `days_employed` column, we can say to assume a person's day of employed we will use `age_category` and `dob_years` column to filled up all the null values available. Thus, we also will use mean, this to make sure the day of employed as accurate as possible.

In [343]:
df.groupby('age_category')['days_employed'].median()

age_category
adult            1481.706633
retired        360304.232308
veteran          4652.167869
young adult       639.051245
Name: days_employed, dtype: float64

In [344]:
df.groupby('age_category')['days_employed'].mean()

age_category
adult            5484.091709
retired        314080.528722
veteran        131889.087476
young adult       695.547762
Name: days_employed, dtype: float64

A person days of employed we can use the mean from other categorical value, so from that info, we can replace the missing value with the mean values we have based on dataset. we will use mean because it will be more accurate and we also already categorize our data.

In [345]:
df['days_employed'].describe()

count     19351.000000
mean      66914.728907
std      139030.880527
min          24.141633
25%         927.009265
50%        2194.220567
75%        5537.882441
max      401755.400475
Name: days_employed, dtype: float64

In [346]:
df['days_employed'] = df.groupby(['age_category', 'dob_years'])['days_employed'].transform(lambda x: x.fillna(x.mean()))


In [347]:
df['days_employed'].describe()


count     21525.000000
mean      67182.410759
std      135524.280012
min          24.141633
25%        1016.709379
50%        2477.202993
75%        7568.821191
max      401755.400475
Name: days_employed, dtype: float64

Now we check back the distribution.


In [348]:
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     21525 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
 12  age_category      21525 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.1+ MB


Now all the missing values has already been filled !

In [142]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     21525 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
 12  age_category      21525 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.1+ MB


## Categorization of data

In order to answer all the hypothesis question we will create a categorical for `purpose` column and `total_income` column.
the categorical will help us answer the questions.

In [143]:
df.purpose.unique()

array(['purchase of the house', 'car purchase', 'supplementary education',
       'to have a wedding', 'housing transactions', 'education',
       'having a wedding', 'purchase of the house for my family',
       'buy real estate', 'buy commercial real estate',
       'buy residential real estate', 'construction of own property',
       'property', 'building a property', 'buying a second-hand car',
       'buying my own car', 'transactions with commercial real estate',
       'building a real estate', 'housing',
       'transactions with my real estate', 'cars', 'to become educated',
       'second-hand car purchase', 'getting an education', 'car',
       'wedding ceremony', 'to get a supplementary education',
       'purchase of my own house', 'real estate transactions',
       'getting higher education', 'to own a car', 'purchase of a car',
       'profile education', 'university education',
       'buying property for renting out', 'to buy a car',
       'housing renovation', 'going

In [144]:
def purpose_group(purpose):
    if 'house' in purpose:
        return 'property'
    if 'estate' in purpose:
        return 'property'
    if 'housing' in purpose:
        return 'property'
    if 'property' in purpose:
        return 'property'

    
    if 'car' in purpose:
        return 'vehicle'
    if 'cars' in purpose:
        return 'vehicle'
    
    
    if 'education' in purpose:
        return 'education'
    if 'educated' in purpose:
        return 'education'
    if 'university' in purpose:
        return 'education'
    
    return 'wedding'

In [145]:
print(purpose_group('transactions with commercial real estate'))

property


In [146]:
print(purpose_group('car purchase'))

vehicle


In [147]:
print(purpose_group('getting higher education'))

education


In [148]:
df['purpose_category'] = df['purpose'].apply(purpose_group)

In [149]:
df['purpose_category'].value_counts()

property     10840
vehicle       4315
education     4022
wedding       2348
Name: purpose_category, dtype: int64

In [150]:
df.groupby('purpose_category')['debt'].sum()

purpose_category
education    370
property     782
vehicle      403
wedding      186
Name: debt, dtype: int64

Next, we try create a categorical for income.

In [151]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     21525 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
 12  age_category      21525 non-null  object 
 13  purpose_category  21525 non-null  object 
dtypes: float64(2), int64(5), object(7)
memory usage: 2.3+ MB


In [152]:
df.total_income.describe()



count     21525.000000
mean      26795.588841
std       15691.754174
min        3306.762000
25%       17236.158000
50%       24109.506000
75%       32071.939000
max      362496.645000
Name: total_income, dtype: float64

Normally in our society we segregated income as low to high income, and some individual has the perks to be a wealthy. so we set a range for this value to group all of our client in this dataframe.

In [153]:
def income_group(income):
    if 0 <= income <= 5000:
        return 'low income'
    if 5001<= income <= 10000:
        return 'mid income'
    if 10001 <= income <= 20000:
        return 'big income'
    return 'wealthy'  
 


In [154]:
print(income_group(4500))

low income


In [155]:
print(income_group(7800))

mid income


In [156]:
print(income_group(14300))

big income


In [157]:
print(income_group(45000))

wealthy


In [158]:
df['income_category'] = df['total_income'].apply(income_group)

In [159]:
df

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,purpose_category,income_category
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,adult,property,wealthy
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,adult,vehicle,big income
2,0,5623.422610,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,adult,property,wealthy
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,adult,education,wealthy
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,veteran,wedding,wealthy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions,adult,property,wealthy
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car,retired,vehicle,wealthy
21522,1,2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property,adult,property,big income
21523,3,3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car,adult,vehicle,wealthy


In [160]:
df.income_category.value_counts()

wealthy       14100
big income     6499
mid income      900
low income       26
Name: income_category, dtype: int64

## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?**

In [161]:
children_df = df.pivot_table(index='children',values='debt' ,aggfunc='sum', margins=True)
children_df

Unnamed: 0_level_0,debt
children,Unnamed: 1_level_1
0,1063
1,445
2,202
3,27
4,4
5,0
All,1741


In [162]:
targetted_children = df.groupby(['children','debt'])['debt'].count()
targetted_children

children  debt
0         0       13086
          1        1063
1         0        4420
          1         445
2         0        1929
          1         202
3         0         303
          1          27
4         0          37
          1           4
5         0           9
Name: debt, dtype: int64

In [163]:
children_total_count =  df.groupby(['children'])['debt'].count()
children_total_count

children
0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: debt, dtype: int64

In [164]:
targetted_children/children_total_count

children  debt
0         0       0.924871
          1       0.075129
1         0       0.908530
          1       0.091470
2         0       0.905209
          1       0.094791
3         0       0.918182
          1       0.081818
4         0       0.902439
          1       0.097561
5         0       1.000000
Name: debt, dtype: float64

**Conclusion**

- Based on the distribution shows, `4` children score the most chances of not pay back on-time follows by `2`children > `3` children > and `0` children.
- But, overall we can conclude, that the higher the children, the higher tendency of not pay back loan on-time


**Is there a correlation between family status and paying back on time?**

In [165]:

df.pivot_table(index='family_status',values='debt' ,aggfunc='sum', margins=True)



Unnamed: 0_level_0,debt
family_status,Unnamed: 1_level_1
civil partnership,388
divorced,85
married,931
unmarried,274
widow / widower,63
All,1741


In [166]:
targetted_family = df.groupby(['family_status','debt'])['debt'].count()
family_total_count =  df.groupby(['family_status'])['debt'].count()


In [167]:
targetted_family

family_status      debt
civil partnership  0        3789
                   1         388
divorced           0        1110
                   1          85
married            0       11449
                   1         931
unmarried          0        2539
                   1         274
widow / widower    0         897
                   1          63
Name: debt, dtype: int64

In [168]:
targetted_family/family_total_count

family_status      debt
civil partnership  0       0.907110
                   1       0.092890
divorced           0       0.928870
                   1       0.071130
married            0       0.924798
                   1       0.075202
unmarried          0       0.902595
                   1       0.097405
widow / widower    0       0.934375
                   1       0.065625
Name: debt, dtype: float64

**Conclusion**

- Based on the table above, we can conclude that `unmarried` client has the most default rate to not pay on-time, follow by `civil partnership` `married` `divorced` and the least is `widow / widower`.

**Is there a correlation between income level and paying back on time?**

In [169]:
df.pivot_table(index='income_category',values='debt' ,aggfunc='sum', margins=True)


Unnamed: 0_level_0,debt
income_category,Unnamed: 1_level_1
big income,556
low income,2
mid income,56
wealthy,1127
All,1741


In [170]:
targetted_income = df.groupby(['income_category','debt'])['debt'].count()
income_total_count =  df.groupby(['income_category'])['debt'].count()

In [171]:
targetted_income

income_category  debt
big income       0        5943
                 1         556
low income       0          24
                 1           2
mid income       0         844
                 1          56
wealthy          0       12973
                 1        1127
Name: debt, dtype: int64

In [172]:
targetted_income/income_total_count

income_category  debt
big income       0       0.914448
                 1       0.085552
low income       0       0.923077
                 1       0.076923
mid income       0       0.937778
                 1       0.062222
wealthy          0       0.920071
                 1       0.079929
Name: debt, dtype: float64

**Conclusion**

Based on the distribution shows, `big income` score the most chances of not pay back on-time follows by `wealthy` > `low income` > and `mid income`

**How does credit purpose affect the default rate?**

In [173]:
df.groupby('purpose_category')['debt'].sum()

purpose_category
education    370
property     782
vehicle      403
wedding      186
Name: debt, dtype: int64

In [174]:
targetted_purpose = df.groupby(['purpose_category','debt'])['debt'].count()
purpose_total_count =  df.groupby(['purpose_category'])['debt'].count()

In [175]:
targetted_purpose

purpose_category  debt
education         0        3652
                  1         370
property          0       10058
                  1         782
vehicle           0        3912
                  1         403
wedding           0        2162
                  1         186
Name: debt, dtype: int64

In [176]:
targetted_purpose/purpose_total_count

purpose_category  debt
education         0       0.908006
                  1       0.091994
property          0       0.927860
                  1       0.072140
vehicle           0       0.906605
                  1       0.093395
wedding           0       0.920784
                  1       0.079216
Name: debt, dtype: float64

Based on the default rate, the loan for `vehicle` is the highest to not pay on time then follows by `education` > `wedding` > `property`

**Conclusion**

- We know that for sure, number of children is not the relation for a client to pay on time. But, couple with zero children has the highest tendency to not pay on time.
- We also know, married couple has the highest percentage to not pay on time.
- Lastly, the wealthiest person tend to not pay on-time


# General Conclusion 

- Overall, we already address and preprocess the data before we came out with the final conclusion. We fix up the missing values, duplicates, and possible reasons and solutions for problematic artifact.

- So, for bank's loan division information, upon analysis we can assume such:-
    - the higher the children, the higher the tendency to not pay on-time
    - the wealthiest does not mean the most dicipline to pay on-time
    - `unmarried` couple has the most chances to not pay on-time.
    - loan purpose for `vehicle` has the highest chances of not pay on time