# Analyzing borrowers’ risk of defaulting

This report is for a bank’s loan division analyzing if a **customer’s marital status and number of children has an impact on whether they (customer) will default on a loan.** 


 ## Contents <a id='back'></a>
 * [Introduction](#intro)
 * [Data expolration](#data_expolration)
 * [Data transformation](#data_transformation)
 * [Working with missing values](#working_with_missing_values)
 * [Categorization of data](#Categorization_of_data)
 * [Checking the Hypotheses](#checking_the_hypotheses)
 * [General conclution](#general_conclution)


## Introduction <a id='data_review'></a>

A bank has provided data on customers’ credit worthiness. This data will be attempt to find a correlation between a customer’s marital status and number of children and its impact on default on a loan.

### **The Purpose of this project:** <a id='data_review'></a>

To prepare a report for a bank’s loan division to be considered any correlations that may exist that impact whether a customer  will default on a loan. The two focus area will be the:

- customer’s marital status 
- and number of children

### **The Hypothes being tested:** <a id='data_review'></a>

A customer’s marital status and number of children havean effect on a customer defaulting on a loan specifically
- Married people have a higher chance of defaulting on a loan
- People with children have a higher chance of defaulting on a loan 

In both cases their expenses are higher than that of unmarried and those without children


### **Task decomposition:** <a id='data_review'></a>

- Data Transformation
- Working with missing values
- Categorization of data
- Checking the Hypotheses 
- Concluding the study

### Open the data file and have a look at the general information. 

In [101]:
# Loading all the libraries
import pandas as pd
import numpy as np

In [102]:
# Load the data
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
df.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


## Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

In [103]:
# Let's see how many rows and columns our dataset has


rows = len(df.axes[0])
 
cols = len(df.axes[1])
 
print("Number of Rows: ", rows)
print("Number of Columns: ", cols)

Number of Rows:  21525
Number of Columns:  12


In [104]:
# let's print the first N rows

df.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


**Issues with the data:**
- Under the days_employed column there are some who are listed in negative numbers

There is a corollation between the negative numbers in days_employed (work experiece) and income_type. Retiree has positve days_employed whereas the empolyed(employee) have negative days_employed.


In [105]:
# Get info on data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


There are two columns with missing values namely:

- the days_employed and 
- the total_income columns 

both their Non-Null are listed less than the full column number of 21525.

In [106]:
# Let's look in the filtered table at the the first column with missing data

df[df["days_employed"].isna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


There is a symmetrical relationship between the missing values. For every missing value in days_employed the corresponding value in total_income is also missing.

In [107]:
# Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.
df[df["days_employed"].isna()].isna().sum()


children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

### **Intermediate conclusion**

The number of rows in the filtered table match the number of missing values, when the filltered missing values are added to the full data set rows all rows add up to 21525:

- 0 + 21525 = 21525
- 2174 + 19351 = 21525


In [108]:
percent_missing = df.isnull().sum() * 100 / len(df)
print(round(percent_missing,2))

children             0.0
days_employed       10.1
dob_years            0.0
education            0.0
education_id         0.0
family_status        0.0
family_status_id     0.0
gender               0.0
income_type          0.0
debt                 0.0
total_income        10.1
purpose              0.0
dtype: float64


There is a symmetrical relationship between the missing values. For every missing value in days_employed the corresponding value in total_income is also missing. 

The missing data is a large amount on data and  it should be anaylised and corrected where possible. 

There is 20% data missing in that dataset
- 10% of it is in the column days_employed 
- the other 10% of it is in the column total_income 

In [109]:
# Let's investigate clients who do not have data on identified characteristic and the column with the missing values
df_na=df.isna()
print(df_na)
print()
print(df_na.info())


       children  days_employed  dob_years  education  education_id  \
0         False          False      False      False         False   
1         False          False      False      False         False   
2         False          False      False      False         False   
3         False          False      False      False         False   
4         False          False      False      False         False   
...         ...            ...        ...        ...           ...   
21520     False          False      False      False         False   
21521     False          False      False      False         False   
21522     False          False      False      False         False   
21523     False          False      False      False         False   
21524     False          False      False      False         False   

       family_status  family_status_id  gender  income_type   debt  \
0              False             False   False        False  False   
1              Fals

In [110]:
# Checking the distribution in the whole dataset
display(df.describe())

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


**Discription of the findings:**

- The missing data has an impact on the days_employed and total_income columns 
- The negative numbers in the days_employed also affects the data for the column
- There is a significant portion of the data that is missing to the amount of 6%

**Averages (mean) in the datasets without missing values**

- children: 0.53
- dob_years: 43.29
- education_id: 0.82

**Possible reasons for missing values in data**

- There is a patterns to the missing data
- Perhaps that data was unavaiable at the time 

**Intermediate conclusion**

There made be a correlation between the missing data in that missing value in days_employed the corresponding value in total_income howeve one cannot say from the data given why that is.

In [111]:
# Check for other reasons and patterns that could lead to missing values

df.groupby("family_status")["total_income"].count()


family_status
civil partnership     3735
divorced              1083
married              11143
unmarried             2525
widow / widower        865
Name: total_income, dtype: int64

In [112]:
df.groupby("family_status")["days_employed"].count()

family_status
civil partnership     3735
divorced              1083
married              11143
unmarried             2525
widow / widower        865
Name: days_employed, dtype: int64

In [113]:
df.groupby("family_status")["debt"].count()

family_status
civil partnership     4177
divorced              1195
married              12380
unmarried             2813
widow / widower        960
Name: debt, dtype: int64

In [114]:
df.groupby("gender")["total_income"].count()

gender
F      12752
M       6598
XNA        1
Name: total_income, dtype: int64

In [115]:
df.groupby("gender")["days_employed"].count()

gender
F      12752
M       6598
XNA        1
Name: days_employed, dtype: int64

In [116]:
df.groupby("gender")["debt"].count()

gender
F      14236
M       7288
XNA        1
Name: debt, dtype: int64

**Comparing the distributionof the column of the missing dataset and the whole dataset to establish a pattern:**

In [117]:
df_na = df[df['total_income'].isna()==True]

In [118]:
df_na["family_status"].value_counts(normalize=True)

married              0.568997
civil partnership    0.203312
unmarried            0.132475
divorced             0.051518
widow / widower      0.043698
Name: family_status, dtype: float64

In [119]:
df["family_status"].value_counts(normalize=True)

married              0.575145
civil partnership    0.194053
unmarried            0.130685
divorced             0.055517
widow / widower      0.044599
Name: family_status, dtype: float64

The comparison of the distribution of the family_status column comparing against the dataset of the missing values and the whole datasets indicates that there are not statistically significant differences. Therefor there is no observable pattern. 

In [120]:
df_na["debt"].value_counts(normalize=True)

0    0.921803
1    0.078197
Name: debt, dtype: float64

In [121]:
df_na["debt"].value_counts(normalize=True)

0    0.921803
1    0.078197
Name: debt, dtype: float64

The comparison of the distribution of the debt column comparing against the dataset of the missing values and the whole datasets indicates that there are not statistically significant differences. Therefor there is no observable pattern. 

In [122]:
df_na["days_employed"].value_counts(normalize=True)

Series([], Name: days_employed, dtype: float64)

In [123]:
display(df["days_employed"].value_counts(normalize=True))

-327.685916     0.000052
-1580.622577    0.000052
-4122.460569    0.000052
-2828.237691    0.000052
-2636.090517    0.000052
                  ...   
-7120.517564    0.000052
-2146.884040    0.000052
-881.454684     0.000052
-794.666350     0.000052
-3382.113891    0.000052
Name: days_employed, Length: 19351, dtype: float64

The days_employed column has missing data therefore a comparision to determine missing values cannot be made.

In [124]:
df_na["total_income"].value_counts(normalize=True)

Series([], Name: total_income, dtype: float64)

In [125]:
df["total_income"].value_counts(normalize=True)

42413.096    0.000103
17312.717    0.000103
31791.384    0.000103
14427.878    0.000052
20837.034    0.000052
               ...   
27715.458    0.000052
23834.534    0.000052
26124.613    0.000052
28692.182    0.000052
41428.916    0.000052
Name: total_income, Length: 19348, dtype: float64

The total_income column has missing data therefore a comparision to determine missing values cannot be made.

In [126]:
df_na["gender"].value_counts(normalize=True)

F    0.682613
M    0.317387
Name: gender, dtype: float64

In [127]:
df["gender"].value_counts(normalize=True)

F      0.661370
M      0.338583
XNA    0.000046
Name: gender, dtype: float64

The comparison of the distribution of the gender column comparing against the dataset of the missing values and the whole datasets indicates that there are not statistically significant differences. Therefor there is no observable pattern.

In [128]:
df[df['total_income'].isna()==True]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


### **Conclusions**

There are no notable patterns in the missing data. 
Due to the significance of the missing data it is nessasary to replace the missing values with a mean or median value.

The next steps:

- removing dupications
- replacing missing values with mean or median numbers
- correcting any data that that may be spelt incorrectly or labelled inefficiently 

Possible reasons as to why there is missing data in the columns total_income and days_employed: 

- There is a possibility that the data was unknown for those columns and was therefore left empty
- These could have been optional items and completing the field was therefore not mandatory and for confidentiality response they might have been left empty.


## Data transformation

At this point the columns will be reviewed which includes:
- removing duplicates
- fixing educational information were required


In [129]:
# Let's see all values in education column to check if and what spellings will need to be fixed

In [130]:
print (df['education'].unique())

["bachelor's degree" 'secondary education' 'Secondary Education'
 'SECONDARY EDUCATION' "BACHELOR'S DEGREE" 'some college'
 'primary education' "Bachelor's Degree" 'SOME COLLEGE' 'Some College'
 'PRIMARY EDUCATION' 'Primary Education' 'Graduate Degree'
 'GRADUATE DEGREE' 'graduate degree']


In [131]:
# Fix the registers if required
df['education'] = df['education'].str.lower()

In [132]:
# Checking all the values in the column to make sure we fixed them

print (df['education'].unique())

["bachelor's degree" 'secondary education' 'some college'
 'primary education' 'graduate degree']


### **Checking the data the `children` column**

In [133]:
# Let's see the distribution of values in the `children` column
print (df['children'].unique())

[ 1  0  3  2 -1  4 20  5]


There is a negative number one cannot have -1 children so the assumption is that it should be 1 and the there is an outlier of 20 and an assumption is that that should be 2

In [134]:
# [fix the data based on your decision]
df.loc[df['children'] == -1, 'children'] = 1
df.loc[df['children'] == 20, 'children'] = 2

In [135]:
# Checking the `children` column again to make sure it's all fixed
print (df['children'].unique())

[1 0 3 2 4 5]


### **Checking the data in the `days_employed` column**

In [136]:
# Find problematic data in `days_employed`, if they exist, and calculate the percentage
print (df['days_employed'])

0         -8437.673028
1         -4024.803754
2         -5623.422610
3         -4124.747207
4        340266.072047
             ...      
21520     -4529.316663
21521    343937.404131
21522     -2113.346888
21523     -3112.481705
21524     -1984.507589
Name: days_employed, Length: 21525, dtype: float64


In [137]:
df['days_employed'].value_counts()

-327.685916     1
-1580.622577    1
-4122.460569    1
-2828.237691    1
-2636.090517    1
               ..
-7120.517564    1
-2146.884040    1
-881.454684     1
-794.666350     1
-3382.113891    1
Name: days_employed, Length: 19351, dtype: int64

In [138]:
print (df['days_employed'].describe())

count     19351.000000
mean      63046.497661
std      140827.311974
min      -18388.949901
25%       -2747.423625
50%       -1203.369529
75%        -291.095954
max      401755.400475
Name: days_employed, dtype: float64


Issues observed with the days_employed column 
- contains negative values and this is not possible for this type of data (assuming it is an error and should be positive)
- the maximum days in the column is too high at 401755.40 day which is over a thousand years this cannot be correct

In [139]:
# Address the problematic values, if they exist
#convert the negative numbers in days_employed to positive

df['days_employed'] = df['days_employed'].abs()                  

In [140]:
# Check the result - make sure it's fixed
print(df['days_employed'])

0          8437.673028
1          4024.803754
2          5623.422610
3          4124.747207
4        340266.072047
             ...      
21520      4529.316663
21521    343937.404131
21522      2113.346888
21523      3112.481705
21524      1984.507589
Name: days_employed, Length: 21525, dtype: float64


In [141]:
print (df['days_employed'].describe())

count     19351.000000
mean      66914.728907
std      139030.880527
min          24.141633
25%         927.009265
50%        2194.220567
75%        5537.882441
max      401755.400475
Name: days_employed, dtype: float64


The Maximum days_employed is 401755 days – this is 1100 years, there must be an error in the data and let's assume that some values were added as hours and nnot dayes. To correct this we take all numbers above 10320 days and divide them by 24.

**10320 = (65-17) * 215**

17 minimum working age

65 maximum working age

215 official number of working days per calendar year, according to the EU


In [142]:
df.loc[df['days_employed'] > 10320, 'days_employed'] /= 24

In [143]:
# Check the result - make sure it's fixed
print (df['days_employed'].describe())

count    19351.000000
mean      4504.800783
std       5308.895662
min         24.141633
25%        884.297961
50%       2103.410292
75%       5203.869932
max      16739.808353
Name: days_employed, dtype: float64


### **Checking the data in the `dob_years`column**

In [144]:
# Check the `dob_years` for suspicious values and count the percentage
print (df['dob_years'].describe())
print ()
print (df['dob_years'].unique())


count    21525.000000
mean        43.293380
std         12.574584
min          0.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

[42 36 33 32 53 27 43 50 35 41 40 65 54 56 26 48 24 21 57 67 28 63 62 47
 34 68 25 31 30 20 49 37 45 61 64 44 52 46 23 38 39 51  0 59 29 60 55 58
 71 22 73 66 69 19 72 70 74 75]


The 0 ages can be replaced by the mean age - an infant 0 years old (0-11 months) cannot take out a loan and generally anyone under the age od 18 in most countries cannot take out a loan.

In [145]:
# Address the issues in the `dob_years` column, if they exist
age_avg = df['dob_years'].mean()
print ()
df.loc[df['dob_years'] == 0, 'dob_years'] = age_avg 




In [146]:
# Check the result - make sure it's fixed
print (df['dob_years'].describe())
print ()
print (df['dob_years'].unique())

count    21525.000000
mean        43.496522
std         12.218174
min         19.000000
25%         34.000000
50%         43.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

[42.         36.         33.         32.         53.         27.
 43.         50.         35.         41.         40.         65.
 54.         56.         26.         48.         24.         21.
 57.         67.         28.         63.         62.         47.
 34.         68.         25.         31.         30.         20.
 49.         37.         45.         61.         64.         44.
 52.         46.         23.         38.         39.         51.
 43.29337979 59.         29.         60.         55.         58.
 71.         22.         73.         66.         69.         19.
 72.         70.         74.         75.        ]


### **Checking the `family_status` column**

In [147]:
# Let's see the values for the column
print (df['family_status'].describe())
print ()
print (df['family_status'].unique())

count       21525
unique          5
top       married
freq        12380
Name: family_status, dtype: object

['married' 'civil partnership' 'widow / widower' 'divorced' 'unmarried']


There are no problematic values in family_status column and there is no missing data

perhaps 'married'and 'civil partnership' could fall under one category but I see it as different categories so I am keeoing it as is. 

In [148]:
# Check the result - make sure it's fixed
print (df['family_status'].describe())
print ()
print (df['family_status'].unique())

count       21525
unique          5
top       married
freq        12380
Name: family_status, dtype: object

['married' 'civil partnership' 'widow / widower' 'divorced' 'unmarried']


### **Checking the `gender` column.**

In [149]:
# Let's see the values in the column
print (df['gender'].describe())
print ()
print (df['gender'].unique())

count     21525
unique        3
top           F
freq      14236
Name: gender, dtype: object

['F' 'M' 'XNA']


In [150]:
# Address the problematic values, if they exist
print (df['gender'].value_counts())

F      14236
M       7288
XNA        1
Name: gender, dtype: int64


We have no way to determine what the value XNA is so it can be declared as an unknown value.

In [151]:
df.loc[df['gender'] == 'XNA', 'gender'] = 'unkown'

In [152]:
# Check the result - make sure it's fixed
print (df['gender'].describe())
print ()
print (df['gender'].unique())
print ()
print (df['gender'].value_counts())

count     21525
unique        3
top           F
freq      14236
Name: gender, dtype: object

['F' 'M' 'unkown']

F         14236
M          7288
unkown        1
Name: gender, dtype: int64


### **Checking the `income_type` column**

In [153]:
# Let's see the values in the column
print (df['income_type'].describe())
print ()
print (df['income_type'].unique())
print ()
print (df['income_type'].value_counts())

count        21525
unique           8
top       employee
freq         11119
Name: income_type, dtype: object

['employee' 'retiree' 'business' 'civil servant' 'unemployed'
 'entrepreneur' 'student' 'paternity / maternity leave']

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64


There are no issues picked up the income_type column

### **Addressing duplicates in the dataset**

In [154]:
# Checking duplicates
print(df.duplicated().sum()) 

71


In [155]:
df = df.drop_duplicates().reset_index(drop=True)

In [156]:
# Last check whether we have any duplicates
print(df.duplicated().sum()) 

0


In [157]:
# Check the size of the dataset that you now have after your first manipulations with it

rows = len(df.axes[0])
 
cols = len(df.axes[1])
 
print("Number of Rows: ", rows)
print("Number of Columns: ", cols)

Number of Rows:  21454
Number of Columns:  12


**Brief Description of the new dataset:** 

The dataset is smaller in that the duplicates have been removed:

Number of Rows:  21525 (Original)
duplicates: 71
Number of Rows:  21454 (Original - duplicates)

## Working with missing values

### Restoring missing values in `total_income`

Currently there are 2 columns in the dataseth that have missing values namely:
- days_employed
- total_income


In [158]:
# Let's write a function that calculates the age category

def age_category(dob_years):
    if dob_years < 21:
        return '11-20'
    elif dob_years < 31:
        return '21-30'
    elif dob_years < 41:
        return '31-40'
    elif dob_years < 51:
        return '41-50'
    elif dob_years < 61:
        return '51-60'
    elif dob_years < 71:
        return '61-70'
    elif dob_years < 81:
        return '71-80'


In [159]:
# Test if the function works
print(age_category(54))

51-60


In [160]:
# Creating new column based on function
df['age_category'] = df['dob_years'].apply(age_category)

In [161]:
# Checking how values in the new column
df['age_category'].value_counts()

31-40    5732
41-50    5361
51-60    4518
21-30    3652
61-70    2022
71-80     104
11-20      65
Name: age_category, dtype: int64

In [162]:
df.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21454.0,19351.0,21454.0,21454.0,21454.0,21454.0,19351.0
mean,0.480563,4504.800783,43.475046,0.817097,0.973898,0.08115,26787.568355
std,0.756069,5308.895662,12.21347,0.548674,1.421567,0.273072,16475.450632
min,0.0,24.141633,19.0,0.0,0.0,0.0,3306.762
25%,0.0,884.297961,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,2103.410292,43.0,1.0,0.0,0.0,23202.87
75%,1.0,5203.869932,53.0,1.0,1.0,0.0,32549.611
max,5.0,16739.808353,75.0,4.0,4.0,1.0,362496.645


In [163]:
# Create a table without missing values and print a few of its rows to make sure it looks fine
df_no_na = df.dropna()
display(df_no_na.head())

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
0,1,8437.673028,42.0,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,41-50
1,1,4024.803754,36.0,secondary education,1,married,0,F,employee,0,17932.802,car purchase,31-40
2,0,5623.42261,33.0,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,31-40
3,3,4124.747207,32.0,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,31-40
4,0,14177.753002,53.0,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,51-60


In [164]:
# Look at the mean values for income based on your identified factors
mean_income_by_age_category = df_no_na.groupby('age_category')['total_income'].mean()
print(mean_income_by_age_category)

age_category
11-20    19586.303559
21-30    25928.848368
31-40    28376.735148
41-50    28332.806009
51-60    25482.856294
61-70    23245.390243
71-80    19575.454327
Name: total_income, dtype: float64


In [165]:
# Look at the median values for income based on your identified factors
median_income_by_age_category = df_no_na.groupby('age_category')['total_income'].median()
print(median_income_by_age_category)

age_category
11-20    17257.2770
21-30    23079.3820
31-40    24825.1865
41-50    24563.6500
51-60    22056.7710
61-70    19705.8550
71-80    18611.5935
Name: total_income, dtype: float64


There are notable differences between the mean and the median. Firstly they are not a like, secondly they have a trend the earlier ages have a median is lower than the mean then the later ages the median is higher than the median.

In this case the using the median to replace the missing data rather than the mean may be beneficial as the mean is more susceptible to being affected by outliers. 
 

**Description of the data used to explore case**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan


***The below expores the colloration between ones income and gender***

In [166]:
# Look at the mean values for income based on your identified factors
mean_income_by_gender = df_no_na.groupby('gender')['total_income'].mean()
print(mean_income_by_gender)

gender
F         24655.604757
M         30907.144369
unkown    32624.825000
Name: total_income, dtype: float64


In [167]:
# Look at the median values for income based on your identified factors
median_income_by_gender = df_no_na.groupby('gender')['total_income'].median()
print(median_income_by_gender)

gender
F         21464.845
M         26834.295
unkown    32624.825
Name: total_income, dtype: float64


In [168]:
print(median_income_by_gender.describe())

count        3.000000
mean     26974.655000
std       5581.313833
min      21464.845000
25%      24149.570000
50%      26834.295000
75%      29729.560000
max      32624.825000
Name: total_income, dtype: float64


Observation: Females earn less than the average and mean of both groups total income.

***The below explores the colloration between marital status and income***

In [169]:
mean_income_by_fstatus = df_no_na.groupby('family_status')['total_income'].mean()
print(mean_income_by_fstatus)

family_status
civil partnership    26694.428597
divorced             27189.354550
married              27041.784689
unmarried            26934.069805
widow / widower      22984.208556
Name: total_income, dtype: float64


In [170]:
median_income_by_fstatus = df_no_na.groupby('family_status')['total_income'].median()
print(median_income_by_fstatus)

family_status
civil partnership    23186.534
divorced             23515.096
married              23389.540
unmarried            23149.028
widow / widower      20514.190
Name: total_income, dtype: float64


In [171]:
print(median_income_by_fstatus.describe())

count        5.000000
mean     22750.877600
std       1259.266759
min      20514.190000
25%      23149.028000
50%      23186.534000
75%      23389.540000
max      23515.096000
Name: total_income, dtype: float64


***The below explores the colloration between children and income***

In [172]:
mean_income_by_children = df_no_na.groupby('children')['total_income'].mean()
print(mean_income_by_children)
print()
median_income_by_children = df_no_na.groupby('children')['total_income'].median()
print(median_income_by_children)
print()
print(median_income_by_children.describe())

children
0    26422.404866
1    27368.627863
2    27478.854282
3    29322.623993
4    27289.829647
5    27268.847250
Name: total_income, dtype: float64

children
0    23029.9535
1    23660.5630
2    23136.1155
3    25155.4480
4    24981.6340
5    29816.2255
Name: total_income, dtype: float64

count        6.000000
mean     24963.323250
std       2544.079341
min      23029.953500
25%      23267.227375
50%      24321.098500
75%      25111.994500
max      29816.225500
Name: total_income, dtype: float64


***The below explores the colloration between education and income***

In [173]:
mean_income_by_education = df_no_na.groupby('education')['total_income'].mean()
print(mean_income_by_education)
print()
median_income_by_education = df_no_na.groupby('education')['total_income'].median()
print(median_income_by_education)
print()
print(median_income_by_education.describe())

education
bachelor's degree      33142.802434
graduate degree        27960.024667
primary education      21144.882211
secondary education    24594.503037
some college           29045.443644
Name: total_income, dtype: float64

education
bachelor's degree      28054.5310
graduate degree        25161.5835
primary education      18741.9760
secondary education    21836.5830
some college           25618.4640
Name: total_income, dtype: float64

count        5.000000
mean     23882.627500
std       3628.575187
min      18741.976000
25%      21836.583000
50%      25161.583500
75%      25618.464000
max      28054.531000
Name: total_income, dtype: float64


***The below explores the colloration between number of days employed and income***

In [174]:
mean_income_by_days_employed = df_no_na.groupby('days_employed')['total_income'].mean()
print(mean_income_by_days_employed)
print()
median_income_by_days_employed = df_no_na.groupby('days_employed')['total_income'].median()
print(median_income_by_days_employed)
print()
print(median_income_by_days_employed.describe())

days_employed
24.141633       26712.386
24.240695       19858.460
30.195337       37033.790
33.520665       20568.944
34.701045       14489.279
                  ...    
16735.993752     7725.831
16736.436110    52063.316
16736.462226    20194.323
16738.158823     9182.441
16739.808353    28204.551
Name: total_income, Length: 19351, dtype: float64

days_employed
24.141633       26712.386
24.240695       19858.460
30.195337       37033.790
33.520665       20568.944
34.701045       14489.279
                  ...    
16735.993752     7725.831
16736.436110    52063.316
16736.462226    20194.323
16738.158823     9182.441
16739.808353    28204.551
Name: total_income, Length: 19351, dtype: float64

count     19351.000000
mean      26787.568355
std       16475.450632
min        3306.762000
25%       16488.504500
50%       23202.870000
75%       32549.611000
max      362496.645000
Name: total_income, dtype: float64


***The below explores the colloration between marital status and debt***

In [175]:
mean_debt_by_fstatus = df_no_na.groupby('family_status')['debt'].mean()
print(mean_debt_by_fstatus)

family_status
civil partnership    0.090763
divorced             0.070175
married              0.075922
unmarried            0.100594
widow / widower      0.064740
Name: debt, dtype: float64


The characteristics that define income most:
- Gender and 
- Education

The median for both charatristics is use as the mean values are closer and carry little variation

In [176]:
#  Write a function that we will use for filling in missing values

df['total_income'] = df['total_income'].fillna(df.groupby('age_category')['total_income'].transform('median'))

In [177]:
# Check if it works
print(df['total_income'].isnull().sum())
print(len(df['total_income']))

0
21454


In [178]:
# Checking the number of entries in the columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21454 entries, 0 to 21453
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21454 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21454 non-null  float64
 3   education         21454 non-null  object 
 4   education_id      21454 non-null  int64  
 5   family_status     21454 non-null  object 
 6   family_status_id  21454 non-null  int64  
 7   gender            21454 non-null  object 
 8   income_type       21454 non-null  object 
 9   debt              21454 non-null  int64  
 10  total_income      21454 non-null  float64
 11  purpose           21454 non-null  object 
 12  age_category      21454 non-null  object 
dtypes: float64(3), int64(4), object(6)
memory usage: 2.1+ MB


There are now no errors or missing values in the total_income column.

###  Restoring values in `days_employed`

The days_employed row has missing values from the total icome column we were able to identify that the data is skewed the use of the median may be beneficial as the mean is more susceptible to being affected by outliers. 

In [179]:
# Distribution of `days_employed` medians based on your identified parameters

median_employed_by_age_category = df_no_na.groupby('age_category')['days_employed'].median()
print(median_employed_by_age_category)


age_category
11-20      695.968951
21-30     1064.560075
31-40     1630.193189
41-50     2111.008029
51-60     5198.707579
61-70    14836.708905
71-80    15007.100954
Name: days_employed, dtype: float64


In [180]:
# Distribution of `days_employed` means based on your identified parameters
mean_employed_by_age_category = df_no_na.groupby('age_category')['days_employed'].mean()
print(mean_employed_by_age_category)

age_category
11-20      673.648361
21-30     1319.599497
31-40     2184.741981
41-50     3174.581770
51-60     7844.451510
61-70    12609.601992
71-80    13901.494417
Name: days_employed, dtype: float64


The data is skewed to the left and the use of the median to replace the missing values will be better than using the mean which would no account for outliers. 

In [181]:
# Let's write a function that calculates means or medians (depending on your decision) based on your identified parameter
df['days_employed'] = df['days_employed'].fillna(df.groupby('age_category')['days_employed'].transform('median'))


In [182]:
# Check that the function works
print(df['days_employed'].isnull().sum())
print(len(df['days_employed']))


0
21454


In [183]:
df[df["days_employed"].isna()].isna().sum()

children            0.0
days_employed       0.0
dob_years           0.0
education           0.0
education_id        0.0
family_status       0.0
family_status_id    0.0
gender              0.0
income_type         0.0
debt                0.0
total_income        0.0
purpose             0.0
age_category        0.0
dtype: float64

In [184]:
# Apply function to the income_type
print(df['income_type'].isnull().sum())



0


In [185]:
# Check if function worked
print(len(df['income_type']))


21454


There are not missing values in the income_type column.

In [186]:
# Check the entries in all columns - make sure we fixed all missing values
df.info()
print()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21454 entries, 0 to 21453
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21454 non-null  int64  
 1   days_employed     21454 non-null  float64
 2   dob_years         21454 non-null  float64
 3   education         21454 non-null  object 
 4   education_id      21454 non-null  int64  
 5   family_status     21454 non-null  object 
 6   family_status_id  21454 non-null  int64  
 7   gender            21454 non-null  object 
 8   income_type       21454 non-null  object 
 9   debt              21454 non-null  int64  
 10  total_income      21454 non-null  float64
 11  purpose           21454 non-null  object 
 12  age_category      21454 non-null  object 
dtypes: float64(3), int64(4), object(6)
memory usage: 2.1+ MB



## Categorization of data

**The Purpose of this project:** <a id='data_review'></a>

To prepare a report for a bank’s loan division to be considered any correlations that may exist that impact whether a customer  will default on a loan. The two focus area will be the:

- customer’s marital status 
- and number of children

The data will be categorized according to the:

- Total Income
- Purpose of the loan

In [187]:
# Print the values for your selected data for categorization

print (df['total_income'])

0        40620.102
1        17932.802
2        23341.752
3        42820.568
4        25378.572
           ...    
21449    35966.698
21450    24959.969
21451    14347.610
21452    39054.888
21453    13127.587
Name: total_income, Length: 21454, dtype: float64


In [188]:
# Check the unique values
print (df['total_income'].unique())

[40620.102 17932.802 23341.752 ... 14347.61  39054.888 13127.587]


In [189]:
print (df['total_income'].describe())

count     21454.000000
mean      26448.546683
std       15689.532417
min        3306.762000
25%       17219.817250
50%       23234.830000
75%       31330.237250
max      362496.645000
Name: total_income, dtype: float64


In [190]:
# Let's write a function to categorize the data based on common topics
def income_category(total_income):

        if total_income < df['total_income'].describe()['25%']:
            return 'low'
        elif total_income < df['total_income'].describe()['50%']:
            return 'lower-mid'
        elif total_income < df['total_income'].describe()['75%']:
            return 'upper-mid'
        else:
            return 'high'

In [191]:
#testing the income category
print(income_category(25000))

upper-mid


In [192]:
# Create a column with the categories and count the values for them
df['income_category'] = df['total_income'].apply(income_category)
print(df['income_category'].value_counts())

high         5364
low          5364
upper-mid    5363
lower-mid    5363
Name: income_category, dtype: int64


In [193]:
# Getting summary statistics for the column
print(df['income_category'].describe())

count     21454
unique        4
top        high
freq       5364
Name: income_category, dtype: object


In [194]:
# Print the values for your selected data for categorization
df['purpose'].sort_values().unique()

array(['building a property', 'building a real estate',
       'buy commercial real estate', 'buy real estate',
       'buy residential real estate', 'buying a second-hand car',
       'buying my own car', 'buying property for renting out', 'car',
       'car purchase', 'cars', 'construction of own property',
       'education', 'getting an education', 'getting higher education',
       'going to university', 'having a wedding', 'housing',
       'housing renovation', 'housing transactions', 'profile education',
       'property', 'purchase of a car', 'purchase of my own house',
       'purchase of the house', 'purchase of the house for my family',
       'real estate transactions', 'second-hand car purchase',
       'supplementary education', 'to become educated', 'to buy a car',
       'to get a supplementary education', 'to have a wedding',
       'to own a car', 'transactions with commercial real estate',
       'transactions with my real estate', 'university education',
       'we

In [195]:
print(df['purpose'].value_counts())

wedding ceremony                            791
having a wedding                            768
to have a wedding                           765
real estate transactions                    675
buy commercial real estate                  661
housing transactions                        652
buying property for renting out             651
transactions with commercial real estate    650
housing                                     646
purchase of the house                       646
purchase of the house for my family         638
construction of own property                635
property                                    633
transactions with my real estate            627
building a real estate                      624
buy real estate                             621
purchase of my own house                    620
building a property                         619
housing renovation                          607
buy residential real estate                 606
buying my own car                       

The Purpose column can be grouped by similarities mainly:
- Buying Property
- Education
- Buying a car
- For a Wedding

For clarity we will label them respectively as:
- Property
- Education
- Car
- Wedding


In [196]:
# Grouping the Purpose data by similarity in the form of lists
property_list = ['building a property', 
                 'building a real estate', 
                 'buy commercial real estate', 
                 'buy real estate', 
                 'buy residential real estate',
                 'buying property for renting out',
                 'construction of own property',
                 'housing',
                 'housing renovation', 
                 'housing transactions',
                 'property',
                 'purchase of my own house',
                 'purchase of the house', 
                 'purchase of the house for my family',
                 'real estate transactions',
                 'transactions with commercial real estate',
                 'transactions with my real estate']
wedding_list = ['having a wedding',
                'to have a wedding',
                'wedding ceremony']
car_list = ['buying a second-hand car',
            'buying my own car',
            'car',
            'car purchase', 
            'cars',
            'purchase of a car',
            'second-hand car purchase',
            'to buy a car',
            'to own a car']
education_list = ['education', 
                  'getting an education', 
                  'getting higher education',
                  'going to university',
                  'profile education',
                  'supplementary education', 
                  'to become educated',
                  'to get a supplementary education',
                  'university education']

In [197]:
# Let's write a function to categorize the data based on common topics
def purpose_grouping(row):
    if 'car' in row['purpose']:
        return 'car'
    if 'hous' in row['purpose'] or 'prop' in row['purpose'] or 'real est' in row['purpose']:
        return 'real estate'
    if 'wedd' in row['purpose']:
        return 'wedding'
    if 'educ' in row['purpose'] or 'uni' in row['purpose']:
        return 'education'

In [198]:
df.apply(purpose_grouping,axis=1)

0        real estate
1                car
2        real estate
3          education
4            wedding
            ...     
21449    real estate
21450            car
21451    real estate
21452            car
21453            car
Length: 21454, dtype: object

In [199]:
#testing the purpose_grouping
print(purpose_grouping({'purpose':'wedding ceremony'}))

wedding


In [200]:
 # Creating a column to store the purpose categories
df['purpose_grouping'] = df.apply(purpose_grouping,axis=1)

In [201]:
# Getting summary statistics for the column
print(df['purpose'].describe())

count                21454
unique                  38
top       wedding ceremony
freq                   791
Name: purpose, dtype: object


In [202]:
# Count each categories values to see the distribution
df['purpose_grouping'].value_counts()

real estate    10811
car             4306
education       4013
wedding         2324
Name: purpose_grouping, dtype: int64

## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?**

**Cheking the correlation between having children and debt**

In [203]:
# Check the children data and paying back on time
df[['children', 'debt']]

Unnamed: 0,children,debt
0,1,0
1,1,0
2,0,0
3,3,0
4,0,0
...,...,...
21449,1,0
21450,0,0
21451,1,1
21452,3,1


In [204]:
# Calculating default-rate based on the number of children
df.pivot_table(index = 'children', values = 'debt', aggfunc = ['sum', 'count', 'mean', 'median'])

Unnamed: 0_level_0,sum,count,mean,median
Unnamed: 0_level_1,debt,debt,debt,debt
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,1063,14091,0.075438,0
1,445,4855,0.091658,0
2,202,2128,0.094925,0
3,27,330,0.081818,0
4,4,41,0.097561,0
5,0,9,0.0,0


### **Conclusion**

From the above we can see that:
- There is a **7.5%** chance that someone with **no children** will default on a loan
- There is a **9.2%** chance that someone with **1 child** will default on a loan
- There is a **9.4%** chance that someone with **2 children** will default on a loan
- There is a **8.2%** chance that someone with **3 children** will default on a loan
- There is a **9.8%** chance that someone with **4 children** will default on a loan

The more children that one has the high chance that they will default on a loan


**Is there a correlation between family status and paying back on time?**

### **Cheking the correlation between family status  and debt**

In [205]:
# Check the family status data and paying back on time
df[['family_status', 'debt']]

Unnamed: 0,family_status,debt
0,married,0
1,married,0
2,married,0
3,married,0
4,civil partnership,0
...,...,...
21449,civil partnership,0
21450,married,0
21451,civil partnership,1
21452,married,1


In [206]:
# Calculating default-rate based on the number of children
df.pivot_table(index = 'family_status', values = 'debt', aggfunc = ['sum', 'count', 'mean', 'median'])

Unnamed: 0_level_0,sum,count,mean,median
Unnamed: 0_level_1,debt,debt,debt,debt
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
civil partnership,388,4151,0.093471,0
divorced,85,1195,0.07113,0
married,931,12339,0.075452,0
unmarried,274,2810,0.097509,0
widow / widower,63,959,0.065693,0


### **Conclusion**

From the above we can see that:
- There is a **9.3%** chance that people with the family status: **civil partnership** will default on a loan
- There is a **7.1%** chance that people with the family status: **divorced** will default on a loan
- There is a **7.5%** chance that people with the family status: **married** will default on a loan
- There is a **9.8%** chance that people with the family status: **unmarried** will default on a loan
- There is a **0.6%** chance that people with the family status: **widow / widower** will default on a loan


People that are **unmarried** have the highest chance of defaulting on a loan whereas those that are **widow / widower**  have the least chance.

**Is there a correlation between income level and paying back on time?**

### **Cheking the correlation between income level and debt**

In [207]:
# Check the income level data and paying back on time
df[['income_category', 'debt']]

Unnamed: 0,income_category,debt
0,high,0
1,lower-mid,0
2,upper-mid,0
3,high,0
4,upper-mid,0
...,...,...
21449,high,0
21450,upper-mid,0
21451,low,1
21452,high,1


In [208]:
# Calculating default-rate based on the income_category
df.pivot_table(index = 'income_category', values = 'debt', aggfunc = ['sum', 'count', 'mean', 'median'])

Unnamed: 0_level_0,sum,count,mean,median
Unnamed: 0_level_1,debt,debt,debt,debt
income_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
high,383,5364,0.071402,0
low,427,5364,0.079605,0
lower-mid,462,5363,0.086146,0
upper-mid,469,5363,0.087451,0


### **Conclusion**

From the above we can see that:
- There is a **7.1%** chance that people in the **high** income level will default on a loan
- There is a **8.0%** chance that people in the **low** income level will default on a loan
- There is a **8.6%** chance that people in the **lower-mid** income level will default on a loan
- There is a **8.7%** chance that people in the **upper-mid** income level will default on a loan


People that are in the **upper-mid** income lower have the highest chance of defaulting on a loan whereas those that are in the **high** income level have the least chance. There is a small difference of **1.6%** between the two groups.

### **How does credit purpose affect the default rate?**

In [209]:
# Check the percentages for default rate for each credit purpose and analyze them
df.pivot_table(index = 'purpose_grouping', values = 'debt', aggfunc = ['sum', 'count', 'mean', 'median'])


Unnamed: 0_level_0,sum,count,mean,median
Unnamed: 0_level_1,debt,debt,debt,debt
purpose_grouping,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
car,403,4306,0.09359,0
education,370,4013,0.0922,0
real estate,782,10811,0.072334,0
wedding,186,2324,0.080034,0


### **Conclusion**

From the above we can see that:
- There is a **9.4%** chance that people who took out a loan for: **car purpose** will default on a loan
- There is a **9.2%** chance that people who took out a loan for: **education purpose** will default on a loan
- There is a **7.2%** chance that people who took out a loan for: **property purpose** will default on a loan
- There is a **8.0%** chance that people who took out a loan for: **wedding purpose** will default on a loan

People that took a loan for **car purpose** have the highest chance of defaulting on a loan whereas those that took the loan out for **property purpose** have the least chance. 

## General Conclusion 


**The initial Hypothes being tested was:** <a id='data_review'></a>

A customer’s marital status and number of children have an effect on a customer defaulting on a loan specifically
- Married people have a higher chance of defaulting on a loan and
- People with children have a higher chance of defaulting on a loan 

From testing the data we found that the people with the higest chances of defaulting on the loan are:

**1.Those who have more children** 

Comparison of those with the most and least children:
- There is a **7.5%** chance that someone with **no children** will default on a loan
- There is a **9.8%** chance that someone with **4 children** will default on a loan

The more children that one has the high chance that they will default on a loan

**2.Those who have are unmarried** 

Comparison of those that are married and unmarried
- There is a **7.5%** chance that people with the family status: **married** will default on a loan
- There is a **9.8%** chance that people with the family status: **unmarried** will default on a loan

**3.Those who are in a the middle upper income level** 

Comparison of income levels:
- There is a **7.1%** chance that people in the **high** income level will default on a loan
- There is a **8.0%** chance that people in the **low** income level will default on a loan
- There is a **8.6%** chance that people in the **lower-mid** income level will default on a loan
- There is a **8.7%** chance that people in the **upper-mid** income level will default on a loan

**4.Those who took ouot a loan for a car** 

Comparison of loan purpose:
- There is a **9.4%** chance that people who took out a loan for: **car purpose** will default on a loan
- There is a **7.2%** chance that people who took out a loan for: **property purpose** will default on a loan

Statistically the greatest contributing factor to defaulting on a loan in the:

- Amount of children that a household has being that the more children the higher chances of defaulting on a loan
- And the purpose that the loan was taken specifically those who took out a loan for a car

A factor that does not contribute to defaulting on a loan  is being married in comparison to unmarried people they have a smaller chance of defaulting.

