## Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

### Step 1. Open the data file and have a look at the general information. 

In [1]:
import pandas as pd
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
None


### Conclusion

I imported the Pandas library as pd, then read the .csv file using the '.read' method and saved it to the veriable 'df'. Then, i print the dataframe using the .info() method to see the general information.

### Step 2. Data preprocessing

### Processing missing values

In [2]:
import pandas as pd
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
df_null = df[df['days_employed'].isnull()]
#print(df_null['dob_years'].value_counts())
df['days_employed'] = df['days_employed'].fillna(value=df['days_employed'].mean())
df['total_income'] = df['total_income'].fillna(value=df['total_income'].mean())
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
None


### Conclusion

After looking at the general info, I found that there were missing values in both the 'days_employed' and 'total_income' columns. I them created a table containing the rows with NAN values for 'days_employed' and 'total_income' to get a better understanding of which demographic had missing values for 'days_employed' and 'total_income'. There was a wide age range for individuals affected by the error that caused length of employment and income to be reported as NAN. Therefore, using the average time of empolyment and average income in place of those missing values will stil be a good approximation and not adversely affect the data. I used the .fillna() method to replace the missing values with thier respective means, and printed the general info to verify that all rows with missing values were filled. I would alert the developers of a possible bug in the employment reporting that is causing the time of employment and income not to be reported

### Data type replacement

In [3]:
import pandas as pd
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
df_null = df[df['days_employed'].isnull()]
#print(df_null['dob_years'].value_counts())
df['days_employed'] = df['days_employed'].fillna(value=df['days_employed'].mean())
df['total_income'] = df['total_income'].fillna(value=df['total_income'].mean())
df['days_employed'] = df['days_employed'].astype(int)
df['total_income'] = df['total_income'].astype(int)
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null int64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null int64
purpose             21525 non-null object
dtypes: int64(7), object(5)
memory usage: 2.0+ MB
None


### Conclusion

 I used the .astype() method to change the Days Employed and Total income to integers to remove the floating point decimals. This makes the numbers easier to work with. You dont have to try to calculate 

### Processing duplicates

In [4]:
import pandas as pd
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
df_null = df[df['days_employed'].isnull()]
#print(df_null['dob_years'].value_counts())
df['days_employed'] = df['days_employed'].fillna(value=df['days_employed'].mean())
df['total_income'] = df['total_income'].fillna(value=df['total_income'].mean())
df['days_employed'] = df['days_employed'].astype(int)
df['total_income'] = df['total_income'].astype(int)
#print(df['education'].value_counts())
#print(df['family_status'].value_counts())
#print(df['gender'].value_counts())
#print(df['income_type'].value_counts())
#print(df['purpose'].value_counts())
df['education'] = df['education'].str.lower()
print(df['education'].value_counts())

secondary education    15233
masters degree          5260
bachelor degree          744
primary education        282
academic degree            6
Name: education, dtype: int64


### Conclusion

In this step i reviewed the value counts for each column containing objects. i found duplicates in the education column due to case-sensitivity. I converted the values of the education column to lower-case using the 'str.lower()' method. the other columns with object data types seem to be free of duplicates. This duplication could be from a case-sensitivity issue when collecting the data from customers. I would tell the department in-charge of collecting the data that we could add efficiency and accuracy to the data by making the text case sensitive. The 'purpose' column has alot of very simmilar values but that will be handled in the next step through stemming and lemmatization. 

### Lemmatization

In [5]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer()
from collections import Counter
import pandas as pd
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
df_null = df[df['days_employed'].isnull()]
#print(df_null['dob_years'].value_counts())
df['days_employed'] = df['days_employed'].fillna(value=df['days_employed'].mean())
df['total_income'] = df['total_income'].fillna(value=df['total_income'].mean())
df['days_employed'] = df['days_employed'].astype(int)
df['total_income'] = df['total_income'].astype(int)
#print(df['education'].value_counts())
#print(df['family_status'].value_counts())
#print(df['gender'].value_counts())
#print(df['income_type'].value_counts())
#print(df['purpose'].value_counts())
df['education'] = df['education'].str.lower()
def lemma_words(row):
    return [wordnet_lemma.lemmatize(word, pos='n') for word in nltk.word_tokenize(row)]
df['purpose'] = df['purpose'].apply(lemma_words)
def lemma_category(row):
    if 'car' in row:
        return 'car'
    if 'house' in row or 'estate' in row or 'property' in row:
        return 'house'
    if 'wedding' in row:
        return 'wedding'
    if 'education' in row or 'university':
        return 'education'
            
df['purpose'] = df['purpose'].apply(lemma_category)
print(df['purpose'].value_counts())

house        9540
education    5322
car          4315
wedding      2348
Name: purpose, dtype: int64


### Conclusion

I used the NLTK library to create a function the will lemmatize a string then used the .apply() method to lemmatize the data in the purpose column. Then I created a function using if rules to categorize the purpose columun to sort the data into 4 distinct categories. This simplifies allows us to categorize the data based on one of the for basic categories.

### Categorizing Data

In [6]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer()
from collections import Counter
import pandas as pd
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
df_null = df[df['days_employed'].isnull()]
#print(df_null['dob_years'].value_counts())
df['days_employed'] = df['days_employed'].fillna(value=df['days_employed'].mean())
df['total_income'] = df['total_income'].fillna(value=df['total_income'].mean())
df['days_employed'] = df['days_employed'].astype(int)
df['total_income'] = df['total_income'].astype(int)
#print(df['education'].value_counts())
#print(df['family_status'].value_counts())
#print(df['gender'].value_counts())
#print(df['income_type'].value_counts())
#print(df['purpose'].value_counts())
df['education'] = df['education'].str.lower()
def lemma_words(row):
    return [wordnet_lemma.lemmatize(word, pos='n') for word in nltk.word_tokenize(row)]
df['purpose'] = df['purpose'].apply(lemma_words)
def lemma_category(row):
    if 'car' in row:
        return 'car'
    if 'house' in row or 'estate' in row or 'property' in row:
        return 'house'
    if 'wedding' in row:
        return 'wedding'
    if 'education' in row or 'university':
        return 'education'
            
df['purpose'] = df['purpose'].apply(lemma_category)
loan_dict = df[['days_employed', 'debt', 'total_income', 'purpose']]
loan_dict = loan_dict.drop_duplicates().reset_index(drop=True)
print(loan_dict)




       days_employed  debt  total_income    purpose
0              -8437     0        253875      house
1              -4024     0        112080        car
2              -5623     0        145885      house
3              -4124     0        267628  education
4             340266     0        158616    wedding
...              ...   ...           ...        ...
19354          -4529     0        224791  education
19355         343937     0        155999        car
19356          -2113     1         89672      house
19357          -3112     1        244093        car
19358          -1984     0         82047        car

[19359 rows x 4 columns]


### Conclusion

In this step i made a dictionary containing what i felt to be the most important factores when considering loan repayment. i then drop all duplicate entries to make the data more unique.

### Step 3. Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [7]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer()
from collections import Counter
import pandas as pd
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
df_null = df[df['days_employed'].isnull()]
#print(df_null['dob_years'].value_counts())
df['days_employed'] = df['days_employed'].fillna(value=df['days_employed'].mean())
df['total_income'] = df['total_income'].fillna(value=df['total_income'].mean())
df['days_employed'] = df['days_employed'].astype(int)
df['total_income'] = df['total_income'].astype(int)
#print(df['education'].value_counts())
#print(df['family_status'].value_counts())
#print(df['gender'].value_counts())
#print(df['income_type'].value_counts())
#print(df['purpose'].value_counts())
df['education'] = df['education'].str.lower()
def lemma_words(row):
    return [wordnet_lemma.lemmatize(word, pos='n') for word in nltk.word_tokenize(row)]
df['purpose'] = df['purpose'].apply(lemma_words)
def lemma_category(row):
    if 'car' in row:
        return 'car'
    if 'house' in row or 'estate' in row or 'property' in row:
        return 'house'
    if 'wedding' in row:
        return 'wedding'
    if 'education' in row or 'university':
        return 'education'
            
df['purpose'] = df['purpose'].apply(lemma_category)
#loan_dict = df[['days_employed', 'debt', 'total_income', 'purpose']]
#loan_dict = loan_dict.drop_duplicates().reset_index(drop=True)
#print(loan_dict)
children_debt = df[['children', 'debt']].copy()
#print(children_debt.groupby('children'))
def children_debt_func(row):
    children = row['children']
    debt = row['debt']
    if children == 0:
        if debt == 0:
            return '0_children_paid'
    if children == 0: 
        if debt == 1:
            return '0_children_unpaid'
    if children == 1:
        if debt == 0:
            return '1_children_paid'
    if children == 1:
        if debt == 1:
            return '1_children_unpaid'
    if children == 2: 
        if debt == 0:
            return '2_children_paid'
    if children == 2: 
        if debt == 1:
            return '2_children_unpaid'
    if children == 3:
        if debt == 0:
            return '3_children_paid'
    if children == 3: 
        if debt == 1:
            return '3_children_unpaid'
    if children == 4:
        if debt == 0:
            return '4_children_paid'
    if children == 4: 
        if debt == 1:
            return '4_children_unpaid'
    if children == 5:
        if debt == 0:
            return '5_children_paid'
    if children == 5: 
        if debt == 1:
            return '5_children_unpaid'
    if children == 20: 
        if debt == 0:
            return '20_children_paid'
    if children == 20: 
        if debt == 1:
            return '20_children_unpaid'
children_debt['children_debt_group'] = children_debt.apply(children_debt_func, axis=1)
children_debt = children_debt.dropna()
children_debt = children_debt['children_debt_group'].value_counts()
zero_children_total = children_debt['0_children_unpaid'] + children_debt['0_children_paid']
zero_children_def_rate = children_debt['0_children_unpaid'] / zero_children_total
one_children_total = children_debt['1_children_unpaid'] + children_debt['1_children_paid']
one_children_def_rate = children_debt['1_children_unpaid'] / one_children_total
two_children_total = children_debt['2_children_unpaid'] + children_debt['2_children_paid']
two_children_def_rate = children_debt['2_children_unpaid'] / two_children_total
three_children_total = children_debt['3_children_unpaid'] + children_debt['3_children_paid']
three_children_def_rate = children_debt['3_children_unpaid'] / three_children_total
four_children_total = children_debt['4_children_unpaid'] + children_debt['4_children_paid']
four_children_def_rate = children_debt['4_children_unpaid'] / four_children_total
#five_children_total = children_debt['5_children_unpaid'] + children_debt['5_children_paid']
#five_children_def_rate = children_debt['5_children_unpaid'] / five_children_total
#there were only 9 indiviuals in the data set with 5 children so I excluded them
twenty_children_total = children_debt['20_children_unpaid'] + children_debt['20_children_paid']
twenty_children_def_rate = children_debt['20_children_unpaid'] / twenty_children_total
print('People with no children default on their loans {:.2%} of the time'.format(zero_children_def_rate))
print('People with one child default on their loans {:.2%} of the time'.format(one_children_def_rate))
print('People with two children default on their loans {:.2%} of the time'.format(two_children_def_rate))
print('People with three children default on their loans {:.2%} of the time'.format(three_children_def_rate))
print('People with four children default on their loans {:.2%} of the time'.format(four_children_def_rate))
print('People with twentry children default on their loans {:.2%} of the time'.format(twenty_children_def_rate))

People with no children default on their loans 7.51% of the time
People with one child default on their loans 9.22% of the time
People with two children default on their loans 9.44% of the time
People with three children default on their loans 8.18% of the time
People with four children default on their loans 9.76% of the time
People with twentry children default on their loans 10.53% of the time


### Conclusion

For this question i seperated each entry by the amount of children they had, then i further seperated the data by thier loan status (paid or unpaid). Then I calculated the rate of default by dividing the number of unpaid loans by the total number of loans for each group. There is a 1.71% increase between the default rate of people with no children and those with one child. after that the rate does very slightly increase as the amount of children increases (with the exception of three children). In my opinion there is a slight correlation between the amount of children you have and your abillity to pay your loan. There were only 9 entires in the data with 5 children and they had all paid their loans back on time so I excluded them from the calculations becasue the default rate was zero and the sample size was too small to be accurate.

- Is there a relation between marital status and repaying a loan on time?

In [8]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer()
from collections import Counter
import pandas as pd
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
df_null = df[df['days_employed'].isnull()]
#print(df_null['dob_years'].value_counts())
df['days_employed'] = df['days_employed'].fillna(value=df['days_employed'].mean())
df['total_income'] = df['total_income'].fillna(value=df['total_income'].mean())
df['days_employed'] = df['days_employed'].astype(int)
df['total_income'] = df['total_income'].astype(int)
#print(df['education'].value_counts())
#print(df['family_status'].value_counts())
#print(df['gender'].value_counts())
#print(df['income_type'].value_counts())
#print(df['purpose'].value_counts())
df['education'] = df['education'].str.lower()
def lemma_words(row):
    return [wordnet_lemma.lemmatize(word, pos='n') for word in nltk.word_tokenize(row)]
df['purpose'] = df['purpose'].apply(lemma_words)
def lemma_category(row):
    if 'car' in row:
        return 'car'
    if 'house' in row or 'estate' in row or 'property' in row:
        return 'house'
    if 'wedding' in row:
        return 'wedding'
    if 'education' in row or 'university':
        return 'education'
            
df['purpose'] = df['purpose'].apply(lemma_category)
marital_debt = df[['family_status', 'family_status_id', 'debt']].copy()
marital_debt_count = marital_debt['family_status'].value_counts()
married_total = marital_debt_count['married']
civil_total = marital_debt_count['civil partnership']
widow_total = marital_debt_count['widow / widower']
divorced_total = marital_debt_count['divorced']
unmarried_total = marital_debt_count['unmarried']
#marital_debt_count
def marital_debt_func(row):
    status = row['family_status_id']
    debt = row['debt']
    if status == 0:
        if debt == 0:
            return 'married_paid'
    if status == 0:
        if debt == 1:
            return 'married_unpaid'
    if status == 1:
        if debt == 0:
            return 'civil_paid'
    if status == 1:
        if debt == 1:
            return 'civil_unpaid'
    if status == 2:
        if debt == 0:
            return 'widow_paid'
    if status == 2:
        if debt == 1:
            return 'widow_unpaid'
    if status == 3:
        if debt == 0:
            return 'divorced_paid'
    if status == 3:
        if debt == 1:
            return 'divorced_unpaid'
    if status == 4:
        if debt == 0:
            return 'unmarried_paid'
    if status == 4:
        if debt == 1:
            return 'unmarried_unpaid'
marital_debt['marital_debt_group'] = marital_debt.apply(marital_debt_func, axis=1)
marital_debt_group_counts = marital_debt['marital_debt_group'].value_counts()
print('Married people default on their loans {:.2%} of the time'.format(marital_debt_group_counts['married_unpaid'] / married_total))
print('People in a Civil Partnership default on thier loans {:.2%} of the time'.format(marital_debt_group_counts['civil_unpaid'] / civil_total))
print('Widowed people default on their loans {:.2%} of the time'.format(marital_debt_group_counts['widow_unpaid'] / widow_total))
print('Divorced people default on their loans {:.2%} of the time'.format(marital_debt_group_counts['divorced_unpaid'] / divorced_total))
print('Unmarried people default on their loans {:.2%} of the time'.format(marital_debt_group_counts['unmarried_unpaid'] / unmarried_total))

Married people default on their loans 7.52% of the time
People in a Civil Partnership default on thier loans 9.29% of the time
Widowed people default on their loans 6.56% of the time
Divorced people default on their loans 7.11% of the time
Unmarried people default on their loans 9.74% of the time


### Conclusion

To find the correlation between marital status and loan repayment i used the same tatic that i did with number of children, I calculated a default rate for each group. I found that married, widowed, and divorced people default on thier loans at a lower rate than those that are unmarried or in a civil partnership. This could be due to multiple factors, such as people that are married, widowed, and divorced could have more savings to pay towards a loan if need be, and/or more financial stabillity due to shared, inherited, or transferred wealth. Whereas those that are unmarried or in a civil partnership tend not to have a financial partner and dont have the stabillity of those that do, and tend to be younger and have not had the time to amass the same amount of wealth.

- Is there a relation between income level and repaying a loan on time?

In [9]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer()
from collections import Counter
import pandas as pd
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
df_null = df[df['days_employed'].isnull()]
#print(df_null['dob_years'].value_counts())
df['days_employed'] = df['days_employed'].fillna(value=df['days_employed'].mean())
df['total_income'] = df['total_income'].fillna(value=df['total_income'].mean())
df['days_employed'] = df['days_employed'].astype(int)
df['total_income'] = df['total_income'].astype(int)
#print(df['education'].value_counts())
#print(df['family_status'].value_counts())
#print(df['gender'].value_counts())
#print(df['income_type'].value_counts())
#print(df['purpose'].value_counts())
df['education'] = df['education'].str.lower()
def lemma_words(row):
    return [wordnet_lemma.lemmatize(word, pos='n') for word in nltk.word_tokenize(row)]
df['purpose'] = df['purpose'].apply(lemma_words)
def lemma_category(row):
    if 'car' in row:
        return 'car'
    if 'house' in row or 'estate' in row or 'property' in row:
        return 'house'
    if 'wedding' in row:
        return 'wedding'
    if 'education' in row or 'university':
        return 'education'
            
df['purpose'] = df['purpose'].apply(lemma_category)
#print(df['total_income'].sort_values().head(20).tail(20))
income_description = df['total_income'].describe()
low_income_threshold = income_description['25%']
#mid_income_threshold = income_description['50%']
high_income_threshold = income_description['75%']
def income_debt_func(row):
    income = row['total_income']
    debt = row['debt']
    if income <= low_income_threshold:
        if debt == 0:
            return 'low_income_paid'
    if income <= low_income_threshold:
        if debt == 1:
            return 'low_income_unpaid'
    if income <= high_income_threshold:
        if debt == 0:
            return 'mid_income_paid'
    if income <= high_income_threshold:
        if debt == 1:
            return 'mid_income_unpaid'
    if income >= high_income_threshold:
        if debt == 0:
            return 'high_income_paid'
    if income >= high_income_threshold:
        if debt == 1:
            return 'high_income_unpaid'
df['income_debt_group'] = df.apply(income_debt_func, axis=1)
income_debt_groups = df['income_debt_group'].value_counts()
low_income_def_rate = income_debt_groups['low_income_unpaid'] / (income_debt_groups['low_income_unpaid'] + income_debt_groups['low_income_paid'])
mid_income_def_rate = income_debt_groups['mid_income_unpaid'] / (income_debt_groups['mid_income_unpaid'] + income_debt_groups['mid_income_paid'])
high_income_def_rate = income_debt_groups['high_income_unpaid'] / (income_debt_groups['high_income_unpaid'] + income_debt_groups['high_income_paid'])
print('People with low income default on their loans {:.2%} of the time'.format(low_income_def_rate))
print('People with medium income default on their loans {:.2%} of the time'.format(mid_income_def_rate))
print('People with high income default on their loans {:.2%} of the time'.format(high_income_def_rate))

People with low income default on their loans 7.93% of the time
People with medium income default on their loans 8.62% of the time
People with high income default on their loans 7.17% of the time


### Conclusion

After calculating the default rate for the different income groups I found that the percentages of default were all within 1% of each other with no steady increase or decrease based on income. Therefore I do not see a correlation between level of income and abillity to repay a loan. This can be due to the bank not dispersing loans unless they fit within an individuals debt to income ratio.

- How do different loan purposes affect on-time repayment of the loan?

In [10]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer()
from collections import Counter
import pandas as pd
df = pd.read_csv('/datasets/credit_scoring_eng.csv')
df_null = df[df['days_employed'].isnull()]
#print(df_null['dob_years'].value_counts())
df['days_employed'] = df['days_employed'].fillna(value=df['days_employed'].mean())
df['total_income'] = df['total_income'].fillna(value=df['total_income'].mean())
df['days_employed'] = df['days_employed'].astype(int)
df['total_income'] = df['total_income'].astype(int)
#print(df['education'].value_counts())
#print(df['family_status'].value_counts())
#print(df['gender'].value_counts())
#print(df['income_type'].value_counts())
#print(df['purpose'].value_counts())
df['education'] = df['education'].str.lower()
def lemma_words(row):
    return [wordnet_lemma.lemmatize(word, pos='n') for word in nltk.word_tokenize(row)]
df['purpose'] = df['purpose'].apply(lemma_words)
def lemma_category(row):
    if 'car' in row:
        return 'car'
    if 'house' in row or 'estate' in row or 'property' in row:
        return 'house'
    if 'wedding' in row:
        return 'wedding'
    if 'education' in row or 'university':
        return 'education'
            
df['purpose'] = df['purpose'].apply(lemma_category)
purpose_counts = df['purpose'].value_counts()
def purpose_debt_func(row):
    purpose = row['purpose']
    debt = row['debt']
    if purpose == 'house':
        if debt == 0:
            return 'house_paid'
    if purpose == 'house':
        if debt == 1:
            return 'house_unpaid'
    if purpose == 'education':
        if debt == 0:
            return 'education_paid'
    if purpose == 'education':
        if debt == 1:
            return 'education_unpaid'
    if purpose == 'car':
        if debt == 0:
            return 'car_paid'
    if purpose == 'car':
        if debt == 1:
            return 'car_unpaid'
    if purpose == 'wedding':
        if debt == 0:
            return 'wedding_paid'
    if purpose == 'wedding':
        if debt == 1:
            return 'wedding_unpaid'
df['purpose_debt_group'] = df.apply(purpose_debt_func, axis=1)
purpose_group_counts = df['purpose_debt_group'].value_counts()
print('People who get a loan for real estate default {:.2%} of the time'.format(purpose_group_counts['house_unpaid'] / purpose_counts['house']))
print('People who get a loan for education default {:.2%} of the time'.format(purpose_group_counts['education_unpaid'] / purpose_counts['education']))
print('People who get a loan for a car default {:.2%} of the time'.format(purpose_group_counts['car_unpaid'] / purpose_counts['car']))
print('People who ghet a loan for a wedding default {:.2%} of the time'.format(purpose_group_counts['wedding_unpaid'] / purpose_counts['wedding']))

People who get a loan for real estate default 7.21% of the time
People who get a loan for education default 8.72% of the time
People who get a loan for a car default 9.34% of the time
People who ghet a loan for a wedding default 7.92% of the time


### Conclusion

After calculating the default rate based on the purpose of the loan I found that the better default rate belonged to real estate and weddings. For real estate I found a deafut rate of 7.21% this lower defaut rate could be due to the loan payment replacing the rent payment an individual was already paying for residential properties, or the payment being covered in the rent paid by tenants in commercial and investment properties. For weddings the lower rate could be due to the added financial stabillity of a spouse. Vehicle loans are typically secured by assests that are worth less than the principal balance of the loan and are more subject to default. Education loans have a slightly hier rate of default, and this can be due to the demographic of those needing education loans (young, low income, unmarried).