---

# Analyzing borrowers’ risk of defaulting

The project is to prepare a report for a bank’s loan division. We’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

The report will be considered when building the **credit score** of a potential customer. The **credit score** is used to evaluate the ability of a potential borrower to repay their loan.

The purpose of the project is to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. My hypotheses are: 
1. Clients with smaller number of children will be more likely to repay the loan on time.
2. Married clients will be more likely to repay the loan. 
3. Married clients with higher income will be more likely to repay the loan.
4. Clients with lower income will be less likely to default on the loan
5. Unmarried clients with larger number of children and lower income will be less likely to default on the loan.
6. Clients taking money for basic necessity will be less likely to repay the loan.

## Let's open the data file and have a look at the general information. 


In [484]:
# Loading all the libraries
import pandas as pd
import numpy as np
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer()

from nltk.stem import PorterStemmer
ps = PorterStemmer()


In [485]:
# Load the data
credit_score = pd.read_csv('/datasets/credit_scoring_eng.csv')

## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan


In [486]:
# Let's see how many rows and columns our dataset has
credit_score.shape

(21525, 12)

The dataset has 21,525 rows and 12 columns

In [487]:
credit_score.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


In [488]:
credit_score.describe(include='all')

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
count,21525.0,19351.0,21525.0,21525,21525.0,21525,21525.0,21525,21525,21525.0,19351.0,21525
unique,,,,15,,5,,3,8,,,38
top,,,,secondary education,,married,,F,employee,,,wedding ceremony
freq,,,,13750,,12380,,14236,11119,,,797
mean,0.538908,63046.497661,43.29338,,0.817236,,0.972544,,,0.080883,26787.568355,
std,1.381587,140827.311974,12.574584,,0.548138,,1.420324,,,0.272661,16475.450632,
min,-1.0,-18388.949901,0.0,,0.0,,0.0,,,0.0,3306.762,
25%,0.0,-2747.423625,33.0,,1.0,,0.0,,,0.0,16488.5045,
50%,0.0,-1203.369529,42.0,,1.0,,0.0,,,0.0,23202.87,
75%,1.0,-291.095954,53.0,,1.0,,1.0,,,0.0,32549.611,


In [489]:
# let's print the first 15 rows
credit_score.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


In [490]:
#Let's also see the last 15 rows
credit_score.tail(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
21510,2,,28,secondary education,1,married,0,F,employee,0,,car purchase
21511,0,-612.569129,29,bachelor's degree,0,civil partnership,1,F,employee,1,22410.956,buying property for renting out
21512,0,-165.377752,26,bachelor's degree,0,unmarried,4,M,business,0,23568.233,to get a supplementary education
21513,0,-1166.216789,35,secondary education,1,married,0,F,employee,0,40157.783,purchase of the house
21514,0,-280.469996,27,some college,2,unmarried,4,M,business,0,56958.145,building a property
21515,1,-467.68513,28,secondary education,1,married,0,F,employee,1,17517.812,to become educated
21516,0,-914.391429,42,bachelor's degree,0,married,0,F,business,0,51649.244,purchase of my own house
21517,0,-404.679034,42,bachelor's degree,0,civil partnership,1,F,business,0,28489.529,buying my own car
21518,0,373995.710838,59,SECONDARY EDUCATION,1,married,0,F,retiree,0,24618.344,purchase of a car
21519,1,-2351.431934,37,graduate degree,4,divorced,3,M,employee,0,18551.846,buy commercial real estate


In [491]:
credit_score.duplicated().sum()

54

**Data Sample Desciption**

1. There seem to be missing values in columns 'days_employed' and 'total_income'. This should be investigated further.
2. 'days_employed' seems to have negative values, but there can't be "minus days". What is the meaning of it? Is there a pattern? How will this impact the analysis? Also, the length of employment is never calculated in 'days'. We might have to convert it to 'years'.
3.'education' column has both lower case and upper case categories. Python will read those as different categories, even though the content may be identical. All values should be changed to lower case to avoid duplicates.
4. While looking the data description with describe() method, we notice 'children' column has at least one negative value. We will have to investigate what is the meaning of that and how to address this issue.
5. In 'dob_years' column the min value is 0, but someone who is 0 years old, cannot apply for a loan, so there must be a mistake there, and we'll have to investigate further.
6. The seem to be some duplicates in the dataset which need to be addressed.

In [492]:
# Let's get info on data
credit_score.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [493]:
credit_score.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

The dataset has 21,525 rows, but the 'days_employed' and 'total_income' columns have fewer than 21,525 values. That means both those columns have missing values. The sum of the missing values in columns showed that the number of missing values in those columns is identical: 2,174. In order to make any assumptions, we need further investigation whether the missing values are symmetric and is there a pattern.

In [494]:
# Let's look in the filtered table at the the first column with missing data
credit_score_filtered1 = credit_score[credit_score.days_employed.isna()]
print(credit_score_filtered1.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2174 entries, 12 to 21510
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          2174 non-null   int64  
 1   days_employed     0 non-null      float64
 2   dob_years         2174 non-null   int64  
 3   education         2174 non-null   object 
 4   education_id      2174 non-null   int64  
 5   family_status     2174 non-null   object 
 6   family_status_id  2174 non-null   int64  
 7   gender            2174 non-null   object 
 8   income_type       2174 non-null   object 
 9   debt              2174 non-null   int64  
 10  total_income      0 non-null      float64
 11  purpose           2174 non-null   object 
dtypes: float64(2), int64(5), object(5)
memory usage: 220.8+ KB
None


In [495]:
credit_score_filtered1.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
65,0,,21,secondary education,1,unmarried,4,M,business,0,,transactions with commercial real estate
67,0,,52,bachelor's degree,0,married,0,F,retiree,0,,purchase of the house for my family
72,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
82,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
83,0,,52,secondary education,1,married,0,M,employee,0,,housing


**Analysis of missing values**

The data above shows that there's an identical number of missing values in both columns which have missing values: days_employed and total_income. Also, from the data sample above it looks like that if there's a missing value in day_employed column, there's also a missing value in total_income column. We can assume that the values are indeed symmetric, however, we need further investigation to make sure that this is so. We will filter the data set to include rows, where both columns have missing values.

In [496]:
#Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.
credit_score_filtered2 = credit_score[credit_score.days_employed.isna()&credit_score.total_income.isna()]

In [497]:
credit_score_filtered2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2174 entries, 12 to 21510
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          2174 non-null   int64  
 1   days_employed     0 non-null      float64
 2   dob_years         2174 non-null   int64  
 3   education         2174 non-null   object 
 4   education_id      2174 non-null   int64  
 5   family_status     2174 non-null   object 
 6   family_status_id  2174 non-null   int64  
 7   gender            2174 non-null   object 
 8   income_type       2174 non-null   object 
 9   debt              2174 non-null   int64  
 10  total_income      0 non-null      float64
 11  purpose           2174 non-null   object 
dtypes: float64(2), int64(5), object(5)
memory usage: 220.8+ KB


In [498]:
#calculating percentage of missing values in 'days_employed' column
days_employed_missing_values_percentage = credit_score['days_employed'].isna().sum()/len(credit_score['days_employed'])

#calculating percentage of missing values in 'total_income' column
total_income_missing_values_percentage = credit_score['total_income'].isna().sum()/len(credit_score['total_income'])

print(f'Percentage of missing values in days_employed column is: {days_employed_missing_values_percentage:.0%}')
print(f'Percentage of missing values in total_income column is: {total_income_missing_values_percentage:.0%}')

Percentage of missing values in days_employed column is: 10%
Percentage of missing values in total_income column is: 10%


**Intermediate conclusion**

The number of rows in the filtered table matches the number of missing values in both column which have missing values. Therefore, we can finally conclude that the missing values are indeed symmetric.

Percentage of missing values in both days_employed column and in total_income is 10%, which is rather high and may impact the final analysis. It may be reasonable to fill the missing data. The missing values might have appeared due to the employment status: maybe, unemployed clients don't have number of employment days. We need to check the correlation between income_type and the value in the employment_days column. Alternatively, it maybe due to applicant's age - maybe a client applying fot a loan is too young or too old to work. There also can be dependency between days_employed, gender, and income_type. In certain cultures, one may assume that married women can be unemployed and choose to be stay-at-home moms.

1. We need to check the correlation between employment status (income_type) and the value in the days_employed column. 
2. If there's a dependency between age, employment status, and values in days_employed and total_income, we need to check the employment status unique values with unique() method and see which categories include missing values in days_employed and total_income columns. 
3. Also, it will be reasonable to create age groups from the age column to be able to identify any patterns by age. This can be achieve by writing a function and applying it (apply()) on the dob_years column. 
4. We should also check distribution by gender and compare the percentage of women among clients with missing data.

In [499]:
# Let's investigate clients who do not have data on identified characteristic and the column with the missing values
credit_score[credit_score.days_employed.isna()].describe(include="all")

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
count,2174.0,0.0,2174.0,2174,2174.0,2174,2174.0,2174,2174,2174.0,0.0,2174
unique,,,,12,,5,,2,5,,,38
top,,,,secondary education,,married,,F,employee,,,having a wedding
freq,,,,1408,,1237,,1484,1105,,,92
mean,0.552438,,43.632015,,0.800828,,0.975161,,,0.078197,,
std,1.469356,,12.531481,,0.530157,,1.41822,,,0.268543,,
min,-1.0,,0.0,,0.0,,0.0,,,0.0,,
25%,0.0,,34.0,,0.25,,0.0,,,0.0,,
50%,0.0,,43.0,,1.0,,0.0,,,0.0,,
75%,1.0,,54.0,,1.0,,1.0,,,0.0,,


From the describe data we know that there's at least one client with dob_year=0. In order to check the hypothesis whether there's a dependency between days_employed and age and the assumption that there may be some clients who are under 19 and therefore they don't have any employment seniority, we would like to check the age distribution to see if there's a large number of clients younger than 19 (college age).

In [500]:
credit_score[credit_score.days_employed.isna()]['dob_years'].value_counts()

34    69
40    66
31    65
42    65
35    64
36    63
47    59
41    59
30    58
28    57
57    56
58    56
54    55
38    54
56    54
37    53
52    53
39    51
33    51
50    51
51    50
45    50
49    50
29    50
43    50
46    48
55    48
48    46
53    44
44    44
60    39
61    38
62    38
64    37
32    37
27    36
23    36
26    35
59    34
63    29
25    23
24    21
66    20
65    20
21    18
22    17
67    16
0     10
68     9
69     5
20     5
71     5
70     3
72     2
19     1
73     1
Name: dob_years, dtype: int64

# **Distribution**

We will use two datasets:
1. credit_score - the original general dataset
2. credit_score_filtered2 - the filtered dataset which includes only clients who have missing data in days_employed and total_income columns.

**Distribution by income_type**

In [501]:
(credit_score_filtered2.value_counts(subset = ['income_type'])/credit_score_filtered2.value_counts(subset = ['income_type']).sum()).map("{:.1%}".format)

income_type  
employee         50.8%
business         23.4%
retiree          19.0%
civil servant     6.8%
entrepreneur      0.0%
dtype: object

In [502]:
credit_score_filtered2['income_type'].nunique()

5

In [503]:
credit_score['income_type'].nunique()

8

In [504]:
credit_score['income_type'].unique()

array(['employee', 'retiree', 'business', 'civil servant', 'unemployed',
       'entrepreneur', 'student', 'paternity / maternity leave'],
      dtype=object)

**Distribition by education**

In [505]:
(credit_score_filtered2.value_counts(subset = ['education'])/credit_score_filtered2.value_counts(subset = ['education']).sum()).map("{:.1%}".format)

education          
secondary education    64.8%
bachelor's degree      22.8%
SECONDARY EDUCATION     3.1%
Secondary Education     3.0%
some college            2.5%
Bachelor's Degree       1.1%
BACHELOR'S DEGREE       1.1%
primary education       0.9%
SOME COLLEGE            0.3%
Some College            0.3%
PRIMARY EDUCATION       0.0%
Primary Education       0.0%
dtype: object

**Distribition by gender**

In [506]:
(credit_score_filtered2.value_counts(subset = ['gender'])/credit_score_filtered2.value_counts(subset = ['gender']).sum()).map("{:.1%}".format)

gender
F         68.3%
M         31.7%
dtype: object

**Distribition by gender and family_status**

In [507]:
(credit_score_filtered2.value_counts(subset = ['gender', 'family_status'])/credit_score_filtered2.value_counts(subset = ['gender', 'family_status']).sum()).map("{:.1%}".format)

gender  family_status    
F       married              36.8%
M       married              20.1%
F       civil partnership    14.2%
        unmarried             8.8%
M       civil partnership     6.1%
        unmarried             4.4%
F       divorced              4.3%
        widow / widower       4.1%
M       divorced              0.9%
        widow / widower       0.2%
dtype: object

The categories in the 'income_type' column in the filtered table don't include 'unemployed', 'student', 'paternity / maternity leave' .

The categories in the 'education' column in the filtered table don't include 'graduate degree'. The rest of the categories are duplicated and need to be adjusted.

**Possible reasons for missing values in data**

The categories in the 'income_type' column propose that clients who are 'unemployed', 'student', or on 'paternity / maternity leave' always have the data for days_employed, which negates the original assumption that they might not be having any income and currently don't have any day_employed. If that was true, it was possible to assume, that for those categories the values are 0. 
As to 'education' column, the only category that is missing in the filtered data_set is 'graduate' degree. But, as we will further see, the percentage of the clients with graduate degree is insignificant, therefore, we can't drive any conclusions.
Based on the analysis above, it looks like there isn't any definite pattern  and the values are missing at random. It maybe a technical error or a human error.

# Checking the distribution in the whole dataset

In [508]:
#Destribution of income_type categories in the original dataset
(credit_score.value_counts(subset = ['income_type'])/credit_score.value_counts(subset = ['income_type']).sum()).map("{:.1%}".format)

income_type                
employee                       51.7%
business                       23.6%
retiree                        17.9%
civil servant                   6.8%
entrepreneur                    0.0%
unemployed                      0.0%
paternity / maternity leave     0.0%
student                         0.0%
dtype: object

In [509]:
#Destribution of education categories in the original dataset

(credit_score.value_counts(subset = ['education'])/credit_score.value_counts(subset = ['education']).sum()).map("{:.1%}".format)

education          
secondary education    63.9%
bachelor's degree      21.9%
SECONDARY EDUCATION     3.6%
Secondary Education     3.3%
some college            3.1%
BACHELOR'S DEGREE       1.3%
Bachelor's Degree       1.2%
primary education       1.2%
Some College            0.2%
SOME COLLEGE            0.1%
PRIMARY EDUCATION       0.1%
Primary Education       0.1%
graduate degree         0.0%
GRADUATE DEGREE         0.0%
Graduate Degree         0.0%
dtype: object

In [510]:
#Destribution of gender categories in the original dataset

(credit_score.value_counts(subset = ['gender'])/credit_score.value_counts(subset = ['gender']).sum()).map("{:.1%}".format)

gender
F         66.1%
M         33.9%
XNA        0.0%
dtype: object

In [511]:
#Destribution by gender and family status categories in the original dataset

(credit_score.value_counts(subset = ['gender', 'family_status'])/credit_score.value_counts(subset = ['gender', 'family_status']).sum()).map("{:.1%}".format)

gender  family_status    
F       married              36.2%
M       married              21.3%
F       civil partnership    13.3%
        unmarried             8.0%
M       civil partnership     6.1%
        unmarried             5.0%
F       divorced              4.3%
        widow / widower       4.2%
M       divorced              1.2%
        widow / widower       0.3%
XNA     civil partnership     0.0%
dtype: object

**Intermediate conclusion**

It seems that the distribution in the original dataset is indeed similar to the distribution of the filtered table, which means that the values are missing randomly and we can't say for sure that there's one column responsible. However, we will continue checking to try and find some possible dependencies.

In [512]:
credit_score['family_status'].value_counts(normalize=True)

married              0.575145
civil partnership    0.194053
unmarried            0.130685
divorced             0.055517
widow / widower      0.044599
Name: family_status, dtype: float64

In [513]:
credit_score_filtered2['family_status'].value_counts(normalize=True)

married              0.568997
civil partnership    0.203312
unmarried            0.132475
divorced             0.051518
widow / widower      0.043698
Name: family_status, dtype: float64

In [514]:
credit_score.groupby('gender')['family_status'].value_counts(normalize=True)

gender  family_status    
F       married              0.547555
        civil partnership    0.201461
        unmarried            0.121663
        divorced             0.065749
        widow / widower      0.063571
M       married              0.629116
        civil partnership    0.179473
        unmarried            0.148326
        divorced             0.035538
        widow / widower      0.007547
XNA     civil partnership    1.000000
Name: family_status, dtype: float64

In [515]:
credit_score[credit_score.total_income.isna()].groupby('gender')['family_status'].value_counts(normalize=True)

gender  family_status    
F       married              0.539084
        civil partnership    0.208221
        unmarried            0.129380
        divorced             0.062668
        widow / widower      0.060647
M       married              0.633333
        civil partnership    0.192754
        unmarried            0.139130
        divorced             0.027536
        widow / widower      0.007246
Name: family_status, dtype: float64

**Intermediate conclusion**

At this point it is safe to conclude that the missing values are accidental and there's no pattern that could explain them.

**Conclusions**

No apparent patterns were found during the analysis, as the distributions by various columns in the general dataset and in the filtered dataset were similar.

There are two columns with missing values:
days_employed - the missing values are missing completely at random, which means there's no way we could predict or restore those. For now we will not touch the missing values in this column and will leave them as they are.

total_income - It is possible predict missing values and fill them with average per income_type and education.

**Next steps**

1. Fill in missing values in total_income column with estimated average bases on income_type and education.
2. Estimate the percentage of negative values in 'days_employed' column and see whether these values can be restored.
3. Change all values in 'education' column to lower case.
4. Negative values in 'children' column need to be investigated and addressed. 
5. In 'dob_years' column the min value is 0. We'll check what may be other issues in this column and will need to decide how to fix the problem.
6. Check for duplicates and decide how to handle them.

## Data transformation

Let's go through each column to see what issues we may have in them

In [516]:
# Let's see all values in education column
credit_score['education'].unique()

array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

In [517]:
#Let's fix the upper case and lower case issues by turing all values to lower case.
credit_score.education = credit_score.education.str.lower()

In [518]:
# Checking all the values in the column to make sure we fixed them
credit_score['education'].unique()

array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

In [519]:
# Let's see the distribution of values in the `children` column
credit_score['children'].unique()

array([ 1,  0,  3,  2, -1,  4, 20,  5])

In [520]:
credit_score['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

In [521]:
(credit_score.value_counts(subset=['children'])/credit_score.value_counts(subset = ['children']).sum()).map("{:.1%}".format)

children
 0          65.7%
 1          22.4%
 2           9.5%
 3           1.5%
 20          0.4%
-1           0.2%
 4           0.2%
 5           0.0%
dtype: object


According to the data above, there two obvious issues with the values in the column:
1. -1 value - 47 clients seem to have -1 child. It's unclear what this means and how to interpret this value. Since it's 0.2% of the available data, for the sake of the research we will consider it a typo and will transform the -1 into 1.
2. It seems as 76 clients have 20 children each. While this can be a case in some cultures, but by looking at the data it seems very unlikely that this is the case here. It must be a human error or a code for something. We will assume that this is a human error and assume that the number of children that should have been there is 2 instead of 20. Even though this is guessing, it will not have a significant effect on the data, as there're only 0.4% of such cases.

In [522]:
#Let's fix the data
credit_score['children'] = credit_score['children'].replace(20, 2)
credit_score['children'] = credit_score['children'].replace(-1, 1)

In [523]:
# Checking the `children` column again to make sure it's all fixed
credit_score['children'].value_counts()

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

Let's now check days_employed column. We already saw that there're negative values in this column. We will calculate the percentage of such values to estimate of the data could be restored. Additionally, we will try to convert the days into years to see what is the average and based on the average try to restore the data.

In [524]:
credit_score['days_employed'].unique()

array([-8437.67302776, -4024.80375385, -5623.42261023, ...,
       -2113.3468877 , -3112.4817052 , -1984.50758853])

In [525]:
credit_score['days_employed'].describe()

count     19351.000000
mean      63046.497661
std      140827.311974
min      -18388.949901
25%       -2747.423625
50%       -1203.369529
75%        -291.095954
max      401755.400475
Name: days_employed, dtype: float64

In [526]:
max_years = credit_score['days_employed'].max()/365

avg_years = credit_score['days_employed'].mean()/365

median_years = credit_score['days_employed'].median()/365

print(f'Maximum amount of years in days_employed: {max_years :.2f}')
print(f'Average amount of years in days_employed: {avg_years :.2f}')
print(f'Middle value for amount of years in days_employed: {median_years :.2f}')

Maximum amount of years in days_employed: 1100.70
Average amount of years in days_employed: 172.73
Middle value for amount of years in days_employed: -3.30


In [527]:
credit_score[credit_score.days_employed<0]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.422610,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.170,purchase of the house
...,...,...,...,...,...,...,...,...,...,...,...,...
21519,1,-2351.431934,37,graduate degree,4,divorced,3,M,employee,0,18551.846,buy commercial real estate
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21522,1,-2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property
21523,3,-3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car


In [528]:
# Find problematic data in `days_employed`, if they exist, and calculate the percentage

(credit_score[credit_score.days_employed < 0]['days_employed'].count())/len(credit_score['days_employed'])

0.7389547038327526

74% of the day_employed column values are negative values. Additionally, if we calculate the days_employed values in years to see whether the numbers are realistic in rows which have positive values, we see that the max number of days_employed is 1,100 years and the average is 172 years. No one can be employed for so long. Therefore, considering 74% negative values and another portion of unrealistic values, it's safe to say that about 90% of the data in this column is problematic. In theory it would be reasonable to delete this column completely to avoid further issues, however, we will not do that and will just leave the column as is. Moreover, the in project description we see that our goal is to find out whether customer’s marital status and number of children have an impact on whether they will default on a loan. This column is not required for the analysis. 

Let's now look at the client's age and whether there are any issues there.

In [529]:
# Check the `dob_years` for suspicious values and count the percentage
credit_score['dob_years'].describe()

count    21525.000000
mean        43.293380
std         12.574584
min          0.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

In [530]:
credit_score[credit_score.dob_years == 0]['dob_years'].count()

101

In [531]:
(credit_score[credit_score.dob_years == 0]['dob_years'].count())/len(credit_score['dob_years'])

0.004692218350754936

The problematic values here 0, assuming that someone applying for a loan should be of a legal age. They constitute a small percentage (only 101 cases or 0.05%), and therefore can be substituted with an average age.

In [532]:
#Let's calculate median age by income_type to avoid skewing the results by outliers, as in income_type we also see some implicit 
#age categories, such as student, retiree, or someone at maternity/paternity leave.
credit_score.groupby('income_type')['dob_years'].median()

income_type
business                       39.0
civil servant                  40.0
employee                       39.0
entrepreneur                   42.5
paternity / maternity leave    39.0
retiree                        60.0
student                        22.0
unemployed                     38.0
Name: dob_years, dtype: float64

We see that the median by income type is for the most part rather close to the general median. Therefore, we will replace the 0 values with a median value.

In [533]:
credit_score['dob_years']=credit_score['dob_years'].replace(0,credit_score['dob_years'].median())

In [534]:
credit_score['dob_years'].describe()

count    21525.000000
mean        43.490453
std         12.218595
min         19.000000
25%         34.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

In [535]:
# Let's see the values for the family_status column.
credit_score['family_status'].describe()

count       21525
unique          5
top       married
freq        12380
Name: family_status, dtype: object

In [536]:
credit_score['family_status'].unique()

array(['married', 'civil partnership', 'widow / widower', 'divorced',
       'unmarried'], dtype=object)

In [537]:
credit_score['family_status'].value_counts(normalize=True)

married              0.575145
civil partnership    0.194053
unmarried            0.130685
divorced             0.055517
widow / widower      0.044599
Name: family_status, dtype: float64

There don't seem to be any problematic values.


In [538]:
# Let's see the values in the gender column
credit_score['gender'].describe()

count     21525
unique        3
top           F
freq      14236
Name: gender, dtype: object

In [539]:
credit_score['gender'].unique()

array(['F', 'M', 'XNA'], dtype=object)

In [540]:
credit_score['gender'].value_counts(normalize=True)

F      0.661370
M      0.338583
XNA    0.000046
Name: gender, dtype: float64

We don't know what's the nature of the one value XNA, and if a person was given an option not to make binary choice of male/female, we can't decide for the person, therefore, we will leave the values unchanged.

In [541]:
# Let's see the values in the income_type column
credit_score['income_type'].describe()

count        21525
unique           8
top       employee
freq         11119
Name: income_type, dtype: object

In [542]:
credit_score['income_type'].unique()

array(['employee', 'retiree', 'business', 'civil servant', 'unemployed',
       'entrepreneur', 'student', 'paternity / maternity leave'],
      dtype=object)

In [543]:
credit_score['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

No problematic values were found in this column. It might be worthwhile to reduce the number of categories in the column by removing the rows with student, paternity / maternity leave, entrepreneur (can be combined with business category), and unemployed, as removing those won't be statistically significant. However, we will not do that.



In [544]:
#Let's check for duplicates
credit_score.duplicated().sum()

72

In [545]:
credit_score[credit_score.duplicated(keep=False)].sort_values(by=['dob_years'],ascending=False)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
7938,0,,71.0,secondary education,1,civil partnership,1,F,retiree,0,,having a wedding
6537,0,,71.0,secondary education,1,civil partnership,1,F,retiree,0,,having a wedding
9604,0,,71.0,secondary education,1,civil partnership,1,F,retiree,0,,having a wedding
5865,0,,66.0,secondary education,1,widow / widower,2,F,retiree,0,,transactions with my real estate
9528,0,,66.0,secondary education,1,widow / widower,2,F,retiree,0,,transactions with my real estate
...,...,...,...,...,...,...,...,...,...,...,...,...
18328,0,,29.0,bachelor's degree,0,married,0,M,employee,0,,buy residential real estate
19321,0,,23.0,secondary education,1,unmarried,4,F,employee,0,,second-hand car purchase
15892,0,,23.0,secondary education,1,unmarried,4,F,employee,0,,second-hand car purchase
20297,1,,23.0,secondary education,1,civil partnership,1,F,employee,0,,to have a wedding


Based on the diverse parameters of the dataset above we cannot conclude unequivocally that the original dataset has duplicates, as there's no unique identifier. However, the identical values in purpose columns and suggest that those indeed are duplicates. As the percentage of the duplicated data is rather low, we will just drop them.

In [546]:
credit_score = credit_score.drop_duplicates()

In [547]:
# Last check whether we still have any duplicates
credit_score.duplicated().sum()

0

In [548]:
credit_score.shape

(21453, 12)

In [549]:
credit_score.describe(include='all')

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
count,21453.0,19351.0,21453.0,21453,21453.0,21453,21453.0,21453,21453,21453.0,19351.0,21453
unique,,,,5,,5,,3,8,,,38
top,,,,secondary education,,married,,F,employee,,,wedding ceremony
freq,,,,15171,,12339,,14173,11083,,,791
mean,0.480585,63046.497661,43.469025,,0.817089,,0.973896,,,0.081154,26787.568355,
std,0.756079,140827.311974,12.214162,,0.548686,,1.421601,,,0.273078,16475.450632,
min,0.0,-18388.949901,19.0,,0.0,,0.0,,,0.0,3306.762,
25%,0.0,-2747.423625,33.0,,1.0,,0.0,,,0.0,16488.5045,
50%,0.0,-1203.369529,42.0,,1.0,,0.0,,,0.0,23202.87,
75%,1.0,-291.095954,53.0,,1.0,,1.0,,,0.0,32549.611,


For now we have addressed the following issues in the data:
- the children_column - there're no negative values in the data
- days_employed column was left untouched. It will not be needed in the analysis. 
- dob_years column does not have 0 values anymore
- We have droped the duplicated rows
- The values in gender column stayed untouched.
- The values in the education column were tranfored into lower case to avoid duplicated categories.

# Working with missing values

### Restoring missing values in `total_income`

The total_income column still has missing values which need to be addressed in order to proceed with the analysis. It's possible to predict those by calculating median for income_type categories and then replacing the missing values with the calculated median. Additional, education and age may impact the total_income. We will start by categorizing the age groups and creating a new column with the assigned age group for every row.

In [550]:
# Let's write a function that calculates the age category
def assign_age_category(dob_years):
    if dob_years < 20:
        return '19'
    elif dob_years < 30:
        return '20-29'
    elif dob_years < 40:
        return '30-39'
    elif dob_years < 50:
        return '40-49'
    elif dob_years < 60:
        return '50-59'
    elif dob_years < 70:
        return '60-69'
    else:
        return '70+'


In [551]:
print(assign_age_category(75))


70+


In [552]:
#New column based on function
credit_score['age_category'] = credit_score['dob_years'].apply(assign_age_category)

In [553]:
# Checking how values in the new column
credit_score['age_category'].value_counts(normalize=True)

30-39    0.263926
40-49    0.254230
50-59    0.217079
20-29    0.147578
60-69    0.108656
70+      0.007878
19       0.000653
Name: age_category, dtype: float64

In [554]:
# Let's create a table without missing values and print a few of its rows to make sure it looks fine
credit_score_filtered3 = credit_score.dropna()
credit_score_filtered3.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
0,1,-8437.673028,42.0,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40-49
1,1,-4024.803754,36.0,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30-39
2,0,-5623.42261,33.0,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,30-39
3,3,-4124.747207,32.0,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,30-39
4,0,340266.072047,53.0,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50-59
5,0,-926.185831,27.0,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,20-29
6,0,-2879.202052,43.0,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,40-49
7,0,-152.779569,50.0,secondary education,1,married,0,M,employee,0,21731.829,education,50-59
8,2,-6929.865299,35.0,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,30-39
9,0,-2188.756445,41.0,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,40-49


In [563]:
credit_score_filtered3['total_income'].agg(['median', 'mean'])

median    23202.870000
mean      26787.568355
Name: total_income, dtype: float64

In [561]:
#Let's see the total_income average and median by age groups
credit_score_filtered3.groupby(['age_category']).agg({'total_income': ['median', 'mean']})

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,median,mean
age_category,Unnamed: 1_level_2,Unnamed: 2_level_2
19,14934.901,16993.942462
20-29,22799.258,25572.630177
30-39,24667.528,28312.479963
40-49,24755.696,28491.929026
50-59,22203.0745,25811.700327
60-69,19817.44,23242.812818
70+,18751.324,20125.658331


In [559]:
#Let's see the total_income average and median by education groups
credit_score_filtered3.groupby(['education']).agg({'total_income': ['median', 'mean']})

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,median,mean
education,Unnamed: 1_level_2,Unnamed: 2_level_2
bachelor's degree,28054.531,33142.802434
graduate degree,25161.5835,27960.024667
primary education,18741.976,21144.882211
secondary education,21836.583,24594.503037
some college,25618.464,29045.443644


In [564]:
#Let's see the total_income average and median by family status
credit_score_filtered3.groupby(['family_status']).agg({'total_income': ['median', 'mean']})

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,median,mean
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2
civil partnership,23186.534,26694.428597
divorced,23515.096,27189.35455
married,23389.54,27041.784689
unmarried,23149.028,26934.069805
widow / widower,20514.19,22984.208556


In [565]:
#Let's see the total_income average and median by income type
credit_score_filtered3.groupby(['income_type']).agg({'total_income': ['median', 'mean']})

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,median,mean
income_type,Unnamed: 1_level_2,Unnamed: 2_level_2
business,27577.272,32386.793835
civil servant,24071.6695,27343.729582
employee,22815.1035,25820.841683
entrepreneur,79866.103,79866.103
paternity / maternity leave,8612.661,8612.661
retiree,18962.318,21940.394503
student,15712.26,15712.26
unemployed,21014.3605,21014.3605


In [566]:
#Let's see the total_income average and median by number of children
credit_score_filtered3.groupby(['children']).agg({'total_income': ['median', 'mean']})

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,median,mean
children,Unnamed: 1_level_2,Unnamed: 2_level_2
0,23029.9535,26422.404866
1,23660.563,27368.627863
2,23136.1155,27478.854282
3,25155.448,29322.623993
4,24981.634,27289.829647
5,29816.2255,27268.84725


In [567]:
#Let's see the total_income average and median by gender
credit_score_filtered3.groupby(['gender']).agg({'total_income': ['median', 'mean']})

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,median,mean
gender,Unnamed: 1_level_2,Unnamed: 2_level_2
F,21464.845,24655.604757
M,26834.295,30907.144369
XNA,32624.825,32624.825


In [568]:
#Let's see the total_income average and median by those who have unpaid debt and those who don't
credit_score_filtered3.groupby(['debt']).agg({'total_income': ['median', 'mean']})

Unnamed: 0_level_0,total_income,total_income
Unnamed: 0_level_1,median,mean
debt,Unnamed: 1_level_2,Unnamed: 2_level_2
0,23225.905,26848.661065
1,22928.48,26096.143537


It's best to use median as we see that there's rather large gap between the ends of the curve. If we look at the mean and median, we see that there's some difference between the two. This means that there're outliers from the both ends of the curve, therefore it's best to use median - this will show much more truthful picture.
Based on the analysis above the characteristics which define the income most are 'education', 'income_type', and 'age_category' to some extent - if grouped by each of these two columns, the gap between the median of total_income and grouped by characteristics median is the largest. We also see that 'gender' has an impact on the total_income, however, we will try to fix the wage gap between genders and imagine we live in the fair world where gender has nothing to do with how much money someone is making. Therefore, we will leave the gender out of the equation.

In [569]:
#Filing the missing values in total income based on education, income_type, and age_category
credit_score['total_income'] = credit_score['total_income'].fillna(credit_score.groupby(['education', 'income_type', 'age_category'])['total_income'].transform('median'))

In [570]:
credit_score['total_income'].isna().sum()

3

In [571]:
# Apply it to every row
credit_score[credit_score.total_income.isna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
1303,1,,70.0,primary education,3,civil partnership,1,F,employee,0,,transactions with commercial real estate,70+
5936,0,,58.0,bachelor's degree,0,married,0,M,entrepreneur,0,,buy residential real estate,50-59
8142,0,,64.0,primary education,3,civil partnership,1,F,civil servant,0,,to have a wedding,60-69


In [572]:
#To fix the remaining missing values we will simplify the code and replace them only based on the income_type
credit_score['total_income'] = credit_score['total_income'].fillna(credit_score.groupby('income_type')['total_income'].transform('median'))

In [573]:
credit_score['total_income'].isna().sum()

0

In [574]:
# Checking the number of entries in the columns
credit_score['education'].count()

21453

In [575]:
credit_score['age_category'].count()

21453

In [576]:
credit_score['total_income'].count()

21453

In [577]:
credit_score.isnull().sum()

children               0
days_employed       2102
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income           0
purpose                0
age_category           0
dtype: int64

The number of entries is the same in all columns.

## Categorization of data


The initial question we had was whether customer’s marital status and number of children has an impact on whether they will default on a loan.
To categorize the data, let's look at the following columns:
1. total_income
2. purpose
3. children

In [578]:
credit_score['total_income'].describe()

count     21453.000000
mean      26477.986277
std       15733.778501
min        3306.762000
25%       17191.455000
50%       22934.395000
75%       31657.491000
max      362496.645000
Name: total_income, dtype: float64

In [579]:
credit_score.purpose.unique()

array(['purchase of the house', 'car purchase', 'supplementary education',
       'to have a wedding', 'housing transactions', 'education',
       'having a wedding', 'purchase of the house for my family',
       'buy real estate', 'buy commercial real estate',
       'buy residential real estate', 'construction of own property',
       'property', 'building a property', 'buying a second-hand car',
       'buying my own car', 'transactions with commercial real estate',
       'building a real estate', 'housing',
       'transactions with my real estate', 'cars', 'to become educated',
       'second-hand car purchase', 'getting an education', 'car',
       'wedding ceremony', 'to get a supplementary education',
       'purchase of my own house', 'real estate transactions',
       'getting higher education', 'to own a car', 'purchase of a car',
       'profile education', 'university education',
       'buying property for renting out', 'to buy a car',
       'housing renovation', 'going

In [580]:
credit_score['purpose'].value_counts()

wedding ceremony                            791
having a wedding                            767
to have a wedding                           765
real estate transactions                    675
buy commercial real estate                  661
housing transactions                        652
buying property for renting out             651
transactions with commercial real estate    650
purchase of the house                       646
housing                                     646
purchase of the house for my family         638
construction of own property                635
property                                    633
transactions with my real estate            627
building a real estate                      624
buy real estate                             621
purchase of my own house                    620
building a property                         619
housing renovation                          607
buy residential real estate                 606
buying my own car                       

[Let's check unique values]

In [581]:
credit_score['children'].unique()

array([1, 0, 3, 2, 4, 5])

In [582]:
credit_score['children'].value_counts()

0    14090
1     4855
2     2128
3      330
4       41
5        9
Name: children, dtype: int64

We will caegorize three characteristics: 

total_income: it's customary to look at the income as average, above average, and below average. This is how we will categorize it:
- Average +/- std
- Below average
- Above average

Purpose: when running a unique() method of the column, we see that the following topics come up
- Home owning** - purchasing or reconstruction of own home.
- Real Estate Investment** - investing in real estate which is other than clients own home
- Wedding
- Purchasing a vehicle
- Education

** It's important to note here that some values in the purpose column were ambiguous as to which category they belong. This specifically referres to Home owning or real estate investment categories. We made a subjective decision to assign them one of the two categories. For instance, 'construction of own property', 'buy residential real estate', 'transactions with my real estate' were decided to be assigned home owning as the client states his own property as a purpose. As opposed to 'property' and 'real estate transactions' which where decided to belong to Real Estate Investment category. We will later check if the categories were assigned correctly.

Children:
- 0 - a significant % of the clients have 0 children
- 1-3 - 1-3 is a average number of children in the developed countries and societies with high socio-economic status
- 4-5 - above average

Let's categorize non-numeric data first

In [583]:

lemmas_list_all = []

for purpose in credit_score.purpose.unique():
    words = nltk.word_tokenize(purpose)
    lemmas = [wordnet_lemma.lemmatize(w, pos = 'n') for w in words]
    lemmas=[l.lower() for l in lemmas]
    for i in lemmas:
        if i==',':
            continue
        else:    
            lemmas_list_all.append(i)

In [584]:
lemmas_list_all

['purchase',
 'of',
 'the',
 'house',
 'car',
 'purchase',
 'supplementary',
 'education',
 'to',
 'have',
 'a',
 'wedding',
 'housing',
 'transaction',
 'education',
 'having',
 'a',
 'wedding',
 'purchase',
 'of',
 'the',
 'house',
 'for',
 'my',
 'family',
 'buy',
 'real',
 'estate',
 'buy',
 'commercial',
 'real',
 'estate',
 'buy',
 'residential',
 'real',
 'estate',
 'construction',
 'of',
 'own',
 'property',
 'property',
 'building',
 'a',
 'property',
 'buying',
 'a',
 'second-hand',
 'car',
 'buying',
 'my',
 'own',
 'car',
 'transaction',
 'with',
 'commercial',
 'real',
 'estate',
 'building',
 'a',
 'real',
 'estate',
 'housing',
 'transaction',
 'with',
 'my',
 'real',
 'estate',
 'car',
 'to',
 'become',
 'educated',
 'second-hand',
 'car',
 'purchase',
 'getting',
 'an',
 'education',
 'car',
 'wedding',
 'ceremony',
 'to',
 'get',
 'a',
 'supplementary',
 'education',
 'purchase',
 'of',
 'my',
 'own',
 'house',
 'real',
 'estate',
 'transaction',
 'getting',
 'higher'

In [585]:
wedding_category = ['wedding', 'ceremony']
home_owning_category = ['housing', 'house', 'family', 'construction', 'renovation', 'residential']
real_estate_investment_category = ['real', 'estate','commercial', 'property', 'renting', 
                          'building']
vehicle_purchase_category = ['car', 'car purchase']
education_category = ['education', 'university', 'educated']

In [586]:
def lemmatization_func(line):
  
    words = nltk.word_tokenize(line)
    lemmas = [wordnet_lemma.lemmatize(w, pos = 'n') for w in words]
    lemmas=[l.lower() for l in lemmas]
    
    if any(word in lemmas for word in wedding_category):
        return 'wedding'
    elif  any(word in lemmas for word in real_estate_investment_category):
        return 'real_estate_investment'
    elif  any(word in lemmas for word in home_owning_category):
        return 'home_owning'
    elif  any(word in lemmas for word in vehicle_purchase_category):
        return 'vehicle_purchase'
    elif  any(word in lemmas for word in education_category):
        return 'education'

In [587]:
credit_score['purpose_category']=credit_score['purpose'].apply(lemmatization_func)

In [588]:
credit_score['purpose_category'].value_counts()

real_estate_investment    7002
vehicle_purchase          4306
education                 4013
home_owning               3809
wedding                   2323
Name: purpose_category, dtype: int64

In [589]:
credit_score['purpose_category'].isna().sum()

0

In [590]:
credit_score['total_income'].describe()

count     21453.000000
mean      26477.986277
std       15733.778501
min        3306.762000
25%       17191.455000
50%       22934.395000
75%       31657.491000
max      362496.645000
Name: total_income, dtype: float64

Let's the whether the categories for purpose column were assigned correctly.

In [591]:
credit_score.head(30)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,purpose_category
0,1,-8437.673028,42.0,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40-49,home_owning
1,1,-4024.803754,36.0,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30-39,vehicle_purchase
2,0,-5623.42261,33.0,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,30-39,home_owning
3,3,-4124.747207,32.0,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,30-39,education
4,0,340266.072047,53.0,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50-59,wedding
5,0,-926.185831,27.0,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,20-29,home_owning
6,0,-2879.202052,43.0,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,40-49,home_owning
7,0,-152.779569,50.0,secondary education,1,married,0,M,employee,0,21731.829,education,50-59,education
8,2,-6929.865299,35.0,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,30-39,wedding
9,0,-2188.756445,41.0,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,40-49,home_owning


In [592]:
credit_score.tail(30)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,purpose_category
21495,1,,50.0,secondary education,1,civil partnership,1,F,employee,0,21831.2455,wedding ceremony,50-59,wedding
21496,1,-759.680851,31.0,secondary education,1,married,0,F,employee,0,19102.819,to buy a car,30-39,vehicle_purchase
21497,0,,48.0,bachelor's degree,0,married,0,F,business,0,34604.478,building a property,40-49,real_estate_investment
21498,1,-1330.627998,32.0,secondary education,1,civil partnership,1,M,employee,0,38522.812,to have a wedding,30-39,wedding
21499,0,-9929.015065,57.0,secondary education,1,civil partnership,1,M,business,0,25208.505,wedding ceremony,50-59,wedding
21500,0,-578.082757,26.0,some college,2,unmarried,4,M,business,0,12450.127,transactions with commercial real estate,20-29,real_estate_investment
21501,0,334343.096304,57.0,secondary education,1,married,0,F,retiree,0,13797.14,housing,50-59,home_owning
21502,1,,42.0,secondary education,1,married,0,F,employee,0,22192.7345,building a real estate,40-49,real_estate_investment
21503,0,-3096.881131,58.0,secondary education,1,married,0,F,employee,0,42280.16,to become educated,50-59,education
21504,0,355235.728158,68.0,secondary education,1,married,0,F,retiree,0,12890.611,supplementary education,60-69,education


We see that some purposes('construction of own property', 'buy residential real estate', 'transactions with my real estate') were assingned a wrong category - real_estate_investment instead of home_owning. Let's check if that's true for all values 'construction of own property'.

In [593]:
credit_score[credit_score.purpose=='construction of own property']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,purpose_category
15,1,-972.364419,26.0,secondary education,1,married,0,F,employee,0,18691.345,construction of own property,20-29,real_estate_investment
27,0,-529.191635,28.0,bachelor's degree,0,married,0,M,employee,0,49415.837,construction of own property,20-29,real_estate_investment
28,1,-717.274324,26.0,bachelor's degree,0,married,0,F,employee,0,30058.118,construction of own property,20-29,real_estate_investment
48,0,-3341.067886,45.0,secondary education,1,married,0,F,employee,0,25930.483,construction of own property,40-49,real_estate_investment
105,0,-2098.626296,62.0,secondary education,1,widow / widower,2,F,employee,0,12301.470,construction of own property,60-69,real_estate_investment
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21363,2,-1384.029879,36.0,secondary education,1,married,0,M,business,0,35366.945,construction of own property,30-39,real_estate_investment
21368,1,-2936.875865,39.0,secondary education,1,married,0,F,employee,0,8868.062,construction of own property,30-39,real_estate_investment
21392,2,-1644.111448,35.0,bachelor's degree,0,married,0,M,business,0,101842.447,construction of own property,30-39,real_estate_investment
21454,0,-1228.222676,48.0,secondary education,1,married,0,M,employee,0,33738.832,construction of own property,40-49,real_estate_investment


Let's replace the incorrect categories with correct ones.

In [594]:
for row in credit_score:
    credit_score.loc[(credit_score['purpose']=='construction of own property'), 'purpose_category'] = 'home_owning'

In [595]:
for row in credit_score:
    credit_score.loc[(credit_score['purpose']=='buy residential real estate'), 'purpose_category'] = 'home_owning'

In [596]:
for row in credit_score:
    credit_score.loc[(credit_score['purpose']=='transactions with my real estate'), 'purpose_category'] = 'home_owning'

In [597]:
credit_score[credit_score.purpose=='transactions with my real estate'].head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,purpose_category
34,0,-4488.067031,35.0,secondary education,1,married,0,F,employee,0,16745.672,transactions with my real estate,30-39,home_owning
84,0,-7125.215028,53.0,secondary education,1,married,0,F,civil servant,0,40945.468,transactions with my real estate,50-59,home_owning
114,0,-1599.879161,26.0,bachelor's degree,0,married,0,F,employee,1,22955.474,transactions with my real estate,20-29,home_owning
130,0,-897.322806,42.0,secondary education,1,married,0,F,business,0,15335.319,transactions with my real estate,40-49,home_owning
136,0,357880.159379,60.0,primary education,3,married,0,M,retiree,0,18099.872,transactions with my real estate,60-69,home_owning


In [598]:
credit_score[credit_score.purpose=='buy residential real estate'].head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category,purpose_category
14,0,-1844.956182,56.0,bachelor's degree,0,civil partnership,1,F,business,1,26420.466,buy residential real estate,50-59,home_owning
96,0,,44.0,secondary education,1,married,0,F,employee,0,22192.7345,buy residential real estate,40-49,home_owning
116,1,-540.038425,50.0,secondary education,1,married,0,M,employee,0,13338.611,buy residential real estate,50-59,home_owning
134,1,-4171.107677,46.0,some college,2,divorced,3,F,employee,0,26562.526,buy residential real estate,40-49,home_owning
201,0,-6542.46079,39.0,secondary education,1,married,0,F,business,0,13919.489,buy residential real estate,30-39,home_owning


In [599]:
credit_score['purpose_category'].value_counts()

home_owning               5677
real_estate_investment    5134
vehicle_purchase          4306
education                 4013
wedding                   2323
Name: purpose_category, dtype: int64

Now let's get down to numeric columns.

In [600]:
credit_score['total_income'].describe()

count     21453.000000
mean      26477.986277
std       15733.778501
min        3306.762000
25%       17191.455000
50%       22934.395000
75%       31657.491000
max      362496.645000
Name: total_income, dtype: float64

In [601]:
#Let's calculate total_income + standart deviation to see what's the upper threshhold for the average income
credit_score['total_income'].mean() + credit_score['total_income'].std()

42211.76477835212

In [602]:
#Let's calculate total_income + standart deviation to see what's the lower threshhold for the average income
credit_score['total_income'].mean() - credit_score['total_income'].std()

10744.207775696277

In [603]:
#Let's write a function which will assign an income level to the client's income

def total_income_categ(total_income):
    
    """
The function will evaluate the income_level of the client. 
The values of total_income.mean +/- std() were calculated above.
- if (total_income.mean - std) <= total_income <= (total_income.mean + std) - the function will return 'average income'
- if total_income < (total_income.mean - std) - the function will return 'below average income'. 
- if total_income > (total_income.mean + std) - it will return 'above average income'
   """
    mean_plus_std = credit_score['total_income'].mean() + credit_score['total_income'].std()
    mean_minus_std = credit_score['total_income'].mean() - credit_score['total_income'].std()
    
    if  mean_minus_std <= total_income <= mean_plus_std:
        return 'average income'
    if total_income > mean_plus_std:
        return 'above average income'
    return 'below average income'

print(total_income_categ(42000))
print(total_income_categ(45000))
print(total_income_categ(8000))


average income
above average income
below average income


In [605]:
# Creating a column with the categories and counting the values for them
credit_score['income_level'] = credit_score['total_income'].apply(total_income_categ)
print(credit_score['income_level'].value_counts())

average income          17852
above average income     2320
below average income     1281
Name: income_level, dtype: int64


In [606]:
#Let's write a function which will categorize a number of children a client has

def children_categ(children):
    
    """
The function will categorize the number of children the client has. 
- if 0 children - the function will return '0'
- if 1 <= children <= 3 - the function will return '1-3' 
- if 4 <= children <= 5 - the function will return '4-5'
- 'Other' - just in case
   """
    
    if  children == 0:
        return '0'
    elif children <= 3:
        return '1-3'
    elif children <= 5:
        return '4-5'
    else:
        return 'Other'
    
print(children_categ(0))
print(children_categ(2))
print(children_categ(4))
print(children_categ(6))


0
1-3
4-5
Other


In [607]:
#applying the function and creating a new column
credit_score['children_category']=credit_score['children'].apply(children_categ)

In [609]:
# Getting summary statistics for the column
credit_score['purpose_category'].describe()

count           21453
unique              5
top       home_owning
freq             5677
Name: purpose_category, dtype: object

In [610]:
credit_score['purpose_category'].value_counts(normalize=True)

home_owning               0.264625
real_estate_investment    0.239314
vehicle_purchase          0.200718
education                 0.187060
wedding                   0.108283
Name: purpose_category, dtype: float64

In [611]:
credit_score['income_level'].describe()

count              21453
unique                 3
top       average income
freq               17852
Name: income_level, dtype: object

In [612]:
credit_score['income_level'].value_counts(normalize=True)

average income          0.832145
above average income    0.108143
below average income    0.059712
Name: income_level, dtype: float64

In [613]:
credit_score['children_category'].describe()

count     21453
unique        3
top           0
freq      14090
Name: children_category, dtype: object

In [614]:
credit_score['children_category'].value_counts(normalize=True)

0      0.656785
1-3    0.340885
4-5    0.002331
Name: children_category, dtype: float64

## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?**

Let's first calculate the default rate.

In [615]:
default_rate = credit_score['debt'].sum()/credit_score['debt'].count()

print(f"{default_rate:.1%}' of clients didn't default on the debt.")

8.1%' of clients didn't default on the debt.


In [616]:
credit_score.groupby(['children_category'])['debt'].mean().reset_index().sort_values(by='debt')

Unnamed: 0,children_category,debt
0,0,0.075444
2,4-5,0.08
1,1-3,0.092165


**Conclusion**

The total of the clients who fail to default on the loan is 8.1%. 
The clients with 1-3 children have harder time defaulting on the their loans comparing to clients with 0 children and with 4-5 children. This somewhat consistent with the original hypothesis that clients with smaller number of children will be more likely to default on the loan. However, the hypothesis wasn't confirmed, as from the data it seems that actually clients with an average number of children (1-3) are less likely to repay their loan on time.


**Is there a correlation between family status and paying back on time?**

In [617]:
credit_score['family_status'].value_counts(normalize=True)

married              0.575164
civil partnership    0.193446
unmarried            0.130984
divorced             0.055703
widow / widower      0.044702
Name: family_status, dtype: float64

In [618]:
credit_score.groupby(['family_status'])['debt'].mean().reset_index().sort_values(by='debt')

Unnamed: 0,family_status,debt
4,widow / widower,0.065693
1,divorced,0.07113
2,married,0.075452
0,civil partnership,0.093494
3,unmarried,0.097509


In [619]:
family_status_final = credit_score.groupby(['family_status', 'children_category'])['debt'].mean().reset_index().sort_values(by='debt')
family_status_final

Unnamed: 0,family_status,children_category,debt
2,civil partnership,4-5,0.0
5,divorced,4-5,0.0
14,widow / widower,4-5,0.0
12,widow / widower,0,0.062574
6,married,0,0.069095
3,divorced,0,0.070153
4,divorced,1-3,0.073171
8,married,4-5,0.083333
0,civil partnership,0,0.083914
7,married,1-3,0.085212


In [620]:
family_status_pivot = family_status_final.pivot_table(index='family_status', columns='children_category', values='debt', aggfunc='sum')
family_status_pivot

children_category,0,1-3,4-5
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
civil partnership,0.083914,0.112686,0.0
divorced,0.070153,0.073171,0.0
married,0.069095,0.085212,0.083333
unmarried,0.092838,0.115385,0.5
widow / widower,0.062574,0.09009,0.0


**Conclusion**

It's clear from the data that unmarried clients are less likely to repay the loan on time. Additionally, the original hypothesis has been confirmed that unmarried clients with larger number of children (4-5) will be less likely to default on the loan. And as we see from the data above, as many as 50% of unmarried clients with large number of children (4-5) fail to repay their loan on time. And after them about 23% of unmarried clients/civil parnership with average number of children (1-3) fail to repay their loan on time.

**Is there a correlation between income level and paying back on time?**

In [621]:
credit_score.groupby(['income_level'])['debt'].mean().reset_index().sort_values(by='debt')

Unnamed: 0,income_level,debt
0,above average income,0.068966
2,below average income,0.070258
1,average income,0.08352


**Conclusion**

The original hypothesis that clients with lower income will be less likely to default on the loan wasn't confirmed, as from the data we see that actually clients with average income have higher rate of those who fail to repay their loan on time.

**How does credit purpose affect the default rate?**

In [622]:
#Let's check the default rate for each of the credit purpose
credit_score.groupby(['purpose_category'])['debt'].mean().reset_index().sort_values(by='debt')

Unnamed: 0,purpose_category,debt
1,home_owning,0.068522
2,real_estate_investment,0.076549
4,wedding,0.080069
0,education,0.0922
3,vehicle_purchase,0.09359


In [623]:
#Let's also check the default rate for each of the credit purpose per income level
data_final = credit_score.groupby(['purpose_category', 'income_level'])['debt'].mean().reset_index().sort_values(by='debt')
data_final

Unnamed: 0,purpose_category,income_level,debt
5,home_owning,below average income,0.042683
12,wedding,above average income,0.056
2,education,below average income,0.060241
3,home_owning,above average income,0.06129
6,real_estate_investment,above average income,0.066901
4,home_owning,average income,0.071262
8,real_estate_investment,below average income,0.074675
9,vehicle_purchase,above average income,0.076271
7,real_estate_investment,average income,0.077971
13,wedding,average income,0.082432


In [624]:
credit_score_pivot = data_final.pivot_table(index='income_level', columns='purpose_category', values='debt', aggfunc='sum')
credit_score_pivot

purpose_category,education,home_owning,real_estate_investment,vehicle_purchase,wedding
income_level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
above average income,0.082927,0.06129,0.066901,0.076271,0.056
average income,0.095707,0.071262,0.077971,0.095518,0.082432
below average income,0.060241,0.042683,0.074675,0.098485,0.090909


**Conclusion**

Clients who take the loan on vehicle purchase or education are less likely to pay the loan on time. 
Client with lower income who take loan for vehicle purchase will be less likely to default on the loan(10% of all clients with similar characteristics). Just a little behide them are clients with average income who take loan for vehicle purchase and education (9.5% in each category). On the 4th place are client with lower income who take loan for a wedding (9%). 

# General Conclusion 

We have addressed the following issues in the data:
•	The children_column - there're negative values (-1) in the data which have been replaced with 1, assuming that was a typo and the value should have been 1.
•	days_employed column was left untouched. After careful analyses at seems that missing values were missing completely at random, which means there's no way we could predict or restore those. In fact, 74% of the day_employed column values were negative. Additionaly, after calculation of the days_employed values in years the number of year received for max and mean values were unrealistic. In total, about 90% of the data in this column was problematic. In theory it would be reasonable to delete this column completely to avoid further issues, however, it was decided just to leave the column as is. Moreover, this column was not required for the analysis. The nature of the problemtic data says that there was some technical issue, maybe a error in the code for recording and storing the data, therefore, the data has been recorded and stored incorrectly.
•	dob_years column have 0 values which were replaced with an age average.
•	The duplicated rows were dropped as there was an insignificant % of those.
•	The values in gender column stayed untouched, as we believe that a person has a right for their decision of gender.
•	The values in the education column were tranfored into lower case to avoid duplicated categories.
- No problematic values were found in income_type column. It might have been worthwile to reduce the number of categories in the column by removing the rows with student, paternity / maternity leave, entrepreneur (could be combined with business category), and unemployed, as removing those wouldn't be statistically significant. However, it was decided to keep all categories.
** It's important to note that some values in the purpose column were ambiguous as to which category they belonged. This specifically referred to Home owning or real estate investment categories. We made a subjective decision to assign them one of the two categories. For instance, 'construction of own property', 'buy residential real estate', 'transactions with my real estate' were decided to be assigned home owning as the client states his own property as a purpose. As opposed to 'property' and 'real estate transactions' which were decided to belong to Real Estate Investment category.


It is worth noting here that based on the data about 70% of the clients taking the loan were female. Female clients also had lowest average income. 
Additionally, a data piece which would be worth investigating further is the fact that according the the data about 70% of the clients taking the loan don't have children. We would assume that there would be an least equal amount of clients with children and without.

The conclusion regarding the posted questions of whether number of children and marital status have an impact of the client's likeliness to repay the loan are as follows:

- Clients with an average number of children (1-3) are less likely to repay their loan on time.
- Unmarried clients are less likely to repay the loan on time. Additionally, unmarried clients with larger number of children (4-5) will be most unlikely to default on the loan - as many as 50% of unmarried clients with large number of children (4-5) fail to repay their loan on time. And after them about 23% of unmarried clients/civil parnership with average number of children fail to repay their loan on time.
- Clients with average income have higher rate of those who fail to repay their loan on time.
- Clients taking money for basic necessity will be less likely to repay the loan.
- Clients who take the loan on vehicle purchase or education are less likely to pay the loan on time, especially clients with lower or average income.