## Analyzing borrowers’ risk of defaulting

The project is to prepare a simple report for a bank’s loan division. We will find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. We will use existing data that the bank already has in regards to customers’ credit worthiness.

### Open the data file and have a look at the general information. 

In [1]:
import pandas as pd 
data = pd.read_csv('credit_scoring_eng.csv')
data.info()
#print (data.head(15))
print (data.sample(15))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
       children  days_employed  dob_years            education  education_id  \
2692          1    -131.876565         40  secondary education             1   
2702       

## Conclusion

- Variable types look fine 

## Observations: 
- **days_employed**, why do we have a value which is positive while all the rest are negative? how come?
- **eduction**, looks like some capitalization differences between values e.g. Secondary Education, secondary education & SECONDARY EDUCATION. Also, BACHELOR'S DEGREE & bachelor's degree.
- **total_income**, observe a NaN for a retiree, do we want to keep it as is?
- **debt** variable of whether a customer defaulted on a loan is okay but why not more qualititive? are all defaulters the same? 
- **purpose** a lot of non unique category value such as: 'buying my own car', 'purchase of a car', 'to own a car' or 'cars'

### Data preprocessing

### Processing missing values

In [3]:
nulls = data.isnull().sum()

print (nulls)
# 'days_employed' & 'total_income' have nulls, why?






children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64


In [4]:

print(data[data['days_employed'].isnull()])

# do not observe a pattern such as all unemployes equals retiree.
#'income_type' is not exclusively business / retiree but also 'employee'


       children  days_employed  dob_years            education  education_id  \
12            0            NaN         65  secondary education             1   
26            0            NaN         41  secondary education             1   
29            0            NaN         63  secondary education             1   
41            0            NaN         50  secondary education             1   
55            0            NaN         54  secondary education             1   
...         ...            ...        ...                  ...           ...   
21489         2            NaN         47  Secondary Education             1   
21495         1            NaN         50  secondary education             1   
21497         0            NaN         48    BACHELOR'S DEGREE             0   
21502         1            NaN         42  secondary education             1   
21510         2            NaN         28  secondary education             1   

           family_status  family_status

In [5]:
print(data[data['total_income'].isnull()])

# no different pattern apart from the fact it looks like it matches with the 'days_employed' (as well as the number of nullS)

       children  days_employed  dob_years            education  education_id  \
12            0            NaN         65  secondary education             1   
26            0            NaN         41  secondary education             1   
29            0            NaN         63  secondary education             1   
41            0            NaN         50  secondary education             1   
55            0            NaN         54  secondary education             1   
...         ...            ...        ...                  ...           ...   
21489         2            NaN         47  Secondary Education             1   
21495         1            NaN         50  secondary education             1   
21497         0            NaN         48    BACHELOR'S DEGREE             0   
21502         1            NaN         42  secondary education             1   
21510         2            NaN         28  secondary education             1   

           family_status  family_status

In [6]:
#switch income_type of nulls to 'other' in order not to let the missing values ruin the averages later on

data.loc[data['total_income'].isnull(),'income_type'] = 'other'
print (data['income_type'].value_counts())

print(data[data['days_employed'].isnull()])

employee                       10014
business                        4577
retiree                         3443
other                           2174
civil servant                   1312
unemployed                         2
student                            1
paternity / maternity leave        1
entrepreneur                       1
Name: income_type, dtype: int64
       children  days_employed  dob_years            education  education_id  \
12            0            NaN         65  secondary education             1   
26            0            NaN         41  secondary education             1   
29            0            NaN         63  secondary education             1   
41            0            NaN         50  secondary education             1   
55            0            NaN         54  secondary education             1   
...         ...            ...        ...                  ...           ...   
21489         2            NaN         47  Secondary Education             

In [7]:
#fill nulls with 0 so we can work with the data
data['days_employed'] = data['days_employed'].fillna(0)
data['total_income'] = data['total_income'].fillna(0)

In [8]:
#test
print (data.isnull().sum())


children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64


In [9]:
print (data['dob_years'].describe())

#looks like we have some year values which are 'zero' based on 'Min'

print(data[(data['dob_years'] < 15) & (data['dob_years'] > 1)])

#no values which are above it, so it's not "wrong" values but actually missing values
#those values below could be replaced, it's not critical since we do not plan to use them further anyways this time
#the ideal way would be to replace with random values based on the distribution of the correct data
#that being said, a mean value is a simpler approach that will also make an improvement from current situation of 0's
#and is more than enough for our specific analysis

print(data[data['dob_years'] == 0])


count    21525.000000
mean        43.293380
std         12.574584
min          0.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64
Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []
       children  days_employed  dob_years            education  education_id  \
99            0  346541.618895          0  Secondary Education             1   
149           0   -2664.273168          0  secondary education             1   
270           3   -1872.663186          0  secondary education             1   
578           0  397856.565013          0  secondary education             1   
1040          0   -1158.029561          0    bachelor's degree             0   
...         ...            ...        ...                  ...           ...   
19829         0       0.000000          0  secondary

In [10]:
#replacing values with mean:

data.loc[data['dob_years'] == 0,'dob_years'] = data['dob_years'].mean()

#check that min values has changed:

print (data['dob_years'].describe())

count    21525.000000
mean        43.496522
std         12.218174
min         19.000000
25%         34.000000
50%         43.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64


In [11]:
print (data['children'].describe())

print (data[data['children'] < 0])

# See values of -1 Children, must be corrupted somehow

# No patterns in the data, everything else is complete so it would be ideal to get rid of it

# Ideally find a way so it will not skew the number of children analysis

# Either to switch with N/A or leave as -1 and make sure to take it our in affected calculations such as mean

# with the default rate we can just ignore this group and it will not do any harm

# if we chose to use correlation functions then it would have been an issue

count    21525.000000
mean         0.538908
std          1.381587
min         -1.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         20.000000
Name: children, dtype: float64
       children  days_employed  dob_years            education  education_id  \
291          -1   -4417.703588       46.0  secondary education             1   
705          -1    -902.084528       50.0  secondary education             1   
742          -1   -3174.456205       57.0  secondary education             1   
800          -1  349987.852217       54.0  secondary education             1   
941          -1       0.000000       57.0  Secondary Education             1   
1363         -1   -1195.264956       55.0  SECONDARY EDUCATION             1   
1929         -1   -1461.303336       38.0  secondary education             1   
2073         -1   -2539.761232       42.0  secondary education             1   
3814         -1   -3045.290443       26.0  Secondary Education           

### Conclusion

- two columns ('total_income' & 'days_employed') has some rows with null values, which were switched to 0.
- It is unclear whether we are missing the data or it should have been 0, so we choose to re-categorize the total_income null rows to a different 'income_type' - so nulls becoming zeros will not affect our calculation later on
- We could potentially make similar assumption with 'days_employed' but choose not to change it's category since we do not plan to check correlation with it later on (unlike income level) 
- 'dob_years' has missing values of 0 - chose the replace with mean which is more than enough for our current situation in which we do not plan to use it. Switched just in case we will want to make some calculation, it will make sure the data is not skewed due to the 0's. 

### Fixing Capitalization

In [12]:
data['education'] = data['education'].str.lower()

print (data['education'].value_counts())



secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64


### Data type replacement

In [13]:
#days employeed doesn't make sense, upon examination of data looks like we need to change signs.
#In addition, in order to make it easier to work with, we add another column of 'years_employeed'

data['years_employed'] = (data['days_employed'] * -1) / 365

print (data['years_employed'].describe())

print (data.sample(10))


#some values doesn't make sense - e.g.  366756.592599 days employeed which effectively means 1004 years
#question is whether it is time effective to investgate further or delete unrealistic values
#since this variable is not going to be used further in our specific analysis, it is not time effective to deal with it



count    21525.000000
mean      -155.284588
std        369.508940
min      -1100.699727
25%         -0.000000
50%          2.691868
75%          6.899093
max         50.380685
Name: years_employed, dtype: float64
       children  days_employed  dob_years            education  education_id  \
3197          0   -1598.597777       54.0  secondary education             1   
16617         0  355323.891557       48.0    bachelor's degree             0   
11584         1       0.000000       53.0  secondary education             1   
10153         0   -5916.181617       68.0  secondary education             1   
12643         2   -3822.733633       41.0  secondary education             1   
16537         0       0.000000       33.0  secondary education             1   
7159          1       0.000000       61.0  secondary education             1   
1447          0   -1980.262882       29.0  secondary education             1   
19473         0    -351.433552       64.0    bachelor's degree     

In [14]:
#debt should be boolean instead of int
#days_employed has negatives, is it supposed to be a date?
#ideally, maybe 'dob_years' should be a float instead of int (though we can't change the data now)
#should we turn gender, income_type, purpose to 'category' data type?

data['debt'] = data['debt'].astype('bool')
data['gender'] = data['gender'].astype('category')
data['education'] = data['education'].astype('category')
data['family_status'] = data['family_status'].astype('category')
data['income_type'] = data['income_type'].astype('category')
data['purpose'] = data['purpose'].astype('category')
data.info()
print (data.sample())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 13 columns):
children            21525 non-null int64
days_employed       21525 non-null float64
dob_years           21525 non-null float64
education           21525 non-null category
education_id        21525 non-null int64
family_status       21525 non-null category
family_status_id    21525 non-null int64
gender              21525 non-null category
income_type         21525 non-null category
debt                21525 non-null bool
total_income        21525 non-null float64
purpose             21525 non-null category
years_employed      21525 non-null float64
dtypes: bool(1), category(5), float64(4), int64(3)
memory usage: 1.3 MB
       children  days_employed  dob_years          education  education_id  \
10800         3   -4677.803335       42.0  bachelor's degree             0   

      family_status  family_status_id gender income_type   debt  total_income  \
10800       married       

### Processing duplicates

In [15]:
#checking if it exists
#print (data.duplicated().sum())
#exists, could be due to various systems we pull the data from and they are not all syncronized 
#preferebaly to remove those.

data = data.drop_duplicates().reset_index(drop=True)

print (data.duplicated().sum())
print (data.head(20))

#no duplicates now and index has been resetted 



0
    children  days_employed  dob_years            education  education_id  \
0          1   -8437.673028       42.0    bachelor's degree             0   
1          1   -4024.803754       36.0  secondary education             1   
2          0   -5623.422610       33.0  secondary education             1   
3          3   -4124.747207       32.0  secondary education             1   
4          0  340266.072047       53.0  secondary education             1   
5          0    -926.185831       27.0    bachelor's degree             0   
6          0   -2879.202052       43.0    bachelor's degree             0   
7          0    -152.779569       50.0  secondary education             1   
8          2   -6929.865299       35.0    bachelor's degree             0   
9          0   -2188.756445       41.0  secondary education             1   
10         2   -4171.483647       36.0    bachelor's degree             0   
11         0    -792.701887       40.0  secondary education             1 

In [16]:
#only a few 'graduate degree' values doesn't make sense not to combine with 'secondary education'
data['education'] = data['education'].replace('graduate degree','secondary education')

### Categorizing Data

In [17]:
#dealing with the 'purpose' column
temp = data['purpose'].value_counts()
print (temp)
#results show we better use lemmatization or stemming to properly categorize 
#or manually use rule based since there are not as many categories


wedding ceremony                            788
to have a wedding                           764
having a wedding                            764
real estate transactions                    674
buy commercial real estate                  658
housing transactions                        651
buying property for renting out             650
transactions with commercial real estate    649
purchase of the house                       646
housing                                     643
purchase of the house for my family         636
construction of own property                634
property                                    630
transactions with my real estate            626
building a real estate                      623
building a property                         618
purchase of my own house                    617
buy real estate                             617
housing renovation                          604
buy residential real estate                 604
buying my own car                       

In [18]:
# tried to use lemmatizer but did not figure it out
#decided to stick with the simpler approach above for now 

# import nltk
# from nltk.stem import WordNetLemmatizer
# wordnet_lemma = WordNetLemmatizer()
# data['lemma_purpose'] = data['purpose']

# for word in data['purpose']:
#     words = nltk.word_tokenize(word)
#    #lemmas = ' '.join([wordnet_lemma.lemmatize(w, pos = 'n') for w in words])
#     data['lemma_purpose'] = data['lemma_purpose'].replace(word,' '.join([wordnet_lemma.lemmatize(w, pos = 'n') for w in words]))

# print (data['lemma_purpose'].value_counts())


queries = data['purpose']
data['new_purpose'] = data['purpose']

                                                      
for query in queries:
    if 'hous' in query:
        data['new_purpose'] = data['new_purpose'].replace(query,'Buying a house')
        
    elif 'wedd' in query:
        data['new_purpose'] = data['new_purpose'].replace(query,'Wedding') 
        
    elif 'car' in query:
        data['new_purpose'] = data['new_purpose'].replace(query,'Car')
        
    elif 'educ' in query or 'univer' in query:
        data['new_purpose'] = data['new_purpose'].replace(query,'Education')
        
    elif 'property' in query or 'estate' in query:
        data['new_purpose'] = data['new_purpose'].replace(query,'Investment')        
   
        
        
print (data['new_purpose'].value_counts())



Investment        6983
Car               4300
Education         4002
Buying a house    3797
Wedding           2316
Name: new_purpose, dtype: int64


### Analyse the data

- Is there a relation between having kids and repaying a loan on time?

In [20]:
#calculate default rate per amount of kids groups

data_pivot = data.pivot_table(index=['children'], values=['debt'],  aggfunc=[len,sum])

data_pivot['default_rate'] = data_pivot['sum'] / data_pivot['len']

print (data_pivot)

#print (data_pivot.corr())



            len     sum default_rate
           debt    debt             
children                            
-1           47     1.0     0.021277
 0        14043  1063.0     0.075696
 1         4803   444.0     0.092442
 2         2049   194.0     0.094680
 3          330    27.0     0.081818
 4           41     4.0     0.097561
 5            9     0.0     0.000000
 20          76     8.0     0.105263


### Conclusion

We can see a clear correlation between number of children to a higher default rate.
- Even if for the '5 children' group we have no defaults, a sample of 9 is too small to draw conclusions.
- Keep in mind we ignore the '-1 children' group - they doesn't make sense but the rest of the data in those rows looks complete so we chose not to get rid of it.


- Is there a relation between marital status and repaying a loan on time?

In [21]:
data_pivot = data.pivot_table(index=['family_status'], values=['debt'],  aggfunc=[len,sum])

data_pivot['default_rate'] = data_pivot['sum'] / data_pivot['len']

print (data_pivot)

                     len    sum default_rate
                    debt   debt             
family_status                               
civil partnership   4142  388.0     0.093675
divorced            1194   85.0     0.071189
married            12296  931.0     0.075716
unmarried           2809  274.0     0.097544
widow / widower      957   63.0     0.065831


### Conclusion

We do see a difference and correlation between marital status and default rate. Widows are less likley to default on their loan. Next, we have the divorced and married groups and then civital partnership and unmarried. 

Just to emphasize, it is evident that an unmarried person has almost 50% higher chance of defalting on a loan than a widow. 

- Is there a relation between income level and repaying a loan on time?

In [62]:
# added income level groups so we can actually calculate different default rates per group. 
# Found online some easier ways to cut the data but they required numpy which we do not use so I just did it manually

#sorting the data and reseting the index so we can use it to group into 10 categories
data = data.sort_values(by='total_income', ascending=True).reset_index(drop=True)

#length of the observations
length = data['total_income'].size

#the point in which we decided that observations will go to a different group
tenth = length / 10 

#nested for loops - was a challenge for me to understand how to communicate with the index of a row
for i, row in data.iterrows():
    
    for z in range(10):
        if i >= (length - (z+1)*tenth):
            data.at[i, 'income_level'] = z
            break
    
# at first I did it manually just to see that the logic works:

#     if i > (length - tenth):
#         data.at[i, 'income_level'] = '10%'
#         #row['income_level'] = '10%'
        
#     elif i > (length - 2*tenth):
#         data.at[i, 'income_level'] = '20%'
#         #row['income_level'] = '20%'
        
#     elif i > (length - 3*tenth):
#         data.at[i, 'income_level'] = '30%'
#         #row['income_level'] = '30%'
        
#     elif i > (length - 4*tenth):
#         data.at[i, 'income_level'] = '40%'
#         #row['income_level'] = '40%'
        
#     elif i > (length - 5*tenth):
#         data.at[i, 'income_level'] = '50%'
#         #row['income_level'] = '50%'
        
#     else: data.at[i, 'income_level'] = '0'
        
        
#checking a couple of times to see that it works (after checking head & tail as well):

print (data.sample(10))
        

       children  days_employed  dob_years            education  education_id  \
3602          0   -5804.653447   40.00000    primary education             3   
2695          0  396597.151698   52.00000    bachelor's degree             0   
7954          0   -2163.226637   41.00000  secondary education             1   
7762          0  328923.448222   57.00000  secondary education             1   
8114          1   -1285.872512   35.00000  secondary education             1   
13069         0    -183.705474   27.00000  secondary education             1   
5454          1  339310.871728   63.00000    bachelor's degree             0   
1966          0       0.000000   43.29338  secondary education             1   
19119         1  356498.450958   48.00000  secondary education             1   
3128          0   -1147.059848   39.00000  secondary education             1   

           family_status  family_status_id gender income_type   debt  \
3602   civil partnership                 1     

In [63]:
data_pivot = data.pivot_table(index=['income_level'], values=['debt'],  aggfunc=[len,sum])

data_pivot['default_rate'] = data_pivot['sum'] / data_pivot['len']

print (data_pivot)

# print(data_pivot.corr())

               len    sum default_rate
              debt   debt             
income_level                          
0             2139  151.0     0.070594
1             2140  148.0     0.069159
2             2140  179.0     0.083645
3             2140  191.0     0.089252
4             2140  193.0     0.090187
5             2139  180.0     0.084151
6             2140  182.0     0.085047
7             2140  177.0     0.082710
8             2140  165.0     0.077103
9             2140  175.0     0.081776


### Conclusion

We can spot that different income levels do affect the default rate up to a certain point. The top 10% of the observations were the least likely to default, though, the bottom 10% were also less likely to get default than the medium income observations. 
My hypothesis is  that generally speaking, the higher your income the less likely you are to default on a loan and what we see on the data for the lower income tiers could be a result of a much higher barriers into getting a loan in the first place (the historical data could be somehow biased in advance).  

- How do different loan purposes affect on-time repayment of the loan?

In [65]:
data_pivot = data.pivot_table(index=['new_purpose'], values=['debt'],  aggfunc=[len,sum])

data_pivot['default_rate'] = data_pivot['sum'] / data_pivot['len']

print (data_pivot)

                 len    sum default_rate
                debt   debt             
new_purpose                             
Buying a house  3797  256.0     0.067422
Car             4300  403.0     0.093721
Education       4002  370.0     0.092454
Investment      6983  526.0     0.075326
Wedding         2316  186.0     0.080311


### Conclusion

It is evident from the data the different loan purpose affects the likelihood of a person to go default on a loan. For instance, a person who takes a loan in order to buy a car or get an education has over 30% higher chance to default their loan compared to someone who takes a loan in order to buy a house. 

### General conclusion

It is seen from that data that number of kids, marital status, income level and loan purposes - all affect whether a person is likely to pay a loan in time. That being said, this is just an initial examination of those features and a real model has to take into account the following:
1. More complex scenarios that take into account the affect on one parameter on the other - e.g. a younger widow vs older widow or how having more kids affects based on different locations. 
2. Different types of default, it could be that the higher your income the lower your change to go default, but once you go default it puts the whole bank in a risk. So, do banks lose 10% of their original loan value or maybe 75%? 
