# Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building the **credit score** of a potential customer. The **credit score** is used to evaluate the ability of a potential borrower to repay their loan.

[In this notebook you're provided with hints and brief instructions and thinking prompts. Don't ignore them as they are designed to equip you with the structure for the project and will help you analyze what you're doing on a deeper level. Before submitting your project, make sure you remove all hints and descriptions provided to you. Instead, make this report look as if you're sending it to your teammates to demonstrate your findings - they shouldn't know you had some external help from us! To help you out, we've placed the hints you should remove in square brackets.]

[Before you dive into analyzing your data, explain the purposes of the project and hypotheses you're going to test.] 

**In this project I will test 4 hypotheses that might affect the borrower from repaying their loan to the bank on time. 
The hypotheses are: 
1- Is there a link between the number of children and repaying the loan on time.
2- Is there a link between marital status and repaying the loan on time.
3- Is there a link between income level and repaying the loan on time.
4 -Is there a link between the purpose of the loan and repaying the loan on time **

## Open the data file and have a look at the general information. 

[Start with importing the libraries and loading the data. You may realise that you need additional libraries as you go, which is totally fine - just make sure to update this section when you do.]

In [3]:
import pandas as pd 
# Loading all the libraries





In [4]:
data = pd.read_csv('/datasets/credit_scoring_eng.csv')# Load the data

## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

[Now let's explore our data. You'll want to see how many columns and rows it has, look at a few rows to check for potential issues with the data.]

In [5]:
data.shape# Let's see how many rows and columns our dataset has



(21525, 12)

In [6]:
data.head(15)# let's print the first N rows



Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


[Describe what you see and notice in your printed data sample. Are there any issues that may need further investigation and changes?]  
**everything seems good about the data except the days_employed and total_income columns need further investegation** 

In [7]:
data.info()# Get info on data


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


[Are there missing values across all columns or just a few? Briefly describe what you see in 1-2 sentences.]

**There are missing values with the data in days_employed and total_income columns that needs to be explored. Seems like in each column there is 21525 columns except of them.**

In [8]:

data[data['days_employed'].isnull()].head(10)


# Let's look in the filtered table at the the first column with missing data



Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
65,0,,21,secondary education,1,unmarried,4,M,business,0,,transactions with commercial real estate
67,0,,52,bachelor's degree,0,married,0,F,retiree,0,,purchase of the house for my family
72,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
82,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
83,0,,52,secondary education,1,married,0,M,employee,0,,housing


In [9]:
data[(data['days_employed'].isnull()) & (data['total_income'].isnull())].shape

# Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.



(2174, 12)

**Intermediate conclusion**

[Does the number of rows in the filtered table match the number of missing values? What conclusion can we make from this?] 
**It looks like the missing values do match in 'days employed' and 'total income'**

[Calculate the percentage of the missing values compared to the whole dataset. Is it a considerably large piece of data? If so, you may want to fill the missing values. To do that, firstly we should consider whether the missing data could be due to the specific client characteristic, such as employment type or something else. You will need to decide which characteristic *you* think might be the reason. Secondly, we should check whether there's any dependence missing values have on the value of other indicators with the columns with identified specific client characteristic.]

[Explain your next steps and how they correlate with the conclusions you made so far.]
**I will replace the missing values where is required, and chicking duplicates and delete them**

In [10]:
data_missing = data[data['days_employed'].isnull()]# Let's investigate clients who do not have data on identified characteristic and the column with the missing values



In [11]:
data_missing['income_type'].value_counts(normalize=True)# Checking distribution



employee         0.508280
business         0.233671
retiree          0.189972
civil servant    0.067617
entrepreneur     0.000460
Name: income_type, dtype: float64

[Describe your findings here.]

**Seems like 'employee' has the most missing values**

[Propose your ideas on why you think the values might be missing. Do you think they are missing randomly or there are any patterns?] 
**I think it's not random, maybe it's because that people who don't work won't have an income.. but we have to reasure that.**

[Let's start checking whether the missing values are random.]

In [12]:
data['income_type'].value_counts()# Checking the distribution in the whole dataset



employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
entrepreneur                       2
unemployed                         2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

**Intermediate conclusion**

[Is the distribution in the original dataset similar to the distribution of the filtered table? What does that mean for us?]

**it seems to be a bit similar thats why we can say maybe the missing values are missing randomly, will keep check this in the next few steps.**

[If you think we can't make any conclusions yet, let's investigate our dataset further. Let's think about other reasons that could lead to data missing and check if we can find any patterns that may lead us to thinking that the missing values are not random. Because this is your work, this is section is optional.]

**We can conclude that the distributions are quite similar, which means that missing values are likely to be random**

**Intermediate conclusion**

[Can we finally confirm that missing values are accidental? Check for anything else that you think might be important here.]

**it looks like we can confirm that**

In [13]:
data['income_type'].value_counts()/len(data['income_type'])# Checking for other patterns - explain which

employee                       0.516562
business                       0.236237
retiree                        0.179141
civil servant                  0.067782
entrepreneur                   0.000093
unemployed                     0.000093
student                        0.000046
paternity / maternity leave    0.000046
Name: income_type, dtype: float64

**Conclusions**

[Did you find any patterns? How did you come to this conclusion?] 

**Looks like the distribution of income type is pretty similar to it's distribution in the table sorted missing values**

[Explain how you will address the missing values. Consider the categories in which values are missing.]

**seems like the missing values are in 'total_income' and 'days_employed' maybe because who doesn't work, doesn't have income**

[Briefly plan your next steps for transforming data. You will probably need to address different types of issues: duplicates, different registers, incorrect artifacts, and missing values.]

**duplicates need to be deleted, replace missing values by the correct value, fix the upper and lower cases, change the negative values in days_employed column and categorize the data where needed**

## Data transformation

[Let's go through each column to see what issues we may have in them.]

[Begin with removing duplicates and fixing educational information if required.]

In [14]:
data['education'].value_counts()# Let's see all values in education column to check if and what spellings will need to be fixed


secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
GRADUATE DEGREE            1
Graduate Degree            1
Name: education, dtype: int64

In [15]:
data['education'] = data['education'].str.lower()# Fix the registers if required


In [16]:
data['education'].value_counts()# Checking all the values in the column to make sure we fixed them



secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

[Check the data the `children` column]

In [17]:
data['children'].value_counts()# Let's see the distribution of values in the `children` column


 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

[Are there any strange things in the column? If yes, how high is the percentage of problematic data? How could they have occurred? Make a decision on what you will do with this data and explain you reasoning.]

**There is no such thing as -1 children, and 20 children seems alot.**

In [18]:
data.drop(data[data['children'] == -1].index, inplace=True)# [fix the data based on your decision]
data.drop(data[data['children'] == 20].index, inplace=True)

In [19]:
data['children'].value_counts()# Checking the `children` column again to make sure it's all fixed



0    14149
1     4818
2     2055
3      330
4       41
5        9
Name: children, dtype: int64

[Check the data in the `days_employed` column. Firstly think about what kind of issues could there be and what you may want to check and how you will do it.] **days employed shoul'dnt be negative, maybe it's an error**

In [20]:
data['days_employed'].value_counts() # Find problematic data in `days_employed`, if they exist, and calculate the percentage


-327.685916     1
-1580.622577    1
-4122.460569    1
-2828.237691    1
-2636.090517    1
               ..
-201.643573     1
-7120.517564    1
-2146.884040    1
-881.454684     1
-3382.113891    1
Name: days_employed, Length: 19240, dtype: int64

[If the amount of problematic data is high, it could've been due to some technical issues. We may probably want to propose the most obvious reason why it could've happened and what the correct data might've been, as we can't drop these problematic rows.]

In [21]:
data[data['days_employed'] < 0].shape[0]# Address the problematic values, if they exist



15809

In [22]:
data['days_employed'] = data['days_employed'].abs()

In [23]:
data['days_employed'].value_counts()

142.276217     1
402.974768     1
3601.450735    1
1849.622944    1
5849.845620    1
              ..
3951.652030    1
847.043824     1
1745.884477    1
1801.512744    1
1636.419775    1
Name: days_employed, Length: 19240, dtype: int64

In [24]:
data.drop(data[data['days_employed'] >= 300000].index, inplace=True)

In [25]:
data['days_employed'].describe()

count    15809.000000
mean      2353.653992
std       2304.559799
min         24.141633
25%        756.819066
50%       1630.443591
75%       3157.754940
max      18388.949901
Name: days_employed, dtype: float64

[Let's now look at the client's age and whether there are any issues there. Again, think about what can data can be strange in this column, i.e. what cannot be someone's age.]

In [26]:
data['dob_years'].value_counts()# Check the `dob_years` for suspicious values and count the percentage



35    613
41    597
40    596
34    594
38    587
42    583
33    575
39    568
31    555
36    550
29    543
30    536
44    533
37    526
48    517
32    503
43    501
28    501
27    487
45    483
49    476
47    467
46    456
50    449
26    404
52    388
51    373
25    356
53    352
54    332
56    299
55    280
24    263
58    253
23    252
57    245
59    187
22    182
61    140
60    133
62    117
21    110
64     86
0      83
63     77
65     58
20     51
66     44
67     35
68     19
19     14
70     11
69     11
71     10
72      5
74      2
73      2
75      1
Name: dob_years, dtype: int64

[Decide what you'll do with the problematic values and explain why.] **no one can be 0 years old, need to be deleted**

In [27]:
data.drop(data[data['dob_years'] == 0].index, inplace = True)# Address the issues in the `dob_years` column, if they exist


In [28]:
data['dob_years'].value_counts()# Check the result - make sure it's fixed


35    613
41    597
40    596
34    594
38    587
42    583
33    575
39    568
31    555
36    550
29    543
30    536
44    533
37    526
48    517
32    503
28    501
43    501
27    487
45    483
49    476
47    467
46    456
50    449
26    404
52    388
51    373
25    356
53    352
54    332
56    299
55    280
24    263
58    253
23    252
57    245
59    187
22    182
61    140
60    133
62    117
21    110
64     86
63     77
65     58
20     51
66     44
67     35
68     19
19     14
70     11
69     11
71     10
72      5
74      2
73      2
75      1
Name: dob_years, dtype: int64

[Now let's check the `family_status` column. See what kind of values there are and what problems you may need to address.]

In [29]:
data['family_status'].value_counts()# Let's see the values for the column



married              10395
civil partnership     3563
unmarried             2476
divorced               983
widow / widower        471
Name: family_status, dtype: int64

[Now let's check the `gender` column. See what kind of values there are and what problems you may need to address] **no gender is defined with XNA**

In [31]:
data['gender'].value_counts()# Let's see the values in the column

F      11304
M       6583
XNA        1
Name: gender, dtype: int64

In [32]:
data.drop(data[data['gender'] == 'XNA'].index, inplace =True)# Address the problematic values, if they exist

In [33]:
data['gender'].value_counts()# Check the result - make sure it's fixed



F    11304
M     6583
Name: gender, dtype: int64

[Now let's check the `income_type` column. See what kind of values there are and what problems you may need to address]

In [34]:
data['income_type'].value_counts()# Let's see the values in the column

employee                       10996
business                        5033
civil servant                   1447
retiree                          407
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [35]:
# Address the problematic values, if they exist
#no problem

[Now let's see if we have any duplicates in our data. If we do, you'll need to decide what you will do with them and explain why.] **seems like there is 71 duplicates that need to be deleted**

In [36]:
data.duplicated().sum()# Checking duplicates



71

In [37]:
data = data.drop_duplicates().reset_index(drop=True)# Address the duplicates, if they exist

In [38]:
data.duplicated().sum()# Last check whether we have any duplicates


0

In [39]:
data.shape# Check the size of the dataset that you now have after your first manipulations with it

(17816, 12)

[Describe your new dataset: briefly say what's changed and what's the percentage of the changes, if there were any.]
**we started with 21525 columns and now we remained with 21230**

# Working with missing values

[To speed up working with some data, you may want to work with dictionaries for some values, where IDs are provided. Explain why and which dictionaries you will work with.] **I used 2 dictionaries cause it will help me with missing values** 

In [40]:
education_total = data[['education', 'education_id']]
education_total = education_total.drop_duplicates().reset_index(drop=True)# Find the dictionaries

In [41]:
family_total = data[['family_status', 'family_status_id']]
family_total = family_total.drop_duplicates().reset_index(drop = True)


### Restoring missing values in `total_income`

[Briefly state which column(s) have values missing that you need to address. Explain how you will fix them.]


[Start with addressing total income missing values. Create and age category for clients. Create a new column with the age category. This strategy can help with calculating values for the total income.]


In [42]:
def age_total(age):
    try:
        if age <= 18:
            return ('young')
        elif 18 < age <= 60:
            return ('adult')
        else:
            return ('elderly')
    except:
        return 0
    
# Let's write a function that calculates the age category

    

In [43]:
age_total(10)# Test if the function works


'young'

In [44]:
data['age_total'] = data['dob_years'].apply(age_total)# Creating new column based on function



In [45]:
data['age_total'].value_counts()# Checking how values in the new column



adult      17212
elderly      604
Name: age_total, dtype: int64

[Think about the factors on which income usually depends. Eventually, you will want to find out whether you should use mean or median values for replacing missing values. To make this decision you will probably want to look at the distribution of the factors you identified as impacting one's income.]

[Create a table that only has data without missing values. This data will be used to restore the missing values.]

In [46]:
table_new = data[data['age_total'].isnull() !=True]
table_new# Create a table without missing values and print a few of its rows to make sure it looks fine

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_total
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,adult
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,adult
2,0,5623.422610,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,adult
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,adult
4,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.170,purchase of the house,adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...
17811,1,2351.431934,37,graduate degree,4,divorced,3,M,employee,0,18551.846,buy commercial real estate,adult
17812,1,4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions,adult
17813,1,2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property,adult
17814,3,3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car,adult


In [47]:
table_new.pivot_table(index = 'income_type', columns='age_total', values='total_income', aggfunc='mean')# Look at the mean values for income based on your identified factors

age_total,adult,elderly
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1
business,32422.304637,32500.258163
civil servant,27230.498861,30226.86263
employee,25780.286084,27645.800965
entrepreneur,79866.103,
paternity / maternity leave,8612.661,
student,15712.26,


In [48]:
table_new.pivot_table(index = 'income_type', columns='age_total', values='total_income', aggfunc='median')# Look at the median values for income based on your identified factors


age_total,adult,elderly
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1
business,27560.3745,29171.989
civil servant,24067.224,24623.8775
employee,22795.7425,23521.586
entrepreneur,79866.103,
paternity / maternity leave,8612.661,
student,15712.26,


[Make a decision on what characteristics define income most and whether you will use a median or a mean. Explain why you made this decision] **using the mean value seems more acurate for the missing values**


In [49]:
pivot_median = table_new.pivot_table(index=['age_total', 'income_type'],
                                    columns = 'education',
                                     values = 'total_income', \
                                     aggfunc = 'median')


def median_fun(x):
    education = x['education']
    age_total = x['age_total']
    income_type = x['income_type']
    try:
        return pivot_median[education][age_total][income_type]
    except:
        return 'error'
                         #  Write a function that we will use for filling in missing values
        
        

In [50]:
pivot_median['secondary education']['adult']['employee']# Check if it works


21853.63

In [51]:
data['median_fun'] = data.apply(median_fun, axis = 1)# Apply it to every row


In [52]:
data[data['median_fun'] == 'error']# Check if we got any errors


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_total,median_fun
11,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding,elderly,error
25,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate,elderly,error
48,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding,adult,error
59,0,,52,bachelor's degree,0,married,0,F,retiree,0,,purchase of the house for my family,adult,error
122,0,,62,secondary education,1,married,0,M,retiree,0,,building a property,elderly,error
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17638,0,,59,secondary education,1,unmarried,4,F,retiree,0,,construction of own property,adult,error
17643,0,,49,secondary education,1,married,0,F,retiree,0,,buying property for renting out,adult,error
17652,0,,56,secondary education,1,married,0,F,retiree,0,,real estate transactions,adult,error
17723,0,,65,secondary education,1,married,0,F,retiree,0,,purchase of my own house,elderly,error


In [53]:
data.drop(data[data['median_fun'] == 'error'].index,inplace=True)

In [54]:
data[data['median_fun'] == 'error']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_total,median_fun


[If you've came across errors in preparing the values for missing data, it probably means there's something special about the data for the category. Give it some thought - you may want to fix some things manually, if there's enough data to find medians/means.]


In [55]:
data['total_income'] = data['total_income'].fillna(data['median_fun'])# Replacing missing values if there are any errors


[When you think you've finished with `total_income`, check that the total number of values in this column matches the number of values in other ones.]

In [56]:
data['total_income']# Checking the number of entries in the columns



0        40620.102
1        17932.802
2        23341.752
3        42820.568
4        40922.170
           ...    
17811    18551.846
17812    35966.698
17813    14347.610
17814    39054.888
17815    13127.587
Name: total_income, Length: 17436, dtype: float64

###  Restoring values in `days_employed`

[Think about the parameters that may help you restore the missing values in this column. Eventually, you will want to find out whether you should use mean or median values for replacing missing values. You will probably conduct a research similar to the one you've done when restoring data in a previous column.]

In [57]:
table_new.pivot_table(index = 'income_type', columns='age_total', values='days_employed', aggfunc='median')# Look at the median values for income based on your identified factors
# Distribution of `days_employed` medians based on your identified parameters




age_total,adult,elderly
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1
business,1532.0672,2358.485102
civil servant,2666.58336,3217.780761
employee,1561.231462,2641.207922
entrepreneur,520.848083,
paternity / maternity leave,3296.759962,
student,578.751554,


In [58]:
table_new.pivot_table(index = 'income_type', columns='age_total', values='days_employed', aggfunc='mean')# Distribution of `days_employed` means based on your identified parameters

age_total,adult,elderly
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1
business,2075.640637,3691.779433
civil servant,3360.594254,4252.203735
employee,2289.671205,3869.714995
entrepreneur,520.848083,
paternity / maternity leave,3296.759962,
student,578.751554,


[Decide what you will use: means or medians. Explain why.] **I will use the median in order to take into considiration all the values**

In [59]:
mix_days_income = data.groupby('income_type')['days_employed'].median()
def filling(income_type):
    try:
        return mix_days_income[income_type]
    except:
        return 'error'# Let's write a function that calculates means or medians (depending on your decision) based on your identified parameter


In [60]:
filling('employee')# Check that the function works



1573.791064067419

In [61]:
data['median_new'] = data['income_type'].apply(filling)# Apply function to the income_type



In [62]:
data['income_type'].value_counts()# Check if function worked



employee                       10961
business                        5026
civil servant                   1445
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [63]:
data['days_employed'] = data['days_employed'].fillna(filling) # Replacing missing values



[When you think you've finished with `total_income`, check that the total number of values in this column matches the number of values in other ones.]

In [64]:
data.isnull().sum()# Check the entries in all columns - make sure we fixed all missing values

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        1
purpose             0
age_total           0
median_fun          1
median_new          0
dtype: int64

## Categorization of data

[To answer the questions and test the hypotheses, you will want to work with categorized data. Look at the questions that were posed to you and that you should answer. Think about which of the data will need to be categorized to answer these questions. Below you will find a template through which you can work your way when categorizing data. The first step-by-step processing covers the text data; the second one addresses the numerical data that needs to be categorized. You can use both or none of the suggested instructions - it's up to you.]

[Despite of how you decide to address the categorization, make sure to provide clear explanation of why you made your decision. Remember: this is your work and you make all decisions in it.]


In [65]:
data['purpose'].value_counts()# Print the values for your selected data for categorization



wedding ceremony                            647
having a wedding                            618
to have a wedding                           614
real estate transactions                    545
buy commercial real estate                  541
housing transactions                        534
buying property for renting out             532
transactions with commercial real estate    527
purchase of my own house                    526
housing                                     526
property                                    523
purchase of the house                       516
purchase of the house for my family         514
building a property                         514
construction of own property                513
transactions with my real estate            508
buy residential real estate                 506
buy real estate                             505
building a real estate                      501
housing renovation                          499
buying my own car                       

[Let's check unique values]

In [66]:
unique_purpose = data['purpose'].unique()
unique_purpose# Check the unique valuesdata

array(['purchase of the house', 'car purchase', 'supplementary education',
       'housing transactions', 'education', 'having a wedding',
       'purchase of the house for my family', 'buy real estate',
       'buy commercial real estate', 'buy residential real estate',
       'construction of own property', 'property', 'building a property',
       'buying my own car', 'buying a second-hand car',
       'to have a wedding', 'housing', 'transactions with my real estate',
       'cars', 'to become educated', 'second-hand car purchase',
       'getting an education', 'car', 'wedding ceremony',
       'to get a supplementary education', 'purchase of my own house',
       'real estate transactions', 'getting higher education',
       'transactions with commercial real estate', 'to own a car',
       'purchase of a car', 'profile education', 'university education',
       'to buy a car', 'buying property for renting out',
       'building a real estate', 'housing renovation',
       'going

[What main groups can you identify based on the unique values?]**there are families with 1,0,3,2,4,5 children, and looks like families with 0 children are the biggest part of the data**

[Based on these themes, we will probably want to categorize our data.]


In [67]:
def purpose_category(y):
    print(y['purpose'])
    mapping = {'wedding':'wedding',
               'estate':'house',
               'property':'house',
               'hous':'house',
               'car':'car',
               'education':'education',
               'university':'education','college':'education'
              }
    for key in mapping.keys():
        if key in y['purpose']:
            return mapping[key]
    return


In [68]:
data['purpose_category'] = data.apply(purpose_category, axis = 1)

purchase of the house
car purchase
purchase of the house
supplementary education
purchase of the house
housing transactions
education
having a wedding
purchase of the house for my family
buy real estate
buy commercial real estate
car purchase
buy residential real estate
construction of own property
property
building a property
buying my own car
property
car purchase
buying a second-hand car
to have a wedding
education
construction of own property
construction of own property
housing
having a wedding
purchase of the house
transactions with my real estate
cars
car purchase
education
to become educated
buy real estate
second-hand car purchase
getting an education
car purchase
buying my own car
to become educated
second-hand car purchase
having a wedding
construction of own property
car
wedding ceremony
to get a supplementary education
purchase of my own house
cars
education
property
real estate transactions
to become educated
buy real estate
getting higher education
housing transactions
s

In [69]:
data

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_total,median_fun,median_new,purpose_category
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,adult,26560.9705,1573.791064,house
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,adult,21853.63,1573.791064,car
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,adult,21853.63,1573.791064,house
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,adult,21853.63,1573.791064,education
4,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.170,purchase of the house,adult,32424.849,1555.993659,house
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17811,1,2351.431934,37,graduate degree,4,divorced,3,M,employee,0,18551.846,buy commercial real estate,adult,31771.321,1573.791064,house
17812,1,4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions,adult,25401.27,1555.993659,house
17813,1,2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property,adult,21853.63,1573.791064,house
17814,3,3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car,adult,21853.63,1573.791064,car


[If you decide to categorize the numerical data, you'll need to come up with the categories for it too.]

In [70]:
data.head()# Looking through all the numerical data in your selected column for categorization


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_total,median_fun,median_new,purpose_category
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,adult,26560.9705,1573.791064,house
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,adult,21853.63,1573.791064,car
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,adult,21853.63,1573.791064,house
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,adult,21853.63,1573.791064,education
4,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,adult,32424.849,1555.993659,house


In [71]:
data['purpose_category'].describe()# Getting summary statistics for the column



count     17098
unique        4
top       house
freq       8830
Name: purpose_category, dtype: object

[Decide what ranges you will use for grouping and explain why.] **There were 4 main categories in the purpose culomn, I categorized it as the code I wrote above with: weding, house, car, education**

In [72]:
def categorize_income(income):
    if (income > 0) and (income <= 20000):
        return 'low income'
    if (income > 20000) and (income <= 40000):
        return 'normal income'
    if (income > 40000):
        return 'high income'# Creating function for categorizing into different numerical groups based on ranges



In [73]:
categorize_income(700000)#chicking if categorize_income workes


'high income'

In [74]:
data['total_income'] = pd.to_numeric(data['total_income'], errors='coerce')

In [75]:
data['total_income']

0        40620.102
1        17932.802
2        23341.752
3        42820.568
4        40922.170
           ...    
17811    18551.846
17812    35966.698
17813    14347.610
17814    39054.888
17815    13127.587
Name: total_income, Length: 17436, dtype: float64

In [76]:
data['categorize_income'] = data['total_income'].apply(categorize_income)# Creating column with categories


In [77]:
data['categorize_income']#Count each categories values to see the distribution

0          high income
1           low income
2        normal income
3          high income
4          high income
             ...      
17811       low income
17812    normal income
17813       low income
17814    normal income
17815       low income
Name: categorize_income, Length: 17436, dtype: object

## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?** ** yes there is a small correlation**

In [78]:
pivot_children = data.pivot_table(index='children', columns='debt', values='days_employed', aggfunc = 'count')# Check the children data and paying back on time


pivot_children['wanted'] = pivot_children[1]/ (pivot_children[1]+pivot_children[0] * 100)

pivot_children.sort_values(by='wanted', ascending=True)# Calculating default-rate based on the number of children



debt,0,1,wanted
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,296.0,26.0,0.000878
0,9669.0,861.0,0.00089
1,4090.0,426.0,0.00104
2,1828.0,191.0,0.001044
4,36.0,4.0,0.00111
5,9.0,,


**Conclusion**

[Write your conclusions based on your manipulations and observations.]


**Is there a correlation between family status and paying back on time?** **there is a small correlation**

In [79]:
pivot_family = data.pivot_table(index='family_status', columns='debt', values='days_employed', aggfunc = 'count')# Check the family status data and paying back on time


pivot_family['wanted'] = pivot_family[1] / (pivot_family[1] + pivot_family[0] * 100)
# Calculating default-rate based on family status
pivot_family.sort_values(by='wanted', ascending=True)


debt,0,1,wanted
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
widow / widower,393,26,0.000661
divorced,892,71,0.000795
married,9346,808,0.000864
civil partnership,3118,347,0.001112
unmarried,2179,256,0.001173


**Conclusion**

[Write your conclusions based on your manipulations and observations.]

**Is there a correlation between income level and paying back on time?**

In [80]:
pivot_income = data.pivot_table(index='categorize_income', columns='debt', values='days_employed', aggfunc = 'count')   # Check the income level data and paying back on time


pivot_income['wanted']  = pivot_income[1] /(pivot_income[1] + pivot_income[0] *100)

pivot_income.sort_values(by='wanted', ascending=True)
# Calculating default-rate based on income level





debt,0,1,wanted
categorize_income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
high income,2328,178,0.000764
normal income,8646,814,0.000941
low income,4953,516,0.001041


**Conclusion**

[Write your conclusions based on your manipulations and observations.]



**How does credit purpose affect the default rate?**

In [81]:
pivot_purpose = data.pivot_table(index='purpose_category', columns='debt', values='days_employed', aggfunc = 'count')
pivot_purpose['wanted']  = pivot_purpose[1] /(pivot_purpose[1] + pivot_purpose[0] *100)
pivot_purpose.sort_values(by='wanted', ascending=True)# Check the percentages for default rate for each credit purpose and analyze them



debt,0,1,wanted
purpose_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
house,8146,684,0.000839
wedding,1723,156,0.000905
education,2630,289,0.001098
car,3123,347,0.00111


In [82]:
data

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_total,median_fun,median_new,purpose_category,categorize_income
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,adult,26560.9705,1573.791064,house,high income
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,adult,21853.63,1573.791064,car,low income
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,adult,21853.63,1573.791064,house,normal income
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,adult,21853.63,1573.791064,education,high income
4,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.170,purchase of the house,adult,32424.849,1555.993659,house,high income
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17811,1,2351.431934,37,graduate degree,4,divorced,3,M,employee,0,18551.846,buy commercial real estate,adult,31771.321,1573.791064,house,low income
17812,1,4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions,adult,25401.27,1555.993659,house,normal income
17813,1,2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property,adult,21853.63,1573.791064,house,low income
17814,3,3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car,adult,21853.63,1573.791064,car,normal income


**Conclusion**

[Write your conclusions based on your manipulations and observations.] **seems like there are alot of factors affecting the lowner from repayiring their loan. I chicked if there is missing values and tried to repair them. plusdid some research about the duplicates. 


# General Conclusion 

[List your conclusions in this final section. Make sure you include all your important conclusions you made that led you to the way you processed and analyzed the data. Cover the missing values, duplicates, and possible reasons and solutions for problematic artifacts that you had to address.]

[List your conclusions regarding the posed questions here as well.]

**I delt with some issues in this project:
There was like 10% missing data, it could be from many reasons like people who are retired or students who does'nt have an income. I replaced The missing data with the median. 
Another issue that there was like 71 duplicates that had to be deleted.
After manipulating the data there were multiple cunclosions one of them that most people how take a loan have normal income.
about my Hypotheses: children: parents with 3 children or more take more time in returning the loan, about purpose_category people who want to buy a house take most loans, especialy those who don't have debt, and the most people who take loans are married (from pivot with family status) **


<b>Overall reviewer's comment v.2.</b> <a class="tocSkip"></a>

Hello! 

Thanks for your corrections)

I'm glad to say that your project is perfect now and your code has passed review!

Congratulations and good luck on the next strint!

&#127881;
&#127881;
&#127881;