## Project description.

The project is to prepare a report for a bank’s loan division. <br>
We need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness. <br>
Overall goal of the project is to build a credit score for a potential customer. A credit score is used to evaluate the ability of a potential borrower to repay their loan.

First, we will preprocess the dataset, which includes following steps:
 - [Identifying and filling in missing values](#processing_missing_values)
 - [Replacing the real number data type with the integer type](#data_type_replacement)
 - Deleting duplicate data
 - Categorizing the data

Then we will analyze the resulting dataframe to answer following questions:  
 - Is there a connection between having kids and repaying a loan on time?
 - Is there a connection between marital status and repaying a loan on time?
 - Is there a connection between income level and repaying a loan on time?
 - How do different loan purposes affect on-time loan repayment?

## Step 1. Open the data file and have a look at the general information. 

In [3]:
import pandas as pd
df = pd.read_csv("credit_scoring_eng.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


Here we can see that our table consists  of 11 columns. <br>
Two of them — `days_employed` and `total_income` contain null values. What's interesting, this two columns have exactly the same amount of null values, which means that we are probably missing information about employment and income for the same group of people. That may be due to their unemployment at the moment or some technical error. <br>
<br>
Now let's take a look at table's content:

In [1]:
df.sample(10)

NameError: name 'df' is not defined

Now we see that the `days_employed` column contains not only null values, but some more artifacts — negative numbers which makes no sense in case of number of working days. <br>
#### Conclusion: 
At first brief look at the data file we already see few columns that contain missing values and values that don't correspond to reality. <br>
We are going to fix them and to take a closer look at each of the other columns in the next data preprocessing step to find out if there are any duplicates or other data corruptions.

<a id=’processing_missing_values’></a>

## Step 2. Data preprocessing
### Processing missing values
#### `days_employed` column

In this column we have missing values of NaN type and some negative values. <br>
First of all, the amount of working days can't be negative. Most likely it's just some sort of technical error while importing data to the database, so I'm going to turn all of the negative values into positive to fix that. <br>
In addition, I'm going to check how many null values we have in this column. \

In [479]:
df['days_employed'] = df['days_employed'].abs()
df['days_employed'].isnull().sum()

2174

So we have 2174 missing values, what could be the reason for that? Maybe those customers are unemployed at the moment? <br> 
Let's test this hypothesis by checking the data of `income_type` column corresponding to all null values of `days_employed` column:

In [480]:
df[df['days_employed'].isnull()]['income_type'].value_counts()

employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: income_type, dtype: int64

Turns out none of the customers with missing values of `days_employed` column are unemployed. So that probably was some random technical mistake. <br>
Let's check if we miss values at `total_income` column for the same group of people:

In [481]:
df[df['days_employed'].isnull()]['total_income'].value_counts()

Series([], Name: total_income, dtype: int64)

The output is empty Series, that means all rows with missing `days_employed` value have `total_income` value missing as well. Taking into account that the `days_employed` and `total_income` column have the same number of null values, we can say that it's the same group of people. <br>
<br>
The Amount of rows containing null values is too big to just drop them, so it's preferably to fill them with some value that wouldn't affect our final results. For quantitative missing values like we have in column `days_employed` this is a good practice to replace them with mean or median values of a column. Median is preferred when data we're working with has serious outliers. <br>
Let's find out if we have them here: 

In [482]:
df['days_employed'].describe()

count     19351.000000
mean      66914.728907
std      139030.880527
min          24.141633
25%         927.009265
50%        2194.220567
75%        5537.882441
max      401755.400475
Name: days_employed, dtype: float64

We see that **std** which stands for "Standard deviation of the observations" is way bigger than **mean** value of the column. There's also quite a big difference between **mean** and **median** (or 50% at the output table) values. <br>
That means data has significant outliers, so replacing nulls with median values would be more representative.

In [483]:
df.loc[df['days_employed'].isnull(), 'days_employed'] = df['days_employed'].median()

#### `total_income` column
As we figured out, in the `total_income` column we're missing values for the same group of people as in `days_employed` column. <br>
So we can apply the same method for filling null values here. Let's see if our data has any significant outliers:

In [484]:
df['total_income'].describe()

count     19351.000000
mean      26787.568355
std       16475.450632
min        3306.762000
25%       16488.504500
50%       23202.870000
75%       32549.611000
max      362496.645000
Name: total_income, dtype: float64

In the case of the `total_income` column, **std** is much smaller than the **mean** value of the column. **mean** and **median** values are about the same. <br>
That means data doesn't have any serious outliers, so replacing nulls with mean values would be more preferable in this case. 

In [485]:
df.loc[df['total_income'].isnull(), 'total_income'] = df['total_income'].mean()

#### `dob_years` column
Let's have a look at values of `dob_years` (customers age) column:

In [486]:
df['dob_years'].describe()

count    21525.000000
mean        43.293380
std         12.574584
min          0.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

We see that the minimum age is 0. Let's check if there're more values with unrealistically young age.   

In [487]:
df.loc[df['dob_years'] != 0, 'dob_years'].min()

19

So, other than 0, minimum age is 19, which makes sense. <br>
We should consider rows with 0 age as missing values. That could be default value for the age column which we're getting when someone doesn't fill out their age.  <br>
<br>
Let's apply here the same method as for `days_employed` and `total_income` columns and replace missing values with mean or median values. <br>
As we see above where we applied the .describe() method to look at the data of `dob_years` column, **std** is much smaller than the **mean** value, **mean** and **median** values are almost the same. That means data doesn't have any serious outliers, which would be weird to have in age data. So we could replace zeros with **mean** value. <br>
<br>
It also might be a good idea to rename this column to `age` instead of `dob_years` for brevity and being more understandable.

In [488]:
df.loc[df['dob_years'] == 0, 'dob_years'] = df['dob_years'].mean()
df['dob_years'] = df['dob_years'].astype('int64')
df.rename(columns={'dob_years': 'age'}, inplace=True)

#### Conclusion: 

In [489]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     21525 non-null  float64
 2   age               21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


Now we got rid of all missing values in the data frame.

### Data type replacement <a id=’data_type_replacement’></a>
`days_employed` and `total_income` columns have float data type. Taking into account that numbers contained at those columns are about thousands/ tens of thousands, digits after decimal point is accuracy that we don't really need. <br>
Lets convert values of those two columns into integers:

In [490]:
df['days_employed'] = df['days_employed'].astype('int64')
df['total_income'] = df['total_income'].astype('int64')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   children          21525 non-null  int64 
 1   days_employed     21525 non-null  int64 
 2   age               21525 non-null  int64 
 3   education         21525 non-null  object
 4   education_id      21525 non-null  int64 
 5   family_status     21525 non-null  object
 6   family_status_id  21525 non-null  int64 
 7   gender            21525 non-null  object
 8   income_type       21525 non-null  object
 9   debt              21525 non-null  int64 
 10  total_income      21525 non-null  int64 
 11  purpose           21525 non-null  object
dtypes: int64(7), object(5)
memory usage: 2.0+ MB


#### Conclusion: 
Now we have a data frame containing only integer and object (String) values.

### Processing duplicates
#### `education` column
Let's have a look at all unique category titles that we have in `education` column:

In [491]:
df['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
GRADUATE DEGREE            1
Graduate Degree            1
Name: education, dtype: int64

We see a lot of duplicates caused by writing the same category title with all capital or all lowercase letters. <br>
Let's get rid of such duplicates by lowercasing all category titles.

In [492]:
df['education'] = df['education'].str.lower()
df['education'].value_counts()

secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

Now after we got rid of all duplicates, the number of categories in the `education` column decreased to 5. <br>
Let's see if that corresponds to the number of category IDs that we have in the `education_id` column: 

In [493]:
df['education_id'].value_counts()

1    15233
0     5260
2      744
3      282
4        6
Name: education_id, dtype: int64

Great. So we have the same number of titles and IDs and same number of rows under each of category titles and IDs.
Which means now we can pair up each education level title with it's ID number.
#### `purpose` column
Let's have a look at all unique category titles that we have in `purpose` column:

In [494]:
df['purpose'].value_counts()

wedding ceremony                            797
having a wedding                            777
to have a wedding                           774
real estate transactions                    676
buy commercial real estate                  664
housing transactions                        653
buying property for renting out             653
transactions with commercial real estate    651
purchase of the house                       647
housing                                     647
purchase of the house for my family         641
construction of own property                635
property                                    634
transactions with my real estate            630
building a real estate                      626
buy real estate                             624
building a property                         620
purchase of my own house                    620
housing renovation                          612
buy residential real estate                 607
buying my own car                       

We see here a lot of categories which are not identical duplicates, but referring to the same reason why customer is taking a loan. <br>
Like _"wedding ceremony", "having a wedding"_ and _"to have a wedding"_ are referring to _"wedding"_. <br>
<br>
We could find such duplicates using the Stemming method. <br>
First of all, let's create a function that taking word as an argument and looking for categories containing this word (or it's form) in the `purpose` column, then replacing all of such duplicates with that argument word:

In [495]:
import nltk
from nltk.stem import SnowballStemmer
eng_stemmer = SnowballStemmer('english')

def stemming_function(argument_word):
    for purpose in df['purpose'].unique():
        for word in purpose.split(" "):
            stemmed_word = eng_stemmer.stem(word)
            if stemmed_word == argument_word:
                df.loc[df['purpose'] == purpose, 'purpose'] = argument_word       

Here's the list of words that appear multiple times in `purpose` column and could be used as argument words:
 - car
 - education
 - wedding
 - house
 - property
 - real estate
 - university
 
Let's find out stems for them, so we can apply them as arguments for stemming function. <br>
_"real estate"_ contains two words, so we might apply only _"estate"_ as an argument word.

In [496]:
print("car —", eng_stemmer.stem('car'))
print("education —", eng_stemmer.stem('education'))
print("wedding —", eng_stemmer.stem('wedding'))
print("house —", eng_stemmer.stem('house'))
print("property —", eng_stemmer.stem('property'))
print("real estate —", eng_stemmer.stem('estate'))
print("university —", eng_stemmer.stem('university'))

car — car
education — educ
wedding — wed
house — hous
property — properti
real estate — estat
university — univers


Let's apply the stemming function with all stem words that we got.

In [497]:
stemming_function('car')
stemming_function('educ')
stemming_function('wed')
stemming_function('hous')
stemming_function('properti')
stemming_function('estat')
stemming_function('univers')
df['purpose'].value_counts()

estat       4478
car         4315
hous        3820
educ        3526
properti    2542
wed         2348
univers      496
Name: purpose, dtype: int64

Now there are still some more categories that we can join together: 
 - _education_ and _university_ can be combined in one category: _education_
 - _house, property_ and _real estate_ can be combined into _real estate_.

Also we might want to rename all categories back to their normal names instead of stem words. 

In [498]:
df.loc[(df['purpose'] == 'estat') | (df['purpose'] == 'hous') | (df['purpose'] == 'properti'), 'purpose'] = 'real_estate'
df.loc[(df['purpose'] == 'educ') | (df['purpose'] == 'univers'), 'purpose'] = 'education'
df.loc[(df['purpose'] == 'wed'), 'purpose'] = 'wedding'
df['purpose'].value_counts()

real_estate    10840
car             4315
education       4022
wedding         2348
Name: purpose, dtype: int64

#### Conclusion: 
We processed all of the duplicates and decreased the number categories in `education` column to 5 and in `purpose` column to 4.

### Other data artifacts
#### `children` column
Let's have a look at all unique values of `children` column:

In [499]:
df['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

**20** and **-1** number of children are looking like mistakes. <br>
Since we don't have anyone with a number of children more than 5, **20** is looking like a misprint for **2**. <br>
**-1** most likely is a misprint for **1**.

In [500]:
df.loc[df['children'] == 20, 'children'] = 2
df.loc[df['children'] == -1, 'children'] = 1
df['children'].value_counts()

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

Now values of `children` column make more sense.
#### `gender` column
Let's take a look at `gender` column:

In [501]:
df['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

We see here one non-binary person with no identified gender. <br>
That's 1 case of 21525, which means that it's statistically insignificant, we can drop that row.

In [502]:
df.drop(df[df['gender'] == 'XNA'].index, inplace = True) 

#### `income_type` column
Let's take a look at all unique categories of `income_type` column:

In [503]:
df['income_type'].value_counts()

employee                       11119
business                        5084
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

Here we see that some income types don't have enough cases to be statistically significant. <br>
1-2 cases out of 21525 cases overall making percent extremely close to 0.
 - _"entrepreneur"_ category could be joined with _"business"_
 - _"unemployed", "paternity / maternity leave"_ and _"student"_ could be dropped.

In [504]:
df.loc[(df['income_type'] == 'entrepreneur'), 'income_type'] = 'business'
df.drop(df[(df['income_type'] == 'unemployed') | (df['income_type'] == 'paternity / maternity leave') | (df['income_type'] == 'student')].index, inplace = True) 

#### Conclusion: 
We fixed some artifact values at the `children` column and got rid of some statistically insignificant categories at `gender` and `income type` columns.

### Categorizing Data
Now when we processed all of our data and got rid of all missing values and duplicates, it's time to analyse it. <br>
Our project goal is to check if there're any correlations between number of children or marital status and whether or not customers will default on a loan.  <br>
Let's split our data to groups according to those parameters and compare percent of people that didn't repay their debt on time. 
#### Categorizing by number of children
First, let's sort the table according to the number of children. <br>
We're going to group the data frame by unique values of `children` column and then apply two functions to the `debt` column: **count** - to count the amount of rows corresponding to each number of children and **sum** - to calculate the sum of `debt` column values corresponding to each category. Since values of the debt column are either 0 in case if the customer didn't default on a loan or 1 in case if he did, summarizing values of the `debt` column for each category allows us to see how many people related to that category didn't repay theri loan on time. <br>
Let's save results to the new `categorized_by_children` data frame:

In [505]:
categorized_by_children = df.groupby('children').agg({'debt':['count', 'sum']})
categorized_by_children

Unnamed: 0_level_0,debt,debt
Unnamed: 0_level_1,count,sum
children,Unnamed: 1_level_2,Unnamed: 2_level_2
0,14146,1063
1,4864,444
2,2130,201
3,330,27
4,41,4
5,9,0


Now we're going to divide number of default cases (`sum` column) by number of cases overall (`count` column) to know the ratio and multiply the result by 100 to get percent of default cases in each group, result would be saved to the new `default_percent` column. <br>
Besides that, let's reset our columns names to get rid of double indexing.

In [506]:
categorized_by_children.columns =['count', 'sum'] 
categorized_by_children['default_percent'] = (categorized_by_children['sum'] / categorized_by_children['count']*100).round(2)
categorized_by_children

Unnamed: 0_level_0,count,sum,default_percent
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,14146,1063,7.51
1,4864,444,9.13
2,2130,201,9.44
3,330,27,8.18
4,41,4,9.76
5,9,0,0.0


#### Categorizing by marital status
Let's apply the same method to compare default percent for each family status:

In [507]:
categorized_by_family_status = df.groupby('family_status').agg({'debt':['count', 'sum']})
categorized_by_family_status.columns =['count', 'sum'] 
categorized_by_family_status['default_percent'] = (categorized_by_family_status['sum'] / categorized_by_family_status['count']*100).round(2)
categorized_by_family_status

Unnamed: 0_level_0,count,sum,default_percent
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
civil partnership,4175,388,9.29
divorced,1195,85,7.11
married,12378,929,7.51
unmarried,2812,274,9.74
widow / widower,960,63,6.56


#### Categorizing by income level
Let's found out if income has any impact on default cases percent. <br>
We have data for income type and income amount. Let's check income type first:

In [508]:
categorized_by_income_type = df.groupby('income_type').agg({'debt':['count', 'sum']})
categorized_by_income_type.columns =['count', 'sum'] 
categorized_by_income_type['default_percent'] = (categorized_by_income_type['sum'] / categorized_by_income_type['count']*100).round(2)
categorized_by_income_type

Unnamed: 0_level_0,count,sum,default_percent
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
business,5086,376,7.39
civil servant,1459,86,5.89
employee,11119,1061,9.54
retiree,3856,216,5.6


Unlike income type, income amount (`total_income` column) is quantitative data, so in order to analyze it we need to split it into categories. <br>
Let's see how we could do that:

In [509]:
df['total_income'].describe()

count     21520.000000
mean      26788.688569
std       15621.873202
min        3306.000000
25%       17248.000000
50%       25024.000000
75%       31285.250000
max      362496.000000
Name: total_income, dtype: float64

We can split all spectrum of `total_income` values into 4 quarters by 25%, median and 75% values. Let's name those categories _"very low"_ - for quarter under 25%, _"low"_ for 25-50%, _"high"_ for 50-75% and _"very high"_ for values over 75%. <br>
Now let's write a function that would do that separation for us and apply it to the `total_income` column, saving result in new column `total_income_grouped`:

In [510]:
def income_group(income):
    if income <= 17248:
        return 'very low'
    if income <= 25024:
        return 'low'   
    if income <= 31285:
        return 'high'  
    return 'very high' 

df['total_income_grouped'] = df['total_income'].apply(income_group)

Now when we have income amount data splitted into categories, let's compare default dercent for each of them:

In [511]:
categorized_by_income = df.groupby('total_income_grouped').agg({'debt':['count', 'sum']})
categorized_by_income.columns =['count', 'sum'] 
categorized_by_income['default_percent'] = (categorized_by_income['sum'] / categorized_by_income['count']*100).round(2)
categorized_by_income

Unnamed: 0_level_0,count,sum,default_percent
total_income_grouped,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
high,5379,455,8.46
low,5380,473,8.79
very high,5380,386,7.17
very low,5381,425,7.9


#### Categorizing by purpose
Let's apply the same technique as for children, family status and income for purpose parameter as well:

In [512]:
categorized_by_purpose = df.groupby('purpose').agg({'debt':['count', 'sum']})
categorized_by_purpose.columns =['count', 'sum'] 
categorized_by_purpose['default_percent'] = (categorized_by_purpose['sum'] / categorized_by_purpose['count']*100).round(2)
categorized_by_purpose

Unnamed: 0_level_0,count,sum,default_percent
purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
car,4314,402,9.32
education,4022,370,9.2
real_estate,10836,781,7.21
wedding,2348,186,7.92


#### Conclusion: 
We categorized data by a few parameters like amount of kids, family status, income level and purpose for taking a loan. For each category of people we calculated percent of default cases and exported results into tables. Now we are ready to make a conclusion which categories have higher chances to default on a loan.

### Step 3. Answer these questions

- **Is there a relation between having kids and repaying a loan on time?**

In [513]:
categorized_by_children.sort_values('default_percent', ascending = False)

Unnamed: 0_level_0,count,sum,default_percent
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,41,4,9.76
2,2130,201,9.44
1,4864,444,9.13
3,330,27,8.18
0,14146,1063,7.51
5,9,0,0.0


From this table we see that having kids increases chances of not repaying the loan on time. <br>
We also see a tendency of that chance to increase with the amount of kids, but it's not very clear, for example, customers with **3** kids don't fit into that theory and customers with **5** kids don't have enough cases to make any assumptions about their default rate at all.

- **Is there a relation between marital status and repaying a loan on time?**

In [514]:
categorized_by_family_status.sort_values('default_percent', ascending = False)

Unnamed: 0_level_0,count,sum,default_percent
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
unmarried,2812,274,9.74
civil partnership,4175,388,9.29
married,12378,929,7.51
divorced,1195,85,7.11
widow / widower,960,63,6.56


**Unmarried** and those people who are in **civil partnership**	are two groups that more likely would default on a loan. <br>
**Married** and **divorced** are less likely to do that, while **widows/ widowers** have the highest chances to repay on time. 

- **Is there a relation between income level and repaying a loan on time?**

In [515]:
categorized_by_income.sort_values('default_percent', ascending = False)

Unnamed: 0_level_0,count,sum,default_percent
total_income_grouped,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
low,5380,473,8.79
high,5379,455,8.46
very low,5381,425,7.9
very high,5380,386,7.17


People with **lowest** and **highest** amount of income have higher chances to repay loan on time.

In [516]:
categorized_by_income_type.sort_values('default_percent', ascending = False)

Unnamed: 0_level_0,count,sum,default_percent
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
employee,11119,1061,9.54
business,5086,376,7.39
civil servant,1459,86,5.89
retiree,3856,216,5.6


Comparing sources of income, we can say that **retirees** and **civil servants** have very low chance of not repaying their debt, while **employees** more likely would default on their loan. Those who run their own **business** have default cases percent little lower than average. 

- **How do different loan purposes affect on-time repayment of the loan?**

In [517]:
categorized_by_purpose.sort_values('default_percent', ascending = False)

Unnamed: 0_level_0,count,sum,default_percent
purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
car,4314,402,9.32
education,4022,370,9.2
wedding,2348,186,7.92
real_estate,10836,781,7.21


People who take loans to pay for **wedding** or **real estate** are more likely to repay on time, while those who are going to buy a **car** or get an **education** have higher chances of defaulting on the loan. 

### Step 4. General conclusion

In this project we analysed a dataset of people who took loan in a bank trying to find correlations between customer’s marital status and number of kids and whether or not they will repay loan on time. <br>
The Dataset contained some missing values which we managed to fill, as well as some duplicates in categorical values, which we identified and joined together. After preprocessing we splitted all data into categories by number of children and family status and calculated percent of default cases for each of categories to compare that rate and find out who had the highest percent of default cases. <br> 
Our analysis shows that people with kids who are unmarried or in civil partnerships are most likely not to repay loans on time. 