# Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building the **credit score** of a potential customer. The **credit score** is used to evaluate the ability of a potential borrower to repay their loan.

[In this notebook you're provided with hints and brief instructions and thinking prompts. Don't ignore them as they are designed to equip you with the structure for the project and will help you analyze what you're doing on a deeper level. Before submitting your project, make sure you remove all hints and descriptions provided to you. Instead, make this report look as if you're sending it to your teammates to demonstrate your findings - they shouldn't know you had some external help from us! To help you out, we've placed the hints you should remove in square brackets.]

[Before you dive into analyzing your data, explain the purposes of the project and hypotheses you're going to test.]

## Open the data file and have a look at the general information. 

[Start with importing the libraries and loading the data. You may realise that you need additional libraries as you go, which is totally fine - just make sure to update this section when you do.]

In [None]:
# Loading all the libraries


# Load the data


## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

[Now let's explore our data. You'll want to see how many columns and rows it has, look at a few rows to check for potential issues with the data.]

In [None]:
# Let's see how many rows and columns our dataset has



In [None]:
# let's print the first N rows



[Describe what you see and notice in your printed data sample. Are there any issues that may need further investigation and changes?]

In [None]:
# Get info on data


[Are there missing values across all columns or just a few? Briefly describe what you see in 1-2 sentences.]

In [None]:
# Let's look at the filtered table with missing values in the the first column with missing data



[Do missing values seem symmetric? Can we be sure in this assumption? Explain your thoughts briefly in this section. You may probably want to conduct further investigations, and count the missing values in all the rows with missing values to confirm the the missing samples are of the same size.]

In [None]:
# Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.



**Intermediate conclusion**

[Does the number of rows in the filtered table match the number of missing values? What conclusion can we make from this?]

[Calculate the percentage of the missing values compared to the whole dataset. Is it a considerably large piece of data? If so, you may want to fill the missing values. To do that, firstly we should consider whether the missing data could be due to the specific client characteristic, such as employment type or something else. You will need to decide which characteristic *you* think might be the reason. Secondly, we should check whether there's any dependence missing values have on the value of other indicators with the columns with identified specific client characteristic.]

[Explain your next steps and how they correlate with the conclusions you made so far.]

In [None]:
# Let's investigate clients who do not have data on identified characteristic and the column with the missing values



In [None]:
# Checking distribution



[Describe your findings here.]

**Possible reasons for missing values in data**

[Propose your ideas on why you think the values might be missing. Do you think they are missing randomly or there are any patterns?]

[Let's start checking whether the missing values are random.]

In [None]:
# Checking the distribution in the whole dataset



**Intermediate conclusion**

[Is the distribution in the original dataset similar to the distribution of the filtered table? What does that mean for us?]

[If you think we can't make any conclusions yet, let's investigate our dataset further. Let's think about other reasons that could lead to data missing and check if we can find any patterns that may lead us to thinking that the missing values are not random. Because this is your work, this is section is optional.]

In [None]:
# Check for other reasons and patterns that could lead to missing values



**Intermediate conclusion**

[Can we finally confirm that missing values are accidental? Check for anything else that you think might be important here.]

In [None]:
# Checking for other patterns - explain which

**Conclusions**

[Did you find any patterns? How did you come to this conclusion?]

[Explain how you will address the missing values. Consider the categories in which values are missing.]

[Briefly plan your next steps for transforming data. You will probably need to address different types of issues: duplicates, different registers, incorrect artifacts, and missing values.]

## Data transformation

[Let's go through each column to see what issues we may have in them.]

[Begin with removing duplicates and fixing educational information if required.]

In [None]:
# Let's see all values in education column to check if and what spellings will need to be fixed


In [None]:
# Fix the registers if required


In [None]:
# Checking all the values in the column to make sure we fixed them



[Check the data the `children` column]

In [None]:
# Let's see the distribution of values in the `children` column


[Are there any strange things in the column? If yes, how high is the percentage of problematic data? How could they have occurred? Make a decision on what you will do with this data and explain you reasoning.]

In [None]:
# [fix the data based on your decision]


In [None]:
# Checking the `children` column again to make sure it's all fixed



[Check the data in the `days_employed` column. Firstly think about what kind of issues could there be and what you may want to check and how you will do it.]

In [None]:
# Find problematic data in `days_employed`, if they exist, and calculate the percentage


[If the amount of problematic data is high, it could've been due to some technical issues. We may probably want to propose the most obvious reason why it could've happened and what the correct data might've been, as we can't drop these problematic rows.]

In [None]:
# Address the problematic values, if they exist



In [None]:
# Check the result - make sure it's fixed


[Let's now look at the client's age and whether there are any issues there. Again, think about what can data can be strange in this column, i.e. what cannot be someone's age.]

In [None]:
# Check the `dob_years` for suspicious values and count the percentage



[Decide what you'll do with the problematic values and explain why.]

In [None]:
# Address the issues in the `dob_years` column, if they exist


In [None]:
# Check the result - make sure it's fixed


[Now let's check the `family_status` column. See what kind of values there are and what problems you may need to address.]

In [None]:
# Let's see the values for the column



In [None]:
# Address the problematic values in `family_status`, if they exist



In [None]:
# Check the result - make sure it's fixed


[Now let's check the `gender` column. See what kind of values there are and what problems you may need to address]

In [None]:
# Let's see the values in the column

In [None]:
# Address the problematic values, if they exist

In [None]:
# Check the result - make sure it's fixed



[Now let's check the `income_type` column. See what kind of values there are and what problems you may need to address]

In [None]:
# Let's see the values in the column

In [None]:
# Address the problematic values, if they exist

In [None]:
# Check the result - make sure it's fixed



[Now let's see if we have any duplicates in our data. If we do, you'll need to decide what you will do with them and explain why.]

In [None]:
# Checking duplicates



In [None]:
# Address the duplicates, if they exist

In [None]:
# Last check whether we have any duplicates


In [None]:
# Check the size of the dataset that you now have after your first manipulations with it

[Describe your new dataset: briefly say what's changed and what's the percentage of the changes, if there were any.]


# Working with missing values

[To speed up working with some data, you may want to work with dictionaries for some values, where IDs are provided. Explain why and which dictionaries you will work with.]

In [None]:
# Find the dictionaries

### Restoring missing values in `total_income`

[Briefly state which column(s) have values missing that you need to address. Explain how you will fix them.]


[Start with addressing total income missing values. Create and age category for clients. Create a new column with the age category. This strategy can help with calculating values for the total income.]


In [None]:
# Let's write a function that calculates the age category

    

In [None]:
# Test if the function works


In [None]:
# Creating new column based on function



In [None]:
# Checking how values in the new column



[Think about the factors on which income usually depends. Eventually, you will want to find out whether you should use mean or median values for replacing missing values. To make this decision you will probably want to look at the distribution of the factors you identified as impacting one's income.]

[Create a table that only has data without missing values. This data will be used to restore the missing values.]

In [None]:
# Create a table without missing values and print a few of its rows to make sure it looks fine

In [None]:
# Look at the mean values for income based on your identified factors

In [None]:
# Look at the median values for income based on your identified factors


[Repeat such comparisons for multiple factors. Make sure you consider different aspects and explain your thinking process.]



[Make a decision on what characteristics define income most and whether you will use a median or a mean. Explain why you made this decision]


In [None]:
#  Write a function that we will use for filling in missing values
        
        

In [None]:
# Check if it works


In [None]:
# Apply it to every row


In [None]:
# Check if we got any errors


[If you've came across errors in preparing the values for missing data, it probably means there's something special about the data for the category. Give it some thought - you may want to fix some things manually, if there's enough data to find medians/means.]


In [None]:
# Replacing missing values if there are any errors


[When you think you've finished with `total_income`, check that the total number of values in this column matches the number of values in other ones.]

In [None]:
# Checking the number of entries in the columns



###  Restoring values in `days_employed`

[Think about the parameters that may help you restore the missing values in this column. Eventually, you will want to find out whether you should use mean or median values for replacing missing values. You will probably conduct a research similar to the one you've done when restoring data in a previous column.]

In [None]:
# Distribution of `days_employed` medians based on your identified parameters




In [None]:
# Distribution of `days_employed` means based on your identified parameters

[Decide what you will use: means or medians. Explain why.]

In [None]:
# Let's write a function that calculates means or medians (depending on your decision) based on your identified parameter


In [None]:
# Check that the function works



In [None]:
# Apply function to the income_type



In [None]:
# Check if function worked



In [None]:
# Replacing missing values



[When you think you've finished with `total_income`, check that the total number of values in this column matches the number of values in other ones.]

In [None]:
# Check the entries in all columns - make sure we fixed all missing values

## Categorization of data

[To answer the questions and test the hypotheses, you will want to work with categorized data. Look at the questions that were posed to you and that you should answer. Think about which of the data will need to be categorized to answer these questions. Below you will find a template through which you can work your way when categorizing data. The first step-by-step processing covers the text data; the second one addresses the numerical data that needs to be categorized. You can use both or none of the suggested instructions - it's up to you.]

[Despite of how you decide to address the categorization, make sure to provide clear explanation of why you made your decision. Remember: this is your work and you make all decisions in it.]


In [None]:
# Print the values for your selected data for categorization



[Let's check unique values]

In [None]:
# Check the unique values

[What main groups can you identify based on the unique values?]

[Based on these themes, we will probably want to categorize our data.]


In [None]:
# Let's write a function to categorize the data based on common topics


In [None]:
# Create a column with the categories and count the values for them



[If you decide to categorize the numerical data, you'll need to come up with the categories for it too.]

In [None]:
# Looking through all the numerical data in your selected column for categorization


In [None]:
# Getting summary statistics for the column



[Decide what ranges you will use for grouping and explain why.]

In [None]:
# Creating function for categorizing into different numerical groups based on ranges



In [None]:
# Creating column with categories


In [None]:
# Count each categories values to see the distribution


## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?**

In [None]:
# Check the children data and paying back on time


# Calculating default-rate based on the number of children



**Conclusion**

[Write your conclusions based on your manipulations and observations.]


**Is there a correlation between family status and paying back on time?**

In [None]:
# Check the family status data and paying back on time



# Calculating default-rate based on family status



**Conclusion**

[Write your conclusions based on your manipulations and observations.]

**Is there a correlation between income level and paying back on time?**

In [None]:
# Check the income level data and paying back on time



# Calculating default-rate based on income level



**Conclusion**

[Write your conclusions based on your manipulations and observations.]

**How does credit purpose affect the default rate?**

In [None]:
# Check the percentages for default rate for each credit purpose and analyze them



**Conclusion**

[Write your conclusions based on your manipulations and observations.]


# General Conclusion 

[List your conclusions in this final section. Make sure you include all your important conclusions you made that led you to the way you processed and analyzed the data. Cover the missing values, duplicates, and possible reasons and solutions for problematic artifacts that you had to address.]

[List your conclusions regarding the posed questions here as well.]
