Hi, I'm Sergey Russkikh. I will here to help you to improve the project. I will leave comments after cells I want to be discussed

You can find my comments in <font color='green'>green</font>, <font color='amber'>yellow</font> or <font color='red'>red</font> boxes like this:

<div class="alert alert-block alert-success">
<b>Success:</b> if everything is done succesfully
</div>

<div class="alert alert-block alert-warning">
<b>Remarks: </b> if I can give some recommendations
</div>

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> if the block requires some corrections. Work can't be accepted with red comments.
</div>
Please, do not delete my comments

## Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

<div class="alert alert-block alert-success">
<b>Success:</b> 
    
You've done a really good job!
Keep up the good work, and good luck on the next sprint!
</div>



### Step 1. Open the data file and have a look at the general information. 

In [1]:
import pandas as pd 
import numpy as np
credit_scoring_eng=pd.read_csv('/datasets/credit_scoring_eng.csv')
#print(credit_scoring_eng.head(10)
print(credit_scoring_eng.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
None


### Conclusion

### Information About The Dataset

The dataset includes 12 columns: 

* children
* days_employed
* dob_years
* education
* education_id
* family_status
* family_status_id 
* gender
* income_type
* debt
* total_income
* purpose

Each row of the dataset contains 21,525 entries, with the exception of days_employed and total_income which contain 19,351 entries. 'days_employed' and 'total_income' are floating point numbers, and will likely need to be converted to integers for further analysis. This is also a likely source of error that should be further investigated. 

# Comments

After loading the data, a few inconsitencies stand out. It looks like we are potentially missing some data on days_employed and total_income for some of the borrowers in the dataset. Alternatively, we may have an issue with duplicated data in the other columns. In order to answer the questions we initially set out to answer, we'll first need to clean up our data!  

<div class="alert alert-block alert-success">
Data has been read, move on

</div>

### Step 2. Data preprocessing

### Processing missing values

In [2]:
#I'd probably want to delete these early comments, but I started off by exploring the comments and then trying to get value_count on each column.
#value_counts helps us to identify unique values and their counts. it's a good way of quickly checking the total of a column.
#since we don't know which columns have missing values, we should check each one. 
#print(credit_scoring_eng.head(10))
#print('number of children:', credit_scoring_eng['children'].value_counts())
#print('education:', credit_scoring_eng['education'].value_counts())
#print('date of birth:', credit_scoring_eng['dob_years'].value_counts())
#checking each one by hand is slow and boring. 
#let's speed this up with a loop.

#list_columns=['children', 'days_employed', 'dob_years','education','education_id','family_status',
             # 'family_status_id','gender','income_type','debt','total_income','purpose']
#for cheese in list_columns:
    #f"number of {cheese} : {credit_scoring_eng[cheese].value_counts()}"
    #I think there's a bug in the platform code runner so the fstring renders in a super unreadable format (http://share.karljtaylor.com/A0KnLd)
    
    #print("number of " + cheese + " : ", credit_scoring_eng[cheese].value_counts())
    
#alternatively, we can check the NaN values this way. 
#temp=credit_scoring_eng
#test=credit_scoring_eng.dropna()
#def dataframe_difference(df1, df2, which=None):
    #"""Find rows which are different between two DataFrames."""
    #comparison_df = df1.merge(df2,
     #                         indicator=True,
     #                         how='outer')
    #if which is None:
     #   diff_df = comparison_df[comparison_df['_merge'] != 'both']
    #else:
     #   diff_df = comparison_df[comparison_df['_merge'] == which]
    #return diff_df

#new_dance=dataframe_difference(temp, test, which=None)
#print(new_dance)

# it looks like two columns have missing information, days_employed & total_income. 
# in the case of days_employed, we have a number of negative days. is it possible we're tracking total time unemployed in this column? 
# since our analysis revolves around the impact of a borrower and their risk of default, let's look apply the median() method to the missing values and continue our analysis.

savg_income=credit_scoring_eng['total_income'].median()
credit_scoring_eng['total_income'] = credit_scoring_eng['total_income'].fillna(value=savg_income)
avg_days_employed=credit_scoring_eng['days_employed'].mean()
credit_scoring_eng['days_employed']=credit_scoring_eng['days_employed'].fillna(value=avg_days_employed)
print(credit_scoring_eng.info())
print(credit_scoring_eng.head(10))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
None
   children  days_employed  dob_years            education  education_id  \
0         1   -8437.673028         42    bachelor's degree             0   
1         1   -4024.803754         36  secondary education             1   
2         0   -5623.422610         33  Secondary Education             1   
3         3   -4124

<div class="alert alert-block alert-warning">
<b>Remarks: </b> 
</div>
You can do that shorter (example):

df.loc[df.days_employed.isna(),'days_employed'] = df.days_employed.mean()

### Updated Two Missing Columns

We identified two columns, 'days_employed' and 'total_income' which were missing several rows of information. There may be other errors in these columns, as 'days_employed' currently reports a negative number of days in several rows. We calculated the median income, and average number of days employed. By using the .fillna() technique, we were able to replace the missing values with the average. In a subsequent analysis, we can make this more precise by calculating an average for a specific category, but we should really try and find out why the dates are negative, first! 

### Data type replacement
During our previous work, we noticed that two columns in our dataset: days_employed, total_income aren't actually in the integer data type. This will limit our ability to perform analysis on these columns, and might make it difficult to answer the questions we've set out to answer. Our data is already in a numeric format, so let's use .astype() to change these data types! 

In [3]:
print(credit_scoring_eng.info())
days_employed_float=credit_scoring_eng['days_employed']
total_income_float=credit_scoring_eng['total_income']
credit_scoring_eng['days_employed']=credit_scoring_eng['days_employed'].astype('int')
credit_scoring_eng['total_income']=credit_scoring_eng['total_income'].astype('int')
credit_scoring_eng['purpose']=credit_scoring_eng['purpose'].astype('str')
print(credit_scoring_eng.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null int64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_s

### Conclusion

Using .astype() we were able to update those numbers! Let's continue by checking for duplicates! 

<div class="alert alert-block alert-success">
<b>Success:</b> 
    all right
</div>

### Processing duplicates

Now that we've filled in the missing values, and made sure our data is the right type, let's start by looking for duplicates. We can do that by calling value_counts. We have a few columns, (like children, days_employed, dob_years, education_id, family_status_id, debt, total_income) where it's ok for more than one row to have the same number, so we can exclude those from our search. 

In [4]:
from collections import Counter

list_columns=['education','family_status',
              'gender','income_type','purpose']
for cheese in list_columns:
    #f"number of {cheese} : {credit_scoring_eng[cheese].value_counts()}"
    #I think there's a bug in the platform code runner so the fstring renders in a super unreadable format (http://share.karljtaylor.com/A0KnLd)
    
    print("number of " + cheese + " : ", credit_scoring_eng[cheese].value_counts())

number of education :  secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
GRADUATE DEGREE            1
Graduate Degree            1
Name: education, dtype: int64
number of family_status :  married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64
number of gender :  F      14236
M       7288
XNA        1
Name: gender, dtype: int64
number of income_type :  employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur        

### Yikes, We Have Duplicates! 

Oh no! It looks like we have a few different types of duplicates in our data. 

But first, here's the good news! It doesn't look like there are any duplicates in the family_status, gender, or income_type columns! 

The education column seems to have a wide variety of different capitalizations of the type of education. We'll be able to fix those by converting the strings! 

The purpose column is a mEeSs! We're going to have to use a stemmer to try and better group these duplicates! 

## Updating Duplicate Strings

In [5]:
#print(credit_scoring_eng.head(10))
credit_scoring_eng['education']=credit_scoring_eng['education'].str.lower()
#print(credit_scoring_eng['education'].value_counts())

In [6]:
credit_scoring_eng['education'].value_counts()

secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

<div class="alert alert-block alert-warning">
<b>Remarks: </b> 
It is better to do this without print()
</div>

<span style="color:blue">totally fair. I'll comment them out here. I've generally been using print to sanity check outputs along the way. Do you have some recommendations on different approaches here? </span>



<div class="alert alert-block alert-warning">
<b>Remarks: </b> 
just use different cells
</div>

### Education Column Is Fixed! 

Now that's better. We've updated the education column, and now we can wee the difference some education makes! We can probably improve this even more by determining the difference between 'secondary,' 'bachelor's degree,' 'some college,' and 'graduate degree,' as the extra precision here will improve our model! 

In some datasets, we might want to drop the additional values here, but since each borrower's education type is an important factor in their creditworthiness, we need to keep the data! 

## Parsing The Purpose 

It looks like our 'purpose' field was open to applicant input, because we've got a whole bunch of different purposes. Let's take a closer look!

In [7]:
print(credit_scoring_eng['purpose'].value_counts())

wedding ceremony                            797
having a wedding                            777
to have a wedding                           774
real estate transactions                    676
buy commercial real estate                  664
housing transactions                        653
buying property for renting out             653
transactions with commercial real estate    651
purchase of the house                       647
housing                                     647
purchase of the house for my family         641
construction of own property                635
property                                    634
transactions with my real estate            630
building a real estate                      626
buy real estate                             624
building a property                         620
purchase of my own house                    620
housing renovation                          612
buy residential real estate                 607
buying my own car                       

### Fixing the Duplicates
Looking at the list, we can see a number of similar entries, like 'getting an education,' 'education,' 'getting higher education,' etc! Let's apply stemming to help us find the purpose! 

In [8]:
# nltk.download('averaged_perceptron_tagger')

In [9]:
import nltk as nltk 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet 
nltk.download('averaged_perceptron_tagger') 
#found I needed this on load, you might not, so leaving it commented out.

def get_wordnet_pos(word): 
    ###Map POS tag to first character lemmatize() accepts
    tag=nltk.pos_tag([word])[0][1][0]
    tag_dict = {"J": wordnet.ADJ, 
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV,}
    return tag_dict.get(tag, wordnet.NOUN)

wordnet_lemma = WordNetLemmatizer()

#for purpose in credit_scoring_eng['purpose']:
    #credit_scoring_eng['purpose_pos']=get_wordnet_pos(purpose)
    #print(get_wordnet_pos(purpose))
lemma_strings=[]
for purpose in credit_scoring_eng['purpose']:
    lemmatized=[wordnet_lemma.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(purpose)]
    lemma_strings.append(lemmatized)
    #print('added: ' + lemmatized)
    
credit_scoring_eng['purpose_lemmas']=lemma_strings
    
purpose_categories=[]    
def categorize_purpose(a):
    #for query in credit_scoring_eng['purpose_lemmas']:
        #for dance in query.split(','):
            #stemmed_word=english_stemmer.stem(dance)
            if 'car' in a:
                return "car_purchase"

            elif "wedding" in a:
                return "finance_wedding"

            elif "construction" in a or "build" in a or "renovation" in a:
                return "construction"

            elif "real" in a or "house" in a or "housing" in a:
                return "mortgage"

            elif "education" in a or "university" in a or "educate" in a:
                return "education"

            else: 
                return "other"
                

credit_purpose_simplified=credit_scoring_eng['purpose_lemmas'].apply(categorize_purpose)
credit_scoring_eng['purpose_category']=credit_purpose_simplified
print(credit_scoring_eng['purpose_category'].value_counts())      


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


mortgage           7686
car_purchase       4315
education          4022
finance_wedding    2348
other              1907
construction       1247
Name: purpose_category, dtype: int64


In [10]:
# credit_scoring_eng['purpose_lemmas'].value_counts()

<div class="alert alert-block alert-warning">
<b>Needs fixing:</b>
</div>
We lost education

<span style="color:blue">Good call out. Looks like we lost more than just education, also 'construction' </span>

<div class="alert alert-block alert-warning">
<b>Remarks:</b>
</div>

We probably have 4 categories:
* Housing
* car_purchase       
* wedding
* education

### We Got It! 

People sure use a lot of words to describe something simple! We've created a new column, purpose_category which will help us speed up our analysis. To populate that column with data, we first iterated over the 'purpose' column and identified the part of speech for each purpose. We then grouped the results into categories, and relied on lemmatization to create a simple set of categories we can use to further our analysis. 

## We Identified, and Removed Two Kinds of Duplicates

We found duplicate data in both the 'education' and 'purpose' categories. 

The duplicates we found in the 'education' column were most likely the result of inconsistent capitalization throughout the field. This can happen when there's an issue with a form entry or the information is being collected in a nonnstandardized way. By converting the strings in the 'education' column to lower case, we were able to quickly identify the duplicate entries. In subsequent exploration, we might want to learn more about what the different categories of college are, so that we could further simplify our model. 

We also found duplicates in the 'purpose' column, which reports on the use of proceeds for a loan. Because these responses can vary widely, we first needed to re-organize the data contained in the column. We did this by analyzing the parts of speech in the response, and using a function to categorize what we found. In this way we were able to group together ideas like 'to purchase real estate' and 'to buy a house' as being _mortgage_ borrowing.

### Categorizing Data
Our dataframe, credit_scoring_eng, contains much more data than we're going to need to make our analysis. In order to make our report as clear as possible, we should focus on better categorizing our data so that we can present the information needed to make a decision. In addition, since we'll also need to calculate some basic statistics like a default rate, we have a little more work to do. It wouldn't be very pleasant to try and read our 'total_income' column, we should try and group that data, and deal with any surprises in the data (like the 47 people who have -1 children!) 

In [11]:

def group_children(row):
    info1=[row['children']]
    for row in info1:
        if row <=0:
            return '0'
        else:
            return row
        
credit_scoring_eng['children_adjusted']=credit_scoring_eng.apply(group_children, axis=1)

def organize_income(row):
    info2=[row['total_income']]
    for row in info2:
        if row <=10000:
            return "0-10000"
        if row <=20000:
            return "10000-20000"
        if row <=30000:
            return "20001-30000"
        if row <=40000:
            return "30001-40000"
        if row <=60000:
            return "40000-60000"
        if row <=80000: 
            return "60001-80000"
        if row <=100000:
            return "80001-100000"
        if row <=150000:
            return "100001-150000"
        if row <=200000:
            return "150001-200000"
        if row <=250000:
            return "200001-250000"
        if row <=300000:
            return "250001-300000"
        if row <=350000:
            return "300000-350000"
        return "350000+"
credit_scoring_eng['income_range']=credit_scoring_eng.apply(organize_income, axis=1)

default_rate=credit_scoring_eng['debt'].sum()/len(credit_scoring_eng['debt'])
print("The default rate for all loans is: " + str(default_rate))

The default rate for all loans is: 0.08088269454123112


In [12]:
# credit_scoring_eng.children.value_counts()

<div class="alert alert-block alert-warning">
<b>Remarks:</b> 
</div>
We need to exclude or replace outliers in the data (-1,20)

<span style="color:blue">I deal with -1 down below in the categorization stage, and would have applied a similar approach to 20, but I spent a little time thinking about it, and in both cases I'm not sure declaring them 'outliers' is entirely warranted. In the case of the -1, it seems very likely that this could be an issue related to the database, and since one of the conclusions we draw is that having child==increased default risk, this is probably worth scrutinizing. One thing I did earlier on and ended up deleting was using median instead of mean, but I realized that since our measurement is on a 0-1 default rate, the median would hide more information than it would share. happy to rethink this if needed, but that's sort of what my frame of thought was</span>

You can assume that these are typos and do the following -1 = 1, 20 = 2


## Conclusion

We identified a few places where our data could benefit from extra classification. 

First, we counted people who had -1 children in with individuals who had 0 children. One possible reason for this is that individuals with negative children are likely to currently not be raising children, like the respondants with 0 children. Another likely explanation might be an issue with the database we got the data from. We should follow up our analysis by further scrutinizing the data. It is possible, for example, that these respondents may have 1 child, and the 47 rows have an error. If this is the case it will skew out analysis! In the case of 20 children, while this number is large, it may not necessarily be invalid. 

Secondly, reading through the income ranges one entry at a time wouldn't be particularly informative, and likely wouldn't have shed any actionable insight on our data, so we created a function to group the total_income column into a few predifned ranges. At smaller incomes, we used a smaller range, as the impact of additional income is likely to have a greater impact on the default rate. 

Finally, answering the questions below will require us to know a lot about what conditions result in a higher risk of default. In order to accomplish this comparison, we calculated a baseline default rate by performing calculations on our dataset. 

### Step 3. Answering Questions About The Data

Now that we've processed, deduplicated, and categorized our data, we're finally ready to begin answering some key questions for the team! 

### How Does Having Children Impact Rate Of Default?
- Is there a relation between having kids and repaying a loan on time?

In [13]:
# print(credit_scoring_eng.pivot_table(index='children_adjusted', values=['debt'],aggfunc=['count','mean']))
children_default_risk=credit_scoring_eng.groupby(['children_adjusted'])['debt'].agg(['count', 'mean']).sort_values('count', ascending=False).style.format({'mean': '{:,.1%}'.format,}).applymap(lambda x: 'background-color: orange' if x>default_rate else '',subset=['mean'])
children_default_risk

Unnamed: 0_level_0,count,mean
children_adjusted,Unnamed: 1_level_1,Unnamed: 2_level_1
0,14196,7.5%
1,4818,9.2%
2,2055,9.4%
3,330,8.2%
20,76,10.5%
4,41,9.8%
5,9,0.0%


### Borrowers With Children Have An Increased Risk Of Default 

With the exception of borrowers who have 5 children, all borrowers with children had an increased risk of default when compared to our baseline default rate. This suggests that we will want to apply additional scruitiny when scoring applicants from these borrowers. 

## What Impact Does Marital Status Have On Loan Payments
- Is there a relation between marital status and repaying a loan on time?

In [14]:
#print(credit_scoring_eng.pivot_table(index='family_status', values=['debt'],aggfunc=['count','mean']))

relationship_default_risk=credit_scoring_eng.groupby(['family_status'])['debt'].agg(['count', 'mean']).sort_values('count',ascending=False).style.format({'mean': '{:,.1%}'.format,}).applymap(lambda x: 'background-color : orange' if x>default_rate else '',subset=['mean'])
relationship_default_risk

Unnamed: 0_level_0,count,mean
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1
married,12380,7.5%
civil partnership,4177,9.3%
unmarried,2813,9.7%
divorced,1195,7.1%
widow / widower,960,6.6%


### Unmarried Borrowers and Borrowers In Partnerships Have an Increased Rate of Default

If we evaluate the default rate for borrowers by their marital status, we find that borrowers who are unmarried or in a civil partnership have a higher than average default risk. 

## What Impact Does Income Have On Default?
- Is there a relation between income level and repaying a loan on time?

In [15]:
income_default_risk=credit_scoring_eng.groupby(['income_range'])['debt'].agg(['count', 'mean']).sort_values('count',ascending=False).style.format({'mean': '{:,.1%}'.format,}).applymap(lambda x: 'background-color : orange' if x>default_rate else '',subset=['mean'])
income_default_risk

Unnamed: 0_level_0,count,mean
income_range,Unnamed: 1_level_1,Unnamed: 2_level_1
20001-30000,8236,8.5%
10000-20000,6443,8.5%
30001-40000,3107,7.8%
40000-60000,2140,7.3%
0-10000,927,6.3%
60001-80000,450,5.3%
80001-100000,123,6.5%
100001-150000,71,5.6%
150001-200000,17,5.9%
200001-250000,5,0.0%


### Some Incomes At Elevated Risk of Default

Borrowers with Incomes Between $10,000 and $30,000 or $350,000+ have an increased risk of default.

## Use of Funds 
- How do different loan purposes affect on-time repayment of the loan?

In [16]:
#print(credit_scoring_eng.pivot_table(index='purpose_category',values=['debt'],aggfunc=['count', 'mean']))
purpose_default_risk=credit_scoring_eng.groupby(['purpose_category'])['debt'].agg(['count', 'mean']).sort_values('count',ascending=False).style.format({'mean': '{:,.1%}'.format,}).applymap(lambda x: 'background-color : orange' if x>default_rate else '',subset=['mean'])
purpose_default_risk

Unnamed: 0_level_0,count,mean
purpose_category,Unnamed: 1_level_1,Unnamed: 2_level_1
mortgage,7686,7.2%
car_purchase,4315,9.3%
education,4022,9.2%
finance_wedding,2348,7.9%
other,1907,7.8%
construction,1247,6.2%


### Auto Loans and Education Have Increased Risk of Default

Our analysis shows that borrowers who are using funds for financing an auto loan or an education have an increased risk of default. 

## Step 4. General conclusion

We began our project with four specific questions we needed to answer. Through our analysis, we identified that: 

- Some borrowers with children are at increased risk of default. 
- Unmarried and partnered borrowers are at increased risk of default. 
- Borrowers with an income between 10,000 and 30,000 dollars are at increased risk of default. 
- Borrowers with an income greater than 350,000 are at the highest risk of default. 
- Borrowers using proceedes for auto purchases are at increased risk of default. 
- Borrowers using proceedes for education are at an increased risk of default. 

<div class="alert alert-block alert-success">

### Code

Everything is fine. That is very pleased - met project structure, the steps of the job identified and executed sequentially, the code is written carefully, use code comments, you can quickly understand which operation to perform complex design, variable names convey the meaning of the operations. As a tip, I suggest that you study the try-except construction more thoroughly and start applying it to solve the problem. This will improve the code's fault tolerance and protect the code from future breakdowns.

### Conclusions

You are very good at analyzing complex data, making correct hypotheses, and checking your conclusions for the possibility of matching reality. You can see a deep understanding of the essence of the analysis. It was very interesting to check your project and follow your thought, keep it up! 
</div>


### Project Readiness Checklist

Put 'x' in the completed points. Then press Shift + Enter.

- [x]  file open;
- [x]  file examined;
- [x]  missing values defined;
- [x]  missing values are filled;
- [x]  an explanation of which missing value types were detected;
- [x]  explanation for the possible causes of missing values;
- [x]  an explanation of how the blanks are filled;
- [x]  replaced the real data type with an integer;
- [x]  an explanation of which method is used to change the data type and why;
- [x]  duplicates deleted;
- [x]  an explanation of which method is used to find and remove duplicates;
- [x]  description of the possible reasons for the appearance of duplicates in the data;
- [x]  data is categorized;
- [x]  an explanation of the principle of data categorization;
- [x]  an answer to the question "Is there a relation between having kids and repaying a loan on time?";
- [x]  an answer to the question " Is there a relation between marital status and repaying a loan on time?";
- [x]   an answer to the question " Is there a relation between income level and repaying a loan on time?";
- [x]  an answer to the question " How do different loan purposes affect on-time repayment of the loan?"
- [x]  conclusions are present on each stage;
- [x]  a general conclusion is made.