Hello, my name is **Lyuman** and I'm going to review your project!

You can find my comments in <font color='green'>green</font>, <font color='blue'>blue</font> or <font color='red'>red</font> boxes like this:

<div class="alert alert-block alert-success">
<b>Success:</b> if everything is done succesfully
</div>

<div class="alert alert-block alert-info">
<b>Improve: </b> "Improve" comments mean that there are tiny corrections that could help you to make your project better.
</div>

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> if the block requires some corrections. Work can't be accepted with the red comments.
</div>


<font color='orange' style='font-size:24px; font-weight:bold'>General feedback</font>
* Thank you so much for submitting your project!
* I'm glad to say that you executed your project really well.
* I left some tips, I would like you to pay attention to them.
* There are a couple of things that need to be done before your project is complete, but they're pretty straightforward.
* This will be an easy fix for you!
* You're almost there!

<font color='orange' style='font-size:24px; font-weight:bold'>General feedback[2]</font>
- Thanks for sending in your project with corrections. It's clear you've put a lot of effort into it.
- I'm really glad to see that the parts with missing values and duplicates are much improved! 
- However, there is the one cell with the execution error, could you take a glance on it!
- I would recommedn you to use 'Kernel -> Restart & run all' option before sending of project.
- Also please try to update a littlbe bit the findings in the last step.
- One more time and you'll have it!

<font color='orange' style='font-size:24px; font-weight:bold'>General feedback[3]</font>
- Your corrections look great, you've improved your work significantly!
- Now your project is a true "A". Congratulations!
- Your project has been accepted and you can go to the next sprint!
- Keep at it. You're improving every day!

 # Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

# Step 1. Open the data file and have a look at the general information. 
## Description of the data
- `children`: the number of children in the family

- `days_employed`: how long the customer has been working

- `dob_years`: the customer’s age

- `education`: the customer’s education level

- `education_id`: identifier for the customer’s education

- `family_status`: the customer’s marital status

- `family_status_id`: identifier for the customer’s marital status

- `gender`: the customer’s gender

- `income_type`: the customer’s income type

- `debt`: whether the customer has ever defaulted on a loan

- `total_income`: monthly income

- `purpose`: reason for taking out a loan

<div class="alert alert-block alert-info">
<b>Improve: </b> Thank you for project description. It would be good to see plan of actions for project too.
</div>

## Plan of actions:
1. From the description of the data I see that **credit scoring** can be deduced from `days_employed`, `dob_years`, `total_income` and `purpose`.

In [1]:
from __future__ import division
from IPython.display import display 
import pandas as pd

<div class="alert alert-block alert-info">
<b>Improve: </b> It is better to collect all imports in the first cell.<br> If a user or a customer launches your entire project and he lacks any package, then it is better to let it fall at the beginning than somewhere in the middle. <br> This will save time, as sometimes projects take hours and days. It also increases the readability of the entire notebook.
</div>

In [2]:
try:
    credit_scoring = pd.read_csv('credit_scoring_eng.csv')
except:
    credit_scoring = pd.read_csv('/datasets/credit_scoring_eng.csv')
credit_scoring.info()
print()
print()
credit_scoring_missing = credit_scoring.isna()
credit_scoring_missing_num = credit_scoring_missing.sum()
print('What are the columns with the missings:\n{}'.format(credit_scoring.loc[:, credit_scoring.isnull().any()].columns))
print()
print('Total number of missing values in each column:\n{}'.format(credit_scoring_missing_num))
print()
print('Number of rows in our DataFrame:\n{:}'.format(len(credit_scoring)))
print()
print("percentage of missing value in days_employed column:\n{:.2%}".format(credit_scoring_missing_num['days_employed']/len(credit_scoring))) 
print()
print("percentage of missing value in total_income column:\n{:.2%}".format(credit_scoring_missing_num['total_income']/len(credit_scoring))) 
print()
print(credit_scoring.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


What are the columns with the missings:
Index(['days_employed', 'total_income'], dtype='object')

Total number of missing values in each column:
children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender

In [3]:
#A general view to 'days_employed' column
display(credit_scoring)
print(credit_scoring['days_employed'].head())
print()
print(credit_scoring['days_employed'].tail())
print()
print('Apparently we have many negative values in days_employed column, lets check:\n{:}'.format(credit_scoring[credit_scoring['days_employed'] < 0]['days_employed'].count()))
print()
print('Positive values?\n{:}'.format(credit_scoring[credit_scoring['days_employed'] >= 0]['days_employed'].count()))
print()
print('Lets see if we have meaningnfull values regarding dob_years column\n')
print('Calculating mean of days_employed per years\n{:}'.format(abs((credit_scoring['days_employed'].mean()))/365))
print()
print('Calculating mean of dob_years\n{:}'.format(credit_scoring['dob_years'].mean()))
print()
print('Lets see now for each negative and positive values')
print('Positive values mean per years\n{:}'.format(credit_scoring[credit_scoring['days_employed'] >= 0]['days_employed'].mean()/365))
print()
print('Negative values mean per years\n{:}'.format(abs(credit_scoring[credit_scoring['days_employed'] < 0]['days_employed'].mean())/365))


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.422610,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car
21522,1,-2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property
21523,3,-3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car


0     -8437.673028
1     -4024.803754
2     -5623.422610
3     -4124.747207
4    340266.072047
Name: days_employed, dtype: float64

21520     -4529.316663
21521    343937.404131
21522     -2113.346888
21523     -3112.481705
21524     -1984.507589
Name: days_employed, dtype: float64

Apparently we have many negative values in days_employed column, lets check:
15906

Positive values?
3445

Lets see if we have meaningnfull values regarding dob_years column

Calculating mean of days_employed per years
172.73013057937914

Calculating mean of dob_years
43.29337979094077

Lets see now for each negative and positive values
Positive values mean per years
1000.011807989777

Negative values mean per years
6.446618991777744


In [4]:
print('Lets see if we have any duplicate rows??\n{:}'.format(credit_scoring.duplicated().sum()))

Lets see if we have any duplicate rows??
54


Let's see if we can deduce some possible scenarios of missing values.

In [5]:
credit_scoring.groupby('education')['days_employed'].count()

education
BACHELOR'S DEGREE        251
Bachelor's Degree        243
GRADUATE DEGREE            1
Graduate Degree            1
PRIMARY EDUCATION         16
Primary Education         14
SECONDARY EDUCATION      705
SOME COLLEGE              22
Secondary Education      646
Some College              40
bachelor's degree       4222
graduate degree            4
primary education        231
secondary education    12342
some college             613
Name: days_employed, dtype: int64

In [6]:
#strong relation between NAs and `secondary education` rows
credit_scoring.loc[credit_scoring['days_employed'].isna(),'education_id'].value_counts()

1    1540
0     544
2      69
3      21
Name: education_id, dtype: int64

In [7]:
credit_scoring.loc[credit_scoring['days_employed'].isna(),'education'].value_counts()

secondary education    1408
bachelor's degree       496
SECONDARY EDUCATION      67
Secondary Education      65
some college             55
Bachelor's Degree        25
BACHELOR'S DEGREE        23
primary education        19
Some College              7
SOME COLLEGE              7
PRIMARY EDUCATION         1
Primary Education         1
Name: education, dtype: int64

In [8]:
credit_scoring.loc[credit_scoring['total_income'].isna(),'education'].value_counts()

secondary education    1408
bachelor's degree       496
SECONDARY EDUCATION      67
Secondary Education      65
some college             55
Bachelor's Degree        25
BACHELOR'S DEGREE        23
primary education        19
Some College              7
SOME COLLEGE              7
PRIMARY EDUCATION         1
Primary Education         1
Name: education, dtype: int64

In [9]:
credit_scoring.loc[credit_scoring['days_employed'].isna(),'gender'].value_counts()

F    1484
M     690
Name: gender, dtype: int64

In [10]:
credit_scoring.loc[credit_scoring['days_employed'].isna(),'income_type'].value_counts()

employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: income_type, dtype: int64

### Conclusion

**Column types**
- `children`: Numeric

- `days_employed`: Numeric

- `dob_years`: Numeric

- `education`: Categorical

- `education_id`: Numeric

- `family_status_id`: Numeric

- `family_status`: Categorical

- `gender`: Boolean

- `income_type`: Categorical

- `debt`: Boolean

- `total_income`: Numeric

- `purpose`: Categorical

**Missing values**  
1. We have 2174  *NAN* values in `days_employed` and `total_income` columns (nearly 10% for each of the 2 cols for example). 
    - The same % is surely because the two columns are related to one other.(the total income is calculated by how much hour of work spent).
2. I see that missing values are strongly related to `education` column, especialy `secondary education` rows, so we can guess that maybe some clients with secondary education don't have an official or a stable job and a stable income...
3. missing values type, *MNAR*: I don't have any exact information about the source of this missing data, it's not abvious, maybe data were lost during data format into csv...  
4. `days_employed` column contains many negative value which is inaccurate. 
5. There is a huge gap between `data_employed` "positive values" and `dob_years` (the days employed are so much bigger than customer's ages).
6. Also, the mean of `data_employed` positive values per year is illogical.
7. 54 duplicate rows.

<div class="alert alert-block alert-success">
<b>Success:</b> Data loading and initial analysis are well done.
</div>

## Step 2. Data preprocessing

### Processing missing values

In [11]:
print('Identifying what family status we do have:\n{:}'.format(credit_scoring['family_status'].unique()))
print()
print('Identifying what family status ids we do have:\n{:}'.format(credit_scoring['family_status_id'].unique()))
credit_scoring['family_status'].value_counts()

Identifying what family status we do have:
['married' 'civil partnership' 'widow / widower' 'divorced' 'unmarried']

Identifying what family status ids we do have:
[0 1 2 3 4]


married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64

- Well I think those two columns stand for the same thing!

- Now let's see what to do about our missing values!
- 'days_employed' and 'total_income' are quantitative columns, so we can replace missing data with representative values!
- Let's start with analysing 'total_income' since it only has missing values(no negative values issue). Any relation with other columns???
- Aapparently we have 'income_type', lets see:

In [12]:
print(credit_scoring.groupby('income_type')['total_income'].sum())

income_type
business                       1.482344e+08
civil servant                  3.587497e+07
employee                       2.585699e+08
entrepreneur                   7.986610e+04
paternity / maternity leave    8.612661e+03
retiree                        7.554078e+07
student                        1.571226e+04
unemployed                     4.202872e+04
Name: total_income, dtype: float64


- We can see here that 'total_income' has some serious outliers. 
- We can use the median to fill the missings.


In [13]:
median_income = credit_scoring['total_income'].median()
credit_scoring['total_income'] = credit_scoring['total_income'].fillna(value = median_income).reset_index(drop=True)
print('Total number of missing values in each column:\n{}'.format(credit_scoring.isna().sum()))

Total number of missing values in each column:
children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income           0
purpose                0
dtype: int64


**total_income & days_employed:** 
- this two columns have the same number of missing values, which is accurate I think, since there should be a relation between them. 
- negative values in days_empolyed are maybe due to formatting problems during the data saving into excel. 

- OK, cool!

- Let's see what to do with 'days_employed' now!
- We already found some negative values in this column, let's check other columns.


In [14]:
quantitative_cols = ['children','dob_years','education_id','family_status_id','debt','total_income']
display_count_zero = [credit_scoring[credit_scoring[c] < 0][c].count() for c in quantitative_cols]
for i in range(len(quantitative_cols)):
    print(quantitative_cols[i]+' '+str(display_count_zero[i]))

children 47
dob_years 0
education_id 0
family_status_id 0
debt 0
total_income 0


- As we see over here, 'children' contains negative values too, which is impossible.

In [15]:
credit_scoring['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

In [16]:
#checking for 'dob_years'...
credit_scoring['dob_years'].value_counts()
#I can see a 110 '0' here, 
print(credit_scoring[credit_scoring['dob_years'] == 0]['dob_years'].count())

101


1. `children`: 
    - A 20 value which is a bit weird about the number of children, this value might be a mistake during the data filling. Maybe it's a 2 and not a 20. 
    - For the -1 values, 
2. `dob_years`: 0 is an impossible age. 
3. `days_employed` & `total_income`: quantitative data, we can replace NAs with one of the representative values such as median or mean. And the outliers in `total_income` column are very significative, so the best is choosing `median()` fct.
    - *negative values* in `days_employed`: data conversion.

I'm going to check all different columns.

<div class="alert alert-block alert-success">
<b>Success:</b> Good job! I'm really liked you approach to filling mussings values. 
</div>

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> Please try to  identify the type of missing value and propose some hypothesis on why the missing values appeared
</div>

<div class="alert alert-block alert-success">
<b>Success[2]:</b> Good job!
</div>

In [17]:
#replacing negative values in 'children' column!
median_children = credit_scoring['children'].median()
credit_scoring.loc[credit_scoring['children'] < 0,'children'] = median_children
credit_scoring['children'].value_counts()
#okay!

0.0     14196
1.0      4818
2.0      2055
3.0       330
20.0       76
4.0        41
5.0         9
Name: children, dtype: int64

<div class="alert alert-block alert-info">
<b>Info:</b> The children = 20 also looks like outlier.
</div>

We can replace the 20 value by 2.

In [18]:
credit_scoring['children'] = credit_scoring['children'].replace(20,2)

In [19]:
credit_scoring['children'].value_counts()

0.0    14196
1.0     4818
2.0     2131
3.0      330
4.0       41
5.0        9
Name: children, dtype: int64

In [20]:
#replacing null values in 'dob_years' column!
median_dobyears = credit_scoring['dob_years'].median()
credit_scoring.loc[credit_scoring['dob_years'] == 0,'dob_years'] = median_dobyears
credit_scoring['dob_years'].value_counts()
#okay!

42.0    698
35.0    617
40.0    609
41.0    607
34.0    603
38.0    598
33.0    581
39.0    573
31.0    560
36.0    555
44.0    547
29.0    545
30.0    540
48.0    538
37.0    537
50.0    514
43.0    513
32.0    510
49.0    508
28.0    503
45.0    497
27.0    493
56.0    487
52.0    484
47.0    480
54.0    479
46.0    475
58.0    461
57.0    460
53.0    459
51.0    448
59.0    444
55.0    443
26.0    408
60.0    377
25.0    357
61.0    355
62.0    352
63.0    269
64.0    265
24.0    264
23.0    254
65.0    194
66.0    183
22.0    183
67.0    167
21.0    111
68.0     99
69.0     85
70.0     65
71.0     58
20.0     51
72.0     33
19.0     14
73.0      8
74.0      6
75.0      1
Name: dob_years, dtype: int64

- Looking now for qualitative data, if there is any processing needed, let's see!

In [21]:
credit_scoring['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
GRADUATE DEGREE            1
Graduate Degree            1
Name: education, dtype: int64

- Hmmm, some format standardization are needed here.

In [22]:
credit_scoring['education'] = credit_scoring['education'].str.lower()

In [23]:
credit_scoring['education'].value_counts()

secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

In [24]:
#family_status
credit_scoring['family_status'].value_counts()

married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64

In [25]:
credit_scoring['family_status_id'].value_counts()
#btw, i still think this is a duplicated column ('family_status'/'family_status_id')

0    12380
1     4177
4     2813
3     1195
2      960
Name: family_status_id, dtype: int64

In [26]:
credit_scoring['gender'].value_counts()
#XNA?

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

In [27]:
credit_scoring.loc[credit_scoring['gender'] == 'XNA','gender'] = 'F' #Voila!

<div class="alert alert-block alert-success">
<b>Success:</b> Nice !=)
</div>

In [28]:
credit_scoring['gender'].value_counts()

F    14237
M     7288
Name: gender, dtype: int64

In [29]:
credit_scoring['income_type'].value_counts()
#nothing to fix

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
paternity / maternity leave        1
student                            1
Name: income_type, dtype: int64

In [30]:
credit_scoring['purpose'].value_counts()
#sounds good!

wedding ceremony                            797
having a wedding                            777
to have a wedding                           774
real estate transactions                    676
buy commercial real estate                  664
buying property for renting out             653
housing transactions                        653
transactions with commercial real estate    651
purchase of the house                       647
housing                                     647
purchase of the house for my family         641
construction of own property                635
property                                    634
transactions with my real estate            630
building a real estate                      626
buy real estate                             624
purchase of my own house                    620
building a property                         620
housing renovation                          612
buy residential real estate                 607
buying my own car                       

### Conclusion

- We're done now with NAs, insignificant values, for qualitative and quantitatives.
- I'm thinking about droping the 'family_status_id' and 'education_id' columns.
- Let's see what can I do else with this data!

**PS: I forgot about 'days_employed'!!**
I don't think there are much choices, I'm thinking about transforming it into absolute values, let's see.

In [31]:
credit_scoring.loc[credit_scoring['days_employed'] < 0,'days_employed'] = abs(credit_scoring['days_employed'])

In [32]:
print(credit_scoring.loc[credit_scoring['days_employed'] < 0,'days_employed'].sum())

0.0


Done with *'days_employed'* column.

<div class="alert alert-block alert-success">
<b>Success:</b> Nice going.
</div>

## Data type replacement

In [33]:
credit_scoring.dtypes

children            float64
days_employed       float64
dob_years           float64
education            object
education_id          int64
family_status        object
family_status_id      int64
gender               object
income_type          object
debt                  int64
total_income        float64
purpose              object
dtype: object

### Conclusion

Nothing to fix in data types I think! Hmmm, I can for example convert some float64 to int64? 
- For `children` column for example, no meaning to have a float here.

In [34]:
credit_scoring['children'] = credit_scoring['children'].astype('int64', errors='ignore')

In [35]:
credit_scoring['children'].dtypes

dtype('int64')

<div class="alert alert-block alert-success">
<b>Success:</b> Yes, you can! It's can help to save a memory and improve a little bit visual perception.
</div>

## Processing duplicates

Checking if there are any duplicates?

In [36]:
credit_scoring.duplicated().sum()

72

In [37]:
credit_scoring[credit_scoring.duplicated()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0,,41.0,secondary education,1,married,0,F,employee,0,23202.87,purchase of the house for my family
3290,0,,58.0,secondary education,1,civil partnership,1,F,retiree,0,23202.87,to have a wedding
4182,1,,34.0,bachelor's degree,0,civil partnership,1,F,employee,0,23202.87,wedding ceremony
4851,0,,60.0,secondary education,1,civil partnership,1,F,retiree,0,23202.87,wedding ceremony
5557,0,,58.0,secondary education,1,civil partnership,1,F,retiree,0,23202.87,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
20702,0,,64.0,secondary education,1,married,0,F,retiree,0,23202.87,supplementary education
21032,0,,60.0,secondary education,1,married,0,F,retiree,0,23202.87,to become educated
21132,0,,47.0,secondary education,1,married,0,F,employee,0,23202.87,housing renovation
21281,1,,30.0,bachelor's degree,0,married,0,F,employee,0,23202.87,buy commercial real estate


In [38]:
#droping duplicates
credit_scoring = credit_scoring.drop_duplicates().reset_index(drop=True)
credit_scoring.duplicated().sum()

0

### Conclusion

Let's try to lower or upper case all qualitative data, and reverify about duplicates.

In [39]:
credit_scoring['family_status'] = credit_scoring['family_status'].str.lower()
credit_scoring['gender'] = credit_scoring['gender'].str.upper()
credit_scoring['income_type'] = credit_scoring['income_type'].str.lower()
credit_scoring['purpose'] = credit_scoring['purpose'].str.lower()

In [40]:
credit_scoring.duplicated().sum()

0

1. 72 duplicates is not too much, so we did not lost much rows. 
2. The duplicates are maybe caused because the client has requested for the same credit several times, and each time, a new record is created for him. as we can see in the `purpose` column, some rows are duplicated due to the same purpose expressed in different ways.

<div class="alert alert-block alert-success">
<b>Success:</b> The duplicates were processed correctly. Good that you used str.lower() method for education before.
</div>

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> Here too. Please try to  identify the type of missing value and  propose some hypothesis on why the duplicated values appeared
</div>

<div class="alert alert-block alert-success">
<b>Success[2]:</b> Excellent! Now it is better!
</div>

## Lemmatization

`purpose` column turn.

In [41]:
credit_scoring['purpose'].value_counts()

wedding ceremony                            791
having a wedding                            767
to have a wedding                           765
real estate transactions                    675
buy commercial real estate                  661
housing transactions                        652
buying property for renting out             651
transactions with commercial real estate    650
purchase of the house                       646
housing                                     646
purchase of the house for my family         638
construction of own property                635
property                                    633
transactions with my real estate            627
building a real estate                      624
buy real estate                             621
purchase of my own house                    620
building a property                         619
housing renovation                          607
buy residential real estate                 606
buying my own car                       

Just from the first sight, there are rows meaning the same thing.

In [42]:
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer
from collections import Counter
from nltk.tag.brill import Pos
#import stopwords with nltk
nltk.download('stopwords')
is_noun = lambda pos: pos[:2] == 'NN'
from nltk.corpus import stopwords
stop = stopwords.words('english')
wordnet_lemma = WordNetLemmatizer()
lemmas = []
words_categories = []
nouns = []
for text in credit_scoring['purpose']:
    words = nltk.word_tokenize(text)
    lemmas += ([wordnet_lemma.lemmatize(w, pos = 'n') for w in words if w not in (stop)])
    words_categories = Counter(lemmas).keys()

for i in nltk.pos_tag(words_categories):
    if i[1] == 'NN':
        nouns.append(i[0])
print(nouns)
#print(words_categories)              
#print(lemmas)
#print(Counter(lemmas))
#print(words_categories)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['purchase', 'house', 'car', 'education', 'housing', 'transaction', 'family', 'estate', 'construction', 'property', 'second-hand', 'become', 'ceremony', 'profile', 'university', 'renovation']


We have above all existing categories in `purpose` column.
Bellow is a function that I'm going to use to add a new column to our dataframe.

In [43]:
def purpose_categories(text):
    if 'edu' in text or 'univ' in text:
        return 'education'
    if 'car' in text:
        return 'car'
    if 'wedding' in text:
        return 'wedding'
    if ('house' in text or 'property' in text) or ('real' in text or 'estate' in text): 
        return 'house'
    else:
        return 'other'

<div class="alert alert-block alert-info">
<b>Improve: </b> 'not listed' and 'real estate' could be related to house group.
</div>

In [44]:
credit_scoring['purpose_categories'] = credit_scoring['purpose'].apply(purpose_categories)

In [45]:
credit_scoring['purpose_categories'].head(20)
#VOILAA

0         house
1           car
2         house
3     education
4       wedding
5         house
6         other
7     education
8       wedding
9         house
10        house
11        house
12      wedding
13          car
14        house
15        house
16        house
17        house
18          car
19          car
Name: purpose_categories, dtype: object

## Categorizing Data

In [46]:
credit_scoring['family_status'].value_counts()

married              12339
civil partnership     4150
unmarried             2810
divorced              1195
widow / widower        959
Name: family_status, dtype: int64

In [47]:
credit_scoring['family_status_id'].value_counts()

0    12339
1     4150
4     2810
3     1195
2      959
Name: family_status_id, dtype: int64

**Let's classify data by income type**

In [48]:
print(credit_scoring.loc[credit_scoring['family_status_id'] == 0,'family_status'].value_counts())
print()
print(credit_scoring.loc[credit_scoring['family_status_id'] == 1,'family_status'].value_counts())
print()
print(credit_scoring.loc[credit_scoring['family_status_id'] == 2,'family_status'].value_counts())
print()
print(credit_scoring.loc[credit_scoring['family_status_id'] == 3,'family_status'].value_counts())
print()
print(credit_scoring.loc[credit_scoring['family_status_id'] == 4,'family_status'].value_counts())



married    12339
Name: family_status, dtype: int64

civil partnership    4150
Name: family_status, dtype: int64

widow / widower    959
Name: family_status, dtype: int64

divorced    1195
Name: family_status, dtype: int64

unmarried    2810
Name: family_status, dtype: int64


In [49]:
credit_scoring.drop(columns='family_status_id',inplace=True)

In [50]:
credit_scoring.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,gender,income_type,debt,total_income,purpose,purpose_categories
0,1,8437.673028,42.0,bachelor's degree,0,married,F,employee,0,40620.102,purchase of the house,house
1,1,4024.803754,36.0,secondary education,1,married,F,employee,0,17932.802,car purchase,car
2,0,5623.42261,33.0,secondary education,1,married,M,employee,0,23341.752,purchase of the house,house
3,3,4124.747207,32.0,secondary education,1,married,M,employee,0,42820.568,supplementary education,education
4,0,340266.072047,53.0,secondary education,1,civil partnership,F,retiree,0,25378.572,to have a wedding,wedding


The same goes for *'education'* and *'education_id'*

In [51]:
credit_scoring['education'].value_counts()

secondary education    15171
bachelor's degree       5250
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

In [52]:
credit_scoring['education_id'].value_counts()

1    15171
0     5250
2      744
3      282
4        6
Name: education_id, dtype: int64

In [53]:
credit_scoring.groupby('education_id')['education'].count()

education_id
0     5250
1    15171
2      744
3      282
4        6
Name: education, dtype: int64

In [54]:
credit_scoring.drop(columns='education_id',inplace=True)

In [55]:
credit_scoring.head(1)

Unnamed: 0,children,days_employed,dob_years,education,family_status,gender,income_type,debt,total_income,purpose,purpose_categories
0,1,8437.673028,42.0,bachelor's degree,married,F,employee,0,40620.102,purchase of the house,house


### Conclusion

- I created *'purpose_categories'* which is a more easy to use column and clear.
- droped column meaning same thing.
- and I think the dataset is already categorized.
- BTW, we can add another column for making categories to *'total_income'*. please see the last part of this report.

<div class="alert alert-block alert-success">
<b>Success:</b> The step was done not bad!
</div>

## Step 3. Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [56]:
credit_scoring.groupby('children')['debt'].sum() 

children
0    1064
1     444
2     202
3      27
4       4
5       0
Name: debt, dtype: int64

In [57]:
credit_scoring.pivot_table(values='debt',index='children', aggfunc='sum')
pd.crosstab(credit_scoring['children'],credit_scoring['debt']).apply(lambda r: r/r.sum(), axis=1)

debt,0,1
children,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.924737,0.075263
1,0.907654,0.092346
2,0.905075,0.094925
3,0.918182,0.081818
4,0.902439,0.097561
5,1.0,0.0


- Well, YES!

### Conclusion

≈ 90% of customers that have children pay their debt.

<div class="alert alert-block alert-info">
<b>Improve:</b> I think 
everything is exactly the opposite :D
    
</div>

- Is there a relation between marital status and repaying a loan on time?

In [58]:
credit_scoring.pivot_table(values='debt',index='family_status', aggfunc='sum') 

Unnamed: 0_level_0,debt
family_status,Unnamed: 1_level_1
civil partnership,388
divorced,85
married,931
unmarried,274
widow / widower,63


In [59]:
pd.crosstab(credit_scoring['family_status'],credit_scoring['debt']).apply(lambda r: r/r.sum(), axis=1)

debt,0,1
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1
civil partnership,0.906506,0.093494
divorced,0.92887,0.07113
married,0.924548,0.075452
unmarried,0.902491,0.097509
widow / widower,0.934307,0.065693


### Conclusion

-  ≈ 90% of each of family status paying back their loans, there is a weak relation between marital status and repaying a loan on time.

- Is there a relation between income level and repaying a loan on time?

Classifying by *'total_income'* ranges, let's check on possible ranges we can work on:

In [60]:
credit_scoring.pivot_table(values='total_income',index='debt', aggfunc=['mean','median','min','max']) 

Unnamed: 0_level_0,mean,median,min,max
Unnamed: 0_level_1,total_income,total_income,total_income,total_income
debt,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,26491.332111,23202.87,3392.845,362496.645
1,25813.629751,23202.87,3306.762,352136.354


In [61]:
def categorizing_income(income_nbr):
    if income_nbr >= 3306.762 and income_nbr <= 13202.87:
        return 'low'
    if income_nbr > 13202.87 and income_nbr <= 26491.332111:
        return 'medium'
    if income_nbr > 26491.332111:
        return 'high'
credit_scoring['income_category'] = credit_scoring['total_income'].apply(categorizing_income)
print(credit_scoring['income_category'].value_counts())

medium    11273
high       7637
low        2543
Name: income_category, dtype: int64


<div class="alert alert-block alert-success">
<b>Success:</b> Good job, it's possible way to categorize income column. 
</div>

Adding *'income_category'* column (responding to the previous part of this report).

In [62]:
credit_scoring.pivot_table(values='debt',index='income_category', aggfunc='sum') 

Unnamed: 0_level_0,debt
income_category,Unnamed: 1_level_1
high,589
low,196
medium,956


In [63]:
pd.crosstab(credit_scoring['income_category'],credit_scoring['debt']).apply(lambda r: r/r.sum(), axis=1)

debt,0,1
income_category,Unnamed: 1_level_1,Unnamed: 2_level_1
high,0.922875,0.077125
low,0.922926,0.077074
medium,0.915196,0.084804


### Conclusion

As we can see, no significant range difference in the results, there is a weak relation between income level and repaying a loan on time:
- ≈ 92% of high/low/medium income level paying back their loans.

- How do different loan purposes affect on-time repayment of the loan?

In [64]:
credit_scoring.pivot_table(values='debt',index='purpose_categories', aggfunc='sum')     

Unnamed: 0_level_0,debt
purpose_categories,Unnamed: 1_level_1
car,403
education,370
house,653
other,129
wedding,186


In [65]:
pd.crosstab(credit_scoring['purpose_categories'],credit_scoring['debt']).apply(lambda r: r/r.sum(), axis=1)

debt,0,1
purpose_categories,Unnamed: 1_level_1,Unnamed: 2_level_1
car,0.90641,0.09359
education,0.9078,0.0922
house,0.926679,0.073321
other,0.932283,0.067717
wedding,0.919931,0.080069


### Conclusion

No matter what the purpose of the loan is, around ≈ 90% of customers pay back their loans.

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> I do not quite agree with your conclusions, since even a small percentage for the bank's profit can play a significant role.
</div>

<div class="alert alert-block alert-success">
<b>Success[3]:</b> Well done
</div>

### Step 4. General conclusion

The analysis of the data is obviously the most important part before going any further on any project. I have took time to arrange the *'purpose'* column, and each time during my analysis, I realize forgotting something on the privious steps. I see that the data is not very much correlated (no relation between columns). And I think we could respond to more queries and statistics. 

<div class="alert alert-block alert-success">
<b>Success:</b>  Excellent work! you have mastered the tools: pivot_table and groupby!
</div>

<div class="alert alert-block alert-info">
<b>Recommendation: </b>
    
Some tips for **Markdown** style, which could help you to improve your conclusions even more!
    
**BOLD** <br>
*Italics*
    
--- 

# First level heading

--- 
    
## Second level heading
  
---  
Lists:
    
- one 
- two
- three
   
---
    
1. one
2. one two
3. one two three
    
--- 
    
Displaying `variables`
    
    

    
</div>`

### Project Readiness Checklist

Put 'x' in the completed points. Then press Shift + Enter.

- [x]  file open;
- [x]  file examined;
- [x]  missing values defined;
- [x]  missing values are filled;
- [x]  an explanation of which missing value types were detected;
- [x]  explanation for the possible causes of missing values;
- [x]  an explanation of how the blanks are filled;
- [x]  replaced the real data type with an integer;
- [x]  an explanation of which method is used to change the data type and why;
- [x]  duplicates deleted;
- [x]  an explanation of which method is used to find and remove duplicates;
- [x]  description of the possible reasons for the appearance of duplicates in the data;
- [x]  data is categorized;
- [x]  an explanation of the principle of data categorization;
- [x]  an answer to the question "Is there a relation between having kids and repaying a loan on time?";
- [x]  an answer to the question " Is there a relation between marital status and repaying a loan on time?";
- [x]   an answer to the question " Is there a relation between income level and repaying a loan on time?";
- [x]  an answer to the question " How do different loan purposes affect on-time repayment of the loan?"
- [x]  conclusions are present on each stage;
- [x]  a general conclusion is made.