<p style="background-color:#FDFEFE; font-family:arial; color:#09042b; font-size:350%; text-align:center; border-radius:10px 10px;"></p>

<p style="background-color:#FDFEFE; font-family:arial; color:#09042b; font-size:400%; text-align:center; border-radius:10px 10px;"> Credit Score Classification</p>

<p style="background-color:#FDFEFE; font-family:arial; color:#09042b; font-size:350%; text-align:center; border-radius:10px 10px;"> EDA Project Part 1 </p>


<img src="https://t3.ftcdn.net/jpg/04/62/56/22/360_F_462562264_vzm8SoTxft5Ug3AEHjoPyHndtSGx6ymb.jpg" align="center"/>

<a id="toc"></a>

## <p style="background-color:#7da6ff; font-family:arial; color:#09042b; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [Aim of the Project](#0)
* [Dataset Info](#1)
* [Importing Related Libraries](#2)
* [Recognizing & Understanding Data](#3)
* [Cleaning Data](#4)    
* [Handling with Missing Values](#5)
* [Handling with Outliers](#6)

## <p style="background-color:#7da6ff; font-family:arial; color:#09042b; font-size:175%; text-align:center; border-radius:10px 10px;">Aim of the Project</p>

<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:#09042b; background-color:#7da6ff" data-toggle="popover">Content</a>


A credit score is a number or a class that depicts a consumer’s creditworthiness. The higher the score, the better a borrower looks to potential lenders.

A credit score is based on credit history: number of open accounts, total levels of debt, repayment history, and other factors. Lenders use credit scores to evaluate the probability that an individual will repay loans in a timely manner.

In this project to recognize and understand the credit score classification data, a comprehensive Exploratory Data Analysis (EDA) was conducted and the data was prepared to implement the Machine Learning Algorithms.


## <p style="background-color:#7da6ff; font-family:arial; color:#09042b; font-size:175%; text-align:center; border-radius:10px 10px;">Dataset Info</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:#09042b; background-color:#7da6ff" data-toggle="popover">Content</a>


The Credit score classification dataset has 100,000 entries, 12,500 different customers and 28 columns. Each entry contains the following information about customers:

**ID :** a unique identification of an entry

**Customer_ID :** a unique identification of a person

**Month :** the month of the year

**Name :** the name of a person

**Age :** the age of the person

**SSN :** the social security number of a person

**Occupation :** the occupation of the person

**Annual_Income :** the annual income of the person

**Monthly_Inhand_Salary :** the monthly base salary of a person

**Num_Bank_Accounts :** the number of bank accounts a person holds

**Num_Credit_Card :** the number of other credit cards held by a person

**Interest_Rate :** the interest rate on credit card

**Num_of_Loan :** the number of loans taken from the bank

**Type_of_Loan :** the types of loan taken by a person

**Delay_from_due_date :** the average number of days delayed from the payment date

**Num_of_Delayed_Payment :** the average number of payments delayed by a person

**Changed_Credit_Limit :** the percentage change in credit card limit

**Num_Credit_Inquiries :** the number of credit card inquiries

**Credit_Mix :** the classification of the mix of credits

**Outstanding_Debt :** the remaining debt to be paid (in USD)

**Credit_Utilization_Ratio :** the utilization ratio of credit card

**Credit_History_Age :** the age of credit history of the person

**Payment_of_Min_Amount :** whether only the minimum amount was paid by the person

**Total_EMI_per_month :** the monthly EMI payments (in USD)

**Amount_invested_monthly :** the monthly amount invested by the customer (in USD)

**Payment_Behaviour :** the payment behavior of the customer (in USD)

**Monthly_Balance :** the monthly balance amount of the customer (in USD)

**Credit_Score :** the bracket of credit score (Poor, Standard, Good)


## <p style="background-color:#7da6ff; font-family:arial; color:#09042b; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Related Libraries</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:#09042b; background-color:#7da6ff" data-toggle="popover">Content</a>

In [1]:
# import data analysis and visualisation libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import patches
import seaborn as sns

# import warnings to suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Statistics functions
from scipy.stats import norm
from scipy import stats
from scipy.stats import chi2_contingency
from scipy.stats import chi2

# Changing the figure size of a seaborn axes 
sns.set(rc={"figure.figsize": (10, 6)})

# The style parameters control properties
sns.set_style("whitegrid")

# To display maximum columns
pd.set_option('display.max_columns', None)

# To display maximum rows
pd.set_option('display.max_rows', None)

### <p style="background-color:#7da6ff; font-family:arial; color:#09042b; font-size:150%; text-align:left; border-radius:10px; line-height:1.5; text-align:center;">Reading the data from file</p>

In [2]:
df= pd.read_csv("train.csv")

## <p style="background-color:#7da6ff; font-family:arial; color:#09042b; font-size:175%; text-align:center; border-radius:10px 10px;">Recognizing and Understanding Data</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:#09042b; background-color:#7da6ff" data-toggle="popover">Content</a>

### Checking the dataframe with head, tail and sample

In [3]:
 df.head()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Type_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",3,7.0,11.27,4.0,_,809.98,26.82262,22 Years and 1 Months,No,49.574949,80.41529543900253,High_spent_Small_value_payments,312.49408867943663,Good
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",-1,,11.27,4.0,Good,809.98,31.94496,,No,49.574949,118.28022162236736,Low_spent_Large_value_payments,284.62916249607184,Good
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",3,7.0,_,4.0,Good,809.98,28.609352,22 Years and 3 Months,No,49.574949,81.699521264648,Low_spent_Medium_value_payments,331.2098628537912,Good
3,0x1605,CUS_0xd40,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",5,4.0,6.27,4.0,Good,809.98,31.377862,22 Years and 4 Months,No,49.574949,199.4580743910713,Low_spent_Small_value_payments,223.45130972736783,Good
4,0x1606,CUS_0xd40,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",6,,11.27,4.0,Good,809.98,24.797347,22 Years and 5 Months,No,49.574949,41.420153086217326,High_spent_Medium_value_payments,341.48923103222177,Good


In [4]:
df.tail()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Type_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
99995,0x25fe9,CUS_0x942c,April,Nicks,25,078-73-5990,Mechanic,39628.99,3359.415833,4,6,7,2,"Auto Loan, and Student Loan",23,7.0,11.5,3.0,_,502.38,34.663572,31 Years and 6 Months,No,35.104023,60.97133255718485,High_spent_Large_value_payments,479.866228,Poor
99996,0x25fea,CUS_0x942c,May,Nicks,25,078-73-5990,Mechanic,39628.99,3359.415833,4,6,7,2,"Auto Loan, and Student Loan",18,7.0,11.5,3.0,_,502.38,40.565631,31 Years and 7 Months,No,35.104023,54.18595028760385,High_spent_Medium_value_payments,496.65161,Poor
99997,0x25feb,CUS_0x942c,June,Nicks,25,078-73-5990,Mechanic,39628.99,3359.415833,4,6,5729,2,"Auto Loan, and Student Loan",27,6.0,11.5,3.0,Good,502.38,41.255522,31 Years and 8 Months,No,35.104023,24.02847744864441,High_spent_Large_value_payments,516.809083,Poor
99998,0x25fec,CUS_0x942c,July,Nicks,25,078-73-5990,Mechanic,39628.99,3359.415833,4,6,7,2,"Auto Loan, and Student Loan",20,,11.5,3.0,Good,502.38,33.638208,31 Years and 9 Months,No,35.104023,251.67258219721603,Low_spent_Large_value_payments,319.164979,Standard
99999,0x25fed,CUS_0x942c,August,Nicks,25,078-73-5990,Mechanic,39628.99_,3359.415833,4,6,7,2,"Auto Loan, and Student Loan",18,6.0,11.5,3.0,Good,502.38,34.192463,31 Years and 10 Months,No,35.104023,167.1638651610451,!@9#%8,393.673696,Poor


In [5]:
df.sample(10)

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Type_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
98597,0x257b7,CUS_0x972a,June,Kajimotou,33,#F%$D@*&8,Entrepreneur,87605.68,7299.473333,5,7,19,3,"Credit-Builder Loan, Personal Loan, and Person...",29,15.0,-0.3500000000000001,6.0,Standard,512.74,32.589849,29 Years and 10 Months,Yes,112.148801,189.52762304766608,Low_spent_Medium_value_payments,708.270909,Poor
89008,0x21f8a,CUS_0x3127,January,Sarahj,19,051-95-4107,Accountant,37972.16,2986.346667,9,10,21,7,"Credit-Builder Loan, Mortgage Loan, Student Lo...",51,,20.14,6.0,Bad,4708.63,28.007441,11 Years and 7 Months,Yes,217.653368,123.30591185917326,High_spent_Small_value_payments,217.67538702315449,Standard
12988,0x621a,CUS_0x67a4,May,,31,495-84-9751,Mechanic,20825.75,1770.479167,6,5,33,3_,"Personal Loan, Student Loan, and Debt Consolid...",21,14.0,7.3,11.0,Standard,2116.54,36.24653,15 Years and 11 Months,Yes,46.644959,__10000__,Low_spent_Large_value_payments,256.47649514212947,Poor
52073,0x1471f,CUS_0x3c5e,February,,24,516-91-0042,Media_Manager,20588.45,1557.704167,6,10,23,-100,"Payday Loan, Payday Loan, Not Specified, and D...",18,16.0,5.12,,Bad,2160.86,38.079023,5 Years and 9 Months,Yes,54.058592,109.06161352900757,Low_spent_Small_value_payments,282.65021104659525,Standard
12966,0x61f8,CUS_0x62ec,July,Stevenn,24,567-99-2219,Mechanic,20060.93,1806.744167,7,6,14,5,"Student Loan, Not Specified, Personal Loan, St...",21,,12.21,9.0,Standard,1445.82,29.520085,8 Years and 1 Months,Yes,48.564731,194.62908529921958,Low_spent_Small_value_payments,227.48060047358163,Standard
64471,0x18fc1,CUS_0x5cd7,August,Manuela Badawyh,18,306-56-6615,Mechanic,68917.94_,5498.161667,5,2,11,0,,0,10.0,0.1899999999999999,4.0,Good,1263.18,36.619696,17 Years and 11 Months,No,0.0,162.17299074339152,High_spent_Medium_value_payments,637.6431759232752,Good
65220,0x19426,CUS_0x682e,May,Moiral,39,130-71-2530,Teacher,131701.12,10704.093333,1145,7,18,2,"Home Equity Loan, and Not Specified",14,14.0,9.97,4.0,Standard,1263.18,39.402939,,Yes,150.612921,129.97487137711568,!@9#%8,1029.82154065413,Good
18186,0x8090,CUS_0x8d45,March,Terril Yued,37,239-07-9719,Mechanic,18718.52,1784.876667,6,9,16,9,"Credit-Builder Loan, Student Loan, Credit-Buil...",61,15.0,22.73,12.0,Bad,4432.96,38.640674,11 Years and 3 Months,Yes,126.491489,68.32509265805145,High_spent_Medium_value_payments,233.67108478089023,Poor
77751,0x1dd91,CUS_0xadf1,August,Dougm,14,393-85-2534,Teacher,20756.69,,8,10,16,7,"Credit-Builder Loan, Payday Loan, Debt Consoli...",56,24.0,27.93,13.0,Bad,4206.64,30.676184,1 Years and 6 Months,Yes,62.388472,86.08758151681081,Low_spent_Medium_value_payments,313.3963630879142,Poor
93410,0x23954,CUS_0x56ef,March,,20,454-36-2008,Mechanic,15867.45,1468.2875,10,10,22,2_,"Payday Loan, and Student Loan",17,17.0,13.02,7.0,Standard,1773.9,28.338393,17 Years and 6 Months,Yes,18.342812,132.3904165776869,Low_spent_Small_value_payments,286.09552187038923,Poor


In [6]:
df.shape

(100000, 28)

### Checking the summary information of df

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

In [8]:
# Checking the null values of df

df.isnull().sum()

ID                              0
Customer_ID                     0
Month                           0
Name                         9985
Age                             0
SSN                             0
Occupation                      0
Annual_Income                   0
Monthly_Inhand_Salary       15002
Num_Bank_Accounts               0
Num_Credit_Card                 0
Interest_Rate                   0
Num_of_Loan                     0
Type_of_Loan                11408
Delay_from_due_date             0
Num_of_Delayed_Payment       7002
Changed_Credit_Limit            0
Num_Credit_Inquiries         1965
Credit_Mix                      0
Outstanding_Debt                0
Credit_Utilization_Ratio        0
Credit_History_Age           9030
Payment_of_Min_Amount           0
Total_EMI_per_month             0
Amount_invested_monthly      4479
Payment_Behaviour               0
Monthly_Balance              1200
Credit_Score                    0
dtype: int64

In [9]:
# Checking the duplicated values in df

df.duplicated().sum()

0

In [10]:
# Checking the number of uniques in df

df.nunique()

ID                          100000
Customer_ID                  12500
Month                            8
Name                         10139
Age                           1788
SSN                          12501
Occupation                      16
Annual_Income                18940
Monthly_Inhand_Salary        13235
Num_Bank_Accounts              943
Num_Credit_Card               1179
Interest_Rate                 1750
Num_of_Loan                    434
Type_of_Loan                  6260
Delay_from_due_date             73
Num_of_Delayed_Payment         749
Changed_Credit_Limit          4384
Num_Credit_Inquiries          1223
Credit_Mix                       4
Outstanding_Debt             13178
Credit_Utilization_Ratio    100000
Credit_History_Age             404
Payment_of_Min_Amount            3
Total_EMI_per_month          14950
Amount_invested_monthly      91049
Payment_Behaviour                7
Monthly_Balance              98792
Credit_Score                     3
dtype: int64

In [11]:
# Checking the desciptive values in df

df.describe()

Unnamed: 0,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Delay_from_due_date,Num_Credit_Inquiries,Credit_Utilization_Ratio,Total_EMI_per_month
count,84998.0,100000.0,100000.0,100000.0,100000.0,98035.0,100000.0,100000.0
mean,4194.17085,17.09128,22.47443,72.46604,21.06878,27.754251,32.285173,1403.118217
std,3183.686167,117.404834,129.05741,466.422621,14.860104,193.177339,5.116875,8306.04127
min,303.645417,-1.0,0.0,1.0,-5.0,0.0,20.0,0.0
25%,1625.568229,3.0,4.0,8.0,10.0,3.0,28.052567,30.30666
50%,3093.745,6.0,5.0,13.0,18.0,6.0,32.305784,69.249473
75%,5957.448333,7.0,7.0,20.0,28.0,9.0,36.496663,161.224249
max,15204.633333,1798.0,1499.0,5797.0,67.0,2597.0,50.0,82331.0


### Defining a function to first look for each feature

In [12]:
def first_look(col):
    print('column name : ', col)
    print('--------------------------------')
    print('Percent_of_Nulls   : ', '%', round(df[col].isnull().sum() / df.shape[0]*100, 2))
    print('Number_of_Nulls   : ', df[col].isnull().sum())
    print('Number_of_Uniques : ', df[col].nunique())
    print("Value_counts :\n",df[col].value_counts(dropna = False))
    print("##"*20)
    print()

In [13]:
# Checking the informations for each column using first_look function

for col in df.columns :
    first_look(col)

column name :  ID
--------------------------------
Percent_of_Nulls   :  % 0.0
Number_of_Nulls   :  0
Number_of_Uniques :  100000
Value_counts :
 0x1602     1
0x19c88    1
0x19caa    1
0x19ca5    1
0x19ca4    1
0x19ca3    1
0x19ca2    1
0x19ca1    1
0x19ca0    1
0x19c9f    1
0x19c9e    1
0x19c99    1
0x19c98    1
0x19c97    1
0x19c96    1
0x19c95    1
0x19c94    1
0x19c93    1
0x19c92    1
0x19c8d    1
0x19c8c    1
0x19c8b    1
0x19c8a    1
0x19cab    1
0x19cac    1
0x19cad    1
0x19cbd    1
0x19cce    1
0x19cc9    1
0x19cc8    1
0x19cc7    1
0x19cc6    1
0x19cc5    1
0x19cc4    1
0x19cc3    1
0x19cc2    1
0x19cbc    1
0x19cae    1
0x19cbb    1
0x19cba    1
0x19cb9    1
0x19cb8    1
0x19cb7    1
0x19cb6    1
0x19cb1    1
0x19cb0    1
0x19caf    1
0x19c89    1
0x19c87    1
0x19cd0    1
0x19c86    1
0x19c5d    1
0x19c5c    1
0x19c5b    1
0x19c5a    1
0x19c59    1
0x19c58    1
0x19c57    1
0x19c56    1
0x19c51    1
0x19c50    1
0x19c4f    1
0x19c4e    1
0x19c4d    1
0x19c4c    1
0x19c4b  

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## <p style="background-color:#7da6ff; font-family:arial; color:#09042b; font-size:175%; text-align:center; border-radius:10px; line-height:1.2">Cleaning Data</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:#09042b; background-color:#7da6ff" data-toggle="popover">Content</a>

There are so many strange values that should be cleaned or replaced with np.nan

In [14]:
# defining a function to replace "_", "_______", "#F%$D@*&8", "!@9#%8", "__-333333333333333333333333333__" with np.nan

def clean_data(row) :
    if row in ["_", "_______", "#F%$D@*&8", "!@9#%8", "__-333333333333333333333333333__"]:
        return np.nan
    else :
        return row

In [15]:
# replacing the strange values with np.nan using clean_data function

for col in df.columns :
    df[col] = df[col].apply(clean_data)

In [16]:
# Strip "_" character at the begin and end of df features

for col in df.columns :
    if df[col].dtype == "O" :
        if df[col].str.contains("_").sum() > 0:
            df[col] = df[col].str.strip("_")

In [17]:
# Convert ID and Customer_ID numbers which were given as hexadecimal to decimal

df["ID"] = df.ID.apply(lambda x : int(x, 16))
df["Customer_ID"] = df.Customer_ID.str.strip("CUS_").apply(lambda x : int(x, 16))

### Fixing Data Type

In [18]:
# defining column lists whose data types should be fixed to integer and float

col_int = ["ID", "Customer_ID", "Age", "Num_Bank_Accounts", "Num_Credit_Card", "Num_of_Loan", "Delay_from_due_date"]

col_float = ["Num_of_Delayed_Payment", "Annual_Income", "Interest_Rate", "Changed_Credit_Limit","Outstanding_Debt",\
             "Amount_invested_monthly","Monthly_Balance"]

In [19]:
# fixing data types to integer
for col in col_int :
    df[col] = df[col].astype(int)

In [20]:
# fixing data types to float
for col in col_float :
    df[col] = df[col].astype(float)

In [21]:
# checking the data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  int32  
 1   Customer_ID               100000 non-null  int32  
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  int32  
 5   SSN                       94428 non-null   object 
 6   Occupation                92938 non-null   object 
 7   Annual_Income             100000 non-null  float64
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int32  
 10  Num_Credit_Card           100000 non-null  int32  
 11  Interest_Rate             100000 non-null  float64
 12  Num_of_Loan               100000 non-null  int32  
 13  Type_of_Loan              88592 non-null   ob

## <p style="background-color:#7da6ff; font-family:arial; color:#09042b; font-size:175%; text-align:center; border-radius:10px 10px;">Handling With Missing Values</p>

<a id="5"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:#09042b; background-color:#7da6ff" data-toggle="popover">Content</a>

### Defining a function for null check

In [22]:
def null_check (col) :
    print("Column name :", col)
    print("--"*16)
    print("Number of nulls :", df[col].isnull().sum())
    print("Percent of nulls :", '%', round((df[col].isnull().sum()/df.shape[0])*100, 2))
    print("Value_counts :", "\n",df[col].value_counts(dropna=False).head(10))  # brings only first 10 value

* There are 13 columns having null values in df.
* Now let's work on this null values column by column.

### Name column

In [23]:
# Checking the null values in Name column

null_check("Name")

Column name : Name
--------------------------------
Number of nulls : 9985
Percent of nulls : % 9.98
Value_counts : 
 NaN                   9985
Stevex                  44
Langep                  44
Jessicad                39
Vaughanl                39
Deepa Seetharamanm      38
Danielz                 38
Jessica Wohlt           38
Raymondr                38
Strupczewskid           37
Name: Name, dtype: int64


In [24]:
# When we grouped the data by Customer_ID, we can see that the null values in Name column can be filled with ffill and bfill method.

df["Name"] = df.groupby("Customer_ID")["Name"].fillna(method='ffill').fillna(method='bfill')

### SSN column

In [25]:
# Checking the null values in SSN column

null_check("SSN")

Column name : SSN
--------------------------------
Number of nulls : 5572
Percent of nulls : % 5.57
Value_counts : 
 NaN            5572
078-73-5990       8
486-78-3816       8
750-67-7525       8
903-50-0305       8
376-28-6303       8
194-93-5515       8
442-30-8588       8
362-78-8068       8
221-76-9774       8
Name: SSN, dtype: int64


In [26]:
# Null values in SSN column can be filled with ffill and bfill method like in Name column.

df["SSN"] = df.groupby("Customer_ID")["SSN"].fillna(method="ffill").fillna(method="bfill")

### Occupation column

In [27]:
# Checking the null values in Occupation column

null_check("Occupation")

Column name : Occupation
--------------------------------
Number of nulls : 7062
Percent of nulls : % 7.06
Value_counts : 
 NaN              7062
Lawyer           6575
Architect        6355
Engineer         6350
Scientist        6299
Mechanic         6291
Accountant       6271
Developer        6235
Media_Manager    6232
Teacher          6215
Name: Occupation, dtype: int64


In [28]:
# Null values in Occupation column can be filled with ffill and bfill method like in Name column.

df["Occupation"] = df.groupby("Customer_ID")["Occupation"].fillna(method="ffill").fillna(method="bfill")

### Monthly_Inhand_Salary column

In [29]:
# Checking the null values in Monthly_Inhand_Salary column

null_check("Monthly_Inhand_Salary")

Column name : Monthly_Inhand_Salary
--------------------------------
Number of nulls : 15002
Percent of nulls : % 15.0
Value_counts : 
 NaN            15002
2295.058333       15
6082.187500       15
6769.130000       15
6358.956667       15
3080.555000       14
4387.272500       13
6639.560000       13
5766.491667       13
536.431250        12
Name: Monthly_Inhand_Salary, dtype: int64


In [30]:
# Checking the null values in Monthly_Inhand_Salary column by grouping Customer_ID

df.groupby("Customer_ID").Monthly_Inhand_Salary.value_counts(dropna=False).head(10)

Customer_ID  Monthly_Inhand_Salary
1006         1331.348333              8
1007         1496.742500              7
             NaN                      1
1008         2655.035833              8
1009         6692.636667              8
1011         8433.546667              7
             NaN                      1
1013         2684.892500              7
             NaN                      1
1014         5197.393333              5
Name: Monthly_Inhand_Salary, dtype: int64

In [31]:
# Null values in Monthly_Inhand_Salary column can be filled with ffill and bfill method like in Name column.
    
df["Monthly_Inhand_Salary"] = df.groupby("Customer_ID")["Monthly_Inhand_Salary"].fillna(method="ffill").fillna(method="bfill")

### Type_of_Loan column

In [32]:
# Checking the null values in Type_of_Loan column

null_check("Type_of_Loan")

Column name : Type_of_Loan
--------------------------------
Number of nulls : 11408
Percent of nulls : % 11.41
Value_counts : 
 NaN                        11408
Not Specified               1408
Credit-Builder Loan         1280
Personal Loan               1272
Debt Consolidation Loan     1264
Student Loan                1240
Payday Loan                 1200
Mortgage Loan               1176
Auto Loan                   1152
Home Equity Loan            1136
Name: Type_of_Loan, dtype: int64


In [33]:
# Checking the null values in Type_of_Loan column by grouping Customer_ID

df.groupby("Customer_ID").Type_of_Loan.value_counts(dropna=False).head(10)

Customer_ID  Type_of_Loan                                                                                                       
1006         Credit-Builder Loan, and Payday Loan                                                                                   8
1007         Home Equity Loan, Mortgage Loan, and Student Loan                                                                      8
1008         NaN                                                                                                                    8
1009         Credit-Builder Loan, Student Loan, Not Specified, and Student Loan                                                     8
1011         Personal Loan, Auto Loan, and Auto Loan                                                                                8
1013         Home Equity Loan, Mortgage Loan, Not Specified, and Personal Loan                                                      8
1014         Payday Loan, Mortgage Loan, and Home Equity Loan      

According to the domain knowledge, the number of loans taken and how they are paid are the parameters that affect the credit score, rather than the type of loan. As we have other features that affect credit score rather than type of loan and filling null values in Type_of_Loan column with appropriate values is not possible, we can drop Type_of_Loan column.

In [34]:
df.drop(columns="Type_of_Loan", inplace=True)

### Num_of_Delayed_Payment column

In [35]:
# Checking the null values in Num_of_Delayed_Payment column

null_check("Num_of_Delayed_Payment")

Column name : Num_of_Delayed_Payment
--------------------------------
Number of nulls : 7002
Percent of nulls : % 7.0
Value_counts : 
 NaN     7002
19.0    5481
17.0    5412
16.0    5312
10.0    5309
15.0    5237
18.0    5216
20.0    5089
12.0    5059
9.0     4981
Name: Num_of_Delayed_Payment, dtype: int64


In [36]:
# Checking the null values in Num_of_Delayed_Payment column by grouping Customer_ID

df.groupby("Customer_ID").Num_of_Delayed_Payment.value_counts(dropna=False).head(10)

Customer_ID  Num_of_Delayed_Payment
1006         12.0                      4
             NaN                       1
             10.0                      1
             11.0                      1
             13.0                      1
1007         19.0                      6
             20.0                      1
             21.0                      1
1008         11.0                      5
             NaN                       2
Name: Num_of_Delayed_Payment, dtype: int64

In [37]:
# checking the number of delayed payment values for Customer 1011

df[df.Customer_ID == 1011]

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
22480,39354,1011,January,Terry Wadeu,43,422-13-0011,Writer,104142.56,8433.546667,3,5,5.0,3,15,17.0,15.28,6.0,,1371.8,26.89019,14 Years and 11 Months,Yes,257.738646,292.212704,High_spent_Small_value_payments,553.403317,Standard
22481,39355,1011,February,Terry Wadeu,44,422-13-0011,Writer,104142.56,8433.546667,3,5,5.0,3,15,16.0,15.28,6.0,,1371.8,36.633754,15 Years and 0 Months,Yes,257.738646,,Low_spent_Small_value_payments,228.71773,Standard
22482,39356,1011,March,Terry Wadeu,44,422-13-0011,Writer,104142.56,8433.546667,3,5,5.0,3,20,14.0,15.28,6.0,Standard,1371.8,36.187409,,Yes,257.738646,187.594897,High_spent_Medium_value_payments,648.021124,Standard
22483,39357,1011,April,Terry Wadeu,44,422-13-0011,Writer,104142.56,8433.546667,3,5,5.0,3,16,,15.28,6.0,Standard,1371.8,30.476065,15 Years and 2 Months,Yes,257.738646,275.63571,Low_spent_Large_value_payments,579.980311,Standard
22484,39358,1011,May,Terry Wadeu,44,422-13-0011,Writer,104142.56,8433.546667,3,5,5.0,3,20,17.0,19.28,6.0,Standard,1371.8,35.932885,15 Years and 3 Months,NM,257.738646,487.806206,Low_spent_Medium_value_payments,377.809814,Standard
22485,39359,1011,June,Terry Wadeu,44,422-13-0011,Writer,104142.56,8433.546667,3,5,5.0,3,20,14.0,15.28,6.0,Standard,1371.8,32.785394,15 Years and 4 Months,Yes,257.738646,,Low_spent_Small_value_payments,,Standard
22486,39360,1011,July,Terry Wadeu,44,422-13-0011,Writer,104142.56,8433.546667,3,5,5.0,3,20,11.0,15.28,6.0,Standard,1371.8,33.647849,15 Years and 5 Months,Yes,257.738646,124.299524,High_spent_Large_value_payments,701.316497,Standard
22487,39361,1011,August,Terry Wadeu,44,422-13-0011,Writer,104142.56,8433.546667,3,5,5.0,3,24,14.0,15.28,692.0,Standard,1371.8,36.339502,15 Years and 6 Months,Yes,257.738646,481.799088,Low_spent_Medium_value_payments,383.816932,Standard


In [38]:
# Checking the mode values in Num_of_Delayed_Payment column by grouping Customer_ID

df.groupby("Customer_ID").Num_of_Delayed_Payment.apply(lambda x: x.mode()[0]).head(10)

Customer_ID
1006    12.0
1007    19.0
1008    11.0
1009    18.0
1011    14.0
1013     2.0
1014    18.0
1015    14.0
1017    25.0
1019     8.0
Name: Num_of_Delayed_Payment, dtype: float64

In [39]:
# Null values in Num_of_Delayed_Payment column can be filled with mode by grouping Customer_ID.
    
df["Num_of_Delayed_Payment"] = df.groupby("Customer_ID")["Num_of_Delayed_Payment"].apply(lambda x : x.fillna(x.mode()[0]))

### Changed_Credit_Limit column

In [40]:
# Checking the null values in Changed_Credit_Limit column

null_check("Changed_Credit_Limit")

Column name : Changed_Credit_Limit
--------------------------------
Number of nulls : 2091
Percent of nulls : % 2.09
Value_counts : 
 NaN      2091
8.22      133
11.50     127
11.32     126
7.35      121
10.06     121
8.23      115
11.49     113
7.69      110
9.25      110
Name: Changed_Credit_Limit, dtype: int64


In [41]:
# Checking the null values in Changed_Credit_Limit column by grouping Customer_ID

df.groupby("Customer_ID").Changed_Credit_Limit.value_counts(dropna=False).head(10)

Customer_ID  Changed_Credit_Limit
1006         10.66                   8
1007         5.13                    6
             2.13                    1
             12.13                   1
1008         14.11                   8
1009         16.91                   7
             19.91                   1
1011         15.28                   7
             19.28                   1
1013         3.06                    7
Name: Changed_Credit_Limit, dtype: int64

In [42]:
# Null values in Changed_Credit_Limit column can be filled with ffill and bfill method by grouping Customer_ID.
    
df["Changed_Credit_Limit"] = df.groupby("Customer_ID")["Changed_Credit_Limit"].fillna(method="ffill").fillna(method="bfill")

### Num_Credit_Inquiries column

In [43]:
# Checking the null values in Num_Credit_Inquiries column

null_check("Num_Credit_Inquiries")

Column name : Num_Credit_Inquiries
--------------------------------
Number of nulls : 1965
Percent of nulls : % 1.96
Value_counts : 
 4.0    11271
3.0     8890
6.0     8111
7.0     8058
2.0     8028
8.0     7866
1.0     7588
0.0     6972
5.0     5693
9.0     5283
Name: Num_Credit_Inquiries, dtype: int64


In [44]:
# Checking the null values in Num_Credit_Inquiries column by grouping Customer_ID

df.groupby("Customer_ID").Num_Credit_Inquiries.value_counts(dropna=False).head(10)

Customer_ID  Num_Credit_Inquiries
1006         8.0                     7
             NaN                     1
1007         1.0                     7
             1196.0                  1
1008         10.0                    7
             6.0                     1
1009         7.0                     8
1011         6.0                     7
             692.0                   1
1013         1.0                     8
Name: Num_Credit_Inquiries, dtype: int64

In [45]:
# Null values in Num_Credit_Inquiries column can be filled with ffill and bfill method by grouping Customer_ID.
 
df["Num_Credit_Inquiries"] = df.groupby("Customer_ID")["Num_Credit_Inquiries"].apply(lambda x : x.fillna(method="ffill").fillna(method="bfill"))

### Credit_Mix column

In [46]:
# Checking the null values in Credit_Mix column

null_check("Credit_Mix")

Column name : Credit_Mix
--------------------------------
Number of nulls : 20195
Percent of nulls : % 20.2
Value_counts : 
 Standard    36479
Good        24337
NaN         20195
Bad         18989
Name: Credit_Mix, dtype: int64


In [47]:
# Checking the descriptive values of Credit_Mix column

df.Credit_Mix.describe()

count        79805
unique           3
top       Standard
freq         36479
Name: Credit_Mix, dtype: object

In [48]:
# Checking the null values in Credit_Mix column by grouping Customer_ID

df.groupby("Customer_ID").Credit_Mix.value_counts(dropna=False).head(10)

Customer_ID  Credit_Mix
1006         Standard      7
             NaN           1
1007         Standard      5
             NaN           3
1008         Standard      6
             NaN           2
1009         Standard      7
             NaN           1
1011         Standard      6
             NaN           2
Name: Credit_Mix, dtype: int64

In [49]:
# Checking the Credit_Mix values for each Customer

df[["Customer_ID","Credit_Mix"]].head(10)

Unnamed: 0,Customer_ID,Credit_Mix
0,3392,
1,3392,Good
2,3392,Good
3,3392,Good
4,3392,Good
5,3392,Good
6,3392,Good
7,3392,Good
8,8625,Good
9,8625,Good


In [50]:
# According to the analysis above null values in Credit_Mix column can be filled with mode by grouping Customer_ID

df["Credit_Mix"] = df.groupby("Customer_ID")["Credit_Mix"].apply(lambda x : x.fillna(x.mode()[0]))

### Credit_History_Age column

In [51]:
# Checking the null values in Credit_History_Age column

null_check("Credit_History_Age")

Column name : Credit_History_Age
--------------------------------
Number of nulls : 9030
Percent of nulls : % 9.03
Value_counts : 
 NaN                       9030
15 Years and 11 Months     446
19 Years and 4 Months      445
19 Years and 5 Months      444
17 Years and 11 Months     443
19 Years and 3 Months      441
17 Years and 9 Months      438
15 Years and 10 Months     436
17 Years and 10 Months     435
15 Years and 9 Months      432
Name: Credit_History_Age, dtype: int64


In [52]:
# Defining a function to convert the Credit_History_Age values to month

def Cred_Hist_Age (cha) :
    if type(cha) == float :
        return cha
    else :
        return int(cha.split()[0])*12 + int(cha.split()[3])

In [53]:
# convert the Credit_History_Age values to month

df["Credit_History_Age"] = df.Credit_History_Age.apply(Cred_Hist_Age).astype(float)

In [54]:
# Checking the Credit_History_Age values after converting to month

df["Credit_History_Age"].head(10)

0    265.0
1      NaN
2    267.0
3    268.0
4    269.0
5    270.0
6    271.0
7      NaN
8    319.0
9    320.0
Name: Credit_History_Age, dtype: float64

In [55]:
# Checking the null values in Credit_History_Age column by grouping Customer_ID

df.groupby("Customer_ID")["Credit_History_Age"].value_counts(dropna=False).head(10)

Customer_ID  Credit_History_Age
1006         182.0                 1
             183.0                 1
             184.0                 1
             185.0                 1
             186.0                 1
             187.0                 1
             188.0                 1
             189.0                 1
1007         NaN                   2
             346.0                 1
Name: Credit_History_Age, dtype: int64

In [56]:
# Defining a function to fill null values in Credit_History_Age column with incrementally
# TİP: first null values in Credit_History_Age column should be replaced with 0(zero) before using the defined function

def fill_credit_history_age_incremental (x) :
    if x.values[0] == 0 :
        return np.arange((int(x.values[1])-1), (int(x.values[1])-1)+len(x))
    else :
        return np.arange(int(x.values[0]), int(x.values[0])+len(x))

In [57]:
# filling null values in Credit_History_Age column with incrementally

df["Credit_History_Age"] = df.Credit_History_Age.fillna(0) # firstly fill null values with 0 then apply the function defined


# df["Credit_History_Age"] = df.groupby("Customer_ID")["Credit_History_Age"].transform(lambda x : fill_credit_history_age_incremental(x))

# or

df["Credit_History_Age"] = df.groupby("Customer_ID")["Credit_History_Age"].transform(fill_credit_history_age_incremental)

### Amount_invested_monthly column

In [58]:
# Checking the null values in Amount_invested_monthly column

null_check("Amount_invested_monthly")

Column name : Amount_invested_monthly
--------------------------------
Number of nulls : 4479
Percent of nulls : % 4.48
Value_counts : 
 NaN             4479
10000.000000    4305
0.000000         169
36.662351          1
89.738489          1
59.937259          1
165.180659         1
62.030803          1
215.577059         1
44.611359          1
Name: Amount_invested_monthly, dtype: int64


In [59]:
# Checking the null values in Amount_invested_monthly column by grouping Customer_ID

df.groupby("Customer_ID").Amount_invested_monthly.value_counts(dropna=False).head(10)

Customer_ID  Amount_invested_monthly
1006         45.301068                  1
             51.726244                  1
             56.494982                  1
             60.828288                  1
             61.732715                  1
             66.718248                  1
             90.078423                  1
             95.648648                  1
1007         NaN                        1
             30.373472                  1
Name: Amount_invested_monthly, dtype: int64

In [60]:
# Checking the descriptive values in Amount_invested_monthly column by grouping Customer_ID

df.groupby("Customer_ID").Amount_invested_monthly.describe().head(10)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1006,8.0,66.066077,17.835684,45.301068,55.302797,61.280501,72.558292,95.648648
1007,7.0,1494.809555,3750.671241,30.373472,51.814458,62.812486,133.426005,10000.0
1008,7.0,167.094236,86.267981,74.198069,112.169039,165.204435,184.359666,337.199741
1009,8.0,217.806923,126.209154,101.120201,141.751476,175.009344,262.312412,492.397491
1011,6.0,308.224688,149.717088,124.299524,209.6051,283.924207,434.402492,487.806206
1013,8.0,118.20515,65.066768,53.342282,73.464382,87.144365,168.101341,233.307837
1014,8.0,215.490003,135.541679,49.065258,130.078088,213.118301,244.986151,471.074683
1015,7.0,116.653518,90.805503,32.228848,45.183284,91.991601,167.598153,266.791305
1017,8.0,78.114594,31.791424,31.228813,59.738139,84.253038,101.159509,119.133287
1019,8.0,2571.229443,4585.278561,38.604002,93.159377,108.171798,2616.610348,10000.0


In [61]:
# According to the analysis above, null values in Amount_invested_monthly column can be filled with median by grouping Customer_ID

df["Amount_invested_monthly"] = df.groupby("Customer_ID")["Amount_invested_monthly"].apply(lambda x : x.fillna(x.median()))

### Payment_Behaviour column

In [62]:
# Checking the null values in Payment_Behaviour column

null_check("Payment_Behaviour")

Column name : Payment_Behaviour
--------------------------------
Number of nulls : 7600
Percent of nulls : % 7.6
Value_counts : 
 Low_spent_Small_value_payments      25513
High_spent_Medium_value_payments    17540
Low_spent_Medium_value_payments     13861
High_spent_Large_value_payments     13721
High_spent_Small_value_payments     11340
Low_spent_Large_value_payments      10425
NaN                                  7600
Name: Payment_Behaviour, dtype: int64


In [63]:
# Checking the null values in Payment_Behaviour column by grouping Customer_ID

df.groupby("Customer_ID").Payment_Behaviour.value_counts(dropna=False).head(10)

Customer_ID  Payment_Behaviour               
1006         Low_spent_Large_value_payments      2
             Low_spent_Small_value_payments      2
             NaN                                 1
             High_spent_Medium_value_payments    1
             High_spent_Small_value_payments     1
             Low_spent_Medium_value_payments     1
1007         Low_spent_Small_value_payments      4
             High_spent_Medium_value_payments    3
             NaN                                 1
1008         High_spent_Small_value_payments     3
Name: Payment_Behaviour, dtype: int64

According to the domain knowledge we have other features (such as Num_of_Delayed_Payment, Delay_from_due_date, Payment_of_Min_Amount, ..) that represents payment behaviour. And also filling null values in Type_of_Loan column with appropriate values is not possible. For the reasons explained above, we can drop Payment_Behaviour column.

In [64]:
df.drop(columns="Payment_Behaviour", inplace=True)

### Monthly_Balance column

In [65]:
# Checking the null values in Monthly_Balance column

null_check("Monthly_Balance")

Column name : Monthly_Balance
--------------------------------
Number of nulls : 1209
Percent of nulls : % 1.21
Value_counts : 
 NaN           1209
312.494089       1
589.699342       1
250.093168       1
289.755075       1
260.625832       1
606.830389       1
111.990521       1
299.545375       1
559.540554       1
Name: Monthly_Balance, dtype: int64


In [66]:
# Checking the null values in Monthly_Balance column by grouping Customer_ID

df.groupby("Customer_ID").Monthly_Balance.value_counts(dropna=False).head(10)

Customer_ID  Monthly_Balance
1006         280.044097         1
             295.614321         1
             309.197763         1
             310.391676         1
             323.966500         1
             328.974496         1
             333.960030         1
             334.864456         1
1007         239.464815         1
             245.618985         1
Name: Monthly_Balance, dtype: int64

In [67]:
# Checking the descriptive values in Monthly_Balance column by grouping Customer_ID

df.groupby("Customer_ID").Monthly_Balance.describe().head(10).T

Customer_ID,1006,1007,1008,1009,1011,1013,1014,1015,1017,1019
count,8.0,8.0,8.0,8.0,7.0,8.0,8.0,8.0,8.0,8.0
mean,314.626667,285.607087,362.407369,523.000149,496.152247,356.811815,458.586852,268.795818,293.41751,299.294272
std,19.571376,35.226049,68.81423,116.960687,170.101308,65.221016,120.817915,71.124471,25.886204,27.896229
min,280.044097,239.464815,218.303843,258.40958,228.71773,242.959128,233.002171,135.526078,268.648817,246.681334
25%,305.801902,260.520345,347.208442,494.750193,380.813373,300.665623,424.090704,240.703969,274.680356,290.918286
50%,317.179088,283.251084,362.466886,566.997305,553.403317,400.155792,465.958553,290.522638,281.107013,294.489864
75%,330.220879,313.095216,409.978715,584.756229,614.000718,403.035431,533.998767,326.355132,314.821846,321.473497
max,334.864456,334.619588,441.305514,629.68687,701.316497,404.877521,605.011597,330.088535,336.553291,333.557795


According to the analysis above, firstly null values in Monthly_Balance column should be filled with interpolate method. Then the remaining blank values in the first row for each Customer should be filled with the minimum Monthly_Balance values of each Customer.

In [68]:
df["Monthly_Balance"] = df.groupby("Customer_ID")["Monthly_Balance"].apply(lambda x : x.interpolate().fillna(x.min()))

In [69]:
# Checking the descriptive values in Monthly_Balance column after filling missing values

df.groupby("Customer_ID").Monthly_Balance.describe().head(10).T

Customer_ID,1006,1007,1008,1009,1011,1013,1014,1015,1017,1019
count,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0
mean,314.626667,285.607087,362.407369,523.000149,501.57861,356.811815,458.586852,268.795818,293.41751,299.294272
std,19.571376,35.226049,68.81423,116.960687,158.229342,65.221016,120.817915,71.124471,25.886204,27.896229
min,280.044097,239.464815,218.303843,258.40958,228.71773,242.959128,233.002171,135.526078,268.648817,246.681334
25%,305.801902,260.520345,347.208442,494.750193,382.315153,300.665623,424.090704,240.703969,274.680356,290.918286
50%,317.179088,283.251084,362.466886,566.997305,546.483236,400.155792,465.958553,290.522638,281.107013,294.489864
75%,330.220879,313.095216,409.978715,584.756229,596.990514,403.035431,533.998767,326.355132,314.821846,321.473497
max,334.864456,334.619588,441.305514,629.68687,701.316497,404.877521,605.011597,330.088535,336.553291,333.557795


In [70]:
# Last check for null values in df

df.isnull().sum()

ID                          0
Customer_ID                 0
Month                       0
Name                        0
Age                         0
SSN                         0
Occupation                  0
Annual_Income               0
Monthly_Inhand_Salary       0
Num_Bank_Accounts           0
Num_Credit_Card             0
Interest_Rate               0
Num_of_Loan                 0
Delay_from_due_date         0
Num_of_Delayed_Payment      0
Changed_Credit_Limit        0
Num_Credit_Inquiries        0
Credit_Mix                  0
Outstanding_Debt            0
Credit_Utilization_Ratio    0
Credit_History_Age          0
Payment_of_Min_Amount       0
Total_EMI_per_month         0
Amount_invested_monthly     0
Monthly_Balance             0
Credit_Score                0
dtype: int64

In [71]:
# saving cleaned data as "df_cleaned.csv"
df.to_csv("df_cleaned.csv", index=False)

In [72]:
pd.read_csv("df_cleaned.csv").head()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance,Credit_Score
0,5634,3392,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3.0,4,3,7.0,11.27,4.0,Good,809.98,26.82262,265,No,49.574949,80.415295,312.494089,Good
1,5635,3392,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3.0,4,-1,4.0,11.27,4.0,Good,809.98,31.94496,266,No,49.574949,118.280222,284.629162,Good
2,5636,3392,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,1824.843333,3,4,3.0,4,3,7.0,11.27,4.0,Good,809.98,28.609352,267,No,49.574949,81.699521,331.209863,Good
3,5637,3392,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3.0,4,5,4.0,6.27,4.0,Good,809.98,31.377862,268,No,49.574949,199.458074,223.45131,Good
4,5638,3392,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3.0,4,6,4.0,11.27,4.0,Good,809.98,24.797347,269,No,49.574949,41.420153,341.489231,Good


## <p style="background-color:#7da6ff; font-family:arial; color:#09042b; font-size:175%; text-align:center; border-radius:10px 10px;">Handling With Outliers</p>

<a id="6"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:#09042b; background-color:#7da6ff" data-toggle="popover">Content</a>

Finally we have completed to fill missing values in df. We can move to the next step which is handling with outliers. This step will be worked next notebook.