## Table of Contents:

  1. [Import Libraries and setup environment](#Import)
  2. [Read csv data file and create Dataframe](#ReadData)
  3. [Data Imbalance Analysis](#Imbalance)
  4. [Splitting dataframe](#TargetDF)
  5. [Correlation Analysis](#Correlation)
    - [Positive Correlation](#+Corr)
    - [Negative Correlation](#-Corr)
  6. [Segmented Univariate analysis](#SegUni)
    - [NAME_HOUSING_TYPE](#NAME_HOUSING_TYPE)
    - [NAME_FAMILY_STATUS](#NAME_FAMILY_STATUS)
    - [NAME_INCOME_TYPE](#NAME_INCOME_TYPE)
    - [OCCUPATION_TYPE](#OCCUPATION_TYPE)
    - [EDUCATION](#EDUCATION)
  7. [Univerate Analysis](#Uni)
    - [INCOME_TOTAL](#Uni1)
    - [AMT_ANNUITY](#Uni2)
    - [AMT_CREDIT](#Uni3)
    - [REGION_POPULATION_RELATIVE](#Uni4)
    - [DAYS_BIRTH](#Uni5)
  8. [Bivariate Analysis](#Bivariate)
    - [DAYS_BIRTH and DAYS_EMPLOYED](#BA1)
    - [AMT_CREDIT and DAYS_EMPLOYED](#BA2)
    - [AMT_CREDIT and AMT_GOODS_PRICE](#BA3)
    - [AMT_CREDIT and AMT_ANNUITY](#BA4)
    - [ AMT_GOODS_PRICE and AMT_ANNUITY](#BA5)
  9. [Analyse Previous application data](#PrevApp)
    - [Data load and basic check](#dlbc)
    - [Creation of merged dataframes](#cmdf)
    - [Segmented Univariate Analysis](#pasua)
    - [Univariate Analysis](#paua)
    - [Bivariate Analysis](#paba)
  10. [Summary](#summary)


<a id='Import'></a>
## Import all the required libraries

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.plotly as py
from plotly import tools
import plotly.graph_objs as go
sns.set(style="darkgrid")
%matplotlib inline
#importing plotly and cufflinks in offline mode
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

In [0]:
# Create a class color for setting print formatting
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

In [0]:
# Function to display plotly offline plots in Juputer notebook cells
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

In [0]:
#Setup the display options so that all the columns are displayed on the screen

pd.options.display.max_columns = 150

<a id='ReadData'></a>
## Read csv data file and create Dataframe


In [0]:
df_cleaned = pd.read_csv("Data_notebook1.csv")

In [0]:
df_cleaned = df_cleaned.drop(columns=['Unnamed: 0'])

<a id='Imbalance'></a>
## Data Imbalance Analysis

In [0]:
df_cleaned.head()

In [0]:
ax = sns.countplot(df_cleaned['TARGET'])
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate('{:.4f} %'.format(100* height/df_cleaned['TARGET'].shape[0]), (p.get_x()+.35*width, p.get_y()+.4*height))
plt.tight_layout()
plt.title("Data Imbalance %",fontsize= 16, color = 'darkcyan')
#plt.legend(bbox_to_anchor=(1, 1.05))
plt.show()

### Conclusion

After data cleanup and imputation

**Target = 0 has 91.6499% data**

**Target = 1 has 8.3501% data**


<a id = 'TargetDF'></a>

## Splitting data in seperate dataframes

In [0]:
#Create dataframe with TARGET =  1 for further analysis

Target_1_data = df_cleaned[df_cleaned['TARGET'] == 1]

#Create dataframe with TARGET =  0 for further analysis

Target_0_data = df_cleaned[df_cleaned['TARGET'] == 0]

In [0]:
Target_1_data.shape

In [0]:
Target_0_data.shape

In [0]:
Target_1_data.head()

<a id = 'Correlation'></a>

## Correlation Analysis

**In this section we will do Correlation analysis to check if correlation between numeric veriables is similar for both TARGET = 0 and TARGET = 1 datasets**

**We will analyse any mismatches between correlation**

In [0]:
corr_0 = Target_0_data.corr()

corr_0_select = corr_0.unstack().sort_values().drop_duplicates()
corr_0_select = corr_0_select[corr_0_select.notna()]
corr_0_select = corr_0_select[corr_0_select != 1]
corr_0_sorted = corr_0_select.sort_values(kind="quicksort")
corr_0_sorted = corr_0_sorted.to_frame()
corr_0_sorted = corr_0_sorted.reset_index()
corr_0_sorted.rename(columns={'level_0': 'Var_1', 'level_1': 'Var_2', 0 : 'Corr'}, inplace=True)

In [0]:
plt.figure(figsize=(24,24))
sns.heatmap(corr_0)
plt.title('Correlation between variables for Non-Defaulters')
plt.show()

In [0]:
corr_1 = Target_1_data.corr()

corr_1_select = corr_1.unstack().sort_values().drop_duplicates()
corr_1_select = corr_1_select[corr_1_select.notna()]
corr_1_select = corr_1_select[corr_1_select != 1]
corr_1_sorted = corr_1_select.sort_values(kind="quicksort")
corr_1_sorted = corr_1_sorted.to_frame()
corr_1_sorted = corr_1_sorted.reset_index()
corr_1_sorted.rename(columns={'level_0': 'Var_1', 'level_1': 'Var_2', 0 : 'Corr'}, inplace=True)

In [0]:
plt.figure(figsize=(24,24))
sns.heatmap(corr_0)
plt.title('Correlation between variables for Defaulters')
plt.show()

<a id = '+Corr'></a>

### Positive correlation

In [0]:
corr_0_sorted.tail(10)

In [0]:

corr_1_sorted.tail(10)

### +ve Correlation observation for variables

**Top 10 correlations between variables are in the range of 0.861 - 0.998. and both datasets(defaulter and non-defaulter) have similar correlation.**


**All the variables with top 10 correlations are matching in both datasets with their correlation varying  within 0.01 scale.**

<a id = '-Corr'></a>

### Negative correlation

In [0]:
corr_0_sorted.head(10)

In [0]:
corr_1_sorted.head(10)

### -ve Correlation observation for variables

**Top 10 correlations between variables are in the range of (-0.275) to (-0.999). and both datasets(defaulter and non-defaulter) have similar correlation.**

**Below their combinations of variables have higher -ve correlation for defaulter dataset compared to the
non-defaulter dataset**


  - **DAYS_EMPLOYED	        &    FLAG_DOCUMENT_3	            -0.282129**

  - **HOUR_APPR_PROCESS_START &	REGION_RATING_CLIENT_W_CITY	-0.275582**

<a id = 'SegUni' ></a>

## Segmented Univariate analysis

<a id = 'NAME_FAMILY_STATUS'></a>
### Univeriate Analysis base on NAME_FAMILY_STATUS of Target 0 and 1

In [0]:
plt.figure(figsize=(12,5))

plt.subplot(1,2,1);
Target_0_data.NAME_FAMILY_STATUS.value_counts(normalize =True).plot(kind= 'bar', color= ['g','b','k','r','c','m']);
plt.title('Impact of Matrimonial status on Target_0')
plt.xticks(rotation = 45)
plt.subplot(1,2,2);
Target_1_data.NAME_FAMILY_STATUS.value_counts(normalize =True).plot(kind= 'bar', color= ['g','b','k','r','c','m']);
plt.title('Impact of Matrimonial status on Target_1')
plt.xticks(rotation = 45)
plt.show()

**Observation :**
  
  - **Loan defaults are proportion is less for Married people compare to non-defaulters**
  - **But It seems for Single/Civil Married customers, the loan defaulter proportion is little higher.**
  

<a id = 'NAME_HOUSING_TYPE'></a>
### Univeriate Analysis base on NAME_HOUSING_TYPE of Target 0 and 1

In [0]:
plt.figure(figsize=(12,5))

plt.subplot(1,2,1);
Target_0_data.NAME_HOUSING_TYPE.value_counts(normalize =True).plot(kind= 'bar', color= ['g','b','k','r','c','m']);
plt.title('Housing Type for Target_0')
plt.xticks(rotation = 45)
plt.subplot(1,2,2);
Target_1_data.NAME_HOUSING_TYPE.value_counts(normalize =True).plot(kind= 'bar', color= ['g','b','k','r','c','m']);
plt.title('Housing Type for Target_1')
plt.xticks(rotation = 45)
plt.show()

**Observation :**
  
  - **Its seems custtomer living with Parents have little more proporation of defaulting compared to non-defaulters**
  - **Similarly Municiple and Rented apartment accomodation shows slightly higher proportion towards defaulting**

<a id = 'NAME_INCOME_TYPE'></a>
### Univeriate Analysis base on NAME_INCOME_TYPE of Target 0 and 1

In [0]:
plt.figure(figsize=(12,5))

plt.subplot(1,2,1);
Target_0_data.NAME_INCOME_TYPE.value_counts(normalize =True).plot(kind= 'bar', color= ['g','b','k','r','c','m']);
plt.title('Income sourse impact on Target_0')
plt.xticks(rotation = 45)
plt.subplot(1,2,2);
Target_1_data.NAME_INCOME_TYPE.value_counts(normalize =True).plot(kind= 'bar', color= ['g','b','k','r','c','m']); 
plt.title('Income sourse impact on Target_1')
plt.xticks(rotation = 45)
plt.show()

**Observation :**
  
  - **Its seems  customers who are currently working have higher proportion of defaulters**
  - **'Pensioners seems to be pay back loan , so their proportion is less on defaulters**
  - **Similarly State servers are comparitivaly show less tendency towards defaulting**

<a id = 'OCCUPATION_TYPE'></a>
### Univeriate Analysis base on OCCUPATION_TYPE of Target 0 and 1

In [0]:
plt.figure(figsize=(12,5))

plt.subplot(1,2,1);
Target_0_data.OCCUPATION_TYPE.value_counts(normalize =True).plot(kind= 'bar'); 
plt.title('Impact of Occupation on Target_0')
plt.subplot(1,2,2);
Target_1_data.OCCUPATION_TYPE.value_counts(normalize =True).plot(kind= 'bar');
plt.title('Impact of Occupation on Target_1')
plt.show()

**Observation :**
  
  - **Its seems customers with profession as Laborer have higher proportion of defaulters**
  - **Another observation is as IT/HR staff have lower propoartion of defaulting**


<a id = 'EDUCATION'></a>
### Univeriate Analysis base on EDUCATION of Target 0 and 1

In [0]:
plt.figure(figsize=(12,5))

plt.subplot(1,2,1);
Target_0_data.NAME_EDUCATION_TYPE.value_counts(normalize =True).plot(kind= 'bar', color= ['g','b','k','r','c']);
plt.title('Impact of Education on Target_0')
plt.xticks(rotation = 45)
plt.subplot(1,2,2);
Target_1_data.NAME_EDUCATION_TYPE.value_counts(normalize =True).plot(kind= 'bar', color= ['g','b','k','r','c']); 
plt.title('Impact of Education on Target_1')
plt.xticks(rotation = 45)
plt.show()

**Observation :**
  
  - **Customers with Secondary education have hight proportion of defaulting if compared to non-defaulters**



<a id = 'Uni'></a>

## Univerate Analysis

<a id = 'Uni1'></a>
### Univerate Analysis -  INCOME_TOTAL for Target 0 and Target 1

In [0]:
plt.figure(figsize=(12,5))
plt.subplot(121)
sns.distplot(Target_0_data.AMT_INCOME_TOTAL.dropna(), kde=False, bins = 10)
plt.title("Distribution of INCOME_TOTAL of Target 0",color="g")
plt.xticks(rotation = 90)
plt.subplot(122)
sns.distplot(Target_1_data.AMT_INCOME_TOTAL.dropna(), kde=False, bins = 10)
plt.title("Distribution of INCOME_TOTAL of Target 1",color="g")
plt.xticks(rotation = 90)

plt.show()

In [0]:
plt.figure(figsize=(12,5))
plt.subplot(121)
sns.boxplot(Target_0_data.AMT_INCOME_TOTAL)
plt.subplot(122)
sns.boxplot(Target_1_data.AMT_INCOME_TOTAL)
plt.show()

**Observation :**
  
  - **The income of the customers seems to have similar distribution for both defaulters and non-defaulters**
  - **The Average income seems to be arond 140K for both segments**



<a id = 'Uni2'></a>
### Univerate Analysis - Annuity for Target 0 and Target 1

In [0]:
fig = plt.figure(figsize=(18,6))
plt.subplot(121)
sns.distplot(Target_0_data.AMT_ANNUITY.dropna(), kde=False, bins = 10)
plt.title("Distribution of ANNUITY of Target 0",color="r")

plt.subplot(122)
sns.distplot(Target_1_data.AMT_ANNUITY.dropna(), kde=False, bins = 10)
plt.title("Distribution of ANNUITY of Target 1",color="r")
plt.show()

In [0]:
plt.figure(figsize=(12,5))
plt.subplot(121)
sns.boxplot(Target_0_data.AMT_ANNUITY)
plt.subplot(122)
sns.boxplot(Target_1_data.AMT_ANNUITY)
plt.show()

**Observation :**
  

  - **The defaulters seems to have more outliers compared to non-defaulters**
  - **The average annuity is similar for both defaulters and non defaulters around 30K**


<a id = 'Uni3'></a>
### Univerate Analysis - CREDIT for Target 0 and Target 1

In [0]:
fig = plt.figure(figsize=(18,6))
plt.subplot(121)
sns.distplot(Target_0_data.AMT_CREDIT.dropna(), kde=False, bins = 10)
plt.title("Distribution of CREDIT of Target 0",color="g")

plt.subplot(122)
sns.distplot(Target_1_data.AMT_CREDIT.dropna(), kde=False, bins = 10)
plt.title("Distribution of CREDIT of Target 1",color="g")
plt.show()

In [0]:
plt.figure(figsize=(12,5))
plt.subplot(121)
sns.boxplot(Target_0_data.AMT_CREDIT)
plt.xticks(rotation = 45)
plt.subplot(122)
sns.boxplot(Target_1_data.AMT_CREDIT)
plt.xticks(rotation = 45)
plt.show()

**Observation :**
  

  - **The defaulters seems to have more outliers compared to non-defaulters**
  - **The higher fence value for defaulters is around 1.2 M compared to non-defaulters which is around 1.5M**
  - **Large no of defaulters have credit of between 200K to 600K**


<a id = 'Uni4'></a>
### Univerate Analysis - REGION_POPULATION for Target 0 and Target 1

In [0]:
fig = plt.figure(figsize=(18,6))
plt.subplot(121)
sns.distplot(Target_0_data.REGION_POPULATION_RELATIVE.dropna(), kde=False, bins = 10)
plt.title("Distribution of REGION_POPULA_RELATIVE of Target 0",color="g")

plt.subplot(122)
sns.distplot(Target_1_data.REGION_POPULATION_RELATIVE.dropna(), kde=False, bins = 10)
plt.title("Distribution of REGION_POPULA_RELATIVE of Target 1",color="g")
plt.show()


**Observation :**
  
   - **It seems majority of defaulter are from lower populated area., we can see that proportions for 0.05 and 0.07 are lower than that of non-defaulters**

<a id = 'Uni5'></a>
### Univerate Analysis - DAYS_BIRTH for Target 0 and Target 1

In [0]:
fig = plt.figure(figsize=(18,6))
plt.subplot(121)
sns.distplot(Target_0_data.DAYS_BIRTH.dropna(), kde=False, bins = 90)
plt.title("Distribution of DAYS_BIRTH of Target 0",color="g")

plt.subplot(122)
sns.distplot(Target_1_data.DAYS_BIRTH.dropna(), kde=False, bins = 90)
plt.title("Distribution of DAYS_BIRTH of Target 1",color="g")
plt.show()

In [0]:
plt.figure(figsize=(12,5))
plt.subplot(121)
sns.boxplot(Target_0_data.DAYS_BIRTH)
plt.xticks(rotation = 45)
plt.subplot(122)
sns.boxplot(Target_1_data.DAYS_BIRTH)
plt.xticks(rotation = 45)
plt.show()

**Observation :**
  
   - **The median age for defaulters are around 14000 days older which would be around 40 Years**
   - **It looks like as the age increases proportion of defaulters decreases**
   - **The younger customers seems to have higher proportion of defaulters**

<a id = 'Bivariate'></a>

## Bivariate Analysis

**In this section we will do bivariate analysis to find correaltion between 2 variables**

<a id = 'BA1'></a>
## Bivariate analysis for DAYS_BIRTH and DAYS_EMPLOYED

In [0]:
sns.scatterplot( x = 'DAYS_BIRTH',y = 'DAYS_EMPLOYED',data = Target_0_data[Target_0_data['DAYS_EMPLOYED']< 300000],
               label = 'Target = 0')
sns.scatterplot( x = 'DAYS_BIRTH',y = 'DAYS_EMPLOYED',data = Target_1_data[Target_1_data['DAYS_EMPLOYED']< 300000],
               label = 'Target = 1')
plt.show()

In [0]:
corr_0_non_retired = Target_0_data[Target_0_data['DAYS_EMPLOYED']< 300000][['DAYS_EMPLOYED','AMT_CREDIT']].corr()['AMT_CREDIT'][0]

corr_0_all_data = Target_0_data[['DAYS_EMPLOYED','AMT_CREDIT']].corr()['AMT_CREDIT'][0]



corr_1_non_retired = Target_1_data[Target_1_data['DAYS_EMPLOYED']< 300000][['DAYS_EMPLOYED','AMT_CREDIT']].corr()['AMT_CREDIT'][0]
corr_1_all_data = Target_1_data[['DAYS_EMPLOYED','AMT_CREDIT']].corr()['AMT_CREDIT'][0]

print(color.BOLD + color.BLUE + 'The correlation between employement and loan amount for non defaulters is  {0} , but if we considered data for only non-retired applicants correlation is: {1}'.format(round(corr_0_all_data,4),round(corr_0_non_retired,4) )
      + color.END)


print(color.BOLD + color.BLUE + 'The correlation between employement and loan amount for loan-defaulters is  {0} , but if we considered data for only non-retired applicants correlation is: {1}'.format(round(corr_1_all_data,4),round(corr_1_non_retired,4) )
      + color.END)



### Obervation:

**for This analysis, I have considered data for non-retired applicants as well as all data**

for Payment defaulters (TARGET = 1)
As age increases and days employed increases, the loan default shows reductions.

So It might be case that younger people will short employment history tend to default more.

Also for loan defaulters, their is correclation between employement period and loan amount is around 0.1124 which is significantly more than non-defaulters. (This observation is only for non-retired applicants)

<a id = 'BA2'></a>
## Bivariate analysis for AMT_CREDIT and DAYS_EMPLOYED

In [0]:
sns.scatterplot( x = 'AMT_CREDIT',y = 'DAYS_EMPLOYED',data = Target_0_data[Target_0_data['DAYS_EMPLOYED']< 300000],
               label = 'Target = 0')
sns.scatterplot( x = 'AMT_CREDIT',y = 'DAYS_EMPLOYED',data = Target_1_data[Target_1_data['DAYS_EMPLOYED']< 300000],
               label = 'Target = 1')
plt.show()




### Obervation:

**for This analysis, I have considered data for non-retired applicants**

for Payment defaulters (TARGET = 1)
It seems that the credit amount of loan is low at higher experience level.
Also the loan default is concentrated below 1.5M Loan amount credit and below 10000 days (around 30 Years job experince)

if we concentrate on non retired appicants, the correlation between 

<a id = 'BA3'></a>
## Bivariate analysis for AMT_CREDIT and AMT_GOODS_PRICE

In [0]:
sns.scatterplot( x = 'AMT_GOODS_PRICE',y = 'AMT_CREDIT',data = Target_0_data,label = 'Target = 0')
sns.scatterplot( x = 'AMT_GOODS_PRICE',y = 'AMT_CREDIT',data = Target_1_data,label = 'Target = 1')
plt.show()

corr_0 = Target_0_data[['AMT_GOODS_PRICE','AMT_CREDIT']].corr()['AMT_CREDIT'][0]
corr_1 = Target_1_data[['AMT_GOODS_PRICE','AMT_CREDIT']].corr()['AMT_CREDIT'][0]

print(color.BOLD + color.BLUE + 'The correlation between property price and loan amount for non defaulters is  {0} \n but for defaulters it is: {1}'.format(round(corr_0,4),round(corr_1,4) )+ color.END)


### Obervation:

credit amount and goods price are highly correlated variables for both defaulters and non-defaulters.

So as the home price increases the loan amount also increases which is logical

<a id = 'BA4'></a>
## Bivariate analysis for AMT_CREDIT and AMT_ANNUITY

In [0]:
sns.scatterplot( x = 'AMT_ANNUITY',y = 'AMT_CREDIT',data = Target_0_data, label = 'Target = 0')
sns.scatterplot( x = 'AMT_ANNUITY',y = 'AMT_CREDIT',data = Target_1_data, label = 'Target = 1')
plt.show()

corr_0 = Target_0_data[['AMT_ANNUITY','AMT_CREDIT']].corr()['AMT_CREDIT'][0]
corr_1 = Target_1_data[['AMT_ANNUITY','AMT_CREDIT']].corr()['AMT_CREDIT'][0]

print(color.BOLD + color.BLUE + 'The correlation between AMT_ANNUITY (EMI) and loan amount for non defaulters is  {0} \n but for defaulters it is: {1}'.format(round(corr_0,4),round(corr_1,4) )+ color.END)


### Obervation:

credit amount and AMT_ANNUITY (EMI) are highly correlated variables for both defaulters and non-defaulters.

So as the home price increases the EMI amount also increases which is logical

<a id = 'BA5'></a>
## Bivariate analysis for AMT_GOODS_PRICE and AMT_ANNUITY

In [0]:
sns.scatterplot( x = 'AMT_ANNUITY',y = 'AMT_GOODS_PRICE',data = Target_0_data, label = 'Target = 0')
sns.scatterplot( x = 'AMT_ANNUITY',y = 'AMT_GOODS_PRICE',data = Target_1_data, label = 'Target = 1')
plt.show()

corr_0 = Target_0_data[['AMT_ANNUITY','AMT_GOODS_PRICE']].corr()['AMT_GOODS_PRICE'][0]
corr_1 = Target_1_data[['AMT_ANNUITY','AMT_GOODS_PRICE']].corr()['AMT_GOODS_PRICE'][0]

print(color.BOLD + color.BLUE + 'The correlation between AMT_ANNUITY (EMI) and goods price for non defaulters is  {0} \n but for defaulters it is: {1}'.format(round(corr_0,4),round(corr_1,4) )+ color.END)


### Obervation:

AMT_ANNUITY (EMI) and goods price are highly correlated variables for both defaulters and non-defaulters.

So as the home price increases the EMI amount also increases which is logical

<a id = 'PrevApp'></a>
## Analyse Previous application data

<a id = 'dlbc'></a>
### Data load and basic check

In [0]:
df_pa = pd.read_csv("previous_application.csv")

#df_pa = pd.read_csv("/content/drive/My Drive/Colab Notebooks/data/previous_application.csv")

In [0]:
df_pa.shape

In [0]:
df_pa.head()

In [0]:
#Check missing values in the dataset
100*df_pa.isnull().sum()/df_pa.shape[0]

In [0]:
#Delete the columns with more than 50% data missing

df_pa = df_pa.loc[:,(100*df_pa.isnull().sum()/df_pa.shape[0]).sort_values(ascending = False) < 50]

### Inspect Datatypes of Variables

In [0]:
#check all dtype counts
df_pa.dtypes.value_counts()

### Inspect columns with object dtype

In [0]:
#print columns for object dtype
df_pa.select_dtypes('object').columns

In [0]:
#print the dtype = object records so that we can have quick look at data to check if any data issues
df_pa[df_pa.select_dtypes('object').columns].head()

**<span style="color:blue">
    As dtype = object columns are categorical variables, they do not need any dtype change
    </span>**

### Inspect columns with int64 dtype

In [0]:
#print columns for int64 dtype
df_pa.select_dtypes('int64').columns

In [0]:
#print the dtype = int64 records so that we can have quick look at data to check if any data issues
df_pa[df_pa.select_dtypes('int64').columns].head()

<a id = 'Fix_1'></a>
**<span style="color:blue">
    As DAYS_DECISION is descrete values. 
    we can change them to +values as -ve values do not make any sense and can affect future analysis
    </span>**

In [0]:
#As DAYS_DECISION is descrete variable, 
#we can change them to +values as -ve values do not make any sense

no_of_rec = df_pa[df_pa['DAYS_DECISION'] >= 0].shape[0]
print(color.BOLD + color.BLUE + 'Total no of records with + value for DAYS_DECISION  : {0}'.format(no_of_rec ) + color.END)
df_pa['DAYS_DECISION']  = np.abs(df_pa['DAYS_DECISION'])



print(color.BOLD + color.BLUE + 'Changed columns DAYS_DECISION values from -ve to absolute' + color.END)

### Inspect columns with float64 dtype

In [0]:
#print columns for float64 dtype
df_pa.select_dtypes('float64').columns

In [0]:
#print the dtype = float64 records so that we can have quick look at data to check if any data issues
df_pa[df_pa.select_dtypes('float64').columns].head()

In [0]:
#Step 1: change values from negative to absolute

df_pa['DAYS_FIRST_DUE']  = np.abs(df_pa['DAYS_FIRST_DUE'])
df_pa['DAYS_LAST_DUE']  = np.abs(df_pa['DAYS_LAST_DUE'])
df_pa['DAYS_TERMINATION']  = np.abs(df_pa['DAYS_TERMINATION'])
df_pa['DAYS_FIRST_DRAWING']  = np.abs(df_pa['DAYS_FIRST_DRAWING'])

print(color.BOLD + color.BLUE + 'Changed columns DAYS_FIRST_DUE, DAYS_LAST_DUE & DAYS_TERMINATION, DAYS_FIRST_DRAWING dataype to int as well as values from -ve to absolute' + color.END)


In [0]:
df_pa.replace(365243,np.nan, inplace = True)

<a id = 'cmdf'></a>
### Creation of merged dataframes 

In [0]:
cmb_Tar_0 = pd.merge(Target_0_data, df_pa, how = 'inner', on = 'SK_ID_CURR')

In [0]:
cmb_Tar_1 = pd.merge(Target_1_data, df_pa, how = 'inner', on = 'SK_ID_CURR')

<a id = 'pasua'></a>
### Segmented Univariate Analysis

In [0]:
fig, (ax_5, ax_6) = plt.subplots(1,2,figsize=(12,5)) 

sns.countplot(cmb_Tar_0['NAME_CONTRACT_STATUS'],ax = ax_5)
for p_5 in ax_5.patches:
    ax_5.text(p_5.get_x()+.3, p_5.get_height()+5, str(int(p_5.get_height())), fontsize=10,color='green',ha= "center")
ax_5.set_title("Previous contract status for Non-Defaulters",fontsize= 12, color = 'darkcyan')
for tick in ax_5.get_xticklabels():
        tick.set_rotation(45)  


sns.countplot(cmb_Tar_1['NAME_CONTRACT_STATUS'],ax = ax_6)
for p_6 in ax_6.patches:
    ax_6.text(p_6.get_x()+.3, p_6.get_height()+5,str(int(p_6.get_height())), fontsize=10,color='green',ha= "center")
ax_6.set_title("Previous contract status for Defaulters",fontsize= 12, color = 'darkcyan')
    
 
for tick in ax_6.get_xticklabels():
        tick.set_rotation(45)  
plt.tight_layout()

nd_ref = round(100*cmb_Tar_0[cmb_Tar_0['NAME_CONTRACT_STATUS'] == 'Refused'].shape[0]/cmb_Tar_0.shape[0],2)
def_ref = round(100*cmb_Tar_1[cmb_Tar_1['NAME_CONTRACT_STATUS'] == 'Refused'].shape[0]/cmb_Tar_1.shape[0],2)

print(color.BOLD + color.BLUE + 'The previously refused % of applications for non-defaulters is: {0}'.format(nd_ref) + color.END)


print(color.BOLD + color.BLUE + 'The previously refused % of applications for defaulters is:   {0}'.format(def_ref) + color.END)


### Observation:

Its seems that for TARGET = 1 clients the previously Refused applications have comparatibly

  - **The previously refused % of applications for non-defaulters is: 16.75**
  - **The previously refused % of applications for defaulters is:   23.96**

In [0]:
plt.figure(figsize=(12,5))
plt.subplot(121)
sns.countplot(cmb_Tar_0.NAME_CONTRACT_TYPE_y.dropna())
plt.title("Distribution of INCOME_TOTAL OF Non-Defaulters",color="g")
plt.xticks(rotation = 45)

plt.subplot(122)
sns.countplot(cmb_Tar_1.NAME_CONTRACT_TYPE_y.dropna())
plt.title("Distribution of INCOME_TOTAL of Defaulters",color="g")
plt.xticks(rotation = 45)
plt.show()

### Observation:

**Its seems that for defaulters more of the previous applications where for cash loans**

**in case of non-defaulters the consumer loans and cash loan proportion is similar**


<a id = 'paua'></a>
###  Univariate Analysis

In [0]:
cmb_Tar_0['DAYS_TERMINATION'].abs()


plt.figure(figsize=(12,5))
plt.subplot(121)
sns.distplot(cmb_Tar_0['DAYS_TERMINATION'].dropna())
plt.title("Distribution of INCOME_TOTAL OF Non-Defaulters",color="g")
plt.xticks(rotation = 45)

plt.subplot(122)
sns.distplot(cmb_Tar_1['DAYS_TERMINATION'].dropna())
plt.title("Distribution of INCOME_TOTAL of Defaulters",color="g")
plt.xticks(rotation = 45)
plt.show()



In [0]:
cmb_Tar_0['DAYS_TERMINATION'].dropna().describe()

In [0]:
cmb_Tar_1['DAYS_TERMINATION'].dropna().describe()

### Observation:

**Its seems that for defaulters meantime previous applications closed was smaller than that of non-defaulters**



### the previous loan period can be calculated with DAYS_FIRST_DRAWING - DAYS_TERMINATION

we can do analysis of this period

In [0]:


plt.figure(figsize=(12,5))
plt.subplot(121)
sns.distplot((cmb_Tar_0['DAYS_FIRST_DRAWING']-  cmb_Tar_0['DAYS_TERMINATION']).dropna())
plt.title("Distribution of Loan period OF Non-Defaulters",color="g")
plt.xticks(rotation = 45)

plt.subplot(122)
sns.distplot((cmb_Tar_1['DAYS_FIRST_DRAWING']-  cmb_Tar_1['DAYS_TERMINATION']).dropna())
plt.title("Distribution of Loan period of Defaulters",color="g")
plt.xticks(rotation = 45)
plt.show()


In [0]:
(cmb_Tar_0['DAYS_FIRST_DRAWING']-  cmb_Tar_0['DAYS_TERMINATION']).dropna().describe()

In [0]:
(cmb_Tar_0['DAYS_FIRST_DRAWING']-  cmb_Tar_0['DAYS_TERMINATION']).dropna().describe()

### Observation:

**Its seems that for defaulters have perious loan duration smaller than that of non-defaulters**


<a id = 'paba'></a>
###  Bivariate Analysis

In [0]:
sns.scatterplot( x = 'AMT_APPLICATION',y = 'AMT_CREDIT_y',data = cmb_Tar_0,
               label = 'Target = 0')
plt.xticks(rotation = 45)
sns.scatterplot( x = 'AMT_APPLICATION',y = 'AMT_CREDIT_y',data = cmb_Tar_1,
               label = 'Target = 1')
plt.xticks(rotation = 45)
plt.show()

In [0]:
cmb_Tar_0[['AMT_CREDIT_y','AMT_APPLICATION']].corr()

In [0]:
cmb_Tar_1[['AMT_CREDIT_y','AMT_APPLICATION']].corr()

In [0]:
sns.scatterplot( x = 'AMT_APPLICATION',y = 'AMT_CREDIT_x',data = cmb_Tar_0,
               label = 'Target = 0')
plt.xticks(rotation = 45)
sns.scatterplot( x = 'AMT_APPLICATION',y = 'AMT_CREDIT_x',data = cmb_Tar_1,
               label = 'Target = 1')
plt.xticks(rotation = 45)
plt.show()

In [0]:
cmb_Tar_0[['AMT_CREDIT_x','AMT_APPLICATION']].corr()

In [0]:
cmb_Tar_1[['AMT_CREDIT_x','AMT_APPLICATION']].corr()

### Observation:

**The previous application aount and credit amounts are show +ve correlation around 0.97**


**Also it seems that previous application amount and current application amount have weak +ve correlation of 0.092 for non-defaulters and 0.095 for defaulters**


<a id = 'summary'></a>

## Summary:

**The dataset is highly imbalnced with 8.35% data for Loan Defaulters and remaining 91.65% data for non-defaulters.**

In the application dataset:
  - The top 10 positive correlation between numerical variables, is consistent across both Defaulters and non-defaulters
  - for Negative correlation 2 combination of numerica variables have higher correlation for Defaulters complared to non-defaulters

    - DAYS_EMPLOYED & FLAG_DOCUMENT_3 -0.282129
    - HOUR_APPR_PROCESS_START & REGION_RATING_CLIENT_W_CITY -0.275582
    
### Observations from Segmenetd Univariate anlysis


  - **Loan defaults are proportion is less for Married people compare to non-defaulters**
  - **But It seems for Single/Civil Married customers, the loan defaulter proportion is little higher.**
  - **Its seems customers living with Parents have little more proportion  of defaulting compared to non-defaulters**
  - **Similarly Municipal  and Rented apartment accomodation shows slightly higher proportion towards defaulting**
  - **Its seems customers who are currently working have higher proportion of defaulters**
  - **Pensioners seems to be pay back loan , so their proportion is less on defaulters**
  - **Similarly State servents are comparitivaly show less tendency towards defaulting**
  - **Its seems customers with profession as Laborer have higher proportion of defaulters**
  - **Another observation is as IT/HR staff have lower proportion of defaulting**
  - **Customers with Secondary education have high proportion of defaulting if compared to non-defaulters**
  
  
### Observations from Univariate analysis


  - **The income of the customers seems to have similar distribution for both defaulters and non-defaulters**
  - **The Average income seems to be around  140K for both segments**
  - **For Annuity data The defaulters seems to have more outliers compared to non-defaulters**
  - **The average annuity is similar for both defaulters and non defaulters around 30K**
  - **for Loan amt The defaulters seems to have more outliers compared to non-defaulters**
  - **The higher fence value for defaulters is around 1.2 M compared to non-defaulters which is around 1.5M**
  - **Large no of defaulters have credit of between 200K to 600K**
  - **It seems majority of defaulter are from lower populated area., we can see that proportions for 0.05 and 0.07 are lower than that of non-defaulters**
  - **The median age for defaulters are around 14000 days older which would be around 40 Years**
  - **It looks like as the age increases proportion of defaulters decreases**
  - **The younger customers seems to have higher proportion of defaulters**
  
  
### Observations from Bivariate analysis

**DAYS_BIRTH and DAYS_EMPLOYED**

  - **The correlation between employment  and loan amount for non defaulters is  -0.0663 , but if we considered data for only non-retired applicants correlation is: 0.0818**
  - **The correlation between employment  and loan amount for loan-defaulters is  0.0018 , but if we considered data for only non-retired applicants correlation is: 0.1124**
  - **for Payment defaulters (TARGET = 1) As age increases and days employed increases, the loan default shows reductions. So It might be case that younger people will short employment history tend to default more.**
  - **Also for loan defaulters, their is correlation  between employment  period and loan amount is around 0.1124 which is significantly more than non-defaulters. (This observation is only for non-retired applicants)**

**AMT_CREDIT and DAYS_EMPLOYED**

  - **for Payment defaulters (TARGET = 1) It seems that the credit amount of loan is low at higher experience level. Also the loan default is concentrated below 1.5M Loan amount credit and below 10000 days (around 30 Years job experience)**

  
  
 **AMT_CREDIT and AMT_GOODS_PRICE**
 
  - **The correlation between property price and loan amount for non defaulters is  0.9816 but for defaulters it is: 0.9776** 
  - **credit amount and goods price are highly correlated variables for both defaulters and non - defaulters. So as the home price increases the loan amount also increases which is logical**
  

**AMT_CREDIT and AMT_ANNUITY**

  - **The correlation between AMT_ANNUITY (EMI) and loan amount for non defaulters is  0.7609 but for defaulters it is: 0.7401**
  - **credit amount and AMT_ANNUITY (EMI) are highly correlated variables for both defaulters and non - defaulters .  So as the home price increases the EMI amount also increases which is logical**
  
  
  **AMT_ANNUITY and AMT_GOODS_PRICE**
  
  - **The correlation between DAYS_EMPLOYED (employment) and loan amount for non defaulters is  0.7604 but for defaulters it is: 0.7374**
  
  
### Observations from Previous Application data

  - **The previously refused % of applications for non-defaulters is: 16.75**
  - **The previously refused % of applications for defaulters is:   23.96**
  - **Its seems that for TARGET = 1 clients have larger proportion of previously Refused applications**
  - **Its seems that for defaulters more of the previous applications were for cash loans**
  - **in case of non-defaulters the consumer loans and cash loan proportion is similar**
  - **Its seems that for defaulters mean time of previous applications closed was smaller than that of non-defaulters**
  - **for defaulters have previous  loan duration smaller than that of non-defaulters**
  
**Bivariate analysis of AMT_CREDIT_y and	AMT_APPLICATION**

   - **The previous application amount  and credit amounts are show +ve correlation around 0.97 Also it seems that previous application amount and current application amount have weak +ve correlation of 0.092 for non-defaulters and 0.095 for defaulters**
  
  
  
  
  

