# About Company
Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.

# Business problem
Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.

**Reference**: [Analytics vidhya - Loan status prediction competition](https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/)

# Import libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from statistics import mode

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input/"))

# Any results you write to the current directory are saved as output.

# Understanding the data

In [None]:
train_df = pd.read_csv("../input/train3/train.csv")
test_df = pd.read_csv("../input/test-file/test.csv")

print('Train data shape', train_df.shape)
print('Test data shape', test_df.shape)
# print(train_df.columns)

In [None]:
train_df.sample(3)

In [None]:
train_df.info()

We have two types of variables in out dataframe, categorical and numerical.

**1. Categorical features**
- Nominal features: These features have categories (Gender, Married, Self_Employed, Credit_History, Loan_Status).
- Ordinal features: Variables in categorical features having some order involved (Dependents, Education, Property_Area).

**2. Numerical features**: 
- Continuous features: These features have continuously distributed values (ApplicantIncome, CoapplicantIncome, LoanAmount).
- Discrete featuress: Values in these columns are unique and non-continous in nature( Loan_Amount_Term, Credit_history ).

Also there are many missing values in the above columns let's get the exact count and impute them(replace with appropirate values).

# Null value treatment

In [None]:
def get_null_columns(df):
    null_cols = df.isna().sum().sort_values(ascending=False)
    null_cols = null_cols[null_cols > 0]
    if len(null_cols):
        plt.title('Count of columns with null values')
        sns.barplot(null_cols.values,null_cols.index)
        return null_cols
    else:
        return None
    
def get_num_cols(df):
    return df.select_dtypes(include="number").columns.values

def get_cat_cols(df):
    return df.select_dtypes(exclude="number").columns.values

def treat_null_cols(df):
    nulls = get_null_columns(df).index.values.tolist()

    for col in nulls:
        if df[col].dtype == 'object':
            df.loc[:,col]=df.loc[:,col].fillna(method='ffill')
            df.loc[:,col]=df.loc[:,col].fillna(method='bfill')
        # Numerical columns    
        elif col in ("Credit_History","Loan_Amount_Term"):
            df[col].fillna(df[col].mode()[0],inplace=True)
        else:
            df[col].fillna(df[col].median(),inplace=True)
    print("Columns after imputation".center(38,'='))
    df.info()

## Imputing train data

In [None]:
treat_null_cols(train_df)

## Imputing test data

In [None]:
treat_null_cols(test_df)

# Exploratory data analysis

## Univariate analysis

1. For Categorical features we can use frequency table or bar plots which will calculate the number of each category in a particular variable. 
2. Numerical features, probability density plots can be used to check the distribution of the variable.

### Loan status 
(Dependant variable)

In [None]:
print(train_df['Loan_Status'].value_counts())

loan_status_count = train_df['Loan_Status'].value_counts(normalize=True)
print(loan_status_count)
ax = sns.barplot(x=loan_status_count.index,y=loan_status_count.values)

The number of Loan approved is more than twice the rejections, so we are dealing with an imbalanced dataset here.

## 

### Gender, Married, Self_Employed, Credit_History
(Independent Nominal variables)

In [None]:
plt.figure(1)
plt.subplot(221)
train_df['Gender'].value_counts().plot.pie(title='Total % distribution of each Gender',figsize=(20,10),autopct="%1.1f",explode=[0,.1])
# train_df['Gender'].value_counts(normalize=True).plot.bar(figsize=(20,10), title= 'Gender')

plt.subplot(222)
train_df['Married'].value_counts().plot.pie(title='Total % distribution of each Marital Status',autopct="%1.1f",explode=[0,.1])

plt.subplot(223)
train_df['Self_Employed'].value_counts().plot.pie(title='Total % distribution of Self Employment',autopct="%1.1f",explode=[0,.1])

plt.subplot(224)
train_df['Credit_History'].value_counts().plot.pie(title='Total % distribution of credit history',autopct="%1.1f",explode=[0,.1])

From the above chart we can see that most of the people who applied for loans were either `Males`, `Married`, `No self employment` or `Had a credit history`.

### Dependents, Education, Property_Area
(Independent Ordinal Variables)

In [None]:
sns.set()
plt.figure(figsize=(20, 10))
plt.subplot(231)
sns.countplot(train_df['Dependents'])
plt.subplot(234)
train_df['Dependents'].value_counts().plot.pie(autopct="%1.1f%%",explode=[0,0,0,0.2])


plt.subplot(232)
sns.countplot(train_df['Education'])
plt.subplot(235)
train_df['Education'].value_counts().plot.pie(autopct="%1.1f%%",explode=[0,0.1])


plt.subplot(233)
sns.countplot(train_df['Property_Area'])
plt.subplot(236)
train_df['Property_Area'].value_counts().plot.pie(autopct="%1.1f%%",explode=[0,0,0.1])

From the above charts we can infer that, 
- Most of the applicants(57.6%) does not have any dependants
- 78.2% are Graduates applying for loan.
- Eventhough most of the applicants are from Semiurban, not much difference is spotted in other regions.

## Independent Numerical Variables

### Continuous values

#### Applicant Income

In [None]:
plt.figure(1,figsize=(16,5))
plt.subplot(121)
print(train_df['ApplicantIncome'].skew())
sns.distplot(train_df['ApplicantIncome']);

plt.subplot(122)
sns.boxplot(train_df['ApplicantIncome'])

plt.show()

- Applicant Income is left skewed(6.53).
- Many outliers are present in the distribution.
- Most of the applicant income is around 5000 range.

Let us check whether this non uniform distribution in Income is due to the difference in education level

In [None]:
plt.figure(figsize=(16,5))
ax = sns.boxplot(data=train_df,x="ApplicantIncome",y="Education")

It can be observed that variation in Application income is more among individuals who are Graduate.

#### Coapplicant Income

In [None]:
plt.figure(1,figsize=(16,5))
plt.subplot(121)
print(train_df['CoapplicantIncome'].skew())
sns.distplot(train_df['CoapplicantIncome'])

plt.subplot(122)
sns.boxplot(train_df['CoapplicantIncome'])
plt.show()

plt.figure(2,figsize=(16,5))
ax = sns.boxplot(data=train_df,x="CoapplicantIncome",y="Education")

- Similar to Applicant Income, Coapplicant income is also left skewed(7.49) and not normally distributed.
- Also, there are a numer of outliers in the distribution. One of which can be attribued to the level of education.

#### Loan Amount

In [None]:
plt.figure(1)
plt.subplot(121)

# distribution plots cannot handle NaN
train_df = train_df.dropna()
sns.distplot(train_df['LoanAmount'])

plt.subplot(122)
# Figsize = width * length
train_df['LoanAmount'].plot.box(figsize=(15,7))
print(train_df['LoanAmount'].skew())

The above three columns(Applicant income, coapplicant income and loan amount) have outliers and a high level of skewness.

We can later use `log transformation` to remove the skewness and it will even help to scale down the outliers.

### Discrete Numeric values

#### Loan Amount Term

In [None]:
plt.figure(figsize=(16,5))
lat = train_df['Loan_Amount_Term'].value_counts(normalize=True)
sns.barplot(x=lat.values,y=lat.index, palette="rocket",orient='h')
print(train_df['Loan_Amount_Term'].skew())

Most of the applicant opted for a specific 30 year(360 months) term followed by 15 years(180 month) term.

#### Credit history

In [None]:
sns.countplot(train_df['Credit_History'],palette="rocket")
print(train_df['Credit_History'].skew())

## Bivariate Analysis

In [None]:
# Removing Loan_ID since it wont contribute much to the analysis
# Also Loan_Status, since it is our target variable
cat_columns = get_cat_cols(train_df).tolist()
cat_columns.remove('Loan_ID')
cat_columns.remove('Loan_Status')
n_cat_cols = len(cat_columns)

num_cols = get_num_cols(train_df)
n_num_cols = len(num_cols)

target = 'Loan_Status'

### Categorical vs Target Variable

We will try to find answers to questions like these
1. Is gender a factor for loan approvals?
2. Does being married decreases your chance of getting a loan?
3. Does having lower number of dependants raises your chance?
4. Will approval be based on the region you belong to?

In [None]:
fig,ax = plt.subplots(1,n_cat_cols,figsize=(6*n_cat_cols,8))
sns.set_style("darkgrid")
for i,col in enumerate(cat_columns):
    # Create a cross table for stacked graph
    pd.crosstab(train_df[col],train_df[target])

    ct = pd.crosstab(train_df[col],train_df[target],normalize="index")
#     print(ct)
    ct.plot.bar(stacked=True,ax=ax[i])
    
plt.show()

- There was no significant pattern observed based on the `gender` or `Self_Employed` status.
- Proportion of married applicants is higher for the approved loans.
- Distribution of applicants with 1 or 3+ dependents is almost similar across both the categories of Loan_Status.
- Property area plays a role in approval too, most of it belongs to `Suburban` region.

### Numerical vs Target Variable

Lets make some new features for performing better EDA.

Three bins Low, Medium and High each for
1. Applicant Income
2. Coapplicant Income
3. LoanAmount

In [None]:
groups = ['Low','Medium','High']

def get_categories(x):
    if x < q1:
        return groups[0]
    elif x < q3:
        return groups[1]
    else:
        return groups[2]    

for col_name in ['ApplicantIncome','CoapplicantIncome','LoanAmount']:
    q1 = train_df[col_name].quantile(q=0.25)
    q3 = train_df[col_name].quantile(q=0.75)
    train_df[col_name+'_cat'] = train_df[col_name].apply(get_categories)

#### ApplicantIncome

In [None]:
train_df.groupby('Loan_Status')['ApplicantIncome'].mean().plot.bar()

There are not much changes in the mean income. 

Lets check based on applicant income category.

In [None]:
cross = pd.crosstab(train_df['ApplicantIncome_cat'],train_df['Loan_Status'],normalize="index")
print(cross)
cross.plot.bar(stacked=True)

It can be inferred that Applicant income does not affect the chances of loan approval which contradicts our hypothesis in which we assumed that if the applicant income is high the chances of loan approval will also be high.

#### CoapplicantIncome

In [None]:
cross = pd.crosstab(train_df['CoapplicantIncome_cat'],train_df['Loan_Status'],normalize="index")
print(cross)
cross.plot.bar(stacked=True)

#### Loan amount variable.

In [None]:
cross = pd.crosstab(train_df['LoanAmount_cat'],train_df['Loan_Status'],normalize="index")
print(cross)
cross.plot.bar(stacked=True)

It can be seen that the proportion of approved loans is higher for Low and Medium Loan Amount as compared to that of High Loan Amount.

## Data preparation

- Let’s drop the bins which we created for EDA. 
- Categorical columns: We will encode the categorical columns to numbers.
- Numerical columns: There were continuous numerical variables with outliers and skewness, those will be transformed by applying log operation.

We will also convert the target variable’s categories into 0 and 1 so that we can find its correlation with numerical variables. One more reason to do so is few models like logistic regression takes only numeric values as input. We will replace N with 0 and Y with 1.

In [None]:
train_df = train_df.drop(['ApplicantIncome_cat', 'CoapplicantIncome_cat', 'LoanAmount_cat'], axis=1)

In [None]:
# Property_Area
col_name="Property_Area"
d = {'Urban':2,'Semiurban':1,'Rural':0}
train_df[col_name].replace(d,inplace=True)
test_df[col_name].replace(d,inplace=True)

# Self_Employed
col_name="Self_Employed"
d = {'Yes':1,'No':0}
train_df[col_name].replace(d,inplace=True)
test_df[col_name].replace(d,inplace=True)

# Education
col_name="Education"
d ={'Graduate':1, 'Not Graduate':0}
train_df[col_name].replace(d,inplace=True)
test_df[col_name].replace(d,inplace=True)

# Married
d = {'Yes':1,'No':0}
train_df['Married'].replace(d,inplace=True)
test_df['Married'].replace(d,inplace=True)

# Gender
gender = {'Male':1,'Female':0}
train_df['Gender'].replace(gender,inplace=True)
test_df['Gender'].replace(gender,inplace=True)

# Dependents
d = {'3+':3}
train_df['Dependents'].replace(d,inplace=True)
test_df['Dependents'].replace(d,inplace=True)

#Loan status
loan_status = {'N':0,'Y':1}
train_df['Loan_Status'].replace(loan_status,inplace=True)

In [None]:
#Loan Amount
col_name = 'LoanAmount'
train_df[col_name] = np.log(train_df[col_name])
print(train_df[col_name].skew())
ax = sns.distplot(train_df[col_name])

In [None]:
# Applicant Income
col_name = 'ApplicantIncome'
train_df[col_name] = np.log(train_df[col_name])
print(train_df[col_name].skew())
ax = sns.distplot(train_df[col_name])

In [None]:
# Coapplicant Income
col_name = 'CoapplicantIncome'
# Since there are 0 values we will have to use log1p to remove infinite values
train_df[col_name] = np.log1p(train_df[col_name])
print(train_df[col_name].skew())
ax = sns.distplot(train_df[col_name])

In [None]:
train_df.info()

Now lets look at the correlation between all the numerical variables. We will use the heat map to visualize the correlation.

Heatmaps visualize data through variations in coloring. The variables with darker color means their correlation is more.

In [None]:
matrix = train_df.corr()
f, ax = plt.subplots(figsize=(12, 12))
mask = np.zeros_like(matrix)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("dark"):
    sns.heatmap(matrix,mask=mask,annot=True, vmax=.8, square=True);

Looking at the correlation map we can see that `credit history` is the most important feature for `Loan Status`.

Also, `Applicant Income` and `LoanAmount` are correlated with each other. 

We will see baseline model building,evaluation metrics for classification models, feature engineering, model selection and hyperparameter tuning in the next kernel.

In [None]:
# Sending out the cleaned train and test data as output.
train_df.to_csv('train_clean.csv')
test_df.to_csv('test_clean.csv')