# Loan predictions

## Problem Statement

We want to automate the loan eligibility process based on customer details that are provided as online application forms are being filled. You can find the dataset [here](https://drive.google.com/file/d/1h_jl9xqqqHflI5PsuiQd_soNYxzFfjKw/view?usp=sharing). These details concern the customer's Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and other things as well. 

Variable - Description<br>
Loan_ID - Unique Loan ID<br>
Gender - Male/ Female<br>
Married -  Applicant married (Y/N)<br>
Dependents - Number of dependents<br>
Education - Applicant Education (Graduate/ Under Graduate)<br>
Self_Employed - Self employed (Y/N)<br>
ApplicantIncome - Applicant income<br>
CoapplicantIncome - Coapplicant income<br>
LoanAmount - Loan amount in thousands<br>
Loan_Amount_Term - Term of loan in months<br>
Credit_History - credit history meets guidelines<br>
Property_Area - Urban/ Semi Urban/ Rural<br>
Loan_Status - Loan approved (Y/N)<br>



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

## 1. Hypothesis Generation

Generating a hypothesis is a major step in the process of analyzing data. This involves understanding the problem and formulating a meaningful hypothesis about what could potentially have a good impact on the outcome. This is done BEFORE looking at the data, and we end up creating a laundry list of the different analyses which we can potentially perform if data is available.

#### Possible hypotheses
Which applicants are more likely to get a loan

1. Applicants having a credit history 
2. Applicants with higher applicant and co-applicant incomes
3. Applicants with higher education level
4. Properties in urban areas with high growth perspectives

Do more brainstorming and create some hypotheses of your own. Remember that the data might not be sufficient to test all of these, but forming these enables a better understanding of the problem.

## 2. Data Exploration
Let's do some basic data exploration here and come up with some inferences about the data. Go ahead and try to figure out some irregularities and address them in the next section. 

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')


df = pd.read_csv("data.csv") 
df.head()

One of the key challenges in any data set are missing values. Lets start by checking which columns contain missing values.

In [None]:
df['Credit_History'].value_counts()

In [None]:
df.apply(lambda x: sum(x.isnull()),axis=0)

Look at some basic statistics for numerical variables.

In [None]:
df.describe()

1. How many applicants have a `Credit_History`? (`Credit_History` has value 1 for those who have a credit history and 0 otherwise)
2. Is the `ApplicantIncome` distribution in line with your expectation? Similarly, what about `CoapplicantIncome`?
3. Tip: Can you see a possible skewness in the data by comparing the mean to the median, i.e. the 50% figure of a feature.



Let's discuss nominal (categorical) variable. Look at the number of unique values in each of them.

In [None]:
#turn loan status into binary 
modified=df
modified['Loan_Status']=df['Loan_Status'].apply(lambda x: 0 if x=="N" else 1 )
#calculate the mean
modified.groupby('Credit_History').mean()['Loan_Status']

Explore further using the frequency of different categories in each nominal variable. Exclude the ID obvious reasons.

### Distribution analysis

Study distribution of various variables. Plot the histogram of ApplicantIncome, try different number of bins.



In [None]:
sns.distplot(df.ApplicantIncome,kde=False)


Look at box plots to understand the distributions. 

In [None]:
sns.distplot(df.ApplicantIncome.dropna(),kde=False)

Look at the distribution of income segregated  by `Education`

In [None]:
sns.boxplot(x='Education',y='ApplicantIncome',data=df)

Look at the histogram and boxplot of LoanAmount

In [None]:
sns.histplot(x='LoanAmount', data=df)

In [None]:
sns.boxplot(x='LoanAmount',data=df)

In [None]:
sns.histplot(x='Loan_Amount_Term', data=df)

There might be some extreme values. Both `ApplicantIncome` and `LoanAmount` require some amount of data munging. `LoanAmount` has missing and well as extreme values values, while `ApplicantIncome` has a few extreme values, which demand deeper understanding. 

### Categorical variable analysis

Try to understand categorical variables in more details using `pandas.DataFrame.pivot_table` and some visualizations.

In [None]:
#pd.DataFrame.pivot_table(df)

## 3. Data Cleaning

This step typically involves imputing missing values and treating outliers. 

### Imputing Missing Values

Missing values may not always be NaNs. For instance, the `Loan_Amount_Term` might be 0, which does not make sense.



Impute missing values for all columns. Use the values which you find most meaningful (mean, mode, median, zero.... maybe different mean values for different groups)

In [None]:
#impute missing values
#categorical
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)

#numerical
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

### Extreme values
Try a log transformation to get rid of the extreme values in `LoanAmount`. Plot the histogram before and after the transformation

Combine both incomes as total income and take a log transformation of the same.

In [None]:
#create TotalIncome column as a sum of ApplicantIncome and CoapplicantIncome
df['TotalIncome']=df['ApplicantIncome']+df['CoapplicantIncome']

In [None]:
#create TotalIncome_log column as a log of TotalIncome
df['TotalIncome_log']=np.log(df['TotalIncome'])

In [None]:
#create LoanAmount_log column
df['LoanAmount_log']=np.log(df['LoanAmount'])

In [None]:
df.drop(columns=['ApplicantIncome','CoapplicantIncome','LoanAmount', 'TotalIncome'],inplace=True)

In [None]:
df.head()

### DATA CLEANING WITH PIPELINES

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class LogDfTransform(BaseEstimator, TransformerMixin):
    def __init__(self, columnNames):
        self.columnNames = columnNames
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X=X.copy()
        X.loc[:,self.columnNames]=np.log(X[self.columnNames]).values
        return X
    
income_log = LogDfTransform(['ApplicantIncome', 'LoanAmount'])

In [None]:
# import libraries for our pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import SelectKBest

# create pipelines for numerical and categorical columns
# pipeline for numerical columns (log transform -> imputation -> standard scaler -> selectkbest)
numerical_transform = Pipeline([('impute_mean', SimpleImputer(strategy='mean')),
                                ('scaling', StandardScaler()),
                                ('select_kbest', SelectKBest(k=3))])

# pipeline for categorical columns
categorical_transform = Pipeline([('impute_mode', SimpleImputer(strategy='most_frequent')),
                                ('one-hot-encode', OneHotEncoder())])

# columntransformer for numerical and categorical columns
preprocessing_df = ColumnTransformer([('numerical', numerical_transform, ['TotalIncome_log', 'LoanAmount_log','Loan_Amount_Term','Credit_History']),
('categorical', categorical_transform, ['Gender', 'Married', 'Dependents', 'Education', 
'Self_Employed', 'Property_Area'])])

## 4. Building a Predictive Model

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

In [None]:
from sklearn.linear_model import LogisticRegression

# create a LogisticRegression classifier
logistic = LogisticRegression(max_iter=10000)

# build a pipeline for our model
pipeline = Pipeline([('preprocessing', preprocessing_df),
                    ('classifier', logistic)])

In [None]:
from sklearn.model_selection import train_test_split

#split data into training and test sets
X=df.drop(['Loan_Status', 'Loan_ID'], axis=1)
y=df['Loan_Status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Try paramater grid search to improve the results

In [None]:
from sklearn.model_selection import GridSearchCV

# find the best parameters for the model using GridSearchCV
param_grid = {'features__random_forest__max_depth': [5, 10, 15, 20, 25, 30],
                'features__random_forest__n_estimators': [100, 200, 300, 400, 500],
                'features__decision_tree__max_depth': [5, 10, 15, 20, 25, 30],
                'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

# create gridsearch object
grid = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)

# fit grid search
grid.fit(X_train, y_train)

In [None]:
best_model = grid.best_estimator_
best_hyperparams = grid.best_params_
best_acc = grid.score(X_test, y_test)
print(f'Best test set accuracy: {best_acc}\nAchieved with hyperparameters: {best_hyperparams}')

## 5. Using Pipeline
If you didn't use pipelines before, transform your data prep, feat. engineering and modeling steps into Pipeline. It will be helpful for deployment.

The goal here is to create the pipeline that will take one row of our dataset and predict the probability of being granted a loan.

`pipeline.predict(x)`

In [None]:
#fit the model
pipeline.fit(X_train, y_train)


# get accuracy score
print('Accuracy: ', accuracy_score(y_test, pipeline.predict(X_test)))

In [None]:
# Display HTML representation in a jupyter context
from sklearn import set_config
set_config(display='diagram')

pipeline

In [None]:
# Or, save the HTML to a file
from sklearn.utils import estimator_html_repr

with open('model_pipeline.html', 'w') as f:  
    f.write(estimator_html_repr(pipeline))

In [None]:
X_tester = X_test.iloc[[5]]
y_pred = pipeline.predict(X_tester)
print(type(X_tester))
print(y_pred)

In [None]:
X_tester.head()

## 6. Deploy your model to cloud and test it with PostMan, BASH or Python

In [None]:
import pickle

In [None]:

#store in pickle
pickle.dump(pipeline, open('model.pkl', 'wb'))

In [None]:

testmodel = pickle.load(open('model.pkl', 'rb'))

In [None]:
y_prpe = testmodel.predict(X_tester)

In [None]:
print(y_prpe[0])

In [None]:
url = 'ec2-52-14-229-23.us-east-2.compute.amazonaws.com:5000'
json_entry = {
    "Gender": "Male",
    "Married": "No",
    "Dependents": 1,
    "Education": "Graduate",
    "Self_Employed": "No",
    "ApplicantIncome": 2345,
    "CoapplicantIncome": 0,
    "LoanAmount": 128.0,
    "Loan_Amount_Term": 360.0,
    "Credit_History": 1.0,
    "Property_Area": "Urban"
}

import requests
res = requests.post(url, json=json_entry)
if res.ok:
    print(res.json())
