# Loan predictions

## Problem Statement

We want to automate the loan eligibility process based on customer details that are provided as online application forms are being filled. You can find the dataset [here](https://drive.google.com/file/d/1h_jl9xqqqHflI5PsuiQd_soNYxzFfjKw/view?usp=sharing). These details concern the customer's Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and other things as well. 

|Variable| Description|
|: ------------- |:-------------|
|Loan_ID| Unique Loan ID|
|Gender| Male/ Female|
|Married| Applicant married (Y/N)|
|Dependents| Number of dependents|
|Education| Applicant Education (Graduate/ Under Graduate)|
|Self_Employed| Self employed (Y/N)|
|ApplicantIncome| Applicant income|
|CoapplicantIncome| Coapplicant income|
|LoanAmount| Loan amount in thousands|
|Loan_Amount_Term| Term of loan in months|
|Credit_History| credit history meets guidelines|
|Property_Area| Urban/ Semi Urban/ Rural|
|Loan_Status| Loan approved (Y/N)



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

## 1. Hypothesis Generation

Generating a hypothesis is a major step in the process of analyzing data. This involves understanding the problem and formulating a meaningful hypothesis about what could potentially have a good impact on the outcome. This is done BEFORE looking at the data, and we end up creating a laundry list of the different analyses which we can potentially perform if data is available.

#### Possible hypotheses
Which applicants are more likely to get a loan

1. Applicants having a credit history 
2. Applicants with higher applicant and co-applicant incomes
3. Applicants with higher education level
4. Properties in urban areas with high growth perspectives
+5. Applicants with higher income per person in the family (considering #of dependents)
+6. Applicants with lower loan amount in application
+7. Applicants with high savings (data not provided here)

Do more brainstorming and create some hypotheses of your own. Remember that the data might not be sufficient to test all of these, but forming these enables a better understanding of the problem.

## 5. Using Pipeline
If you didn't use pipelines before, transform your data prep, feat. engineering and modeling steps into Pipeline. It will be helpful for deployment.

The goal here is to create the pipeline that will take one row of our dataset and predict the probability of being granted a loan.

`pipeline.predict(x)`

In [59]:
df_Loan = pd.read_csv("data.csv") 

### Pipeline order 
##### Preprocessing 
1. Creating a new column 'Total income' 
2. Missing value fill in
3. Log transformation

##### Main pipeline
 - Preprocessing
 - Numerical & Categorical value split and One Hot Encoding for Categorical values
 - classifier 

##### After pipeline
 - grid search 
 - pickling

In [60]:
# Preprocessing_1. new column
df_Loan['TotalIncome'] = df_Loan['ApplicantIncome'] + df_Loan['CoapplicantIncome']
df_Loan = df_Loan.drop('Loan_ID', axis=1)
df_Loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             601 non-null    object 
 1   Married            611 non-null    object 
 2   Dependents         599 non-null    object 
 3   Education          614 non-null    object 
 4   Self_Employed      582 non-null    object 
 5   ApplicantIncome    614 non-null    int64  
 6   CoapplicantIncome  614 non-null    float64
 7   LoanAmount         592 non-null    float64
 8   Loan_Amount_Term   600 non-null    float64
 9   Credit_History     564 non-null    float64
 10  Property_Area      614 non-null    object 
 11  Loan_Status        614 non-null    object 
 12  TotalIncome        614 non-null    float64
dtypes: float64(5), int64(1), object(7)
memory usage: 62.5+ KB


In [61]:
X = df_Loan.drop(['Loan_Status','ApplicantIncome','CoapplicantIncome'], axis=1)
y = df_Loan['Loan_Status'].values

In [62]:
# Data Splitting 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=27)

###### Originally I took the steps above by using column transformer each steps, but it turns out that once you use column transformer it returns np.array, so you can't specify column names afterwards. Thus, I changed the way to use column transformer only once. 

In [63]:
import pandas as pd
#Own data-frame transformer class used in pipeline
class DataframeTransformer:
    def __init__(self, func):
        self.func = func
    
    def transform(self, input_df, **transform_param):
        return self.func(input_df)
       
    def fit(self, X, y=None, **fit_params):
        return self

def create_total_income_feature(input_df):
    input_df['total_income'] = input_df['ApplicantIncome'] + input_df['CoapplicantIncome']
    return input_df
    
def to_dataframe(array):
    columns=['LoanAmount','TotalIncome', 'Gender','Dependents','Self_Employed','Credit_History',
             'Married','Education', 'Loan_Amount_Term','Property_Area' ]
    df = pd.DataFrame(array, columns=columns)
    convert_dict = {'LoanAmount':'float64','TotalIncome':'float64'}
    df = df.astype(convert_dict)
    return df


In [68]:
# Pipeline 
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import precision_recall_fscore_support, f1_score

fillna_trans = ColumnTransformer([
    ('fill_avg', SimpleImputer(strategy='mean'), ['LoanAmount', 'TotalIncome']),
    ('fill_mode', SimpleImputer(strategy='most_frequent'), ['Gender','Dependents','Self_Employed']),
    ('fill_zero', SimpleImputer(strategy='constant', fill_value = 0), ['Credit_History'])], remainder='passthrough')

log_trans = ColumnTransformer([('log_trans', FunctionTransformer(np.log), ['LoanAmount', 'TotalIncome'])], 
                              remainder='passthrough')
                             
preprocessing = Pipeline([
    ('fillna_trans', fillna_trans),
    ('to_dataframe', DataframeTransformer(to_dataframe)),
    ('log_trans', log_trans),
    ('to_dataframe2', DataframeTransformer(to_dataframe))    
])

preprocessing

In [65]:


pipeline = Pipeline(steps=[('preprocessing', preprocessing),
                           ('encoding', OneHotEncoder(sparse=False,handle_unknown='ignore')),
                           ('scaling', MinMaxScaler()),
                           ('classifier',LogisticRegression())])

pipeline.fit(X_train,y_train)


In [77]:
# grid search
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

param_grid = {'classifier': [LogisticRegression(), KNeighborsClassifier(), GaussianNB(), 
                             DecisionTreeClassifier(), SVC(), LinearDiscriminantAnalysis()],
              'scaling': [MinMaxScaler(), StandardScaler()]              
             }

grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5, n_jobs=-1,verbose=1, scoring='f1')
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
best_hyperparams = grid.best_params_
best_score = grid.best_score_
predictions= grid.predict(X_test)
print(f'Best test set f1_score:\n\t {best_score}\nAchieved with hyperparameters:\n\t {best_hyperparams}')
print( classification_report(y_test, predictions))

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best test set f1_score:
	 nan
Achieved with hyperparameters:
	 {'classifier': LogisticRegression(), 'scaling': MinMaxScaler()}
              precision    recall  f1-score   support

           N       0.63      0.33      0.43        52
           Y       0.72      0.90      0.80       102

    accuracy                           0.71       154
   macro avg       0.68      0.61      0.62       154
weighted avg       0.69      0.71      0.68       154





## 6. Deploy your model to cloud and test it with PostMan, BASH or Python

In [None]:
#pickling 
import pickle 
