# 1.0.0 Introduction

**Setting the context for this notebook.**

* This notebook is to be read in conjunction with the [phase-5 report](https://docs.google.com/document/d/1Ks52x1MUSMBvMw50IinyWVJZFMfsZ1JYgyhIqI4tQSk/edit?usp=sharing) about the Home Credit loan defaulter prediction problem hosted on Kaggle [here](https://www.kaggle.com/c/home-credit-default-risk/data).

* An understanding of the complete problem context and high level summary of the datasets used can be sought from [here.](https://docs.google.com/document/d/1qcbp5zqYPSARrVfMg2flBY-Zgbp7j_IWExBpGuZIrkI/edit?usp=sharing)
* This notebook is building on the key insights and preliminary feature engineering from the previous EDA phase and directly dwells into the modelling.
* The datasets used in this phase are the ones already processed based on the insights and feature engineering in earlier EDA phase, which can be referred to, [here.](https://colab.research.google.com/drive/1npayzHKNq-oHzbpfjBVKrLCZY3Z18dZx?usp=sharing)
* Finally, the deployed app which is hosted on Heroku can be interacted with, [here](https://loan-def-predict.herokuapp.com/).
* Github repositories for the deployment with clear README for using app as well as steps involved is [here](https://github.com/nanorohan/Loan-Defaulting-Tendency-predictor) & compilation of all the Colab notebooks for end-to-end deployment is [here](https://github.com/nanorohan/Loan-Defaulting-Tendency-predictor-notebooks).

**A quick refresher about Home Credit's motivation for this problem**

Though there are a lot of people seeking loans from banks and lending institutions, only a few of them get approved. This is primarily because of insufficient or non-existent credit histories of the applicant. Such population is taken advantage of by untrustworthy lenders.
In order to make sure that these applicants have a positive loan taking experience, Home Credit uses Data Analytics to predict the applicants' loan repayment abilities, trying to ensure that the clients capable of loan repayment do not have their applications rejected.

## 1.1.0 High level summary of this notebook

* **Section 1.0.0** - A brief summary of the project, dataset and intent of this notebook.

* **Section 2.0.0** - Contains the necessary groundwork for proceeding with the modeling & data pipeline for deployment.

* **Section 3.0.0** - Comprises of modeling and data pipeline by PyCaret framework

* **Section 4.0.0** - Comprises of modeling and data pipeline by sklearn framework

* **Section 5.0.0** - Concludes and summarizes the deployment phase

# 2.0.0 Necessary groundwork for proceeding with the model & dataflow pipeline and eventual deployment

## 2.1.0 Mounting Google Drive to acess processed datasets from previous phases

In [None]:
#Mount Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2.2.0 Installing the dependencies for libraries used in this notebook

In [None]:
#Libraries for model & modeling pipeline [PyCaret] and outlier detection [PyOD]
%%capture
!pip install pycaret [full]
!pip install --upgrade pycaret
!pip install pyod
!pip install --upgrade pyod

## 2.3.0 Importing the necessary libraries

In [None]:
#Import libraries

#The essential basics
import pandas as pd
import numpy as np
import pickle

#For modeling & dataflow pipeline - sklearn as well as PyCaret
from scipy.stats import uniform
from pycaret.classification import *
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier

#Library for outlier detection & removal
from pyod.models.cblof import CBLOF

  defaults = yaml.load(f)


## 2.4.0 Importing the original & processed datasets for modeling & dataflow pipeline
>
>In this notebook, we are building the dataflow pipeline. 
>
>Each of the upcoming functions were already carried out in disparate ways.
>
>Here, they are all collated in order to form the data pipeline.
>
>Thus, initially, the original Home Credit dataset is loaded.
>

### 2.4.1 Custom function for optimizing size of the loaded datasets

In [None]:
#Function for reducing the size of the Pandas Dataframe
def df_size_optimizer(df):
  """This function accepts a Pandas DataFrame and each feature variable type is checked and assigned appropriately. This is primarily done to optimize the size the dataframe occupies on system RAM."""
  #Source ref [1] & Code Credit - https://www.kaggle.com/rinnqd/reduce-memory-usage
  #Source ref [2] - https://www.analyticsvidhya.com/blog/2021/04/how-to-reduce-memory-usage-in-python-pandas/ [For understanding the logic and implementation]  
  for col in df.columns:
    col_type=df[col].dtype
    if col_type!=object:
      c_min=df[col].min()
      c_max=df[col].max()
      if str(col_type)[:3]=='int':
        if c_min>np.iinfo(np.int8).min and c_max<np.iinfo(np.int8).max:
          df[col]=df[col].astype(np.int8)
        elif c_min>np.iinfo(np.int16).min and c_max<np.iinfo(np.int16).max:
          df[col]=df[col].astype(np.int16)
        elif c_min>np.iinfo(np.int32).min and c_max<np.iinfo(np.int32).max:
          df[col]=df[col].astype(np.int32)
        elif c_min>np.iinfo(np.int64).min and c_max<np.iinfo(np.int64).max:
          df[col]=df[col].astype(np.int64)  
      else:
        if c_min>np.finfo(np.float16).min and c_max<np.finfo(np.float16).max:
          df[col]=df[col].astype(np.float16)
        elif c_min>np.finfo(np.float32).min and c_max<np.finfo(np.float32).max:
          df[col]=df[col].astype(np.float32)
        else:
          df[col]=df[col].astype(np.float64)
  return df

### 2.4.2 Loading the original Home Credit datasets

In [None]:
#Read application_train dataset
application_train = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/application_train.csv'))

#Read application_test dataset
application_test = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/application_test.csv'))

#Read bureau dataset
bureau = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/bureau.csv'))

#Read previous_application dataset
previous_application = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/previous_application.csv'))

### 2.4.3 Feature creation & data augmentation
>
>Here, the additional features constructed based on domain knowledge from literature study in phase-1 in the form of 3 ratios are implemented.
>
>The ratios are used as they are among the top 20 features towards predicting applicant defaulting tendency.

In [None]:
#create new feature DEBT_INCOME_RATIO in application_train
application_train['DEBT_INCOME_RATIO'] = application_train['AMT_ANNUITY']/application_train['AMT_INCOME_TOTAL']

#create new feature LOAN_VALUE_RATIO in application_train
application_train['LOAN_VALUE_RATIO'] = application_train['AMT_CREDIT']/application_train['AMT_GOODS_PRICE']

#create new feature LOAN_INCOME_RATIO in application_train
application_train['LOAN_INCOME_RATIO'] = application_train['AMT_CREDIT']/application_train['AMT_INCOME_TOTAL']

### 2.4.4 Merging the 'Bureau' data with the application_train dataset for applicants having records in both the sets.
>
>As seen in the EDA, a huge number of the sample applicants pool [train & test set] have existing records in the bureau database. Thus merging them adds to the information available for the model.

In [None]:
#Create a dataframe with numerical features of bureau
bureau_numerical = bureau.select_dtypes(exclude=object)

#Create a dataframe with categorical features of bureau
bureau_categorical = bureau.select_dtypes(include=object)

#Merge numerical features from bureau to application_train
bureau_numerical_merge = bureau_numerical.groupby(by=['SK_ID_CURR']).median().reset_index()
application_train_bureau = application_train.merge(bureau_numerical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_BUREAU'))

#Merge categorical features from bureau to application_train
bureau_categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']
bureau_categorical_merge = bureau_categorical.groupby(by=['SK_ID_CURR']).agg(lambda x:x.value_counts().index[0] if len(x.value_counts()) != 0 else '').reset_index()
application_train_bureau = application_train_bureau.merge(bureau_categorical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_BUREAU'))

#Drop SK_ID_BUREAU
application_train_bureau = application_train_bureau.drop(columns = ['SK_ID_BUREAU'])

#Shape of application and bureau data combined
print('The shape of application_train and bureau data merged: ', application_train_bureau.shape)

The shape of application_train and bureau data merged:  (307511, 140)


### 2.4.5 Checkpoint saving of the processed dataset to save time & compute requirement on Colab free box. 

In [None]:
#Saving the dataframes into CSV format for future use
bureau_numerical_merge.to_csv('bureau_numerical_merge.csv', index = False)
bureau_categorical_merge.to_csv('bureau_categorical_merge.csv', index = False)

### 2.4.6 Merging the 'Previous Application' data with the application_train dataset for applicants having records in both the sets.
>
>As seen in the EDA, a huge number of the sample applicants pool [train & test set] are already existing Home Credit customers having entries in the previous application database. Thus merging them adds to the information available for the model.

In [None]:
#Create a dataframe with numerical features of previous_application
previous_application_numerical = previous_application.select_dtypes(exclude=object)

#Create a dataframe with categorical features of previous_application
previous_application_categorical = previous_application.select_dtypes(include=object)

#Merge numerical features from previous_application to application_train_bureau
previous_numerical_merge = previous_application_numerical.groupby(by=['SK_ID_CURR']).mean().reset_index()
application_train_bureau_previous = application_train_bureau.merge(previous_numerical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_PREVIOUS'))

#Merge categorical features from previous_application to application_train_bureau
previous_application_categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']
previous_categorical_merge = previous_application_categorical.groupby(by=['SK_ID_CURR']).agg(lambda x:x.value_counts().index[0] if len(x.value_counts()) != 0 else '').reset_index()
application_train_bureau_previous = application_train_bureau_previous.merge(previous_categorical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_PREVIOUS'))

#Drop SK_ID_PREV
application_train_bureau_previous = application_train_bureau_previous.drop(columns = ['SK_ID_PREV'])

#Shape of application_train_bureau and previous_application data combined
print('The shape of application_train_bureau and previous_application data merged: ', application_train_bureau_previous.shape)

The shape of application_train_bureau and previous_application data merged:  (307511, 175)


### 2.4.7 Checkpoint saving of the processed dataset to save time & compute requirement on Colab free box. 

In [None]:
#Saving the dataframes into CSV format for future use
previous_numerical_merge.to_csv('previous_numerical_merge.csv', index = False)
previous_categorical_merge.to_csv('previous_categorical_merge.csv', index = False)

## 2.5.0 Readying the processed train dataset and saving it for further use.

In [None]:
#Final train data ready for preprocessing
train_data = application_train_bureau_previous.drop(columns=['SK_ID_CURR'])

In [None]:
#Save the dataframes into CSV files for future use
train_data.to_csv('train_data.csv', index = False)

# 3.0.0 Modeling & Datapipeline for Deployment using the PyCaret framework

In [None]:
#Saving the name of features in train_data
file = open('columns_query_data.pkl', 'wb')
pickle.dump(list(application_test.columns), file)
file.close()

## 3.1.0 Loading the saved & processed dataset

In [None]:
#Reading the train_data
train_data_full = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/train_data.csv'))

## 3.2.0 Building the model & data pipeline
>
>This builds the entire data preprocessing pipeline along with the best model as was arrived at, from previous phases.

In [None]:
#Creating the lists of numerical and categorical features
columns_numerical = list(train_data_full.select_dtypes(exclude=object).columns)
columns_numerical.remove('TARGET')
columns_categorical = list(train_data_full.select_dtypes(include=object).columns)

In [None]:
#Setting up the dataflow pipeline for ingestion to the PyCaret model
data = setup(data=train_data_full, target="TARGET", categorical_features=columns_categorical, numeric_features=columns_numerical, train_size=0.9, 
             numeric_imputation='median', normalize=True, remove_outliers=True, data_split_stratify=True, feature_selection=True, feature_selection_threshold=0.35)

Unnamed: 0,Description,Value
0,session_id,2299
1,Target,TARGET
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(307511, 174)"
5,Missing Values,True
6,Numeric Features,138
7,Categorical Features,35
8,Ordinal Features,False
9,High Cardinality Features,False


In [None]:
#Training the model
model = create_model('lightgbm')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9194,0.7681,0.0239,0.5667,0.0459,0.0396,0.1043
1,0.9182,0.7662,0.0164,0.3977,0.0315,0.0253,0.0672
2,0.9195,0.7572,0.023,0.5904,0.0442,0.0384,0.105
3,0.9191,0.7795,0.0235,0.5319,0.0449,0.0383,0.0989
4,0.9197,0.7631,0.0282,0.5941,0.0538,0.0468,0.1167
5,0.9194,0.7641,0.0253,0.5625,0.0485,0.0418,0.1068
6,0.9192,0.7711,0.0235,0.5376,0.045,0.0384,0.0997
7,0.9192,0.7646,0.0239,0.5368,0.0458,0.0392,0.1006
8,0.9189,0.7647,0.0272,0.4915,0.0516,0.0434,0.101
9,0.9193,0.7643,0.023,0.5444,0.0441,0.0378,0.0995


In [None]:
#Tuning the model
tuned_model = tune_model(model)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9196,0.7673,0.0206,0.6197,0.0399,0.0349,0.1027
1,0.9188,0.7679,0.0141,0.4688,0.0273,0.0227,0.0702
2,0.9195,0.7581,0.0188,0.6154,0.0364,0.0318,0.0974
3,0.9195,0.7785,0.023,0.5976,0.0443,0.0385,0.1058
4,0.9193,0.7634,0.0216,0.561,0.0416,0.0358,0.0984
5,0.9192,0.7629,0.0188,0.5479,0.0363,0.0311,0.0903
6,0.9197,0.7711,0.0235,0.6173,0.0452,0.0395,0.1092
7,0.9193,0.7644,0.0206,0.557,0.0398,0.0342,0.0957
8,0.9194,0.7658,0.023,0.5765,0.0442,0.0382,0.1034
9,0.9192,0.7662,0.0169,0.5538,0.0328,0.0281,0.0862


In [None]:
#Saving the best model and storing it in Google Drive for future use
save_model(tuned_model, "model")

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=['NAME_CONTRACT_TYPE',
                                                             'CODE_GENDER',
                                                             'FLAG_OWN_CAR',
                                                             'FLAG_OWN_REALTY',
                                                             'NAME_TYPE_SUITE',
                                                             'NAME_INCOME_TYPE',
                                                             'NAME_EDUCATION_TYPE',
                                                             'NAME_FAMILY_STATUS',
                                                             'NAME_HOUSING_TYPE',
                                                             'OCCUPATION_TYPE',
                                                             'WEEKDAY_APPR_PROCESS_START',
                                                    

>* The created data pipeline & model are successfully working in the Colab notebook.
>
>* However, during deployment on Heroku, there were issues which indicate a sort of compatibility error between PyCaret or GitLFS and Heroku, probably to do with the size of the model & pipeline filesize.
>
>* **In order to reduce the filesize of the model, training was carried out on reduced datasets. These strategies did not work. However, these are documented for possible debugging in the future.**
>
>* One of these [training on 25% data] is retained below.

## 3.3.0 Building the model & data pipeline using 25% of original training dataset
>
>As explained above, this builds the entire data preprocessing pipeline along with the best model on lesser training data to reduce the model & pipeline file size for deployment on Heroku box.

In [None]:
#Using train test split to extract 25% of train data with stratification
train_data, X_test, y_train, y_test = train_test_split( train_data_full, train_data_full['TARGET'], 
                                                       test_size=0.75, random_state=42, stratify=train_data_full['TARGET'])

In [None]:
#Creating lists of numerical and categorical features
columns_numerical = list(train_data.select_dtypes(exclude=object).columns)
columns_numerical.remove('TARGET')
columns_categorical = list(train_data.select_dtypes(include=object).columns)

In [None]:
#Setting up the dataflow pipeline for ingestion to the PyCaret model
data = setup(data=train_data, target="TARGET", categorical_features=columns_categorical, numeric_features=columns_numerical, train_size=0.9, 
             numeric_imputation='median', normalize=True, remove_outliers=True, data_split_stratify=True, feature_selection=True, feature_selection_threshold=0.35)

Unnamed: 0,Description,Value
0,session_id,1779
1,Target,TARGET
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(76877, 174)"
5,Missing Values,True
6,Numeric Features,138
7,Categorical Features,35
8,Ordinal Features,False
9,High Cardinality Features,False


In [None]:
#Training the model
model = create_model('lightgbm')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9175,0.75,0.0207,0.3438,0.039,0.0301,0.0674
1,0.918,0.7498,0.0132,0.3333,0.0253,0.0193,0.0524
2,0.9189,0.7598,0.0132,0.4667,0.0256,0.0212,0.0676
3,0.9189,0.7701,0.0188,0.4762,0.0362,0.0302,0.082
4,0.9195,0.7337,0.0244,0.5652,0.0468,0.0404,0.1052
5,0.9192,0.7431,0.0244,0.52,0.0467,0.0397,0.0995
6,0.918,0.7324,0.0169,0.36,0.0323,0.0252,0.0632
7,0.9206,0.7577,0.0395,0.6562,0.0745,0.0659,0.1475
8,0.9195,0.7407,0.0263,0.56,0.0503,0.0433,0.1085
9,0.9181,0.7465,0.0188,0.3704,0.0358,0.0282,0.0682


In [None]:
#Tuning the model
tuned_model = tune_model(model)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9191,0.7229,0.0,0.0,0.0,0.0,0.0
1,0.9191,0.7068,0.0,0.0,0.0,0.0,0.0
2,0.9191,0.7181,0.0,0.0,0.0,0.0,0.0
3,0.9191,0.7209,0.0,0.0,0.0,0.0,0.0
4,0.9191,0.7156,0.0,0.0,0.0,0.0,0.0
5,0.9191,0.6958,0.0,0.0,0.0,0.0,0.0
6,0.9191,0.6951,0.0,0.0,0.0,0.0,0.0
7,0.9191,0.7239,0.0,0.0,0.0,0.0,0.0
8,0.9191,0.7173,0.0,0.0,0.0,0.0,0.0
9,0.9192,0.7094,0.0,0.0,0.0,0.0,0.0


In [None]:
#Saving the best model and storing it in Google Drive for future use
save_model(tuned_model, "model")

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=['NAME_CONTRACT_TYPE',
                                                             'CODE_GENDER',
                                                             'FLAG_OWN_CAR',
                                                             'FLAG_OWN_REALTY',
                                                             'NAME_TYPE_SUITE',
                                                             'NAME_INCOME_TYPE',
                                                             'NAME_EDUCATION_TYPE',
                                                             'NAME_FAMILY_STATUS',
                                                             'NAME_HOUSING_TYPE',
                                                             'OCCUPATION_TYPE',
                                                             'WEEKDAY_APPR_PROCESS_START',
                                                    

## 3.4.0 Running the PyCaret model & data pipeline on the notebook for validation

In [None]:
#Importing checkpoint-saved data and pickle files
bureau_numerical_merge = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/bureau_numerical_merge.csv'))
bureau_categorical_merge = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/bureau_categorical_merge.csv'))
previous_numerical_merge = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/previous_numerical_merge.csv'))
previous_categorical_merge = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/previous_categorical_merge.csv'))
filename = open('/content/drive/MyDrive/Data/columns_train_data.pkl', 'rb')
columns = pickle.load(filename)
filename.close()
tuned_model = load_model('/content/drive/MyDrive/model_pycaret')

Transformation Pipeline and Model Successfully Loaded


In [None]:
#Reading query data point/s
query = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/application_test.csv'))

In [None]:
#Function as a pipeline for prediction
def predictor(query):
  
  #create new feature DEBT_INCOME_RATIO in application_train
  query['DEBT_INCOME_RATIO'] = query['AMT_ANNUITY']/query['AMT_INCOME_TOTAL']

  #create new feature LOAN_VALUE_RATIO in application_train
  query['LOAN_VALUE_RATIO'] = query['AMT_CREDIT']/query['AMT_GOODS_PRICE']

  #create new feature LOAN_INCOME_RATIO in application_train
  query['LOAN_INCOME_RATIO'] = query['AMT_CREDIT']/query['AMT_INCOME_TOTAL']

  #Merge numerical features from bureau to query data
  query_bureau = query.merge(bureau_numerical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_BUREAU'))

  #Merge categorical features from bureau to query data
  query_bureau = query_bureau.merge(bureau_categorical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_BUREAU'))

  #Drop SK_ID_BUREAU
  query_bureau = query_bureau.drop(columns = ['SK_ID_BUREAU'])

  #Shape of query and bureau data combined
  print('The shape of query and bureau data merged: ', query_bureau.shape)
  
  #Merge numerical features from previous_application to query_bureau
  query_bureau_previous = query_bureau.merge(previous_numerical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_PREVIOUS'))

  #Merge categorical features from previous_application to query_bureau
  query_bureau_previous = query_bureau_previous.merge(previous_categorical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_PREVIOUS'))

  #Drop SK_ID_PREV and SK_ID_CURR
  query_bureau_previous = query_bureau_previous.drop(columns = ['SK_ID_PREV'])

  #Shape of query_bureau and previous_application data combined
  print('The shape of query_bureau and previous_application data merged: ', query_bureau_previous.shape)
  
  #Drop SK_ID_PREV and SK_ID_CURR
  query_bureau_previous = query_bureau_previous.drop(columns = ['SK_ID_CURR'])

  missing_columns = set(list(columns)) - set(['TARGET']) - set(list(query_bureau_previous.columns))
  if len(missing_columns) != 0:
    print("Please enter values for all columns")
  else:
    predictions = predict_model(tuned_model, query_bureau_previous)
    return predictions

In [None]:
#Using the data pipeline to predict defaulting tendency
query_prediction = predictor(query)
query_prediction

The shape of query and bureau data merged:  (48744, 139)
The shape of query_bureau and previous_application data merged:  (48744, 174)


Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,DEBT_INCOME_RATIO,LOAN_VALUE_RATIO,LOAN_INCOME_RATIO,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,CREDIT_TYPE,AMT_ANNUITY_PREVIOUS,AMT_APPLICATION,AMT_CREDIT_PREVIOUS,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE_PREVIOUS,HOUR_APPR_PROCESS_START_PREVIOUS,NFLAG_LAST_APPL_IN_DAY,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,DAYS_DECISION,SELLERPLACE_AREA,CNT_PAYMENT,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL,NAME_CONTRACT_TYPE_PREVIOUS,WEEKDAY_APPR_PROCESS_START_PREVIOUS,FLAG_LAST_APPL_PER_CONTRACT,NAME_CASH_LOAN_PURPOSE,NAME_CONTRACT_STATUS,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE_PREVIOUS,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,CHANNEL_TYPE,NAME_SELLER_INDUSTRY,NAME_YIELD_GROUP,PRODUCT_COMBINATION,Label,Score
0,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,Unaccompanied,Working,Higher education,Married,House / apartment,0.018845,-19241,-2329,-5168.0,-812,,1,1,0,1,0,1,,2.0,2,2,TUESDAY,18,0,0,0,0,0,0,Kindergarten,0.752441,0.789551,0.159546,0.065979,0.058990,0.973145,,,,0.137939,0.125000,,,,0.050507,,,0.067200,0.061188,0.973145,,,,0.137939,0.125000,,,,0.052612,,,0.066589,0.058990,0.973145,,,,0.137939,0.125000,,,,0.051392,,,,block of flats,0.039215,"Stone, brick",No,0.0,0.0,0.0,0.0,-1740.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.152300,1.2640,4.213333,-857.0,0.0,-179.0,-715.0,,0.0,168345.000000,0.000,0.0,0.0,-155.0,0.0,Closed,currency 1,Consumer credit,3951.000000,24835.500000,23787.000000,2520.00,24835.500000,13.000000,1.0,0.104309,,,-1740.0,23.000000,8.000000,365243.000000,-1709.000000,-1499.000000,-1619.000000,-1612.000000,0.000000,Revolving loans,SUNDAY,Y,XAP,Approved,XNA,XAP,Family,Repeater,XNA,Cards,XNA,Credit and cash offices,XNA,XNA,Card Street,0,0.9752
1,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.035797,-18064,-4469,-9120.0,-1623,,1,1,0,1,0,0,Low-skill Laborers,2.0,2,2,FRIDAY,9,0,0,0,0,0,0,Self-employed,0.564941,0.291748,0.432861,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,0.175455,1.2376,2.250182,-137.0,0.0,122.0,-123.0,0.00,0.0,58500.000000,25321.500,0.0,0.0,-31.0,0.0,Active,currency 1,Consumer credit,4813.200195,22308.750000,20076.750000,4464.00,44617.500000,10.500000,1.0,0.108948,,,-536.0,18.000000,12.000000,365243.000000,-706.000000,-376.000000,-466.000000,-460.000000,0.000000,Cash loans,SATURDAY,Y,XAP,Refused,Cash through the bank,XAP,Unaccompanied,Repeater,XNA,Cash,XNA,Country-wide,Connectivity,middle,POS mobile with interest,0,0.9138
2,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,,Working,Higher education,Married,House / apartment,0.019104,-20038,-4458,-2176.0,-3503,5.0,1,1,0,1,0,0,Drivers,2.0,2,2,MONDAY,14,0,0,0,0,0,0,Transport: type 3,,0.699707,0.610840,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-856.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0,0.344578,1.0528,3.275378,-1835.0,0.0,-999.0,-1168.0,19305.00,0.0,391770.000000,0.000,,0.0,-882.0,0.0,Closed,currency 1,Car loan,11478.195312,130871.250000,146134.125000,3375.00,174495.000000,14.500000,1.0,0.067200,,,-837.5,82.000000,17.328125,365243.000000,-1005.666687,-515.666687,-715.666687,-710.333313,0.333252,Cash loans,THURSDAY,Y,XNA,Canceled,XNA,XAP,Unaccompanied,Repeater,XNA,XNA,XNA,Credit and cash offices,XNA,XNA,Cash,0,0.9548
3,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.026398,-13976,-1866,-2000.0,-4208,,1,1,0,1,1,0,Sales staff,4.0,2,2,WEDNESDAY,11,0,0,0,0,0,0,Business Entity Type 3,0.525879,0.509766,0.612793,0.305176,0.197388,0.997070,0.958984,0.116516,0.320068,0.275879,0.375000,0.041687,0.204224,0.240356,0.367188,0.038605,0.080017,0.310791,0.204956,0.997070,0.960938,0.117615,0.322266,0.275879,0.375000,0.041687,0.208862,0.262695,0.382812,0.03891,0.084717,0.308105,0.197388,0.997070,0.959473,0.11731,0.320068,0.275879,0.375000,0.041687,0.207764,0.244629,0.373779,0.038788,0.081726,reg oper account,block of flats,0.370117,Panel,No,0.0,0.0,0.0,0.0,-1805.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,0.155614,1.0000,5.000000,-1612.0,0.0,-896.5,-1375.0,0.00,0.0,129614.039062,0.000,0.0,0.0,-683.5,0.0,Closed,currency 1,Consumer credit,8091.584961,49207.500000,92920.500000,3750.00,82012.500000,10.796875,1.0,0.057709,,,-1124.0,1409.599976,11.335938,243054.328125,-1271.000000,121221.335938,121171.335938,121182.664062,0.000000,Consumer loans,SUNDAY,Y,XAP,Approved,Cash through the bank,XAP,Unaccompanied,Repeater,Computers,POS,XNA,Country-wide,Consumer electronics,middle,POS household with interest,0,0.9666
4,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.010033,-13040,-2191,-4000.0,-4262,16.0,1,1,1,1,0,0,,3.0,2,2,FRIDAY,5,0,0,0,0,1,1,Business Entity Type 3,0.202148,0.425781,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-821.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,0.178150,1.0000,3.475000,,,,,,,,,,,,,,,,17782.156250,267727.500000,300550.500000,8095.50,267727.500000,5.500000,1.0,0.087524,,,-466.0,13.000000,24.000000,365243.000000,-787.000000,-457.000000,-457.000000,-449.000000,0.000000,,,,,,,,,,,,,,,,,0,0.8171
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48739,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,Unaccompanied,Working,Secondary / secondary special,Widow,House / apartment,0.002043,-19970,-5169,-9096.0,-3399,,1,1,1,1,1,0,,1.0,3,3,WEDNESDAY,16,0,0,0,0,0,0,Other,,0.648438,0.643066,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,0.0,1.0,0.0,-684.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.143815,1.5280,3.395555,-601.0,0.0,-98.0,-603.0,11427.75,0.0,145867.500000,0.000,0.0,0.0,-99.0,0.0,Closed,currency 1,Consumer credit,14222.429688,225000.000000,254700.000000,,225000.000000,14.000000,1.0,,,,-683.0,-1.000000,24.000000,365243.000000,-653.000000,37.000000,-593.000000,-591.000000,1.000000,Cash loans,SATURDAY,Y,XNA,Approved,Cash through the bank,XAP,Unaccompanied,Repeater,XNA,Cash,x-sell,Credit and cash offices,XNA,middle,Cash X-Sell: middle,0,0.9423
48740,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,House / apartment,0.035797,-11186,-1149,-3016.0,-3003,,1,1,0,1,0,0,Sales staff,4.0,2,2,MONDAY,11,0,0,0,0,1,1,Trade: type 7,,0.684570,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,0.202600,1.2574,3.951828,,,,,,,,,,,,,,,,6968.891113,86871.375000,98704.125000,1200.00,86871.375000,12.250000,1.0,0.042999,,,-1552.0,99.000000,17.500000,365243.000000,-1519.750000,-1024.750000,-1024.750000,-1019.500000,0.500000,,,,,,,,,,,,,,,,,0,0.8857
48741,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,House / apartment,0.026398,-15922,-3037,-2680.0,-1504,4.0,1,1,0,1,1,0,,3.0,2,2,WEDNESDAY,12,0,0,0,0,0,0,Business Entity Type 3,0.733398,0.632812,0.283691,0.111328,0.136353,0.995605,,,0.160034,0.137939,0.333252,,,,0.138306,,0.054199,0.113403,0.141479,0.995605,,,0.161133,0.137939,0.333252,,,,0.144043,,0.057404,0.112427,0.136353,0.995605,,,0.160034,0.137939,0.333252,,,,0.140747,,0.055389,,block of flats,0.166260,"Stone, brick",No,0.0,0.0,0.0,0.0,-838.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0,0.163978,1.0000,1.555556,-349.0,0.0,-407.5,-406.0,0.00,0.0,54000.000000,0.000,0.0,0.0,-159.0,0.0,Closed,currency 1,Consumer credit,14201.078125,141060.078125,132516.828125,8543.25,141060.078125,20.000000,1.0,0.054474,,,-461.0,146.000000,11.000000,365243.000000,-423.500000,-123.500000,182293.500000,182307.500000,0.000000,Consumer loans,SATURDAY,Y,XAP,Approved,Cash through the bank,XAP,Family,New,Consumer Electronics,POS,XNA,Country-wide,Consumer electronics,middle,POS household with interest,0,0.9841
48742,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,Family,Commercial associate,Higher education,Married,House / apartment,0.018845,-13968,-2731,-1461.0,-1364,,1,1,1,1,1,0,Managers,2.0,2,2,MONDAY,10,0,1,1,0,1,1,Self-employed,0.373047,0.445801,0.595215,0.162842,0.072327,0.989746,,,0.160034,0.068970,0.625000,,,,0.156250,,0.149048,0.166016,0.075012,0.989746,,,0.161133,0.068970,0.625000,,,,0.120422,,0.157715,0.164551,0.072327,0.989746,,,0.160034,0.068970,0.625000,,,,0.159058,,0.152100,,block of flats,0.197388,Panel,No,0.0,0.0,0.0,0.0,-2308.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0,0.111680,1.0000,2.000000,-1421.0,0.0,-1122.0,-1513.0,0.00,0.0,147339.000000,0.000,0.0,0.0,-1058.0,0.0,Closed,currency 1,Consumer credit,11486.215820,113758.203125,127578.601562,1500.00,142197.750000,14.000000,1.0,0.036346,,,-1284.0,22.600000,14.500000,365243.000000,-1409.000000,-929.000000,181620.500000,181622.500000,0.000000,Cash loans,SATURDAY,Y,XNA,Approved,Cash through the bank,XAP,Unaccompanied,Repeater,XNA,Cash,XNA,Credit and cash offices,XNA,XNA,Cash,0,0.9537


## 3.5.0 Key highlights of PyCaret framework for modeling & data pipeline
>
>* PyCaret is a low-code library for building & comparing multiple models & set up a data pipeline.
>
>* Model & data pipeline created by PyCaret is significantly large in size. Deployment on Heroku throws an error.
>
>* To reduce filesize, model was trained using 50% and 25% of original data.However, the error still persisted.
>
>* In order to deploy the model on Heroku or another platform, building the model & pipeline using sklearn shall be tried out.
>
>* A complete log and documentation of the whole modeling & deployment iterations can be read in the accompanying documentation.

# 4.0.0 Modeling & Datapipeline for Deployment using the sklearn framework
>
>* As evidenced in previous section, the PyCaret model & pipeline though working successfully on Colab notebook, are not successfully deployed on Heroku, mostly due to compatibility or size issues.
>
>* Thus, the model as well as the data pipeline are being rebuilt in sklearn.

## 4.1.0 Setting up & Processing the data

In [None]:
#Reading application_test data
application_test = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/application_test.csv'))

#Saving feature names from application_test
columns_input = list(application_test.columns)

#Creating save-point of feature names
file = open('columns_input.pkl', 'wb')
pickle.dump(columns_input, file)
file.close()

In [None]:
#Reading train_data
train_data = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/train_data.csv'))

In [None]:
#Creating lists of numerical and categorical features
y_train = train_data['TARGET']
X_train_numerical = train_data.select_dtypes(exclude=object).drop(columns=['TARGET'])
X_train_categorical = train_data.select_dtypes(include=object)
columns_numerical = X_train_numerical.columns
columns_categorical = X_train_categorical.columns

## 4.2.0 Imputing missing values & Scaling of data for Numerical features

In [None]:
#Imputation of missing data
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(X_train_numerical)
X_train_numerical_imputed = imputer.transform(X_train_numerical)

In [None]:
#Saving the imputer
file = open('imputer.pkl', 'wb')
pickle.dump(imputer, file)
file.close()

In [None]:
#Scaling of data
scaler = StandardScaler()
scaler.fit(X_train_numerical_imputed)
X_train_numerical_imputed_scaled = scaler.transform(X_train_numerical_imputed)
X_train_numerical_imputed_scaled_df = pd.DataFrame(data = X_train_numerical_imputed_scaled, columns = columns_numerical)

In [None]:
#Save the scaler
file = open('scaler.pkl', 'wb')
pickle.dump(scaler, file)
file.close()

## 4.3.0 One hot encoding of data for Categorical features

In [None]:
#Imputation of missing data
imputer_constant = SimpleImputer(strategy='constant', fill_value='missing_vale')
imputer_constant.fit(X_train_categorical)
X_train_categorical_imputed = imputer_constant.transform(X_train_categorical)

In [None]:
#Saving imputer_constant
file = open('imputer_constant.pkl', 'wb')
pickle.dump(imputer_constant, file)
file.close()

In [None]:
#One hot encoding of categorical data
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit(X_train_categorical_imputed)
X_train_categorical_imputed_ohe = ohe.transform(X_train_categorical_imputed)
columns_ohe = ohe.get_feature_names(input_features=columns_categorical)
X_train_categorical_imputed_ohe_df = pd.DataFrame(data = X_train_categorical_imputed_ohe.toarray(), columns = list(columns_ohe))

In [None]:
#Saving the ohe function
file = open('columns_ohe.pkl', 'wb')
pickle.dump(columns_ohe, file)
file.close()

In [None]:
#Save ohe function
file = open('ohe.pkl', 'wb')
pickle.dump(ohe, file)
file.close()

## 4.5 Define train data with all columns

In [None]:
#Define train data with all columns
X_train_all_columns = pd.concat([X_train_numerical_imputed_scaled_df, X_train_categorical_imputed_ohe_df], axis = 1)

## 4.6 Outlier removal

In [None]:
#Defining the outlier detector and fitting it to X_train_all_columns with contamination = 0.05
clf = CBLOF(contamination=0.05, check_estimator=False, random_state=42)
clf.fit(X_train_all_columns)
scores_pred = clf.decision_function(X_train_all_columns) * -1

#Classifying the datapoints as outlier or inlier
outlier_prediction = clf.predict(X_train_all_columns)
inliers = len(outlier_prediction) - np.count_nonzero(outlier_prediction)
outliers = np.count_nonzero(outlier_prediction == 1)

In [None]:
#Removing the outliers
X_train_all_columns_outlier_label = X_train_all_columns.copy()
X_train_all_columns_outlier_label['outlier'] = outlier_prediction.tolist()
X_y_train_all_columns_outlier_label = pd.concat([X_train_all_columns_outlier_label, y_train], axis = 1)
X_y_train_final_outlier_removed = X_y_train_all_columns_outlier_label[X_y_train_all_columns_outlier_label['outlier'] != 1]
X_train = X_y_train_final_outlier_removed.drop(columns = ['TARGET', 'outlier'])
y_train = X_y_train_final_outlier_removed['TARGET']

## 4.7 Feature Selection

In [None]:
#Defining the model for feature selection
model_feature_slection = GradientBoostingClassifier(random_state=0).fit(X_train, y_train)

In [None]:
#Creating variable for the selected features as a list
feature_importance = pd.DataFrame(model_feature_slection.feature_importances_, index=X_train.columns, columns=['importance']).sort_values('importance', ascending=False)
selected_features = list(feature_importance['importance'].head(175).index)

In [None]:
#Saving the selected features as a list
file = open('selected_features.pkl', 'wb')
pickle.dump(selected_features, file)
file.close()

## 4.8 Train model

In [None]:
#Defining the model in sklearn
model = GradientBoostingClassifier(random_state=0).fit(X_train[selected_features], y_train)

In [None]:
#Saving the model
file = open('model.pkl', 'wb')
pickle.dump(model, file)
file.close()

## 4.9 Create Pipeline and predict

In [None]:
#Importing the checkpoint-saved data and pickle files
bureau_numerical_merge = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/bureau_numerical_merge.csv'))
bureau_categorical_merge = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/bureau_categorical_merge.csv'))
previous_numerical_merge = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/previous_numerical_merge.csv'))
previous_categorical_merge = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/previous_categorical_merge.csv'))
filename = open('/content/drive/MyDrive/Data/columns_input.pkl', 'rb')
columns_input = pickle.load(filename)
filename.close()
model_temp = open('/content/drive/MyDrive/Data/model.pkl', 'rb')
model = pickle.load(model_temp)
model_temp.close()
imputer_temp = open('/content/drive/MyDrive/Data/imputer.pkl', 'rb')
imputer = pickle.load(imputer_temp)
imputer_temp.close()
scaler_temp = open('/content/drive/MyDrive/Data/scaler.pkl', 'rb')
scaler = pickle.load(scaler_temp)
scaler_temp.close()
constant_temp = open('/content/drive/MyDrive/Data/imputer_constant.pkl', 'rb')
imputer_constant = pickle.load(constant_temp)
constant_temp.close()
ohe_temp = open('/content/drive/MyDrive/Data/ohe.pkl', 'rb')
ohe = pickle.load(ohe_temp)
ohe_temp.close()
selected_temp = open('/content/drive/MyDrive/Data/selected_features.pkl', 'rb')
selected_features = pickle.load(selected_temp)
selected_temp.close()
ohe_col_temp = open('/content/drive/MyDrive/Data/columns_ohe.pkl', 'rb')
columns_ohe = pickle.load(ohe_col_temp)
ohe_col_temp.close()

In [None]:
#Function as a pipeline for prediction
def predictor(query):
  #Create additional features named DEBT_INCOME_RATIO, LOAN_VALUE_RATIO & LOAN_INCOME_RATIO in a copy of the query data
  query_with_additinal_features = query.copy()
  query_with_additinal_features['DEBT_INCOME_RATIO'] = query_with_additinal_features['AMT_ANNUITY']/query_with_additinal_features['AMT_INCOME_TOTAL']
  query_with_additinal_features['LOAN_VALUE_RATIO'] = query_with_additinal_features['AMT_CREDIT']/query_with_additinal_features['AMT_GOODS_PRICE']
  query_with_additinal_features['LOAN_INCOME_RATIO'] = query_with_additinal_features['AMT_CREDIT']/query_with_additinal_features['AMT_INCOME_TOTAL']

  #Merge numerical features from bureau to query data
  query_bureau = query_with_additinal_features.merge(bureau_numerical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_BUREAU'))

  #Merge categorical features from bureau to query data
  query_bureau = query_bureau.merge(bureau_categorical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_BUREAU'))

  #Drop SK_ID_BUREAU
  query_bureau = query_bureau.drop(columns = ['SK_ID_BUREAU'])
  
  #Merge numerical features from previous_application to query_bureau
  query_bureau_previous = query_bureau.merge(previous_numerical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_PREVIOUS'))

  #Merge categorical features from previous_application to query_bureau
  query_bureau_previous = query_bureau_previous.merge(previous_categorical_merge, on='SK_ID_CURR', how='left', suffixes=('', '_PREVIOUS'))

  #Drop SK_ID_PREV
  query_bureau_previous = query_bureau_previous.drop(columns = ['SK_ID_PREV'])
    
  #Drop SK_ID_CURR
  query_bureau_previous = query_bureau_previous.drop(columns = ['SK_ID_CURR'])

  query_numerical = query_bureau_previous.select_dtypes(exclude=object)
  query_categorical = query_bureau_previous.select_dtypes(include=object)

  columns_numerical = query_numerical.columns
  columns_categorical = query_categorical.columns

  query_numerical_imputed_scaled_df = imputer.transform(query_numerical)
  query_numerical_imputed_scaled_df = scaler.transform(query_numerical_imputed_scaled_df)
  query_numerical_imputed_scaled_df = pd.DataFrame(data = query_numerical_imputed_scaled_df, columns = columns_numerical)

  query_categorical_imputed_ohe_df = imputer_constant.transform(query_categorical)
  query_categorical_imputed_ohe_df = ohe.transform(query_categorical_imputed_ohe_df)
  query_categorical_imputed_ohe_df = pd.DataFrame(data = query_categorical_imputed_ohe_df.toarray(), columns = list(columns_ohe))

  query_data_all_features = pd.concat([query_numerical_imputed_scaled_df, query_categorical_imputed_ohe_df], axis = 1)
  query_data = query_data_all_features[selected_features]

  predictions = model.predict(query_data)
  pred_cat=[]
  for i in range(len(predictions)):
    if predictions[i]==0:
      pred_cat.append("Low")
    else:
      pred_cat.append("High")
  applicant_no=query['SK_ID_CURR'].copy()
  #applicant_no['Defaulting Tendency']=pred_cat
  pred_df=pd.DataFrame(pred_cat, columns = ['Defaulting Tendency'])
  pred_out=pd.concat([applicant_no,pred_df], axis=1, ignore_index=False)
  return pred_out

In [None]:
#Reading the query data point/s
query = df_size_optimizer(pd.read_csv('/content/drive/MyDrive/Data/application_test.csv'))
col_names=query.columns.values.tolist()
if col_names == columns_input:
  query_prediction = predictor(query)
  query_pred=pd.DataFrame(query_prediction)
  query_pred.columns = ['Applicant ID', 'Defaulting Tendency']
  pred_append=pd.concat([query,query_pred['Defaulting Tendency']], axis=1, ignore_index=False)
  display(pred_append)
else:
  print("Input applicant form-fields are not in the prescribed format. Please ahdere to the same for processing.")

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,Defaulting Tendency
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,Unaccompanied,Working,Higher education,Married,House / apartment,0.018845,-19241,-2329,-5168.0,-812,,1,1,0,1,0,1,,2.0,2,2,TUESDAY,18,0,0,0,0,0,0,Kindergarten,0.752441,0.789551,0.159546,0.065979,0.058990,0.973145,,,,0.137939,0.125000,,,,0.050507,,,0.067200,0.061188,0.973145,,,,0.137939,0.125000,,,,0.052612,,,0.066589,0.058990,0.973145,,,,0.137939,0.125000,,,,0.051392,,,,block of flats,0.039215,"Stone, brick",No,0.0,0.0,0.0,0.0,-1740.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,Low
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.035797,-18064,-4469,-9120.0,-1623,,1,1,0,1,0,0,Low-skill Laborers,2.0,2,2,FRIDAY,9,0,0,0,0,0,0,Self-employed,0.564941,0.291748,0.432861,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,Low
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,,Working,Higher education,Married,House / apartment,0.019104,-20038,-4458,-2176.0,-3503,5.0,1,1,0,1,0,0,Drivers,2.0,2,2,MONDAY,14,0,0,0,0,0,0,Transport: type 3,,0.699707,0.610840,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-856.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0,Low
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.026398,-13976,-1866,-2000.0,-4208,,1,1,0,1,1,0,Sales staff,4.0,2,2,WEDNESDAY,11,0,0,0,0,0,0,Business Entity Type 3,0.525879,0.509766,0.612793,0.305176,0.197388,0.997070,0.958984,0.116516,0.320068,0.275879,0.375000,0.041687,0.204224,0.240356,0.367188,0.038605,0.080017,0.310791,0.204956,0.997070,0.960938,0.117615,0.322266,0.275879,0.375000,0.041687,0.208862,0.262695,0.382812,0.03891,0.084717,0.308105,0.197388,0.997070,0.959473,0.11731,0.320068,0.275879,0.375000,0.041687,0.207764,0.244629,0.373779,0.038788,0.081726,reg oper account,block of flats,0.370117,Panel,No,0.0,0.0,0.0,0.0,-1805.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,Low
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.010033,-13040,-2191,-4000.0,-4262,16.0,1,1,1,1,0,0,,3.0,2,2,FRIDAY,5,0,0,0,0,1,1,Business Entity Type 3,0.202148,0.425781,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-821.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48739,456221,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,Unaccompanied,Working,Secondary / secondary special,Widow,House / apartment,0.002043,-19970,-5169,-9096.0,-3399,,1,1,1,1,1,0,,1.0,3,3,WEDNESDAY,16,0,0,0,0,0,0,Other,,0.648438,0.643066,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,0.0,1.0,0.0,-684.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,Low
48740,456222,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,House / apartment,0.035797,-11186,-1149,-3016.0,-3003,,1,1,0,1,0,0,Sales staff,4.0,2,2,MONDAY,11,0,0,0,0,1,1,Trade: type 7,,0.684570,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,Low
48741,456223,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,House / apartment,0.026398,-15922,-3037,-2680.0,-1504,4.0,1,1,0,1,1,0,,3.0,2,2,WEDNESDAY,12,0,0,0,0,0,0,Business Entity Type 3,0.733398,0.632812,0.283691,0.111328,0.136353,0.995605,,,0.160034,0.137939,0.333252,,,,0.138306,,0.054199,0.113403,0.141479,0.995605,,,0.161133,0.137939,0.333252,,,,0.144043,,0.057404,0.112427,0.136353,0.995605,,,0.160034,0.137939,0.333252,,,,0.140747,,0.055389,,block of flats,0.166260,"Stone, brick",No,0.0,0.0,0.0,0.0,-838.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0,Low
48742,456224,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,Family,Commercial associate,Higher education,Married,House / apartment,0.018845,-13968,-2731,-1461.0,-1364,,1,1,1,1,1,0,Managers,2.0,2,2,MONDAY,10,0,1,1,0,1,1,Self-employed,0.373047,0.445801,0.595215,0.162842,0.072327,0.989746,,,0.160034,0.068970,0.625000,,,,0.156250,,0.149048,0.166016,0.075012,0.989746,,,0.161133,0.068970,0.625000,,,,0.120422,,0.157715,0.164551,0.072327,0.989746,,,0.160034,0.068970,0.625000,,,,0.159058,,0.152100,,block of flats,0.197388,Panel,No,0.0,0.0,0.0,0.0,-2308.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0,Low


# 5.0.0 Conclusion & Summary

>**The finalised deployment platform**
>
>After trying out the mentioned combinations of platforms and services, I opted for Streamlit+Heroku as the primary method of deployment for the following reasons - 
>* Streamlit allowed me to customise the app UI much better than FastAPI to the extent I could.
>* Heroku required the least amount of time for iterative deployment and the entire repository being on GitHub, I could modify, build & redeploy from anywhere.


>**Basic architecture of the app system**
>* The user interacts with the app via any browser on their local PC running the Streamlit client by uploading the query csv file containing the applicant data.
>* The model hosted on the remote Heroku box computes the predictions and sends back the results which are displayed on the user’s browser as well as can be downloaded.
>* Upon linking the GitHub repo with the Heroku site for the first time, the files are pulled into the Heroku box.


>**Highlights of the deployed app**
>
>*App engagement*
>
>* The app accepts the Home credit applicant details in a CSV file as is in the test dataset.
> 
>* A downloadable template is provided for the user to enter data into.
>* Individual form fields are not provided owing to the large number of fields which will result in an unpleasant UX.
>
>* The output of the model predictions is displayed on the screen as an interactable dataframe as well as a downloadable CSV file appended to the original query set.
>
>* Importantly, following error handling methods are implemented - 
>>* The uploaded csv is checked for correctness w.r.t. the actual feature names required in template and in case of mismatch, displays a message stating the same.
>>* When a new, unseen categorical variable is encountered in the query data, handling is done by ignoring it which is implemented by setting ‘handle_unknown’ parameter to ‘ignore’. This ignores the unseen category values and proceeds ahead.
>
>
>*Scalability, Throughput, Latency and real-world case*
>
>* The app was fed the Home Credit raw test dataset consisting of around 50k applicant records with a file size of approximately 26mb.
>
>* The app, after upload [depending on internet connectivity took around 5 -30 seconds] does the entire data processing and predicts the defaulting tendency for the applicants in less than 40 seconds.
>
>* Considering the real-world scenario, the latency is not a strict requirement and is acceptable.
>
>* With context to throughput, as the app can be run frequently per day or even per application, the throughput volumes are not a limiting case.

