**Table of contents**<a id='toc0_'></a>    
- [Data Pre-processing:](#toc1_)    
  - [Import Libraries and Datasets](#toc1_1_)    
  - [Formatting](#toc1_2_)    
  - [Examining Data:](#toc1_3_)    
    - [Columns and Data Types](#toc1_3_1_)    
    - [Dropping Non-Required Columns](#toc1_3_2_)    
    - [Checking/Handling Missing Values](#toc1_3_3_)    
- [Feature Engineering:](#toc2_)    
  - [Merchant Characteristics](#toc2_1_)    
  - [Loan Characteristics](#toc2_2_)    
  - [Transaction Characteristics](#toc2_3_)    
  - [Loan-Sales Ratios](#toc2_4_)    
  - [Target Variables](#toc2_5_)    
- [Data Exploration:](#toc3_)    
  - [Checking/Handling Missing Values](#toc3_1_)    
  - [Checking/Removing Outliers](#toc3_2_)    
- [Models:](#toc4_)    
  - [Splitting Dataset into Training and Testing](#toc4_1_)    
  - [Model Selection and Training:](#toc4_2_)    
    - [XGBoost](#toc4_2_1_)    
    - [Decision Tree and Random Forest](#toc4_2_2_)    
  - [Hypertuning](#toc4_3_)    
- [Notes:](#toc4_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a> Data Pre-processing:  

## <a id='toc1_1_'></a> Import Libraries and Datasets

In [16]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn import preprocessing 
import seaborn as sns
import xgboost as xgb # type: ignore
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split,GridSearchCV
import joblib
from sklearn.metrics import f1_score
%matplotlib inline

In [13]:
merchantDetails = pd.DataFrame({

})

loanScheduleDetails = pd.DataFrame({
 
})

loanLedgerDetails = pd.DataFrame({
 
})

transactionDetails = pd.DataFrame({

})

## <a id='toc1_2_'></a> Formatting  

In [None]:
# Rename columns with same name

merchantDetails.rename(columns={'id': 'merchant_code', 'created_at': 'merchant_created_at', 'updated_at':'merchant_updated_at'}, inplace=True)

loanScheduleDetails.rename(columns={'id': 'loan_schedule_id', 'created_at': 'loan_schedule_created_at', 'updated_at':'loan_schedule_updated_at', 'paid_date':'loan_schedule_paid_date'}, inplace=True)

loanLedgerDetails.rename(columns={'id': 'loan_id', 'created_at': 'loan_created_at', 'updated_at':'loan_updated_at', 'deleted_at':'loan_deleted_at', 'transaction_date':'loan_transaction_date', 'transaction_type':'loan_transaction_type', 'description':'loan_description'}, inplace=True)

transactionDetails.rename(columns={'id': 'transaction_id', 'created_at': 'transaction_created_at'}, inplace=True)

# Convert date columns to datetime format

merchantDetails[['merchant_created_at', 'merchant_updated_at']] = merchantDetails.apply(pd.to_datetime)

loanScheduleDetails[['loan_schedule_created_at', 'loan_schedule_updated_at', 'loan_schedule_paid_date', 'schedule_date']] = loanScheduleDetails.apply(pd.to_datetime)

loanLedgerDetails[['loan_created_at', 'loan_updated_at', 'loan_deleted_at']] = merchantDetails.apply(pd.to_datetime)

transactionDetails[['transaction_datetime', 'transaction_created_at']] = transactionDetails.apply(pd.to_datetime)

# Use loans only older than 3 months

# Calculate the cutoff date for loans older than 3 months
cutoff_date = pd.Timestamp.today() - pd.DateOffset(months=3)

loanSchedule = loanScheduleDetails[loanScheduleDetails['schedule_date'] < cutoff_date]

loanLedger = loanLedgerDetails[loanLedgerDetails['loan_created_at'] < cutoff_date]

# Merge the data on 'merchant_code' to get only merchants who have or have had loans

merchants = pd.merge(merchantDetails, loanSchedule[['merchant_code']], on='merchant_code', how='inner')

# Filter transactionDetails to only include merchants who have taken loans and transactions prior to loan disbursement

transactions = pd.merge(transactionDetails, loanSchedule[['merchant_code', 'loan_schedule_created_at']], on='merchant_code', how='inner')

transactions = transactions[transactions['transaction_created_at'] < transactions['loan_schedule_created_at']]

## <a id='toc1_3_'></a> Examining Data:  

### <a id='toc1_3_1_'></a> Columns and Data Types

In [None]:
merchants.head()
loanSchedule.head()
loanLedger.head()
transactions.head()



In [None]:
merchants.info()
loanSchedule.info()
loanLedger.info()
transactions.info()

### <a id='toc1_3_2_'></a> Dropping Non-Required Columns 

In [None]:
merchants=merchants.drop([''],axis=1)
loanSchedule=loanSchedule.drop([''],axis=1)
loanLedger=loanSchedule.drop([''],axis=1)
transactions=transactions.drop([''],axis=1)

### <a id='toc1_3_3_'></a> Checking/Handling Missing Values

In [None]:
missing_info_merchants = merchants.isnull().sum()
missing_info_loanSchedule = loanSchedule.isnull().sum()
missing_info_loanLedger = loanLedger.isnull().sum()
missing_info_transactions = transactions.isnull().sum()

# <a id='toc2_'></a> Feature Engineering:  

## <a id='toc2_1_'></a> Merchant Characteristics  

In [None]:
merch_char = pd.DataFrame()
merch_char['merchant_code'] = merchants['merchant_code']

# Age of relationship with the payment company at the time of disbursal (in months)

disbursement_df = loanLedger[loanLedger['loan_transaction_type'] == 'DISBURSEMENT']
merchant_creation_map = merchants.set_index('merchant_code')['merchant_created_at'].to_dict()
disbursement_df['merchant_created_at'] = disbursement_df['merchant_code'].map(merchant_creation_map)
disbursement_df['relationage'] = (disbursement_df['loan_transaction_date'] - disbursement_df['merchant_created_at']).dt.month
disbursement_age_df = disbursement_df[['merchant_code', 'relationage']]

merch_char = pd.merge(merch_char, disbursement_age_df, on='merchant_code', how='left')

# Age of the business owner at the time of the payment company loan
# Gender (Dummy variable assigning Male = 1)

# (For repeat borrowers) Number of days between disbursal of the repeat loan and closure of the previous loan
# (For repeat borrowers) Principal amount of the previous loan
# (For repeat borrowers) implied days past due for previous loan
# (For repeat borrowers) 1 if the previous loan was late on implied criteria (lastidpd > 30)
# (For repeat borrowers) Loan number (first loan, second loan, . . . )


## <a id='toc2_2_'></a> Loan Characteristics  

In [None]:
loan_char = pd.DataFrame()
loan_char['merchant_code'] = merchants['merchant_code']

# Month of loan disbursal

disbursement_df['mdisburse'] = disbursement_df['loan_transaction_date'].dt.month
month_of_disbursement = disbursement_df[['merchant_code', 'mdisburse']]
loan_char = pd.merge(loan_char, month_of_disbursement, on='merchant_code', how='left')

# Principal amount

loan_char = pd.merge(loan_char, disbursement_df[['merchant_code', 'credit']], on='merchant_code', how='left')
loan_char.rename(columns={'credit': 'loanamount'}, inplace=True)

# Suggested tenure



## <a id='toc2_3_'></a> Transaction Characteristics  

In [None]:
def calculate_trans_char(transactions, sales_period_end=1, sales_period_start=28, suffix=''):

    # Merge disbursement date to transactions
    disbursement_dates_amounts = loanLedger[loanLedger['loan_transaction_type'] == 'DISBURSEMENT']

    transactions = pd.merge(transactions, disbursement_dates_amounts[['merchant_code', 'loan_transaction_date', 'debit']], on='merchant_code', how='left')
    results = []

    # Loop for each merchant
    for merchant_code in transactions['merchant_code'].unique():

        # Filter transactions for the specific merchant
        merchant_transactions = transactions[transactions['merchant_code'] == merchant_code]
        disbursement_date = merchant_transactions['loan_transaction_date'].iloc[0]

        # Filter transactions for the sales period before disbursement
        sales_period_transactions_all = merchant_transactions[
            (merchant_transactions['transaction_datetime'] >= (disbursement_date - pd.Timedelta(days=sales_period_start))) &
            (merchant_transactions['transaction_datetime'] <= (disbursement_date - pd.Timedelta(days=sales_period_end)))
        ]

        # Filter transactions by response_code = 0
        sales_period_transactions = sales_period_transactions_all[sales_period_transactions_all['response_code'] == 0]

       # Aggregate sales by date
        daily_sales = sales_period_transactions.groupby(sales_period_transactions['transaction_datetime'].dt.date)['amount'].sum().reset_index(name='daily_sales')
        daily_sales['transaction_datetime'] = pd.to_datetime(daily_sales['transaction_datetime'])

        # Ensure continuity of dates: fill in missing dates with 0 sales
        all_dates = pd.date_range(start=disbursement_date - pd.Timedelta(days=sales_period_start), end=disbursement_date - pd.Timedelta(days=sales_period_end))
        daily_sales_continuous = pd.DataFrame({'transaction_datetime': all_dates})
        daily_sales_continuous = pd.merge(daily_sales_continuous, daily_sales, on='transaction_datetime', how='left').fillna(0)
        
        # Average of total daily transaction value (in log)
        avg_daily_sales = daily_sales_continuous['amount'].mean()
        log_avg_transday = np.log1p(avg_daily_sales)
        
        # Average transaction sizes (in log)
        avg_transaction_size = sales_period_transactions['amount'].mean()
        log_avgtrans = np.log1p(avg_transaction_size)
        
        # Coefficient of variation of transaction sizes
        std_dev_transaction_size = sales_period_transactions['amount'].std()
        cv_trans = std_dev_transaction_size / avg_transaction_size if avg_transaction_size != 0 else 0
        
        # Coefficient of variation of total daily transaction values
        std_dev_daily_sales = daily_sales_continuous['amount'].std()
        cv_transday = std_dev_daily_sales / avg_daily_sales if avg_daily_sales != 0 else 0
        
        # Herfindahl-Hirschmann index of customers total transaction value
        customer_sales = sales_period_transactions.groupby('pan')['amount'].sum()
        total_sales = customer_sales.sum()
        customer_sales_proportion = customer_sales / total_sales
        HH_cust_trans = (customer_sales_proportion ** 2).sum()
        
        # Average of periods without any transactions (# of inactive days/total number of transaction days)
        transaction_dates = sales_period_transactions['transaction_datetime'].dt.date.unique()
        inactive_days = len(set(all_dates.date) - set(transaction_dates))
        total_transaction_days = len(transaction_dates)
        avg_inactivity = inactive_days / total_transaction_days if total_transaction_days != 0 else 0
        
        # Days since the period of longest inactivity
        all_dates_df = pd.DataFrame(all_dates, columns=['date'])
        all_dates_df['had_transaction'] = all_dates_df['date'].isin(transaction_dates)
        all_dates_df['inactive_period'] = (all_dates_df['had_transaction'] == False).astype(int).diff().ne(0).cumsum()
        inactive_periods = all_dates_df[all_dates_df['had_transaction'] == False].groupby('inactive_period')['date'].agg(['min', 'max', 'size'])
        longest_inactivity_period = inactive_periods.loc[inactive_periods['size'].idxmax()]
        days_since_max_inactivity = (disbursement_date.date() - longest_inactivity_period['max'].date()).days
        
        # Days since last transaction
        last_transaction_date = sales_period_transactions['transaction_datetime'].max().date()
        days_since_lasttrans = (disbursement_date.date() - last_transaction_date).days
        
        # (Relative to disbursal) Day with the largest transactions value
        largest_transaction_day = daily_sales.loc[daily_sales['daily_sales'].idxmax(), 'transaction_datetime']
        max_trans_dt = (largest_transaction_day - disbursement_date.date()).days
    
        # (Relative to disbursal) Day with most number of transactions
        daily_transaction_count = sales_period_transactions.groupby(sales_period_transactions['transaction_datetime'].dt.date).size().reset_index(name='transaction_count')
        most_transactions_day = daily_transaction_count.loc[daily_transaction_count['transaction_count'].idxmax(), 'transaction_datetime']
        max_transcount_dt = (most_transactions_day - disbursement_date.date()).days
        
        # Days since last transaction of most frequent customer at the time of disbursal.
        most_frequent_customer = sales_period_transactions.groupby('pan').size().idxmax()
        most_frequent_customer_transactions = sales_period_transactions[sales_period_transactions['pan'] == most_frequent_customer]
        most_frequent_customer_last_transaction_date = most_frequent_customer_transactions['transaction_datetime'].max().date()
        dayspast_freqcust = (disbursement_date.date() - most_frequent_customer_last_transaction_date).days
        
        # Days since last transaction of largest customer at the time of disbursal.
        largest_customer = sales_period_transactions.groupby('pan')['amount'].sum().idxmax()
        largest_customer_transactions = sales_period_transactions[sales_period_transactions['pan'] == largest_customer]
        last_transaction_largest_customer = largest_customer_transactions['transaction_datetime'].max().date()
        dayspast_largcust = (disbursement_date.date() - last_transaction_largest_customer).days
        
        # Number of distinct customers within period (in log).
        custcount = sales_period_transactions['pan'].nunique()
        log_custcount = np.log1p(custcount)
        
        # Share of total transaction value conducted on {DayOfWeek}
        sales_period_transactions['day_of_week'] = sales_period_transactions['transaction_datetime'].dt.dayofweek
        transaction_value_by_day = sales_period_transactions.groupby('day_of_week')['amount'].sum()
        total_transaction_value = sales_period_transactions['amount'].sum()
        share_by_day_of_week = transaction_value_by_day / total_transaction_value
        day_of_week_share = share_by_day_of_week.to_dict()
        
        shr_Mon_trans = day_of_week_share[0]
        shr_Tue_trans = day_of_week_share[1]
        shr_Wed_trans = day_of_week_share[2]
        shr_Thu_trans = day_of_week_share[3]
        shr_Fri_trans = day_of_week_share[4]
        shr_Sat_trans = day_of_week_share[5]
        shr_Sun_trans = day_of_week_share[6]
        
        results.append({'merchant_code': merchant_code, f'log_avg_transday{suffix}':log_avg_transday, f'log_avgtrans{suffix}':log_avgtrans, 
                        f'cv_trans{suffix}':cv_trans, f'cv_transday{suffix}':cv_transday, f'HH_cust_trans{suffix}':HH_cust_trans, 
                        f'avg_inactivity{suffix}':avg_inactivity, f'days_since_max_inactivity{suffix}':days_since_max_inactivity, 
                        f'days_since_lasttrans{suffix}':days_since_lasttrans, f'max_trans_dt{suffix}':max_trans_dt, 
                        f'max_transcount_dt{suffix}':max_transcount_dt, f'dayspast_freqcust{suffix}':dayspast_freqcust, 
                        f'dayspast_largcust{suffix}':dayspast_largcust, f'log_custcount{suffix}':log_custcount, 
                        f'shr_Mon_trans{suffix}':shr_Mon_trans, f'shr_Tue_trans{suffix}':shr_Tue_trans, 
                        f'shr_Wed_trans{suffix}':shr_Wed_trans, f'shr_Thu_trans{suffix}':shr_Thu_trans, 
                        f'shr_Fri_trans{suffix}':shr_Fri_trans, f'shr_Sat_trans{suffix}':shr_Sat_trans,
                        f'shr_Sun_trans{suffix}':shr_Sun_trans, f'avg_daily_sales{suffix}':avg_daily_sales})

    trans_char_df = pd.DataFrame(results)



    return trans_char_df

# 91 day period (long term - lt)

trans_char_df_lt = calculate_trans_char(transactions, sales_period_end=1, sales_period_start=91, suffix='_lt')

# 91 - 64 (t_1) and 1 - 28 (t_2) periods to then calculate change over lt

trans_char_df_t_1 = calculate_trans_char(transactions, sales_period_end=64, sales_period_start=91, suffix='_t_1')

trans_char_df_t_2 = calculate_trans_char(transactions, sales_period_end=1, sales_period_start=28, suffix='_t_2')

# Change t_1, t_2 (da - absolute change, dr - relative change, log...da - absolute change in log approximating relative change) 

trans_char_df_t_1_t_2 = pd.merge(trans_char_df_t_1, trans_char_df_t_2, on='merchant_code', how='left')

def calculate_da(trans_char_df_t_1_t_2, da_variables):
    for var in da_variables:
       trans_char_df_t_1_t_2[f'{var}_da'] = trans_char_df_t_1_t_2[f'{var}_t_1'] - trans_char_df_t_1_t_2[f'{var}_t_2']
    result_da = trans_char_df_t_1_t_2[['merchant_code'] + [f'{var}_da' for var in da_variables]]
    return result_da

def calculate_dr(trans_char_df_t_1_t_2, dr_variables):
    for var in dr_variables:
       trans_char_df_t_1_t_2[f'{var}_dr'] = ((trans_char_df_t_1_t_2[f'{var}_t_1'] - trans_char_df_t_1_t_2[f'{var}_t_2']) / trans_char_df_t_1_t_2[f'{var}_t_2']) if trans_char_df_t_1_t_2[f'{var}_t_2'] != 0 else 0
    result_dr = trans_char_df_t_1_t_2[['merchant_code'] + [f'{var}_da' for var in dr_variables]]
    return result_dr

da_variables = ['log_avg_transday', 'log_avgtrans', 'log_custcount',
                'shr_Mon_trans', 'shr_Tue_trans', 'shr_Wed_trans',
                'shr_Thu_trans', 'shr_Fri_trans', 'shr_Sat_trans',
                'shr_Sun_trans']

dr_variables = ['cv_trans', 'cv_transday', 'HH_cust_trans',
                'avg_inactivity', 'days_since_max_inactivity',
                'days_since_lasttrans', 'max_trans_dt',
                'max_transcount_dt', 'dayspast_freqcust',
                'dayspast_largcust']

trans_char_da = calculate_da(trans_char_df_t_1_t_2, da_variables)
trans_char_dr = calculate_dr(trans_char_df_t_1_t_2, dr_variables)

trans_char = pd.merge(trans_char_df_lt, trans_char_da, on='merchant_code', how='outer')
trans_char = pd.merge(trans_char, trans_char_dr, on='merchant_code', how='outer')

trans_char.drop('avg_daily_sales_lt', axis=1, inplace=True)


## <a id='toc2_4_'></a> Loan-Sales Ratios  

In [None]:
# Loan-sales ratio. Loan amount divided by average daily sales calculated in the 91-day period before disbursal.

loan_sales_ratio_df = pd.merge(loan_char, trans_char_df_lt[['merchant_code', 'avg_daily_sales_lt']], on='merchant_code', how='left')
loan_sales_ratio_df['lsratio'] = loan_sales_ratio_df['loanamount'] / loan_sales_ratio_df['avg_daily_sales_lt']

# Loan-sales ratio calculated with average daily sales calculated in the 28-day period before disbursal.

loan_sales_ratio_df = pd.merge(loan_sales_ratio_df, trans_char_df_t_2[['merchant_code', 'avg_daily_sales_t_2']], on='merchant_code', how='left')
loan_sales_ratio_df['rtsshort'] = loan_sales_ratio_df['loanamount'] / loan_sales_ratio_df['avg_daily_sales_lt']

## <a id='toc2_5_'></a> Target Variables  

# <a id='toc3_'></a> Data Exploration  

In [None]:
dataset = pd.DataFrame()
dataset['merchant_code'] = merchants['merchant_code']
dataset = pd.merge(dataset, loan_sales_ratio_df[['merchant_code', 'lsratio', 'rtsshort']], on='merchant_code', how='left')
dataset = pd.merge(dataset, trans_char, on='merchant_code', how='left')
dataset = pd.merge(dataset, loan_char, on='merchant_code', how='left')
dataset = pd.merge(dataset, merch_char, on='merchant_code', how='left')

In [None]:
# Getting whole dataset information:
dataset.info()

In [None]:
# Finding number of unique values in each column
print(dataset.nunique())

## <a id='toc3_1_'></a> Checking/Handling Missing Values  

In [None]:
missing_info = dataset.isnull().sum()
print(missing_info)

## <a id='toc3_2_'></a> Checking/Removing Outliers  

In [None]:
# Getting statistical information of the dataset
dataset.describe()

In [None]:
#To Check Correlation of The Features with the Target Feature i.e. Loan Status
corr_matrix=dataset.corr()
plt.figure(figsize=(25, 25))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()
best_features=corr_matrix.index[abs(corr_matrix['default'])>0.3]

In [None]:
# Plotting the target variable
target = 'default'
class_counts = dataset[target].value_counts()
plt.bar([0,1], class_counts.values,width=0.3)
plt.title('Bar Plot of ' + target)
plt.xlabel(target)
plt.ylabel('Number of samples')
plt.show()

In [None]:
# OVER-sample the majority class

majority_class = class_counts.idxmax()
n_samples = class_counts[majority_class]
over_sampled_dataset = dataset.groupby('Loan Status').apply(lambda x: x.sample(n_samples, replace=True)).reset_index(drop=True)

print('Original dataset shape is', dataset.shape)
print('Over-sampled dataset shape is ', over_sampled_dataset.shape)

In [None]:
class_counts = over_sampled_dataset[target].value_counts()
plt.bar([0,1], class_counts.values,width=0.3)
plt.title('Bar Plot of ' + target)
plt.xlabel(target)
plt.ylabel('Number of samples')
plt.show()

In [None]:
# Distribution
columns=over_sampled_dataset.columns.tolist()
for column in over_sampled_dataset:
  sns.displot(over_sampled_dataset[columns], kde=True, x=column, color="red", edgecolor="purple", linewidth=5, bins=int(180/5))

In [None]:
# Box plot
for column in dataset:
    over_sampled_dataset[[column]].boxplot(boxprops=dict(color='red'))
    plt.title(column)
    plt.show()

In [None]:
# Splitting X and Y
X = over_sampled_dataset.iloc[:, :-1]  
y = over_sampled_dataset.iloc[:, -1] 

In [None]:
# Removing Outliers through IQR Method
cols = []

print("Skewness of ")
print("Before Removing Outliers")
for column in cols:
  print(column, X[column].skew())
  Q1 = X[column].quantile(0.25)
  Q3 = X[column].quantile(0.75)
  IQR = Q3 - Q1
  upper_limit = Q3 + IQR * 1.5
  lower_limit = Q1 - IQR * 1.5
  X[column] = np.where(
        X[column] > upper_limit,
        upper_limit,
        np.where(
            X[column] < lower_limit,
            lower_limit,
            X[column]
        )
    )
# Checking Skewness of the indicated columns in cols after removing outliers
print("After Removing Outliers")
for column in cols:
  print(column,X[column].skew())

In [None]:
# Box Plot after handling outliers
for column in X:
    X[[column]].boxplot(boxprops=dict(color='red'))
    plt.title(column)
    plt.show()

# <a id='toc4_'></a> Models  

In [None]:
# Payment Footprint Basic Model

pf_basic = pd.DataFrame()
pf_basic['merchant_code'] = merchants['merchant_code']

pf_basic = pd.merge(pf_basic, loan_sales_ratio_df[['merchant_code', 'lsratio', 'rtsshort']], on='merchant_code', how='left')
pf_basic = pd.merge(pf_basic, trans_char[['merchant_code', 'log_avgtrans_lt', 'log_avg_transday_lt', 'cv_trans_lt', 'cv_transday_lt''log_avgtrans_da', 'log_avg_transday_da', 'cv_trans_dr', 'cv_transday_dr']], on='merchant_code', how='left')

# Payment Footprint Deep Model

pf_deep = pd.DataFrame()
pf_deep['merchant_code'] = merchants['merchant_code']

pf_deep = pd.merge(pf_deep, loan_sales_ratio_df[['merchant_code', 'lsratio', 'rtsshort']], on='merchant_code', how='left')
pf_deep = pd.merge(pf_deep, trans_char, on='merchant_code', how='left')

# Payment Footprint Basic Model + Loan Char

pf_basic_lc = pd.merge(pf_basic, loan_char, on='merchant_code', how='left')

# Payment Footprint Deep Model + Loan Char

pf_deep_lc = pd.merge(pf_deep, loan_char, on='merchant_code', how='left')

# Payment Footprint Basic Model + Merchant Char

pf_basic_mc = pd.merge(pf_basic, merch_char, on='merchant_code', how='left')

# Payment Footprint Deep Model + Merchant Char

pf_deep_mc = pd.merge(pf_deep, loan_char, on='merchant_code', how='left')

# Payment Footprint Basic Model + Loan Char + Merchant Char

pf_basic_lc_mc = pd.merge(pf_basic_lc, merch_char, on='merchant_code', how='left')

# Payment Footprint Deep Model + Loan Char + Merchant Char

pf_deep_lc_mc = pd.merge(pf_deep_lc, merch_char, on='merchant_code', how='left')

models = [
    ('pf_basic', pf_basic),
    ('pf_deep', pf_deep),
    ('pf_basic_lc', pf_basic_lc),
    ('pf_deep_lc', pf_deep_lc),
    ('pf_basic_mc', pf_basic_mc),
    ('pf_deep_mc', pf_deep_mc),
    ('pf_basic_lc_mc', pf_basic_lc_mc),
    ('pf_deep_lc_mc', pf_deep_lc_mc)
]

## <a id='toc4_1_'></a> Splitting Dataset into Training and Testing  

In [None]:
xtrain,xtest,ytrain,ytest=train_test_split(X,y,test_size=0.3)

## <a id='toc4_2_'></a> Model Selection and Training  

### <a id='toc4_2_1_'></a> XGBoost  

In [None]:
dtrain = xgb.DMatrix(xtrain, label=ytrain) #Converting data into DMatrix format, which is the input format required by XGBoost
dtest = xgb.DMatrix(xtest, label=ytest)
params = { # Specifying parameters
    "objective": "binary:logistic",
    "max_depth": 3,
    "eta": 0.1,
    "gamma": 0.1,
    "min_child_weight": 1,
    "subsample": 0.5,
    "colsample_bytree": 0.5,
    "verbosity": 0
}
num_of_rounds = 100
model = xgb.train(params, dtrain, num_of_rounds) # Training
y_pred = model.predict(dtest) # Prediction
y_train_pred = model.predict(dtrain) # Prediction
train_acc= accuracy_score(ytrain, y_train_pred.round()) # Training
acc = accuracy_score(ytest, y_pred.round()) # Calculating accuracy 
print(f"Training Accuracy: {train_acc}")
print(f"Accuracy: {acc}")
print(classification_report(ytest, y_pred.round()))

### <a id='toc4_2_2_'></a> Decision Tree and Random Forest  

In [None]:
dt = DecisionTreeClassifier()
rf = RandomForestClassifier(random_state=0)


model_list = [rf,dt]
test_acc = []
train_acc=[]
for i in model_list:
    i_model = i.fit(xtrain,ytrain)
    ypred_train = i_model.predict(xtrain)
    ypred_test = i_model.predict(xtest)
    print(i)
    print(classification_report(ytest, ypred_test))
    print(f1_score(ytest, ypred_test, average='macro'))
    print(f1_score(ytest, ypred_test, average='micro'))
    train_acc.append(accuracy_score(ytrain,ypred_train))
    test_acc.append(accuracy_score(ytest,ypred_test))
print("Training Accuracy: ",train_acc)
print("Testing Accuracy: ",test_acc)

## <a id='toc4_3_'></a> Hypertuning  

In [None]:
# Define hyperparameters and their possible values
hyperparameters = {
    'max_depth': [None,36,40,44],
    'min_samples_split': [2, 5, 10],
    'max_features': [None, 'sqrt', 'log2'],
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random']
}
Decision_tree = DecisionTreeClassifier()

# Grid Search for Decision Trees:
grid_search_dt = GridSearchCV(Decision_tree, hyperparameters, cv=4)
grid_search_dt.fit(xtrain, ytrain)

# Best parameters and score
print('Best Parameters:', grid_search_dt.best_params_)
# Validation score
print('Best validation Score:', grid_search_dt.best_score_)

# <a id='toc4_4_'></a> Notes:  

To-Do:
- Deployment
    - filename = 'DeploymentModel.joblib'
    - joblib.dump({Final Model}, filename)
- Target var
- Run a for loop for each pf_... dataset
- Edit pf_... to include balanced data
- Consider K-Fold Cross-Validation or multiple splits with different random seeds
- Consider different models (neural network, logit regression)
- Early warning exercise

Target Vars:

Default - binary variable that takes value 1, if the loan had a shortfall > 5% of repayment amount and it was either written off or still pending

Late - binary variable that takes value 1 if a loan was non-defaulting and took at least 30 days more than the implied tenure to fully pay the loan off

Non-performing - binary variable that takes value 1 when either the loan is categorized as Default or Late

N.B. Implied tenure is the number of days it would take to repay the loan (loan amount + interest) given the 10% deduction rate and if the merchant after disbursal continued to transact at their pre-disbursal long-term average. Pre-disbursal long-term average is average daily transaction value calculated over 91 days between day 120 and day 30 before loan disbursal.

Variables:

merchant_code - Merchant Identifier

relationage - Age of relationship with the payment company at the time of disbursal (in months)

mdisburse - Month of loan disbursal

loanamount - Principal amount

lsratio - Loan-sales ratio. Loan amount divided by average daily sales calculated in the 91-day period before disbursal.

rtsshort - Loan-sales ratio calculated with average daily sales calculated in the 28-day period before disbursal.

log_avg_transday - Average of total daily transaction value (in log).

log_avgtrans - Average transaction sizes (in log).

log_custcount - Number of distinct customers within period (in log).

cv_trans - Coefficient of variation of transaction sizes.

cv_transday - Coefficient of variation of total daily transaction values.

HH_cust_trans - Herfindahl-Hirschmann index of customers total transaction value.

avg_inactivity - Average of periods without any transactions (# of inactive days/total number of transaction days)

days_since_max_inactivity - Days since the period of longest inactivity

days_since_lasttrans - Days since last transaction

max_trans_dt - (Relative to disbursal) Day with the largest transactions value.

max_transcount_dt - (Relative to disbursal) Day with most number of transactions.

dayspast_freqcust - Days since last transaction of most frequent customer at the time of disbursal.

dayspast_largcust -  Days since last transaction of largest customer at the time of disbursal.

shr_Mon_trans - Share of total transaction value conducted on Mondays

shr_Tue_trans - Share of total transaction value conducted on Tuesdays

shr_Wed_trans - Share of total transaction value conducted on Wednesdays

shr_Thu_trans - Share of total transaction value conducted on Thursdays

shr_Fri_trans - Share of total transaction value conducted on Fridays

shr_Sat_trans - Share of total transaction value conducted on Saturdays

shr_Sun_trans - Share of total transaction value conducted on Sundays

N.B.
Variables ending with subscript "_lh" are calculated in the 91-day window before disbursal.
Variable ending with "_dr" and "_da" represent relative and absolute changes in values of the payment variables calculated in the 28-day period before disbursal and values calculated in the 63-day to 91-day period before disbursal.
Absolute changes in log variables approximate the relative change in the variable.

Variables to add:

shr_AMEX_trans_lh 
shr_MAESTRO_trans_lh 
shr_MASTERCARDVISA_trans_lh 
shr_OTHER_trans_lh 
shr_fallbacktrans_lh 
lastfallback_days_lh 
lastfallback_amt_relavg_lh 
shr_refundtrans_lh
++ shr card types, shr switch name?, demographic info (location, age, gender, etc?)
