# I. Project Team Members

| Prepared by | Email | Prepared for |
| :-: | :-: | :-: |
| **Hardefa Rogonondo** | hardefarogonondo@gmail.com | **IBRD Credit Scorecard Predictive Engine** |

# II. Notebook Target Definition

This notebook describes the Feature Engineering phase of IBRD Credit Scorecard Predictive Engine Project. Here, we take the cleaned and preprocessed loan data and create new features that can enhance our model's predictive power. By applying techniques such as binning, polynomial feature creation, and interaction term generation, we unlock further insights and potentially improve our model's performance. This step readies our data for the next phase of model building and validation.

# III. Notebook Setup

## III.A. Import Libraries

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import seaborn as sns

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## III.B. Import Data

In [2]:
X_train = pd.read_pickle('../../data/processed/X_train.pkl')
X_test = pd.read_pickle('../../data/processed/X_test.pkl')
y_train = pd.read_pickle('../../data/processed/y_train.pkl')
y_test = pd.read_pickle('../../data/processed/y_test.pkl')

In [3]:
X_train.head()

Unnamed: 0,End of Period,Loan Number,Region,Country Code,Country,Borrower,Guarantor Country Code,Guarantor,Loan Type,Loan Status,Interest Rate,Project ID,Project Name,Original Principal Amount,First Repayment Date,Last Repayment Date,Agreement Signing Date,Board Approval Date,Effective Date (Most Recent),Closed Date (Most Recent),Last Disbursement Date
2153,2023-04-30,IBRD18870,SOUTH ASIA,IN,India,CONTROLLER OF AID ACCOUNTS & AUDIT,IN,India,NPL,Fully Repaid,8.25,P009768,FARRAKA THERMAL,25000000.0,1985-10-01,2000-04-01,1980-07-11,1980-06-26,1980-12-10,1989-06-30,1990-02-01
5370,2023-04-30,IBRD39120,LATIN AMERICA AND CARIBBEAN,MX,Mexico,"NACIONAL FINANCIERA, S.N.C. NAFIN",MX,Mexico,SCL,Fully Repaid,0.0,P040462,ESSENTIAL SOCIAL SER,500000000.0,1999-04-15,2010-10-15,1995-06-23,1995-06-22,1995-07-06,1998-06-30,1998-07-02
6537,2023-04-30,IBRD70690,MIDDLE EAST AND NORTH AFRICA,DZ,Algeria,MINISTERE DES FINANCES,DZ,Algeria,FSL,Fully Repaid,0.0,P054217,DZ-FINANCIAL SYSTEM INFRASTR. MODERN.,16500000.0,2007-04-15,2017-10-15,2002-01-04,2001-07-26,2002-09-03,2006-06-30,2006-10-11
1014,2023-04-30,IBRD09570,EUROPE AND CENTRAL ASIA,TR,Turkiye,Ministry of Treasury and Finance,TR,Turkiye,NPL,Fully Repaid,7.25,P008906,ANTALYA FORESTRY&PAPER M,40000000.0,1979-05-01,1989-11-01,1974-01-28,1974-01-15,1976-05-26,1982-06-30,NaT
4831,2023-04-30,IBRD36330,LATIN AMERICA AND CARIBBEAN,BR,Brazil,Ministério da Fazenda,BR,Brazil,CPL,Fully Repaid,7.54,P006547,RIO DE JANEIRO METROPOLITAN TRANSPORT,81023670.0,1999-03-01,2008-09-01,1993-10-14,1993-06-29,1994-03-14,2000-12-31,2000-12-29


In [4]:
X_test.head()

Unnamed: 0,End of Period,Loan Number,Region,Country Code,Country,Borrower,Guarantor Country Code,Guarantor,Loan Type,Loan Status,Interest Rate,Project ID,Project Name,Original Principal Amount,First Repayment Date,Last Repayment Date,Agreement Signing Date,Board Approval Date,Effective Date (Most Recent),Closed Date (Most Recent),Last Disbursement Date
7270,2023-04-30,IBRD78320,LATIN AMERICA AND CARIBBEAN,PE,Peru,Ministerio De Economia Y Finanzas,PE,Peru,FSL,Terminated,0.0,P116929,PE Safe and Sustainable Transport,150000000.0,2018-07-15,2031-01-15,NaT,2010-01-14,NaT,2014-07-01,NaT
2085,2023-04-30,IBRD18320,LATIN AMERICA AND CARIBBEAN,CL,Chile,MINISTERIO DE OBRAS PUBLICAS,CL,Chile,NPL,Fully Repaid,8.25,P006602,W/S PROJECT,38000000.0,1984-01-01,1995-07-01,1980-08-15,1980-04-17,1980-11-07,1987-06-30,1988-01-08
6585,2023-04-30,IBRD71170,MIDDLE EAST AND NORTH AFRICA,LB,Lebanon,MINISTRY OF FINANCE,LB,Lebanon,FSL,Fully Repaid,0.0,P074042,LB - Ba'albeck Water and Wastewater,43530000.0,2009-11-15,2015-05-15,2002-09-26,2002-06-04,2003-07-31,2012-06-15,2013-05-30
6515,2023-04-30,IBRD70470,MIDDLE EAST AND NORTH AFRICA,DZ,Algeria,MINISTERE DES FINANCES,DZ,Algeria,FSL,Fully Repaid,0.0,P064921,DZ-Budget System Modernization,23700000.0,2013-04-15,2016-10-15,2001-04-18,2001-02-06,2001-07-17,2009-02-28,2010-10-15
14,2023-04-30,IBRD00112,LATIN AMERICA AND CARIBBEAN,BR,Brazil,Ministério da Fazenda,BR,Brazil,NPL,Fully Repaid,4.25,P006214,POWER AND TELEPHONE,15000000.0,1955-07-01,1976-01-01,1951-01-18,1951-01-16,1951-04-10,1954-12-31,NaT


In [5]:
y_train.head()

2153    0
5370    0
6537    0
1014    0
4831    0
Name: bad, dtype: int32

In [6]:
y_test.head()

7270    1
2085    0
6585    0
6515    0
14      0
Name: bad, dtype: int32

# IV. Feature Engineering

## IV.A. Data Shape Inspection

In [7]:
X_train.shape, X_test.shape

((4680, 21), (2006, 21))

In [8]:
y_train.shape, y_test.shape

((4680,), (2006,))

## IV.B. Data Information Inspection

In [9]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4680 entries, 2153 to 6611
Data columns (total 21 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   End of Period                 4680 non-null   datetime64[ns]
 1   Loan Number                   4680 non-null   object        
 2   Region                        4680 non-null   object        
 3   Country Code                  4679 non-null   object        
 4   Country                       4680 non-null   object        
 5   Borrower                      4642 non-null   object        
 6   Guarantor Country Code        4490 non-null   object        
 7   Guarantor                     4491 non-null   object        
 8   Loan Type                     4680 non-null   object        
 9   Loan Status                   4680 non-null   object        
 10  Interest Rate                 4631 non-null   float64       
 11  Project ID                    46

In [10]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2006 entries, 7270 to 175
Data columns (total 21 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   End of Period                 2006 non-null   datetime64[ns]
 1   Loan Number                   2006 non-null   object        
 2   Region                        2006 non-null   object        
 3   Country Code                  2005 non-null   object        
 4   Country                       2006 non-null   object        
 5   Borrower                      1987 non-null   object        
 6   Guarantor Country Code        1917 non-null   object        
 7   Guarantor                     1918 non-null   object        
 8   Loan Type                     2006 non-null   object        
 9   Loan Status                   2006 non-null   object        
 10  Interest Rate                 1988 non-null   float64       
 11  Project ID                    200

In [11]:
y_train.info()

<class 'pandas.core.series.Series'>
Index: 4680 entries, 2153 to 6611
Series name: bad
Non-Null Count  Dtype
--------------  -----
4680 non-null   int32
dtypes: int32(1)
memory usage: 54.8 KB


In [12]:
y_test.info()

<class 'pandas.core.series.Series'>
Index: 2006 entries, 7270 to 175
Series name: bad
Non-Null Count  Dtype
--------------  -----
2006 non-null   int32
dtypes: int32(1)
memory usage: 23.5 KB


## IV.C. Unused Feature Removal

In [None]:
def unused_feat_removal(df, feature_to_remove):
    df.drop(columns = feature_to_remove, inplace = True)
    return df

In [None]:
feature_to_remove = ["column_0", "column_1"]

In [None]:
unused_feat_removal(X_train, feature_to_remove)
unused_feat_removal(X_test, feature_to_remove)
X_train.shape, X_test.shape

In [None]:
X_train.head()

In [None]:
X_test.head()

## IV.E. Specific Feature Engineering

## IV.F. Final Feature Inspection

In [None]:
X_train.shape, X_test.shape

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
X_train.info()

In [None]:
X_test.info()

# V. Feature Selection

## V.E. Feature Manual Binning

### V.E.1. Feature Weight of Evidence and Information Value Inspection

In [None]:
def woe_analysis(X, feature, y):
    df = pd.concat([X[feature], y], axis = 1)
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
    df = df.iloc[:, [0, 1, 3]]
    df.columns = [df.columns[0], "n_observation", "proportion_of_category"]
    df["proportion_of_observation"] = df["n_observation"] / df["n_observation"].sum()
    df["n_good"] = df["proportion_of_category"] * df["n_observation"]
    df["n_bad"] = (1 - df["proportion_of_category"]) * df["n_observation"]
    df["proportion_of_good"] = df["n_good"] / df["n_good"].sum()
    df["proportion_of_bad"] = df["n_bad"] / df["n_bad"].sum()
    df["WoE"] = np.log(df["proportion_of_good"] / df["proportion_of_bad"])
    df = df.sort_values(["WoE"]).reset_index(drop = True)
    df["diff_proportion_of_category"] = df["proportion_of_category"].diff().abs()
    df["diff_WoE"] = df["WoE"].diff().abs()
    df["IV"] = (df["proportion_of_good"] - df["proportion_of_bad"]) * df["WoE"]
    df["IV"] = df["IV"].sum()
    return df

def plot_by_woe(woe_df, rotation_of_x_axis_labels = 0):
    x = np.array(woe_df.iloc[:, 0].astype(str))
    y = woe_df["WoE"]
    plt.figure(figsize = (18, 6))
    plt.plot(x, y, marker = 'o', linestyle = '--', color = 'k')
    plt.xlabel(woe_df.columns[0])
    plt.ylabel("Weight of Evidence")
    plt.title("Weight of Evidence by " + woe_df.columns[0])
    plt.xticks(rotation = rotation_of_x_axis_labels)

#### V.E.1.A. _Column_0_

In [None]:
# Categorical Feature
column_0_woe = woe_analysis(X_train, "column_0", y_train)
column_0_woe

In [None]:
plot_by_woe(column_0_woe)

#### V.E.1.B. _Column_1_

In [None]:
# Continuous Feature
# Fine Classing or Coarse Classing
X_train["column_1"] = pd.cut(X_train["column_1"], 10) # This is an iterative process
column_1_woe = woe_analysis(X_train, "column_1", y_train)
column_1_woe

In [None]:
plot_by_woe(column_1_woe)

## V.F. Feature Manual Binning Weight of Evidence Encoding

In [None]:
def manual_binning_woe_encoding(X, feature, df_woe_analysis):
    X_encoded = X.copy()
    woe_values = df_woe_analysis.set_index(feature)["WoE"]
    X_encoded[feature] = X_encoded[feature].map(woe_values)
    return X_encoded

In [None]:
X_train_woe = manual_binning_woe_encoding(X, "column_name", column_name_woe)
X_test_woe = manual_binning_woe_encoding(X, "column_name", column_name_woe)
X_train_woe.shape, X_test_woe.shape

In [None]:
X_train_woe.head()

In [None]:
X_test_woe.head()

## V.G. Feature Manual Binning One-Hot Encoding

### V.G.1. Categorical Feature Dummy Encoding

In [None]:
def dummy_encoding(df, columns_list):
    df_dummies = pd.get_dummies(df[columns_list], prefix = columns_list, prefix_sep = ":")
    df = pd.concat([df, df_dummies], axis = 1)
    return df

In [None]:
X_train = dummy_encoding(X_train, ["column_0", "column_1"])
X_test = dummy_encoding(X_test, ["column_0", "column_1"])
X_train.shape, X_test.shape

In [None]:
X_train.head()

In [None]:
X_test.head()

### V.G.2. Overall Feature One-Hot Encoding

In [None]:
def manual_binning_ohe_encoding(X):
    X_encoded = X.copy()
    original_columns = X.columns.tolist()
    # Categorical Features
    X_encoded["feature_0:bin_0"] = X.loc[:, "encodedfeature_0:bin_0"]
    X_encoded["feature_0:bin_1"] = X.loc[:, "feature_0:bin_1"]
    # Numerical Features
    X_encoded["feature_1:36"] = np.where((X["feature_1"] == 36), 1, 0) # Change this according to your data
    X_encoded["feature_1:60"] = np.where((X["feature_1"] == 60), 1, 0) # Change this according to your data
    # Continuous Features
    X_encoded["feature_2:<7.071"] = np.where((X["feature_2"] <= 7.071), 1, 0) # Change this according to your data
    X_encoded["feature_2:7.071-10.374"] = np.where((X["feature_2"] > 7.071) & (X["feature_2"] <= 10.374), 1, 0) # Change this according to your data
    X_encoded.drop(columns = original_columns, inplace = True)
    return X_encoded

In [None]:
X_train_ohe = manual_binning_ohe_encoding(X_train)
X_test_ohe = manual_binning_ohe_encoding(X_test)
X_train_ohe.shape, X_test_ohe.shape

In [None]:
X_train_ohe.head()

In [None]:
X_test_ohe.head()

## V.H. Export Data

In [None]:
X_train_woe.to_pickle('../../data/processed/X_train_woe.pkl')
X_test_woe.to_pickle('../../data/processed/X_test_woe.pkl')

X_train_ohe.to_pickle('../../data/processed/X_train_ohe.pkl')
X_test_ohe.to_pickle('../../data/processed/X_test_ohe.pkl')