# I. Project Team Members

| Prepared by | Email | Prepared for |
| :-: | :-: | :-: |
| **Hardefa Rogonondo** | hardefarogonondo@gmail.com | **IBRD Credit Scorecard Predictive Engine** |

# II. Notebook Target Definition

This notebook outlines the exploratory data analysis (EDA) and preprocessing stages for IBRD Loan Credit Scorecard Predictive Engine Project. Starting with the cleaned IBRD loan dataset, we dive into comprehensive EDA to discover patterns, anomalies, and relationships within the data, supported by various visualizations. During this process, we also handle missing values to maintain the integrity of our data. In the preprocessing stage, we normalize the data if necessary, making sure it's in the most suitable form for our future machine learning models. The outcome of this notebook is a ready-to-use, preprocessed dataset, setting the stage for the next step in our pipeline: feature engineering.

# III. Notebook Setup

## III.A. Import Libraries

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import seaborn as sns

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## III.B. Import Data

In [2]:
df = pd.read_pickle('../../data/processed/df.pkl')
df.head()

Unnamed: 0,End of Period,Loan Number,Region,Country Code,Country,Borrower,Guarantor Country Code,Guarantor,Loan Type,Loan Status,Interest Rate,Currency of Commitment,Project ID,Project Name,Original Principal Amount,Cancelled Amount,Undisbursed Amount,Disbursed Amount,Repaid to IBRD,Due to IBRD,Exchange Adjustment,Borrower's Obligation,Sold 3rd Party,Repaid 3rd Party,Due 3rd Party,Loans Held,First Repayment Date,Last Repayment Date,Agreement Signing Date,Board Approval Date,Effective Date (Most Recent),Closed Date (Most Recent),Last Disbursement Date
0,2023-04-30,IBRD00010,EUROPE AND CENTRAL ASIA,FR,France,CREDIT NATIONAL,FR,France,NPL,Fully Repaid,4.25,,P037383,RECONSTRUCTION,250000000.0,0.0,0.0,250000000.0,38000.0,0.0,0.0,0.0,249962000.0,249962000.0,0,0.0,1952-11-01,1977-05-01,1947-05-09,1947-05-09,1947-06-09,1947-12-31,NaT
1,2023-04-30,IBRD00020,EUROPE AND CENTRAL ASIA,NL,Netherlands,,,,NPL,Fully Repaid,4.25,,P037452,RECONSTRUCTION,191044200.0,0.0,0.0,191044200.0,103372200.0,0.0,0.0,0.0,87672000.0,87672000.0,0,0.0,1952-04-01,1972-10-01,1947-08-07,1947-08-07,1947-09-11,1948-03-31,NaT
2,2023-04-30,IBRD00021,EUROPE AND CENTRAL ASIA,NL,Netherlands,,,,NPL,Fully Repaid,4.25,,P037452,RECONSTRUCTION,3955788.0,0.0,0.0,3955788.0,0.0,0.0,0.0,0.0,3955788.0,3955788.0,0,0.0,1953-04-01,1954-04-01,1948-05-25,1947-08-07,1948-06-01,1948-06-30,NaT
3,2023-04-30,IBRD00030,EUROPE AND CENTRAL ASIA,DK,Denmark,,,,NPL,Fully Repaid,4.25,,P037362,RECONSTRUCTION,40000000.0,0.0,0.0,40000000.0,17771000.0,0.0,0.0,0.0,22229000.0,22229000.0,0,0.0,1953-02-01,1972-08-01,1947-08-22,1947-08-22,1947-10-17,1949-03-31,NaT
4,2023-04-30,IBRD00040,EUROPE AND CENTRAL ASIA,LU,Luxembourg,,,,NPL,Fully Repaid,4.25,,P037451,RECONSTRUCTION,12000000.0,238016.98,0.0,11761980.0,1619983.0,0.0,0.0,0.0,10142000.0,10142000.0,0,0.0,1949-07-15,1972-07-15,1947-08-28,1947-08-28,1947-10-24,1949-03-31,NaT


# IV. Exploratory Data Analysis

## IV.A. Data Shape Inspection

In [None]:
df.shape

## IV.B. Data Information Inspection

In [None]:
df.info()

## IV.C. Missing Values Inspection

In [None]:
df_missing = pd.DataFrame(df.isnull().sum().sort_values() / len(df) * 100).reset_index()
df_missing.columns = ["variables", "missing_percentage"]
df_missing

In [None]:
sns.barplot(data = df_missing,
            x = "variables",
            y = "missing_percentage",
            palette = 'Blues')
plt.title("Dataset Null Values Proportion")
plt.xticks(rotation = 'vertical')
plt.show()

### IV.C.1. Missing Values Handling

In [None]:
df.drop(columns = ["Currency of Commitment"], inplace = True)
df.shape

In [None]:
df.head()

## IV.D. Duplicated Values Inspection

In [None]:
df_duplicated = df[df.duplicated(keep = False)]
df_duplicated.shape

In [None]:
df_duplicated

## IV.E. Preliminary Data Analysis

### IV.E.1. Loan 

In [3]:
end_of_period_max = df["Agreement Signing Date"].max()
end_of_period_min = df["Agreement Signing Date"].min()

print(f"Max End of Period: {end_of_period_max}")
print(f"Min End of Period: {end_of_period_min}")

Max End of Period: 2023-04-26 00:00:00
Min End of Period: 1947-05-09 00:00:00


In [4]:
end_of_period_max = df["Board Approval Date"].max()
end_of_period_min = df["Board Approval Date"].min()

print(f"Max End of Period: {end_of_period_max}")
print(f"Min End of Period: {end_of_period_min}")

Max End of Period: 2023-04-28 00:00:00
Min End of Period: 1947-05-09 00:00:00


In [5]:
end_of_period_max = df["Effective Date (Most Recent)"].max()
end_of_period_min = df["Effective Date (Most Recent)"].min()

print(f"Max End of Period: {end_of_period_max}")
print(f"Min End of Period: {end_of_period_min}")

Max End of Period: 2023-12-31 00:00:00
Min End of Period: 1947-06-09 00:00:00


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Histogram
plt.figure(figsize=(10,5))
sns.histplot(data=df, x='Interest Rate', kde=True)  # kde=True will also plot a density line
plt.title('Distribution of Interest Rates')
plt.show()

## IV.F. Data Visualization

### IV.E.1. Region Distribution

In [None]:
plt.title("Region Distribution")
region_distribution = sns.countplot(data = df,
                                    x = "Region",
                                    palette = "Set1")
region_distribution.bar_label(region_distribution.containers[0])
region_distribution.set_xticklabels(region_distribution.get_xticklabels(), rotation = 45, ha = 'right')
plt.show()

### IV.E.2. Loan Type Distribution

In [None]:
plt.title("Loan Type Distribution")
loan_type_distribution = sns.countplot(data = df,
                                       x = "Loan Type",
                                       palette = "Set1")
loan_type_distribution.bar_label(region_distribution.containers[0])
plt.show()

### IV.E.3. Loan Status Distribution

In [None]:
plt.title("Loan Status Distribution")
loan_status_distribution = sns.countplot(data = df,
                                         x = "Loan Status",
                                         palette = "Set1")
loan_status_distribution.bar_label(loan_status_distribution.containers[0])
loan_status_distribution.set_xticklabels(loan_status_distribution.get_xticklabels(), rotation = 45, ha = 'right')
plt.show()

### IV.E.4.

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(data=df, x='Interest Rate', kde=True)  # kde=True will also plot a density line
plt.title('Distribution of Interest Rates')
plt.show()

In [None]:
import plotly.express as px

# This time we're counting the number of records for each 'Country'
df_grouped = df['Country'].value_counts().reset_index()
df_grouped.columns = ['Country', 'Count']  # renaming the columns appropriately

fig = px.choropleth(df_grouped, locations='Country',
                    locationmode='country names',  # to tell function that 'Country' column contains country names
                    color='Count',  # column defining color intensity
                    hover_name='Country',  # hover text
                    color_continuous_scale=px.colors.sequential.Plasma,  # color scale
                    title='Number of Records by Country')
fig.show()


In [None]:
import plotly.express as px

# This time we're counting the number of records for each 'Country'
df_grouped = df['Country'].value_counts().reset_index()
df_grouped.columns = ['Borrower', 'Count']  # renaming the columns appropriately

fig = px.choropleth(df_grouped, locations='Borrower',
                    locationmode='country names',  # to tell function that 'Country' column contains country names
                    color='Count',  # column defining color intensity
                    hover_name='Borrower',  # hover text
                    color_continuous_scale=px.colors.sequential.Plasma,  # color scale
                    title='Number of Records by Country')
fig.show()


In [None]:
import plotly.express as px

# This time we're counting the number of records for each 'Country'
df_grouped = df['Country'].value_counts().reset_index()
df_grouped.columns = ['Guarantor', 'Count']  # renaming the columns appropriately

fig = px.choropleth(df_grouped, locations='Guarantor',
                    locationmode='country names',  # to tell function that 'Country' column contains country names
                    color='Count',  # column defining color intensity
                    hover_name='Guarantor',  # hover text
                    color_continuous_scale=px.colors.sequential.Plasma,  # color scale
                    title='Number of Records by Country')
fig.show()


### IV.E.1. Target Label Proportion

In [None]:
# Barplot
plt.title("Target Label Proportion")
y_proportion = sns.countplot(data = y,
                             x = y["target_label"],
                             palette = 'Blues')
y_proportion.bar_label(y_proportion.containers[0])
plt.show()

In [None]:
# Pie Chart
plt.title("Target Label Proportion")
plt.pie(x = y.value_counts(),
        labels = y["target_label"].value_counts(),
        colors = sns.color_palette('Set3'),
        autopct = '%1.1f%%')
plt.show()

## IV.F. Statistical Analysis

### IV.F.1. Statistical Description

In [None]:
X.describe()

### IV.F.2. Skewness Analysis

In [None]:
X_skewness = X.skew()
X_skewness = pd.DataFrame({"variables": X_skewness.index, "skewness": X_skewness.values})

In [None]:
plt.title("Skewness Analysis")
plt.bar(X_skewness["variables"], X_skewness["skewness"])
plt.xticks(rotation = 45)
plt.xlabel("Variables")
plt.ylabel("Skewness")
plt.show()

### IV.F.3. Chi-Squared Analysis

Analyze the independence or dependence between categorical variables and assess the goodness of fit of observed data to an expected distribution.

In [None]:
X_categorical = X.select_dtypes(include = 'object').copy()
X_numerical = X.select_dtypes(include = 'number').copy()
X_categorical.shape, X_numerical.shape

In [None]:
X_categorical.columns

In [None]:
X_numerical.columns

In [None]:
chi2_result = pd.DataFrame(columns = ["variables", "p-value"])

for column in X_categorical.columns:
    cross_tab = pd.crosstab(y, X_categorical[column])
    chi2, p_value, degree_of_freedom, expected_frequencies = chi2_contingency(cross_tab)
    chi2_result = chi2_result.append({"variables": column, "p-value": round(p_value, 10)}, ignore_index = True)

chi2_result.sort_values(by = "p-value", ascending = True, inplace = True, ignore_index = True)
chi2_result

### IV.F.4. T-Statistics Analysis

Assess if there is a significant difference in means between two groups, such as comparing the mean scores of a continuous variable between two treatment groups.

In [None]:
X_numerical.fillna(X_numerical.mean(), inplace = True)

In [None]:
t_test_results = []
for variable in X_numerical.columns:
    group_0_values = X_numerical.loc[y == 0, variable]
    group_1_values = X_numerical.loc[y == 1, variable]
    t_statistic, p_value = ttest_ind(group_0_values, group_1_values)
    t_test_results.append({"variables": variable, "t-statistic": t_statistic, "p-value": p_value})

t_test_table = pd.DataFrame(t_test_results)
t_test_table.sort_values(by = "t-statistic", ascending = False, inplace = True, ignore_index = True)
t_test_table

### IV.F.5. ANOVA F Analysis

Compare more than two groups, such as comparing the mean scores of a continuous variable among different experimental conditions.

In [None]:
X_numerical.fillna(X_numerical.mean(), inplace = True)

In [None]:
f_statistic, p_values = f_classif(X_numerical, y)

anova_f_table = pd.DataFrame({"variables": X_numerical.columns, "f-score": f_statistic, "p-values": p_values.round(decimals = 10)})
anova_f_table.sort_values(by = "f-score", ascending = False, inplace = True, ignore_index = True)
anova_f_table

## IV.G. Correlation Matrix

In [None]:
X.corr()

In [None]:
sns.heatmap(data = X.corr())

# V. Preprocessing

## V.A. Columns Reorder

In [None]:
custom_order = ["column_0", "column_1", "column_2"]

In [None]:
X = X.reindex(columns = custom_order)
X.shape

In [None]:
X.head()

## V.B. Specific Preprocessing

## V.C. Imbalance Data Preprocessing

### V.C.1. Random Undersampling

In [None]:
rus = RandomUnderSampler(random_state = 777)
X_undersampled, y_undersampled = rus.fit_resample(X, y)
y_undersampled.value_counts()

### V.C.2. Random Oversampling

In [None]:
ros = RandomOverSampler(random_state = 777)
X_oversampled, y_oversampled = ros.fit_resample(X, y)
y_oversampled.value_counts()

### V.C.3. Synthetic Minority Oversampling Technique (SMOTE)

In [None]:
smote = SMOTE(random_state = 777)
X_smote, y_smote = smote.fit_resample(X, y)
y_smote.value_counts()

### V.C.4. Synthetic Minority Oversampling Technique for Nominal (SMOTEN)

In [None]:
smoten = SMOTEN(random_state = 777)
X_smoten, y_smoten = smoten.fit_resample(X, y)
y_smoten.value_counts()

### V.C.5. Adaptive Synthetic Sampling (ADASYN)

In [None]:
adasyn = ADASYN(random_state = 777)
X_adasyn, y_adasyn = adasyn.fit_resample(X, y)
y_adasyn.value_counts()

### V.C.6. KMeans Clustering + Synthetic Minority Oversampling Technique (SMOTE)

In [None]:
kmeanssmote = KMeansSMOTE(random_state = 777)
X_kmeanssmote, y_kmeanssmote = kmeanssmote.fit_resample(X, y)
y_kmeanssmote.value_counts()

### V.C.7. Support Vector Machine (SVM) + Synthetic Minority Oversampling Technique (SMOTE)

In [None]:
svmsmote = SVMSMOTE(random_state = 777)
X_svmsmote, y_svmsmote = svmsmote.fit_resample(X, y)
y_svmsmote.value_counts()

### V.C.8. Synthetic Minority Oversampling Technique (SMOTE) + Edited Nearest Neighbour (ENN)

In [None]:
smoteenn = SMOTEENN(random_state = 777)
X_smoteenn, y_smoteenn = smoteenn.fit_resample(X, y)
y_smoteenn.value_counts()

### V.C.9. Synthetic Minority Oversampling Technique (SMOTE) + Tomek Links

In [None]:
smotetomek = SMOTETomek(random_state = 777)
X_smotetomek, y_smotetomek = smotetomek.fit_resample(X, y)
y_smotetomek.value_counts()

## V.D. Data Splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 777)

In [None]:
X_train.shape, X_test.shape

In [None]:
y_train.shape, y_test.shape

## V.E. Export Data

In [None]:
X_train.to_pickle('../../data/processed/X_train.pkl')
X_test.to_pickle('../../data/processed/X_test.pkl')
y_train.to_pickle('../../data/processed/y_train.pkl')
y_test.to_pickle('../../data/processed/y_test.pkl')