# I. Project Team Members

| Prepared by | Email | Prepared for |
| :-: | :-: | :-: |
| **Hardefa Rogonondo** | hardefarogonondo@gmail.com | **IBRD Credit Scorecard Predictive Engine** |

# II. Notebook Target Definition

This notebook outlines the data preparation process for IBRD Loan Credit Scorecard Predictive Engine Project. We commence by preparing the comprehensive IBRD loan dataset, which we extract from a CSV file. This dataset encompasses various loan-related attributes, including loan status and type, borrower information, and repayment details. Within this notebook, we execute key data cleaning and formatting tasks, alongside comprehensive quality checks, to ensure data suitability for subsequent predictive analysis. The output of this notebook is a clean, well-structured dataset, poised to set a solid foundation for an accurate prediction of credit scores using the dataset.

# III. Notebook Setup

## III.A. Import Libraries

In [1]:
import numpy as np
import pandas as pd
import pickle

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## III.B. Import Data

In [2]:
df = pd.read_csv('../../data/raw/IBRD_Statement_of_Loans_-_Latest_Available_Snapshot.csv')
df.head()

Unnamed: 0,End of Period,Loan Number,Region,Country Code,Country,Borrower,Guarantor Country Code,Guarantor,Loan Type,Loan Status,Interest Rate,Currency of Commitment,Project ID,Project Name,Original Principal Amount,Cancelled Amount,Undisbursed Amount,Disbursed Amount,Repaid to IBRD,Due to IBRD,Exchange Adjustment,Borrower's Obligation,Sold 3rd Party,Repaid 3rd Party,Due 3rd Party,Loans Held,First Repayment Date,Last Repayment Date,Agreement Signing Date,Board Approval Date,Effective Date (Most Recent),Closed Date (Most Recent),Last Disbursement Date
0,04/30/2023 12:00:00 AM,IBRD00010,EUROPE AND CENTRAL ASIA,FR,France,CREDIT NATIONAL,FR,France,NPL,Fully Repaid,4.25,,P037383,RECONSTRUCTION,250000000.0,0.0,0.0,250000000.0,38000.0,0.0,0.0,0.0,249962000.0,249962000.0,0,0.0,11/01/1952 12:00:00 AM,05/01/1977 12:00:00 AM,05/09/1947 12:00:00 AM,05/09/1947 12:00:00 AM,06/09/1947 12:00:00 AM,12/31/1947 12:00:00 AM,
1,04/30/2023 12:00:00 AM,IBRD00020,EUROPE AND CENTRAL ASIA,NL,Netherlands,,,,NPL,Fully Repaid,4.25,,P037452,RECONSTRUCTION,191044200.0,0.0,0.0,191044200.0,103372200.0,0.0,0.0,0.0,87672000.0,87672000.0,0,0.0,04/01/1952 12:00:00 AM,10/01/1972 12:00:00 AM,08/07/1947 12:00:00 AM,08/07/1947 12:00:00 AM,09/11/1947 12:00:00 AM,03/31/1948 12:00:00 AM,
2,04/30/2023 12:00:00 AM,IBRD00021,EUROPE AND CENTRAL ASIA,NL,Netherlands,,,,NPL,Fully Repaid,4.25,,P037452,RECONSTRUCTION,3955788.0,0.0,0.0,3955788.0,0.0,0.0,0.0,0.0,3955788.0,3955788.0,0,0.0,04/01/1953 12:00:00 AM,04/01/1954 12:00:00 AM,05/25/1948 12:00:00 AM,08/07/1947 12:00:00 AM,06/01/1948 12:00:00 AM,06/30/1948 12:00:00 AM,
3,04/30/2023 12:00:00 AM,IBRD00030,EUROPE AND CENTRAL ASIA,DK,Denmark,,,,NPL,Fully Repaid,4.25,,P037362,RECONSTRUCTION,40000000.0,0.0,0.0,40000000.0,17771000.0,0.0,0.0,0.0,22229000.0,22229000.0,0,0.0,02/01/1953 12:00:00 AM,08/01/1972 12:00:00 AM,08/22/1947 12:00:00 AM,08/22/1947 12:00:00 AM,10/17/1947 12:00:00 AM,03/31/1949 12:00:00 AM,
4,04/30/2023 12:00:00 AM,IBRD00040,EUROPE AND CENTRAL ASIA,LU,Luxembourg,,,,NPL,Fully Repaid,4.25,,P037451,RECONSTRUCTION,12000000.0,238016.98,0.0,11761980.0,1619983.0,0.0,0.0,0.0,10142000.0,10142000.0,0,0.0,07/15/1949 12:00:00 AM,07/15/1972 12:00:00 AM,08/28/1947 12:00:00 AM,08/28/1947 12:00:00 AM,10/24/1947 12:00:00 AM,03/31/1949 12:00:00 AM,


# IV. Data Preparation

## IV.A. Data Shape Inspection

In [3]:
df.shape

(8991, 33)

## IV.B. Data Information Inspection

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8991 entries, 0 to 8990
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   End of Period                 8991 non-null   object 
 1   Loan Number                   8991 non-null   object 
 2   Region                        8991 non-null   object 
 3   Country Code                  8989 non-null   object 
 4   Country                       8991 non-null   object 
 5   Borrower                      8934 non-null   object 
 6   Guarantor Country Code        8709 non-null   object 
 7   Guarantor                     8711 non-null   object 
 8   Loan Type                     8991 non-null   object 
 9   Loan Status                   8991 non-null   object 
 10  Interest Rate                 8897 non-null   float64
 11  Currency of Commitment        0 non-null      float64
 12  Project ID                    8991 non-null   object 
 13  Pro

## IV.C. Data Definition

| Variables | Columns Definition |
| :-: | :-: |
| End of Period | _Column Definition_ |
| Loan Number | _Column Definition_ |
| Region | _Column Definition_ |
| Country Code | _Column Definition_ |
| Country | _Column Definition_ |
| Borrower | _Column Definition_ |
| Guarantor Country Code | _Column Definition_ |
| Guarantor | _Column Definition_ |
| Loan Type | _Column Definition_ |
| Loan Status | _Column Definition_ |
| Interest Rate | _Column Definition_ |
| Currency of Commitment | _Column Definition_ |
| Project ID | _Column Definition_ |
| Project Name | _Column Definition_ |
| Original Principal Amount | _Column Definition_ |
| Cancelled Amount | _Column Definition_ |
| Undisbursed Amount | _Column Definition_ |
| Disbursed Amount | _Column Definition_ |
| Repaid to IBRD | _Column Definition_ |
| Due to IBRD | _Data Type_ |
| Exchange Adjustment | _Data Type_ |
| Borrower's Obligation | _Data Type_ |
| Sold 3rd Party | _Data Type_ |
| Repaid 3rd Party | _Data Type_ |
| Due 3rd Party | _Data Type_ |
| Loans Held | _Data Type_ |
| First Repayment Date | _Data Type_ |
| Last Repayment Date | _Data Type_ |
| Agreement Signing Date | _Data Type_ |
| Board Approval Date | _Data Type_ |
| Effective Date (Most Recent) | _Data Type_ |
| Closed Date (Most Recent) | _Data Type_ |
| Last Disbursement Date | _Data Type_ |

## IV.D. Data Validation

| Variables | Data Types |
| :-: | :-: |
| _column_0_ | _Data Type_ |
| _column_1_ | _Data Type_ |
| _column_2_ | _Data Type_ |

In [None]:
# Convert to boolean
df_aggregated["column_name"] = df_aggregated["column_name"].astype(bool)

# Convert to datetime
df_aggregated["column_name"] = pd.to_datetime(df_aggregated["column_name"])

# Convert to float
df_aggregated["column_name"] = df_aggregated["column_name"].astype(float)

# Convert to integer
df_aggregated["column_name"] = df_aggregated["column_name"].astype(int)

# Convert to string
df_aggregated["column_name"] = df_aggregated["column_name"].astype(str)

In [None]:
df_aggregated.head()

In [None]:
df_aggregated.info()

## IV.E. Data Segregation

In [None]:
X = df_aggregated.drop("label_target", axis = 1)
y = df_aggregated["label_target"]

In [None]:
X.shape, y.shape

In [None]:
X.head()

In [None]:
y.head()

## IV.F. Export Data

In [None]:
X.to_pickle('../../data/processed/X.pkl')
y.to_pickle('../../data/processed/y.pkl')