# Loan Default Prediction

**Project Overview**

Build a predictive model that assigns default probabilities to loan applications.
Minimize nancial risk by accurately predicting the likelihood of loan defaults, enabling
more informed and strategic lending decisions.

# Notebook Structure
---
<details>
<summary><b>1. Business Problem and Objectives</b></summary>
   Define the problem being addressed and its relevance to real-world scenarios.
</details>

---

<details>
<summary><b>2. Data Acquisition and Preparation</b></summary>

- ### **2.1 Data Source and Download**  
  Explanation of the dataset source and how it was obtained.  

- ### **2.2 Installing Required Modules**  
  List and install the libraries needed for the project.  

- ### **2.3 Importing Modules and Global Variables**  
  Set up imports and define constants or global variables.  

- ### **2.4 Defining Supplemental Functions**  
  Helper functions to streamline data processing.  

- ### **2.5 Data Loading**  
  Load the dataset into a DataFrame or suitable data structure.  

- ### **2.6 Basic Data Understanding**  
  Perform initial data exploration, including shape, columns, and types.  

</details>

---

<details>
<summary><b>3. Data Preprocessing and Feature Engineering</b></summary>

- ### **3.1 Cleaning**  

- ### **3.2 Preprocessing**  

- ### **3.3 Feature Extraction**  

</details>

---

<details>
<summary><b>4. Predictive Analysis</b></summary>

- ### **4.1 Train-Test Data Split**  
- ### **4.2 Classification with Simple Model**   
  Choose a simple base classification model and train it on the preprocessed data. Assess model performance using metrics like accuracy, precision, and recall.
- ### **4.3 Selecting Best Model for Feature Reduction**
  Deploy several advanced classification models with feature interpretability.
- ### **4.4 Feature Reduction Using Best Advanced Model**
  Reducing dataset to most important features from best performing model.        
- ### **4.5 Tuning Hyperparameters for Logistic Regression Model**
  Use parameter grid search to find the best-performing model.  
- ### **4.6 Improved Model Performance**
  Assess model performance using metrics like accuracy, precision, and recall.
- ### **4.7 Feature Interpretation**
  Visualize results and discuss findings, including strengths and limitations.
</details>

---

<details>
<summary><b>5. Conclusion</b></summary>
Summarize work.
Summarize findings, including strengths and limitations.
Suggest future work.
</details>


## 1.1 Business Problem and Objectives

**Problem Statement:**
**Key Questions:**
**Project Objectives:**


# 2. Data Acquisition and Preparation

## 2.1. Data Understanding

This section outlines the source of the data used in this project,  and provides instructions for downloading it.

**Data Sources**
**Data Relevance**
**Data Limitations**
**Download Instructions**
**Data Storage**
**Data Loading**

## 2.2 Installing Required Modules

This section focuses on installing the necessary Python libraries and packages required

1. **Requirements File**
    - We retrieve the list of required packages from a `requirements.txt` file hosted on GitHub using `wget`. This file contains the names and versions of all the dependencies.
    - This ensures that we install the correct versions of the libraries for compatibility and reproducibility.
2. **Installation using pip**
    - We use Python's `pip` package manager to install the libraries listed in the `requirements.txt` file.
    - The `-r` flag instructs `pip` to read the requirements file and install all the packages listed within.

Obtaining Data from the [Loan Default Prediction Competition](https://www.kaggle.com/competitions/home-credit-default-risk/overview) on **Kaggle:**

Before accessing the loan default prediction data, you must first join the competition and agree to its specific Terms & Conditions. Follow the step:

1. Create/Log In to Your Kaggle Account.
2. Navigate to the Competition Page. This page will have all the relevant details about the competition, including the rules and guidelines.
3. Click the “Join Competition” button. This action will prompt you to review and agree to the competition's Terms & Conditions. You must accept these terms before you can access the data.
4. Downloading the Dataset.
  
  a) **Manual Download**. After joining and agreeing to the terms, use the download button provided on the competition page to download the dataset directly to your computer.

  b) **Using the Kaggle API**. If you prefer using the command line, install the Kaggle API.

  `!pip install kaggle`

  Next, ensure your Kaggle API credentials (found in your account settings) are correctly set up (typically by placing the kaggle.json file in the ~/.kaggle/ directory). Then, run:

  `!kaggle competitions download -c home-credit-default-risk`

In [10]:
# Upload kaggle.json
from google.colab import files
files.upload()  # Upload your kaggle.json file here
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


This block of code will download the dataset into either `/content` if environment is `Colab`, or into local `Downloads` folder.

In [11]:
# Download the dataset !kaggle competitions download -c home-credit-default-risk
# A fancier way of doing the kaggle download with try-catch block, if notebook is executed locally
import subprocess

try:
    # Run the Kaggle dataset download command
    result = subprocess.run(
        ["kaggle", "competitions", "download", "-c", "home-credit-default-risk"],
        check=True,  # Raise an exception if the command fails
        text=True,   # Capture output as text
        capture_output=True  # Capture stdout and stderr
    )
    print("Dataset downloaded successfully!")
    print(result.stdout)  # Print the command output
except subprocess.CalledProcessError as e:
    print("Error occurred while downloading the dataset.")
    print(f"Return code: {e.returncode}")
    print(f"Error output: {e.stderr}")
except FileNotFoundError:
    print("Kaggle CLI is not installed. Please install it and ensure it's in your PATH.")

Dataset downloaded successfully!
Downloading home-credit-default-risk.zip to /content




This block of code contains supplemental functions that will extra ct and move datasets into defailt `data` directory.  

In [12]:
# Import modules
import os
import zipfile
import requests

# Function checks if directory exists
def ensure_directory(path):
    """
    Ensure that a directory exists. If not, create it.
    """
    os.makedirs(path, exist_ok=True)
    print(f"Directory ensured: {path}")

# Function downloads files
def download_files(base_url, file_names, destination_dir):
    """
    Download a list of files from a base URL to a specified directory.

    Args:
    - base_url (str): The base URL for the files.
    - file_names (list): List of filenames to download.
    - destination_dir (str): Directory to save the downloaded files.
    """
    for file_name in file_names:
        url = f"{base_url}/{file_name}"
        dest_path = os.path.join(destination_dir, file_name)
        if not os.path.exists(dest_path):
            print(f"Downloading {file_name}...")
            response = requests.get(url)
            response.raise_for_status()
            with open(dest_path, "wb") as f:
                f.write(response.content)
            print(f"Downloaded: {file_name}")
        else:
            print(f"File already exists: {file_name}")

# Function unzips archive into directory
def unzip_dataset(zip_path, destination_dir):
    """
    Unzip a dataset into the specified directory.

    Args:
    - zip_path (str): Path to the zip file.
    - destination_dir (str): Directory to extract the zip contents.
    """
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(destination_dir)
    print(f"Unzipped: {zip_path} to {destination_dir}")

# Function to determine data directory, depending on runtime environment.

def determine_data_dir():
    """
    Determines the data directory based on the execution environment:
    - Local: Uses 'Data' directory in the current working directory.
    - Cloud (e.g., Google Colab): Uses '/content' as the data directory.

    Returns:
        str: Path to the appropriate data directory.
    """
    if 'COLAB_GPU' in os.environ:  # Check if running in Google Colab
        data_dir = "/content/data"
        print(f"Running in Google Colab. Using data directory: {data_dir}")
    else:
        data_dir = os.path.join(os.getcwd(), "data")
        print(f"Running locally. Using data directory: {data_dir}")

        # Ensure the 'data' directory exists locally
        if not os.path.isdir(data_dir):
            print(f"The directory '{data_dir}' does not exist. Please create it and place the data files there.")
            raise FileNotFoundError(f"'{data_dir}' directory is required for local execution.")

    return data_dir

This block of code syncs repository structure with colab or local version.

In [13]:
# Check data directories
data_dir = determine_data_dir()

# Get the parent directory of data_dir
base_dir = os.path.dirname(data_dir)
models_dir = os.path.join(base_dir, "models")
images_dir = os.path.join(base_dir, "images")

# Ensure directories exist
ensure_directory(data_dir)
ensure_directory(models_dir)
ensure_directory(images_dir)

# Dataset path
zip_file_path_dataset = os.path.join("home-credit-default-risk.zip")

# Check if the file exists
if os.path.exists(zip_file_path_dataset):
    print("File found. Proceeding to unzip...")
    # Unzip dataset
    unzip_dataset("home-credit-default-risk.zip", data_dir)
    # Remove after unzipping
    os.remove(zip_file_path_dataset)
    print(f"Removed ZIP file: {zip_file_path_dataset}")
else:
    print("File not found. Please check the path or download the Dataset from Kaggle.")

# Download supplemental data
github_base_url = "https://raw.githubusercontent.com/leksea/loan-default-prediction/main/data"
supplemental_files = [
]
download_files(github_base_url, supplemental_files, data_dir)

# Download model into models directory
model_base_url = "https://raw.githubusercontent.com/leksea/loan-default-prediction/main/models"
model_files = [
]
download_files(model_base_url, model_files, models_dir)
print("Setup complete.")

Running in Google Colab. Using data directory: /content/data
Directory ensured: /content/data
Directory ensured: /content/models
Directory ensured: /content/images
File found. Proceeding to unzip...
Unzipped: home-credit-default-risk.zip to /content/data
Removed ZIP file: home-credit-default-risk.zip
Setup complete.


# 2.3 Importing Modules and Global Variables

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 2.4 Defining Supplemental Functions

# 2.5 Data Loading

In [18]:
## Loading the files
# determine the data directory
data_dir = determine_data_dir()

df_col_desc = pd.read_csv(os.path.join(data_dir, 'HomeCredit_columns_description.csv'), encoding='latin1')
df_application_train = pd.read_csv(os.path.join(data_dir, 'application_train.csv'), encoding='latin1')
df_application_test = pd.read_csv(os.path.join(data_dir, 'application_test.csv'), encoding='latin1')
df_bureau = pd.read_csv(os.path.join(data_dir, 'bureau.csv'), encoding='latin1')
df_bureau_balance = pd.read_csv(os.path.join(data_dir, 'bureau_balance.csv'), encoding='latin1')
df_credit_card_balance = pd.read_csv(os.path.join(data_dir, 'credit_card_balance.csv'), encoding='latin1')
df_installments_payments = pd.read_csv(os.path.join(data_dir, 'installments_payments.csv'), encoding='latin1')
df_POS_CASH_balance = pd.read_csv(os.path.join(data_dir, 'POS_CASH_balance.csv'), encoding='latin1')
df_previous_application = pd.read_csv(os.path.join(data_dir, 'previous_application.csv'), encoding='latin1')
df_application_test = pd.read_csv(os.path.join(data_dir, 'application_test.csv'), encoding='latin1')

Running in Google Colab. Using data directory: /content/data


# 2.6 Basic Data Understanding

Running built-in functions to gain insights about the data frames.