# Loan Default Prediction

**Project Overview**

Build a predictive model that assigns default probabilities to loan applications.
Minimize nancial risk by accurately predicting the likelihood of loan defaults, enabling
more informed and strategic lending decisions.

# Notebook Structure
---
<details>
<summary><b>1. Business Problem and Objectives</b></summary>
   Define the problem being addressed and its relevance to real-world scenarios.
</details>

---

<details>
<summary><b>2. Data Acquisition and Preparation</b></summary>

- ### **2.1 Data Source and Download**  
  Explanation of the dataset source and how it was obtained.  

- ### **2.2 Installing Required Modules**  
  List and install the libraries needed for the project.  

- ### **2.3 Importing Modules and Global Variables**  
  Set up imports and define constants or global variables.  

- ### **2.4 Defining Supplemental Functions**  
  Helper functions to streamline data processing.  

- ### **2.5 Data Loading**  
  Load the dataset into a DataFrame or suitable data structure.  

- ### **2.6 Basic Data Understanding**  
  Perform initial data exploration, including shape, columns, and types.  

</details>

---

<details>
<summary><b>3. Data Preprocessing and Feature Engineering</b></summary>

- ### **3.1 Cleaning**  

- ### **3.2 Preprocessing**  

- ### **3.3 Feature Extraction**  

</details>

---

<details>
<summary><b>4. Predictive Analysis</b></summary>

- ### **4.1 Train-Test Data Split**  
- ### **4.2 Classification with Simple Model**   
  Choose a simple base classification model and train it on the preprocessed data. Assess model performance using metrics like accuracy, precision, and recall.
- ### **4.3 Selecting Best Model for Feature Reduction**
  Deploy several advanced classification models with feature interpretability.
- ### **4.4 Feature Reduction Using Best Advanced Model**
  Reducing dataset to most important features from best performing model.        
- ### **4.5 Tuning Hyperparameters for Logistic Regression Model**
  Use parameter grid search to find the best-performing model.  
- ### **4.6 Improved Model Performance**
  Assess model performance using metrics like accuracy, precision, and recall.
- ### **4.7 Feature Interpretation**
  Visualize results and discuss findings, including strengths and limitations.
</details>

---

<details>
<summary><b>5. Conclusion</b></summary>
Summarize work.
Summarize findings, including strengths and limitations.
Suggest future work.
</details>


## 1.1 Business Problem and Objectives

**Problem Statement:**
**Key Questions:**
**Project Objectives:**


# 2. Data Acquisition and Preparation

## 2.1. Data Understanding

This section outlines the source of the data used in this project,  and provides instructions for downloading it.

**Data Sources**
**Data Relevance**
**Data Limitations**
**Download Instructions**
**Data Storage**
**Data Loading**

## 2.2 Installing Required Modules

This section focuses on installing the necessary Python libraries and packages required

1. **Requirements File**
    - We retrieve the list of required packages from a `requirements.txt` file hosted on GitHub using `wget`. This file contains the names and versions of all the dependencies.
    - This ensures that we install the correct versions of the libraries for compatibility and reproducibility.
2. **Installation using pip**
    - We use Python's `pip` package manager to install the libraries listed in the `requirements.txt` file.
    - The `-r` flag instructs `pip` to read the requirements file and install all the packages listed within.

Getting Data from **Kaggle**

In [13]:
# Download the dataset !kaggle datasets download -d https://www.kaggle.com/competitions/home-credit-default-risk/data
# A fancier way of doing the kaggle download with try-catch block, if notebook is executed locally
import subprocess

try:
    # Run the Kaggle dataset download command
    result = subprocess.run(
        ["kaggle", "datasets", "download", "-d", "competitions/home-credit-default-risk"],
        check=True,  # Raise an exception if the command fails
        text=True,   # Capture output as text
        capture_output=True  # Capture stdout and stderr
    )
    print("Dataset downloaded successfully!")
    print(result.stdout)  # Print the command output
except subprocess.CalledProcessError as e:
    print("Error occurred while downloading the dataset.")
    print(f"Return code: {e.returncode}")
    print(f"Error output: {e.stderr}")
except FileNotFoundError:
    print("Kaggle CLI is not installed. Please install it and ensure it's in your PATH.")

Kaggle CLI is not installed. Please install it and ensure it's in your PATH.
