**Here is the background information on your task**

Now that you are familiar with the portfolio and personal loans and risk are using your model as a guide to loss provisions for the upcoming year, the team now asks you to look at their mortgage book. They suspect that FICO scores will provide a good indication of how likely a customer is to default on their mortgage. Charlie wants to build a machine learning model that will predict the probability of default, but while you are discussing the methodology, she mentions that the architecture she is using requires categorical data. As FICO ratings can take integer values in a large range, they will need to be mapped into buckets. She asks if you can find the best way of doing this to allow her to analyze the data.

A FICO score is a standardized credit score created by the Fair Isaac Corporation (FICO) that quantifies the creditworthiness of a borrower to a value between 300 to 850, based on various factors. FICO scores are used in 90% of mortgage application decisions in the United States. The risk manager provides you with FICO scores for the borrowers in the bank’s portfolio and wants you to construct a technique for predicting the PD (probability of default) for the borrowers using these scores. 

**Here is your task**

Charlie wants to make her model work for future data sets, so she needs a general approach to generating the buckets. Given a set number of buckets corresponding to the number of input labels for the model, she would like to find out the boundaries that best summarize the data. You need to create a rating map that maps the FICO score of the borrowers to a rating where a lower rating signifies a better credit score.

The process of doing this is known as quantization. You could consider many ways of solving the problem by optimizing different properties of the resulting buckets, such as the mean squared error or log-likelihood (see below for definitions). For background on quantization, see [here](link).

**Mean squared error**

You can view this question as an approximation problem and try to map all the entries in a bucket to one value, minimizing the associated squared error. We are now looking to minimize the following equation:

```math
\sum_{i=1}^{N} (y_i - \hat{y}_i)^2
```

**Log-likelihood**

A more sophisticated possibility is to maximize the following log-likelihood function:

```math
\sum_{i=1}^{N} \left(k_i \log(p_i) + (n_i - k_i) \log(1 - p_i)\right)
```

Where:
- \(b_i\) is the bucket boundaries
- \(n_i\) is the number of records in each bucket
- \(k_i\) is the number of defaults in each bucket
- \(p_i = \frac{k_i}{n_i}\) is the probability of default in the bucket.

This function considers how rough the discretization is and the density of defaults in each bucket. This problem could be addressed by splitting it into subproblems, which can be solved incrementally (i.e., through a dynamic programming approach). For example, you can break the problem into two subproblems, creating five buckets for FICO scores ranging from 0 to 600 and five buckets for FICO scores ranging from 600 to 850. Refer to [this page](link) for more context behind a likelihood function. [This page](link) may also be helpful for background on dynamic programming.


In [None]:
import pandas as pd
from scipy.optimize import minimize
import numpy as np

# Load the sample data
data = "../data/Loan_Data.csv"

df = pd.read_csv(data)

In [None]:
# Sort the DataFrame by FICO score
df.sort_values("fico_score", inplace=True)

# Define the number of buckets
num_buckets = 5


# Function to calculate mean squared error
def mean_squared_error(boundaries, data):
    buckets = np.digitize(data, np.sort(boundaries))
    bucket_means = [np.mean(data[buckets == i]) for i in range(1, num_buckets + 1)]
    mse = np.mean((data - np.array(bucket_means)[buckets - 1]) ** 2)
    return mse


# Function to calculate log-likelihood
def log_likelihood(boundaries, data, labels):
    buckets = np.digitize(data, np.sort(boundaries))
    ni = np.bincount(buckets)[1:]
    ki = np.bincount(labels, weights=np.ones(len(labels)))[1:]
    pi = ki / ni
    log_likelihood = np.sum(ki * np.log(pi / (1 - pi)))
    return -log_likelihood


# Initial guess for boundaries
initial_boundaries = np.linspace(
    df["fico_score"].min(), df["fico_score"].max(), num_buckets + 1
)[1:-1]

# Optimize for mean squared error
result_mse = minimize(
    mean_squared_error,
    initial_boundaries,
    args=(df["fico_score"],),
    method="Nelder-Mead",
)

# Optimize for log-likelihood
labels = df["default"]
result_ll = minimize(
    log_likelihood,
    initial_boundaries,
    args=(df["fico_score"], labels),
    method="Nelder-Mead",
)

# Print the results
print("Mean Squared Error Optimal Boundaries:", result_mse.x)
print("Log-Likelihood Optimal Boundaries:", result_ll.x)