# Introduction to the Project 

## HELOC 
A home equity line of credit, or HELOC, is a loan in which the lender agrees to lend a maximum amount within an agreed period (called a term), where the collateral is the borrower’s equity in his/her house (akin to a second mortgage). Because a home often is a consumer’s most valuable asset, many homeowners use home equity credit lines only for major items, such as education, home improvements, or medical bills, and choose not to use them for day-to-day expenses.

Since amount of such credit is not small, banks carefully review financial situation of applicants. Utmost care is taken so the whole process is transparent and decision is easily explainable to the client.

## My Dataset
I have take a subset of HELOC (Home Equity Line of Credit) Dataset. Given below are the column names that have been selected. These correspond to various attributes and features related to credit, payment history, and borrower behavior. 

### Dependent Variable
1. **ExternalRiskEstimate**: A risk assessment or credit score provided by an external source.

### Independent Variables

2. **MSinceOldestTradeOpen**: The number of months since the oldest trade line (credit account) was opened.

3. **MSinceMostRecentTradeOpen**: The number of months since the most recent trade line was opened.

4. **NumSatisfactoryTrades**: The number of satisfactory (positive) credit trades.

5. **NumTrades90Ever2DerogPubRec**: Number of trades that have experienced derogatory public records within the past 90 months. i.e. number of trades which are more than 90 past due

6. **PercentTradesNeverDelq**: Percentage of trades that have never been delinquent.

7. **MSinceMostRecentDelq**: The number of months since the most recent delinquency.

8. **NumTradesOpeninLast12M**: Number of credit trades opened in the last 12 months.

9. **MSinceMostRecentInqexcl7days**: The number of months since the most recent inquiry (excluding inquiries within the last 7 days).

10. **NumInqLast6M**: Number of inquiries in the last 6 months.

11. **NetFractionRevolvingBurden**: Portion of credit card spending that goes unpaid at the end of a billing cycle balance divided by credit limit

These columns seem to provide a comprehensive set of information about a borrower's credit history, payment behavior, and financial standing. 
We can use these columns to gain insights into the creditworthiness and risk associated with the borrowers in the dataset.

<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>


# SECTION 1 -  Exploratory Data analysis

### 1. Importing Libraries

In [None]:
import numpy as np  # for numerical calculations
import pandas as pd  # for handling data in tabular form
import matplotlib.pyplot as plt  # for creating visualizations
import seaborn as sns  # for enhanced data visualization

In [None]:
from optbinning import BinningProcess   # a custom library build to tackle binning!

from sklearn.linear_model import LogisticRegression  # for logistic regression modeling
from sklearn.metrics import classification_report  # for generating classification reports
from sklearn.metrics import auc, roc_auc_score, roc_curve  # for ROC curve and AUC calculations
from sklearn.model_selection import train_test_split  # for splitting data into training and testing sets

### 2. Loading the Dataset

In [None]:
df = pd.read_csv(r'C:\Users\91989\OneDrive\Desktop\Python Importing Files Project\FICO_HELOC\Newdataset.csv')

### 3. Assigning Independent and Dependent Variables

In [None]:
# Define the list of variable names
variable_names = list(df.columns[1:])

# Create the predictor variable X as a NumPy array (matrix) - Works well with ML algorithms
X = df[variable_names].values

### 4. Transforming the categorical dichotomic target variable into numerical type.

In [None]:
y = df.RiskPerformance.values
mask = y == "Bad"
y[mask] = 1
y[~mask] = 0
y = y.astype(int)

In [None]:
# Create a DataFrame for X and y
data = pd.DataFrame(data=X, columns=variable_names)
data['RiskPerformance'] = y

In [None]:
df.head()

### 5. EDA (Exploratory Data Analysis)

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)

In [None]:
# Create a scatter plot matrix with different colors for 'goods' and 'bad'
sns.set(style="ticks")
sns.pairplot(data, hue="RiskPerformance", palette={0: "blue", 1: "red"})

# Show the plot
plt.show()

In [None]:
# Select numerical columns from the DataFrame
numerical_columns = df.select_dtypes(include='number')

# Adjust the number of bins, figsize, and layout as needed
numerical_columns.hist(bins=15, figsize=(15, 10), layout=(3, 4))
plt.show()

In [None]:
# Select numerical columns from the DataFrame
numerical_columns = df.select_dtypes(include='number')

# Adjust the figsize as needed
plt.figure(figsize=(15, 10))

# Create box plots for numerical columns
numerical_columns.boxplot()
plt.title('Box Plots for Numerical Columns')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

plt.show()

<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>

# SECTION 2: Model Selection 

### 1. Incorporate Unique Conditions relating to the Model

Here we instantiate special codes, dictionary of binning parameters and a ***BinningProcess*** object class with variable names,. 

The data dictionary of this challenge includes three special values/codes: <BR>
-9 means No Bureau Record or No Investigation  <BR>
-8 means No Usable/Valid Trades or Inquiries   <BR>
-7 means Condition not Met (e.g. No Inquiries, No Delinquencies) 

In [None]:
# The Optbinning library helps consider these special codes
special_codes = [-9, -8, -7]

Noe that in ScoreCard building, we have monotonicity constraints with respect to the probability of a target for many of the variables. We apply these rules by passing the following dictionary of parameters for these variables involved.

This is mainly because we assume MONOTONICITY as LOGISTIC Regression assumes there must be a linear relationship between Logit f(x) and independent variables.

In [None]:
binning_fit_params = {
    "ExternalRiskEstimate": {"monotonic_trend": "descending"},
    "MSinceOldestTradeOpen": {"monotonic_trend": "descending"},
    "MSinceMostRecentTradeOpen": {"monotonic_trend": "descending"},
    "NumSatisfactoryTrades": {"monotonic_trend": "descending"},
    "NumTrades90Ever2DerogPubRec": {"monotonic_trend": "ascending"},
    "PercentTradesNeverDelq": {"monotonic_trend": "descending"},
    "MSinceMostRecentDelq": {"monotonic_trend": "descending"},
    "NumTradesOpeninLast12M": {"monotonic_trend": "ascending"},
    "MSinceMostRecentInqexcl7days": {"monotonic_trend": "descending"},
    "NumInqLast6M": {"monotonic_trend": "ascending"},
    "NetFractionRevolvingBurden": {"monotonic_trend": "ascending"},
}

In [None]:
binning_process = BinningProcess(variable_names, special_codes=special_codes,
                                 binning_fit_params=binning_fit_params)

### 2. Creating explainable model pipelines.

In [None]:
clf1 = LogisticRegression(solver="lbfgs")

clf2 = LogisticRegression(solver="lbfgs")

### 3. Split dataset into train and test AND fit pipelines with training data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
binning_process.fit(X_train, y_train)

Now, we replace the usual binning of a few numerical variables with a piecewise continuous binning. Since version 0.9.2, the binning process includes the method ***update_binned_variable*** which allows updating an optimal binning without the need of re-processing the rest of the variables.

### 4. Comparing the Performance of both Models

In [None]:
clf1.fit(binning_process.transform(X_train), y_train)

In [None]:
clf2.fit(X_train, y_train)

In [None]:
# Finding Confusion Matrix for both Models
from sklearn.metrics import confusion_matrix

# Calculate confusion matrices for both classifiers
y_pred1 = clf1.predict(binning_process.transform(X_test))
y_pred2 = clf2.predict(X_test)

cm1 = confusion_matrix(y_test, y_pred1)
cm2 = confusion_matrix(y_test, y_pred2)

# Function to plot a confusion matrix with labels
def plot_confusion_matrix(ax, cm, title):
    sns.set(font_scale=1.2)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', linewidths=.5, annot_kws={"size": 16}, ax=ax)
    ax.set_xlabel('Predicted Labels')
    ax.set_ylabel('True Labels')
    ax.set_title(title)

# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot confusion matrices on subplots
plot_confusion_matrix(axes[0], cm1, 'Confusion Matrix - Binning + Logistic Regression')
plot_confusion_matrix(axes[1], cm2, 'Confusion Matrix - Logistic Regression')

# Adjust spacing between subplots
plt.tight_layout()

# Show the figure with both confusion matrices
plt.show()

In [None]:
y_pred = clf1.predict(binning_process.transform(X_test))
print(classification_report(y_test, y_pred))

In [None]:
y_pred = clf2.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
probs = clf1.predict_proba(binning_process.transform(X_test))
preds = probs[:,1]
fpr1, tpr1, threshold = roc_curve(y_test, preds)
roc_auc1 = auc(fpr1, tpr1)

probs = clf2.predict_proba(X_test)
preds = probs[:,1]
fpr2, tpr2, threshold = roc_curve(y_test, preds)
roc_auc2 = auc(fpr2, tpr2)

In [None]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr1, tpr1, 'b', label='Binning+LR: AUC = {0:.5f}'.format(roc_auc1))
plt.plot(fpr2, tpr2, 'g', label='LR: AUC = {0:.5f}'.format(roc_auc2))
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1],'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### So we choose the Logistic model with Binning in place of a model without Binning

### 5. Binning Process Statistics

The binning process of the pipeline can be retrieved to show analysis about any problems and to keep track of timings.

In [None]:
binning_process.information(print_level=2)

In [None]:
binning_process.summary()

### 6. Retrieve Binning Process Stats for EACH individual variable
The ***get_binned_variable*** method serves to retrieve an optimal binning object, which can be analyzed in detail afterward.

In [None]:
optb = binning_process.get_binned_variable("ExternalRiskEstimate")

In [None]:
optb.binning_table.build()

In [None]:
# We disable sns for the moment as OPTBINNING Library already gives plots.
sns.reset_orig()

In [None]:
optb.binning_table.plot(metric="event_rate")

In [None]:
optb.binning_table.analysis()

<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>

# SECTION 3: Scorecard Building

Scorecard with binary target
The goal is to develop a scorecard using the logistic regression as an estimator.

### 1. Importing Libraries

In [None]:
from optbinning import BinningProcess
from optbinning import Scorecard

### 2. Loading the Dataset

In [None]:
df = pd.read_csv(r'C:\Users\91989\OneDrive\Desktop\Python Importing Files Project\FICO_HELOC\Newdataset.csv')

### 3. Assigning Independent and Dependent Variables

In [None]:
# Define the list of variable names
variable_names = list(df.columns[1:])

# Create the predictor variable X as a DataFrame - more suitable for data analysis & manipulation.
X = df[variable_names]
# Previously we created a predictor variable X as a NumPy array (matrix) ((X = df[variable_names].values))

### 4. Transforming the categorical dichotomic target variable into numerical type.

In [None]:
y = df.RiskPerformance.values
mask = y == "Bad"
y[mask] = 1
y[~mask] = 0
y = y.astype(int)

### 5. Incorporating Selection Criteria for importance of Variables
We specify a selection criteria in terms of the Information Value (IV) predictiveness and minimum quality score to remove low-quality variables.

In [None]:
selection_criteria = {
    "iv": {"min": 0.02, "max": 1},
    "quality_score": {"min": 0.01}
}

### 6. Instantiate the Binning Process
Then, we instantiate a ***BinningProcess*** object class with variable names, special codes and selection criteria.

In [None]:
binning_process = BinningProcess(variable_names, special_codes=special_codes,
                                 selection_criteria=selection_criteria)

### 7. Choosing a Suitable ML Model

In [None]:
# We select as an estimator a logistic regression to be solved using the non-linear solver L-BFGS-B.
estimator = LogisticRegression(solver="lbfgs")

### 8. Instantiate the Scorecard

We instantiate a Scorecard class with the target name, a binning process object, and an estimator. Also, we apply a scaling method to the scorecard points.

In [None]:
scorecard = Scorecard(binning_process=binning_process,
                      estimator=estimator, scaling_method="min_max",
                      scaling_method_params={"min": 300, "max": 850})   # To keep in the FICO Score Range

In [None]:
scorecard.fit(X, y, show_digits=4)

### 9. Scorecard Process Statistics

Similar to other objects in OptBinning, we can print overview information about the options settings, problems statistics, and the number of selected variables after the binning process. 

With these settings, using the selection criteria, **1** variable is removed.

In [None]:
scorecard.information(print_level=2)

The method ***table*** returns the scorecard table. A scorecard table has a wide range of real-world business applications, being an interpretable tool to summarize relationships among variables. The scorecard table can handle binary and continuous targets. Two scorecard styles are available: ***style="summary"*** shows the variable name, and their corresponding bins and assigned points; ***style="detailed"*** adds information from the corresponding binning table.

In [None]:
scorecard.table(style="summary")

In [None]:
scorecard.table(style="detailed")

<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>

# SECTION 4: Deriving Business Implications

In [None]:
# Set pandas display options to show all rows without truncation
pd.set_option('display.max_rows', None)

# Display the detailed scorecard table
scorecard_table = scorecard.table(style="detailed")
scorecard_table

### 1. Group the variable using aggregate Information Value (I.V.) and Logistic Coefficient


In [None]:
variable_iv_coeff = scorecard_table.groupby("Variable").agg({"IV": "sum", "Coefficient": "mean"})
print(variable_iv_coeff)

### 2. Plotting Feature Importances for Business Analysis

In [None]:
# Sort the DataFrame by IV in descending order
sorted_iv_coeff = variable_iv_coeff.sort_values(by='IV', ascending=False)

# Plot feature importance based on IV
plt.figure(figsize=(10, 6))
plt.barh(sorted_iv_coeff.index, sorted_iv_coeff['IV'], color='skyblue')
plt.xlabel('IV (Information Value)')
plt.title('Feature Importance Based on IV')
plt.gca().invert_yaxis()
plt.show()


# Sort the DataFrame by the absolute values of Coefficients in descending order
sorted_iv_coeff['Abs_Coefficient'] = abs(sorted_iv_coeff['Coefficient'])
sorted_iv_coeff = sorted_iv_coeff.sort_values(by='Abs_Coefficient', ascending=False)

# Plot feature importance based on Coefficients
plt.figure(figsize=(10, 6))
plt.barh(sorted_iv_coeff.index, sorted_iv_coeff['Coefficient'], color='lightcoral')
plt.xlabel('Coefficient')
plt.title('Feature Importance Based on Coefficients')
plt.gca().invert_yaxis()
plt.show()

For business analysis, we consider using ONLY Information Value (IV) for feature importance as IV is specifically designed for assessing the predictive power of variables in logistic regression models and is widely used in credit scoring and risk assessment. It provides a more direct measure of the variables' impact on the target variable (e.g., default or non-default) and helps identify the most influential features for decision-making. **Logistic regression coefficients can be affected by variable scaling and multicollinearity, making IV a more robust choice for feature importance in this context.**

Based on the provided data with Information Value (IV) and Logistic Regression Coefficients, here are some business implications:
### Important Variables for Business Standpoint
1. **External Risk Estimate (IV: 0.97)**: A higher External Risk Estimate indicates a lower credit risk. Consider offering more favorable terms (e.g., lower interest rates) to customers with higher External Risk Estimates. This can attract low-risk borrowers and reduce default rates.

2. **Manage Revolving Burden (IV: 0.55)**: Customers with a high revolving burden (high credit card balances) are riskier. Encourage responsible credit card use and educate customers on managing their balances effectively.

3. **Percent of Trades Never Delinquent (IV: 0.35)**: Customers with a high percentage of non-delinquent trades are lower risk. Offer benefits or lower rates to customers with a strong history of on-time payments.

### Other Variables to Look out for

1. **Recent Delinquencies Matter (IV: 0.26)**: Pay close attention to customers with a recent history of delinquencies. Implement proactive customer support or targeted promotions to help them get back on track and reduce the risk of default.

2. **Oldest Trade Open (IV: 0.23)**: Longer credit histories indicate stability. Consider offering incentives or preferential terms to customers with longer trade histories to attract loyal, low-risk customers.

3. **Short Time Since Recent Inquiries (IV: 0.17)**: Customers with frequent recent credit inquiries may represent higher credit risk. Implement stricter approval criteria or adjust interest rates for this segment to mitigate potential default.

### Remaining Variables

1. **Inquiries in the Last 6 Months (IV: 0.09)**: Be cautious with customers who have had multiple inquiries in the last six months. They may be actively seeking credit and could be higher risk. Evaluate their creditworthiness more carefully.

2. **Satisfactory Trades (IV: 0.12)**: Customers with a higher number of satisfactory trades are less likely to default. Consider offering incentives or rewards for maintaining a positive trade history.

3. **Trades with Derogatory Records (IV: 0.13, Coefficient: 0.06)**: While the IV suggests importance, the coefficient is low. Monitor this variable but focus on other high-impact factors for now.

 
These implications provide a simple and actionable strategy to manage credit risk and make data-driven decisions to reduce defaults and attract valuable customers.

<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>

# SECTION 5: Scorecard performance (Model Validation) 

### 1. Computing predicted probabilities of the fitted estimator.

In [None]:
df = pd.read_csv(r'C:\Users\91989\OneDrive\Desktop\Python Importing Files Project\FICO_HELOC\Newdataset.csv')

In [None]:
y_pred = scorecard.predict_proba(X)[:, 1]

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
from sklearn.linear_model import LogisticRegression

from optbinning import BinningProcess
from optbinning import Scorecard
from optbinning.scorecard import plot_auc_roc, plot_cap, plot_ks

### 2. Plotting Classification Curves to measure Performance

In [None]:
plot_auc_roc(y, y_pred)
plt.show()

AUC-ROC (Area Under the Receiver Operating Characteristic Curve):

* **Significance**: AUC-ROC is a widely used metric in credit scoring and scorecard development. It evaluates the model's ability to discriminate between good and bad instances (e.g., creditworthy vs. non-creditworthy customers). It shows how well the model separates positive and negative cases across different probability thresholds.
* **Usefulness**: AUC-ROC provides an overall measure of the model's discriminative power but doesn't take into account the specific thresholds for decision-making. It's a good initial indicator of model performance.

In [None]:
plot_cap(y, y_pred)
plt.show()

CAP Profile (Cumulative Accuracy Profile):

* **Significance**: The CAP curve compares the cumulative percentage of positive outcomes against the cumulative percentage of cases targeted by the model. It helps understand how well the model is performing compared to random selection.
* **Usefulness**: The CAP curve allows you to assess how effective your model is at differentiating high-risk cases early. It's particularly relevant in credit scoring to evaluate the model's calibration and the proportion of default cases captured.

In [None]:
plot_ks(y, y_pred)
plt.show()

KS Curve (Kolmogorov-Smirnov Curve):

* **Significance**: The KS statistic is often used in credit scoring to assess the separation between the cumulative distributions of good and bad instances. The KS curve visualizes this separation and helps in identifying an optimal cutoff threshold for making decisions.
* **Usefulness**: The point on the KS curve where the separation between the distributions is highest indicates the cutoff threshold that optimally balances between false positives and false negatives. It's valuable for determining the practical threshold for the scorecard.

### Calculate the score of the dataset and plot distribution of scores for event and non-event records.

In [None]:
score = scorecard.score(X)

In [None]:
mask = y == 0
plt.hist(score[mask], label="non-event", color="b", alpha=0.35)
plt.hist(score[~mask], label="event", color="r", alpha=0.35)
plt.xlabel("score")
plt.legend()
plt.show()

<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>

# SECTION 6: ScoreCard Insights 

### 1. Sanity Checking
We can start noticing that after a score of 630 we see a considerable decline in Events (Defaults) and keeps on decling as the Score increases.

Complaring with General FICO Score distribution Chart this score is approxiamately at the middle of the FAIR Score range.

![image.png](attachment:image.png)

### 2. Data Analysis on Probability of Defaults

In [None]:
scorecard.score(X)

In [None]:
# As we have already fitted the scorecard model as 'scorecard'
# Calculate the predicted probabilities of default for all individuals
pred = scorecard.predict_proba(X)[:, 1]

# Create a DataFrame to display the individual scores and predicted probabilities
individual_scores_df = pd.DataFrame({
    "Individual_Score": scorecard.score(X),  # Scores from the scorecard
    "Probability_of_Default": pred,  # Predicted probabilities
    "RiskPerformance": y # Original RiskPerformance labels
})

# Display the individual scores and predicted probabilities
print(individual_scores_df)

# Export the DataFrame to a CSV file if needed
individual_scores_df.to_csv("individual_scores_with_probabilities.csv", index=False)

In [None]:
# Load the individual_scores_df DataFrame from the CSV file
individual_scores_df = pd.read_csv("individual_scores_with_probabilities.csv")  # Load the updated DataFrame

# Group the DataFrame by 'RiskPerformance' (1 for defaulters, 0 for non-defaulters)
grouped = individual_scores_df.groupby('RiskPerformance')

# Calculate descriptive statistics for scores of defaulters
defaulters_score_stats = grouped.get_group(1)["Individual_Score"].describe()

# Calculate descriptive statistics for scores of non-defaulters
non_defaulters_score_stats = grouped.get_group(0)["Individual_Score"].describe()

# Calculate descriptive statistics for probabilities of default for defaulters
defaulters_prob_stats = grouped.get_group(1)["Probability_of_Default"].describe() * 100  # Multiply by 100 to convert to percentages

# Calculate descriptive statistics for probabilities of default for non-defaulters
non_defaulters_prob_stats = grouped.get_group(0)["Probability_of_Default"].describe() * 100  # Multiply by 100 to convert to percentages

# Create DataFrames for descriptive statistics
score_stats_df = pd.DataFrame({
    'Defaulters Scores': defaulters_score_stats,
    'Non-Defaulters Scores': non_defaulters_score_stats
})

prob_stats_df = pd.DataFrame({
    'Defaulters PD%': defaulters_prob_stats,
    'Non-Defaulters PD%': non_defaulters_prob_stats
})

# Concatenate DataFrames (Side By Side Data Tables)
concatenated_df = pd.concat([score_stats_df, prob_stats_df], axis=1)

# Display the concatenated table without styling
concatenated_df

### 3. Data Analysis on Credit Scores

In [None]:
# Create subplots for individual scores and PDs
fig, axes = plt.subplots(1, 2, figsize=(15, 10))

# Box plot for individual scores
sns.boxplot(ax=axes[0], x='RiskPerformance', y='Individual_Score', data=individual_scores_df)
axes[0].set_title('Box Plot of Individual Scores by Risk Performance')
axes[0].set_xlabel('Risk Performance')
axes[0].set_ylabel('Individual Score')

# Box plot for Probability of Default (PD)
sns.boxplot(ax=axes[1], x='RiskPerformance', y='Probability_of_Default', data=individual_scores_df)
axes[1].set_title('Box Plot of Probability of Default by Risk Performance')
axes[1].set_xlabel('Risk Performance')
axes[1].set_ylabel('Probability of Default')

# Adjust layout and display plots
plt.tight_layout()
plt.show()

In [None]:
# Create subplots for individual scores and PDs histograms
fig, axes = plt.subplots(1, 2, figsize=(15, 7))

# Histogram for individual scores
sns.histplot(ax=axes[0], data=individual_scores_df, x='Individual_Score', hue='RiskPerformance', bins=30, kde=True)
axes[0].set_title('Histogram of Individual Scores by Risk Performance')
axes[0].set_xlabel('Individual Score')
axes[0].set_ylabel('Frequency')
axes[0].legend(title='Risk Performance', labels=['Defaulter', 'Non-Defaulter']) 

# Histogram for Probability of Default (PD)
sns.histplot(ax=axes[1], data=individual_scores_df, x='Probability_of_Default', hue='RiskPerformance', bins=30, kde=True)
axes[1].set_title('Histogram of Probability of Default by Risk Performance')
axes[1].set_xlabel('Probability of Default')
axes[1].set_ylabel('Frequency')
axes[1].legend(title='Risk Performance', labels=['Defaulter', 'Non-Defaulter'])

# Adjust layout and display plots
plt.tight_layout()
plt.show()

In [None]:
# Create summary statistics for individual scores and PDs
summary_stats_scores = individual_scores_df.groupby('RiskPerformance')['Individual_Score'].agg(['mean', 'median']).reset_index()
summary_stats_pds = individual_scores_df.groupby('RiskPerformance')['Probability_of_Default'].agg(['mean', 'median']).reset_index()

# Create subplots for mean individual scores and mean PDs
fig, axes = plt.subplots(1, 2, figsize=(15, 7))

# Bar plot for mean individual scores
sns.barplot(ax=axes[0], x='RiskPerformance', y='mean', data=summary_stats_scores)
axes[0].set_title('Mean Individual Scores by Risk Performance')
axes[0].set_xlabel('Risk Performance')
axes[0].set_ylabel('Mean Individual Score')

# Bar plot for mean PDs
sns.barplot(ax=axes[1], x='RiskPerformance', y='mean', data=summary_stats_pds)
axes[1].set_title('Mean Probability of Default by Risk Performance')
axes[1].set_xlabel('Risk Performance')
axes[1].set_ylabel('Mean Probability of Default')

# Adjust layout and display plots
plt.tight_layout()
plt.show()

<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>
<div style="border-top: 15px solid black;"></div> <br>

# SECTION 8: Scorecard Monitoring

It is important to determine if the distribution of new data has shifted with respect to the original data used to develop the scorecard. Also, monitoring is also useful to detect errors in raw data and track scorecard performance.

During the model building and monitoring phases PSI and CSI can be a very powerful metrics. In this article we will try to cover when can we use them and how to use them. Before going into their use, I will try and explain what both PSI and CSI are.

**Population Stability Index (PSI)**: As the name suggests, it looks at the shift of the distribution of a variable across different time internals. Here the focus is only on the dependent variable. the PSI is a divergence measure equivalent to the Information Value (IV), also known as Jeffry’s divergence. This measure assesses whether the actual score distribution has shifted from the expected score distribution

**Characteristic Stability Index (CSI)**: It is the measure of the change in distribution of the independent variables over time. It can be used both for testing and performance tracking in a similar way to PSI, the comparison would be the distribution of variables unlike PSI where it is the model scores.

We split data to compare the robustness of the developed scorecard in the test dataset.

### 1. Import Libraries

In [None]:
from optbinning.scorecard import ScorecardMonitoring

### 2. Loading the Dataset

In [None]:
df = pd.read_csv(r'C:\Users\91989\OneDrive\Desktop\Python Importing Files Project\FICO_HELOC\Newdataset.csv')

### 3. Assigning Independent and Dependent Variables

In [None]:
# Define the list of variable names
variable_names = list(df.columns[1:])

# Create the predictor variable X as a DataFrame
X = df[variable_names]

### 4. Transforming the categorical dichotomic target variable into numerical type.

In [None]:
target = "RiskPerformance"
y = df[target].values
mask = y == "Bad"
y[mask] = 1
y[~mask] = 0
y = y.astype(int)

### 5. Splitting the Data in Training and Testing set (as we want to find PSI)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

### 6. Instantiate and Fit the Scorecard

Now, we instantiate a Scorecard class with the target name, a binning process object, and an estimator, and fit with training data. Also, we apply a scaling method to the scorecard points.

In [None]:
scorecard = Scorecard(binning_process=binning_process,
                      estimator=estimator, scaling_method="min_max",
                      scaling_method_params={"min": 0, "max": 100})   # As PSI lies between 0 to 100

In [None]:
scorecard.fit(X_train, y_train, metric_special="empirical", metric_missing="empirical")

### 7. Scorecard Process Statistics

In [None]:
scorecard.information(print_level=2)

### 8. Discriminate between the Train and Test sets 

Once the scorecard is fitted, we use the ScorecardMonitoring class to ensure that the resulting scorecard is discriminating using train and test data. Furthermore, this class analyzes whether the distribution of train and test data differ significantly. In practice, df_train would be the (expected) data used for scorecard development, whereas df_test would be the (actual) evolved data.

In [None]:
monitoring = ScorecardMonitoring(scorecard=scorecard, psi_method="cart",
                                 psi_n_bins=10, verbose=True)

In [None]:
monitoring.fit(X_test, y_test, X_train, y_train)

### 9. Population Stability Index (PSI)

In [None]:
monitoring.psi_table()

We can plot the PSI table using method ***psi_plot***, where the population distribution and event rate for each bin (Bin ID) are shown.

In [None]:
monitoring.psi_plot()

This analysis computes statistical tests to determine if the event rate on train and test data are significantly different using the Chi-square test. The null hypothesis is that actual = expected.

In [None]:
monitoring.tests_table()

### 10. Characteristic Stability Index (CSI)

The ***ScorecardMonitoring*** also implements functionalities to perform the characteristic stability report. The ***psi_variable_table*** method returns the PSI using the optimal bins incorporated in the scorecard at a characteristic level.

In [None]:
monitoring.psi_variable_table(style="summary")

In [None]:
monitoring.psi_variable_table(style="detailed")