<a href="https://colab.research.google.com/github/mar7i4ka/Lin_Reg/blob/main/%F0%9F%93%9CComplete_Credit_Risk_Modeling_%F0%9F%8E%AF_%7C_2_SC_WoE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
husainsb_lendingclub_issued_loans_path = kagglehub.dataset_download('husainsb/lendingclub-issued-loans')
wordsforthewise_lending_club_path = kagglehub.dataset_download('wordsforthewise/lending-club')
ethon0426_lending_club_20072020q1_path = kagglehub.dataset_download('ethon0426/lending-club-20072020q1')
adarshsng_lending_club_loan_data_csv_path = kagglehub.dataset_download('adarshsng/lending-club-loan-data-csv')
beatafaron_loan_credit_risk_and_population_stability_path = kagglehub.dataset_download('beatafaron/loan-credit-risk-and-population-stability')

print('Data source import complete.')


## This is a continuation
> of a notebook:
> 📜Complete Credit Risk Modeling 🔍 | 1. EDA
> direct link here:
https://www.kaggle.com/code/beatafaron/complete-credit-risk-modeling-1-eda

# 1.📘 SCOPE

> In contiunation we will create a behavioral scorecard: the general framework involves setting score ranges and corresponding risk groups (e.g., high-risk, medium-risk, low-risk) based on statistical analysis and business policies.
>


## **1. 1. Behavioral Scorecards**

**Score Distribution**:
   - Behavioral scorecards typically assign scores on a scale (e.g., 300–900) that reflects the likelihood of favorable behavior (e.g., repayment).
   - Higher scores indicate lower risk (e.g., low probability of default).
   - Lower scores indicate higher risk (e.g., high probability of default).
   - Risk thresholds are typically based on statistical analysis (e.g., percentiles or clustering) or business rules (e.g., cutoffs for default rates).


---

## **1. 2. Rules for Assigning Scores**

1. **Weight of Evidence (WOE) Transformation**:
   - Continuous features (e.g., credit utilization, payment-to-balance ratio) and categorical features are binned and transformed into WOE values for score calculation.

2. **Logistic Regression**:
   - Scores are typically derived from logistic regression models, where the predicted probability of default (PD) is converted into a score.

3. **Score Scaling**:
   - Scores are scaled to a user-friendly range using the formula:
$$
\[
\text{Score} = \text{Base Score} + \text{Factor} \times \log \left( \frac{1 - \text{PD}}{\text{PD}} \right)
\]    
$$     
     Where:
     - **Base Score**: Starting point for scores (e.g., 300).
     - **Factor**: Controls the distribution (e.g., 20 per doubling of odds).

---


In [None]:
"""
Jupyter: scikit-learn 1.5.1
Kaggle: scikit-learn 1.2.2 (Older version!)
This version mismatch is causing the huge difference in logistic regression coefficients (intercept_ values).

model.intercept_ in kaggle : array([-7.52160171])
model.intercept_ in jupiter : array([-3.48804904])
"""
!pip install --upgrade scikit-learn==1.5.1


In [None]:
import sklearn
import pandas as pd
import numpy as np

print("scikit-learn version:", sklearn.__version__)
print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)


In [None]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
pd.options.display.max_columns = None

# 2.📌Read & Bin

In [None]:
df=pd.read_csv('/kaggle/input/loan-credit-risk-and-population-stability/df_2014-18_selected.csv')
dictionary=pd.read_csv('/kaggle/input/loan-credit-risk-and-population-stability/dictionary_selected.csv')
df = df.round(3)
df.head()

### 📌 Binning

Here  I am including a small digression on quick and clever data binning. We will build two functions to choose the best way for smart binning the contiunos variables.

1. First one **`bin_and_plot_woe_manual`** → Uses `pd.cut()` to bin a continuous variable and plots WOE.  
2. Second one **`bin_and_plot_woe_tree`** → Uses `DecisionTreeClassifier` to bin a continuous variable and plots WOE.  

---


###  **Differences Between the Two Methods**
| Method | Description | Pros | Cons |
|--------|------------|------|------|
| **`pd.cut()` (Equal Width Binning)** | Divides data into `bins` of equal width | Simple, interpretable | Might not capture patterns well |
| **DecisionTreeClassifier Binning** | Uses tree-based splits to define bins | Data-driven, captures patterns | Can overfit if `max_depth` is too high |

---
Let's check it out!


In [None]:
df.total_rec_late_fee.describe()

> Although the mean and the standard deviation is around 2, the max value is very huge - 1.598520e+03. Propably we have outliers. Lets check how it influence the binning.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def plot_woe_distribution(woe_df, feature):
    """
    Plot WOE distribution.

    Args:
        woe_df (pd.DataFrame): DataFrame with bins and WOE values.
        feature (str): Feature name for title.
    """
    plt.figure(figsize=(4, 3))
    bin_labels = [str(b) for b in woe_df['bin']]
    plt.plot(bin_labels, woe_df['WOE'], marker='o', linestyle='-', color='lightblue', label='WOE')

    # Style improvements
    plt.xlabel('Bins', fontsize=8)
    plt.ylabel('Weight of Evidence (WOE)', fontsize=8)
    plt.title(f'WOE Distribution for {feature}', fontsize=8)
    plt.xticks(rotation=45, ha='right', fontsize=8)
    plt.yticks(fontsize=8)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.legend(fontsize=10)

    # Remove top and right spines (black borders)
    ax = plt.gca()
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    plt.show()

# 3.🎯WoE 🎯IV

In [None]:
def bin_and_plot_woe_manual(data, feature, target, bins=5, eps=0.0001):
    """
    Bin a continuous variable using equal-width binning (pd.cut) and plot WOE.

    Args:
        data (pd.DataFrame): The dataset containing the feature and target.
        feature (str): The name of the continuous feature.
        target (str): The binary target variable.
        bins (int): Number of bins for pd.cut().
        eps (float): Small value to prevent division by zero.

    Returns:
        woe_df (pd.DataFrame): DataFrame with bins and WOE values.
    """
    # Create bins using pd.cut
    data['bin'] = pd.cut(data[feature], bins=bins, include_lowest=True, duplicates='drop')

    # Aggregate event & non-event counts
    woe_df = data.groupby('bin', observed=True).agg(
        total_count=(target, 'count'),
        event_count=(target, 'sum')
    ).reset_index()

    woe_df['non_event_count'] = woe_df['total_count'] - woe_df['event_count']

    # Compute event and non-event rates (avoid division by zero)
    woe_df['event_rate'] = (woe_df['event_count'] + eps) / woe_df['event_count'].sum()
    woe_df['non_event_rate'] = (woe_df['non_event_count'] + eps) / woe_df['non_event_count'].sum()

    # Compute WOE
    woe_df['WOE'] = round(np.log(woe_df['non_event_rate'] / woe_df['event_rate']),4)

     # Compute IV for each bin
    woe_df['IV'] = round((woe_df['non_event_rate'] - woe_df['event_rate']) * woe_df['WOE'],4)

    # Compute total IV
    total_IV = woe_df['IV'].sum()
    print(f'Total IV for {feature}: {total_IV:.4f}')

    # Convert bins to string format for plotting
    # woe_df['bin_str'] = woe_df['bin'].astype(str)

    # Plot WOE distribution
    plot_woe_distribution(woe_df, feature)

    return woe_df

In [None]:
from sklearn.tree import DecisionTreeClassifier
from decimal import Decimal, ROUND_DOWN

def truncate(value, decimals=3):
    factor = Decimal('1.' + '0' * decimals)
    return float(Decimal(value).quantize(factor, rounding=ROUND_DOWN))


def bin_and_plot_woe_tree(data, feature, target, max_depth=4, min_samples_leaf=2000, eps=0.0001):
    """
    Bin a continuous variable using DecisionTreeClassifier and compute WOE.

    Args:
        data (pd.DataFrame): The dataset containing the feature and target.
        feature (str): The name of the continuous feature.
        target (str): The binary target variable.
        max_depth (int): Max depth of the decision tree for binning.
        min_samples_leaf (int): Minimum samples per leaf to avoid overfitting.
        eps (float): Small value to prevent division by zero.

    Returns:
        woe_df (pd.DataFrame): DataFrame with bins, WOE values, and IV.
    """
    # Fit Decision Tree for binning
    X = data[[feature]]
    y = data[target]
    tree = DecisionTreeClassifier(max_depth=max_depth, min_samples_leaf=min_samples_leaf, random_state=42)
    tree.fit(X, y)

    # Get bin edges from decision tree and truncate to 3 decimal places
    thresholds = tree.tree_.threshold
    thresholds = thresholds[thresholds != -2]  # Remove leaf node markers
    thresholds = sorted(thresholds)
    thresholds = [truncate(th, 3) for th in thresholds]

    # Define bin edges
    bins = [truncate(data[feature].min(), 3)] + thresholds + [truncate(data[feature].max(), 3)]
    bins[0] = truncate(data[feature].min(), 3)

    # Bin the data
    data['bin'] = pd.cut(data[feature], bins=bins, include_lowest=True)

    # Aggregate event & non-event counts
    woe_df = data.groupby('bin').agg(
        total_count=(target, 'count'),
        event_count=(target, 'sum')
    ).reset_index()
    woe_df['non_event_count'] = woe_df['total_count'] - woe_df['event_count']

    # Compute event and non-event rates (avoid division by zero)
    woe_df['event_rate'] = (woe_df['event_count'] + eps) / woe_df['event_count'].sum()
    woe_df['non_event_rate'] = (woe_df['non_event_count'] + eps) / woe_df['non_event_count'].sum()

    # Compute WOE
    woe_df['WOE'] = np.log(woe_df['non_event_rate'] / woe_df['event_rate'])

    # Compute IV for each bin
    woe_df['IV'] = (woe_df['non_event_rate'] - woe_df['event_rate']) * woe_df['WOE']

    # Compute total IV
    total_IV = woe_df['IV'].sum()
    print(f'Total IV for {feature}: {total_IV:.4f}')

    # Plot WOE distribution
    plot_woe_distribution(woe_df, feature)

    return woe_df


In [None]:
print("Weight of evidence - Equal Width Binning")
woe_manual = bin_and_plot_woe_manual(df, 'total_rec_late_fee', 'loan_status_binary', bins=5)
print("Weight of evidence - Decision Tree Classifier Binning")
woe_tree = bin_and_plot_woe_tree(df, 'total_rec_late_fee', 'loan_status_binary', max_depth=4)

> Check how different results we have! Both plots have different bins - with the manual plot we should check the outliers and try binning without them.
> Look how different Importance Values  we can achieve by properly grouping data. <br>
> 1. Total IV for total_rec_late_fee: 0.0004
>2. Total IV for total_rec_late_fee: 0.2404 <br>


In [None]:
woe_tree.head()

In [None]:
woe_manual.head()

> As we observe - there are two outliers. Also the main part of distribution is between -1,59 to 319 ( looking on manual), where DecisionTreeClassifier narrow this bin from -inf to 0.035

In [None]:
df=pd.read_csv('/kaggle/input/loan-credit-risk-and-population-stability/df_2014-18_selected.csv')
df = df.round(3)
df.head()

In [None]:
df.head()

In [None]:
# set target and features
target = 'loan_status_binary'
features = df.columns.drop(target).tolist()
features

In [None]:
df_prep=pd.DataFrame()
dropped_first=pd.DataFrame()

def transform_to_dummy(data, feature, woe_df, df_prep=None, dropped_first=None):
    """
    Transform a continuous feature into categorical dummy variables based on binning.

    Args:
        data (pd.DataFrame): The original dataset.
        feature (str): The name of the continuous feature to transform.
        woe_df (pd.DataFrame): DataFrame containing bin information from binning.
        df_prep (pd.DataFrame, optional): DataFrame to store transformed categorical dummy variables.
        dropped_first (pd.DataFrame, optional): DataFrame to store dropped bin information.

    Returns:
        df_prep (pd.DataFrame): Updated dataset with only the transformed categorical dummy variables.
        dropped_first (pd.DataFrame): Updated DataFrame storing the feature and dropped bin information.
    """
    # Ensure bins are applied to the original data
    bin_labels = woe_df['bin'].astype(str).values
    data['bin'] = pd.cut(
        data[feature],
        bins=[b.left for b in woe_df['bin'].cat.categories] + [woe_df['bin'].cat.categories[-1].right],
        include_lowest=True,
        labels=bin_labels
    )

    # Identify the dropped bin
    dropped_bin = bin_labels[0] if len(bin_labels) > 0 else None
    new_dropped = pd.DataFrame({'feature': [feature], 'dropped_bin': [dropped_bin], 'col_name': [feature + "_" + dropped_bin]})

    # Convert categorical bins into dummy variables with proper naming
    new_df_prep = pd.get_dummies(data[['bin']], columns=['bin'], prefix=f"{feature}", drop_first=False)

    # Append to existing dataframes if provided
    if df_prep is not None:
        df_prep = pd.concat([df_prep, new_df_prep], axis=1)
    else:
        df_prep = new_df_prep

    if dropped_first is not None:
        dropped_first = pd.concat([dropped_first, new_dropped], ignore_index=True)
    else:
        dropped_first = new_dropped

    return df_prep, dropped_first


## 3. 1. Using Decision Tree Classifier

In [None]:
for feature in features:
    woe_tree = bin_and_plot_woe_tree(df, feature, 'loan_status_binary', max_depth=4)
    df_prep, dropped_first = transform_to_dummy(df, feature, woe_tree, df_prep, dropped_first)

> Great! Now lets check how look our data after binning and transforming to dummy variables.
>
> First columns to drop are stored in dropped_first

In [None]:
df_prep.head()

In [None]:
dropped_first

In [None]:
df_prep_dropped_first=pd.DataFrame()
df_prep_dropped_first = df_prep.drop(columns=dropped_first["col_name"].tolist(), errors='ignore')


In [None]:
dropped_first.shape, df_prep_dropped_first.shape, df_prep.shape


# 4.🤖Training Logistic Regression

In [None]:
target= df['loan_status_binary']

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assuming df_prep contains features and target
X = df_prep_dropped_first.copy()
y = target  # Target variable (0/1)

# Train logistic regression
model = LogisticRegression()
model.fit(X, y)

In [None]:
model.intercept_

In [None]:
model.coef_

> Now we create summary_table to store features, their coefficiences ( with intercept )

In [None]:
summary_table = pd.DataFrame()
summary_table['feature_name'] = df_prep_dropped_first.columns
summary_table['coefficience'] = np.transpose(model.coef_)
summary_table.loc[-1] = ['intercept', model.intercept_[0]]
summary_table = summary_table.sort_index().reset_index(drop=True)
summary_table

# 5.📈Scorecard
> Creating scorecard scaled between 300-900

In [None]:
new = pd.DataFrame({'feature_name': dropped_first["feature"] + "_" + dropped_first["dropped_bin"],
                    'coefficience': 0})
scorecard=pd.DataFrame()
scorecard=pd.concat([summary_table,new]).reset_index()
scorecard

In [None]:
scorecard['feature_original'] = scorecard['feature_name'].str.split('_(', regex=False).str[0]
scorecard

In [None]:
scorecard.groupby('feature_original')['coefficience'].min()

In [None]:
min_sum=scorecard.groupby('feature_original')['coefficience'].min().sum()
max_sum=scorecard.groupby('feature_original')['coefficience'].max().sum()
max_score=900
min_score=300

In [None]:
scorecard['score_cal']= scorecard['coefficience']*(max_score-min_score)/(max_sum-min_sum)
scorecard['score_cal'][0]=(scorecard['coefficience'][0]-min_sum)/(max_sum-min_sum)*(max_score-min_score)+min_score
scorecard

In [None]:
min_check=scorecard.groupby('feature_original')['score_cal'].min().sum().round()
min_check

In [None]:
max_check=scorecard.groupby('feature_original')['score_cal'].max().sum().round()
max_check

In [None]:
df_prep.insert(0, 'intercept', 1)
df_prep = df_prep[scorecard['feature_name'].values]
scores = scorecard['score_cal']

In [None]:
scores.shape, df_prep.shape

In [None]:
scores=scores.values.reshape(135,1)

In [None]:
df_prep_scores = df_prep.dot(scores)
df_prep_scores = df_prep_scores.astype(int)
df_prep_scores

# 6.📏logistic (sigmoid) function

>We have df_prep_scores, which contains the total score for each observation. Now, we need to convert these scores into probabilities using the logistic (sigmoid) function:

In [None]:
# Convert score to probability
y_score = 1 / (1 + np.exp(-(df_prep_scores - min_score) / (max_score - min_score)))

# Flatten to a 1D array
y_score = y_score.values.flatten()

>Now, let's calculate once again the ROC-AUC score to assess model performance and visualize the effectiveness of the scorecard.

In [None]:
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve

# Compute ROC-AUC
auc = roc_auc_score(y, y_score)
print(f"ROC-AUC: {auc:.4f}")

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y, y_score)  # Make sure 'thresholds' is defined!

# Plot the ROC Curve
plt.figure(figsize=(5, 5))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.4f})', color='blue')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')  # Random model line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()


>
>0.9 - 1.0 → Excellent.

# 7.➡️Youden's J statistic
>The optimal cutoff balances the True Positive Rate (TPR) and False Positive Rate (FPR). A common method is Youden’s J statistic:
>
>\[
J = TPR - FPR
\]


In [None]:
# Compute Youden's J statistic
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"Optimal Cutoff Probability: {optimal_threshold:.4f}")

>Convert Probability Cutoff to Score Cutoff
To determine the equivalent score cutoff, reverse the logistic function:

$$
\[
\text{Score} = \text{min\_score} + (\text{max\_score} - \text{min\_score}) \times \log\left(\frac{p}{1-p}\right)
\]
$$

In [None]:
# Convert probability cutoff to score cutoff
optimal_score_cutoff = min_score + (max_score - min_score) * np.log(optimal_threshold / (1 - optimal_threshold))

print(f"Optimal Score Cutoff: {optimal_score_cutoff:.0f}")


>Now, classify customers into "Good" or "Bad" risk segments using the optimal cutoff:

# 8.📉Risk Category🔴🟢

In [None]:
# Assign risk category based on cutoff
df_prep_scores['Risk_Category'] = np.where(df_prep_scores >= optimal_score_cutoff, "Good", "Bad")

# View distribution
df_prep_scores['Risk_Category'].value_counts()

>
>- Customers above the cutoff → "Good" Risk (Low probability of default).
>- Customers below the cutoff → "Bad" Risk (High probability of default).


## 🎯Next Steps

> Here is direct link to next steps:
> Complete Credit Risk Modeling  | 1. EDA <br>
> direct link here: <br>
https://www.kaggle.com/code/beatafaron/complete-credit-risk-modeling-1-eda <br>
> 2. Behavioral Scorecards, weight of evidence, logistic regresion.
> https://www.kaggle.com/code/beatafaron/complete-credit-risk-modeling-2-sc-woe
> <br>
> 3. Population stability index.<br>
> https://www.kaggle.com/code/beatafaron/complete-credit-risk-modeling-3-psi