# Fine-Tuning BERT for Default Probability Assessment

In this notebook, we will fine-tune a BERT model to assess default probabilities. We'll explore how to:
- Process and prepare financial data for fine-tuning
- Adapt a BERT model for binary classification with calibrated probabilities
- Evaluate the model's performance
- Interpret the softmax probabilities as real-world default probabilities

## 1. Setup and Dependencies

In [None]:
# Install required packages
!pip install transformers datasets sklearn pandas numpy matplotlib torch evaluate

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import EarlyStoppingCallback
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, brier_score_loss
from sklearn.calibration import calibration_curve
import evaluate
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 2. Load and Explore Dataset

We'll use a loan dataset with default information. This dataset contains loan applications with textual descriptions and default outcomes.

In [None]:
# Load dataset (adjust the path as needed)
# For this example, we'll create a simplified synthetic dataset
# In practice, you would load your actual loan data

def create_synthetic_loan_data(n_samples=1000):
    """Create a synthetic loan dataset with default information"""
    np.random.seed(seed)
    
    # Create loan purposes and descriptions
    purposes = ['Home improvement', 'Debt consolidation', 'Business', 'Medical expenses', 'Education', 'Vacation', 'Car purchase']
    income_levels = ['Low', 'Medium', 'High', 'Very high']
    employment_status = ['Employed', 'Self-employed', 'Unemployed', 'Retired', 'Student']
    credit_history = ['Excellent', 'Good', 'Fair', 'Poor']
    
    data = []
    for _ in range(n_samples):
        purpose = np.random.choice(purposes)
        income = np.random.choice(income_levels)
        employment = np.random.choice(employment_status)
        credit = np.random.choice(credit_history)
        loan_amount = np.random.randint(1000, 50000)
        debt_to_income = np.random.uniform(0.1, 0.8)
        
        # Generate description with variations
        description = f"I am {employment.lower()} with a {income.lower()} income of {np.random.randint(20000, 150000)} per year. "
        description += f"I am applying for a {purpose.lower()} loan of ${loan_amount}. "
        description += f"My credit history is {credit.lower()} and my debt-to-income ratio is {debt_to_income:.2f}. "
        
        # Add some random details
        if purpose == 'Home improvement':
            description += f"I plan to renovate my {np.random.choice(['kitchen', 'bathroom', 'basement', 'roof', 'entire house'])}." 
        elif purpose == 'Debt consolidation':
            description += f"I have {np.random.randint(2, 8)} credit cards with high interest rates that I want to consolidate."
        elif purpose == 'Business':
            description += f"I run a small {np.random.choice(['retail', 'online', 'service', 'consulting', 'food'])} business."
        
        # Determine default probability based on features
        default_prob = 0.05  # Base rate
        
        # Adjust based on income
        if income == 'Low':
            default_prob += 0.15
        elif income == 'Medium':
            default_prob += 0.05
        elif income == 'High':
            default_prob -= 0.03
        elif income == 'Very high':
            default_prob -= 0.04
        
        # Adjust based on employment
        if employment == 'Unemployed':
            default_prob += 0.2
        elif employment == 'Self-employed':
            default_prob += 0.05
        elif employment == 'Student':
            default_prob += 0.1
        
        # Adjust based on credit history
        if credit == 'Poor':
            default_prob += 0.25
        elif credit == 'Fair':
            default_prob += 0.1
        elif credit == 'Excellent':
            default_prob -= 0.04
        
        # Adjust based on debt-to-income
        default_prob += debt_to_income * 0.2
        
        # Cap probability between 0.01 and 0.9
        default_prob = max(min(default_prob, 0.9), 0.01)
        
        # Generate default status based on probability
        default = 1 if np.random.random() < default_prob else 0
        
        data.append({
            'description': description,
            'purpose': purpose,
            'income_level': income,
            'employment_status': employment,
            'credit_history': credit,
            'loan_amount': loan_amount,
            'debt_to_income': debt_to_income,
            'default_prob': default_prob,
            'default': default
        })
    
    return pd.DataFrame(data)

# Create synthetic dataset
loan_data = create_synthetic_loan_data(2000)

# Display sample data
loan_data.head()

In [None]:
# Explore the dataset
print(f"Dataset shape: {loan_data.shape}")
print(f"Default rate: {loan_data['default'].mean():.2f}")

# Explore default rates by category
print("\nDefault rate by purpose:")
print(loan_data.groupby('purpose')['default'].mean().sort_values(ascending=False))

print("\nDefault rate by income level:")
print(loan_data.groupby('income_level')['default'].mean().sort_values(ascending=False))

print("\nDefault rate by credit history:")
print(loan_data.groupby('credit_history')['default'].mean().sort_values(ascending=False))

In [None]:
# Visualize the distribution of default probabilities
plt.figure(figsize=(10, 6))
sns.histplot(data=loan_data, x='default_prob', hue='default', bins=20, kde=True)
plt.title('Distribution of Default Probabilities')
plt.xlabel('Default Probability')
plt.ylabel('Count')
plt.show()

## 3. Prepare Data for BERT

We'll now prepare our data for fine-tuning a BERT model. We'll focus on using the textual description as the primary input.

In [None]:
# Split data into train, validation, and test sets
train_data, test_data = train_test_split(loan_data, test_size=0.2, random_state=seed, stratify=loan_data['default'])
train_data, val_data = train_test_split(train_data, test_size=0.25, random_state=seed, stratify=train_data['default'])

print(f"Train data size: {len(train_data)}")
print(f"Validation data size: {len(val_data)}")
print(f"Test data size: {len(test_data)}")

In [None]:
# Initialize tokenizer
model_name = "distilbert-base-uncased"  # We'll use DistilBERT for efficiency
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create a PyTorch dataset
class LoanDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length=128):
        self.descriptions = dataframe['description'].tolist()
        self.targets = dataframe['default'].tolist()
        self.tokenizer = tokenizer
        self.max_length = max_length
        
    def __len__(self):
        return len(self.descriptions)
    
    def __getitem__(self, idx):
        description = self.descriptions[idx]
        target = self.targets[idx]
        
        encoding = self.tokenizer(
            description,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        # Remove batch dimension
        encoding = {k: v.squeeze() for k, v in encoding.items()}
        encoding['labels'] = torch.tensor(target, dtype=torch.long)
        
        return encoding

# Create datasets
train_dataset = LoanDataset(train_data, tokenizer)
val_dataset = LoanDataset(val_data, tokenizer)
test_dataset = LoanDataset(test_data, tokenizer)

# Check a sample
sample = train_dataset[0]
print("Sample input keys:", sample.keys())
print("Input IDs shape:", sample['input_ids'].shape)
print("Attention mask shape:", sample['attention_mask'].shape)
print("Label:", sample['labels'])

## 4. Define Evaluation Metrics

We'll define functions to compute evaluation metrics specific to our task.

In [None]:
def compute_metrics(eval_pred):
    """Compute metrics for evaluation"""
    predictions, labels = eval_pred
    # Get probabilities with softmax
    probs = torch.nn.functional.softmax(torch.tensor(predictions), dim=1)
    preds = np.argmax(predictions, axis=1)
    
    # Get default probability (probability of class 1)
    default_probs = probs[:, 1].numpy()
    
    # Calculate metrics
    accuracy = (preds == labels).mean()
    
    # ROC AUC score
    roc_auc = roc_auc_score(labels, default_probs)
    
    # Precision-Recall AUC
    precision, recall, _ = precision_recall_curve(labels, default_probs)
    pr_auc = auc(recall, precision)
    
    # Brier score (for probability calibration)
    brier = brier_score_loss(labels, default_probs)
    
    return {
        'accuracy': accuracy,
        'roc_auc': roc_auc,
        'pr_auc': pr_auc,
        'brier_score': brier
    }

## 5. Initialize and Fine-Tune BERT Model

Now we'll fine-tune a pre-trained BERT model for our default prediction task.

In [None]:
# Initialize model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=2,  # Binary classification: default or not
    problem_type="single_label_classification"
)

# Define training arguments
batch_size = 16
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="roc_auc",
    push_to_hub=False,
    report_to="none",  # Disable W&B and TensorBoard reporting
    save_total_limit=2,  # Only keep the 2 best checkpoints
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

In [None]:
# Train the model
trainer.train()

In [None]:
# Evaluate on validation set
val_results = trainer.evaluate()
print("Validation Results:", val_results)

## 6. Evaluate Model Performance

Now let's evaluate our model on the test set and analyze its performance.

In [None]:
# Evaluate on test set
test_results = trainer.evaluate(test_dataset)
print("Test Results:", test_results)

In [None]:
# Get predictions on test set
test_predictions = trainer.predict(test_dataset)
logits = test_predictions.predictions
probabilities = torch.nn.functional.softmax(torch.tensor(logits), dim=1).numpy()
default_probabilities = probabilities[:, 1]  # Probability of class 1 (default)
predicted_labels = np.argmax(logits, axis=1)
true_labels = test_data['default'].values

In [None]:
# Plot ROC curve
from sklearn.metrics import roc_curve

fpr, tpr, _ = roc_curve(true_labels, default_probabilities)
roc_auc = roc_auc_score(true_labels, default_probabilities)

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

In [None]:
# Plot precision-recall curve
precision, recall, _ = precision_recall_curve(true_labels, default_probabilities)
pr_auc = auc(recall, precision)

plt.figure(figsize=(10, 6))
plt.plot(recall, precision, color='darkorange', lw=2, label=f'PR curve (area = {pr_auc:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower left")
plt.show()

## 7. Calibration of Softmax Probabilities

Let's evaluate how well calibrated our softmax probabilities are and apply calibration if needed.

In [None]:
# Check calibration
prob_true, prob_pred = calibration_curve(true_labels, default_probabilities, n_bins=10)

plt.figure(figsize=(10, 6))
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfectly calibrated')
plt.plot(prob_pred, prob_true, marker='o', label='Model')
plt.xlabel('Predicted probability')
plt.ylabel('True probability (fraction of positives)')
plt.title('Calibration Curve (Reliability Diagram)')
plt.legend()
plt.show()

In [None]:
# Apply temperature scaling for calibration
from sklearn.isotonic import IsotonicRegression

# We'll try two calibration methods: Isotonic Regression and Platt Scaling
ir = IsotonicRegression(out_of_bounds='clip')
ir.fit(default_probabilities, true_labels)

# Calibrate probabilities
calibrated_probs = ir.predict(default_probabilities)

# Check calibration after correction
prob_true_cal, prob_pred_cal = calibration_curve(true_labels, calibrated_probs, n_bins=10)

plt.figure(figsize=(10, 6))
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfectly calibrated')
plt.plot(prob_pred, prob_true, marker='o', label='Uncalibrated')
plt.plot(prob_pred_cal, prob_true_cal, marker='s', label='Calibrated (Isotonic)')
plt.xlabel('Predicted probability')
plt.ylabel('True probability (fraction of positives)')
plt.title('Calibration Curve Before and After Calibration')
plt.legend()
plt.show()

In [None]:
# Compare Brier scores before and after calibration
brier_before = brier_score_loss(true_labels, default_probabilities)
brier_after = brier_score_loss(true_labels, calibrated_probs)

print(f"Brier score before calibration: {brier_before:.4f}")
print(f"Brier score after calibration: {brier_after:.4f}")
print(f"Improvement: {(brier_before - brier_after) / brier_before * 100:.2f}%")

## 8. Compare Predicted Probabilities to True Default Rates

Let's compare our model's predicted probabilities to the true default probabilities in our synthetic data.

In [None]:
# Add predictions to test data
test_results_df = test_data.copy()
test_results_df['predicted_prob'] = default_probabilities
test_results_df['calibrated_prob'] = calibrated_probs
test_results_df['predicted_default'] = predicted_labels

# Compare predicted vs true default probabilities
plt.figure(figsize=(10, 6))
plt.scatter(test_results_df['default_prob'], test_results_df['predicted_prob'], alpha=0.5, label='Uncalibrated')
plt.scatter(test_results_df['default_prob'], test_results_df['calibrated_prob'], alpha=0.5, label='Calibrated')
plt.plot([0, 1], [0, 1], 'r--', label='Perfect prediction')
plt.xlabel('True Default Probability')
plt.ylabel('Predicted Default Probability')
plt.title('Comparison of True vs. Predicted Default Probabilities')
plt.legend()
plt.show()

In [None]:
# Calculate error metrics
test_results_df['uncal_error'] = np.abs(test_results_df['default_prob'] - test_results_df['predicted_prob'])
test_results_df['cal_error'] = np.abs(test_results_df['default_prob'] - test_results_df['calibrated_prob'])

print(f"Mean absolute error (Uncalibrated): {test_results_df['uncal_error'].mean():.4f}")
print(f"Mean absolute error (Calibrated): {test_results_df['cal_error'].mean():.4f}")

## 9. Error Analysis

Let's examine cases where our model performed poorly.

In [None]:
# Find examples with largest errors
worst_predictions = test_results_df.sort_values('cal_error', ascending=False).head(5)

print("Examples with largest errors:")
for i, row in worst_predictions.iterrows():
    print(f"\nExample {i}:")
    print(f"Description: {row['description']}")
    print(f"True default prob: {row['default_prob']:.4f}")
    print(f"Predicted prob (uncalibrated): {row['predicted_prob']:.4f}")
    print(f"Predicted prob (calibrated): {row['calibrated_prob']:.4f}")
    print(f"Actual default: {row['default']}")

## 10. Build a Default Probability Prediction Function

Let's create a function that takes a loan description and returns a calibrated default probability.

In [None]:
def predict_default_probability(description, model=model, tokenizer=tokenizer, calibrator=ir):
    """Predict default probability for a loan description"""
    # Tokenize the input
    inputs = tokenizer(
        description,
        add_special_tokens=True,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    
    # Move inputs to the same device as the model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
    
    # Apply softmax to get probabilities
    probabilities = torch.nn.functional.softmax(logits, dim=1).cpu().numpy()
    default_prob = probabilities[0, 1]  # Probability of class 1 (default)
    
    # Apply calibration
    calibrated_prob = calibrator.predict([default_prob])[0]
    
    return {
        'raw_default_probability': float(default_prob),
        'calibrated_default_probability': float(calibrated_prob),
        'predicted_class': 'Default' if calibrated_prob > 0.5 else 'Non-Default'
    }

In [None]:
# Test the function with some examples
test_descriptions = [
    "I am employed with a high income of 120000 per year. I am applying for a home improvement loan of $25000. My credit history is excellent and my debt-to-income ratio is 0.25. I plan to renovate my kitchen.",
    "I am unemployed with a low income of 25000 per year. I am applying for a debt consolidation loan of $15000. My credit history is poor and my debt-to-income ratio is 0.65. I have 6 credit cards with high interest rates that I want to consolidate.",
    "I am self-employed with a medium income of 75000 per year. I am applying for a business loan of $35000. My credit history is good and my debt-to-income ratio is 0.40. I run a small retail business."
]

for i, description in enumerate(test_descriptions):
    result = predict_default_probability(description)
    print(f"\nExample {i+1}:")
    print(f"Description: {description}")
    print(f"Raw Default Probability: {result['raw_default_probability']:.4f}")
    print(f"Calibrated Default Probability: {result['calibrated_default_probability']:.4f}")
    print(f"Predicted Class: {result['predicted_class']}")

## 11. Save the Model and Calibrator

Let's save our model and calibrator for future use.

In [None]:
# Save the model
model_save_path = "./default_prediction_model"
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

# Save the calibrator
import pickle
with open(f"{model_save_path}/calibrator.pkl", 'wb') as f:
    pickle.dump(ir, f)

print(f"Model and calibrator saved to {model_save_path}")

## 12. Conclusion

In this notebook, we've demonstrated how to fine-tune a BERT model to predict loan default probabilities. We've covered:

1. Data preparation and exploration
2. Fine-tuning a BERT model for default prediction
3. Evaluating model performance with appropriate metrics
4. Calibrating softmax probabilities to align with real-world default rates
5. Building a prediction function for new loan descriptions

This approach can be extended to real-world credit risk assessment by:
- Using actual loan data with historical default information
- Incorporating additional structured data features
- Implementing more sophisticated calibration techniques
- Conducting fairness audits to ensure unbiased predictions
- Regularly retraining the model with new data

The model we've built can serve as a starting point for more advanced credit risk assessment systems.