## Prepare data and predictions for topic modeling

This will allow us to see which topics of risky complaints the model captures well vs which ones it tends to miss.

In [9]:
import os
import numpy as np
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Paths
colab_dir = "/content/drive/MyDrive/NLP/respect-cfpb"
checkpoint_dir = os.path.join(colab_dir, "models/finbert_full/checkpoint-1332")

# Load data
test_df = pd.read_csv(os.path.join(colab_dir, "data/processed/test.csv"))

# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

# Load tokenizer and full fine-tuned model
base_model_name = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(base_model_name, use_fast=True)

print("Loading model from:", checkpoint_dir)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint_dir).to(device)
model.eval()

# Use the threshold that worked well on validation and test
best_threshold = 0.448  # optimal threshold from validation

# Compute probabilities and predictions on test set
test_texts = test_df["clean_text"].tolist()
probs_test = []

# Run a batch prediction loop
with torch.no_grad():
    for i in range(0, len(test_texts), 32):
        batch = test_texts[i:i+32]
        # tokenize text into tensors
        enc = tokenizer(
            batch,
            truncation=True,
            padding=True,
            max_length=512,
            return_tensors="pt"
        )
        enc = {k: v.to(device) for k, v in enc.items()}
        # gets logits from the model
        logits = model(**enc).logits
        # apply softmax to convert logits
        p = torch.softmax(logits, dim=1)[:, 1]  # probability of risk_flag = 1
        probs_test.extend(p.cpu().numpy())

# Convert prob to pred using our validation threshold
probs_test = np.array(probs_test)
preds_test = (probs_test >= best_threshold).astype(int)

# attach prediction and probability
risk_df = test_df.copy()
risk_df["prob"] = probs_test
risk_df["pred"] = preds_test

# keep only true risk cases
risk_df = risk_df[risk_df["risk_flag"] == 1].copy()

# label outcomes
risk_df["outcome"] = np.where(risk_df["pred"] == 1, "TP", "FN")

print("Risk subset shape:", risk_df.shape)
print(risk_df["outcome"].value_counts())

Device: cuda
Loading model from: /content/drive/MyDrive/NLP/respect-cfpb/models/finbert_full/checkpoint-1332
Risk subset shape: (116, 15)
outcome
FN    90
TP    26
Name: count, dtype: int64


## Sentence Embedding for Risk-Case Clustering

To understand patterns within the model’s true risk cases (both TP and FN), we generate vector embeddings of each complaint, using the `all-MiniLM-L6-v2` SentenceTransformer model.

In [10]:
!pip install -q sentence-transformers

import torch
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Embedding device:", device)

embed_model = SentenceTransformer("all-MiniLM-L6-v2", device=device)

texts = risk_df["clean_text"].tolist()

embeddings = embed_model.encode(
    texts,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

embeddings.shape

Embedding device: cuda


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

(116, 384)

## Clustering Risk Cases Using K-Means

To uncover structural patterns within the true risk complaints, we cluster the sentence embeddings using K-Means. We specify k = 14 clusters to obtain moderately granular groupings without over-segmentation. Each complaint is assigned a cluster label, allowing us to analyze where True Positives and False Negatives concentrate.

In [None]:
from sklearn.cluster import KMeans

k = 14 # 14 gave us the most meaningful separations
kmeans = KMeans(
    n_clusters=k,
    random_state=42,
    n_init=10
)

cluster_labels = kmeans.fit_predict(embeddings)
risk_df["cluster"] = cluster_labels

risk_df[["risk_flag", "outcome", "cluster"]].head()

Unnamed: 0,risk_flag,outcome,cluster
33,1,FN,10
42,1,FN,4
129,1,FN,2
153,1,TP,10
154,1,FN,0


## Recall Analysis by Cluster

After assigning each risk-labeled complaint to a semantic cluster, we compute how well the model performs within each group.  
For every cluster, we count the number of:

- **TP** — correctly predicted risk cases
- **FN** — missed risk cases
- **total** — total risk complaints in the cluster  
- **recall** — TP / (TP + FN), indicating how often the model recognizes risk within that cluster

In [27]:
# count TP and FN per cluster
cluster_summary = (
    risk_df
    .groupby("cluster")["outcome"]
    .value_counts()
    .unstack(fill_value=0)
    .reset_index()
)

# ensure FN and TP columns exist
for col in ["FN", "TP"]:
    if col not in cluster_summary.columns:
        cluster_summary[col] = 0

cluster_summary["total"] = cluster_summary["FN"] + cluster_summary["TP"]
cluster_summary["recall"] = cluster_summary["TP"] / cluster_summary["total"]

cluster_summary = cluster_summary.sort_values("recall")

cluster_summary

outcome,cluster,FN,TP,total,recall
0,0,3,0,3,0.0
6,6,3,0,3,0.0
12,12,3,0,3,0.0
8,8,5,0,5,0.0
13,13,7,1,8,0.125
5,5,6,1,7,0.142857
4,4,10,2,12,0.166667
2,2,5,1,6,0.166667
9,9,3,1,4,0.25
7,7,8,3,11,0.272727


### Extract Top Terms for Each Cluster Using TF-IDF

This block identifies the most representative phrases for each cluster, helping us interpret what themes each cluster contains. We compute the average TF-IDF weight of each term within each cluster.


In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# TF-IDF on risk texts
tfidf = TfidfVectorizer(
    ngram_range=(1, 3),
    min_df=2,              # drop rare phrases
    max_features=3000,     # keep it compact
    stop_words="english"
)

X_tfidf = tfidf.fit_transform(risk_df["clean_text"])
terms = np.array(tfidf.get_feature_names_out())

cluster_keywords = {}

for c in range(k):
    idx = np.where(risk_df["cluster"] == c)[0]
    if len(idx) == 0:
        continue

    # mean TF-IDF score for this cluster
    cluster_mean = X_tfidf[idx].mean(axis=0).A1
    top_idx = cluster_mean.argsort()[::-1][:10]
    cluster_keywords[c] = terms[top_idx].tolist()

cluster_summary["top_terms"] = cluster_summary["cluster"].map(cluster_keywords)

cluster_summary


outcome,cluster,FN,TP,total,recall,top_terms
0,0,3,0,3,0.0,"[appraisal, home, income, value, used, propert..."
6,6,3,0,3,0.0,"[scheduled, current, payment, upload, avoid fo..."
12,12,3,0,3,0.0,"[flood, insurance, flood insurance, truist, ho..."
8,8,5,0,5,0.0,"[credit, late, usc, section, account, payment,..."
13,13,7,1,8,0.125,"[insurance, truist, policy, inspection, year, ..."
5,5,6,1,7,0.142857,"[escrow, insurance, policy, escrow account, mo..."
4,4,10,2,12,0.166667,"[credit, forbearance plan, loan, plan, applica..."
2,2,5,1,6,0.166667,"[pnc, fee, late fee, payment, late, check, mai..."
9,9,3,1,4,0.25,"[application, td, rep, year year, statements, ..."
7,7,8,3,11,0.272727,"[escrow, payment, loan, taxes, rocket, money, ..."


Across the lowest-recall clusters (0, 6, 12, and 8), the missed cases share a clear structural pattern: they involve technical or procedural issues that, if the allegations are accurate, constitute regulatory violations requiring remediation, independent of narrative style or emotional tone.

These clusters consistently map to domains governed by explicit, rule-based obligations, including appraisal-related impacts, foreclosure prevention and payment posting, flood-insurance requirements, and credit-reporting accuracy.

This pattern indicates that the model is not reliably capturing the underlying regulatory context. As a result, it systematically under-detects risk in categories where violations are defined by strict rules rather than by linguistic cues of harm or sentiment.

## Model Weaknesses and Future Considerations

The model elevates complaints that sound risky, including those with emotional, accusatory, or harm-focused language, even when the cases describe situations that do not meet the regulatory definition of a violation. This produces a consistent pattern of false positives.

In contrast, the model frequently misses complaints that involve concrete, rule-based violations esperically when the consumer writes in a calm, procedural, or factual tone. These cases, often involving escrow requirements, foreclosure timelines, flood insurance rules, appraisal-based adjustments, or credit reporting obligations, appear prominently in the false negatives.

Together, these patterns indicate that the model excels at detecting narrative cues of consumer harm, but struggles with violations defined by explicit regulatory rules, which are the most critical for compliance remediation.

Improving performance therefore requires strengthening the model’s access to regulatory context, not merely adding more sentiment or linguistic cues.

A practical next step is to incorporate a Retrieval-Augmented Generation (RAG) layer.
By retrieving relevant Consumer Financial Protection Laws and Regulations, investor guidelines, and internal policies, the system could check whether the facts described in a complaint conflict with a specific rule, even when the narrative tone is neutral.

This provides a complementary capability that the classifier alone cannot capture, helping to improve recall on rule-driven violations without sacrificing precision.