In [1]:
pwd

'/home/jovyan/silver-iguana/spark/notebooks'

<h2 style="color:red; font-weight:bold; text-align:center;">
How do we prevent an AI model from hallucinating?
</h2>

Large Language Models (LLMs) can summarize, reason, and synthesize text — but they still **invent facts** and **fail to quantify uncertainty**.  

Recent research explores many ways to reduce this problem, yet most solutions remain **qualitative** rather than **mathematical**.  

My thesis: *LLMs alone can’t provide trustworthy, numerically grounded predictions without leaning on a statistical framework underneath.*

---

- **Islam et al. (2024)** — *A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models*  
  ([arXiv:2401.01313](https://arxiv.org/abs/2401.01313))  

  **Summary:**  
  This paper surveys over 32 attempts to reduce hallucinations — through *prompt engineering*, *retrieval-augmented generation*, *self-refinement*, and *knowledge injection*.  
  The key takeaway: nearly all these methods are **heuristics** — clever but ad hoc fixes — not mathematically grounded systems. None provide true **probabilistic calibration** or **confidence intervals**.  

  ![test](Images/Islam.png)  

  *Figure:* A taxonomy of mitigation strategies. What’s missing is a **statistical backbone** — none of these approaches measure how confident the model actually is in a mathematically valid way.  

---

- **Liang et al. (2024)** — *THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models*  
  ([arXiv:2409.11353](https://arxiv.org/abs/2409.11353))  

  ![test](Images/THaMES.png)  

  **Summary:**  
  THaMES builds an automated pipeline for mitigating hallucinations by generating QA datasets, benchmarking model responses, and applying methods like **In-Context Learning**, **Retrieval-Augmented Generation**, and **Parameter-Efficient Fine-Tuning**.  
  It does reduce hallucination frequency, but only in a **qualitative** way — classifying outputs as “fact” or “non-fact,” with no mathematical notion of **confidence** or **error bounds**.  

  **Why this approach doesn’t fit the Spark / ScoreCard project:**  
  THaMES improves behavior through **data curation and prompt tuning**, not through **statistical validity**.  
  The ScoreCard system, by contrast, must justify **quantitative model outputs** — real probabilities from logistic regression models grounded in procurement data.  
  A THaMES-style pipeline might clean inputs, but it can’t explain *why* a score was produced or *how sure* the system is.  

  *In short:* **THaMES reduces hallucination by reshaping data — ScoreCard prevents it by grounding explanations in math.**

---

- **Pang, Jang & Fang (2024)** — *Generating Descriptive Explanations of Machine Learning Models Using LLM*  
  ([OpenReview PDF](https://openreview.net/pdf/7966b9a46e995f470152dc158cf2c09a957eae89.pdf))  

  **Summary:**  
  This paper introduces a complementary strategy: using an LLM as a **translator** between classical machine-learning models and human users.  
  Instead of retraining or altering data, the LLM reads a model’s internal parameters — such as feature weights and correlations — and converts them into **natural-language explanations** describing *how* the model reasons.  
  The insight is simple but powerful: an LLM can offer interpretability **without touching the underlying math**.  

  ![test](Images/Pang.png)  

  *Figure:* The statistical model remains the **source of truth**, while the LLM acts as the **interpreter** — explaining logic, relevance, and uncertainty to humans.  
  This is the same strategy adopted in the **Spark / ScoreCard** system: use AI not to alter the numbers, but to **explain and justify** them in plain language.


In [None]:
import sys
import os

# Just for demo purposes
sys.path.insert(0, os.path.abspath(os.path.join("..", "src")))

from scorecard import (ScoreCardConfig, 
                      ScoreCardState, 
                      ConnectionManager,
                      ScoreCardModeling,
                      ScoreCardRag,
                      ScoreCardPipeline,
                      Horizon,              # Multi-horizon support
                      SUPPORTED_HORIZONS)   # List of supported horizons


config = ScoreCardConfig()
state = ScoreCardState(config)
conn = ConnectionManager(config, state)
modeler = ScoreCardModeling(config, state, conn)
rag = ScoreCardRag(config, state, conn)

pipeline = ScoreCardPipeline(config, state, conn, modeler)

This notebook demonstrates a **hybrid model pipeline** for vendor performance classification and explanation — integrating **statistical modeling**, **vector search**, and **generative reasoning**. The architecture connects classical machine learning, dense embedding retrieval, and GPT-based justification into one coherent system.

---

## Hybrid Modeling Approach

The system has three main layers:

---

### 1. **Logistic Regression over Linguistic Features**

Implemented in the `ScoreCardModeling` class — this is the **statistical core** of the ScoreCard system.

---

#### What is Logistic Regression?

Logistic regression is a **classification algorithm** that predicts the probability of an outcome (e.g., Green, Yellow, or Red).  
It combines input features linearly and passes that score through a **sigmoid** (or softmax) function to produce probabilities between 0 and 1.

$
P(y \in \{\text{Green, Yellow, Red}\} \mid \vec{x}) = \text{softmax}(W\vec{x} + b)
$

Where:
- $\vec{x}$ = vectorized representation of each supplier note and its metadata  
- $W$ = learned feature weights (interpretable coefficients)  
- $b$ = bias term  

Because logistic regression is **transparent**, we can inspect which words and features drive predictions — and even attach **confidence intervals** to them.

---

#### Vectorization: Turning Text into Numbers

LLMs and machine learning models can only “understand” text once it’s represented as **vectors** — numerical encodings of meaning.  
In this pipeline, we **vectorize nouns, verbs, and adjectives** from supplier notes to capture both *what happened* (nouns) and *how it was described* (verbs/adjectives).

Example:

> “Vendor missed the critical delivery deadline.”

becomes feature tokens like:
- **Nouns:** `vendor`, `delivery`, `deadline`  
- **Verbs/Adjectives:** `missed`, `critical`

These are then transformed into numeric vectors using **Count**, **TF-IDF**, or **Embedding** encoders.

---

#### Word Embedding Analogy

Vectorization allows words to occupy positions in a multidimensional space, preserving relationships between meanings.  
A famous example is:

$
\text{king} - \text{man} + \text{woman} \approx \text{queen}
$

In the ScoreCard context, this concept generalizes to vendor language:

> “missed deadline” − “late” + “resolved” → closer to “on time”

It’s how the model captures semantic nuance beyond raw text.

![Word embedding analogy: king - man + woman = queen](https://kawine.github.io/blog/assets/parallelogram.png)

---

#### Building and Tuning the Model

The `build_model_grid` method performs a **structured grid search** across:

- Feature sets (linguistic vs. metadata)
- Sampling strategies (e.g., downsample overrepresented Green cases)
- Vectorizer types (`count`, `tfidf`, etc.)
- Class weights (penalize risky misclassifications more heavily)

This process — known as **hyperparameter tuning** — systematically trains dozens of logistic models to find the best balance between **interpretability** and **predictive strength**.

![Logistic regression S-curve](Images/logreg.jpg)

---

#### Evaluation and Selection

Each candidate model is trained and tested (80/20 split) and scored on:
- **False negatives** — our top priority (don’t classify risky vendors as “green”)  
- **Overall accuracy**  
- **Confusion matrix structure**

Selection rule:
1. Minimize false negatives  
2. Use test accuracy as a tie-breaker  

This ensures a **safety-sensitive bias** toward over-detection rather than under-detection.

---

#### Why This Matters

Unlike opaque neural models, logistic regression provides:
- **Transparent coefficients** showing which terms push a note toward Red, Yellow, or Green  
- **Probabilistic outputs** with mathematically valid confidence levels  
- **Explainable logic** where every prediction can be traced to specific linguistic or metadata features  

Together, this forms the **quantitative foundation** of the ScoreCard system:  
> The model predicts with math — the LLM later explains with language.


### 2. **Vector Store of Notes (RAG)**

Past scorecard notes are embedded and indexed in a local Elasticsearch dense vector store (`scorecard_rag_notes`).

![RAG](Images/RAG.png)


- Each note → 1024-dimensional vector via `SentenceTransformer` (“BAAI/bge-large-en-v1.5”)  
- Metadata (date, SID, label) stored for contextual retrieval  

Managed by the `ScoreCardRag` class:
- `embed_and_index_notes`: builds embeddings and indexes  
- `similar_notes_same_sid`: finds earlier notes for the same supplier  
- `get_vendor_trouble_notes`: retrieves non-green notes across contracts  

This allows **semantic retrieval** — finding related issues without keyword overlap.

---

### 3. **GPT-Based Justification Engine**

After prediction, GPT-4o generates a natural-language **justification** of the label.

Prompts include:
- The vendor’s note history (`sid_key_0–3`)
- Similar notes retrieved via RAG
- Vendor-level context (other non-green scores)

Executed through:
- `ScoreCardRag.generate_justifications`: prompt construction and GPT call  
- `run_gpt_justification_pass`: parallelized execution with backoff/retry logic  

Example output:

> “This vendor is trending yellow due to repeated late-delivery notes and unresponsiveness, consistent with similar issues in SID 301884.”

Thus, the LLM supplies **qualitative reasoning** that complements the model’s **quantitative prediction**.

---

### Pipeline Integration

The `ScoreCardPipeline` class orchestrates end-to-end flow:
1. `_stage_1_download` — load data from SQL / Elasticsearch  
2. `_stage_2_text_enrichment` — NLP preprocessing and note windowing  
3. `_stage_3_modeling_and_prediction` — train or load the best logistic model, generate predictions  
4. Optionally trigger GPT justifications

Overall:

$$
\text{Prediction} = \text{LogisticRegression}(\text{Linguistic Features})
$$  

$$
\text{Explanation} = \text{GPT-4o}(\text{Note History + Vector Similarity})
$$  

This hybrid pipeline delivers **statistical precision**, **semantic memory**, and **contextual explanations** — bridging classical analytics with modern generative AI.


In [3]:
pipeline.run()

[SQL] 	28118 rows downloaded and stored in state.details_df.
[ES] 	Indexed 28118 documents to 'scorecard_details' with 0 errors.
[PIPE] 	Completed all Downloads from SQL or ES


  dlpack_tensor = xp_tensor.toDlpack()  # type: ignore


[ES] 	Indexed 28087 documents to 'scorecard_enriched' with 0 errors.
[ES] 	Indexed 28087 documents to 'scorecard_sid_history' with 0 errors.
[PIPE] 	Completed all Text Prep Steps
[DATA] 	Filtered to 23982 trainable rows with total_notes >= 5 and valid target (0/1/2) 14.6% culled
[MODL] 	Rehydrated best model: complete_main_words_only | no_downsample_weighted | count | Weights {0: 0.5, 1: 1.35, 2: 1.15}
[ML] 	Best model: complete_main_words_only | no_downsample_weighted | count | Weights {0: 0.5, 1: 1.35, 2: 1.15}
[ES] 	Indexed 0 documents to 'scorecard_model_summary' with 0 errors.
[ES] 	Indexed 28087 documents to 'scorecard_predictions' with 0 errors.
[ML] 	Predictions uploaded to 'scorecard_predictions'
[JOIN] 	Merged predictions into enriched_df. Final shape: (28087, 56)


## Multi-Horizon Prediction Results

The pipeline now supports **two prediction horizons**:

| Horizon | Description | Target | Minimum Notes |
|---------|-------------|--------|---------------|
| **H1** | Next card prediction | 1 step ahead | 5 notes |
| **H2** | Card after next | 2 steps ahead | 6 notes |

### Key Design Decisions

1. **No Recursive Predictions**: H2 is trained directly from historical data, NOT conditioned on H1 predictions
2. **Backward Compatibility**: H1 uses original column names (`trainable`, `target`, `predicted_label`); H2 uses `_h2` suffix
3. **Separate Models**: Each horizon has its own trained model with independent evaluation metrics

### Output Columns

After running `pipeline.run()`, `state.complete_df` contains:

**H1 Columns (original names):**
- `predicted_label` - H1 predicted label (0=Green, 1=Yellow, 2=Red)
- `prob_green`, `prob_yellow`, `prob_red` - H1 probabilities
- `predicted_color` - Human-readable H1 prediction

**H2 Columns (with _h2 suffix):**
- `predicted_label_h2` - H2 predicted label
- `prob_green_h2`, `prob_yellow_h2`, `prob_red_h2` - H2 probabilities
- `predicted_color_h2` - Human-readable H2 prediction

In [None]:
import pandas as pd
from IPython.display import display, Markdown

def display_one_note(row: pd.Series) -> None:
    """
    Nicely displays a single ScoreCard prediction with H1 and H2 results.
    Updated for multi-horizon prediction support.
    
    Example use:
        display_one_note(pipeline.state.complete_df.sample(1).iloc[0])
    """

    sid = row.get("SID", "unknown")
    note = row.get("Scorecard_Note", "").strip()
    supplier = row.get("Supplier_Name", "").strip()
    program = row.get("Program_Name", "")
    year = row.get("Note_Year", "")
    
    # H1 prediction (original column names)
    color_h1 = row.get("predicted_color", "Unknown").title()
    p_green_h1 = row.get("prob_green", 0)
    p_yellow_h1 = row.get("prob_yellow", 0)
    p_red_h1 = row.get("prob_red", 0)
    
    # H2 prediction (with _h2 suffix)
    color_h2 = row.get("predicted_color_h2", "N/A")
    if pd.notna(color_h2) and color_h2 != "N/A":
        color_h2 = str(color_h2).title()
    else:
        color_h2 = "N/A"
    p_green_h2 = row.get("prob_green_h2", None)
    p_yellow_h2 = row.get("prob_yellow_h2", None)
    p_red_h2 = row.get("prob_red_h2", None)

    # Choose style by H1 color
    if color_h1 == "Red":
        style = '<hr style="border-top: 3px double #d33;">'
    elif color_h1 == "Yellow":
        style = '<hr style="border-top: 3px double #e6c300;">'
    else:
        style = '<hr style="border-top: 3px double #3a9c35;">'

    # Format H2 probabilities (may be missing if not enough notes)
    if p_green_h2 is not None and pd.notna(p_green_h2):
        h2_probs = f"""- Green: {p_green_h2:.5f}  
- Yellow: {p_yellow_h2:.5f}  
- Red: {p_red_h2:.5f}"""
    else:
        h2_probs = "*(Insufficient note history for H2 prediction)*"

    # Markdown block with both horizons
    md = f"""
{style}

**SID:** `{sid}`  **Program:** *{program}* **Supplier:** *{supplier}*  **Year:** {year}  

---

**Scorecard Note:**  
> {note}

---

### H1: Next Card Prediction
**Predicted Rating: {color_h1}**

**Model Probabilities**  
- Green: {p_green_h1:.5f}  
- Yellow: {p_yellow_h1:.5f}  
- Red: {p_red_h1:.5f}  

---

### H2: Card After Next Prediction
**Predicted Rating: {color_h2}**

**Model Probabilities**  
{h2_probs}

{style}
"""
    display(Markdown(md))

In [None]:
# Sample a note and display both H1 and H2 predictions
sample = pipeline.state.complete_df[[
    "SID", 
    "Scorecard_Note",
    "Supplier_Name",
    "Note_Year",
    "Program_Name",
    "color_set",
    # H1 columns (original names)
    "prob_green",
    "prob_yellow",
    "prob_red",
    "predicted_color",
    # H2 columns (with _h2 suffix)
    "prob_green_h2",
    "prob_yellow_h2",
    "prob_red_h2",
    "predicted_color_h2"
]].sample()

display_one_note(sample.iloc[0])

### Horizon Comparison Analysis

Let's examine how H1 and H2 predictions differ across the dataset.

In [None]:
# Horizon comparison analysis
df_complete = pipeline.state.complete_df

# Count predictions by horizon
h1_counts = df_complete['predicted_color'].value_counts()
h2_counts = df_complete['predicted_color_h2'].dropna().value_counts()

print("=" * 50)
print("H1 (Next Card) Prediction Distribution:")
print("=" * 50)
for color, count in h1_counts.items():
    print(f"  {color}: {count:,} ({100*count/len(df_complete):.1f}%)")

print("\n" + "=" * 50)
print("H2 (Card After Next) Prediction Distribution:")
print("=" * 50)
h2_total = df_complete['predicted_color_h2'].notna().sum()
for color, count in h2_counts.items():
    print(f"  {color}: {count:,} ({100*count/h2_total:.1f}%)")
print(f"\n  (H2 predictions available for {h2_total:,} of {len(df_complete):,} rows)")

# Show cases where H1 and H2 predictions differ
both_available = df_complete.dropna(subset=['predicted_color', 'predicted_color_h2'])
differs = both_available[both_available['predicted_color'] != both_available['predicted_color_h2']]
print(f"\n" + "=" * 50)
print(f"Horizon Disagreement:")
print("=" * 50)
print(f"  {len(differs):,} rows ({100*len(differs)/len(both_available):.1f}%) have different H1 vs H2 predictions")

In [None]:
# Access per-horizon model information
print("=" * 60)
print("Best Model Keys by Horizon:")
print("=" * 60)

for horizon in SUPPORTED_HORIZONS:
    key = state.best_model_key_by_horizon.get(int(horizon), "Not trained")
    print(f"\n  Horizon {horizon} (H{int(horizon)}):")
    print(f"    {key}")

# Show per-horizon predictions DataFrames
print("\n" + "=" * 60)
print("Per-Horizon Predictions DataFrames:")
print("=" * 60)
for horizon in SUPPORTED_HORIZONS:
    df_h = state.predictions_df_by_horizon.get(int(horizon))
    if df_h is not None:
        print(f"\n  H{int(horizon)}: {len(df_h):,} rows, {len(df_h.columns)} columns")
    else:
        print(f"\n  H{int(horizon)}: Not available")

## Model Validation: Confusion Matrices & Metrics

This section evaluates the predictive performance of both H1 (next card) and H2 (card after next) models using:

- **Confusion Matrix**: Shows true vs predicted labels for each class (Green/Yellow/Red)
- **Classification Metrics**: Precision, Recall, F1-Score per class
- **Overall Accuracy**: Percentage of correct predictions
- **False Negative Analysis**: Critical for safety - identifying risky vendors misclassified as Green

Note: Validation is performed only on **trainable** rows where ground truth targets are available.

In [None]:
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

def plot_confusion_matrix(y_true, y_pred, title, labels=['Green', 'Yellow', 'Red']):
    """
    Plot a confusion matrix with counts and percentages.
    """
    cm = confusion_matrix(y_true, y_pred, labels=[0, 1, 2])
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Raw counts
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=labels, yticklabels=labels, ax=axes[0])
    axes[0].set_xlabel('Predicted')
    axes[0].set_ylabel('Actual')
    axes[0].set_title(f'{title} - Counts')
    
    # Normalized (percentages)
    sns.heatmap(cm_normalized, annot=True, fmt='.1%', cmap='Blues',
                xticklabels=labels, yticklabels=labels, ax=axes[1])
    axes[1].set_xlabel('Predicted')
    axes[1].set_ylabel('Actual')
    axes[1].set_title(f'{title} - Normalized by Row')
    
    plt.tight_layout()
    plt.show()
    
    return cm

def compute_false_negatives(y_true, y_pred):
    """
    Compute false negatives: Yellow/Red predicted as Green (safety-critical).
    """
    fn_yellow = sum((y_true == 1) & (y_pred == 0))  # Yellow predicted as Green
    fn_red = sum((y_true == 2) & (y_pred == 0))     # Red predicted as Green
    total_yellow = sum(y_true == 1)
    total_red = sum(y_true == 2)
    
    return {
        'yellow_as_green': fn_yellow,
        'yellow_total': total_yellow,
        'yellow_fn_rate': fn_yellow / total_yellow if total_yellow > 0 else 0,
        'red_as_green': fn_red,
        'red_total': total_red,
        'red_fn_rate': fn_red / total_red if total_red > 0 else 0,
        'total_fn': fn_yellow + fn_red
    }

def print_validation_report(y_true, y_pred, horizon_name):
    """
    Print comprehensive validation metrics.
    """
    print("=" * 70)
    print(f"  {horizon_name} VALIDATION REPORT")
    print("=" * 70)
    
    # Overall accuracy
    acc = accuracy_score(y_true, y_pred)
    print(f"\nOverall Accuracy: {acc:.4f} ({100*acc:.2f}%)")
    print(f"Total Samples: {len(y_true):,}")
    
    # Classification report
    print("\n" + "-" * 70)
    print("Classification Report:")
    print("-" * 70)
    print(classification_report(y_true, y_pred, 
                                target_names=['Green (0)', 'Yellow (1)', 'Red (2)'],
                                digits=4))
    
    # False negative analysis (safety-critical)
    fn_stats = compute_false_negatives(y_true, y_pred)
    print("-" * 70)
    print("False Negative Analysis (Safety-Critical):")
    print("-" * 70)
    print(f"  Yellow misclassified as Green: {fn_stats['yellow_as_green']:,} / {fn_stats['yellow_total']:,} ({100*fn_stats['yellow_fn_rate']:.2f}%)")
    print(f"  Red misclassified as Green:    {fn_stats['red_as_green']:,} / {fn_stats['red_total']:,} ({100*fn_stats['red_fn_rate']:.2f}%)")
    print(f"  Total False Negatives:         {fn_stats['total_fn']:,}")
    print("=" * 70 + "\n")
    
    return fn_stats

### H1: Next Card Prediction Validation

Evaluating the model that predicts the **next** scorecard rating (1 step ahead).

In [None]:
# H1 Validation: Filter to trainable rows with valid targets
df_complete = pipeline.state.complete_df.copy()

# H1 uses original column names: 'trainable' and 'target'
h1_valid = df_complete[
    (df_complete['trainable'] == True) & 
    (df_complete['target'].isin([0, 1, 2])) &
    (df_complete['predicted_label'].notna())
].copy()

y_true_h1 = h1_valid['target'].astype(int).values
y_pred_h1 = h1_valid['predicted_label'].astype(int).values

print(f"H1 Validation Set: {len(h1_valid):,} trainable rows with valid targets\n")

# Print validation report
h1_fn_stats = print_validation_report(y_true_h1, y_pred_h1, "H1 (Next Card)")

# Plot confusion matrix
cm_h1 = plot_confusion_matrix(y_true_h1, y_pred_h1, "H1: Next Card Prediction")

### H2: Card After Next Prediction Validation

Evaluating the model that predicts the scorecard rating **two steps ahead**.

Note: H2 requires more historical notes (6+), so the validation set is smaller than H1.

In [None]:
# H2 Validation: Filter to trainable_h2 rows with valid targets
# H2 uses _h2 suffix: 'trainable_h2' and 'target_h2'

# Check if H2 columns exist
if 'trainable_h2' in df_complete.columns and 'target_h2' in df_complete.columns:
    h2_valid = df_complete[
        (df_complete['trainable_h2'] == True) & 
        (df_complete['target_h2'].isin([0, 1, 2])) &
        (df_complete['predicted_label_h2'].notna())
    ].copy()
    
    if len(h2_valid) > 0:
        y_true_h2 = h2_valid['target_h2'].astype(int).values
        y_pred_h2 = h2_valid['predicted_label_h2'].astype(int).values
        
        print(f"H2 Validation Set: {len(h2_valid):,} trainable rows with valid targets\n")
        
        # Print validation report
        h2_fn_stats = print_validation_report(y_true_h2, y_pred_h2, "H2 (Card After Next)")
        
        # Plot confusion matrix
        cm_h2 = plot_confusion_matrix(y_true_h2, y_pred_h2, "H2: Card After Next Prediction")
    else:
        print("No valid H2 validation data available.")
        print("Ensure the pipeline was run with H2 model training enabled.")
else:
    print("H2 columns not found in complete_df.")
    print("Ensure the pipeline was run with multi-horizon support enabled.")

### Horizon Comparison Summary

Side-by-side comparison of H1 and H2 model performance.

In [None]:
# Side-by-side comparison of H1 and H2 performance
from sklearn.metrics import precision_recall_fscore_support

def get_metrics_summary(y_true, y_pred):
    """Extract key metrics for comparison."""
    acc = accuracy_score(y_true, y_pred)
    precision, recall, f1, support = precision_recall_fscore_support(
        y_true, y_pred, labels=[0, 1, 2], average=None, zero_division=0
    )
    fn_stats = compute_false_negatives(y_true, y_pred)
    
    return {
        'accuracy': acc,
        'precision_green': precision[0],
        'precision_yellow': precision[1],
        'precision_red': precision[2],
        'recall_green': recall[0],
        'recall_yellow': recall[1],
        'recall_red': recall[2],
        'f1_green': f1[0],
        'f1_yellow': f1[1],
        'f1_red': f1[2],
        'fn_yellow_rate': fn_stats['yellow_fn_rate'],
        'fn_red_rate': fn_stats['red_fn_rate'],
        'total_fn': fn_stats['total_fn'],
        'samples': len(y_true)
    }

# Build comparison table
h1_metrics = get_metrics_summary(y_true_h1, y_pred_h1)

comparison_data = {
    'Metric': [
        'Validation Samples',
        'Overall Accuracy',
        '',
        'Precision - Green',
        'Precision - Yellow', 
        'Precision - Red',
        '',
        'Recall - Green',
        'Recall - Yellow',
        'Recall - Red',
        '',
        'F1-Score - Green',
        'F1-Score - Yellow',
        'F1-Score - Red',
        '',
        'False Neg Rate (Yellow→Green)',
        'False Neg Rate (Red→Green)',
        'Total False Negatives'
    ],
    'H1 (Next Card)': [
        f"{h1_metrics['samples']:,}",
        f"{h1_metrics['accuracy']:.4f}",
        '',
        f"{h1_metrics['precision_green']:.4f}",
        f"{h1_metrics['precision_yellow']:.4f}",
        f"{h1_metrics['precision_red']:.4f}",
        '',
        f"{h1_metrics['recall_green']:.4f}",
        f"{h1_metrics['recall_yellow']:.4f}",
        f"{h1_metrics['recall_red']:.4f}",
        '',
        f"{h1_metrics['f1_green']:.4f}",
        f"{h1_metrics['f1_yellow']:.4f}",
        f"{h1_metrics['f1_red']:.4f}",
        '',
        f"{h1_metrics['fn_yellow_rate']:.2%}",
        f"{h1_metrics['fn_red_rate']:.2%}",
        f"{h1_metrics['total_fn']:,}"
    ]
}

# Add H2 metrics if available
if 'y_true_h2' in dir() and len(y_true_h2) > 0:
    h2_metrics = get_metrics_summary(y_true_h2, y_pred_h2)
    comparison_data['H2 (Card After Next)'] = [
        f"{h2_metrics['samples']:,}",
        f"{h2_metrics['accuracy']:.4f}",
        '',
        f"{h2_metrics['precision_green']:.4f}",
        f"{h2_metrics['precision_yellow']:.4f}",
        f"{h2_metrics['precision_red']:.4f}",
        '',
        f"{h2_metrics['recall_green']:.4f}",
        f"{h2_metrics['recall_yellow']:.4f}",
        f"{h2_metrics['recall_red']:.4f}",
        '',
        f"{h2_metrics['f1_green']:.4f}",
        f"{h2_metrics['f1_yellow']:.4f}",
        f"{h2_metrics['f1_red']:.4f}",
        '',
        f"{h2_metrics['fn_yellow_rate']:.2%}",
        f"{h2_metrics['fn_red_rate']:.2%}",
        f"{h2_metrics['total_fn']:,}"
    ]

comparison_df = pd.DataFrame(comparison_data)
print("=" * 70)
print("  HORIZON COMPARISON SUMMARY")
print("=" * 70)
display(comparison_df.set_index('Metric'))

In [None]:
# Visual comparison of H1 vs H2 key metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

metrics_to_plot = ['accuracy', 'recall_yellow', 'recall_red']
metric_labels = ['Overall Accuracy', 'Yellow Recall', 'Red Recall']
colors = ['#2ecc71', '#f1c40f', '#e74c3c']

for idx, (metric, label, color) in enumerate(zip(metrics_to_plot, metric_labels, colors)):
    values = [h1_metrics[metric]]
    labels = ['H1']
    
    if 'h2_metrics' in dir():
        values.append(h2_metrics[metric])
        labels.append('H2')
    
    bars = axes[idx].bar(labels, values, color=color, alpha=0.7, edgecolor='black')
    axes[idx].set_title(label, fontsize=12, fontweight='bold')
    axes[idx].set_ylim(0, 1.0)
    axes[idx].set_ylabel('Score')
    
    # Add value labels on bars
    for bar, val in zip(bars, values):
        axes[idx].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                      f'{val:.3f}', ha='center', va='bottom', fontsize=11)

plt.suptitle('H1 vs H2 Model Performance Comparison', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# False negative comparison
fig, ax = plt.subplots(figsize=(10, 5))

fn_metrics = ['fn_yellow_rate', 'fn_red_rate']
fn_labels = ['Yellow → Green\n(False Negative)', 'Red → Green\n(False Negative)']
x = np.arange(len(fn_labels))
width = 0.35

h1_fn_values = [h1_metrics[m] for m in fn_metrics]
bars1 = ax.bar(x - width/2, h1_fn_values, width, label='H1 (Next Card)', color='#3498db', alpha=0.8)

if 'h2_metrics' in dir():
    h2_fn_values = [h2_metrics[m] for m in fn_metrics]
    bars2 = ax.bar(x + width/2, h2_fn_values, width, label='H2 (Card After Next)', color='#9b59b6', alpha=0.8)

ax.set_ylabel('False Negative Rate')
ax.set_title('False Negative Rates by Horizon (Lower is Better)', fontsize=12, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(fn_labels)
ax.legend()
ax.set_ylim(0, max(h1_fn_values + (h2_fn_values if 'h2_metrics' in dir() else [])) * 1.3 + 0.05)

# Add value labels
for bar in bars1:
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
           f'{bar.get_height():.1%}', ha='center', va='bottom', fontsize=10)
if 'h2_metrics' in dir():
    for bar in bars2:
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
               f'{bar.get_height():.1%}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

In [4]:
rag.embed_and_index_notes()

[EMBD] 	Encoding 28087 notes for embedding...


Batches:   0%|          | 0/220 [00:00<?, ?it/s]

[EMBD] 	Token count stats: {
  "count": 28087.0,
  "mean": 187.10382027272402,
  "std": 314.3341133517609,
  "min": 0.0,
  "25%": 32.0,
  "50%": 93.0,
  "75%": 241.0,
  "max": 13213.0
}
[ES] 	Deleted existing index 'scorecard_rag_notes'
[ES] 	Created index 'scorecard_rag_notes' with vector mapping
[ES] 	Indexed 28087 documents to 'scorecard_rag_notes'


In [5]:
rag.run_gpt_justification_pass()

[RAG] 	No prior notes found for SID 4 before 2018-1
[RAG] 	No prior notes found for SID 4 before 2018-1
[RAG] 	No prior notes found for SID 4 before 2018-1
[RAG] 	No prior notes found for SID 4 before 2018-1
[RAG] 	No prior notes found for SID 4 before 2018-1
[RAG] 	No prior notes found for SID 3 before 2018-8
[RAG] 	No prior notes found for SID 3 before 2018-8
[RAG] 	No prior notes found for SID 4 before 2018-1
[RAG] 	No prior notes found for SID 3 before 2018-8
[RAG] 	No prior notes found for SID 1 before 2017-12
[RAG] 	No prior notes found for SID 4 before 2018-1
[RAG] 	No prior notes found for SID 3 before 2018-8
[RAG] 	No prior notes found for SID 4 before 2018-1
[RAG] 	No prior notes found for SID 2 before 2016-2
[GPT] 	[15] 000002.2016.03.000008 succeeded (attempt 1)
[GPT] 	[30] 000004.2018.01.000023 succeeded (attempt 1)
[GPT] 	[24] 000004.2018.01.000015 succeeded (attempt 1)
[GPT] 	[27] 000004.2018.01.000018 succeeded (attempt 1)
[GPT] 	[22] 000003.2020.03.004593 succeeded (at

In [8]:
from elasticsearch import Elasticsearch, helpers
import pandas as pd

def fetch_index_as_dataframe(es_client, index_name, scroll='2m', page_size=1000) -> pd.DataFrame:
    """
    Fetches all documents from an Elasticsearch index and returns them as a DataFrame.
    
    Parameters:
    - es_client: Elasticsearch client instance (e.g., pipeline.conn.es_client)
    - index_name: Name of the index to retrieve
    - scroll: Scroll duration to keep context alive
    - page_size: Number of documents per scroll batch

    Returns:
    - pd.DataFrame containing all _source fields from the index
    """
    results = []

    # Initial search to get the scroll ID
    resp = es_client.search(
        index=index_name,
        scroll=scroll,
        size=page_size,
        body={"query": {"match_all": {}}}
    )

    scroll_id = resp['_scroll_id']
    hits = resp['hits']['hits']
    results.extend([doc['_source'] for doc in hits])

    # Keep scrolling until no hits are returned
    while len(hits):
        resp = es_client.scroll(scroll_id=scroll_id, scroll=scroll)
        scroll_id = resp['_scroll_id']
        hits = resp['hits']['hits']
        results.extend([doc['_source'] for doc in hits])

    # Cleanup scroll context
    es_client.clear_scroll(scroll_id=scroll_id)

    return pd.DataFrame(results)



In [9]:
df = fetch_index_as_dataframe(pipeline.conn.es_client, "scorecard_rag_notes")

  resp = es_client.search(


In [17]:
df.sample().to_dict()

{'SID': {27283: 2909},
 'Scorecard_Detail_Note_SID': {27283: 40044},
 'Scorecard_Note': {27283: 'Program security and the program are working accreditation items. Subcontract closeout to happen upon expiration of PoP on 4/1/25.'},
 'Note_Year': {27283: 2025},
 'Note_Month': {27283: '03'},
 'PO_Number': {27283: '4105668131'},
 'PO_Contract_Type': {27283: 'FFP'},
 'PO_Complexity_Level': {27283: 3},
 'PO_Lifecycle_Phase': {27283: 'Production'},
 'Supplier_Name': {27283: 'FEDERAL CONTRACTING, INC.'},
 'LM_Vendor_ID': {27283: 'LM1581528'},
 'Supplier_Site_Location': {27283: 'COLORADO SPRINGS, CO'},
 'Supplier_DandB_Number': {27283: '039142492'},
 'LOB_Name': {27283: 'LMSS CENTRAL'},
 'Program_Name': {27283: 'CIP HISTORICAL'},
 'Prime_Contract_Number': {27283: 'I30002'},
 'Program_Contract_Type': {27283: 'Capital'},
 'PO_Contract_Dollars_Mil': {27283: 3371898.63},
 'PO_Funding_Dollars_Mil': {27283: 3371898.63},
 'PO_Definitization_Status': {27283: 'DEF'},
 'PO_Funding_Override_Ind': {27283: 

In [13]:
df['justification'].isna().sum()

0

In [None]:
# Clean up DataFrame for export - remove embeddings but keep H1 and H2 prediction columns
df = df.dropna(subset=['justification'])
exclude_cols = ['embedding', 'tokens', 'text_for_embedding']
df = df[[x for x in df.columns if x not in exclude_cols]]

# Verify H2 columns are present
h2_cols = [c for c in df.columns if '_h2' in c]
print(f"H2 columns in export: {h2_cols}")

In [15]:
df.to_csv("predictions.csv")

In [None]:
def display_one_note(row: pd.Series) -> None:
    """Display a single note with H1 and H2 predictions and GPT justification."""
    sid = row.get("sid_key", "unknown")
    program = row.get("Program_Name", "N/A")
    po_number = row.get("PO_Number", "N/A")
    note = row.get("Scorecard_Note", "").strip()
    justification = row.get("justification", "").strip()

    # H1 prediction (original column names)
    color_h1 = row.get("predicted_color", "Unknown").title()
    p_green_h1 = f"{100 * row.get('prob_green', 0):.1f}%"
    p_yellow_h1 = f"{100 * row.get('prob_yellow', 0):.1f}%"
    p_red_h1 = f"{100 * row.get('prob_red', 0):.1f}%"
    
    # H2 prediction (with _h2 suffix)
    color_h2 = row.get("predicted_color_h2", None)
    if pd.notna(color_h2):
        color_h2 = str(color_h2).title()
        p_green_h2 = f"{100 * row.get('prob_green_h2', 0):.1f}%"
        p_yellow_h2 = f"{100 * row.get('prob_yellow_h2', 0):.1f}%"
        p_red_h2 = f"{100 * row.get('prob_red_h2', 0):.1f}%"
        h2_section = f"""### H2: Card After Next  
**Predicted Rating**: **{color_h2}**

- **Green**: {p_green_h2}  
- **Yellow**: {p_yellow_h2}  
- **Red**: {p_red_h2}"""
    else:
        h2_section = "### H2: Card After Next  \n*(Insufficient note history)*"

    # Safe handling for possibly missing or NaN target_label
    target_raw = row.get("target_label", "")
    target = str(target_raw).strip() if pd.notnull(target_raw) else "N/A"

    md = f"""
<hr style="border-top: 3px double red;">

**SID**: `{sid}`   **Program**: `{program}`   **PO Number**: `{po_number}`

---

### H1: Next Card  
**Predicted Rating**: **{color_h1}**

- **Green**: {p_green_h1}  
- **Yellow**: {p_yellow_h1}  
- **Red**: {p_red_h1}

---

{h2_section}

---

**Ground Truth**: **{target}**

---

**Raw Note**  
> {note}

---

**GPT Justification (H1)**  
{justification if justification else "*(No justification generated)*"}

---
"""
    display(Markdown(md))

In [None]:
from IPython.display import Markdown

sampled_df = sample_random_rag_notes()
for _, row in sampled_df.iterrows():
    display_one_note(row)