## Coverage analysis

This notebook evaluates and the production system against the test subset of the Safe-Guard Prompt Injection Dataset. This system includes three components: static keyword checker, a TD-IDF classifier, an LLM-based agent.

This notebook evaluates and compares the coverage of prompt safety detection across three approaches: a baseline classifier, an LLM-based agent, and a static keyword filter. The goal is to understand the strengths, overlaps, and gaps in these methods, to understand the need for additional pipeline components.

## Setup

In this section, we install the dependencies required to run the code in this notebook, define common variables used throughout, and add the project root to `PYTHONPATH` so we can use components from the `src/` folder.

In [None]:
import sys
import os

# Add project root to path
sys.path.append(os.path.abspath(".."))

In [None]:
# flake8-noqa-cell
import json
from dataclasses import dataclass
from typing import cast

import plotly.graph_objects as go
from datasets import DatasetDict, load_dataset
from datasets.arrow_dataset import Column
from sklearn.metrics import classification_report, confusion_matrix

from src.analyzers import KeywordChecker, TfidfClassifier

In [None]:
# Synthetic prompt injection dataset: https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection.
dataset_id = "xTRam1/safe-guard-prompt-injection"

In [None]:
notebooks_dir = os.path.dirname(os.path.abspath("__file__"))
plots_dir = os.path.abspath(os.path.join(notebooks_dir, "..", "docs", "content", "plots"))
models_dir = os.path.abspath(os.path.join(notebooks_dir, "..", "models"))
src_dir = os.path.abspath(os.path.join(notebooks_dir, "..", "src"))
data_dir = os.path.abspath(os.path.join(notebooks_dir, "..", "data"))

### System evaluation

In this section, we evaluate each component on the test subsection of the project dataset, and then combine their predictions to produce a unified classification report for the complete system.

In [None]:
# Load project dataset
dataset = cast(DatasetDict, load_dataset(dataset_id))
X_test, y_test = cast(Column, dataset["test"]["text"]), cast(Column, dataset["test"]["label"])

In [None]:
# Get list of predictions for static keyword analyzer
keyword_checker = KeywordChecker()
y_pred_keyword = []
for prompt in X_test:
    report = keyword_checker.analyze(prompt=prompt)

    if report is None:
        # The the keyword checker didn't find any unsafe keywords in the prompt
        y_pred_keyword.append(0)
    else:
        y_pred_keyword.append(1)

In [None]:
# Get list of predictions for TF-IDF classifier
tfidf_classifier = TfidfClassifier()
y_pred_tfidf = []
for prompt in X_test:
    report = tfidf_classifier.analyze(prompt=prompt)
    y_pred_tfidf.append(report.label)

Instead of re-running the LLM-based solution, let's load the evaluation data from a file...

In [None]:
@dataclass
class EvaluationResult:
    """
    Represents the results of a model evaluation.
    """

    y_true: list[int]
    y_pred: list[int]
    failed_indices: list[int]


# Path to the saved JSON file
# result_filepath = os.path.join(data_dir, "llm_safety_eval_Mistral-7B-Instruct-v0.3.json")
result_filepath = os.path.join(data_dir, "llm_safety_eval_Qwen3-4B-Instruct-2507.json")

filename = os.path.basename(result_filepath)
llm_model_name = filename.replace("llm_safety_eval_", "").replace(".json", "")

# Load JSON from file
with open(result_filepath, "r") as f:
    data = json.load(f)

# Rebuild EvaluationResult instance
llm_result = EvaluationResult(y_true=data["y_true"], y_pred=data["y_pred"], failed_indices=data["failed_indices"])

In [None]:
# Since failed indices are not included in llm_result.y_pred in the evaluation report,
#  we need to insert zeros at those positions to reconstruct the full list of LLM predictions
y_pred_llm = []
failed_set = set(llm_result.failed_indices)  # For O(1) lookups
llm_iter = iter(llm_result.y_pred)

for i in range(len(llm_result.y_pred) + len(llm_result.failed_indices)):
    if i in failed_set:
        y_pred_llm.append(0)
    else:
        y_pred_llm.append(next(llm_iter))

In [None]:
# Combine predictions from all components: mark 1 if any component predicted 1, otherwise 0
assert len(y_pred_llm) == len(y_pred_tfidf) == len(y_pred_keyword) == len(y_test)
y_pred_system = [
    1 if llm == 1 or tfidf == 1 or keyword == 1 else 0
    for llm, tfidf, keyword in zip(y_pred_llm, y_pred_tfidf, y_pred_keyword)
]

With the unified predictions for the complete system now loaded, let's generate the performance metrics report and the corresponding confusion matrix.

In [None]:
def generate_confusion_matrix(y_pred: list[int], y: Column, title: str):
    """
    Generate confusion matrix.
    """
    labels = ["Safe (0)", "Unsafe (1)"]
    cm = confusion_matrix(y, y_pred, labels=[0, 1])

    fig = go.Figure(
        data=go.Heatmap(
            z=cm,
            x=labels,
            y=labels,
            colorscale="Blues",
            hoverongaps=False,
            text=cm,
            texttemplate="%{text}",
            showscale=True,
            colorbar=dict(title="Count"),
        )
    )

    fig.update_layout(
        title=title,
        xaxis_title="Predicted Label",
        yaxis_title="True Label",
        yaxis=dict(autorange="reversed"),
        width=580,
        height=500,  # Make the plot square
        margin=dict(l=80, r=80, t=100, b=80),
    )

    return fig

In [None]:
print(classification_report(y_test, y_pred_system))

In [None]:
fig = generate_confusion_matrix(
    y_pred=y_pred_system,
    y=y_test,
    title=f"Test Set Confusion - System with {llm_model_name}",
)
fig.show()

Here we see that the The Mistral-based system achieved an overall accuracy of 94%, compared to 98% for the Qwen-based system. Both models attained near-perfect recall for unsafe prompts, with the primary performance difference observed in precision for unsafe prompts. The Qwen-based system achieved 96% precision (30 misclassifications), whereas the Mistral-based system achieved 84% precision (120 misclassifications). These results indicate that we should proceed with the Qwen-based system.

In [None]:
# Save confusion matrix to file for use in the report
html_str = f"""
<div style="display: flex; justify-content: center;">
  {fig.to_html(full_html=False, include_plotlyjs='cdn')}
</div>
"""  # noqa: E702, E222
output_file = os.path.join(plots_dir, f"conf_matrix_system_test_{llm_model_name}.html")
with open(output_file, "w") as f:
    f.write(html_str)