# Data exploration

In this notebook, we explore the [Safe-Guard Prompt Injection Dataset](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection). The objectives are twofold: first, to understand the project dataset; and second, to identify any trends within the data that we can leverage in our solution.

## Setup

In this section, we will install the dependencies required to run the code in this notebook.

In [None]:
import math
import os
import random
import statistics
from collections import Counter
from math import log
from typing import Iterable, cast

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from datasets import DatasetDict, load_dataset
from scipy.stats import norm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.mixture import GaussianMixture

In [None]:
# Synthetic prompt injection dataset: https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection
dataset_id = "xTRam1/safe-guard-prompt-injection"

In [None]:
notebooks_dir = os.path.dirname(os.path.abspath("__file__"))
plots_dir = os.path.abspath(os.path.join(notebooks_dir, "..", "docs", "content", "plots"))

## Preliminary analysis

In this section, we loading the dataset from Hugging Face, examining its structure, review the class distribution, and inspecting sample entries. Datasets avaiable on Hugging Face Hub can be accessed using the [`load_dataset()`](https://huggingface.co/docs/datasets/v4.0.0/en/package_reference/loading_methods#datasets.load_dataset) function from the [Datasets](https://pypi.org/project/datasets/) Python library.


In [None]:
dataset = cast(DatasetDict, load_dataset("xTRam1/safe-guard-prompt-injection"))
print(dataset)

In [None]:
X_train, y_train = dataset["train"]["text"], dataset["train"]["label"]
X_test, y_test = dataset["test"]["text"], dataset["test"]["label"]

To better understand the dataset, let’s manually inspect a few random examples.

In [None]:
# Split negative and positive examples
negatives = [text for text, label in zip(X_train, y_train) if label == 0]
positives = [text for text, label in zip(X_train, y_train) if label == 1]

# Randomly select one from each class
random_negative = random.choice(negatives)
random_positive = random.choice(positives)

print("Random negative example:")
print(random_negative)

print("\nRandom positive example:")
print(random_positive)

Imbalance occurs when one class is significantly more represented. If not properly mitigated, models trained on imbalanced datasets can exhibit bias by favoring the majority class. Let's examine the class distribution to identify any class imablance that could impact our solution.

In [None]:
label_counts = Counter(y_train)

label_map = {0: "safe", 1: "unsafe"}
labels = [label_map[label] for label in label_counts.keys()]
counts = list(label_counts.values())
total = sum(counts)
percentages = [100 * c / total for c in counts]

fig = px.bar(
    x=labels,
    y=counts,
    text=[f"{p:.1f}%" for p in percentages],
    labels={"x": "Label", "y": "Number of prompts"},
    title="Label Distribution in Training Set",
)
fig.update_xaxes(type="category")
fig.update_traces(textposition="inside")

fig.show()

In [None]:
# Save label distribution figure to file for use in the report
html_str = pio.to_html(fig, full_html=False, include_plotlyjs="cdn")
output_file = os.path.join(plots_dir, "label_distribution.html")
with open(output_file, "w") as f:
    f.write(html_str)

We see here that the dataset is indeed imbalanced, consisting of approximately 70% safe prompts and 30% unsafe prompts.

## Text length analysis

In this section, we anlalyse the distribution of prompt lengths by class to see, for example, if one prompts of one class are systematically longer than the other.

In [None]:
def prompt_lengths(texts: Iterable[str]) -> list[int]:
    return [len(text.split()) for text in texts]


def get_stats(lengths: list[int]) -> dict:
    return {
        "Count": len(lengths),
        "Min": min(lengths),
        "Max": max(lengths),
        "Mean": round(statistics.mean(lengths), 2),
        "Variance": round(statistics.variance(lengths), 2),
    }


# Positive and negative texts
unsafe_prompts = [text for text, label in zip(X_train, y_train) if label == 1]
safe_prompts = [text for text, label in zip(X_train, y_train) if label == 0]

all_lengths = prompt_lengths(X_train)
unsafe_lengths = prompt_lengths(unsafe_prompts)
safe_lengths = prompt_lengths(safe_prompts)

data = {
    "Positive (unsafe)": get_stats(unsafe_lengths),
    "Negative (safe)": get_stats(safe_lengths),
}
df = pd.DataFrame(data).T

print(df)

In [None]:
# Shortest and longest unsafe prompt
shortest_unsafe = min(unsafe_prompts, key=lambda x: len(x.split()))
longest_unsafe = max(unsafe_prompts, key=lambda x: len(x.split()))

# Shortest and longest safe prompt
shortest_safe = min(safe_prompts, key=lambda x: len(x.split()))
longest_safe = max(safe_prompts, key=lambda x: len(x.split()))

print("Shortest unsafe prompt:\n", shortest_unsafe)
print("\nLongest unsafe prompt:\n", longest_unsafe)

print("\nShortest safe prompt:\n", shortest_safe)
print("\nLongest safe prompt:\n", longest_safe)

In [None]:
bins = np.arange(0, 2150, 50)  # bins every 50 lenghts
bin_labels = [f"{b}-{b+50}" for b in bins[:-1]]

unsafe_counts, _ = np.histogram(unsafe_lengths, bins=bins)
safe_counts, _ = np.histogram(safe_lengths, bins=bins)

# Normalize counts to get proportions
unsafe_freqs = unsafe_counts / unsafe_counts.sum()
safe_freqs = safe_counts / safe_counts.sum()

fig = go.Figure()

fig.add_trace(go.Bar(x=bin_labels, y=unsafe_freqs, name="Positive (unsafe)", marker_color="red"))

fig.add_trace(go.Bar(x=bin_labels, y=safe_freqs, name="Negative (safe)", marker_color="green"))

fig.update_layout(
    barmode="group",
    title="Normalized Prompt Length Distribution by Label",
    xaxis_title="Number of Words in Prompt",
    yaxis_title="Proportion",
    xaxis_tickangle=-45,
    bargap=0.2,
    yaxis_type="log",
)
fig.update_layout(
    legend=dict(
        x=0.98,
        y=0.94,
        xanchor="right",
        yanchor="top",
        bgcolor="rgba(255,255,255,0.8)",
        bordercolor="black",
        borderwidth=1,
    )
)

fig.show()

In [None]:
# Save plot of normalized prompt length distribution by label to file for use in the report
html_str = pio.to_html(fig, full_html=False, include_plotlyjs="cdn")
output_file = os.path.join(plots_dir, "prompt_length_distribution.html")
with open(output_file, "w") as f:
    f.write(html_str)

It looks like there is nothing meaningful here to help us separate the classes.

## Entropy analysis

In this section, we calculate Shannon entropy to quantify how diverse or unpredictable the characters or tokens are within prompts.

In [None]:
# Idea is that unsafe prompts might be more repetitive or formulaic.
def shannon_entropy(text: str) -> float:
    tokens = text.split()
    counts = Counter(tokens)
    probs = [count / len(tokens) for count in counts.values()]
    return -sum(p * math.log2(p) for p in probs)


entropies = [shannon_entropy(t) for t in X_train]

# Prepare DataFrame
df_entropy = pd.DataFrame({"entropy": entropies, "label": y_train})
df_entropy["label_name"] = df_entropy["label"].map({0: "Safe", 1: "Unsafe"})

# Histogram with normalized counts (relative frequencies)
fig_hist = px.histogram(
    df_entropy,
    x="entropy",
    color="label_name",
    nbins=30,
    barmode="group",
    opacity=0.6,
    histnorm="probability",  # <-- normalize within each class
    labels={"entropy": "Shannon Entropy", "label_name": "Prompt Class"},
    title="Normalized Entropy Distribution for Safe vs Unsafe Prompts",
    marginal="rug",  # optional
)
fig_hist.update_layout(bargap=0.1)

fig_hist.show()

# Boxplot distribution
# fig_box = px.box(
#     df_entropy, x="label_name", y="entropy",
#     color="label_name",
#     labels={"label_name": "Prompt Class", "entropy": "Shannon Entropy"},
#     title="Entropy Comparison between Safe and Unsafe Prompts"
# )
# fig_box.show()

In [None]:
def add_histogram(fig, data, bins, name, color):
    counts, bin_edges = np.histogram(data, bins=bins, density=False)
    bin_width = bin_edges[1] - bin_edges[0]
    fig.add_trace(
        go.Bar(
            x=bin_edges[:-1],
            y=counts,
            width=bin_width * 0.9,
            name=f"{name}",
            marker_color=color,
            opacity=0.5,
        )
    )
    return counts, bin_edges, bin_width


def add_normal_fit(fig, data, bin_width, color="green", name="Safe Fit (Normal)"):
    mu, std = norm.fit(data)
    x = np.linspace(min(data), max(data), 300)
    pdf = norm.pdf(x, mu, std)
    pdf_scaled = pdf * len(data) * bin_width
    fig.add_trace(go.Scatter(x=x, y=pdf_scaled, mode="lines", line=dict(color=color, width=3), name=name))


def add_gmm_fit(fig, data, bin_width, color="red", name="Unsafe Fit (GMM)", n_components=2):
    # Reshape data for GMM (expects 2D)
    data_reshaped = data.reshape(-1, 1)

    # Fit GMM
    gmm = GaussianMixture(n_components=n_components, random_state=0)
    gmm.fit(data_reshaped)

    # Create x-axis range for smooth plot
    x = np.linspace(min(data), max(data), 300).reshape(-1, 1)

    # Compute weighted sum of component PDFs for each x
    logprob = gmm.score_samples(x)
    pdf = np.exp(logprob)

    # Scale PDF to histogram counts
    pdf_scaled = pdf * len(data) * bin_width

    fig.add_trace(go.Scatter(x=x.flatten(), y=pdf_scaled, mode="lines", line=dict(color=color, width=3), name=name))


# Calculate entropies
entropies = [shannon_entropy(t) for t in X_train]
df_entropy = pd.DataFrame({"entropy": entropies, "label": y_train})
df_entropy["label_name"] = df_entropy["label"].map({0: "Safe", 1: "Unsafe"})

fig = go.Figure()
bins = 30

safe_data = df_entropy[df_entropy["label"] == 0]["entropy"].values
unsafe_data = df_entropy[df_entropy["label"] == 1]["entropy"].values

safe_counts, safe_bin_edges, safe_bin_width = add_histogram(fig, safe_data, bins, "Safe", "green")
unsafe_counts, unsafe_bin_edges, unsafe_bin_width = add_histogram(fig, unsafe_data, bins, "Unsafe", "red")

add_normal_fit(fig, safe_data, safe_bin_width)
# add_gmm_fit(fig, unsafe_data, unsafe_bin_width)
add_gmm_fit(fig, unsafe_data, unsafe_bin_width, n_components=3)

fig.update_layout(
    title="Entropy Distribution",
    xaxis_title="Shannon Entropy",
    yaxis_title="Count",
    barmode="overlay",
    bargap=0.3,
)

fig.update_layout(
    legend=dict(
        x=0.98,
        y=0.94,
        xanchor="right",
        yanchor="top",
        bgcolor="rgba(255,255,255,0.8)",
        bordercolor="black",
        borderwidth=1,
    )
)

fig.show()

Here we see that the entropy distribution of safe prompts follows a normal distribution, indicating variation and diversity typical of natural language. However, unsafe prompts are much more likely to exhibit lower entropy values in the range of 3 to 4.5, indicating that these prompts often contain repeated or formulaic wording and less linguistic diversity. However, as indicated by the tail in the unsafe prompt entropy distribution, some unsafe prompts exhibit high entropy values, indicating that they retain considerable linguistic diversity.

In [None]:
# Save plot of entropy distribution to file for use in the report
html_str = pio.to_html(fig, full_html=False, include_plotlyjs="cdn")
output_file = os.path.join(plots_dir, "entropy_distribution.html")
with open(output_file, "w") as f:
    f.write(html_str)


## Exploring n-grams

In this section we look at n-grams, which are contiguous sequences of words of length n. The idea is to identify if n-grams separate safe and unsafe prompts and if so, which n-gram lengths best separate safe prompts. We will also inspect a few example n-grams to better understand common phrases in each class.

For example, maybe unsafe prompts contain certain phrases like “ignore all previous instructions,” that are exceedingly rare in safe prompts (i.e., “trigger” phrases). Or, maybe unsafe prompts mention certain entities or roles repeatedly. If so, simple n-grams-frequency statistics could be very useful for class separation.

In [None]:
def get_top_ngrams(texts, ngram_range, top_k):
    """Extract top n-grams from list of texts."""
    vectorizer = CountVectorizer(ngram_range=ngram_range, stop_words="english")
    X = vectorizer.fit_transform(texts)
    freqs = X.sum(axis=0).A1
    ngrams = vectorizer.get_feature_names_out()
    freq_df = pd.DataFrame({"ngram": ngrams, "count": freqs})

    return freq_df.sort_values(by="count", ascending=False).head(top_k)


# Separate safe and unsafe prompts
safe_texts = [text for text, label in zip(X_train, y_train) if label == 0]
unsafe_texts = [text for text, label in zip(X_train, y_train) if label == 1]

top_k = 10

for n in [1, 2, 3]:
    print(f"Top {n}-grams in safe prompts:")
    top_safe = get_top_ngrams(safe_texts, ngram_range=(n, n), top_k=top_k)
    print(top_safe)
    print()

    print(f"Top {n}-grams in unsafe prompts:")
    top_unsafe = get_top_ngrams(unsafe_texts, ngram_range=(n, n), top_k=top_k)
    print(top_unsafe)
    print("\n" + "=" * 40 + "\n")

To get a more objective understand if n-grams are useful for class separation, we can shuffle the labels on the prompts and calculate the average LLR for each shuffle. If the real average LLR is much higher than these shuffled cases, it indicates that the n-grams capture meaningful patterns.

In [None]:
def shuffled_avg_llr_test(safe_texts, unsafe_texts, n, top_k=10, num_shuffles=10):
    """
    Compare average top-k LLR of real labels to shuffled labels.

    Parameters:
    - safe_texts: list of safe prompts (label=0)
    - unsafe_texts: list of unsafe prompts (label=1)
    - n: int, n-gram length
    - top_k: int, number of top LLRs to average
    - num_shuffles: int, how many times to shuffle for baseline

    Returns:
    - real_avg_llr: average top-k LLR with real labels
    - shuffle_avg_llrs: list of average top-k LLRs from shuffled labels
    """

    all_texts = safe_texts + unsafe_texts
    labels = [0] * len(safe_texts) + [1] * len(unsafe_texts)

    # Compute real average LLR
    real_avg_llr = get_avg_topk_llr_for_ngram_range(safe_texts, unsafe_texts, n, top_k=top_k)

    shuffle_avg_llrs = []
    for _ in range(num_shuffles):
        random.shuffle(labels)
        # Split texts based on shuffled labels
        shuffled_safe = [text for text, label in zip(all_texts, labels) if label == 0]
        shuffled_unsafe = [text for text, label in zip(all_texts, labels) if label == 1]

        # Compute average LLR for shuffled labels
        avg_llr_shuffled = get_avg_topk_llr_for_ngram_range(shuffled_safe, shuffled_unsafe, n, top_k=top_k)
        shuffle_avg_llrs.append(avg_llr_shuffled)

    return real_avg_llr, shuffle_avg_llrs


# Example usage:
real_llr, shuffled_llrs = shuffled_avg_llr_test(safe_texts, unsafe_texts, n=1, top_k=10, num_shuffles=20)

print(f"Real average top-10 LLR: {real_llr:.4f}")
print(f"Mean shuffled average top-10 LLR: {np.mean(shuffled_llrs):.4f}")
print(f"Shuffled LLRs range: {min(shuffled_llrs):.4f} - {max(shuffled_llrs):.4f}")

To understand how useful different n-gram sizes are at distinguishing safe vs. unsafe prompts, let's plot average top-k log-likelihood ratio (LLR) as a function of n-gram length. Higher average LLR means those n-grams provide stronger, more consistent signals for class separation.

In [None]:
def compute_llr(k11, k12, k21, k22):
    def safe_log(x):
        return log(x) if x > 0 else 0

    row_sum_1 = k11 + k12
    row_sum_2 = k21 + k22
    col_sum_1 = k11 + k21
    col_sum_2 = k12 + k22
    total = row_sum_1 + row_sum_2

    E11 = row_sum_1 * col_sum_1 / total
    E12 = row_sum_1 * col_sum_2 / total
    E21 = row_sum_2 * col_sum_1 / total
    E22 = row_sum_2 * col_sum_2 / total

    llr = 2 * (
        k11 * safe_log(k11 / E11) + k12 * safe_log(k12 / E12) + k21 * safe_log(k21 / E21) + k22 * safe_log(k22 / E22)
    )
    return llr


def get_avg_topk_llr_for_ngram_range(safe_texts, unsafe_texts, n, top_k=10):
    """
    Compute max LLR for n-grams of length n.

    Parameters:
    - safe_texts: list of safe prompt strings
    - unsafe_texts: list of unsafe prompt strings
    - n: int, n-gram length

    Returns:
    - max_llr: float, maximum log-likelihood ratio found for n-grams of length n
    """
    vectorizer = CountVectorizer(ngram_range=(n, n), stop_words="english", binary=True)
    all_texts = safe_texts + unsafe_texts
    X = vectorizer.fit_transform(all_texts)

    features = vectorizer.get_feature_names_out()
    X_safe = X[: len(safe_texts)]
    X_unsafe = X[len(safe_texts) :]

    llr_scores = []
    for i in range(len(features)):
        k11 = X_unsafe[:, i].sum()
        k12 = X_safe[:, i].sum()
        k21 = len(unsafe_texts) - k11
        k22 = len(safe_texts) - k12
        llr = compute_llr(k11, k12, k21, k22)
        llr_scores.append(llr)

    llr_scores.sort(reverse=True)
    return sum(llr_scores[:top_k]) / top_k


def plot_llr_for_ngram_lengths(safe_texts, unsafe_texts, n, top_k=10):
    """
    Compute and plot average top-k LLR for n-gram lengths from 1 to n.

    Parameters:
    - safe_texts: list of safe prompt strings
    - unsafe_texts: list of unsafe prompt strings
    - n: int, maximum n-gram length to evaluate
    - top_k: int, number of top LLR scores to average (default 10)

    Returns:
    - None (shows plot)
    """
    avg_llrs = []
    labels = []
    for i in range(1, n + 1):
        avg_llr = get_avg_topk_llr_for_ngram_range(safe_texts, unsafe_texts, i, top_k=top_k)
        avg_llrs.append(avg_llr)
        if i == 1:
            labels.append("Unigrams")
        elif i == 2:
            labels.append("Bigrams")
        elif i == 3:
            labels.append("Trigrams")
        else:
            labels.append(f"{i}-grams")

    fig = go.Figure(data=[go.Bar(x=labels, y=avg_llrs, marker_color="blue")])
    fig.update_layout(
        title=f"Average Top-{top_k} LLR by N-gram Length",
        xaxis_title="N-gram Type",
        yaxis_title="Average Log-Likelihood Ratio (LLR)",
        template="plotly_white",
    )
    return fig

In [None]:
# Note: This cell may take several minutes to run, especially for n-gram lengths of 3 or higher
fig = plot_llr_for_ngram_lengths(safe_texts, unsafe_texts, 5)
fig.show()

Here we see that unigrams provide the strongest discriminative signal for class separation, and this signal tends to decrease as the n-gram length increases. This means we should focus on unigrams (and maybe bigrams and trigrams) for our TF-IDF-based classifier.

In [None]:
# Save plot of average top-k LLM by n-gram to file for use in the report
html_str = pio.to_html(fig, full_html=False, include_plotlyjs="cdn")
output_file = os.path.join(plots_dir, "avg_topk_llr_by_ngram.html")
with open(output_file, "w") as f:
    f.write(html_str)