PromptShield – Real‑Time LLM Safety Gateway

PromptShield is a lightweight security layer that sits in front of any Large Language Model and inspects user prompts before they ever reach the model. It spots jailbreaks, decodes obfuscated payloads, cleans up risky input, and re-wraps safe prompts so attackers can’t rely on fixed patterns.

Think of it as a smart WAF for prompts: fast, transparent, and explainable.

Why it matters

LLMs routinely see untrusted input: chats, tickets, emails, agent instructions.
A single prompt injection can bypass safety, leak sensitive context, or trigger dangerous tools.
Most apps rely only on the model vendor’s guardrails and have no visibility into what was attempted.

PromptShield adds a focused layer in front of any LLM endpoint so you can:

Detect and neutralize prompt injection attempts in a few milliseconds.
Keep a forensic trail of attacks, scores, and transformations.
Wrap and normalize prompts so downstream models stay simple and safe.

What PromptShield does

Analyzes every prompt with multiple lightweight detectors:
- targeted regex and keyword heuristics for direct jailbreak cues,
- entropy and encoding probes for base64/obfuscation,
- a supervised risk model trained on benign vs. adversarial prompts.
Scores and categorizes each input into:
- pass – safe enough, send as-is,
- sanitize – clean it, then continue,
- block – stop and respond safely.
Sanitizes risky content:
- decodes obvious payloads,
- strips or redacts dangerous directives,
- normalizes weird whitespace and control characters,
- records each step so you can audit exactly what changed.
Re-wraps prompts with polymorphic templates (PPA), so even cleaned input is hard to reuse as a universal jailbreak.
Logs everything into SQLite with per-layer scores, indicators, timings, and templates, ready for dashboards or exports.

Architecture at a glance

          +----------------+        +--------------------+        +----------------+
User ---> | Your App / API | -----> |  PromptShield API  | -----> |   Target LLM   |
          +----------------+        +--------------------+        +----------------+
                                             |
                      +----------------------+----------------------+
                      |                                             |
              +---------------+                           +------------------+
              | Detection     |                           | Sanitization     |
              | Engine        |                           | + PPA Wrapping   |
              +---------------+                           +------------------+
                      |                                             |
                      +----------------------+----------------------+
                                             v
                                     +--------------+
                                     | SQLite Log   |
                                     +--------------+
                                             |
                                             v
                                     +--------------+
                                     | React UI     |
                                     | Dashboard    |
                                     +--------------+

How it works (at a glance)

Prompt in
Your application sends a user prompt to PromptShield instead of directly to the LLM.
Risk analysis
The detection engine extracts cheap structural and lexical features, runs them through:
- rule-based detectors, and
- a trained logistic-regression risk model loaded from on-disk artifacts.
Decision & transformation
Based on the total score, PromptShield:
- lets the prompt pass untouched,
- sanitizes and annotates it, or
- blocks it and produces a safe fallback message.
Wrapping & forwarding
Safe or sanitized prompts are wrapped with a randomized safe template and sent to your LLM or an internal handler.
Telemetry & insights
Every request captures:
- which signals fired (regex/keywords/entropy/ML),
- why an action was chosen,
- what the sanitizer did,
- latency and template ID, all visible in the React dashboard and accessible via API.

Key signals and metrics

Layered risk score
Every prompt is mapped to a 0–100 risk score with a clear breakdown by detector family: regex, entropy, keyword, and ML anomaly. The UI makes it obvious which layer contributed most to the final decision.
Action bands with hard outcomes
Pass, sanitize, and block are not fuzzy labels: each one drives a different downstream behavior, from forwarding the raw prompt to returning a safe refusal when risk is too high.
Designed for low latency
The entire pipeline is built from cheap, CPU‑friendly operations (string scans, simple statistics, and a compact linear model). In practice this keeps end‑to‑end gateway time comfortably below normal network latency for typical prompts.
Built‑in observability
Every request records timing, scores, indicators, and sanitization steps, and the gateway maintains running counters for action ratios and average latency. This makes it easy to plug PromptShield into dashboards and alerting systems.

What you can show in a demo

A benign prompt flowing through with a low score and “PASS” action.
A straightforward jailbreak (“ignore previous instructions…”) being:
- flagged by regex and keyword detectors,
- assigned a high risk score,
- sanitized with directives redacted,
- wrapped in a safe template.
A “content policy” bypass attempt that looks innocent but gets caught by the learned model and new heuristics.
The timeline view: raw → sanitized → wrapped, plus a human-readable list of everything the gateway did.
The history and metrics view showing:
- pass/sanitize/block ratios,
- average latency,
- top threat indicators over recent traffic.

Running PromptShield locally

Backend: Python + FastAPI, runs on CPU only.
Frontend: React + Vite dashboard.
Storage: local SQLite database; no external dependencies.

High-level flow:

Install backend Python dependencies and start the API server.
Install frontend dependencies and start the dashboard.
Point the dashboard at the backend URL (default http://localhost:8000).
Paste prompts into the UI and watch scores, insights, and transformations update in real time.

Optional toggles:

set an API key header to protect your gateway,
cap maximum prompt length,
retrain the risk model when you add more examples to the training corpus,
run the included evaluation script to compute detection and latency statistics.

Where PromptShield fits

Customer-facing chatbots and support agents that must not leak internal notes, previous tickets, or hidden system messages.
Internal copilots and workflow engines that can trigger real actions such as ticket creation, deployments, or CRM writes.
RAG and tool-using agents that ingest untrusted documents, URLs, or email-style content from the open world.

Outcomes at a glance

Sub-millisecond gateway processing on a typical laptop for ordinary prompts, thanks to compact CPU-only detectors and a small linear model.
Clear separation between benign and adversarial prompts, with multi-layer risk scores typically several times higher on attacks than on normal traffic.
Full audit trail for every analyzed prompt: raw, sanitized, and wrapped views, plus scores, indicators, actions, and latency tied together in one place.

Walkthrough snapshots

Why PromptShield stands out

Purpose-built for prompt injection, not a generic firewall.
Fast and frugal: handmade features and compact models designed to run comfortably on CPUs.
Explainable by design: every decision surfaces signals, scores, and a step-by-step timeline.
Easy to adopt: simple HTTP API, self-contained dashboard, and a clear story for plugging in front of any LLM.

PromptShield is meant to be small, sharp, and ready to sit between untrusted users and powerful models. Use it as a gateway today, and as a foundation for more advanced safety systems tomorrow.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
image 2.png		image 2.png
image 3.png		image 3.png
image 4.png		image 4.png
image1.png		image1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PromptShield – Real‑Time LLM Safety Gateway

Why it matters

What PromptShield does

Architecture at a glance

How it works (at a glance)

Key signals and metrics

What you can show in a demo

Running PromptShield locally

Where PromptShield fits

Outcomes at a glance

Walkthrough snapshots

Why PromptShield stands out

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PromptShield – Real‑Time LLM Safety Gateway

Why it matters

What PromptShield does

Architecture at a glance

How it works (at a glance)

Key signals and metrics

What you can show in a demo

Running PromptShield locally

Where PromptShield fits

Outcomes at a glance

Walkthrough snapshots

Why PromptShield stands out

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages