Skip to content

prabujayant/PromptShield

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PromptShield – Real‑Time LLM Safety Gateway

PromptShield is a lightweight security layer that sits in front of any Large Language Model and inspects user prompts before they ever reach the model. It spots jailbreaks, decodes obfuscated payloads, cleans up risky input, and re-wraps safe prompts so attackers can’t rely on fixed patterns.

Think of it as a smart WAF for prompts: fast, transparent, and explainable.


Why it matters

  • LLMs routinely see untrusted input: chats, tickets, emails, agent instructions.
  • A single prompt injection can bypass safety, leak sensitive context, or trigger dangerous tools.
  • Most apps rely only on the model vendor’s guardrails and have no visibility into what was attempted.

PromptShield adds a focused layer in front of any LLM endpoint so you can:

  • Detect and neutralize prompt injection attempts in a few milliseconds.
  • Keep a forensic trail of attacks, scores, and transformations.
  • Wrap and normalize prompts so downstream models stay simple and safe.

What PromptShield does

  • Analyzes every prompt with multiple lightweight detectors:
    • targeted regex and keyword heuristics for direct jailbreak cues,
    • entropy and encoding probes for base64/obfuscation,
    • a supervised risk model trained on benign vs. adversarial prompts.
  • Scores and categorizes each input into:
    • pass – safe enough, send as-is,
    • sanitize – clean it, then continue,
    • block – stop and respond safely.
  • Sanitizes risky content:
    • decodes obvious payloads,
    • strips or redacts dangerous directives,
    • normalizes weird whitespace and control characters,
    • records each step so you can audit exactly what changed.
  • Re-wraps prompts with polymorphic templates (PPA), so even cleaned input is hard to reuse as a universal jailbreak.
  • Logs everything into SQLite with per-layer scores, indicators, timings, and templates, ready for dashboards or exports.

Architecture at a glance

          +----------------+        +--------------------+        +----------------+
User ---> | Your App / API | -----> |  PromptShield API  | -----> |   Target LLM   |
          +----------------+        +--------------------+        +----------------+
                                             |
                      +----------------------+----------------------+
                      |                                             |
              +---------------+                           +------------------+
              | Detection     |                           | Sanitization     |
              | Engine        |                           | + PPA Wrapping   |
              +---------------+                           +------------------+
                      |                                             |
                      +----------------------+----------------------+
                                             v
                                     +--------------+
                                     | SQLite Log   |
                                     +--------------+
                                             |
                                             v
                                     +--------------+
                                     | React UI     |
                                     | Dashboard    |
                                     +--------------+

How it works (at a glance)

  1. Prompt in
    Your application sends a user prompt to PromptShield instead of directly to the LLM.

  2. Risk analysis
    The detection engine extracts cheap structural and lexical features, runs them through:

    • rule-based detectors, and
    • a trained logistic-regression risk model loaded from on-disk artifacts.
  3. Decision & transformation
    Based on the total score, PromptShield:

    • lets the prompt pass untouched,
    • sanitizes and annotates it, or
    • blocks it and produces a safe fallback message.
  4. Wrapping & forwarding
    Safe or sanitized prompts are wrapped with a randomized safe template and sent to your LLM or an internal handler.

  5. Telemetry & insights
    Every request captures:

    • which signals fired (regex/keywords/entropy/ML),
    • why an action was chosen,
    • what the sanitizer did,
    • latency and template ID, all visible in the React dashboard and accessible via API.

Key signals and metrics

  • Layered risk score
    Every prompt is mapped to a 0–100 risk score with a clear breakdown by detector family: regex, entropy, keyword, and ML anomaly. The UI makes it obvious which layer contributed most to the final decision.

  • Action bands with hard outcomes
    Pass, sanitize, and block are not fuzzy labels: each one drives a different downstream behavior, from forwarding the raw prompt to returning a safe refusal when risk is too high.

  • Designed for low latency
    The entire pipeline is built from cheap, CPU‑friendly operations (string scans, simple statistics, and a compact linear model). In practice this keeps end‑to‑end gateway time comfortably below normal network latency for typical prompts.

  • Built‑in observability
    Every request records timing, scores, indicators, and sanitization steps, and the gateway maintains running counters for action ratios and average latency. This makes it easy to plug PromptShield into dashboards and alerting systems.


What you can show in a demo

  • A benign prompt flowing through with a low score and “PASS” action.
  • A straightforward jailbreak (“ignore previous instructions…”) being:
    • flagged by regex and keyword detectors,
    • assigned a high risk score,
    • sanitized with directives redacted,
    • wrapped in a safe template.
  • A “content policy” bypass attempt that looks innocent but gets caught by the learned model and new heuristics.
  • The timeline view: raw → sanitized → wrapped, plus a human-readable list of everything the gateway did.
  • The history and metrics view showing:
    • pass/sanitize/block ratios,
    • average latency,
    • top threat indicators over recent traffic.

Running PromptShield locally

  • Backend: Python + FastAPI, runs on CPU only.
  • Frontend: React + Vite dashboard.
  • Storage: local SQLite database; no external dependencies.

High-level flow:

  1. Install backend Python dependencies and start the API server.
  2. Install frontend dependencies and start the dashboard.
  3. Point the dashboard at the backend URL (default http://localhost:8000).
  4. Paste prompts into the UI and watch scores, insights, and transformations update in real time.

Optional toggles:

  • set an API key header to protect your gateway,
  • cap maximum prompt length,
  • retrain the risk model when you add more examples to the training corpus,
  • run the included evaluation script to compute detection and latency statistics.

Where PromptShield fits

  • Customer-facing chatbots and support agents that must not leak internal notes, previous tickets, or hidden system messages.
  • Internal copilots and workflow engines that can trigger real actions such as ticket creation, deployments, or CRM writes.
  • RAG and tool-using agents that ingest untrusted documents, URLs, or email-style content from the open world.

Outcomes at a glance

  • Sub-millisecond gateway processing on a typical laptop for ordinary prompts, thanks to compact CPU-only detectors and a small linear model.
  • Clear separation between benign and adversarial prompts, with multi-layer risk scores typically several times higher on attacks than on normal traffic.
  • Full audit trail for every analyzed prompt: raw, sanitized, and wrapped views, plus scores, indicators, actions, and latency tied together in one place.

Walkthrough snapshots

Analyze View

Risk Breakdown

Timeline + Insights

Attack History


Why PromptShield stands out

  • Purpose-built for prompt injection, not a generic firewall.
  • Fast and frugal: handmade features and compact models designed to run comfortably on CPUs.
  • Explainable by design: every decision surfaces signals, scores, and a step-by-step timeline.
  • Easy to adopt: simple HTTP API, self-contained dashboard, and a clear story for plugging in front of any LLM.

PromptShield is meant to be small, sharp, and ready to sit between untrusted users and powerful models. Use it as a gateway today, and as a foundation for more advanced safety systems tomorrow.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors