A Self-Healing AI Agent That Finds, Reproduces, and Fixes Security Vulnerabilities β Automatically
Security vulnerabilities cost the industry $9.5 trillion annually. Code reviews are slow, manual, and error-prone. Existing tools detect issues β but they don't fix them. And they certainly don't validate their own fixes before deploying.
What if an AI agent could do the entire pipeline β detect β reproduce β fix β validate β deploy β autonomously?
SecureCodeEnv++ is a production-ready AI environment where an autonomous agent:
- π Detects security vulnerabilities in source code
- π Reproduces the bug to confirm it's exploitable
- π οΈ Generates a patch using Hugging Face models
- β Validates the fix (syntax, tests, regression)
- π Makes the deployment decision (with confidence gating)
All of this runs inside a standardized, scored benchmark (OpenEnv spec) β so agents can be compared, ranked, and improved objectively.
We started by defining the language of the system β what does the agent see, what can it do, and how is it scored?
Observation β "Here's vulnerable code. Find the problems."
Action β "I found these bugs, here are my fixes."
Reward β "You scored 0.87. Here's what you got right/wrong."
We then created 3 progressively harder security challenges:
| Task | Difficulty | What's Wrong |
|---|---|---|
| SEC-EASY-001 | π’ Easy | AWS secret keys hardcoded directly in Python source |
| SEC-MED-001 | π‘ Medium | SQL injection via string concatenation + no input validation in Flask |
| SEC-HARD-001 | π΄ Hard | 6 simultaneous vulnerabilities: eval(), pickle.loads(), path traversal, unsafe YAML, SQL injection, os.system() command injection |
Every task is frozen and deterministic β same input, same expected output, every single time.
We built a 3-component grading system that makes evaluation objective and reproducible:
βββββββββββββββββββββββββββββββββββ
β TOTAL SCORE (0β1.0) β
βββββββββββββββββββββββββββββββββββ€
β β
β π Vulnerability Detection β Γ0.4 β Did you find the right bugs?
β (set intersection) β Match predicted vs. expected types
β β
β π Explanation Quality β Γ0.3 β Can you explain WHY it's a bug?
β (keyword matching) β Must mention critical concepts
β β
β π§ Fix Correctness β Γ0.3 β Does your patch actually work?
β (token overlap + structure) β Compared against reference fixes
β β
βββββββββββββββββββββββββββββββββββ
No LLM judge. No randomness. Pure deterministic scoring.
We implemented the OpenEnv standard β three methods that make our environment plug-and-play with any agent:
env = SecureCodeEnv()
obs = env.reset("SEC-HARD-001") # Agent sees the vulnerable code
reward = env.step(agent_action) # Agent submits findings β gets scored
obs = env.state() # Peek without advancingThis means any agent β ours, yours, a competitor's β can be benchmarked against the same tasks with the same scoring.
We wrapped everything in a FastAPI server with strict Pydantic validation:
| Endpoint | Method | Purpose |
|---|---|---|
/reset |
POST | Start a new episode (optionally pick a task) |
/step |
POST | Submit analysis + fixes β receive scored reward |
/state |
GET | Check current observation |
/tasks |
GET | List available challenges |
/health |
GET | Service health check |
Full OpenAPI docs auto-generated at /docs.
This is the core innovation. A 5-stage pipeline that simulates what a real autonomous security agent would do:
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β 1. DETECT ββββββΆβ 2. REPRODUCE ββββββΆβ 3. PATCH β
β Find vulns β β Confirm bug β β Generate fix β
ββββββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ
β
ββββββββββββββββ βββββββββΌβββββββ
β 5. DEPLOY? βββββββ 4. VALIDATE β
β Score > 0.6? β β Syntax+Tests β
ββββββββββββββββ ββββββββββββββββ
Pipeline Scoring:
| Stage | Weight | What It Checks |
|---|---|---|
| Reproduction | +0.2 | Can we confirm the vulnerability exists? |
| Compile | +0.2 | Does the patched code have valid syntax? |
| Tests | +0.3 | Does the patch pass simulated test cases? |
| Regression | +0.2 | All stages passed β no regressions introduced |
| Deploy | +0.1 | Score β₯ 0.8 β safe to deploy |
Safety guardrails built in:
- β Max iteration limit prevents infinite loops
- π¦ Confidence threshold gates deployment (score < 0.6 β NO deploy)
Zero OpenAI. Zero Anthropic. 100% Hugging Face.
| Purpose | Model | Fallback |
|---|---|---|
| Code analysis & patching | bigcode/starcoder2-15b |
deepseek-ai/deepseek-coder-6.7b-instruct |
| Security reasoning | mistralai/Mistral-7B-Instruct-v0.3 |
mistralai/Mixtral-8x7B-Instruct-v0.1 |
All generation is deterministic: temperature=0, top_p=1, do_sample=False.
One command to build. One command to run.
docker build -t secure-code-env .
docker run -p 8000:8000 -e HF_TOKEN=hf_... secure-code-envProduction-grade: health checks, layer caching, minimal image, 1-worker uvicorn.
======================== 42 passed in 0.21s ========================
| Test Suite | Tests | Status |
|---|---|---|
| Models (Pydantic schemas) | 5 | β All pass |
| Tasks (determinism + registry) | 7 | β All pass |
| Grader (scoring engine) | 6 | β All pass |
| Environment (OpenEnv lifecycle) | 10 | β All pass |
| API (FastAPI endpoints) | 8 | β All pass |
| Pipeline (self-healing E2E) | 6 | β All pass |
python3 -m secure_code_env.inference --all
| Task | Difficulty | Score | Vulns Found | Fixes Generated |
|---|---|---|---|---|
| SEC-EASY-001 | π’ Easy | 1.0000 | 1/1 β | 1/1 β |
| SEC-MED-001 | π‘ Medium | 0.8714 | 3/3 β | 2/2 β |
| SEC-HARD-001 | π΄ Hard | 0.8637 | 6/6 β | 5/5 β |
| Average | 0.9117 | 10/10 | 8/8 |
| Feature | Us | Typical Security Tools |
|---|---|---|
| Detects vulnerabilities | β | β |
| Explains why it's dangerous | β | β |
| Generates fixes automatically | β | β |
| Validates its own fixes | β | β |
| Makes deployment decisions | β | β |
| Standardized benchmark (OpenEnv) | β | β |
| Deterministic & reproducible | β | β |
| Open-source HF models only | β | β |
# Install
pip install -e ".[dev]"
# Run the agent against all tasks (no server needed)
python3 -m secure_code_env.inference --all
# Start the API server
uvicorn secure_code_env.app:app --port 8000
# Run tests
python3 -m pytest tests/ -v -p no:anyio
# Docker
docker build -t secure-code-env .
docker run -p 8000:8000 secure-code-envsecure-code-env/
βββ secure_code_env/ # Core package
β βββ models.py # Observation / Action / Reward schemas
β βββ tasks.py # 3 deterministic security tasks
β βββ grader.py # Rule-based scoring (0.4 + 0.3 + 0.3)
β βββ env.py # OpenEnv engine (reset / step / state)
β βββ app.py # FastAPI REST API
β βββ hf_client.py # Hugging Face model wrapper
β βββ pipeline.py # 5-stage self-healing pipeline
β βββ inference.py # Baseline agent + CLI
βββ tests/test_env.py # 42 integration tests
βββ openenv.yaml # OpenEnv specification
βββ Dockerfile # Production container
βββ pyproject.toml # Project config
βββ requirements.txt # Dependencies
βββ README.md # You are here
βββββββββββββββββββββββββββββββββββββββββββββββ
β SecureCodeEnv++ β
β β
Agent β βββββββββββ ββββββββββ βββββββββββ β
(inference.py)βββΆβ β Tasks βββββΆβ Env βββββΆβ Grader β β
β β β Registryβ β(OpenEnv)β β(Scoring)β β
β β βββββββββββ ββββββ¬ββββ βββββββββββ β
β β β β
β β ββββββββββΌβββββββββ β
β β β FastAPI App β β
β β β /reset /step β β
β β β /state /tasks β β
β β βββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Self-Healing Pipeline β
β Detect β Reproduce β Patch β β
β Validate β Deploy Decision β
ββββββββββββββββ¬βββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Hugging Face Models β
β StarCoder2 Β· DeepSeek-Coder β
β Mistral Β· Mixtral β
βββββββββββββββββββββββββββββββββββββββ
MIT β Use it, extend it, build on top of it.
Built with π§ AI + β Engineering
Because security shouldn't wait for the next code review.