Skip to content

notysozu/secure-code-env

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” SecureCodeEnv++

A Self-Healing AI Agent That Finds, Reproduces, and Fixes Security Vulnerabilities β€” Automatically

Python 3.10+ OpenEnv 1.0 Tests HF Only License: MIT


🎯 The Problem

Security vulnerabilities cost the industry $9.5 trillion annually. Code reviews are slow, manual, and error-prone. Existing tools detect issues β€” but they don't fix them. And they certainly don't validate their own fixes before deploying.

What if an AI agent could do the entire pipeline β€” detect β†’ reproduce β†’ fix β†’ validate β†’ deploy β€” autonomously?


πŸ’‘ Our Solution

SecureCodeEnv++ is a production-ready AI environment where an autonomous agent:

  1. πŸ” Detects security vulnerabilities in source code
  2. πŸ” Reproduces the bug to confirm it's exploitable
  3. πŸ› οΈ Generates a patch using Hugging Face models
  4. βœ… Validates the fix (syntax, tests, regression)
  5. πŸš€ Makes the deployment decision (with confidence gating)

All of this runs inside a standardized, scored benchmark (OpenEnv spec) β€” so agents can be compared, ranked, and improved objectively.


πŸ—οΈ How We Built It β€” Step by Step

Phase 1: The Foundation (Models + Tasks)

We started by defining the language of the system β€” what does the agent see, what can it do, and how is it scored?

Observation  β†’  "Here's vulnerable code. Find the problems."
Action       β†’  "I found these bugs, here are my fixes."
Reward       β†’  "You scored 0.87. Here's what you got right/wrong."

We then created 3 progressively harder security challenges:

Task Difficulty What's Wrong
SEC-EASY-001 🟒 Easy AWS secret keys hardcoded directly in Python source
SEC-MED-001 🟑 Medium SQL injection via string concatenation + no input validation in Flask
SEC-HARD-001 πŸ”΄ Hard 6 simultaneous vulnerabilities: eval(), pickle.loads(), path traversal, unsafe YAML, SQL injection, os.system() command injection

Every task is frozen and deterministic β€” same input, same expected output, every single time.


Phase 2: The Scoring Engine

We built a 3-component grading system that makes evaluation objective and reproducible:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚       TOTAL SCORE (0–1.0)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                 β”‚
β”‚  πŸ” Vulnerability Detection     β”‚ Γ—0.4  β€” Did you find the right bugs?
β”‚     (set intersection)          β”‚        Match predicted vs. expected types
β”‚                                 β”‚
β”‚  πŸ“ Explanation Quality         β”‚ Γ—0.3  β€” Can you explain WHY it's a bug?
β”‚     (keyword matching)          β”‚        Must mention critical concepts
β”‚                                 β”‚
β”‚  πŸ”§ Fix Correctness            β”‚ Γ—0.3  β€” Does your patch actually work?
β”‚     (token overlap + structure) β”‚        Compared against reference fixes
β”‚                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

No LLM judge. No randomness. Pure deterministic scoring.


Phase 3: The OpenEnv Engine

We implemented the OpenEnv standard β€” three methods that make our environment plug-and-play with any agent:

env = SecureCodeEnv()

obs = env.reset("SEC-HARD-001")    # Agent sees the vulnerable code
reward = env.step(agent_action)     # Agent submits findings β†’ gets scored
obs = env.state()                   # Peek without advancing

This means any agent β€” ours, yours, a competitor's β€” can be benchmarked against the same tasks with the same scoring.


Phase 4: The REST API

We wrapped everything in a FastAPI server with strict Pydantic validation:

Endpoint Method Purpose
/reset POST Start a new episode (optionally pick a task)
/step POST Submit analysis + fixes β†’ receive scored reward
/state GET Check current observation
/tasks GET List available challenges
/health GET Service health check

Full OpenAPI docs auto-generated at /docs.


Phase 5: The Self-Healing Pipeline 🧬

This is the core innovation. A 5-stage pipeline that simulates what a real autonomous security agent would do:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. DETECT   │────▢│ 2. REPRODUCE │────▢│  3. PATCH    β”‚
β”‚  Find vulns  β”‚     β”‚ Confirm bug  β”‚     β”‚ Generate fix β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                  β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                     β”‚  5. DEPLOY?  │◀────│ 4. VALIDATE  β”‚
                     β”‚ Score > 0.6? β”‚     β”‚ Syntax+Tests β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pipeline Scoring:

Stage Weight What It Checks
Reproduction +0.2 Can we confirm the vulnerability exists?
Compile +0.2 Does the patched code have valid syntax?
Tests +0.3 Does the patch pass simulated test cases?
Regression +0.2 All stages passed β†’ no regressions introduced
Deploy +0.1 Score β‰₯ 0.8 β†’ safe to deploy

Safety guardrails built in:

  • β›” Max iteration limit prevents infinite loops
  • 🚦 Confidence threshold gates deployment (score < 0.6 β†’ NO deploy)

Phase 6: Hugging Face Integration

Zero OpenAI. Zero Anthropic. 100% Hugging Face.

Purpose Model Fallback
Code analysis & patching bigcode/starcoder2-15b deepseek-ai/deepseek-coder-6.7b-instruct
Security reasoning mistralai/Mistral-7B-Instruct-v0.3 mistralai/Mixtral-8x7B-Instruct-v0.1

All generation is deterministic: temperature=0, top_p=1, do_sample=False.


Phase 7: Containerized Deployment

One command to build. One command to run.

docker build -t secure-code-env .
docker run -p 8000:8000 -e HF_TOKEN=hf_... secure-code-env

Production-grade: health checks, layer caching, minimal image, 1-worker uvicorn.


πŸ“Š Test Results

Unit & Integration Tests

======================== 42 passed in 0.21s ========================
Test Suite Tests Status
Models (Pydantic schemas) 5 βœ… All pass
Tasks (determinism + registry) 7 βœ… All pass
Grader (scoring engine) 6 βœ… All pass
Environment (OpenEnv lifecycle) 10 βœ… All pass
API (FastAPI endpoints) 8 βœ… All pass
Pipeline (self-healing E2E) 6 βœ… All pass

Baseline Agent Benchmark

python3 -m secure_code_env.inference --all
Task Difficulty Score Vulns Found Fixes Generated
SEC-EASY-001 🟒 Easy 1.0000 1/1 βœ… 1/1 βœ…
SEC-MED-001 🟑 Medium 0.8714 3/3 βœ… 2/2 βœ…
SEC-HARD-001 πŸ”΄ Hard 0.8637 6/6 βœ… 5/5 βœ…
Average 0.9117 10/10 8/8

🧠 What Makes This Different

Feature Us Typical Security Tools
Detects vulnerabilities βœ… βœ…
Explains why it's dangerous βœ… ❌
Generates fixes automatically βœ… ❌
Validates its own fixes βœ… ❌
Makes deployment decisions βœ… ❌
Standardized benchmark (OpenEnv) βœ… ❌
Deterministic & reproducible βœ… ❌
Open-source HF models only βœ… ❌

πŸš€ Quick Start

# Install
pip install -e ".[dev]"

# Run the agent against all tasks (no server needed)
python3 -m secure_code_env.inference --all

# Start the API server
uvicorn secure_code_env.app:app --port 8000

# Run tests
python3 -m pytest tests/ -v -p no:anyio

# Docker
docker build -t secure-code-env .
docker run -p 8000:8000 secure-code-env

πŸ“ Project Structure

secure-code-env/
β”œβ”€β”€ secure_code_env/           # Core package
β”‚   β”œβ”€β”€ models.py              # Observation / Action / Reward schemas
β”‚   β”œβ”€β”€ tasks.py               # 3 deterministic security tasks
β”‚   β”œβ”€β”€ grader.py              # Rule-based scoring (0.4 + 0.3 + 0.3)
β”‚   β”œβ”€β”€ env.py                 # OpenEnv engine (reset / step / state)
β”‚   β”œβ”€β”€ app.py                 # FastAPI REST API
β”‚   β”œβ”€β”€ hf_client.py           # Hugging Face model wrapper
β”‚   β”œβ”€β”€ pipeline.py            # 5-stage self-healing pipeline
β”‚   └── inference.py           # Baseline agent + CLI
β”œβ”€β”€ tests/test_env.py          # 42 integration tests
β”œβ”€β”€ openenv.yaml               # OpenEnv specification
β”œβ”€β”€ Dockerfile                 # Production container
β”œβ”€β”€ pyproject.toml             # Project config
β”œβ”€β”€ requirements.txt           # Dependencies
└── README.md                  # You are here

πŸ›οΈ Architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚            SecureCodeEnv++                   β”‚
                    β”‚                                             β”‚
   Agent            β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
   (inference.py)──▢│  β”‚  Tasks  │───▢│  Env   │───▢│ Grader  β”‚  β”‚
        β”‚           β”‚  β”‚ Registryβ”‚    β”‚(OpenEnv)β”‚    β”‚(Scoring)β”‚  β”‚
        β”‚           β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
        β”‚           β”‚                      β”‚                      β”‚
        β”‚           β”‚             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
        β”‚           β”‚             β”‚   FastAPI App    β”‚             β”‚
        β”‚           β”‚             β”‚  /reset  /step   β”‚             β”‚
        β”‚           β”‚             β”‚  /state  /tasks   β”‚             β”‚
        β”‚           β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
        β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚     Self-Healing Pipeline           β”‚
   β”‚  Detect β†’ Reproduce β†’ Patch β†’      β”‚
   β”‚  Validate β†’ Deploy Decision        β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚     Hugging Face Models              β”‚
   β”‚  StarCoder2 Β· DeepSeek-Coder       β”‚
   β”‚  Mistral Β· Mixtral                  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“œ License

MIT β€” Use it, extend it, build on top of it.


Built with 🧠 AI + β˜• Engineering
Because security shouldn't wait for the next code review.

# secure-code-env

About

SecureCodeEnv is an AI environment fo security agents. Built on the OpenEnv standard, it detects, reproduces, and patches vulnerabilities. It is robust, safe, and open-source.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors