An agentic DevSecOps reviewer that runs on every pull request. It combines deterministic security scanners with LLM-driven reasoning to produce a structured review comment, a SARIF report, and a CI gate decision (PASS / WARN / FAIL).
The deterministic side handles what scanners are good at: secret detection, SAST patterns, dependency CVEs, and IaC misconfigurations. The LLM side handles what scanners miss: business-logic flaws, authorization gaps, design-level threats (STRIDE), exploitability reasoning, and patch generation. The final policy decision is always made by deterministic Python; LLM output is advisory.
flowchart TB
PR([PR event]) --> CTX[context_agent<br/>diff + sensitive-file detection]
CTX --> SEC[secrets_scan<br/>Gitleaks]
CTX --> SAST[sast_scan<br/>Semgrep]
CTX --> DEP[dependency_scan<br/>Grype]
CTX --> IAC[iac_scan<br/>Checkov]
CTX --> AID[ai_discovery<br/>LLM]
SEC & SAST & DEP & IAC & AID --> NORM[normalize<br/>dedup + scope filter]
NORM --> THM[threat_mapping<br/>CWE / OWASP / ATT&CK]
THM --> ENR[enrichment<br/>OSV / CVSS]
ENR --> REACH[reachability filter]
REACH --> EXP[exploitability<br/>LLM second-opinion]
EXP --> PATCH[patch_generation<br/>LLM + git apply + scanner rescan + LLM review]
EXP --> TM[threat_model<br/>STRIDE delta LLM]
PATCH & TM --> DEC[policy engine<br/>PASS / WARN / FAIL]
DEC --> OUT([JSON · Markdown · SARIF · PR comment])
classDef llm fill:#fef3c7,stroke:#d97706
class AID,EXP,PATCH,TM llm
LLM-using nodes are shaded. Every LLM call goes through:
- A cross-provider failover chain so rate-limits, schema-validation failures, and quota exhaustion route to the next provider automatically.
- A content-addressed cache keyed on
(prompt_version, model, temperature, hashed input)so re-pushes do not re-bill. - A per-PR token budget with graceful degradation when exhausted.
- A Pydantic-validated structured-output contract.
- A
json_repairfallback that recovers almost-valid model output (missing key quotes, unterminated strings nearmax_tokens, missing commas) without burning an extra LLM call.
Patches additionally get a dedicated chain (DeepSeek first), a gibberish sanity check that rejects mojibake before applying, and a second-opinion LLM review that confirms the patch addresses the specific vulnerability and matches the surrounding code style.
- 5 parallel scanners (Gitleaks, Semgrep, Grype, Checkov, AI Vulnerability Discovery)
- 4 LLM reasoning agents (AI Discovery, Exploitability, STRIDE Threat-Model Delta, Patch Generation + Review)
- 4-link provider failover chain (DeepSeek, Gemini, Groq, OpenRouter) with JSON-repair fallback and tolerant schemas
- 3 policy profiles (
advisory/balanced/strict) and dependency triage (direct_runtime/direct_dev/transitive) for calibrated CI gating - Configurable
scanners.grype.include_transitivetoggle so reviewers can choose SBOM-style full reporting (default) or PR-scope-only output - 248 unit tests, ruff-clean
- Validated on 40 labeled fixtures across 2 pipeline modes (80 orchestrator runs total) on a standard Ubuntu CI runner with zero degraded stages
The agentic-review pipeline has caught path-traversal vulnerabilities in this repository's own code twice — once on the PR that introduced the affected file, and again when the live shipping codebase was scanned by itself. Both fixes are signed off by the bot's own threat-modeling agent.
The dependency-triage feature was reviewed by SecureFlow AI itself before merge. Pointing the live bot at the PR that introduced the new manifest_parser tool, the STRIDE Threat-Modeling Delta agent flagged a real path-traversal weakness the human reviewer had missed:
Manifest parser reads arbitrary files from repo path —
secureflow/tools/manifest_parser.py:63· severity medium · confidence 0.70The
parse_manifestsfunction reads files from the repository based on user-provided manifest paths. If an attacker can control themanifest_pathslist (e.g., via a crafted PR), they could cause the parser to read arbitrary files outside the intended manifests, potentially leaking sensitive information.Required mitigations before merge:
- Validate that resolved paths are within the repository directory (e.g., using
os.path.commonpathorPath.relative_to).- Restrict
manifest_pathsto only files that were actually changed in the PR.
The original code resolved every path naively:
full = (repo / rel).resolve()
if not full.exists() or not full.is_file():
continue
sub = _dispatch(full)The shipped fix applies exactly the mitigation the bot suggested — Path.relative_to(repo):
repo = Path(repo_path).resolve()
for rel in manifest_paths:
full = (repo / rel).resolve()
try:
full.relative_to(repo)
except ValueError:
# Resolved path escapes the repository root — skip.
continue
# ... + 2 MiB per-manifest size cap as DoS guardTwo unit tests (test_parse_rejects_path_traversal, test_parse_skips_oversized_manifest) lock the fix in. The bot also flagged a missing file-size cap (DoS) — same patch addressed it.
After Round 1 shipped, the live pipeline was pointed at this repository's own source via secureflow scan --repo . (see .github/workflows/dogfood.yml) to see what it would find. The threat-modeling agent flagged the same path-traversal pattern in two functions that Round 1 hadn't covered:
secureflow/agents/ai_discovery_agent.py:103—_load_file_contentsbuildsPath(repo_path) / reland reads the file without verifying the resolved path stays insiderepo_path. A craftedchanged_filesentry of../../etc/passwdwould have leaked system files into the LLM prompt.
secureflow/agents/threat_model_agent.py:202—_load_changed_fileshad identical structure, same risk.
Both functions now match Round 1's hardening — Path.relative_to(repo) check + a hard 4× per-file-budget size cap. The fix PR was reviewed by the bot itself, which confirmed both mitigations at confidence 1.00:
Path traversal protection in agent file readers — severity high · confidence 1.00 · Suggested: PASS
Both
_load_file_contentsand_load_changed_filesnow enforce that resolved file paths stay within the repo root usingPath.relative_to(repo). This prevents an attacker from reading arbitrary files (e.g.,/etc/passwd) via crafted relative paths like../../etc/passwd.
Six unit tests in tests/unit/test_agent_file_reader_security.py lock the fix in. End-to-end this is the strongest evidence the agentic-review approach catches design-level weaknesses scanner-only tools miss: the same vulnerability class was found in three different files across two rounds, the fix PR for Round 2 was signed off by the bot at confidence 1.00, and the dogfood workflow remains available so future review rounds run themselves.
See ARCHITECTURE.md and design/ for the full component catalog and per-subsystem specs.
- Python 3.11
gitleaksandgrypebinaries on$PATHsemgrepandcheckovPython packages (installed via pip below)
git clone https://github.com/<your-account>/secureflow-ai.git
cd secureflow-ai
python -m venv .venv
source .venv/bin/activate # PowerShell: .\.venv\Scripts\Activate.ps1
pip install -e . semgrep checkovInstall the scanner binaries:
# Linux / macOS
curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh \
| sudo sh -s -- -b /usr/local/bin
GITLEAKS_VERSION=8.30.1
curl -sSL "https://github.com/gitleaks/gitleaks/releases/download/v${GITLEAKS_VERSION}/gitleaks_${GITLEAKS_VERSION}_linux_x64.tar.gz" \
| tar xz -C /tmp gitleaks
sudo install -m 0755 /tmp/gitleaks /usr/local/bin/gitleaksOn Windows: download the prebuilt binaries from each tool's releases page and place them on PATH.
Copy .env.example to .env and fill in at least one provider:
GROQ_API_KEY= # free tier: https://console.groq.com/keys
GEMINI_API_KEY= # free tier: https://aistudio.google.com/apikey
DEEPSEEK_API_KEY= # paid prepaid: https://platform.deepseek.com/
OPENROUTER_AI_API_KEY= # free tier: https://openrouter.ai/settings/keysThe pipeline degrades gracefully when keys are missing: agents that require an LLM emit a single skip-banner on the PR comment, and the deterministic policy decision still applies.
# Scan the current directory
secureflow scan --repo . --output report.json --markdown report.md --sarif report.sarif
# Fast iteration without LLM calls
secureflow scan --repo . --no-llmThe CLI writes JSON, Markdown, and SARIF artifacts. Exit codes: 0 for PASS or WARN, 1 for FAIL, 2 for terminal errors.
# Deterministic baseline (no LLM cost)
secureflow eval run --no-llm --output reports/eval_scanners_only.md
# Full pipeline
secureflow eval run --output reports/eval_full.md \
--llm-concurrency 1 --max-findings-to-exploit 15 --max-patches 5A drop-in workflow ships at .github/workflows/secureflow.yml. Copy it into your repository's .github/workflows/ directory and:
- Add at least one provider API key as a repository secret.
GROQ_API_KEYis the simplest free option;DEEPSEEK_API_KEYis recommended for patch generation. - Optionally add
GEMINI_API_KEYandOPENROUTER_AI_API_KEYfor chain failover. - Open a pull request.
The workflow:
- Caches
.secureflow_cache/keyed on the head SHA so re-pushes reuse LLM responses. - Uses a concurrency group that cancels stale runs on rapid re-pushes.
- Posts a structured PR comment under the marker
<!-- secureflow-ai:bot-comment -->and edits the same comment on subsequent pushes rather than creating new ones. - Uploads SARIF to the repository's Security tab when Code Scanning is enabled (free on public repos; requires GitHub Advanced Security on private repos).
- Exits with code 1 on a
FAILdecision so the PR's CI status reflects the outcome.
A separate workflow .github/workflows/eval.yml runs the evaluation harness on PRs that touch fixtures or pipeline code and uploads the full report as an artifact.
The shipped workflow requests only:
permissions:
contents: read
pull-requests: write
issues: write
security-events: write
actions: readsecurity-events: write is needed only when SARIF upload is enabled. issues: write is required because the PR-comment update path goes through the issues API for non-review-thread comments. actions: read lets the codeql-action/upload-sarif step fetch workflow-run metadata without logging "Resource not accessible by integration" warnings. The workflow does not need write access to repository contents.
Everything lives in .secureflow.yml at the repository root. Every section is optional with sensible defaults.
llm:
# Cross-provider chain, strongest first.
provider: deepseek
model: deepseek-chat
fallback_providers: [gemini, groq, openrouter]
# Patch generation has its own chain because patches require higher
# JSON-output reliability than the other schemas.
patch_provider: deepseek
patch_fallback_providers: [gemini, groq]
temperature: 0.1
max_tokens: 4096
cache: true
scanners:
semgrep: { enabled: true, config: auto }
gitleaks: { enabled: true }
grype: { enabled: true }
checkov: { enabled: true } # IaC / Dockerfile / Kubernetes / GH Actions
ai_discovery:
enabled: true
trigger_on_sensitive_files: true
policy:
# advisory | balanced | strict — see "Policy profiles" below.
profile: balanced
fail_on:
- critical_secret
- critical_cve
- high_confidence_injection
- confirmed_auth_bypass
warn_on:
- medium_ai_discovery
- low_confidence_high_impact
- outdated_dependency
minimum_fail_confidence: 0.80
minimum_warn_confidence: 0.50
limits:
max_findings_to_exploit_check: 30
max_patches_per_pr: 10
max_llm_concurrency: 1Alternative configurations (Ollama for local-only, Gemini-only, Ollama for patches) ship in examples/configs/.
policy.profile controls how strictly findings translate into a CI FAIL. Three profiles ship; balanced is the default.
| Profile | When to use | Behavior |
|---|---|---|
advisory |
Initial rollout, shadow-mode runs, repos where the team wants visibility before enforcement | Never blocks CI. Every finding that would normally FAIL is reported as WARN with a marker line so reviewers can still see what would have blocked. |
balanced (default) |
Day-to-day use on most repositories | Blocks on critical secrets, critical CVEs in direct/transitive dependencies, high-confidence injection patterns, and AI-discovered critical findings at confidence ≥ 0.85. Critical CVEs in dev-only dependencies (eslint, pytest, etc.) downgrade to WARN. |
strict |
Security-sensitive repositories where false negatives cost more than false positives | Adds three blockers: AI-discovered high findings at ≥ 0.85, AI critical at ≥ 0.75 (down from 0.85), threat-model FAIL recommendations at ≥ 0.70 (down from 0.80), and high-severity direct dependencies with a fix available. |
policy:
profile: strictDependency findings are classified by scope to reduce noise:
| Scope | Source | Policy effect (balanced) |
|---|---|---|
direct_runtime |
Package declared in dependencies / [project.dependencies] / [packages] / runtime requirements.txt |
Full strictness — critical FAILs, high WARNs |
direct_dev |
Package declared in devDependencies / [tool.poetry.group.dev.dependencies] / [dev-packages] / requirements-dev.txt |
Critical downgrades to WARN (build-only deps don't ship with the application) |
transitive |
Package not declared in any direct-deps section of a changed manifest | Same FAIL bar as direct runtime — reachability is unknown, safe default |
unknown |
No manifests in the PR diff, or parser couldn't read them | Pre-triage behavior preserved (no regression) |
Manifests supported: package.json, pyproject.toml (PEP 621 + Poetry), Pipfile, requirements*.txt.
Grype reports CVEs across the full resolved dependency tree (Flask 1.0.0 pulls in old Werkzeug + Jinja2 + itsdangerous + click; CVEs in all of them surface). Default behavior keeps this SBOM-style report. Set scanners.grype.include_transitive: false to drop findings tagged transitive and keep only direct_runtime / direct_dev for a cleaner bot comment on PRs that touch package manifests.
scanners:
grype:
enabled: true
include_transitive: false # drop transitive CVEs; default is trueFindings tagged unknown (ecosystems the manifest parser doesn't yet cover — Go go.mod, Rust Cargo.toml, Java pom.xml/Gradle, Ruby Gemfile, PHP composer.json, .NET *.csproj, Docker-image deps) are never dropped by the toggle. The markdown report renders (unknown) inline so reviewers on those ecosystems see why the toggle had no effect on their PR.
The repository ships with 40 labeled fixtures under tests/fixtures/:
| Class | Count | Coverage |
|---|---|---|
| Base AppSec | 20 | SQLi, command injection, SSRF, XSS, XXE, path traversal, weak crypto, JWT alg:none, insecure deserialization, IAM wildcard, payment-logic bug, private key, SHA1 password, SSL verify=False, missing authorization, hardcoded secret, open redirect, weak YAML, vulnerable dependency |
| Cross-language | 6 | Go, Java, Ruby, PHP, C#, TypeScript |
| Adversarial prompt-injection | 4 | Comment-override, fake-review, role-injection, authority-claim |
| Static IaC | 5 | Public S3 (Terraform), wildcard IAM, Dockerfile root, open security group, over-permissioned GitHub Actions |
| STRIDE Threat-Model | 2 | New admin route, new file upload |
| True negatives | 3 | Docs-only, safe Python change, safe subprocess use |
Captured 2026-05-18 on the combined W1 + W5 + W15 + W19 (toggle) + W22 (matcher) + Round 2 hardening state. See reports/eval_full.md for per-scenario breakdown and reports/eval_versions.yaml for the reproducibility sidecar.
| Metric | scanners_only |
secureflow_full |
Δ |
|---|---|---|---|
| Recall | 0.61 | 0.78 | +0.17 |
| Precision | 0.54 | 0.36 | −0.18 |
| Decisions correct | 27 / 40 (67.5%) | 31 / 40 (77.5%) | +4 |
| True positives | 28 | 36 | +8 |
| False positives | 24 | 65 | +41 |
| Secondary findings (not FP) | 41 | 41 | +0 |
| Avg latency per scenario | 5.9 s | 19.8 s | +13.9 s |
| LLM tokens (in / out) | 0 / 0 | 388,885 / 52,737 | — |
| Patches generated / scanner-verified | 0 / 0 | 40 / 10 | +10 |
Secondary findings are extra CVEs on a labeled package (Django 2.2.0 has 15 published CVEs but the label expects one match) or extra Checkov sub-checks on a labeled IaC resource. The W22 matcher fix credits these as secondary instead of FP since the system correctly detected the labeled vulnerability and the additional findings represent the same underlying issue. Pre-W22 these inflated the FP count: full-mode dropped from 107 FP → 65 FP when the matcher started crediting them correctly.
Precision moved up sharply (scanners-only: 0.30 → 0.54) for the same reason — the system was always finding these CVEs; the eval just wasn't crediting them honestly.
| Stat | Value |
|---|---|
| Orchestrator runs completed | 80 / 80 |
| Schema-validation warnings | 0 |
json_repair recoveries needed |
0 |
| Chain failovers triggered | 0 |
| Skip-banner triggers | 0 |
Full per-scenario breakdown is in reports/eval_full.md; raw data in reports/eval_full.json; scanner and provider versions in reports/eval_versions.yaml.
Nine scenarios where the deterministic pipeline gave the wrong decision and the full pipeline gave the right one:
| Scenario | scanners_only |
secureflow_full |
|---|---|---|
scenario_02_missing_authz |
PASS (wrong) | FAIL (AI Discovery found IDOR) |
scenario_08_path_traversal |
PASS (wrong) | FAIL (AI Discovery surfaced the traversal) |
scenario_14_business_logic_payment |
PASS (wrong) | WARN (semantic logic bug; no SAST pattern) |
scenario_16_jwt_alg_none |
WARN (wrong) | FAIL (exploitability agent elevated) |
scenario_17_private_key |
WARN (wrong) | FAIL (LLM elevated severity) |
scenario_20_xxe |
PASS (wrong) | FAIL (AI Discovery caught it) |
scenario_js_sqli |
WARN (wrong) | FAIL (LLM elevated to FAIL) |
scenario_tm_new_admin_route |
PASS (wrong) | FAIL (STRIDE Threat-Model Delta) |
scenario_iac_gha_overprivileged |
PASS (wrong) | WARN (STRIDE flagged pull_request_target + permissions: write-all) |
All three clean-change scenarios correctly produced PASS in both modes:
| Scenario | scanners_only |
secureflow_full |
|---|---|---|
scenario_09_safe_subprocess_fp |
PASS | PASS |
scenario_docs_only |
PASS | PASS |
scenario_safe_python_change |
PASS | PASS |
The LLM half does not invent vulnerabilities on clean code.
All four scenario_pi_* fixtures (comment-override, fake-review, role-injection, authority-claim) correctly produce FAIL in both modes. The system prompts treat repository content as untrusted data and do not follow embedded directives.
- PR code is untrusted. Every system prompt instructs the LLM to treat code, comments, and string literals as data, not instructions.
- API keys are read from the environment only. They are never persisted to disk or to the configuration file. A defensive BOM strip at the env boundary prevents secrets with a UTF-8 BOM from crashing
urllib's latin-1 header encoder. - Secrets in scan output are masked. The masker runs on every log record and report body. Reports show
AKIA****rather than the full credential. - Patches are never auto-applied. Generated patches render as GitHub suggestion blocks; the human reviewer applies them.
- The GitHub Action requests least-privilege permissions (
contents: read,pull-requests: write,security-events: write). - IaC review is static-only. No live AWS, Azure, or GCP credentials are required; Checkov runs against committed files.
- Multi-file data-flow analysis is heuristic. Path-based reachability and AST signals only; no full call graph or alias analysis. Catches the common single-file taint patterns; misses vulnerabilities that span multiple files.
- Patch verification has two tiers. Patches for scanner-detected findings get a full scanner re-run on the patched tree — if the rerun no longer reports the original finding, the patch is marked
verified. Patches for AI-discovered findings have no scanner rule to re-run against, so they go through the LLM-review path only and are markedverifiedorunverifiedbased on the second-opinion verdict. Both tiers are surfaced to the reviewer with the verdict + concerns; nothing is auto-applied. The current run's verified count is in the aggregate table above. - Free-tier LLM quotas are tight. Groq caps at roughly 30 RPM and 6K TPM. Gemini caps at 200 RPD. DeepSeek is paid (approximately $0.05 per PR with 10 patches).
- Grype reports transitive-dependency CVEs. Grype scans the whole resolved dependency tree, not just the diff. In a PR that bumps one direct dep, this can surface CVEs in the dep's transitive children that the reviewer did not directly choose. Documented as a known cost of using Grype rather than a manifest-diff-only tool.
- Local Ollama is supported via configuration but not exercised in the shipped CI workflow. It works on the local CLI when the daemon is running; CI runners do not provision Ollama by default.
- NVD enrichment is opt-in. OSV plus a local CVSS v3 calculator covers most use cases without an API key. Enable NVD in
.secureflow.ymland setNVD_API_KEYfor the higher rate-limit tier.
| Document | Purpose |
|---|---|
ARCHITECTURE.md |
Component catalog, state machine, cross-cutting concerns. |
design/ |
Per-subsystem specs (LLM stack, orchestrator, patch validation, eval harness, schemas, sensitive-file detection). |
CONTRIBUTING.md |
How to add a new agent, scanner, or fixture. |
examples/ |
Alternative configurations and a vulnerable Terraform demo. |
reports/ |
Latest evaluation results and reproducibility sidecar. |
AGPL-3.0. See LICENSE.