Skip to content

feat(security): implement skill content scanning with shared prompt injection detection#408

Merged
Aaronontheweb merged 7 commits into
devfrom
claude-wt-skill-scanning
Mar 27, 2026
Merged

feat(security): implement skill content scanning with shared prompt injection detection#408
Aaronontheweb merged 7 commits into
devfrom
claude-wt-skill-scanning

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

Closes #395

  • RegexPromptInjectionDetector — shared regex engine in Netclaw.Security with 22 [GeneratedRegex] patterns across 5 threat categories (prompt injection, data exfiltration, privilege escalation, destructive ops, invisible unicode). Reusable by any content trust boundary via IPromptInjectionDetector.DetectAsync().
  • RegexSkillContentScanner — trust-tier-aware adapter that delegates to the detector and applies policy: System/User bypass scanning, Community warns on medium/rejects high, External/Agent warns on low/rejects medium+.
  • DI wiring fixProgram.cs was hardcoding new NoOpSkillContentScanner() instead of resolving from DI. Moved skill tool registration to post-build following the existing ChannelToolRegistration pattern.
  • SkillManageTool — now passes SkillTrustTier through on create/edit/patch and surfaces Warning verdicts in tool response.

Risk matrix

Risk Community External/Agent
None Allow Allow
Low Allow Warn
Medium Warn Reject
High Reject Reject

Test plan

  • 120 security tests pass (54 new across detector, scanner, DI, and no-op)
  • 67 actor skill tests pass (existing, no regressions)
  • Daemon builds clean
  • dotnet slopwatch analyze — 0 violations

…skill content scanning (#395)

Replace no-op stubs with real scanning infrastructure in Netclaw.Security.
RegexPromptInjectionDetector provides 22 categorized patterns across 5 threat
categories (prompt injection, data exfiltration, privilege escalation,
destructive ops, invisible unicode) as shared infrastructure reusable by
skills, webhooks, and any future content trust boundary.

RegexSkillContentScanner applies trust-tier-aware policy: System/User tiers
bypass scanning, Community warns on medium risk, External/Agent rejects on
medium+. Fix daemon DI wiring to resolve scanner from the built service
provider instead of hardcoding NoOpSkillContentScanner.
Preserve the scanner boundary and content-scan hardening work on a dedicated branch so dev stays clean while we reconcile it with the PR branch.
@Aaronontheweb

Aaronontheweb commented Mar 26, 2026

Copy link
Copy Markdown
Collaborator Author

Self-review

  • Tightened the scanner boundary so malformed skills are no longer silently skipped: startup and sync now surface degraded inventory explicitly, and the registry tracks rejected skill issues instead of quietly dropping them.
  • Closed the biggest bypasses in the original PR: skill_load, skill_read_resource, skill_manage patch, and resource-file write_file operations now run skill content scans, not just skill_manage write writes.
  • Raised the effective scan policy for model-driven skill mutations to at least Community tier so agent-authored skill_manage create and skill_manage edit flows do not inherit the permissive User-tier bypass.
  • Added frontmatter identity enforcement, duplicate-name rejection, canonical path and symlink checks, and staged system-skill sync replacement so partially unsafe feed updates do not replace the on-disk version.
  • Added targeted tests for scanner issues, blocked malicious resource reads and writes, detector failure handling, and sync rollback behavior; targeted security, actors, and daemon test slices all passed locally after reconciliation.

Remaining caveat

  • This is still a regex-based tripwire, not a full malicious-content or malware analysis system. It is materially harder to bypass now, but I still view it as an MVP guardrail for obvious prompt-injection, exfiltration, and privilege-escalation strings rather than a comprehensive skill security solution.

Address all findings from the PR #408 security review:

- Remove silent NoOp fallback from SkillLoadTool and SkillReadResourceTool
  constructors — ISkillContentScanner is now required, not nullable. Aligns
  with "no silent fallbacks" constitution rule.

- Elevate scan tier in SkillLoadTool and SkillReadResourceTool to Community
  minimum, matching SkillManageTool and SystemSkillSyncService. User-placed
  skills on disk are now scanned at load time.

- Tighten false-positive-prone regex patterns: downgrade YouAreNowRegex from
  High to Medium (Community tier now warns instead of rejects), narrow
  ActAsRegex to "act as if you" to avoid false positives on legitimate skill
  instructions like "act as a code reviewer."

- Fix exception message leakage in RegexSkillContentScanner — log full
  exception internally but return generic "content scanning failed" message
  to avoid leaking internal paths.

- Add CachingSkillContentScanner decorator that caches scan results by
  content hash and trust tier, avoiding redundant regex scanning on repeated
  skill_load calls. Wired as the DI-registered ISkillContentScanner.

- Document regex evasion limitations (homoglyphs, encoding indirection,
  synonyms, multi-file split, non-English) in RegexPromptInjectionDetector
  XML docs and TOCTOU race caveat in SkillScanner symlink check.

All 195 security/skill tests pass including 4 new CachingSkillContentScanner
tests and updated assertions for tightened patterns.
Local filesystem access implies far worse attack vectors than skill
symlink manipulation — the comment was noise.
…tions)

Keep version 1.2.0 from HEAD. Interleave both branches' new tests:
RejectsFrontmatterNameMismatch from HEAD plus orphan recovery tests from dev.
@Aaronontheweb Aaronontheweb marked this pull request as ready for review March 27, 2026 14:04
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) March 27, 2026 14:04
@Aaronontheweb Aaronontheweb merged commit 301a54a into dev Mar 27, 2026
3 checks passed
@Aaronontheweb Aaronontheweb deleted the claude-wt-skill-scanning branch March 27, 2026 14:08
@Aaronontheweb Aaronontheweb mentioned this pull request Mar 30, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement skill content scanning using shared prompt injection detection infrastructure

1 participant