feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags#2
Conversation
…tags Changes: - extract-knowledge.py: Expand tag_patterns from 20 to 60+ covering AWS, TypeScript, React Native, testing tools, etc. All original tags preserved (thymeleaf, csrf, liquibase, java, gradle, excel, JPQL, messages.properties). Fix false-positive patterns: Lambda→AWS Lambda, CDK→AWS CDK, token→bearer token, Copilot→GitHub Copilot. Remove overly generic JSON tag. Fix .tsx?/.py regex to use word names only. Add Japanese indicators for all 4 categories. Restore gradle/maven/spring boot in TOOL_INDICATORS alongside new tools. - briefing.py: Add branch/commit stopwords for conventional commits. Remove org-specific names from stopwords (keep generic only). - README.md: Add Multilingual & Multi-Stack Support section with trilingual docs, technology tag table, and branch parsing examples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR extends the knowledge extraction pipeline to better classify content across English/Vietnamese/Japanese and to extract a broader set of technology tags across multiple stacks, while updating auto-detected briefing keywords and documentation to reflect the new capabilities.
Changes:
- Expanded indicator/noise regexes to include Japanese (and added more tool indicators) for paragraph classification.
- Expanded technology tag extraction to include AWS/services, frontend/testing/tooling/security tags in addition to the existing Java/Spring tags.
- Updated briefing keyword extraction stopwords and documented multilingual/tag support + branch parsing behavior in the README.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| README.md | Documents multilingual support, tag taxonomy, and branch parsing examples/stopwords. |
| extract-knowledge.py | Adds Japanese/Vietnamese indicators/noise filters and expands tag extraction patterns substantially. |
| briefing.py | Extends branch/commit stopword lists for better auto-detected briefing queries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (r"\b(?:CloudFormation|CFN)\b", "cloudformation"), | ||
| (r"\b(?:Step\s+Functions?)\b", "step-functions"), | ||
| (r"\b(?:X-Ray|XRay)\b", "xray"), | ||
| (r"\b(?:WebSocket|wss?://)\b", "websocket"), |
There was a problem hiding this comment.
The websocket tag regex wraps wss?:// in \b...\b, which prevents matching URLs like wss://... because the trailing \b requires a word char before the boundary (the pattern ends with /). Consider separating the alternatives so only WebSocket uses word boundaries, or remove the trailing \b for the URL form.
| (r"\b(?:WebSocket|wss?://)\b", "websocket"), | |
| (r"(?:\bWebSocket\b|wss?://)", "websocket"), |
| TOOL_INDICATORS = [ | ||
| r"(?:install|configure|setup|version|upgrade|dependency)\b", | ||
| r"(?:gradle|maven|docker|redis|postgres|spring\s+boot)\b", | ||
| r"(?:JDK|SDK|IDE|VSCode|extension)\b", | ||
| r"(?:yarn|npm|cdk|playwright|jest|eslint|prettier)\b", | ||
| r"(?:JDK|SDK|IDE|VSCode|extension|plugin|MCP)\b", |
There was a problem hiding this comment.
TOOL_INDICATORS entry (?:yarn|npm|cdk|playwright|jest|eslint|prettier)\b has no leading word-boundary, so it can match inside larger tokens that end with one of these strings (e.g., mycdk). Adding a leading \b (or otherwise anchoring) would reduce accidental matches.
| def extract_tags(text: str) -> str: | ||
| """Extract relevant tags from text.""" | ||
| tag_patterns = [ | ||
| (r"\b(?:Spring\s+Boot|SpringBoot)\b", "spring-boot"), | ||
| (r"\b(?:Thymeleaf)\b", "thymeleaf"), | ||
| (r"\b(?:JPQL|JPA|Hibernate)\b", "jpa"), | ||
| (r"\b(?:PostgreSQL|Postgres|PG)\b", "postgresql"), | ||
| # Cloud & Infrastructure | ||
| (r"\b(?:AWS|Amazon\s+Web\s+Services)\b", "aws"), | ||
| (r"\b(?:AWS\s+CDK|CDK\s+(?:stack|construct|deploy|app))\b", "aws-cdk"), |
There was a problem hiding this comment.
extract_tags() rebuilds the full tag_patterns list on every call. Since this runs per extracted chunk, the expanded 50+ tag list will repeatedly allocate/compile regexes and can become a hot path. Consider moving patterns to a module-level constant and precompiling them (e.g., [(re.compile(..., re.I), tag), ...]).
| and p not in ("feature", "fix", "chore", "update", "and", | ||
| "dev", "bug", "refactor", "docs", | ||
| "release", "hotfix", "main", "master")) |
There was a problem hiding this comment.
Branch stopword filtering is case-sensitive (p not in (...)) even though commit stopword filtering uses w.lower(). If a branch segment is capitalized (e.g., Dev/Feature/...), it will leak into keywords. Consider normalizing p (and/or branch) to lowercase for stopword comparison before adding to keywords.
| and p not in ("feature", "fix", "chore", "update", "and", | |
| "dev", "bug", "refactor", "docs", | |
| "release", "hotfix", "main", "master")) | |
| and p.lower() not in ("feature", "fix", "chore", "update", "and", | |
| "dev", "bug", "refactor", "docs", | |
| "release", "hotfix", "main", "master")) |
| ### Supported Technology Tags (50+) | ||
|
|
||
| | Category | Tags | | ||
| |---|---| | ||
| | **Cloud & Infra** | `aws`, `aws-cdk`, `lambda`, `dynamodb`, `s3`, `sqs`, `sns`, `cognito`, `cloudwatch`, `api-gateway`, `eventbridge`, `cloudformation`, `step-functions`, `xray`, `websocket`, `docker`, `vpc` | | ||
| | **Languages** | `typescript`, `javascript`, `python`, `nodejs`, `java` | |
There was a problem hiding this comment.
The documentation says "Supported Technology Tags (50+)" here, while the PR description claims "60+" tags. Either adjust the wording in the README to match the actual tag count/pattern set, or update the PR description so the numbers are consistent.
magicpro97
left a comment
There was a problem hiding this comment.
V2 Review — All previous issues resolved ✅
Checklist vs PR #1 feedback:
| Issue | Status |
|---|---|
| 🔴 Breaking compat (removed tags) | ✅ Fixed — All 20 original tags preserved with exact same names |
🔴 fpt hardcoded |
✅ Fixed — Removed from stopwords |
| 🔴 Generic tags (Lambda, CDK, JSON, token) | ✅ Fixed — AWS Lambda|Lambda function, AWS CDK|CDK stack, JSON removed, SAML|OAuth2?|JWT|bearer token |
.tsx? matching .tsv |
✅ Fixed — Removed file extension patterns, uses full word TypeScript |
✅ Fixed — Original gradle|maven|spring boot preserved |
Minor nits (non-blocking):
(r"\b(?:wss?://)\b", "websocket")—\bdoesn't work around://. TheWebSocketkeyword covers it, so this sub-pattern is dead code. Consider removing.- README category "Database & ORM" includes
spring-boot— consider renaming to "Java Ecosystem".
Verdict: LGTM 🚀
Clean, well-organized, backward-compatible. Ready to merge.
#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Same class of vulnerability as #2 — hook was literally telling the agent 'create ~/.copilot/markers/tentacle-bypass to skip'. Agent in session followed this instruction and bypassed tentacle enforcement entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Same class of vulnerability as #2 — hook was literally telling the agent 'create ~/.copilot/markers/tentacle-bypass to skip'. Agent in session followed this instruction and bypassed tentacle enforcement entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Same class of vulnerability as #2 — hook was literally telling the agent 'create ~/.copilot/markers/tentacle-bypass to skip'. Agent in session followed this instruction and bypassed tentacle enforcement entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix #1: integration_auth_bearer_valid_passes now uses /api/noroute to exercise Bearer auth middleware (asserts 404 post-auth, not open /healthz) - Fix #2: start() uses tuple bind (host, port) for IPv6-safe TcpListener binding - Fix #3: TraceLayer customized to record only URI path, not query string (avoids ?token= leaks) - Fix #4: resolve_file maps PathSafetyError::Empty to 404 (not 403) for root / - Fix #5: serve_static rejects non-GET/HEAD with 405 Method Not Allowed - Fix #6: symlink_metadata and canonicalize moved into spawn_blocking to avoid async runtime starvation - Fix #7: rand dep aligned to 0.9 (removes duplicate rand 0.8); generate_nonce uses rand::rng() (0.9 API); add tracing dep for TraceLayer span customization - Fix #8: CORS preflight Allow-Headers includes Last-Event-ID and X-Resume-Token for Python parity Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- live.rs: honor Last-Event-ID header so reconnecting clients resume from the provided event id (positive integer); invalid/missing falls back to latest MAX(id) snapshot without panicking (#1) - live.rs: move initial latest_entry_id() call into spawn_blocking to avoid blocking the Tokio executor with synchronous SQLite/r2d2 (#2) - db.rs: replace .ok().flatten() on created_at with explicit has_ca branch — propagates real row conversion errors via ? when the column exists; skips r.get() entirely for NULL projection; collect errors instead of silently dropping rows via filter_map (#3) - tests: replace fixed-size header read with read_http_headers() helper that loops until \\r\\n\\r\\n, preventing packet-splitting flakes (#4) - tests: replace fixed sleep + single read with read_until_sse_contains() helper that accumulates chunks until expected strings appear or a deadline passes, avoiding timing/chunk fragility (#5) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
feat: Add Multilingual Support (EN/VI/JP) and Multi-Stack Technology Tags
Overview
Extends
copilot-session-knowledgebeyond Java/Spring to support trilingual knowledge extraction (English, Vietnamese, Japanese) and 60+ technology tags across multiple stacks.All original tags and patterns are fully preserved for backward compatibility.
Changes
extract-knowledge.pyTag patterns (20 → 60+):
thymeleaf,csrf,liquibase,java,gradle,excel,jpawithJPQL,i18nwithmessages.properties)Lambda→AWS Lambda|Lambda function|Lambda handler(won't match Java lambdas)CDK→AWS CDK|CDK stack|CDK construct(won't match Chrome DevTools)token→SAML|OAuth2?|JWT|bearer token(won't match generic tokens)Copilot→GitHub Copilot|Copilot CLI|Copilot chat(won't match aircraft)JSONtag.tsx?/.jsx?/.pyfile extension patterns (matched.tsv,.pyc, etc.)Indicators — added Japanese regex to all 4 categories:
TOOL_INDICATORS — original
gradle|maven|spring bootpreserved, addedyarn|npm|cdk|playwright|jest|eslint|prettieralongside.Noise filters — added Japanese patterns (面接, 採点, 評価基準).
briefing.pydev,bug,refactor,docs,release,hotfix,main,masteruse,refactor,feat,chore,docs,style,test,buildREADME.mdBackward Compatibility
spring-bootthymeleafjpa(with JPQL)postgresqlredisgradlejava-build, now reverted)csrfliquibasejavaexcelspreadsheet, now reverted)javascript(withJS)i18n(withmessages.properties)