feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags#1
Conversation
…tags Changes: - extract-knowledge.py: Add Japanese regex indicators for all 4 categories (mistakes, patterns, decisions, tools), expand tag_patterns from 20 to 55+ covering AWS, TypeScript, React Native, Expo, testing tools, and more. Add Japanese noise filters. Keep legacy Java/Spring tags for backward compat. - briefing.py: Add branch name stopwords for dev/fpt/ convention and conventional commit prefixes (feat, chore, docs, style, test, build). - README.md: Add Multilingual & Multi-Stack Support section documenting trilingual indicators, 50+ technology tags, and branch parsing examples. Add two new features to the feature table. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97
left a comment
There was a problem hiding this comment.
Overall
Great contribution! Multilingual support (EN/VI/JP) and multi-stack tags are very welcome. The Japanese indicators and noise filters are well-researched. A few issues to address before merging:
🔴 Must Fix
- Breaking backward compatibility — PR description says "all original tags preserved" but several are removed:
Thymeleaf,CSRF,Liquibase,JDK/Javatags are gone, andgradle→java-build,excel→spreadsheetare renamed. Existing knowledge entries with old tags will become orphaned. fpthardcoded in stopwords — This is an organization-specific name. A public tool shouldn't have company names hardcoded. Make this configurable or remove it.- Overly generic tag patterns —
Lambda,CDK,JSON,token,Copilotwill cause false positives in almost every session.
🟡 Should Fix
\.tsx?and\.pyregex in tag patterns can match unintended strings (e.g.,.tsv,.pyc). Need end-of-word boundary.TOOL_INDICATORSremovedgradle|maven|spring\s+boot— these are real tool indicators that Java users need.
🟢 Nice to Have
- Consider adding Korean (한국어) indicators as a future TODO since the pattern is established.
See inline comments for specific locations.
| @@ -141,6 +156,9 @@ def classify_paragraph(text: str) -> list[tuple[str, float]]: | |||
| r"câu\s*hỏi\s*phỏng\s*vấn", | |||
There was a problem hiding this comment.
Lambda is too generic — matches Java lambdas, Python lambda expressions, math lambda calculus discussions, etc.
Suggest: r\"\b(?:AWS\s+Lambda|Lambda\s+function|lambda\s+handler)\b\" or require AWS context.
Same concern for CDK (Chrome DevTools Kit?) and JSON (matches literally everything).
| (r"\b(?:Step\s+Functions?)\b", "step-functions"), | ||
| (r"\b(?:X-Ray|XRay)\b", "xray"), | ||
| (r"\b(?:WebSocket|wss?://)\b", "websocket"), | ||
| (r"\b(?:Docker|docker-compose)\b", "docker"), |
There was a problem hiding this comment.
token is extremely generic. CSRF tokens, embedding tokens, API tokens, OAuth tokens — this would tag almost everything as auth.
Suggest: restrict to r\"\b(?:SAML|OAuth2?|JWT|bearer\s+token)\b\"
| (r"\b(?:JPQL|JPA|Hibernate)\b", "jpa"), | ||
| (r"\b(?:PostgreSQL|Postgres|PG)\b", "postgresql"), | ||
| # Cloud & Infrastructure | ||
| (r"\b(?:AWS|Amazon\s+Web\s+Services)\b", "aws"), |
There was a problem hiding this comment.
\.tsx? without end anchor can match .tsv, .tsc, etc. Same issue with \.jsx? and \.py.
Suggest: r\"\b(?:TypeScript)\b|\.tsx?\b\" or use (?:\.tsx?)(?:\s|$|[\"'])
| r"(?:インストール|設定|バージョン|依存関係|ツール|環境構築)", | ||
| ] | ||
|
|
||
|
|
There was a problem hiding this comment.
🔴 Breaking change: TOOL_INDICATORS removed gradle|maven|spring\s+boot. Java/Spring users rely on these to classify tool-related entries. Please keep the original patterns and ADD new ones.
python r\"\(?:gradle|maven|docker|redis|postgres|spring\s+boot)\b\", # keep original r\"\(?:yarn|npm|cdk|playwright|jest|eslint|prettier)\b\", # add new
"@
},
@{
path = "extract-knowledge.py"
position = 62
body = @"
🔴 Backward compat broken: Original tags Thymeleaf, CSRF, Liquibase, JDK/Java are removed entirely (not in Legacy section). gradle renamed to java-build, excel renamed to spreadsheet.
Existing knowledge entries tagged with gradle, java, excel etc. will no longer match.
Fix: Keep ALL original tag patterns. Add new ones alongside, don't replace.
| parts = branch.replace("/", "-").replace("_", "-").split("-") | ||
| keywords.update(p for p in parts if len(p) > 2 | ||
| and p not in ("feature", "fix", "chore", "update", "and")) | ||
| and p not in ("feature", "fix", "chore", "update", "and", |
There was a problem hiding this comment.
\"fpt\" is a company name (FPT Software). This is a public open-source tool — shouldn't hardcode organization names.
Options:
- Remove
fptfrom this list - Make stopwords configurable via a config file
- At minimum, add a comment explaining why it's here
|
Closing to recreate from owner account with review fixes applied. Thank you for the detailed review! |
#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix #1: integration_auth_bearer_valid_passes now uses /api/noroute to exercise Bearer auth middleware (asserts 404 post-auth, not open /healthz) - Fix #2: start() uses tuple bind (host, port) for IPv6-safe TcpListener binding - Fix #3: TraceLayer customized to record only URI path, not query string (avoids ?token= leaks) - Fix #4: resolve_file maps PathSafetyError::Empty to 404 (not 403) for root / - Fix #5: serve_static rejects non-GET/HEAD with 405 Method Not Allowed - Fix #6: symlink_metadata and canonicalize moved into spawn_blocking to avoid async runtime starvation - Fix #7: rand dep aligned to 0.9 (removes duplicate rand 0.8); generate_nonce uses rand::rng() (0.9 API); add tracing dep for TraceLayer span customization - Fix #8: CORS preflight Allow-Headers includes Last-Event-ID and X-Resume-Token for Python parity Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- live.rs: honor Last-Event-ID header so reconnecting clients resume from the provided event id (positive integer); invalid/missing falls back to latest MAX(id) snapshot without panicking (#1) - live.rs: move initial latest_entry_id() call into spawn_blocking to avoid blocking the Tokio executor with synchronous SQLite/r2d2 (#2) - db.rs: replace .ok().flatten() on created_at with explicit has_ca branch — propagates real row conversion errors via ? when the column exists; skips r.get() entirely for NULL projection; collect errors instead of silently dropping rows via filter_map (#3) - tests: replace fixed-size header read with read_http_headers() helper that loops until \\r\\n\\r\\n, preventing packet-splitting flakes (#4) - tests: replace fixed sleep + single read with read_until_sse_contains() helper that accumulates chunks until expected strings appear or a deadline passes, avoiding timing/chunk fragility (#5) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
feat: Add Multilingual Support (EN/VI/JP) and Multi-Stack Technology Tags
Overview
This PR extends
copilot-session-knowledgebeyond its original Java/Spring focus to support multilingual knowledge extraction (English, Vietnamese, Japanese) and multi-stack technology tagging (AWS, TypeScript, React Native, and more).Originally, the knowledge extraction was optimized for Java/Spring Boot projects. This update adds comprehensive support for cloud-native and mobile development stacks while maintaining full backward compatibility with existing Java patterns.
Changes
extract-knowledge.pyBackward compatibility: All original Java/Spring/PostgreSQL tags are preserved under a "Legacy" section.
briefing.pydev/fpt/convention and conventional commit prefixes (feat,chore,docs,style,test,build)dev/fpt/feature/5022-copy-to-groupREADME.mdMotivation
We use this tool in a bilingual (Vietnamese/Japanese) team working on an AWS CDK + Expo/React Native project. The original Java-focused patterns missed most of our knowledge entries. This PR makes the tool useful for a broader range of technology stacks and multilingual teams.
Testing
extract-knowledge.pyagainst session checkpoints containing Japanese/Vietnamese content