feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags by linhnt102-fpt · Pull Request #1 · magicpro97/copilot-session-knowledge

linhnt102-fpt · 2026-04-10T04:26:51Z

feat: Add Multilingual Support (EN/VI/JP) and Multi-Stack Technology Tags

Overview

This PR extends copilot-session-knowledge beyond its original Java/Spring focus to support multilingual knowledge extraction (English, Vietnamese, Japanese) and multi-stack technology tagging (AWS, TypeScript, React Native, and more).

Originally, the knowledge extraction was optimized for Java/Spring Boot projects. This update adds comprehensive support for cloud-native and mobile development stacks while maintaining full backward compatibility with existing Java patterns.

Changes

`extract-knowledge.py`

Area	Before	After
Tag patterns	20 Java-centric patterns	55+ patterns covering AWS, TypeScript, React Native, testing, auth, TLS, etc.
Mistake indicators	English only	+ Vietnamese + Japanese regex
Pattern indicators	English only	+ Vietnamese + Japanese regex
Decision indicators	English only	+ Vietnamese + Japanese regex
Tool indicators	Java tools (gradle, maven, spring)	Modern stack (yarn, npm, cdk, playwright, jest, eslint, prettier) + Japanese
Noise patterns	English only	+ Japanese interview/scoring patterns

Backward compatibility: All original Java/Spring/PostgreSQL tags are preserved under a "Legacy" section.

`briefing.py`

Added branch name stopwords for dev/fpt/ convention and conventional commit prefixes (feat, chore, docs, style, test, build)
Better keyword extraction from branch names like dev/fpt/feature/5022-copy-to-group

`README.md`

Added "Multilingual & Multi-Stack Support" section documenting:
- Trilingual indicator support (EN/VI/JP) with examples
- 50+ technology tags organized by category
- Branch name parsing examples with stopword filtering
Added two new features to the feature comparison table

Motivation

We use this tool in a bilingual (Vietnamese/Japanese) team working on an AWS CDK + Expo/React Native project. The original Java-focused patterns missed most of our knowledge entries. This PR makes the tool useful for a broader range of technology stacks and multilingual teams.

Testing

Manual verification: ran extract-knowledge.py against session checkpoints containing Japanese/Vietnamese content
Tag extraction correctly identifies AWS, TypeScript, React Native patterns
Japanese indicators correctly classify mistakes, patterns, decisions, and tools
Legacy Java patterns still match Java-related content

…tags Changes: - extract-knowledge.py: Add Japanese regex indicators for all 4 categories (mistakes, patterns, decisions, tools), expand tag_patterns from 20 to 55+ covering AWS, TypeScript, React Native, Expo, testing tools, and more. Add Japanese noise filters. Keep legacy Java/Spring tags for backward compat. - briefing.py: Add branch name stopwords for dev/fpt/ convention and conventional commit prefixes (feat, chore, docs, style, test, build). - README.md: Add Multilingual & Multi-Stack Support section documenting trilingual indicators, 50+ technology tags, and branch parsing examples. Add two new features to the feature table. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

magicpro97

Overall

Great contribution! Multilingual support (EN/VI/JP) and multi-stack tags are very welcome. The Japanese indicators and noise filters are well-researched. A few issues to address before merging:

🔴 Must Fix

Breaking backward compatibility — PR description says "all original tags preserved" but several are removed: Thymeleaf, CSRF, Liquibase, JDK/Java tags are gone, and gradle → java-build, excel → spreadsheet are renamed. Existing knowledge entries with old tags will become orphaned.
fpt hardcoded in stopwords — This is an organization-specific name. A public tool shouldn't have company names hardcoded. Make this configurable or remove it.
Overly generic tag patterns — Lambda, CDK, JSON, token, Copilot will cause false positives in almost every session.

🟡 Should Fix

\.tsx? and \.py regex in tag patterns can match unintended strings (e.g., .tsv, .pyc). Need end-of-word boundary.
TOOL_INDICATORS removed gradle|maven|spring\s+boot — these are real tool indicators that Java users need.

🟢 Nice to Have

Consider adding Korean (한국어) indicators as a future TODO since the pattern is established.

See inline comments for specific locations.

magicpro97 · 2026-04-10T06:23:08Z

@@ -141,6 +156,9 @@ def classify_paragraph(text: str) -> list[tuple[str, float]]:
    r"câu\s*hỏi\s*phỏng\s*vấn",


⚠️ False positive risk: Lambda is too generic — matches Java lambdas, Python lambda expressions, math lambda calculus discussions, etc.

Suggest: r\"\b(?:AWS\s+Lambda|Lambda\s+function|lambda\s+handler)\b\" or require AWS context.

Same concern for CDK (Chrome DevTools Kit?) and JSON (matches literally everything).

magicpro97 · 2026-04-10T06:23:08Z

+        (r"\b(?:Step\s+Functions?)\b", "step-functions"),
+        (r"\b(?:X-Ray|XRay)\b", "xray"),
+        (r"\b(?:WebSocket|wss?://)\b", "websocket"),
        (r"\b(?:Docker|docker-compose)\b", "docker"),


⚠️ token is extremely generic. CSRF tokens, embedding tokens, API tokens, OAuth tokens — this would tag almost everything as auth.

Suggest: restrict to r\"\b(?:SAML|OAuth2?|JWT|bearer\s+token)\b\"

magicpro97 · 2026-04-10T06:23:08Z

-        (r"\b(?:JPQL|JPA|Hibernate)\b", "jpa"),
-        (r"\b(?:PostgreSQL|Postgres|PG)\b", "postgresql"),
+        # Cloud & Infrastructure
+        (r"\b(?:AWS|Amazon\s+Web\s+Services)\b", "aws"),


⚠️ \.tsx? without end anchor can match .tsv, .tsc, etc. Same issue with \.jsx? and \.py.

Suggest: r\"\b(?:TypeScript)\b|\.tsx?\b\" or use (?:\.tsx?)(?:\s|$|[\"'])

magicpro97 · 2026-04-10T06:23:08Z

+    r"(?:インストール|設定|バージョン|依存関係|ツール|環境構築)",
 ]




🔴 Breaking change: TOOL_INDICATORS removed gradle|maven|spring\s+boot. Java/Spring users rely on these to classify tool-related entries. Please keep the original patterns and ADD new ones.

python r\"\(?:gradle|maven|docker|redis|postgres|spring\s+boot)\b\", # keep original r\"\(?:yarn|npm|cdk|playwright|jest|eslint|prettier)\b\", # add new
"@
},
@{
path = "extract-knowledge.py"
position = 62
body = @"
🔴 Backward compat broken: Original tags Thymeleaf, CSRF, Liquibase, JDK/Java are removed entirely (not in Legacy section). gradle renamed to java-build, excel renamed to spreadsheet.

Existing knowledge entries tagged with gradle, java, excel etc. will no longer match.

Fix: Keep ALL original tag patterns. Add new ones alongside, don't replace.

magicpro97 · 2026-04-10T06:23:08Z

            parts = branch.replace("/", "-").replace("_", "-").split("-")
            keywords.update(p for p in parts if len(p) > 2
-                           and p not in ("feature", "fix", "chore", "update", "and"))
+                           and p not in ("feature", "fix", "chore", "update", "and",


⚠️ Org-specific: \"fpt\" is a company name (FPT Software). This is a public open-source tool — shouldn't hardcode organization names.

Options:

Remove fpt from this list

Make stopwords configurable via a config file

At minimum, add a comment explaining why it's here

linhnt102-fpt · 2026-04-10T06:25:38Z

Closing to recreate from owner account with review fixes applied. Thank you for the detailed review!

#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix #1: integration_auth_bearer_valid_passes now uses /api/noroute to exercise Bearer auth middleware (asserts 404 post-auth, not open /healthz) - Fix #2: start() uses tuple bind (host, port) for IPv6-safe TcpListener binding - Fix #3: TraceLayer customized to record only URI path, not query string (avoids ?token= leaks) - Fix #4: resolve_file maps PathSafetyError::Empty to 404 (not 403) for root / - Fix #5: serve_static rejects non-GET/HEAD with 405 Method Not Allowed - Fix #6: symlink_metadata and canonicalize moved into spawn_blocking to avoid async runtime starvation - Fix #7: rand dep aligned to 0.9 (removes duplicate rand 0.8); generate_nonce uses rand::rng() (0.9 API); add tracing dep for TraceLayer span customization - Fix #8: CORS preflight Allow-Headers includes Last-Event-ID and X-Resume-Token for Python parity Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- live.rs: honor Last-Event-ID header so reconnecting clients resume from the provided event id (positive integer); invalid/missing falls back to latest MAX(id) snapshot without panicking (#1) - live.rs: move initial latest_entry_id() call into spawn_blocking to avoid blocking the Tokio executor with synchronous SQLite/r2d2 (#2) - db.rs: replace .ok().flatten() on created_at with explicit has_ca branch — propagates real row conversion errors via ? when the column exists; skips r.get() entirely for NULL projection; collect errors instead of silently dropping rows via filter_map (#3) - tests: replace fixed-size header read with read_http_headers() helper that loops until \\r\\n\\r\\n, preventing packet-splitting flakes (#4) - tests: replace fixed sleep + single read with read_until_sse_contains() helper that accumulates chunks until expected strings appear or a deadline passes, avoiding timing/chunk fragility (#5) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

magicpro97 requested changes Apr 10, 2026

View reviewed changes

linhnt102-fpt closed this Apr 10, 2026

magicpro97 mentioned this pull request Apr 10, 2026

feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags #2

Merged

This was referenced May 13, 2026

[EPIC] Superpower Resilience - keep tentacle goal progressing across compaction, interruption, and quota limits #180

Closed

EPIC-0 — Maintainability roadmap board bootstrap #245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags#1

feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags#1
linhnt102-fpt wants to merge 1 commit into
magicpro97:mainfrom
linhnt102-fpt:feature/multilingual-stack-support

linhnt102-fpt commented Apr 10, 2026

Uh oh!

magicpro97 left a comment

Uh oh!

magicpro97 Apr 10, 2026

Uh oh!

magicpro97 Apr 10, 2026

Uh oh!

magicpro97 Apr 10, 2026

Uh oh!

magicpro97 Apr 10, 2026

Uh oh!

magicpro97 Apr 10, 2026

Uh oh!

linhnt102-fpt commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -141,6 +156,9 @@ def classify_paragraph(text: str) -> list[tuple[str, float]]:
		r"câu\shỏi\sphỏng\s*vấn",

		r"(?:インストール\|設定\|バージョン\|依存関係\|ツール\|環境構築)",
		]

Conversation

linhnt102-fpt commented Apr 10, 2026

feat: Add Multilingual Support (EN/VI/JP) and Multi-Stack Technology Tags

Overview

Changes

extract-knowledge.py

briefing.py

README.md

Motivation

Testing

Uh oh!

magicpro97 left a comment

Choose a reason for hiding this comment

Overall

🔴 Must Fix

🟡 Should Fix

🟢 Nice to Have

Uh oh!

magicpro97 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

magicpro97 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

magicpro97 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

magicpro97 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

magicpro97 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

linhnt102-fpt commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`extract-knowledge.py`

`briefing.py`

`README.md`