feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags by magicpro97 · Pull Request #2 · magicpro97/copilot-session-knowledge

magicpro97 · 2026-04-10T06:32:43Z

feat: Add Multilingual Support (EN/VI/JP) and Multi-Stack Technology Tags

Overview

Extends copilot-session-knowledge beyond Java/Spring to support trilingual knowledge extraction (English, Vietnamese, Japanese) and 60+ technology tags across multiple stacks.

All original tags and patterns are fully preserved for backward compatibility.

Changes

`extract-knowledge.py`

Tag patterns (20 → 60+):

✅ All 20 original tags preserved with exact same names (thymeleaf, csrf, liquibase, java, gradle, excel, jpa with JPQL, i18n with messages.properties)
➕ Added: AWS services, TypeScript, React Native, Expo, Node.js, testing tools, security patterns
🔒 False-positive prevention:
- Lambda → AWS Lambda|Lambda function|Lambda handler (won't match Java lambdas)
- CDK → AWS CDK|CDK stack|CDK construct (won't match Chrome DevTools)
- token → SAML|OAuth2?|JWT|bearer token (won't match generic tokens)
- Copilot → GitHub Copilot|Copilot CLI|Copilot chat (won't match aircraft)
- Removed overly generic JSON tag
- Removed .tsx?/.jsx?/.py file extension patterns (matched .tsv, .pyc, etc.)

Indicators — added Japanese regex to all 4 categories:

Mistakes: エラー, バグ, 不具合, 障害, 修正, 原因, 間違い, 注意
Patterns: パターン, ルール, 規約, 推奨, 必須, ベストプラクティス
Decisions: 決定, 選択, 理由, 比較, トレードオフ, 代替案, 方針
Tools: インストール, 設定, バージョン, 依存関係, ツール, 環境構築

Noise filters — added Japanese patterns (面接, 採点, 評価基準).

`briefing.py`

Added branch stopwords: dev, bug, refactor, docs, release, hotfix, main, master
Added commit message stopwords: use, refactor, feat, chore, docs, style, test, build
No org-specific names hardcoded

`README.md`

Added "Multilingual & Multi-Stack Support" section with trilingual indicator table, 60+ tag reference, and branch parsing examples

Backward Compatibility

Original Tag	Status
`spring-boot`	✅ Preserved
`thymeleaf`	✅ Preserved
`jpa` (with JPQL)	✅ Preserved
`postgresql`	✅ Preserved
`redis`	✅ Preserved
`gradle`	✅ Preserved (was renamed to `java-build`, now reverted)
`csrf`	✅ Preserved
`liquibase`	✅ Preserved
`java`	✅ Preserved
`excel`	✅ Preserved (was renamed to `spreadsheet`, now reverted)
`javascript` (with `JS`)	✅ Preserved
`i18n` (with `messages.properties`)	✅ Preserved
All others	✅ Preserved

…tags Changes: - extract-knowledge.py: Expand tag_patterns from 20 to 60+ covering AWS, TypeScript, React Native, testing tools, etc. All original tags preserved (thymeleaf, csrf, liquibase, java, gradle, excel, JPQL, messages.properties). Fix false-positive patterns: Lambda→AWS Lambda, CDK→AWS CDK, token→bearer token, Copilot→GitHub Copilot. Remove overly generic JSON tag. Fix .tsx?/.py regex to use word names only. Add Japanese indicators for all 4 categories. Restore gradle/maven/spring boot in TOOL_INDICATORS alongside new tools. - briefing.py: Add branch/commit stopwords for conventional commits. Remove org-specific names from stopwords (keep generic only). - README.md: Add Multilingual & Multi-Stack Support section with trilingual docs, technology tag table, and branch parsing examples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR extends the knowledge extraction pipeline to better classify content across English/Vietnamese/Japanese and to extract a broader set of technology tags across multiple stacks, while updating auto-detected briefing keywords and documentation to reflect the new capabilities.

Changes:

Expanded indicator/noise regexes to include Japanese (and added more tool indicators) for paragraph classification.
Expanded technology tag extraction to include AWS/services, frontend/testing/tooling/security tags in addition to the existing Java/Spring tags.
Updated briefing keyword extraction stopwords and documented multilingual/tag support + branch parsing behavior in the README.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
README.md	Documents multilingual support, tag taxonomy, and branch parsing examples/stopwords.
extract-knowledge.py	Adds Japanese/Vietnamese indicators/noise filters and expands tag extraction patterns substantially.
briefing.py	Extends branch/commit stopword lists for better auto-detected briefing queries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-10T06:36:22Z

+        (r"\b(?:CloudFormation|CFN)\b", "cloudformation"),
+        (r"\b(?:Step\s+Functions?)\b", "step-functions"),
+        (r"\b(?:X-Ray|XRay)\b", "xray"),
+        (r"\b(?:WebSocket|wss?://)\b", "websocket"),


The websocket tag regex wraps wss?:// in \b...\b, which prevents matching URLs like wss://... because the trailing \b requires a word char before the boundary (the pattern ends with /). Consider separating the alternatives so only WebSocket uses word boundaries, or remove the trailing \b for the URL form.

Suggested change

(r"\b(?:WebSocket|wss?://)\b", "websocket"),

(r"(?:\bWebSocket\b|wss?://)", "websocket"),

Copilot · 2026-04-10T06:36:23Z

 TOOL_INDICATORS = [
    r"(?:install|configure|setup|version|upgrade|dependency)\b",
    r"(?:gradle|maven|docker|redis|postgres|spring\s+boot)\b",
-    r"(?:JDK|SDK|IDE|VSCode|extension)\b",
+    r"(?:yarn|npm|cdk|playwright|jest|eslint|prettier)\b",
+    r"(?:JDK|SDK|IDE|VSCode|extension|plugin|MCP)\b",


Copilot · 2026-04-10T06:36:23Z

 def extract_tags(text: str) -> str:
    """Extract relevant tags from text."""
    tag_patterns = [
-        (r"\b(?:Spring\s+Boot|SpringBoot)\b", "spring-boot"),
-        (r"\b(?:Thymeleaf)\b", "thymeleaf"),
-        (r"\b(?:JPQL|JPA|Hibernate)\b", "jpa"),
-        (r"\b(?:PostgreSQL|Postgres|PG)\b", "postgresql"),
+        # Cloud & Infrastructure
+        (r"\b(?:AWS|Amazon\s+Web\s+Services)\b", "aws"),
+        (r"\b(?:AWS\s+CDK|CDK\s+(?:stack|construct|deploy|app))\b", "aws-cdk"),


extract_tags() rebuilds the full tag_patterns list on every call. Since this runs per extracted chunk, the expanded 50+ tag list will repeatedly allocate/compile regexes and can become a hot path. Consider moving patterns to a module-level constant and precompiling them (e.g., [(re.compile(..., re.I), tag), ...]).

Copilot · 2026-04-10T06:36:23Z

+                           and p not in ("feature", "fix", "chore", "update", "and",
+                                         "dev", "bug", "refactor", "docs",
+                                         "release", "hotfix", "main", "master"))


Branch stopword filtering is case-sensitive (p not in (...)) even though commit stopword filtering uses w.lower(). If a branch segment is capitalized (e.g., Dev/Feature/...), it will leak into keywords. Consider normalizing p (and/or branch) to lowercase for stopword comparison before adding to keywords.

Suggested change

and p not in ("feature", "fix", "chore", "update", "and",

"dev", "bug", "refactor", "docs",

"release", "hotfix", "main", "master"))

and p.lower() not in ("feature", "fix", "chore", "update", "and",

"dev", "bug", "refactor", "docs",

"release", "hotfix", "main", "master"))

Copilot · 2026-04-10T06:36:23Z

+### Supported Technology Tags (50+)
+
+| Category | Tags |
+|---|---|
+| **Cloud & Infra** | `aws`, `aws-cdk`, `lambda`, `dynamodb`, `s3`, `sqs`, `sns`, `cognito`, `cloudwatch`, `api-gateway`, `eventbridge`, `cloudformation`, `step-functions`, `xray`, `websocket`, `docker`, `vpc` |
+| **Languages** | `typescript`, `javascript`, `python`, `nodejs`, `java` |


The documentation says "Supported Technology Tags (50+)" here, while the PR description claims "60+" tags. Either adjust the wording in the README to match the actual tag count/pattern set, or update the PR description so the numbers are consistent.

magicpro97

V2 Review — All previous issues resolved ✅

Checklist vs PR #1 feedback:

Issue	Status
🔴 Breaking compat (removed tags)	✅ Fixed — All 20 original tags preserved with exact same names
🔴 `fpt` hardcoded	✅ Fixed — Removed from stopwords
🔴 Generic tags (Lambda, CDK, JSON, token)	✅ Fixed — `AWS Lambda\|Lambda function`, `AWS CDK\|CDK stack`, JSON removed, `SAML\|OAuth2?\|JWT\|bearer token`
⚠️ Regex `.tsx?` matching `.tsv`	✅ Fixed — Removed file extension patterns, uses full word `TypeScript`
⚠️ TOOL_INDICATORS missing gradle/maven	✅ Fixed — Original `gradle\|maven\|spring boot` preserved

Minor nits (non-blocking):

(r"\b(?:wss?://)\b", "websocket") — \b doesn't work around ://. The WebSocket keyword covers it, so this sub-pattern is dead code. Consider removing.
README category "Database & ORM" includes spring-boot — consider renaming to "Java Ecosystem".

Verdict: LGTM 🚀

Clean, well-organized, backward-compatible. Ready to merge.

#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Same class of vulnerability as #2 — hook was literally telling the agent 'create ~/.copilot/markers/tentacle-bypass to skip'. Agent in session followed this instruction and bypassed tentacle enforcement entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Same class of vulnerability as #2 — hook was literally telling the agent 'create ~/.copilot/markers/tentacle-bypass to skip'. Agent in session followed this instruction and bypassed tentacle enforcement entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

#2 Remove bypass hint from deny messages (hook was coaching agent how to bypass) #3 Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass) #4 learn.py detection requires python3 execution, not just substring match #5 enforce-tentacle now gates bash file writes, not just edit/create #6 verify-integrity checks hooks.json hash (was stored but never verified) #8 git commit detection uses regex to handle interleaved flags Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity, #9 relative paths, #10 double-counting, #11 non-git dirs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Same class of vulnerability as #2 — hook was literally telling the agent 'create ~/.copilot/markers/tentacle-bypass to skip'. Agent in session followed this instruction and bypassed tentacle enforcement entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CRITICAL fixes: - #1: Remove CLI 'sign' command from marker_auth.py (agents could forge markers via 'python3 marker_auth.py sign tentacle-done') - #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker instead of raw file I/O (was corrupting HMAC-signed tentacle-edits) HIGH fixes: - #3: Add missing WARN constant in install.py (--lock-hooks crashed with NameError on config.json sanitization and non-root runs) - #4: Expand is_secret_access() to block marker_auth.py execution, glob wildcards (.marker-*), and .copilot/hooks/. access patterns MEDIUM fixes: - #5: Add isinstance(tool_args, dict) to enforce-tentacle.py (same crash as enforce-learn.py when toolArgs is a string) - #6: Fix tentacle-suggest.py redirect regex to match relative paths (was only matching absolute paths starting with /) - #7: Add stale marker cleanup at sessionStart in auto-briefing.py (crash recovery — old markers from dead sessions no longer persist) All 74 tests pass (9 security + 65 fixes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix #1: integration_auth_bearer_valid_passes now uses /api/noroute to exercise Bearer auth middleware (asserts 404 post-auth, not open /healthz) - Fix #2: start() uses tuple bind (host, port) for IPv6-safe TcpListener binding - Fix #3: TraceLayer customized to record only URI path, not query string (avoids ?token= leaks) - Fix #4: resolve_file maps PathSafetyError::Empty to 404 (not 403) for root / - Fix #5: serve_static rejects non-GET/HEAD with 405 Method Not Allowed - Fix #6: symlink_metadata and canonicalize moved into spawn_blocking to avoid async runtime starvation - Fix #7: rand dep aligned to 0.9 (removes duplicate rand 0.8); generate_nonce uses rand::rng() (0.9 API); add tracing dep for TraceLayer span customization - Fix #8: CORS preflight Allow-Headers includes Last-Event-ID and X-Resume-Token for Python parity Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- live.rs: honor Last-Event-ID header so reconnecting clients resume from the provided event id (positive integer); invalid/missing falls back to latest MAX(id) snapshot without panicking (#1) - live.rs: move initial latest_entry_id() call into spawn_blocking to avoid blocking the Tokio executor with synchronous SQLite/r2d2 (#2) - db.rs: replace .ok().flatten() on created_at with explicit has_ca branch — propagates real row conversion errors via ? when the column exists; skips r.get() entirely for NULL projection; collect errors instead of silently dropping rows via filter_map (#3) - tests: replace fixed-size header read with read_http_headers() helper that loops until \\r\\n\\r\\n, preventing packet-splitting flakes (#4) - tests: replace fixed sleep + single read with read_until_sse_contains() helper that accumulates chunks until expected strings appear or a deadline passes, avoiding timing/chunk fragility (#5) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 10, 2026 06:32

Copilot started reviewing on behalf of magicpro97 April 10, 2026 06:33 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

magicpro97 commented Apr 10, 2026

View reviewed changes

magicpro97 merged commit c947c09 into main Apr 10, 2026
4 checks passed

magicpro97 deleted the feature/multilingual-stack-support branch April 10, 2026 06:43

This was referenced May 7, 2026

feat(browse.py): add --broker-mode telegram for outbound-only connectivity in tunnel-hostile networks #71

Closed

[Dreaming] Epic: sleep-like memory consolidation roadmap #156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags#2

feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags#2
magicpro97 merged 1 commit into
mainfrom
feature/multilingual-stack-support

magicpro97 commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

magicpro97 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	(r"\b(?:WebSocket\|wss?://)\b", "websocket"),
	(r"(?:\bWebSocket\b\|wss?://)", "websocket"),

Conversation

magicpro97 commented Apr 10, 2026

feat: Add Multilingual Support (EN/VI/JP) and Multi-Stack Technology Tags

Overview

Changes

extract-knowledge.py

briefing.py

README.md

Backward Compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

magicpro97 left a comment

Choose a reason for hiding this comment

V2 Review — All previous issues resolved ✅

Checklist vs PR #1 feedback:

Minor nits (non-blocking):

Verdict: LGTM 🚀

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`extract-knowledge.py`

`briefing.py`

`README.md`