Skip to content

feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags#2

Merged
magicpro97 merged 1 commit into
mainfrom
feature/multilingual-stack-support
Apr 10, 2026
Merged

feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags#2
magicpro97 merged 1 commit into
mainfrom
feature/multilingual-stack-support

Conversation

@magicpro97
Copy link
Copy Markdown
Owner

feat: Add Multilingual Support (EN/VI/JP) and Multi-Stack Technology Tags

Overview

Extends copilot-session-knowledge beyond Java/Spring to support trilingual knowledge extraction (English, Vietnamese, Japanese) and 60+ technology tags across multiple stacks.

All original tags and patterns are fully preserved for backward compatibility.

Changes

extract-knowledge.py

Tag patterns (20 → 60+):

  • ✅ All 20 original tags preserved with exact same names (thymeleaf, csrf, liquibase, java, gradle, excel, jpa with JPQL, i18n with messages.properties)
  • ➕ Added: AWS services, TypeScript, React Native, Expo, Node.js, testing tools, security patterns
  • 🔒 False-positive prevention:
    • LambdaAWS Lambda|Lambda function|Lambda handler (won't match Java lambdas)
    • CDKAWS CDK|CDK stack|CDK construct (won't match Chrome DevTools)
    • tokenSAML|OAuth2?|JWT|bearer token (won't match generic tokens)
    • CopilotGitHub Copilot|Copilot CLI|Copilot chat (won't match aircraft)
    • Removed overly generic JSON tag
    • Removed .tsx?/.jsx?/.py file extension patterns (matched .tsv, .pyc, etc.)

Indicators — added Japanese regex to all 4 categories:

  • Mistakes: エラー, バグ, 不具合, 障害, 修正, 原因, 間違い, 注意
  • Patterns: パターン, ルール, 規約, 推奨, 必須, ベストプラクティス
  • Decisions: 決定, 選択, 理由, 比較, トレードオフ, 代替案, 方針
  • Tools: インストール, 設定, バージョン, 依存関係, ツール, 環境構築

TOOL_INDICATORS — original gradle|maven|spring boot preserved, added yarn|npm|cdk|playwright|jest|eslint|prettier alongside.

Noise filters — added Japanese patterns (面接, 採点, 評価基準).

briefing.py

  • Added branch stopwords: dev, bug, refactor, docs, release, hotfix, main, master
  • Added commit message stopwords: use, refactor, feat, chore, docs, style, test, build
  • No org-specific names hardcoded

README.md

  • Added "Multilingual & Multi-Stack Support" section with trilingual indicator table, 60+ tag reference, and branch parsing examples

Backward Compatibility

Original Tag Status
spring-boot ✅ Preserved
thymeleaf ✅ Preserved
jpa (with JPQL) ✅ Preserved
postgresql ✅ Preserved
redis ✅ Preserved
gradle ✅ Preserved (was renamed to java-build, now reverted)
csrf ✅ Preserved
liquibase ✅ Preserved
java ✅ Preserved
excel ✅ Preserved (was renamed to spreadsheet, now reverted)
javascript (with JS) ✅ Preserved
i18n (with messages.properties) ✅ Preserved
All others ✅ Preserved

…tags

Changes:
- extract-knowledge.py: Expand tag_patterns from 20 to 60+ covering AWS,
  TypeScript, React Native, testing tools, etc. All original tags preserved
  (thymeleaf, csrf, liquibase, java, gradle, excel, JPQL, messages.properties).
  Fix false-positive patterns: Lambda→AWS Lambda, CDK→AWS CDK, token→bearer token,
  Copilot→GitHub Copilot. Remove overly generic JSON tag. Fix .tsx?/.py regex
  to use word names only. Add Japanese indicators for all 4 categories.
  Restore gradle/maven/spring boot in TOOL_INDICATORS alongside new tools.
- briefing.py: Add branch/commit stopwords for conventional commits. Remove
  org-specific names from stopwords (keep generic only).
- README.md: Add Multilingual & Multi-Stack Support section with trilingual
  docs, technology tag table, and branch parsing examples.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 10, 2026 06:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the knowledge extraction pipeline to better classify content across English/Vietnamese/Japanese and to extract a broader set of technology tags across multiple stacks, while updating auto-detected briefing keywords and documentation to reflect the new capabilities.

Changes:

  • Expanded indicator/noise regexes to include Japanese (and added more tool indicators) for paragraph classification.
  • Expanded technology tag extraction to include AWS/services, frontend/testing/tooling/security tags in addition to the existing Java/Spring tags.
  • Updated briefing keyword extraction stopwords and documented multilingual/tag support + branch parsing behavior in the README.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
README.md Documents multilingual support, tag taxonomy, and branch parsing examples/stopwords.
extract-knowledge.py Adds Japanese/Vietnamese indicators/noise filters and expands tag extraction patterns substantially.
briefing.py Extends branch/commit stopword lists for better auto-detected briefing queries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread extract-knowledge.py
(r"\b(?:CloudFormation|CFN)\b", "cloudformation"),
(r"\b(?:Step\s+Functions?)\b", "step-functions"),
(r"\b(?:X-Ray|XRay)\b", "xray"),
(r"\b(?:WebSocket|wss?://)\b", "websocket"),
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The websocket tag regex wraps wss?:// in \b...\b, which prevents matching URLs like wss://... because the trailing \b requires a word char before the boundary (the pattern ends with /). Consider separating the alternatives so only WebSocket uses word boundaries, or remove the trailing \b for the URL form.

Suggested change
(r"\b(?:WebSocket|wss?://)\b", "websocket"),
(r"(?:\bWebSocket\b|wss?://)", "websocket"),

Copilot uses AI. Check for mistakes.
Comment thread extract-knowledge.py
Comment on lines 69 to +73
TOOL_INDICATORS = [
r"(?:install|configure|setup|version|upgrade|dependency)\b",
r"(?:gradle|maven|docker|redis|postgres|spring\s+boot)\b",
r"(?:JDK|SDK|IDE|VSCode|extension)\b",
r"(?:yarn|npm|cdk|playwright|jest|eslint|prettier)\b",
r"(?:JDK|SDK|IDE|VSCode|extension|plugin|MCP)\b",
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOOL_INDICATORS entry (?:yarn|npm|cdk|playwright|jest|eslint|prettier)\b has no leading word-boundary, so it can match inside larger tokens that end with one of these strings (e.g., mycdk). Adding a leading \b (or otherwise anchoring) would reduce accidental matches.

Copilot uses AI. Check for mistakes.
Comment thread extract-knowledge.py
Comment on lines 228 to +233
def extract_tags(text: str) -> str:
"""Extract relevant tags from text."""
tag_patterns = [
(r"\b(?:Spring\s+Boot|SpringBoot)\b", "spring-boot"),
(r"\b(?:Thymeleaf)\b", "thymeleaf"),
(r"\b(?:JPQL|JPA|Hibernate)\b", "jpa"),
(r"\b(?:PostgreSQL|Postgres|PG)\b", "postgresql"),
# Cloud & Infrastructure
(r"\b(?:AWS|Amazon\s+Web\s+Services)\b", "aws"),
(r"\b(?:AWS\s+CDK|CDK\s+(?:stack|construct|deploy|app))\b", "aws-cdk"),
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_tags() rebuilds the full tag_patterns list on every call. Since this runs per extracted chunk, the expanded 50+ tag list will repeatedly allocate/compile regexes and can become a hot path. Consider moving patterns to a module-level constant and precompiling them (e.g., [(re.compile(..., re.I), tag), ...]).

Copilot uses AI. Check for mistakes.
Comment thread briefing.py
Comment on lines +66 to +68
and p not in ("feature", "fix", "chore", "update", "and",
"dev", "bug", "refactor", "docs",
"release", "hotfix", "main", "master"))
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branch stopword filtering is case-sensitive (p not in (...)) even though commit stopword filtering uses w.lower(). If a branch segment is capitalized (e.g., Dev/Feature/...), it will leak into keywords. Consider normalizing p (and/or branch) to lowercase for stopword comparison before adding to keywords.

Suggested change
and p not in ("feature", "fix", "chore", "update", "and",
"dev", "bug", "refactor", "docs",
"release", "hotfix", "main", "master"))
and p.lower() not in ("feature", "fix", "chore", "update", "and",
"dev", "bug", "refactor", "docs",
"release", "hotfix", "main", "master"))

Copilot uses AI. Check for mistakes.
Comment thread README.md
Comment on lines +284 to +289
### Supported Technology Tags (50+)

| Category | Tags |
|---|---|
| **Cloud & Infra** | `aws`, `aws-cdk`, `lambda`, `dynamodb`, `s3`, `sqs`, `sns`, `cognito`, `cloudwatch`, `api-gateway`, `eventbridge`, `cloudformation`, `step-functions`, `xray`, `websocket`, `docker`, `vpc` |
| **Languages** | `typescript`, `javascript`, `python`, `nodejs`, `java` |
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says "Supported Technology Tags (50+)" here, while the PR description claims "60+" tags. Either adjust the wording in the README to match the actual tag count/pattern set, or update the PR description so the numbers are consistent.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

@magicpro97 magicpro97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V2 Review — All previous issues resolved ✅

Checklist vs PR #1 feedback:

Issue Status
🔴 Breaking compat (removed tags) Fixed — All 20 original tags preserved with exact same names
🔴 fpt hardcoded Fixed — Removed from stopwords
🔴 Generic tags (Lambda, CDK, JSON, token) FixedAWS Lambda|Lambda function, AWS CDK|CDK stack, JSON removed, SAML|OAuth2?|JWT|bearer token
⚠️ Regex .tsx? matching .tsv Fixed — Removed file extension patterns, uses full word TypeScript
⚠️ TOOL_INDICATORS missing gradle/maven Fixed — Original gradle|maven|spring boot preserved

Minor nits (non-blocking):

  1. (r"\b(?:wss?://)\b", "websocket")\b doesn't work around ://. The WebSocket keyword covers it, so this sub-pattern is dead code. Consider removing.
  2. README category "Database & ORM" includes spring-boot — consider renaming to "Java Ecosystem".

Verdict: LGTM 🚀

Clean, well-organized, backward-compatible. Ready to merge.

@magicpro97 magicpro97 merged commit c947c09 into main Apr 10, 2026
4 checks passed
@magicpro97 magicpro97 deleted the feature/multilingual-stack-support branch April 10, 2026 06:43
magicpro97 pushed a commit that referenced this pull request Apr 17, 2026
#2  Remove bypass hint from deny messages (hook was coaching agent how to bypass)
#3  Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass)
#4  learn.py detection requires python3 execution, not just substring match
#5  enforce-tentacle now gates bash file writes, not just edit/create
#6  verify-integrity checks hooks.json hash (was stored but never verified)
#8  git commit detection uses regex to handle interleaved flags

Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity,
#9 relative paths, #10 double-counting, #11 non-git dirs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 17, 2026
Same class of vulnerability as #2 — hook was literally telling the agent
'create ~/.copilot/markers/tentacle-bypass to skip'. Agent in session
followed this instruction and bypassed tentacle enforcement entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 17, 2026
CRITICAL fixes:
- #1: Remove CLI 'sign' command from marker_auth.py (agents could
  forge markers via 'python3 marker_auth.py sign tentacle-done')
- #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker
  instead of raw file I/O (was corrupting HMAC-signed tentacle-edits)

HIGH fixes:
- #3: Add missing WARN constant in install.py (--lock-hooks crashed
  with NameError on config.json sanitization and non-root runs)
- #4: Expand is_secret_access() to block marker_auth.py execution,
  glob wildcards (.marker-*), and .copilot/hooks/. access patterns

MEDIUM fixes:
- #5: Add isinstance(tool_args, dict) to enforce-tentacle.py
  (same crash as enforce-learn.py when toolArgs is a string)
- #6: Fix tentacle-suggest.py redirect regex to match relative paths
  (was only matching absolute paths starting with /)
- #7: Add stale marker cleanup at sessionStart in auto-briefing.py
  (crash recovery — old markers from dead sessions no longer persist)

All 74 tests pass (9 security + 65 fixes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 20, 2026
#2  Remove bypass hint from deny messages (hook was coaching agent how to bypass)
#3  Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass)
#4  learn.py detection requires python3 execution, not just substring match
#5  enforce-tentacle now gates bash file writes, not just edit/create
#6  verify-integrity checks hooks.json hash (was stored but never verified)
#8  git commit detection uses regex to handle interleaved flags

Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity,
#9 relative paths, #10 double-counting, #11 non-git dirs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 20, 2026
Same class of vulnerability as #2 — hook was literally telling the agent
'create ~/.copilot/markers/tentacle-bypass to skip'. Agent in session
followed this instruction and bypassed tentacle enforcement entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 20, 2026
CRITICAL fixes:
- #1: Remove CLI 'sign' command from marker_auth.py (agents could
  forge markers via 'python3 marker_auth.py sign tentacle-done')
- #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker
  instead of raw file I/O (was corrupting HMAC-signed tentacle-edits)

HIGH fixes:
- #3: Add missing WARN constant in install.py (--lock-hooks crashed
  with NameError on config.json sanitization and non-root runs)
- #4: Expand is_secret_access() to block marker_auth.py execution,
  glob wildcards (.marker-*), and .copilot/hooks/. access patterns

MEDIUM fixes:
- #5: Add isinstance(tool_args, dict) to enforce-tentacle.py
  (same crash as enforce-learn.py when toolArgs is a string)
- #6: Fix tentacle-suggest.py redirect regex to match relative paths
  (was only matching absolute paths starting with /)
- #7: Add stale marker cleanup at sessionStart in auto-briefing.py
  (crash recovery — old markers from dead sessions no longer persist)

All 74 tests pass (9 security + 65 fixes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 20, 2026
#2  Remove bypass hint from deny messages (hook was coaching agent how to bypass)
#3  Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass)
#4  learn.py detection requires python3 execution, not just substring match
#5  enforce-tentacle now gates bash file writes, not just edit/create
#6  verify-integrity checks hooks.json hash (was stored but never verified)
#8  git commit detection uses regex to handle interleaved flags

Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity,
#9 relative paths, #10 double-counting, #11 non-git dirs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 20, 2026
Same class of vulnerability as #2 — hook was literally telling the agent
'create ~/.copilot/markers/tentacle-bypass to skip'. Agent in session
followed this instruction and bypassed tentacle enforcement entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 20, 2026
CRITICAL fixes:
- #1: Remove CLI 'sign' command from marker_auth.py (agents could
  forge markers via 'python3 marker_auth.py sign tentacle-done')
- #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker
  instead of raw file I/O (was corrupting HMAC-signed tentacle-edits)

HIGH fixes:
- #3: Add missing WARN constant in install.py (--lock-hooks crashed
  with NameError on config.json sanitization and non-root runs)
- #4: Expand is_secret_access() to block marker_auth.py execution,
  glob wildcards (.marker-*), and .copilot/hooks/. access patterns

MEDIUM fixes:
- #5: Add isinstance(tool_args, dict) to enforce-tentacle.py
  (same crash as enforce-learn.py when toolArgs is a string)
- #6: Fix tentacle-suggest.py redirect regex to match relative paths
  (was only matching absolute paths starting with /)
- #7: Add stale marker cleanup at sessionStart in auto-briefing.py
  (crash recovery — old markers from dead sessions no longer persist)

All 74 tests pass (9 security + 65 fixes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 added a commit that referenced this pull request May 22, 2026
- Fix #1: integration_auth_bearer_valid_passes now uses /api/noroute to exercise Bearer auth middleware (asserts 404 post-auth, not open /healthz)
- Fix #2: start() uses tuple bind (host, port) for IPv6-safe TcpListener binding
- Fix #3: TraceLayer customized to record only URI path, not query string (avoids ?token= leaks)
- Fix #4: resolve_file maps PathSafetyError::Empty to 404 (not 403) for root /
- Fix #5: serve_static rejects non-GET/HEAD with 405 Method Not Allowed
- Fix #6: symlink_metadata and canonicalize moved into spawn_blocking to avoid async runtime starvation
- Fix #7: rand dep aligned to 0.9 (removes duplicate rand 0.8); generate_nonce uses rand::rng() (0.9 API); add tracing dep for TraceLayer span customization
- Fix #8: CORS preflight Allow-Headers includes Last-Event-ID and X-Resume-Token for Python parity

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 added a commit that referenced this pull request May 22, 2026
- live.rs: honor Last-Event-ID header so reconnecting clients resume
  from the provided event id (positive integer); invalid/missing falls
  back to latest MAX(id) snapshot without panicking (#1)
- live.rs: move initial latest_entry_id() call into spawn_blocking to
  avoid blocking the Tokio executor with synchronous SQLite/r2d2 (#2)
- db.rs: replace .ok().flatten() on created_at with explicit has_ca
  branch — propagates real row conversion errors via ? when the column
  exists; skips r.get() entirely for NULL projection; collect errors
  instead of silently dropping rows via filter_map (#3)
- tests: replace fixed-size header read with read_http_headers() helper
  that loops until \\r\\n\\r\\n, preventing packet-splitting flakes (#4)
- tests: replace fixed sleep + single read with read_until_sse_contains()
  helper that accumulates chunks until expected strings appear or a
  deadline passes, avoiding timing/chunk fragility (#5)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants