Skip to content

feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags#1

Closed
linhnt102-fpt wants to merge 1 commit into
magicpro97:mainfrom
linhnt102-fpt:feature/multilingual-stack-support
Closed

feat: Add multilingual support (EN/VI/JP) and multi-stack technology tags#1
linhnt102-fpt wants to merge 1 commit into
magicpro97:mainfrom
linhnt102-fpt:feature/multilingual-stack-support

Conversation

@linhnt102-fpt
Copy link
Copy Markdown

feat: Add Multilingual Support (EN/VI/JP) and Multi-Stack Technology Tags

Overview

This PR extends copilot-session-knowledge beyond its original Java/Spring focus to support multilingual knowledge extraction (English, Vietnamese, Japanese) and multi-stack technology tagging (AWS, TypeScript, React Native, and more).

Originally, the knowledge extraction was optimized for Java/Spring Boot projects. This update adds comprehensive support for cloud-native and mobile development stacks while maintaining full backward compatibility with existing Java patterns.

Changes

extract-knowledge.py

Area Before After
Tag patterns 20 Java-centric patterns 55+ patterns covering AWS, TypeScript, React Native, testing, auth, TLS, etc.
Mistake indicators English only + Vietnamese + Japanese regex
Pattern indicators English only + Vietnamese + Japanese regex
Decision indicators English only + Vietnamese + Japanese regex
Tool indicators Java tools (gradle, maven, spring) Modern stack (yarn, npm, cdk, playwright, jest, eslint, prettier) + Japanese
Noise patterns English only + Japanese interview/scoring patterns

Backward compatibility: All original Java/Spring/PostgreSQL tags are preserved under a "Legacy" section.

briefing.py

  • Added branch name stopwords for dev/fpt/ convention and conventional commit prefixes (feat, chore, docs, style, test, build)
  • Better keyword extraction from branch names like dev/fpt/feature/5022-copy-to-group

README.md

  • Added "Multilingual & Multi-Stack Support" section documenting:
    • Trilingual indicator support (EN/VI/JP) with examples
    • 50+ technology tags organized by category
    • Branch name parsing examples with stopword filtering
  • Added two new features to the feature comparison table

Motivation

We use this tool in a bilingual (Vietnamese/Japanese) team working on an AWS CDK + Expo/React Native project. The original Java-focused patterns missed most of our knowledge entries. This PR makes the tool useful for a broader range of technology stacks and multilingual teams.

Testing

  • Manual verification: ran extract-knowledge.py against session checkpoints containing Japanese/Vietnamese content
  • Tag extraction correctly identifies AWS, TypeScript, React Native patterns
  • Japanese indicators correctly classify mistakes, patterns, decisions, and tools
  • Legacy Java patterns still match Java-related content

…tags

Changes:
- extract-knowledge.py: Add Japanese regex indicators for all 4 categories
  (mistakes, patterns, decisions, tools), expand tag_patterns from 20 to 55+
  covering AWS, TypeScript, React Native, Expo, testing tools, and more.
  Add Japanese noise filters. Keep legacy Java/Spring tags for backward compat.
- briefing.py: Add branch name stopwords for dev/fpt/ convention and
  conventional commit prefixes (feat, chore, docs, style, test, build).
- README.md: Add Multilingual & Multi-Stack Support section documenting
  trilingual indicators, 50+ technology tags, and branch parsing examples.
  Add two new features to the feature table.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Owner

@magicpro97 magicpro97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall

Great contribution! Multilingual support (EN/VI/JP) and multi-stack tags are very welcome. The Japanese indicators and noise filters are well-researched. A few issues to address before merging:

🔴 Must Fix

  1. Breaking backward compatibility — PR description says "all original tags preserved" but several are removed: Thymeleaf, CSRF, Liquibase, JDK/Java tags are gone, and gradlejava-build, excelspreadsheet are renamed. Existing knowledge entries with old tags will become orphaned.
  2. fpt hardcoded in stopwords — This is an organization-specific name. A public tool shouldn't have company names hardcoded. Make this configurable or remove it.
  3. Overly generic tag patternsLambda, CDK, JSON, token, Copilot will cause false positives in almost every session.

🟡 Should Fix

  1. \.tsx? and \.py regex in tag patterns can match unintended strings (e.g., .tsv, .pyc). Need end-of-word boundary.
  2. TOOL_INDICATORS removed gradle|maven|spring\s+boot — these are real tool indicators that Java users need.

🟢 Nice to Have

  1. Consider adding Korean (한국어) indicators as a future TODO since the pattern is established.

See inline comments for specific locations.

Comment thread extract-knowledge.py
@@ -141,6 +156,9 @@ def classify_paragraph(text: str) -> list[tuple[str, float]]:
r"câu\s*hỏi\s*phỏng\s*vấn",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ False positive risk: Lambda is too generic — matches Java lambdas, Python lambda expressions, math lambda calculus discussions, etc.

Suggest: r\"\b(?:AWS\s+Lambda|Lambda\s+function|lambda\s+handler)\b\" or require AWS context.

Same concern for CDK (Chrome DevTools Kit?) and JSON (matches literally everything).

Comment thread extract-knowledge.py
(r"\b(?:Step\s+Functions?)\b", "step-functions"),
(r"\b(?:X-Ray|XRay)\b", "xray"),
(r"\b(?:WebSocket|wss?://)\b", "websocket"),
(r"\b(?:Docker|docker-compose)\b", "docker"),
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ token is extremely generic. CSRF tokens, embedding tokens, API tokens, OAuth tokens — this would tag almost everything as auth.

Suggest: restrict to r\"\b(?:SAML|OAuth2?|JWT|bearer\s+token)\b\"

Comment thread extract-knowledge.py
(r"\b(?:JPQL|JPA|Hibernate)\b", "jpa"),
(r"\b(?:PostgreSQL|Postgres|PG)\b", "postgresql"),
# Cloud & Infrastructure
(r"\b(?:AWS|Amazon\s+Web\s+Services)\b", "aws"),
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ \.tsx? without end anchor can match .tsv, .tsc, etc. Same issue with \.jsx? and \.py.

Suggest: r\"\b(?:TypeScript)\b|\.tsx?\b\" or use (?:\.tsx?)(?:\s|$|[\"'])

Comment thread extract-knowledge.py
r"(?:インストール|設定|バージョン|依存関係|ツール|環境構築)",
]


Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Breaking change: TOOL_INDICATORS removed gradle|maven|spring\s+boot. Java/Spring users rely on these to classify tool-related entries. Please keep the original patterns and ADD new ones.

python r\"\(?:gradle|maven|docker|redis|postgres|spring\s+boot)\b\", # keep original r\"\(?:yarn|npm|cdk|playwright|jest|eslint|prettier)\b\", # add new
"@
},
@{
path = "extract-knowledge.py"
position = 62
body = @"
🔴 Backward compat broken: Original tags Thymeleaf, CSRF, Liquibase, JDK/Java are removed entirely (not in Legacy section). gradle renamed to java-build, excel renamed to spreadsheet.

Existing knowledge entries tagged with gradle, java, excel etc. will no longer match.

Fix: Keep ALL original tag patterns. Add new ones alongside, don't replace.

Comment thread briefing.py
parts = branch.replace("/", "-").replace("_", "-").split("-")
keywords.update(p for p in parts if len(p) > 2
and p not in ("feature", "fix", "chore", "update", "and"))
and p not in ("feature", "fix", "chore", "update", "and",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Org-specific: \"fpt\" is a company name (FPT Software). This is a public open-source tool — shouldn't hardcode organization names.

Options:

  1. Remove fpt from this list
  2. Make stopwords configurable via a config file
  3. At minimum, add a comment explaining why it's here

@linhnt102-fpt
Copy link
Copy Markdown
Author

Closing to recreate from owner account with review fixes applied. Thank you for the detailed review!

magicpro97 pushed a commit that referenced this pull request Apr 17, 2026
#2  Remove bypass hint from deny messages (hook was coaching agent how to bypass)
#3  Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass)
#4  learn.py detection requires python3 execution, not just substring match
#5  enforce-tentacle now gates bash file writes, not just edit/create
#6  verify-integrity checks hooks.json hash (was stored but never verified)
#8  git commit detection uses regex to handle interleaved flags

Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity,
#9 relative paths, #10 double-counting, #11 non-git dirs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 17, 2026
CRITICAL fixes:
- #1: Remove CLI 'sign' command from marker_auth.py (agents could
  forge markers via 'python3 marker_auth.py sign tentacle-done')
- #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker
  instead of raw file I/O (was corrupting HMAC-signed tentacle-edits)

HIGH fixes:
- #3: Add missing WARN constant in install.py (--lock-hooks crashed
  with NameError on config.json sanitization and non-root runs)
- #4: Expand is_secret_access() to block marker_auth.py execution,
  glob wildcards (.marker-*), and .copilot/hooks/. access patterns

MEDIUM fixes:
- #5: Add isinstance(tool_args, dict) to enforce-tentacle.py
  (same crash as enforce-learn.py when toolArgs is a string)
- #6: Fix tentacle-suggest.py redirect regex to match relative paths
  (was only matching absolute paths starting with /)
- #7: Add stale marker cleanup at sessionStart in auto-briefing.py
  (crash recovery — old markers from dead sessions no longer persist)

All 74 tests pass (9 security + 65 fixes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 20, 2026
#2  Remove bypass hint from deny messages (hook was coaching agent how to bypass)
#3  Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass)
#4  learn.py detection requires python3 execution, not just substring match
#5  enforce-tentacle now gates bash file writes, not just edit/create
#6  verify-integrity checks hooks.json hash (was stored but never verified)
#8  git commit detection uses regex to handle interleaved flags

Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity,
#9 relative paths, #10 double-counting, #11 non-git dirs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 20, 2026
CRITICAL fixes:
- #1: Remove CLI 'sign' command from marker_auth.py (agents could
  forge markers via 'python3 marker_auth.py sign tentacle-done')
- #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker
  instead of raw file I/O (was corrupting HMAC-signed tentacle-edits)

HIGH fixes:
- #3: Add missing WARN constant in install.py (--lock-hooks crashed
  with NameError on config.json sanitization and non-root runs)
- #4: Expand is_secret_access() to block marker_auth.py execution,
  glob wildcards (.marker-*), and .copilot/hooks/. access patterns

MEDIUM fixes:
- #5: Add isinstance(tool_args, dict) to enforce-tentacle.py
  (same crash as enforce-learn.py when toolArgs is a string)
- #6: Fix tentacle-suggest.py redirect regex to match relative paths
  (was only matching absolute paths starting with /)
- #7: Add stale marker cleanup at sessionStart in auto-briefing.py
  (crash recovery — old markers from dead sessions no longer persist)

All 74 tests pass (9 security + 65 fixes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 20, 2026
#2  Remove bypass hint from deny messages (hook was coaching agent how to bypass)
#3  Fail-closed on empty/malformed stdin (enforce-briefing denies, others safe-pass)
#4  learn.py detection requires python3 execution, not just substring match
#5  enforce-tentacle now gates bash file writes, not just edit/create
#6  verify-integrity checks hooks.json hash (was stored but never verified)
#8  git commit detection uses regex to handle interleaved flags

Remaining: #1 marker spoofing (needs architecture change), #7 warn-only integrity,
#9 relative paths, #10 double-counting, #11 non-git dirs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 pushed a commit that referenced this pull request Apr 20, 2026
CRITICAL fixes:
- #1: Remove CLI 'sign' command from marker_auth.py (agents could
  forge markers via 'python3 marker_auth.py sign tentacle-done')
- #2: tentacle-suggest.py now uses sign_list_marker/verify_list_marker
  instead of raw file I/O (was corrupting HMAC-signed tentacle-edits)

HIGH fixes:
- #3: Add missing WARN constant in install.py (--lock-hooks crashed
  with NameError on config.json sanitization and non-root runs)
- #4: Expand is_secret_access() to block marker_auth.py execution,
  glob wildcards (.marker-*), and .copilot/hooks/. access patterns

MEDIUM fixes:
- #5: Add isinstance(tool_args, dict) to enforce-tentacle.py
  (same crash as enforce-learn.py when toolArgs is a string)
- #6: Fix tentacle-suggest.py redirect regex to match relative paths
  (was only matching absolute paths starting with /)
- #7: Add stale marker cleanup at sessionStart in auto-briefing.py
  (crash recovery — old markers from dead sessions no longer persist)

All 74 tests pass (9 security + 65 fixes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 added a commit that referenced this pull request May 22, 2026
- Fix #1: integration_auth_bearer_valid_passes now uses /api/noroute to exercise Bearer auth middleware (asserts 404 post-auth, not open /healthz)
- Fix #2: start() uses tuple bind (host, port) for IPv6-safe TcpListener binding
- Fix #3: TraceLayer customized to record only URI path, not query string (avoids ?token= leaks)
- Fix #4: resolve_file maps PathSafetyError::Empty to 404 (not 403) for root /
- Fix #5: serve_static rejects non-GET/HEAD with 405 Method Not Allowed
- Fix #6: symlink_metadata and canonicalize moved into spawn_blocking to avoid async runtime starvation
- Fix #7: rand dep aligned to 0.9 (removes duplicate rand 0.8); generate_nonce uses rand::rng() (0.9 API); add tracing dep for TraceLayer span customization
- Fix #8: CORS preflight Allow-Headers includes Last-Event-ID and X-Resume-Token for Python parity

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
magicpro97 added a commit that referenced this pull request May 22, 2026
- live.rs: honor Last-Event-ID header so reconnecting clients resume
  from the provided event id (positive integer); invalid/missing falls
  back to latest MAX(id) snapshot without panicking (#1)
- live.rs: move initial latest_entry_id() call into spawn_blocking to
  avoid blocking the Tokio executor with synchronous SQLite/r2d2 (#2)
- db.rs: replace .ok().flatten() on created_at with explicit has_ca
  branch — propagates real row conversion errors via ? when the column
  exists; skips r.get() entirely for NULL projection; collect errors
  instead of silently dropping rows via filter_map (#3)
- tests: replace fixed-size header read with read_http_headers() helper
  that loops until \\r\\n\\r\\n, preventing packet-splitting flakes (#4)
- tests: replace fixed sleep + single read with read_until_sse_contains()
  helper that accumulates chunks until expected strings appear or a
  deadline passes, avoiding timing/chunk fragility (#5)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants