-
Notifications
You must be signed in to change notification settings - Fork 224
fix(tokenize): correct capture group reference in website regex #1004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(tokenize): correct capture group reference in website regex #1004
Conversation
The websites regex /[.](com|net|org|io|gov|edu|me)/g has only one capture group for the TLD, but was incorrectly referencing $2. This caused URLs to be corrupted with literal '$2' text. Changed $2 to $1 to correctly reference the TLD capture group.
🦋 Changeset detectedLatest commit: 9209c5e The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughThe change corrects a regex capture-group reference in website tokenization: the replacement now uses the first capture group ($1) instead of the second ($2), adjusting how domain abbreviations (e.g., com, net) are inserted after a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
✏️ Tip: You can disable this entire section by setting Comment |
Fixes the capture group reference in the regex used for tokenization on the website.
lukasIO
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch!
Thank you for the fix and the detailed description
fix(tokenize): correct capture group reference in website regex
The websites regex
/[.](com|net|org|io|gov|edu|me)/ghas only one capture group for the TLD, but was incorrectly referencing$2. This caused URLs to be corrupted with literal'$2'text.Changed
$2to$1to correctly reference the TLD capture group.Description
The sentence tokenizer in
agents/src/tokenize/basic/sentence.tscontains a bug where thewebsitesregex replacement uses an incorrect capture group reference.The Bug:
When JavaScript encounters a non-existent capture group reference like
$2, it treats it as a literal string instead of a regex backreference. This corrupts any text containing domain extensions.Example of Current Behavior:
"Visit example.com for more information""Visit example<prd>$2 for more information"❌"Visit example<prd>com for more information"✅This bug affects text-to-speech pronunciation when using the agents framework with services like OpenAI Realtime API, as the literal
$2string is spoken instead of the proper domain extension.Changes Made
agents/src/tokenize/basic/sentence.tstext.replaceAll(websites, '<prd>$2');totext.replaceAll(websites, '<prd>$1');This is a single character change (
2→1) that fixes the regex backreference to correctly use the first (and only) capture group.Pre-Review Checklist
@livekit/plugins-ai-coustics@0.1.7not found in npm registry). CI will validate the build.Testing
$2text, causing incorrect TTS pronunciation$2to$1and URLs now process correctlyHow to Reproduce the Bug (Before Fix):
Expected Behavior (After Fix):
Additional Notes
Root Cause
The regex pattern
[.](com|net|org|io|gov|edu|me)creates only one capture group containing the TLD extension. Regex capture groups are numbered starting from 1:$1= the matched TLD (com, net, org, etc.)$2= does not existUsing
$2in the replacement string causes JavaScript to insert the literal string"$2"instead of the captured group value.Impact
This bug affects any agent that mentions URLs in its speech output, particularly when using:
The corrupted text impacts user experience as the TTS engine attempts to pronounce
"dollar-two"instead of the proper domain extension.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.