Skip to content

Conversation

@IlyaShelestov
Copy link
Contributor

@IlyaShelestov IlyaShelestov commented Jan 29, 2026

fix(tokenize): correct capture group reference in website regex

The websites regex /[.](com|net|org|io|gov|edu|me)/g has only one capture group for the TLD, but was incorrectly referencing $2. This caused URLs to be corrupted with literal '$2' text.

Changed $2 to $1 to correctly reference the TLD capture group.

Description

The sentence tokenizer in agents/src/tokenize/basic/sentence.ts contains a bug where the websites regex replacement uses an incorrect capture group reference.

The Bug:

const websites = /[.](com|net|org|io|gov|edu|me)/g;  // Has 1 capture group: (com|net|org|...)
text = text.replaceAll(websites, '<prd>$2');  // ❌ References $2 which doesn't exist

When JavaScript encounters a non-existent capture group reference like $2, it treats it as a literal string instead of a regex backreference. This corrupts any text containing domain extensions.

Example of Current Behavior:

  • Input: "Visit example.com for more information"
  • Current Output: "Visit example<prd>$2 for more information"
  • Expected Output: "Visit example<prd>com for more information"

This bug affects text-to-speech pronunciation when using the agents framework with services like OpenAI Realtime API, as the literal $2 string is spoken instead of the proper domain extension.

Changes Made

  • File: agents/src/tokenize/basic/sentence.ts
  • Line 30: Changed text.replaceAll(websites, '<prd>$2'); to text.replaceAll(websites, '<prd>$1');

This is a single character change (21) that fixes the regex backreference to correctly use the first (and only) capture group.

Pre-Review Checklist

  • Build passes: Unable to run locally due to missing dependency (@livekit/plugins-ai-coustics@0.1.7 not found in npm registry). CI will validate the build.
  • AI-generated code reviewed: Not applicable (manual fix)
  • Changes explained: Change is fully documented above
  • Scope appropriate: Single-line fix directly addresses the PR title
  • Video demo: Not applicable for this tokenization bug fix

Testing

  • Manual testing: This bug was discovered in a production environment while using the agents framework with OpenAI Realtime API
  • Impact verified: URLs were being corrupted with literal $2 text, causing incorrect TTS pronunciation
  • Fix verified: Changed $2 to $1 and URLs now process correctly

How to Reproduce the Bug (Before Fix):

const websites = /[.](com|net|org|io|gov|edu|me)/g;
const text = "Visit example.com today";
const result = text.replaceAll(websites, '<prd>$2');
console.log(result); // "Visit example<prd>$2 today" ❌

Expected Behavior (After Fix):

const websites = /[.](com|net|org|io|gov|edu|me)/g;
const text = "Visit example.com today";
const result = text.replaceAll(websites, '<prd>$1');
console.log(result); // "Visit example<prd>com today" ✅
  • Automated tests added/updated: Not applicable (no test file exists for this tokenizer)
  • All tests pass: Unable to run locally due to dependency issues
  • restaurant_agent.ts and realtime_agent.ts work properly: Not tested locally, but fix is isolated to tokenization logic

Additional Notes

Root Cause

The regex pattern [.](com|net|org|io|gov|edu|me) creates only one capture group containing the TLD extension. Regex capture groups are numbered starting from 1:

  • $1 = the matched TLD (com, net, org, etc.)
  • $2 = does not exist

Using $2 in the replacement string causes JavaScript to insert the literal string "$2" instead of the captured group value.

Impact

This bug affects any agent that mentions URLs in its speech output, particularly when using:

  • OpenAI Realtime API
  • ElevenLabs TTS
  • Any other TTS provider integrated with the framework

The corrupted text impacts user experience as the TTS engine attempts to pronounce "dollar-two" instead of the proper domain extension.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed website-domain tokenization used during sentence boundary detection so domain abbreviations (e.g., .com, .org) are handled correctly. This improves accuracy of sentence splitting and reduces incorrect breaks around web addresses, resulting in more reliable text parsing and downstream processing.

✏️ Tip: You can customize this high-level summary in your review settings.

The websites regex /[.](com|net|org|io|gov|edu|me)/g has only one
capture group for the TLD, but was incorrectly referencing $2.
This caused URLs to be corrupted with literal '$2' text.

Changed $2 to $1 to correctly reference the TLD capture group.
@CLAassistant
Copy link

CLAassistant commented Jan 29, 2026

CLA assistant check
All committers have signed the CLA.

@changeset-bot
Copy link

changeset-bot bot commented Jan 29, 2026

🦋 Changeset detected

Latest commit: 9209c5e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 18 packages
Name Type
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugins-test Patch
@livekit/agents-plugin-xai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

The change corrects a regex capture-group reference in website tokenization: the replacement now uses the first capture group ($1) instead of the second ($2), adjusting how domain abbreviations (e.g., com, net) are inserted after a <prd> token during sentence tokenization.

Changes

Cohort / File(s) Summary
Website Pattern Tokenization Fix
agents/src/tokenize/basic/sentence.ts
Updated replaceAll to use $1 (first capture group) instead of $2 when substituting website pattern matches, fixing domain tokenization after dots.
Release Changeset
.changeset/neat-kangaroos-wonder.md
Added a changeset file describing the tokenization fix for a patch release.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐇 I nibbled the dot, then found the clue,
Swapped second for first—now domains fit true.
Tiny tweak, tidy hop, token paths align,
A rabbit's patch, neat and fine. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: a single-character fix to correct a regex capture group reference from $2 to $1 in the website tokenization pattern.
Description check ✅ Passed The description is comprehensive and well-structured, including clear bug explanation with code examples, before/after behavior, root cause analysis, and impact assessment. All major template sections are addressed appropriately.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0287f2 and 9209c5e.

📒 Files selected for processing (1)
  • .changeset/neat-kangaroos-wonder.md

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Fixes the capture group reference in the regex used for tokenization on the website.
Copy link
Contributor

@lukasIO lukasIO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch!

Thank you for the fix and the detailed description

@lukasIO lukasIO merged commit db4e259 into livekit:main Jan 29, 2026
1 of 2 checks passed
@github-actions github-actions bot mentioned this pull request Jan 28, 2026
@lukasIO lukasIO mentioned this pull request Jan 29, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants