fix(tokenize): correct capture group reference in website regex #1004

IlyaShelestov · 2026-01-29T13:36:32Z

fix(tokenize): correct capture group reference in website regex

The websites regex /[.](com|net|org|io|gov|edu|me)/g has only one capture group for the TLD, but was incorrectly referencing $2. This caused URLs to be corrupted with literal '$2' text.

Changed $2 to $1 to correctly reference the TLD capture group.

Description

The sentence tokenizer in agents/src/tokenize/basic/sentence.ts contains a bug where the websites regex replacement uses an incorrect capture group reference.

The Bug:

const websites = /[.](com|net|org|io|gov|edu|me)/g;  // Has 1 capture group: (com|net|org|...)
text = text.replaceAll(websites, '<prd>$2');  // ❌ References $2 which doesn't exist

When JavaScript encounters a non-existent capture group reference like $2, it treats it as a literal string instead of a regex backreference. This corrupts any text containing domain extensions.

Example of Current Behavior:

Input: "Visit example.com for more information"
Current Output: "Visit example<prd>$2 for more information" ❌
Expected Output: "Visit example<prd>com for more information" ✅

This bug affects text-to-speech pronunciation when using the agents framework with services like OpenAI Realtime API, as the literal $2 string is spoken instead of the proper domain extension.

Changes Made

File: agents/src/tokenize/basic/sentence.ts
Line 30: Changed text.replaceAll(websites, '<prd>$2'); to text.replaceAll(websites, '<prd>$1');

This is a single character change (2 → 1) that fixes the regex backreference to correctly use the first (and only) capture group.

Pre-Review Checklist

Build passes: Unable to run locally due to missing dependency (@livekit/plugins-ai-coustics@0.1.7 not found in npm registry). CI will validate the build.
AI-generated code reviewed: Not applicable (manual fix)
Changes explained: Change is fully documented above
Scope appropriate: Single-line fix directly addresses the PR title
Video demo: Not applicable for this tokenization bug fix

Testing

Manual testing: This bug was discovered in a production environment while using the agents framework with OpenAI Realtime API
Impact verified: URLs were being corrupted with literal $2 text, causing incorrect TTS pronunciation
Fix verified: Changed $2 to $1 and URLs now process correctly

How to Reproduce the Bug (Before Fix):

const websites = /[.](com|net|org|io|gov|edu|me)/g;
const text = "Visit example.com today";
const result = text.replaceAll(websites, '<prd>$2');
console.log(result); // "Visit example<prd>$2 today" ❌

Expected Behavior (After Fix):

const websites = /[.](com|net|org|io|gov|edu|me)/g;
const text = "Visit example.com today";
const result = text.replaceAll(websites, '<prd>$1');
console.log(result); // "Visit example<prd>com today" ✅

Automated tests added/updated: Not applicable (no test file exists for this tokenizer)
All tests pass: Unable to run locally due to dependency issues
restaurant_agent.ts and realtime_agent.ts work properly: Not tested locally, but fix is isolated to tokenization logic

Additional Notes

Root Cause

The regex pattern [.](com|net|org|io|gov|edu|me) creates only one capture group containing the TLD extension. Regex capture groups are numbered starting from 1:

$1 = the matched TLD (com, net, org, etc.)
$2 = does not exist

Using $2 in the replacement string causes JavaScript to insert the literal string "$2" instead of the captured group value.

Impact

This bug affects any agent that mentions URLs in its speech output, particularly when using:

OpenAI Realtime API
ElevenLabs TTS
Any other TTS provider integrated with the framework

The corrupted text impacts user experience as the TTS engine attempts to pronounce "dollar-two" instead of the proper domain extension.

Summary by CodeRabbit

Bug Fixes
- Fixed website-domain tokenization used during sentence boundary detection so domain abbreviations (e.g., .com, .org) are handled correctly. This improves accuracy of sentence splitting and reduces incorrect breaks around web addresses, resulting in more reliable text parsing and downstream processing.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

The websites regex /[.](com|net|org|io|gov|edu|me)/g has only one capture group for the TLD, but was incorrectly referencing $2. This caused URLs to be corrupted with literal '$2' text. Changed $2 to $1 to correctly reference the TLD capture group.

CLAassistant · 2026-01-29T13:36:39Z

All committers have signed the CLA.

changeset-bot · 2026-01-29T13:36:43Z

🦋 Changeset detected

Latest commit: 9209c5e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 18 packages

Name	Type
@livekit/agents	Patch
@livekit/agents-plugin-anam	Patch
@livekit/agents-plugin-baseten	Patch
@livekit/agents-plugin-bey	Patch
@livekit/agents-plugin-cartesia	Patch
@livekit/agents-plugin-deepgram	Patch
@livekit/agents-plugin-elevenlabs	Patch
@livekit/agents-plugin-google	Patch
@livekit/agents-plugin-hedra	Patch
@livekit/agents-plugin-inworld	Patch
@livekit/agents-plugin-livekit	Patch
@livekit/agents-plugin-neuphonic	Patch
@livekit/agents-plugin-openai	Patch
@livekit/agents-plugin-resemble	Patch
@livekit/agents-plugin-rime	Patch
@livekit/agents-plugin-silero	Patch
@livekit/agents-plugins-test	Patch
@livekit/agents-plugin-xai	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2026-01-29T13:36:51Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

The change corrects a regex capture-group reference in website tokenization: the replacement now uses the first capture group ($1) instead of the second ($2), adjusting how domain abbreviations (e.g., com, net) are inserted after a <prd> token during sentence tokenization.

Changes

Cohort / File(s)	Summary
Website Pattern Tokenization Fix `agents/src/tokenize/basic/sentence.ts`	Updated `replaceAll` to use `$1` (first capture group) instead of `$2` when substituting website pattern matches, fixing domain tokenization after dots.
Release Changeset `.changeset/neat-kangaroos-wonder.md`	Added a changeset file describing the tokenization fix for a patch release.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐇 I nibbled the dot, then found the clue,
Swapped second for first—now domains fit true.
Tiny tweak, tidy hop, token paths align,
A rabbit's patch, neat and fine. ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and specifically describes the main change: a single-character fix to correct a regex capture group reference from $2 to $1 in the website tokenization pattern.
Description check	✅ Passed	The description is comprehensive and well-structured, including clear bug explanation with code examples, before/after behavior, root cause analysis, and impact assessment. All major template sections are addressed appropriately.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0287f2 and 9209c5e.

📒 Files selected for processing (1)

.changeset/neat-kangaroos-wonder.md

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Fixes the capture group reference in the regex used for tokenization on the website.

lukasIO

great catch!

Thank you for the fix and the detailed description

Fix capture group reference in website regex

9209c5e

Fixes the capture group reference in the regex used for tokenization on the website.

lukasIO approved these changes Jan 29, 2026

View reviewed changes

lukasIO merged commit db4e259 into livekit:main Jan 29, 2026
1 of 2 checks passed

github-actions bot mentioned this pull request Jan 28, 2026

Version Packages #1001

Merged

lukasIO mentioned this pull request Jan 29, 2026

Add URLs to tokenizer tests #1005

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokenize): correct capture group reference in website regex #1004

fix(tokenize): correct capture group reference in website regex #1004

Uh oh!

IlyaShelestov commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

CLAassistant commented Jan 29, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

lukasIO left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(tokenize): correct capture group reference in website regex #1004

fix(tokenize): correct capture group reference in website regex #1004

Uh oh!

Conversation

IlyaShelestov commented Jan 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix(tokenize): correct capture group reference in website regex

Description

Changes Made

Pre-Review Checklist

Testing

How to Reproduce the Bug (Before Fix):

Expected Behavior (After Fix):

Additional Notes

Root Cause

Impact

Summary by CodeRabbit

Uh oh!

CLAassistant commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

lukasIO left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IlyaShelestov commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

CLAassistant commented Jan 29, 2026 •

edited

Loading

changeset-bot bot commented Jan 29, 2026 •

edited

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading