fix(collection): use UTF-8 byte length for SQL query size check by yodakaEngineer · Pull Request #3717 · nuxt/content

yodakaEngineer · 2026-02-12T08:04:20Z

🔗 Linked issue

Resolves #3533

❓ Type of change

📖 Documentation (updates to the documentation or readme)
🐞 Bug fix (a non-breaking change that fixes an issue)
👌 Enhancement (improving an existing functionality like performance)
✨ New feature (a non-breaking change that adds functionality)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)

📚 Description

The SQL split logic in generateCollectionInsert used string.length (character count) to check against Cloudflare D1's 100KB byte limit.
Multibyte characters (e.g. Japanese: 1 char = 3 bytes) could exceed the limit without triggering a split, causing SQLite errors on D1.
So, I replaced string.length with utf8ByteLength() for size comparisons and slice index calculations.

FYI: 心 means heart.

📝 Checklist

I have linked an issue or discussion.
I have updated the documentation accordingly.

The SQL split logic in generateCollectionInsert used string.length (character count) to check against Cloudflare D1's 100KB byte limit. Multibyte characters (e.g. Japanese: 1 char = 3 bytes) could exceed the limit without triggering a split, causing SQLite errors on D1. Replace string.length with utf8ByteLength() for size comparisons and slice index calculations.

vercel · 2026-02-12T08:04:24Z

@yodakaEngineer is attempting to deploy a commit to the Nuxt Team on Vercel.

A member of the Team first needs to authorize it.

pkg-pr-new · 2026-02-12T08:07:27Z

npm i https://pkg.pr.new/@nuxt/content@3717

commit: 5c7c79f

coderabbitai · 2026-02-12T08:09:21Z

📝 Walkthrough

Walkthrough

Added UTF-8 byte-length utilities in src/utils/collection.ts: utf8ByteLength(str: string) and charIndexAtByteOffset(str: string, targetBytes: number) plus a UTF-8 encoder constant. Replaced character-length checks and slicing logic with UTF-8 byte-aware calculations across SQL size validation, largest-column selection, initial slice index computation, and iterative multi-statement insert slicing to avoid splitting multibyte characters. Tests updated/added to verify splitting behavior for multibyte UTF-8 content and that emoji boundaries are respected. No public API removals; exports increased.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: using UTF-8 byte length instead of character length for SQL query size checks.
Description check	✅ Passed	The description is directly related to the changeset, explaining the problem with multibyte UTF-8 characters and the solution implemented.
Linked Issues check	✅ Passed	The code changes directly address issue `#3533` by implementing UTF-8 byte length checks and slice calculations instead of character length operations.
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing the UTF-8 byte length issue in the collection insert logic with corresponding tests; no unrelated modifications present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

test/unit/generateCollectionInsert.test.ts (1)
127-169: Add null-check assertions on regex matches for better test diagnostics.

Lines 162–166 access regex match results via ! non-null assertions without first asserting the match succeeded. If the implementation ever changes the SQL format, these lines will throw a cryptic TypeError instead of a clear test failure. The emoji test at lines 209–214 already does this correctly with expect(...).not.toBeNull().
Proposed fix
     const insertMatch = sql[0]!.match(/'(心+)'/)
+    expect(insertMatch).not.toBeNull()
     let reconstructed = insertMatch![1]!
     for (let i = 1; i < sql.length; i++) {
       const updateMatch = sql[i]!.match(/CONCAT\(content, '(心+)'\)/)
+      expect(updateMatch).not.toBeNull()
       reconstructed += updateMatch![1]!
     }

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

yodakaEngineer · 2026-02-12T09:17:15Z

I'm sorry.
Maybe, This will likely stabilize deployment to D1, but this PR cannot resolve #3533.
I think it just looked like it fixed because the slice position changed.

yodakaEngineer · 2026-02-12T12:17:56Z

Finaly, I got it!
Using .slice() to strings include emoji breaks character like below.

'aa🎲aa'.slice(3) // This is '\uDFB2aa'

The current implementation breaks when an emoji appears at the SLICE_SIZE th character position.
I think this is cause of #3533.

This PR is changing how to size from character to byte, so I think this resolves #3533.
I also added test.
Please review it.

yodakaEngineer · 2026-02-12T12:57:30Z

Finaly, I got it! Using .slice() to strings include emoji breaks character like below.
'aa🎲aa'.slice(3) // This is '\uDFB2aa'
The current implementation breaks when an emoji appears at the SLICE_SIZE th character position. I think this is cause of #3533.

This PR is changing how to size from character to byte, so I think this resolves #3533. I also added test. Please review it.

FYI: This test is broken on main branch.

  test('Emoji at SLICE_SIZE boundary is not broken', () => {
    const collection = resolveCollection('content', defineCollection({
      type: 'data',
      source: '**',
      schema: z.object({
        content: z.string(),
      }),
    }))!

    // biggestColumn = `'${content}'` (with surrounding SQL quotes)
    // getSliceIndex checks column[SLICE_SIZE - 1] = column[69999]
    // column[0] = opening quote, so column[69999] = content[69998]
    // Place an emoji (surrogate pair, 2 UTF-16 code units) at content[69998]
    const prefix = 'a'.repeat(69998)
    const emoji = '😀'
    const suffix = 'b'.repeat(40000)
    const content = prefix + emoji + suffix

    const { queries: sql } = generateCollectionInsert(collection, {
      id: 'foo.md',
      stem: 'foo',
      extension: 'md',
      meta: {},
      content,
    })

    // Should be split into multiple queries
    expect(sql.length).toBeGreaterThan(1)

    // Each query should not contain broken surrogate pairs
    for (const query of sql) {
      console.log(query)
      for (let i = 0; i < query.length; i++) {
        const code = query.charCodeAt(i)
        if (code >= 0xD800 && code <= 0xDBFF) {
          // High surrogate must be followed by low surrogate
          const next = query.charCodeAt(i + 1)
          expect(next >= 0xDC00 && next <= 0xDFFF).toBe(true)
        }
        if (code >= 0xDC00 && code <= 0xDFFF) {
          // Low surrogate must be preceded by high surrogate
          const prev = query.charCodeAt(i - 1)
          expect(prev >= 0xD800 && prev <= 0xDBFF).toBe(true)
        }
      }
    }
  })

farnabaz · 2026-02-13T11:51:57Z

Thanks for the PR @yodakaEngineer.

Don’t we need to reduce ‎MAX_SQL_QUERY_SIZE? If Japanese characters are 3 bytes, 100K characters might go above the 100KB Cloudflare D1 hard limit.

yodakaEngineer · 2026-02-14T03:37:41Z

@farnabaz Thanks for quick response.

I think MAX_SQL_QUERY_SIZE is not character limit but byte limit.

But the existing implementation calculates based on character count, so this PR changes it to calculate based on byte count.

So, we don't need change MAX_SQL_QUERY_SIZE.

vercel · 2026-02-17T15:21:45Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
content	Ready	Preview, Comment	Feb 17, 2026 3:25pm

farnabaz

Thank for the PR and Japanese tip. 心

test(collection): add test for emoji at SLICE_SIZE byte boundary

5c7c79f

vercel bot deployed to Preview February 17, 2026 15:25 View deployment

farnabaz approved these changes Feb 17, 2026

View reviewed changes

farnabaz merged commit 9f8402a into nuxt:main Feb 17, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(collection): use UTF-8 byte length for SQL query size check#3717

fix(collection): use UTF-8 byte length for SQL query size check#3717
farnabaz merged 2 commits intonuxt:mainfrom
yodakaEngineer:fix/multibyte-generate-collection-insert

yodakaEngineer commented Feb 12, 2026

Uh oh!

vercel bot commented Feb 12, 2026

Uh oh!

pkg-pr-new bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 12, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

yodakaEngineer commented Feb 12, 2026

Uh oh!

yodakaEngineer commented Feb 12, 2026

Uh oh!

yodakaEngineer commented Feb 12, 2026

Uh oh!

farnabaz commented Feb 13, 2026

Uh oh!

yodakaEngineer commented Feb 14, 2026 •

edited

Loading

Uh oh!

vercel bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

farnabaz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yodakaEngineer commented Feb 12, 2026

🔗 Linked issue

❓ Type of change

📚 Description

📝 Checklist

Uh oh!

vercel bot commented Feb 12, 2026

Uh oh!

pkg-pr-new bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

yodakaEngineer commented Feb 12, 2026

Uh oh!

yodakaEngineer commented Feb 12, 2026

Uh oh!

yodakaEngineer commented Feb 12, 2026

Uh oh!

farnabaz commented Feb 13, 2026

Uh oh!

yodakaEngineer commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

farnabaz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pkg-pr-new bot commented Feb 12, 2026 •

edited

Loading

coderabbitai bot commented Feb 12, 2026 •

edited

Loading

yodakaEngineer commented Feb 14, 2026 •

edited

Loading

vercel bot commented Feb 17, 2026 •

edited

Loading