Skip to content

fix(collection): use UTF-8 byte length for SQL query size check#3717

Merged
farnabaz merged 2 commits intonuxt:mainfrom
yodakaEngineer:fix/multibyte-generate-collection-insert
Feb 17, 2026
Merged

fix(collection): use UTF-8 byte length for SQL query size check#3717
farnabaz merged 2 commits intonuxt:mainfrom
yodakaEngineer:fix/multibyte-generate-collection-insert

Conversation

@yodakaEngineer
Copy link
Contributor

🔗 Linked issue

Resolves #3533

❓ Type of change

  • 📖 Documentation (updates to the documentation or readme)
  • 🐞 Bug fix (a non-breaking change that fixes an issue)
  • 👌 Enhancement (improving an existing functionality like performance)
  • ✨ New feature (a non-breaking change that adds functionality)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)

📚 Description

The SQL split logic in generateCollectionInsert used string.length (character count) to check against Cloudflare D1's 100KB byte limit.
Multibyte characters (e.g. Japanese: 1 char = 3 bytes) could exceed the limit without triggering a split, causing SQLite errors on D1.
So, I replaced string.length with utf8ByteLength() for size comparisons and slice index calculations.

FYI: means heart.

📝 Checklist

  • I have linked an issue or discussion.
  • I have updated the documentation accordingly.

The SQL split logic in generateCollectionInsert used string.length  (character count) to check against Cloudflare D1's 100KB byte limit. Multibyte characters (e.g. Japanese: 1 char = 3 bytes) could exceed the limit without triggering a split, causing SQLite errors on D1. Replace string.length with utf8ByteLength() for size comparisons and slice index calculations.
@vercel
Copy link

vercel bot commented Feb 12, 2026

@yodakaEngineer is attempting to deploy a commit to the Nuxt Team on Vercel.

A member of the Team first needs to authorize it.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Feb 12, 2026

npm i https://pkg.pr.new/@nuxt/content@3717

commit: 5c7c79f

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

Added UTF-8 byte-length utilities in src/utils/collection.ts: utf8ByteLength(str: string) and charIndexAtByteOffset(str: string, targetBytes: number) plus a UTF-8 encoder constant. Replaced character-length checks and slicing logic with UTF-8 byte-aware calculations across SQL size validation, largest-column selection, initial slice index computation, and iterative multi-statement insert slicing to avoid splitting multibyte characters. Tests updated/added to verify splitting behavior for multibyte UTF-8 content and that emoji boundaries are respected. No public API removals; exports increased.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: using UTF-8 byte length instead of character length for SQL query size checks.
Description check ✅ Passed The description is directly related to the changeset, explaining the problem with multibyte UTF-8 characters and the solution implemented.
Linked Issues check ✅ Passed The code changes directly address issue #3533 by implementing UTF-8 byte length checks and slice calculations instead of character length operations.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the UTF-8 byte length issue in the collection insert logic with corresponding tests; no unrelated modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
test/unit/generateCollectionInsert.test.ts (1)

127-169: Add null-check assertions on regex matches for better test diagnostics.

Lines 162–166 access regex match results via ! non-null assertions without first asserting the match succeeded. If the implementation ever changes the SQL format, these lines will throw a cryptic TypeError instead of a clear test failure. The emoji test at lines 209–214 already does this correctly with expect(...).not.toBeNull().

Proposed fix
     const insertMatch = sql[0]!.match(/'(心+)'/)
+    expect(insertMatch).not.toBeNull()
     let reconstructed = insertMatch![1]!
     for (let i = 1; i < sql.length; i++) {
       const updateMatch = sql[i]!.match(/CONCAT\(content, '(心+)'\)/)
+      expect(updateMatch).not.toBeNull()
       reconstructed += updateMatch![1]!
     }

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@yodakaEngineer
Copy link
Contributor Author

I'm sorry.
Maybe, This will likely stabilize deployment to D1, but this PR cannot resolve #3533.
I think it just looked like it fixed because the slice position changed.

@yodakaEngineer
Copy link
Contributor Author

Finaly, I got it!
Using .slice() to strings include emoji breaks character like below.

'aa🎲aa'.slice(3) // This is '\uDFB2aa'

The current implementation breaks when an emoji appears at the SLICE_SIZE th character position.
I think this is cause of #3533.

This PR is changing how to size from character to byte, so I think this resolves #3533.
I also added test.
Please review it.

@yodakaEngineer
Copy link
Contributor Author

Finaly, I got it! Using .slice() to strings include emoji breaks character like below.

'aa🎲aa'.slice(3) // This is '\uDFB2aa'

The current implementation breaks when an emoji appears at the SLICE_SIZE th character position. I think this is cause of #3533.

This PR is changing how to size from character to byte, so I think this resolves #3533. I also added test. Please review it.

FYI: This test is broken on main branch.

  test('Emoji at SLICE_SIZE boundary is not broken', () => {
    const collection = resolveCollection('content', defineCollection({
      type: 'data',
      source: '**',
      schema: z.object({
        content: z.string(),
      }),
    }))!

    // biggestColumn = `'${content}'` (with surrounding SQL quotes)
    // getSliceIndex checks column[SLICE_SIZE - 1] = column[69999]
    // column[0] = opening quote, so column[69999] = content[69998]
    // Place an emoji (surrogate pair, 2 UTF-16 code units) at content[69998]
    const prefix = 'a'.repeat(69998)
    const emoji = '😀'
    const suffix = 'b'.repeat(40000)
    const content = prefix + emoji + suffix

    const { queries: sql } = generateCollectionInsert(collection, {
      id: 'foo.md',
      stem: 'foo',
      extension: 'md',
      meta: {},
      content,
    })

    // Should be split into multiple queries
    expect(sql.length).toBeGreaterThan(1)

    // Each query should not contain broken surrogate pairs
    for (const query of sql) {
      console.log(query)
      for (let i = 0; i < query.length; i++) {
        const code = query.charCodeAt(i)
        if (code >= 0xD800 && code <= 0xDBFF) {
          // High surrogate must be followed by low surrogate
          const next = query.charCodeAt(i + 1)
          expect(next >= 0xDC00 && next <= 0xDFFF).toBe(true)
        }
        if (code >= 0xDC00 && code <= 0xDFFF) {
          // Low surrogate must be preceded by high surrogate
          const prev = query.charCodeAt(i - 1)
          expect(prev >= 0xD800 && prev <= 0xDBFF).toBe(true)
        }
      }
    }
  })

@farnabaz
Copy link
Member

Thanks for the PR @yodakaEngineer.

Don’t we need to reduce ‎MAX_SQL_QUERY_SIZE? If Japanese characters are 3 bytes, 100K characters might go above the 100KB Cloudflare D1 hard limit.

@yodakaEngineer
Copy link
Contributor Author

yodakaEngineer commented Feb 14, 2026

@farnabaz Thanks for quick response.

I think MAX_SQL_QUERY_SIZE is not character limit but byte limit.

But the existing implementation calculates based on character count, so this PR changes it to calculate based on byte count.

So, we don't need change MAX_SQL_QUERY_SIZE.

@vercel
Copy link

vercel bot commented Feb 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
content Ready Ready Preview, Comment Feb 17, 2026 3:25pm

Copy link
Member

@farnabaz farnabaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the PR and Japanese tip. 心

@farnabaz farnabaz merged commit 9f8402a into nuxt:main Feb 17, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQLite fails when UTF-8 multibyte character is split at exactly 350,000 bytes boundary

2 participants