Conversation
The SQL split logic in generateCollectionInsert used string.length (character count) to check against Cloudflare D1's 100KB byte limit. Multibyte characters (e.g. Japanese: 1 char = 3 bytes) could exceed the limit without triggering a split, causing SQLite errors on D1. Replace string.length with utf8ByteLength() for size comparisons and slice index calculations.
|
@yodakaEngineer is attempting to deploy a commit to the Nuxt Team on Vercel. A member of the Team first needs to authorize it. |
commit: |
📝 WalkthroughWalkthroughAdded UTF-8 byte-length utilities in src/utils/collection.ts: Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
I'm sorry. |
|
Finaly, I got it! 'aa🎲aa'.slice(3) // This is '\uDFB2aa'The current implementation breaks when an emoji appears at the This PR is changing how to size from character to byte, so I think this resolves #3533. |
FYI: This test is broken on main branch. test('Emoji at SLICE_SIZE boundary is not broken', () => {
const collection = resolveCollection('content', defineCollection({
type: 'data',
source: '**',
schema: z.object({
content: z.string(),
}),
}))!
// biggestColumn = `'${content}'` (with surrounding SQL quotes)
// getSliceIndex checks column[SLICE_SIZE - 1] = column[69999]
// column[0] = opening quote, so column[69999] = content[69998]
// Place an emoji (surrogate pair, 2 UTF-16 code units) at content[69998]
const prefix = 'a'.repeat(69998)
const emoji = '😀'
const suffix = 'b'.repeat(40000)
const content = prefix + emoji + suffix
const { queries: sql } = generateCollectionInsert(collection, {
id: 'foo.md',
stem: 'foo',
extension: 'md',
meta: {},
content,
})
// Should be split into multiple queries
expect(sql.length).toBeGreaterThan(1)
// Each query should not contain broken surrogate pairs
for (const query of sql) {
console.log(query)
for (let i = 0; i < query.length; i++) {
const code = query.charCodeAt(i)
if (code >= 0xD800 && code <= 0xDBFF) {
// High surrogate must be followed by low surrogate
const next = query.charCodeAt(i + 1)
expect(next >= 0xDC00 && next <= 0xDFFF).toBe(true)
}
if (code >= 0xDC00 && code <= 0xDFFF) {
// Low surrogate must be preceded by high surrogate
const prev = query.charCodeAt(i - 1)
expect(prev >= 0xD800 && prev <= 0xDBFF).toBe(true)
}
}
}
}) |
|
Thanks for the PR @yodakaEngineer. Don’t we need to reduce |
|
@farnabaz Thanks for quick response. I think MAX_SQL_QUERY_SIZE is not character limit but byte limit. But the existing implementation calculates based on character count, so this PR changes it to calculate based on byte count. So, we don't need change MAX_SQL_QUERY_SIZE. |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
farnabaz
left a comment
There was a problem hiding this comment.
Thank for the PR and Japanese tip. 心
🔗 Linked issue
Resolves #3533
❓ Type of change
📚 Description
The SQL split logic in generateCollectionInsert used string.length (character count) to check against Cloudflare D1's 100KB byte limit.
Multibyte characters (e.g. Japanese: 1 char = 3 bytes) could exceed the limit without triggering a split, causing SQLite errors on D1.
So, I replaced string.length with utf8ByteLength() for size comparisons and slice index calculations.
FYI:
心meansheart.📝 Checklist