feat: add batch size parameter to prevent context leaking#1969
feat: add batch size parameter to prevent context leaking#1969Hellnight2005 wants to merge 12 commits intolingodotdev:mainfrom
Conversation
…o feat/export-locale-helpers
📝 WalkthroughWalkthroughAdds a CLI flag Changes
Sequence DiagramsequenceDiagram
participant User as CLI User
participant CLI as CLI Parser
participant Setup as Setup Flow
participant Localizer as Explicit Localizer
participant Processor as Processor / Chunker
participant AISDK as AI SDK
participant Callback as Progress Callback
User->>CLI: run --batch-size 5
CLI->>Setup: parse flags, pass batchSize
Setup->>Localizer: createLocalizer(..., batchSize=5)
User->>Localizer: localize(translations, onProgress)
Localizer->>Processor: extractPayloadChunks(payload, batchSize=5)
Processor-->>Localizer: chunks[]
loop chunk processing
Localizer->>AISDK: generateText(systemPrompt, userMsg with chunk)
AISDK-->>Localizer: translatedChunk
Localizer->>Callback: onProgress(percentage, chunk, result)
end
Localizer->>Localizer: merge subResults
Localizer-->>User: final mergedResult
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~22 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/cli/src/cli/processor/basic.ts (1)
10-35:⚠️ Potential issue | 🟡 MinorProgress never reaches 100% for chunked runs.
Becauseiis zero‑based, the final chunk reports(n-1)/n * 100. Use(i + 1)instead.🛠️ Proposed fix
- onProgress((i / chunks.length) * 100, chunk, result); + onProgress(((i + 1) / chunks.length) * 100, chunk, result);
🤖 Fix all issues with AI agents
In `@packages/cli/src/cli/localizer/explicit.ts`:
- Around line 279-281: The progress calculation currently uses (i /
chunks.length) * 100 which reports 0% for the first processed chunk; update the
call to onProgress to use ((i + 1) / chunks.length) * 100 (or compute
processedCount = i + 1 and use processedCount / chunks.length) so progress
reflects completed chunks, and clamp to 100% if needed; apply the same fix to
the other instance that uses the same pattern (the calls invoking onProgress
with i, chunks, finalResult).
🧹 Nitpick comments (2)
packages/cli/src/cli/processor/basic.spec.ts (1)
22-60: Test input is missing required fields that exist in LocalizerInput type.Looking at the
createBasicTranslatorimplementation,LocalizerInputlikely requiressourceDataandtargetDatafields in addition to what's provided. While the current test may pass because the mock interceptsgenerateText, the input object is incomplete:const input = { sourceLocale: "en", targetLocale: "fr", processableData: { key1: "value1", key2: "value2", key3: "value3", }, + sourceData: {}, + targetData: {}, };This applies to all three test cases. The tests focus correctly on batching behavior, but adding these fields would make the test inputs more realistic.
packages/cli/src/cli/localizer/explicit.ts (1)
296-324: Consider the interaction between word count and batch size constraints.The chunking logic has two constraints:
idealBatchItemSize(250 words) ANDbatchSize(key count). A chunk is finalized when either limit is exceeded. This means:
- If a user sets
batchSize: 100but entries are long, chunks may be smaller due to the 250-word limit- The 250-word limit is hardcoded and not configurable
This is likely acceptable behavior (prevents overly large payloads), but users expecting exact batch sizes may be surprised. Consider documenting this dual-constraint behavior in the CLI help text for
--batch-size.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/cli/src/cli/localizer/explicit.ts`:
- Around line 260-276: Wrap the initial JSON.parse(response.text) in a try/catch
inside the function that processes the LLM response (the block using variables
result, finalResult and jsonrepair); on parse failure attempt to run jsonrepair
on response.text and parse the repaired string, falling back to an empty object
if that also fails, and ensure you surface a clear error/log message that
includes the raw response.text and the parse error so the batch doesn't crash
(refer to result, finalResult, response.text, and jsonrepair in your changes).
🧹 Nitpick comments (1)
packages/cli/src/cli/localizer/explicit.ts (1)
296-324: Chunk size check occurs after adding the item, so chunks may exceed limits.The current implementation adds an entry to
currentChunk(line 308) before checking size limits (lines 312-316). This means a chunk can contain one more item thanbatchSizeor exceedidealBatchItemSizeby the size of one entry.For
batchSize=1(the main use case for context isolation), this works correctly since the checkcurrentChunkItemCount >= batchSizetriggers immediately after adding each item.For larger batch sizes, this is likely acceptable, but if strict limits are needed:
♻️ Optional: check before adding to enforce strict limits
for (let i = 0; i < payloadEntries.length; i++) { const [key, value] = payloadEntries[i]; + + // Check if adding this item would exceed limits + const testChunk = { ...currentChunk, [key]: value }; + const testChunkSize = countWordsInRecord(testChunk); + const wouldExceedLimits = + currentChunkItemCount > 0 && + (testChunkSize > idealBatchItemSize || currentChunkItemCount >= batchSize); + + if (wouldExceedLimits) { + result.push(currentChunk); + currentChunk = {}; + currentChunkItemCount = 0; + } + currentChunk[key] = value; currentChunkItemCount++; - - const currentChunkSize = countWordsInRecord(currentChunk); - if ( - currentChunkSize > idealBatchItemSize || - currentChunkItemCount >= batchSize || - i === payloadEntries.length - 1 - ) { + + if (i === payloadEntries.length - 1) { result.push(currentChunk); - currentChunk = {}; - currentChunkItemCount = 0; } }
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@packages/cli/src/cli/localizer/explicit.ts`:
- Around line 268-271: Replace the current console.error and thrown error that
include the full response.text with a safe, truncated or redacted preview: in
the error handling block around response and e2 in explicit.ts (the lines
calling console.error(`Failed to parse response... ${response.text}`) and throw
new Error(`Failed to parse response...: ${e2}`)), log only a short sanitized
excerpt (e.g., first N characters or a redacted placeholder) and include the
parse error (e2) but never the full response body; ensure the thrown Error
message also contains no full response text—only the error summary and the same
sanitized preview so no PII is leaked.
- Around line 310-333: The function extractPayloadChunks currently defaults
batchSize to 25 which forces chunking even when callers omit the flag; change
the signature to accept batchSize?: number (remove the =25 default) and compute
an effectiveBatchSize inside the function as batchSize ?? payloadEntries.length
(or Infinity/payloadEntries.length) so that when batchSize is undefined all keys
are kept in a single batch; then replace uses of batchSize in the chunking
condition (currentChunkItemCount >= batchSize) with currentChunkItemCount >=
effectiveBatchSize (and defensively handle non-positive values if needed).
Ensure you update the parameter type and the condition checks in
extractPayloadChunks to use the new effectiveBatchSize while leaving the
idealBatchItemSize logic intact.
| "Brand voice enabled", | ||
| "Translation memory connected", | ||
| "Glossary enabled", | ||
| "Quality assurance enabled", | ||
| ].map((title) => ({ title, task: () => {} })) | ||
| "Brand voice enabled", | ||
| "Translation memory connected", | ||
| "Glossary enabled", | ||
| "Quality assurance enabled", | ||
| ].map((title) => ({ title, task: () => { } })) | ||
| : isPseudo | ||
| ? [ | ||
| "Pseudo-localization mode active", | ||
| "Character replacement configured", | ||
| "No external API calls", | ||
| ].map((title) => ({ title, task: () => {} })) | ||
| "Pseudo-localization mode active", | ||
| "Character replacement configured", | ||
| "No external API calls", | ||
| ].map((title) => ({ title, task: () => { } })) | ||
| : [ | ||
| "Skipping brand voice", | ||
| "Skipping glossary", | ||
| "Skipping translation memory", | ||
| "Skipping quality assurance", | ||
| ].map((title) => ({ title, task: () => {}, skip: true })); | ||
| "Skipping brand voice", | ||
| "Skipping glossary", | ||
| "Skipping translation memory", | ||
| "Skipping quality assurance", | ||
| ].map((title) => ({ title, task: () => { }, skip: true })); |
There was a problem hiding this comment.
Not sure where all the whitespace changes come from, but it adds a significant amount of noise to the changeset
There was a problem hiding this comment.
I'm not sure if the logic to have the 250 word limit is necessary or it complicates the logic 🤔
There was a problem hiding this comment.
That’s fair 👍
The 250-word limit was kept as a safety guard around prompt size, but I agree the batching behavior should be user-controlled.
Proposed behavior:
If no batching option is provided, preserve the current behavior (single batch).
If --batch-size is explicitly set, apply user-defined batching.
The 250-word limit acts only as a fallback safety guard, not as the primary batching mechanism.
What do you think? Happy to simplify or adjust this if the maintainers prefer a single, explicit batching strategy.
There was a problem hiding this comment.
Yeah I would wait for the maintainers opinion on this. Since there is similar logic in other parts they might want to keep it
|
@Hellnight2005 I took a look, looks mostly good to me. However since I'm not a maintainer, you’ll probably need for someone from the lingo.dev team for their review |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/cli/src/cli/processor/basic.ts (1)
10-45:⚠️ Potential issue | 🟠 MajorExtract
batchSizefrom settings before spreading intogenerateText.The AI SDK v6.0.25 uses schema validation and rejects unknown options. Spreading
settings(which includesbatchSize) directly intogenerateText()will cause a runtime validation error in production. The tests pass only because they mockgenerateText, bypassing actual SDK validation.Extract
batchSizeoutside the spread to prevent this error:Suggested fix
export function createBasicTranslator( model: LanguageModel, systemPrompt: string, settings: ModelSettings = {}, ) { + const { batchSize, ...modelSettings } = settings; return async (input: LocalizerInput, onProgress: LocalizerProgressFn) => { const chunks = extractPayloadChunks( input.processableData, - settings.batchSize, + batchSize, ); @@ const response = await generateText({ model, - ...settings, + ...modelSettings, messages: [
🤖 Fix all issues with AI agents
In `@packages/cli/src/cli/localizer/explicit.ts`:
- Around line 286-296: The string-response branch that extracts JSON from
result.data (using index/lastIndex, jsonrepair, and JSON.parse) lacks error
handling and can throw; wrap the parsing steps (const trimmed = ..., const
repaired = jsonrepair(trimmed), const parsed = JSON.parse(repaired)) in a
try/catch similar to the main parsing block, catch parsing errors, log or warn
the failure (including result.data or the error), and ensure finalResult is
safely set to parsed.data || parsed || {} or {} on error so the batch doesn't
crash; reference the result.data handling block and the jsonrepair/JSON.parse
sequence when applying the change.
In `@packages/cli/src/cli/processor/basic.ts`:
- Around line 92-96: The JSDoc comment for the function that extracts payload
chunks (around line 92-96) contains an inaccurate default value description for
the batchSize parameter. The doc states the default is 25, but the actual
implementation defaults to no max key limit. Update the `@param` documentation for
batchSize to accurately reflect what the implementation actually does when
batchSize is not provided, ensuring the documentation matches the actual default
behavior.
🧹 Nitpick comments (1)
packages/cli/src/cli/localizer/explicit.ts (1)
310-343: Outdated JSDoc and inefficient loop computation.Two minor issues:
- Line 313 JSDoc says "(default: 25)" but the default is now to process all keys in one batch when
batchSizeis omitted.effectiveBatchSize(lines 332-333) is computed on every loop iteration but its value never changes. Move it outside the loop.♻️ Proposed refactor
/** * Extract payload chunks based on the ideal chunk size * `@param` payload - The payload to be chunked - * `@param` batchSize - Max number of keys per chunk (default: 25) + * `@param` batchSize - Max number of keys per chunk (optional; if omitted, all keys in one batch) * `@returns` An array of payload chunks */ function extractPayloadChunks( payload: Record<string, string>, batchSize?: number, ): Record<string, string>[] { const idealBatchItemSize = 250; const result: Record<string, string>[] = []; let currentChunk: Record<string, string> = {}; let currentChunkItemCount = 0; const payloadEntries = Object.entries(payload); + const effectiveBatchSize = + batchSize && batchSize > 0 ? batchSize : payloadEntries.length || 1; + for (let i = 0; i < payloadEntries.length; i++) { const [key, value] = payloadEntries[i]; currentChunk[key] = value; currentChunkItemCount++; const currentChunkSize = countWordsInRecord(currentChunk); - const effectiveBatchSize = - batchSize && batchSize > 0 ? batchSize : payloadEntries.length || 1; if ( currentChunkSize > idealBatchItemSize || currentChunkItemCount >= effectiveBatchSize || i === payloadEntries.length - 1 ) {
| /** | ||
| * Extract payload chunks based on the ideal chunk size | ||
| * @param payload - The payload to be chunked | ||
| * @param batchSize - Max number of keys per chunk (default: 25) | ||
| * @returns An array of payload chunks |
There was a problem hiding this comment.
JSDoc default for batchSize is inaccurate.
The comment says default is 25, but the implementation defaults to “no max key limit.” Please align the doc with the behavior.
✏️ Suggested doc fix
- * `@param` batchSize - Max number of keys per chunk (default: 25)
+ * `@param` batchSize - Max number of keys per chunk (default: unlimited; uses payload length when omitted)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /** | |
| * Extract payload chunks based on the ideal chunk size | |
| * @param payload - The payload to be chunked | |
| * @param batchSize - Max number of keys per chunk (default: 25) | |
| * @returns An array of payload chunks | |
| /** | |
| * Extract payload chunks based on the ideal chunk size | |
| * `@param` payload - The payload to be chunked | |
| * `@param` batchSize - Max number of keys per chunk (default: unlimited; uses payload length when omitted) | |
| * `@returns` An array of payload chunks |
🤖 Prompt for AI Agents
In `@packages/cli/src/cli/processor/basic.ts` around lines 92 - 96, The JSDoc
comment for the function that extracts payload chunks (around line 92-96)
contains an inaccurate default value description for the batchSize parameter.
The doc states the default is 25, but the actual implementation defaults to no
max key limit. Update the `@param` documentation for batchSize to accurately
reflect what the implementation actually does when batchSize is not provided,
ensuring the documentation matches the actual default behavior.
There was a problem hiding this comment.
Good catch 👍
This doc mismatch is tied to the open question around default batching behavior. I’m waiting for maintainer guidance on whether we want to keep single-batch as the default or introduce a capped default.
Once that’s clarified, I’ll align both the implementation and JSDoc accordingly.
There was a problem hiding this comment.
@Hellnight2005 Sounds good! Waiting for maintainer input on the default strategy makes sense. Feel free to ping me once you've updated the docs to match the final decision.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/cli/src/cli/localizer/explicit.ts`:
- Around line 296-301: The current catch for nested JSON parse errors swallows
the failure and pushes an empty finalResult into subResults (variables:
finalResult, subResults, result.data), causing silent data loss; make this
consistent with the primary JSON parse behavior by re-throwing the caught error
instead of continuing (i.e., remove the silent fallback and throw a new Error or
rethrow e) so callers fail fast, or alternatively set a clear marker on
finalResult (e.g., finalResult.__incomplete = true and include an error message
field) before pushing to subResults so callers can detect partial results —
update the catch block accordingly and ensure any callers that iterate
subResults check for that marker if you choose the marker approach.
🧹 Nitpick comments (3)
packages/cli/src/cli/localizer/explicit.ts (3)
311-312: Potential key collision on merge with nested objects.
_.mergeperforms deep merging, which means if two chunks have overlapping keys with nested structures, later chunks will overwrite earlier values. While translation keys should be unique across chunks (sinceextractPayloadChunksdistributes keys), if the same key appears in multiple chunks due to a bug or edge case, data could be silently overwritten.Consider using
Object.assignfor shallow merge (appropriate if values are flat strings) or add a debug assertion that chunks have disjoint keys.
339-340: HoisteffectiveBatchSizeoutside the loop.The
effectiveBatchSizecalculation doesn't depend on loop variables (i,key,value, etc.) but is recomputed on every iteration. Move it before the loop for a minor efficiency improvement.♻️ Proposed refactor
const payloadEntries = Object.entries(payload); + const effectiveBatchSize = + batchSize && batchSize > 0 ? batchSize : payloadEntries.length || 1; + for (let i = 0; i < payloadEntries.length; i++) { const [key, value] = payloadEntries[i]; currentChunk[key] = value; currentChunkItemCount++; const currentChunkSize = countWordsInRecord(currentChunk); - const effectiveBatchSize = - batchSize && batchSize > 0 ? batchSize : payloadEntries.length || 1; if (
360-361: Simplify redundant type annotation.The type
any | Record<string, any> | Array<any>is redundant sinceanysubsumes all other types. Consider using justunknownfor type safety oranyif you need flexibility.♻️ Proposed simplification
function countWordsInRecord( - payload: any | Record<string, any> | Array<any>, + payload: unknown, ): number {Using
unknowninstead ofanyis safer and the existing type guards (Array.isArray,typeof === "object",typeof === "string") already narrow the type appropriately.
…loss (CodeRabbit review)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/cli/src/cli/localizer/explicit.ts`:
- Around line 272-277: The error text incorrectly hardcodes "Lingo.dev" when
parsing responses for explicit BYOK providers; update the console.error and the
thrown Error to include the actual provider identifier (e.g., use provider.id or
providerName from the surrounding scope) instead of "Lingo.dev", keep the
snippet and original error (e2) in the message for debugging, and ensure you
fallback to a generic label if the provider id is missing.
🧹 Nitpick comments (1)
packages/cli/src/cli/localizer/explicit.ts (1)
340-346: MoveeffectiveBatchSizecalculation outside the loop.The
effectiveBatchSizevalue doesn't change during iteration since it only depends onbatchSizeandpayloadEntries.length. Computing it inside the loop on every iteration is wasteful.♻️ Proposed refactor
const payloadEntries = Object.entries(payload); + const effectiveBatchSize = + batchSize && batchSize > 0 ? batchSize : payloadEntries.length || 1; + for (let i = 0; i < payloadEntries.length; i++) { const [key, value] = payloadEntries[i]; currentChunk[key] = value; currentChunkItemCount++; const currentChunkSize = countWordsInRecord(currentChunk); - const effectiveBatchSize = - batchSize && batchSize > 0 ? batchSize : payloadEntries.length || 1; if ( currentChunkSize > idealBatchItemSize || currentChunkItemCount >= effectiveBatchSize ||
| console.error( | ||
| `Failed to parse response from Lingo.dev. Response snippet: ${snippet}`, | ||
| ); | ||
| throw new Error( | ||
| `Failed to parse response from Lingo.dev: ${e2} (Snippet: ${snippet})`, | ||
| ); |
There was a problem hiding this comment.
Error message incorrectly references "Lingo.dev" for BYOK providers.
This error message fires when parsing responses from explicit providers (OpenAI, Anthropic, Google, etc.), but it references "Lingo.dev" which is misleading. Consider using the actual provider ID for clearer diagnostics.
🔧 Proposed fix
console.error(
- `Failed to parse response from Lingo.dev. Response snippet: ${snippet}`,
+ `Failed to parse response from ${params.id}. Response snippet: ${snippet}`,
);
throw new Error(
- `Failed to parse response from Lingo.dev: ${e2} (Snippet: ${snippet})`,
+ `Failed to parse response from ${params.id}: ${e2} (Snippet: ${snippet})`,
);🤖 Prompt for AI Agents
In `@packages/cli/src/cli/localizer/explicit.ts` around lines 272 - 277, The error
text incorrectly hardcodes "Lingo.dev" when parsing responses for explicit BYOK
providers; update the console.error and the thrown Error to include the actual
provider identifier (e.g., use provider.id or providerName from the surrounding
scope) instead of "Lingo.dev", keep the snippet and original error (e2) in the
message for debugging, and ensure you fallback to a generic label if the
provider id is missing.
|
Thanks for your contribution! However, this PR references issue #1733 where you are not assigned as the owner. To contribute to this project, please find an unassigned issue and get assigned to it first, or create a new issue and get it assigned to you before submitting a PR. |
|
@Hellnight2005 this got into a future version? or no? |
|
No this didn’t make it into a release yet. The PR was automatically closed by GitHub because I wasn’t assigned to issue #1733 |
|
@Hellnight2005 should we try and press them to reopen this? I would really like this feature too |
|
@Hellnight2005 do you want to get assigned to the issue? Your PR got auto-closed because you weren't assigned. |
|
@philipellis101 Thanks for your enthusiasm! Honestly, don’t worry about pushing this. I created the PR with some AI help to explore a solution before making any decisions. This is only my third open-source contribution, so I’m still figuring things out. If the maintainers review the code and feel it’s good, they can go ahead with it—otherwise, we’ll see how it goes in the future. |
|
@sumitsaurabh927 Yes, please! I’d love to get assigned to the issue so we can reopen the PR. Thanks for offering — I’m still learning the ropes with open-source contributions, so I really appreciate the guidance. |
|
Just assigned; have a blast! |
|
Hi @sumitsaurabh927 , now that I’m assigned to issue #1733, would it be possible to reopen PR #1969 so it can be reviewed and merged? If reopening isn’t possible, I can create a new PR from the same branch — just wanted to check what’s preferred. Thanks! |
|
@Hellnight2005 please open a new PR, thanks :) |
|
@sumitsaurabh927 I’ve opened a new PR for this issue (#2055). Could you please take a look when you have time? |
Summary
Implements the
--batch-sizeparameter for bothrunandi18ncommands to control the number of translation strings sent to the LLM in a single request. This addresses the context leaking issue where translator notes/hints from one key could influence others.Closes #1733: Add batch size parameter to prevent context leaking of translator notes/hints
Changes
--batch-size <number>flag torunandi18n(deprecated) commands.createBasicTranslatorinsrc/cli/processor/basic.tsto respect thebatchSizeparameter and chunk requests accordingly.createExplicitLocalizerinsrc/cli/localizer/explicit.tsto support batch processing.pseudolocalizer by adding the missingvalidateSettingsmethod.Verification
Unit Tests
Added
src/cli/processor/basic.spec.tsto verify chunking logic:batchSize: 1sends individual requests (isolation).batchSize > 1creates correct chunk sizes.Manual Verification
Ran the CLI command
lingo.dev run --batch-size 1 --pseudo:CI/CD
pnpm turbo testincludes the new unit tests, ensuring the GitHub workflow (pr-check.yml) covers this feature.Summary by CodeRabbit
New Features
Tests