nes-datagen: stream multi-GB inputs/outputs and switch output to JSON Lines#320089
Merged
ulugbekna merged 7 commits intoJun 5, 2026
Merged
Conversation
…imit Reading the whole input via fs.readFile fails for files larger than 2 GiB (and exceeds V8's max string length). Add a streaming JSON-array parser and use it in both the sequential and parallel pipeline paths so multi-GB recordings can be processed with bounded memory. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Auto-detect the input format from the first non-whitespace character: a leading '[' is parsed as a single JSON array, otherwise the file is parsed as JSON Lines (one JSON object per line). Both formats are streamed so multi-GB inputs work regardless of shape. Rename streamJsonArray -> streamJsonRecords to reflect the broader purpose. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use the file extension (.jsonl/.ndjson -> JSON Lines, otherwise JSON array) to select the streaming parser instead of sniffing the content. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ed input The new streaming parser previously accepted any prefix of a JSON array silently: a truncated file (no closing ']'), a missing element between commas, a trailing comma or trailing data after the array all produced zero or fewer records rather than an error. That is especially dangerous for the multi-GB inputs this parser was introduced for, because the underlying file is much more likely to be incomplete. Tighten the state machine to surface these as errors, matching what the old whole-file JSON.parse would have done, and add tests for each case. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
For multi-GB inputs the parent process was still hitting V8's ~512 MiB max-string-length limit in two places after the input-side fix: 1. Merging worker result files used fs.promises.readFile + JSON.parse on each result, but with a 5+ GB input split N ways each per-worker file is hundreds of MB of similar-shaped data and easily exceeds the string limit. 2. writeSamples serialized the entire validSamples array via a single JSON.stringify(arr, null, 2) before writing, which has the same problem on output. Switch both to stream over individual records: - A new shared openWriteStream(filePath) helper wraps fs.createWriteStream, attaches an 'error' listener immediately (so async write failures don't surface as uncaughtException and skip cleanup), awaits backpressure via the per-write callback, and exposes an idempotent close(). - writeChunkFiles uses the helper inside a try/finally so any mid-stream ENOSPC/EIO bubbles up cleanly and the tmp dir is still removed. - The merge step now uses streamJsonRecords<ISample>(resultPath), so the parent never materializes a single worker's output as one string. - writeSamples emits the output JSON array incrementally: per-sample JSON.stringify(..., null, 2) (indented two spaces to match the previous layout) joined with ',\n'. Byte size is accumulated for the existing IWriteResult.fileSize. Also documents that single-process loadAndParseInput still buffers the full row set in memory and that --parallelism is required for very large inputs (workers each only load their slice). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both the final user-facing output and the per-worker intermediate result
files now use JSON Lines (one record per line) instead of a
pretty-printed JSON array. JSONL is dramatically simpler to write and
read incrementally: no surrounding brackets/commas to track, no
multi-line per-element indentation, just JSON.stringify + '\n' per
record on the write side and split-on-newline + JSON.parse per non-empty
line on the read side (this is what streamJsonRecords already does when
it detects the .jsonl extension).
Changes:
- writeSamples emits one JSON.stringify(sample) + '\n' per validated
sample via no array wrapper, no pretty-printing.openWriteStream
- resolveOutputPath defaults the implicit output to <input>_output.jsonl
(was <input>_output.json).
- Per-worker result files in runInputPipelineParallel are now
result_${w}.jsonl, so the merge step's streamJsonRecords auto-picks
the JSONL parser from the extension.
- E2e tests updated to read JSONL (split on newline, JSON.parse per
line) and to use .jsonl output paths.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… text Two review follow-ups: 1. The merge step that streams each worker's result file used to wrap the iteration in a try/catch that downgraded any parse error to a console.error warning. With the new streaming reader that is unsafe: streamJsonRecords yields N valid records first and then throws on a malformed/truncated tail, leaving those N partial records already in allSamples. The pipeline would then quietly emit a truncated training-data output. Drop the swallowing try/catch so a corrupt worker result aborts the run non-zero. 2. The --out help text in simulationOptions.ts still advertised the old JSON-array default (<input>_output.json). Update it to reflect the JSONL output, and also note in --input that the format is inferred from the .jsonl/.ndjson extension. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the extensions/copilot/test/pipeline “nes-datagen” tooling to handle multi-GB inputs/outputs by moving from whole-file JSON parsing/writing to streaming record processing, and switches generated outputs to JSON Lines to avoid Node/V8 size limits.
Changes:
- Added streaming JSON record reader (
streamJsonRecords) supporting JSON arrays and JSON Lines with extension-based format inference. - Added a streaming file writer helper (
openWriteStream) and used it to incrementally write chunk files and JSONL outputs. - Updated pipeline + e2e tests to consume JSONL outputs and added unit tests for the new helpers.
Show a summary per file
| File | Description |
|---|---|
| extensions/copilot/test/pipeline/writeStream.ts | New write-stream wrapper used to stream output/chunk file writes. |
| extensions/copilot/test/pipeline/test/writeStream.spec.ts | Unit tests for incremental writes, idempotent close, and async FS errors. |
| extensions/copilot/test/pipeline/streamJsonRecords.ts | New streaming reader for JSON array / JSONL inputs. |
| extensions/copilot/test/pipeline/test/streamJsonRecords.spec.ts | Unit tests covering JSON array + JSONL parsing and malformed/truncation cases. |
| extensions/copilot/test/pipeline/pipeline.ts | Parallel pipeline updated to stream-count, stream-split into chunk files, and stream-merge JSONL worker results. |
| extensions/copilot/test/pipeline/parseInput.ts | Input parsing changed from readFile+JSON.parse to streaming record iteration. |
| extensions/copilot/test/pipeline/output.ts | Output switched to JSON Lines and written incrementally via openWriteStream. |
| extensions/copilot/test/base/simulationOptions.ts | CLI help text updated to document JSONL input inference and JSONL output default. |
| extensions/copilot/test/pipeline/test/pipeline.spec.ts | Updated expected output filename extension to .jsonl. |
| extensions/copilot/test/pipeline/test/pipeline.e2e.spec.ts | E2e tests updated to parse JSONL output via helper. |
Copilot's findings
- Files reviewed: 10/10 changed files
- Comments generated: 5
alexr00
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nes-datagenfailed withRangeError: File size (5643888945) is greater than 2 GiBwhen given multi-GB inputs. Beyond Node's 2 GiBreadFilelimit, V8 also caps strings at ~512 MiB, so the previous approach (whole-filereadFile+JSON.parse, thenJSON.stringifythe entire output array) was wrong even for inputs that fit under the 2 GiB ceiling once the input/output grew past ~512 MB.output flow streaming, and switches the output (and per-worker intermediate result files) to JSON Lines so the streaming reader/writer stay trivially simple.
Changes
JSON Lines, otherwise JSON array). Strictly validates truncation, missing elements between commas, trailing commas, and trailing matches
JSON.parse's failure modes.datawriteStream.ts): wrapsfs.createWriteStream, attaches'error'listener immediately (no moreuncaughtExceptionwindow that would skip tmp-dir cleanup), awaits backpressure via the per-write callback, idempotentclose().runInputPipelineParallel: usesstreamJsonRecordsto count records, then writes per-worker chunk files incrementally via the helper inside atry/finally. The merge step also usesstreamJsonRecords<ISample>(resultPath)instead ofreadFile+JSON.parse, so per-worker result files (~hundreds of MB at this input size) no longer trip the V8 string cap. Parse errors during merge now abort the run non-zero rather than silently truncating the output.writeSamples(output.ts): emits oneJSON.stringify(sample) + '\n'per validated sample viaopenWriteStream. No array wrapper, no pretty-printing. Default output filename is now<input>_output.jsonl.loadAndParseInput: usesstreamJsonRecords. Doc comment notes that this still buffersrowsin memory for single-process mode;--parallelismis required for very large inputs since each worker only loads its slice.--inputnow documents that the format is inferred from.jsonl/.ndjson;--outadvertises JSONL /.jsonl.Tests
streamJsonRecords.spec.tscovering JSON-array, JSONL, format inference, truncation, and malformed-input rejection.writeStream.spec.tscovering normal writes, idempotent close, andENOENT-style async errors surfacing as rejections (notuncaughtException).pipeline.e2e.spec.ts) updated to parse JSONL via aparseJsonl<T>helper; all 20 still pass.Verification:
tsgoclean,eslint --max-warnings=0clean, 49/49 pipeline tests pass (1 unrelated test skipped).Reviews
Both a code-review and an explore subagent were run on the final state of the branch. Findings (all addressed in the last commit):
try/catch; corrupt worker result aborts the run.--outhelp text still advertised the old_output.jsondefault.The explore subagent confirmed no external consumer (CI script, package.json script, README, etc.) depends on the old pretty-printed JSON-array format.