fix: CSV/TSV reader tracks quote state across newlines (0.12.1)#435
Merged
Conversation
parse_format split content via content.lines() before tracking quote state, so a cell containing \n (which write_csv_tsv correctly emits as a quoted multi-line field per RFC 4180) was re-parsed as two rows. ilo silently mis-read CSV it had just written. csv-pipeline rerun10 caught the round-trip drift: wrote 4 rows, read back 5. Replace the line-by-line approach with parse_csv_content: a single-pass scanner that walks the whole document, treats \n / \r\n as record separators only when out of quotes, keeps embedded newlines and CRLF verbatim inside quoted cells, and never emits a phantom trailing row for files that end in \n. Same signature shape going in, Vec<Vec<String>> coming out. Unit tests cover multi-line cells, escaped quotes inside multi-line cells, CRLF row separators, CRLF inside a quoted field, embedded commas, mixed quoted/unquoted cells in one row, empty trailing fields, empty quoted fields, the UTF-8 BOM (pinned as preserved), unterminated quoted fields, and a full write_csv_tsv -> parse_csv_content round-trip.
vm_parse_format had the same bug as the tree-walker: content.lines() ran before quote-state tracking, so a quoted multi-line cell came back as two rows. The VM is the default engine on 0.12.x so any csv-pipeline user hit this on every read. vm_parse_csv_content mirrors interpreter::parse_csv_content -- single pass, in_quotes state, \n and \r\n as record separators outside quotes, preserved inside. Both code paths now use the same algorithm so a write on one engine and a read on another stays byte-stable.
tests/regression_csv_multiline_roundtrip.rs drives wr! then rd! across tree and VM. Pins the row count, the multi-line cell value, the quote+newline combined case, and the plain (no-newline) negative control so a future regression in either engine fails loudly. examples/csv-multiline-roundtrip.ilo gives agents an in-context example of the now-correct behaviour and rides the existing examples_engines harness so every engine runs it on each test pass. Covers a row-count round-trip, a single-cell length round-trip, an embedded-quote + embedded-newline cell, and a CRLF cell.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
danieljohnmorris
added a commit
that referenced
this pull request
May 19, 2026
…val_inline strings #435 added regression_csv_multiline_roundtrip.rs with an engines list that still included --run-tree, which the soft-deprecate removes. VM round-trip alone covers the bug fix scope. Also refreshed an error-message string in eval_inline.rs that still referenced --run-tree after the prior sweep renamed the actual flag to --run-vm.
danieljohnmorris
added a commit
that referenced
this pull request
May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The CSV/TSV reader split content on
\nbefore tracking quote state, so a cell containing an embedded newline (which the writer correctly emits as a quoted multi-line field per RFC 4180) was re-parsed as two rows. ilo silently mis-read CSV it had just written. csv-pipeline rerun10 caught the drift: wrote 4 rows, read back 5.Both the tree-walker (
parse_format) and the VM (vm_parse_format) had the same bug, so the fix lands in both: replace the line-based split with a single-pass scanner that tracksin_quotesacross\n/\r\n. Same return shape, no caller changes needed.Repro
Before:
After (regression tests pin this):
What's in the diff
fix(interpreter): scan CSV/TSV in one pass with quote-state tracking- newparse_csv_contentplus unit tests for multi-line cells, escaped quotes inside multi-line cells, CRLF, embedded commas, mixed quoted/unquoted in one row, empty trailing fields, empty quoted fields, UTF-8 BOM (pinned as preserved), unterminated quoted fields, and a full writer round-trip.fix(vm): scan CSV/TSV in one pass to match interpreter scanner- mirror of the tree-walker change invm_parse_csv_content. The VM is the default engine on 0.12.x so every csv-pipeline run hit this.test: cross-engine CSV multi-line round-trip regression + example-tests/regression_csv_multiline_roundtrip.rsrunswr!thenrd!across tree and VM, asserting row count + cell value + quote+newline + a plain negative control.examples/csv-multiline-roundtrip.ilocovers row count, cell length, embedded quote + newline, and CRLF inside a quoted cell through the examples_engines harness.docs: changelog 0.12.1 entry for CSV multi-line round-trip fixTest plan
cargo test --release --features cranelift- full suite greencargo fmt && cargo clippy --release --features cranelift -- -D warnings- cleanparse_csv_contentpass (11 cases)examples_enginesharness picks upcsv-multiline-roundtrip.ilo(4 cases x 2 engines = 8 assertions)Doc-sync
This is a behaviour-restoration fix with no surface change: SPEC.md, ai.txt, SKILL.md, and the site builtins reference don't need edits. CHANGELOG.md gets the 0.12.1 Fixed entry above.
Follow-ups
parse_csv_contentwould be a friendly addition - but it's a deliberate behaviour change, not a bugfix, so it lives as a separate ticket if and when.Proposed new persona
csv-roundtrip- a thin persona focused exclusively on the writer/reader contract for tabular data. csv-pipeline already covers the messy-input direction (broken headers, ragged rows, type coercion), but the round-trip direction is a different failure mode: ilo produces output that ilo itself cannot read back. A dedicated persona that does nothing exceptwr -> rd -> assert equalacross a corpus of cells (embedded\n, embedded", embedded sep, embedded CRLF, empty cells, leading/trailing whitespace, unicode, BOM, very long cells, very many cells) would have caught this bug on the day it landed, and would catch any future writer/reader drift before it ships. Worth adding to the persona suite when capacity allows.