Skip to content

fix: CSV/TSV reader tracks quote state across newlines (0.12.1)#435

Merged
danieljohnmorris merged 4 commits into
mainfrom
fix/csv-multiline-quoted-roundtrip
May 19, 2026
Merged

fix: CSV/TSV reader tracks quote state across newlines (0.12.1)#435
danieljohnmorris merged 4 commits into
mainfrom
fix/csv-multiline-quoted-roundtrip

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

The CSV/TSV reader split content on \n before tracking quote state, so a cell containing an embedded newline (which the writer correctly emits as a quoted multi-line field per RFC 4180) was re-parsed as two rows. ilo silently mis-read CSV it had just written. csv-pipeline rerun10 caught the drift: wrote 4 rows, read back 5.

Both the tree-walker (parse_format) and the VM (vm_parse_format) had the same bug, so the fix lands in both: replace the line-based split with a single-pass scanner that tracks in_quotes across \n / \r\n. Same return shape, no caller changes needed.

Repro

Before:

wrote 4 rows, read back 5 rows (the multi-line cell split into two)

After (regression tests pin this):

wrote 4 rows, read back 4 rows; multi-line cell intact byte-for-byte

What's in the diff

  1. fix(interpreter): scan CSV/TSV in one pass with quote-state tracking - new parse_csv_content plus unit tests for multi-line cells, escaped quotes inside multi-line cells, CRLF, embedded commas, mixed quoted/unquoted in one row, empty trailing fields, empty quoted fields, UTF-8 BOM (pinned as preserved), unterminated quoted fields, and a full writer round-trip.
  2. fix(vm): scan CSV/TSV in one pass to match interpreter scanner - mirror of the tree-walker change in vm_parse_csv_content. The VM is the default engine on 0.12.x so every csv-pipeline run hit this.
  3. test: cross-engine CSV multi-line round-trip regression + example - tests/regression_csv_multiline_roundtrip.rs runs wr! then rd! across tree and VM, asserting row count + cell value + quote+newline + a plain negative control. examples/csv-multiline-roundtrip.ilo covers row count, cell length, embedded quote + newline, and CRLF inside a quoted cell through the examples_engines harness.
  4. docs: changelog 0.12.1 entry for CSV multi-line round-trip fix

Test plan

  • cargo test --release --features cranelift - full suite green
  • cargo fmt && cargo clippy --release --features cranelift -- -D warnings - clean
  • New unit tests in parse_csv_content pass (11 cases)
  • Cross-engine regression tests pass on tree + VM
  • examples_engines harness picks up csv-multiline-roundtrip.ilo (4 cases x 2 engines = 8 assertions)
  • CI green

Doc-sync

This is a behaviour-restoration fix with no surface change: SPEC.md, ai.txt, SKILL.md, and the site builtins reference don't need edits. CHANGELOG.md gets the 0.12.1 Fixed entry above.

Follow-ups

  • None blocking. Future-but-not-now: the UTF-8 BOM is currently preserved into the first cell (test pins this). If users start hitting Excel-exported CSVs at scale, a one-shot BOM strip at the front of parse_csv_content would be a friendly addition - but it's a deliberate behaviour change, not a bugfix, so it lives as a separate ticket if and when.

Proposed new persona

csv-roundtrip - a thin persona focused exclusively on the writer/reader contract for tabular data. csv-pipeline already covers the messy-input direction (broken headers, ragged rows, type coercion), but the round-trip direction is a different failure mode: ilo produces output that ilo itself cannot read back. A dedicated persona that does nothing except wr -> rd -> assert equal across a corpus of cells (embedded \n, embedded ", embedded sep, embedded CRLF, empty cells, leading/trailing whitespace, unicode, BOM, very long cells, very many cells) would have caught this bug on the day it landed, and would catch any future writer/reader drift before it ships. Worth adding to the persona suite when capacity allows.

parse_format split content via content.lines() before tracking quote
state, so a cell containing \n (which write_csv_tsv correctly emits as
a quoted multi-line field per RFC 4180) was re-parsed as two rows. ilo
silently mis-read CSV it had just written. csv-pipeline rerun10 caught
the round-trip drift: wrote 4 rows, read back 5.

Replace the line-by-line approach with parse_csv_content: a single-pass
scanner that walks the whole document, treats \n / \r\n as record
separators only when out of quotes, keeps embedded newlines and CRLF
verbatim inside quoted cells, and never emits a phantom trailing row
for files that end in \n. Same signature shape going in, Vec<Vec<String>>
coming out.

Unit tests cover multi-line cells, escaped quotes inside multi-line
cells, CRLF row separators, CRLF inside a quoted field, embedded commas,
mixed quoted/unquoted cells in one row, empty trailing fields, empty
quoted fields, the UTF-8 BOM (pinned as preserved), unterminated quoted
fields, and a full write_csv_tsv -> parse_csv_content round-trip.
vm_parse_format had the same bug as the tree-walker: content.lines()
ran before quote-state tracking, so a quoted multi-line cell came back
as two rows. The VM is the default engine on 0.12.x so any csv-pipeline
user hit this on every read.

vm_parse_csv_content mirrors interpreter::parse_csv_content -- single
pass, in_quotes state, \n and \r\n as record separators outside quotes,
preserved inside. Both code paths now use the same algorithm so a write
on one engine and a read on another stays byte-stable.
tests/regression_csv_multiline_roundtrip.rs drives wr! then rd! across
tree and VM. Pins the row count, the multi-line cell value, the
quote+newline combined case, and the plain (no-newline) negative
control so a future regression in either engine fails loudly.

examples/csv-multiline-roundtrip.ilo gives agents an in-context example
of the now-correct behaviour and rides the existing examples_engines
harness so every engine runs it on each test pass. Covers a row-count
round-trip, a single-cell length round-trip, an embedded-quote +
embedded-newline cell, and a CRLF cell.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 96.47887% with 5 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/vm/mod.rs 78.26% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@danieljohnmorris danieljohnmorris merged commit ed0ce91 into main May 19, 2026
5 checks passed
@danieljohnmorris danieljohnmorris deleted the fix/csv-multiline-quoted-roundtrip branch May 19, 2026 17:51
danieljohnmorris added a commit that referenced this pull request May 19, 2026
…val_inline strings

#435 added regression_csv_multiline_roundtrip.rs with an engines
list that still included --run-tree, which the soft-deprecate
removes. VM round-trip alone covers the bug fix scope. Also
refreshed an error-message string in eval_inline.rs that still
referenced --run-tree after the prior sweep renamed the actual
flag to --run-vm.
danieljohnmorris added a commit that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant