Add an experimental csv module exposing a streaming csv parser #3743

oleiade · 2024-05-15T08:30:02Z

What?

This PR is a cleaned-up version of the CSV streaming parser we hacked during Crococon.

It aims to address #2976, and adds the capability to parse a whole CSV file as a SharedArray natively (without having to resort to parse).

Parse function

The parse function takes a fs.File instance as input, as well as options, parses the whole file as csv, and returns a SharedArray instance containing the parsed records.

It aims to offer a similar experience as to what is currently possible with the open function and papaparse with the added benefits to:

consume less memory as it uses the new fs.open function: the file content and the parsed records will be shared across VUs, too (albeit a copy in itself).
be faster, especially for larger files, as it is designed to bypass most of the JS runtime, and directly parse the file in, and store the results in a SharedArray in Go. Through our pairing sessions with @joanlopez we profiled the execution extensively with Pyroscope. We made some comparisons, and found out that most of the CPU time spent parsing using papaparse into a SharedArray was spent in the JS runtime. The approach picked in this PR mitigates that.

This API allows the trade of memory for performance. The whole file content will still be held in memory a couple of times, and we'll also hold a copy of all the file's parsed rows, however, in our benchmark, this approach was significantly faster than using papaparse.

import { open } from 'k6/experimental/fs'
import csv from 'k6/experimental/csv'
import { scenario } from 'k6/execution'

export const options = {
	iterations: 10,
}

// Open the csv file, and parse it ahead of time.
let file;
let csvRecords;
(async function () {
	file = await open('data.csv');

	// The `csv.parse` function consumes the entire file at once, and returns
	// the parsed records as a SharedArray object.
	csvRecords = await csv.parse(file, { delimiter: ',', skipFirstLine: true, fromLine: 10, toLine: 1000})
})();


export default async function() {
	console.log(csvRecords[scenario.iterationInTest])
}

Parser

The parser results from our initial CSV parsing workshop at Crococon. Its API is specifically designed to address #2976. It exposes a Parser object and its constructor which behave similarly to a JS iterator, on which the next method can be called and returns the next set of records as well as a done marker indicating whether there is more to consume.

The parser relies exclusively on the fs.File constructs and parses rows as they go, instead of storing them all in memory. As such, it consumes less memory but is also somewhat slower (comparable to paparse) to parse as each call to next() needs to go through the whole JS Runtime and event loop (observed during our profiling sessions in Pyroscope); making the cost of creating/await the next promise significantly bigger than the actual parsing operation.

The parser effectively trades performance for memory but offers some flexibility in parsing and interpreting the results.

import { open } from 'k6/experimental/fs'
import csv from 'k6/experimental/csv'

export const options = {
	iterations: 10,
}

let file;
let parser;
(async function () {
	file = await open('data.csv');
	parser = new csv.Parser(file, { delimiter: ',', skipFirstLine: true, fromLine: 10, toLine: 1000});
})();

export default async function() {
	const {done, value} = await parser.next();
	if (done) {
		throw new Error("No more rows to read");
	}

	console.log(value);
}

Implementation details & Open Questions

In order to support the module, we had to significantly reshape the internals of the fs module in order to facilitate opening and manipulating files using it from another module. The biggest part of the change was to introduce an interface specific to the fs.File behavior that we needed to rely on from the outside, namely read, seek and stat: ReadSeekStater. See commit 8e782c1 for more details.
As we found per our profiling investigation, instantiating shared arrays that we would fill with the results of parsing the csv file in Go showed little benefit, and displayed that most of the execution was spent in the JS runtime. In order to make things faster, we added a Go SharedArray constructor to the Go data module that allows to replicate the behavior of the JS constructor in Go, and effectively bypass most of the runtime overhead. We were not sure this was the best approach, let us know if you think of something better. See commit d5e6ebc for more details.

What's not there yet

Part of the initial design described in Add a streaming-based CSV parser to k6 #2976 included two concepts I haven't yet included here, as I'm not sure what the best API or performance-oriented solution would be (ideas welcome 🤝):
The ability to describe a strategy for the parser to select which lines should be picked for parsing or ignored (say, you want a file's lines parsing to be spread evenly across all your VUs, for instance).
The ability to instruct the parser to cycle through the file: once it reaches the end, it restarts from the top. My main question mark is that as it would probably be possible using the existing APIs (csv.Parser.next returns an iterator-like object with a done property, seeking through the file is possible, and re-instantiating the parser once the end is reached is an option), would we want indeed to have a dedicated method/API for that?

Why?

Using CSV files in k6 tests is a very common pattern, and until recently, doing it efficiently could prove tricky. One common issue users encounter is that JS tends to be rather slow when performing parsing operations. Hence, we are leveraging the fs module constructs and asynchronous APIs introduced in Goja over the last year to implement a Go-based CSV "high-performance" streaming parser.

Checklist

I have performed a self-review of my code.
I have added tests for my changes.
I have run linter locally (make lint) and all checks pass.
I have run tests locally (make tests) and all tests pass.
I have commented on my code, particularly in hard-to-understand areas.

Related PR(s)/Issue(s)

#2976

joanlopez · 2024-05-15T11:11:10Z

One common issue encountered by users is that JS tends to be rather slow when performing parsing operations.

Take this just a simple idea rather than something that's really a requirement for this pull request to move forward, but considering that you explicitly mentioned that, would be nice to have a small benchmark for comparison.

js/modules/k6/experimental/csv/data.csv

js/modules/k6/experimental/csv/module.go

js/modules/k6/experimental/fs/module.go

joanlopez

Thanks for giving form to what we started during the Crococon 💟

I left multiple comments as some form of initial feedback, but generally speaking I think this approach is more than okay, and from my side I'd suggest to move forward (with tests and all that) 🚀

I'm not sure how far are we from being able to publish this as an experimental module, but I guess it's part of the its experimental stage, the feedback and usage we will collect from users, what will help us answer some of the open questions that you left, and to actually confirm whether the current approach is good enough or not.

js/modules/k6/experimental/csv/csv.js

oleiade · 2024-05-27T14:07:23Z

Posting here a summary of the use-cases we discussed privately, and that we'd like the module to tackle:

As a user, I want to read a CSV file containing 1000 credentials, and have each credential being processed by a single iteration.

no credential should be processed more than once
unless the parser is explicitly to restart from the begining? In that scenario, the same credential can be processed multiple times.
if the option is not set, and the user calls parser.next() after all credentials are consumed, they keep getting a { done: true, value: undefined } response.

As a user, I want to read a CSV file containing 1000 credentials, and have each subset of those credentials reserved to be processed by a single VU.

the subset of credentials could be for instance a chunk: 0-100 credentials go to VU 1, 101-200 credentials go to VU 2, etc.
the subset of credentials could be every Nth credential: 0, 10, 20, 30, etc. go to VU 1, 1, 11, 21, 31, etc. go to VU 2, etc.
This is possible with the existing SharedArray approach, but it needs a faster way of processing the rows.

As a user, I want each iteration to stream through my CSV file, and have the ability to act upon each returned records.

The user has the ability to skip a record, or to stop the iteration, based on the content of the record, or the line number.
This is assuming that each iteration needs the whole content of the file to perform its test.

js/modules/k6/data/share.go

js/modules/k6/experimental/csv/module.go

js/modules/k6/experimental/csv/module_test.go

js/modules/k6/experimental/csv/reader.go

js/modules/k6/experimental/csv/module.go

js/modules/k6/experimental/csv/reader_test.go

js/modules/k6/experimental/fs/module.go

Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>

joanlopez

Awesome! 🚀 💜

olegbespalov · 2024-09-10T07:37:07Z

js/modules/k6/experimental/fs/file.go

@@ -48,7 +51,7 @@ func (f *file) Read(into []byte) (n int, err error) {

 	// Check if we have reached the end of the file
 	if currentOffset == fileSize {
-		return 0, newFsError(EOFError, "EOF")
+		return 0, io.EOF


I might be wrong since I am missing a module context, but this seems like a breaking change. Since later on there is logic that depends on this handling this error type.

For this specific case, I think it isn't: this is the "private" file construct and, as far as I can tell, is never exposed directly to the JS runtime. By the looks of it, we have also updated all the places calling this code to expect io.EOF instead of an fsError too 🙇🏻

Sorry, I wasn't clear; I meant that the breaking change is that previously, the read method in case of the EOF resolves with null, whenever after the changes, it's probably resolved with the EOF error.

Anyway, keeping in mind that both modules are experimental I'm postponing for you @oleiade decision whenever it should be fixed, or when it should be fixed. The rest of PR looks good to me 👍

Oh, I see what you mean. Thanks for clarifying, indeed 👍🏻

I reverified, and we're safe —no breaking changes— but I admit the code does not convey what happens clearly enough.

The module's Read method callback checks if the returned error from File.Read is io.EOF, and if it is, it rewraps it as a FsError (which we do expose to the runtime) with its kind set to EOFEror. At a later point in the code, if the error is an FsError with its kind set to EOFError, then it returns null. So the behavior is unchanged.

We have a test to assert that already, but I'm taking note for the cooldown period and see if I can improve it and make it easier to read and understand 🙇🏻

Parental leave

oleiade added the feature label May 15, 2024

oleiade self-assigned this May 15, 2024

oleiade requested review from codebien and joanlopez May 15, 2024 09:13

joanlopez reviewed May 15, 2024

View reviewed changes

js/modules/k6/experimental/csv/data.csv Outdated Show resolved Hide resolved

joanlopez reviewed May 15, 2024

View reviewed changes

js/modules/k6/experimental/csv/module.go Outdated Show resolved Hide resolved

joanlopez reviewed May 15, 2024

View reviewed changes

js/modules/k6/experimental/csv/module.go Outdated Show resolved Hide resolved

joanlopez reviewed May 15, 2024

View reviewed changes

js/modules/k6/experimental/csv/module.go Outdated Show resolved Hide resolved

joanlopez reviewed May 15, 2024

View reviewed changes

js/modules/k6/experimental/csv/module.go Show resolved Hide resolved

joanlopez reviewed May 15, 2024

View reviewed changes

js/modules/k6/experimental/fs/module.go Outdated Show resolved Hide resolved

joanlopez reviewed May 15, 2024

View reviewed changes

codebien previously requested changes May 15, 2024

View reviewed changes

js/modules/k6/experimental/csv/csv.js Outdated Show resolved Hide resolved

oleiade commented May 23, 2024

View reviewed changes

js/modules/k6/experimental/csv/csv.js Outdated Show resolved Hide resolved

joanlopez added this to the v0.52.0 milestone Jun 4, 2024

oleiade force-pushed the streaming-csv branch from 06042c4 to 1df3e1e Compare June 12, 2024 15:04

joanlopez modified the milestones: v0.52.0, v0.53.0 Jun 17, 2024

oleiade modified the milestones: v0.53.0, v0.54.0 Jul 24, 2024

oleiade force-pushed the streaming-csv branch from 1df3e1e to f47ea37 Compare August 1, 2024 13:32

oleiade mentioned this pull request Aug 1, 2024

Support top-level-await #3858

Closed

oleiade force-pushed the streaming-csv branch from f47ea37 to bffee3b Compare August 22, 2024 10:02

oleiade added 2 commits August 27, 2024 14:14

Expose fs.File underlying implementation's ReadSeeker to other modules

8e782c1

Add ability to produce a SharedArray from Go code

d5e6ebc

oleiade force-pushed the streaming-csv branch from bffee3b to 9e4dde8 Compare August 27, 2024 12:41

oleiade added 2 commits August 27, 2024 15:03

Add experimental csv parser module

78831ca

Add csv module usage examples

7e77c32

oleiade force-pushed the streaming-csv branch from 9e4dde8 to 7e77c32 Compare August 27, 2024 13:04