Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an experimental csv module exposing a streaming csv parser #3743

Merged
merged 15 commits into from
Sep 10, 2024

Conversation

oleiade
Copy link
Member

@oleiade oleiade commented May 15, 2024

What?

This PR is a cleaned-up version of the CSV streaming parser we hacked during Crococon.

It aims to address #2976, and adds the capability to parse a whole CSV file as a SharedArray natively (without having to resort to parse).

Parse function

The parse function takes a fs.File instance as input, as well as options, parses the whole file as csv, and returns a SharedArray instance containing the parsed records.

It aims to offer a similar experience as to what is currently possible with the open function and papaparse with the added benefits to:

  • consume less memory as it uses the new fs.open function: the file content and the parsed records will be shared across VUs, too (albeit a copy in itself).
  • be faster, especially for larger files, as it is designed to bypass most of the JS runtime, and directly parse the file in, and store the results in a SharedArray in Go. Through our pairing sessions with @joanlopez we profiled the execution extensively with Pyroscope. We made some comparisons, and found out that most of the CPU time spent parsing using papaparse into a SharedArray was spent in the JS runtime. The approach picked in this PR mitigates that.

This API allows the trade of memory for performance. The whole file content will still be held in memory a couple of times, and we'll also hold a copy of all the file's parsed rows, however, in our benchmark, this approach was significantly faster than using papaparse.

import { open } from 'k6/experimental/fs'
import csv from 'k6/experimental/csv'
import { scenario } from 'k6/execution'

export const options = {
	iterations: 10,
}

// Open the csv file, and parse it ahead of time.
let file;
let csvRecords;
(async function () {
	file = await open('data.csv');

	// The `csv.parse` function consumes the entire file at once, and returns
	// the parsed records as a SharedArray object.
	csvRecords = await csv.parse(file, { delimiter: ',', skipFirstLine: true, fromLine: 10, toLine: 1000})
})();


export default async function() {
	console.log(csvRecords[scenario.iterationInTest])
}

Parser

The parser results from our initial CSV parsing workshop at Crococon. Its API is specifically designed to address #2976. It exposes a Parser object and its constructor which behave similarly to a JS iterator, on which the next method can be called and returns the next set of records as well as a done marker indicating whether there is more to consume.

The parser relies exclusively on the fs.File constructs and parses rows as they go, instead of storing them all in memory. As such, it consumes less memory but is also somewhat slower (comparable to paparse) to parse as each call to next() needs to go through the whole JS Runtime and event loop (observed during our profiling sessions in Pyroscope); making the cost of creating/await the next promise significantly bigger than the actual parsing operation.

The parser effectively trades performance for memory but offers some flexibility in parsing and interpreting the results.

import { open } from 'k6/experimental/fs'
import csv from 'k6/experimental/csv'

export const options = {
	iterations: 10,
}

let file;
let parser;
(async function () {
	file = await open('data.csv');
	parser = new csv.Parser(file, { delimiter: ',', skipFirstLine: true, fromLine: 10, toLine: 1000});
})();

export default async function() {
	const {done, value} = await parser.next();
	if (done) {
		throw new Error("No more rows to read");
	}

	console.log(value);
}

Implementation details & Open Questions

  • In order to support the module, we had to significantly reshape the internals of the fs module in order to facilitate opening and manipulating files using it from another module. The biggest part of the change was to introduce an interface specific to the fs.File behavior that we needed to rely on from the outside, namely read, seek and stat: ReadSeekStater. See commit 8e782c1 for more details.
  • As we found per our profiling investigation, instantiating shared arrays that we would fill with the results of parsing the csv file in Go showed little benefit, and displayed that most of the execution was spent in the JS runtime. In order to make things faster, we added a Go SharedArray constructor to the Go data module that allows to replicate the behavior of the JS constructor in Go, and effectively bypass most of the runtime overhead. We were not sure this was the best approach, let us know if you think of something better. See commit d5e6ebc for more details.

What's not there yet

  • Part of the initial design described in Add a streaming-based CSV parser to k6 #2976 included two concepts I haven't yet included here, as I'm not sure what the best API or performance-oriented solution would be (ideas welcome 🤝):
  • The ability to describe a strategy for the parser to select which lines should be picked for parsing or ignored (say, you want a file's lines parsing to be spread evenly across all your VUs, for instance).
  • The ability to instruct the parser to cycle through the file: once it reaches the end, it restarts from the top. My main question mark is that as it would probably be possible using the existing APIs (csv.Parser.next returns an iterator-like object with a done property, seeking through the file is possible, and re-instantiating the parser once the end is reached is an option), would we want indeed to have a dedicated method/API for that?

Why?

Using CSV files in k6 tests is a very common pattern, and until recently, doing it efficiently could prove tricky. One common issue users encounter is that JS tends to be rather slow when performing parsing operations. Hence, we are leveraging the fs module constructs and asynchronous APIs introduced in Goja over the last year to implement a Go-based CSV "high-performance" streaming parser.

Checklist

  • I have performed a self-review of my code.
  • I have added tests for my changes.
  • I have run linter locally (make lint) and all checks pass.
  • I have run tests locally (make tests) and all tests pass.
  • I have commented on my code, particularly in hard-to-understand areas.

Related PR(s)/Issue(s)

#2976

@oleiade oleiade self-assigned this May 15, 2024
@joanlopez
Copy link
Contributor

One common issue encountered by users is that JS tends to be rather slow when performing parsing operations.

Take this just a simple idea rather than something that's really a requirement for this pull request to move forward, but considering that you explicitly mentioned that, would be nice to have a small benchmark for comparison.

Copy link
Contributor

@joanlopez joanlopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving form to what we started during the Crococon 💟

I left multiple comments as some form of initial feedback, but generally speaking I think this approach is more than okay, and from my side I'd suggest to move forward (with tests and all that) 🚀

I'm not sure how far are we from being able to publish this as an experimental module, but I guess it's part of the its experimental stage, the feedback and usage we will collect from users, what will help us answer some of the open questions that you left, and to actually confirm whether the current approach is good enough or not.

js/modules/k6/experimental/csv/csv.js Outdated Show resolved Hide resolved
@oleiade
Copy link
Member Author

oleiade commented May 27, 2024

Posting here a summary of the use-cases we discussed privately, and that we'd like the module to tackle:

  1. As a user, I want to read a CSV file containing 1000 credentials, and have each credential being processed by a single iteration.
  • no credential should be processed more than once
  • unless the parser is explicitly to restart from the begining? In that scenario, the same credential can be processed multiple times.
  • if the option is not set, and the user calls parser.next() after all credentials are consumed, they keep getting a { done: true, value: undefined } response.
  1. As a user, I want to read a CSV file containing 1000 credentials, and have each subset of those credentials reserved to be processed by a single VU.
  • the subset of credentials could be for instance a chunk: 0-100 credentials go to VU 1, 101-200 credentials go to VU 2, etc.
  • the subset of credentials could be every Nth credential: 0, 10, 20, 30, etc. go to VU 1, 1, 11, 21, 31, etc. go to VU 2, etc.
  • This is possible with the existing SharedArray approach, but it needs a faster way of processing the rows.
  1. As a user, I want each iteration to stream through my CSV file, and have the ability to act upon each returned records.
  • The user has the ability to skip a record, or to stop the iteration, based on the content of the record, or the line number.
  • This is assuming that each iteration needs the whole content of the file to perform its test.

Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>
oleiade and others added 4 commits September 9, 2024 15:52
Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>
Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>
Copy link
Contributor

@joanlopez joanlopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 🚀 💜

@@ -48,7 +51,7 @@ func (f *file) Read(into []byte) (n int, err error) {

// Check if we have reached the end of the file
if currentOffset == fileSize {
return 0, newFsError(EOFError, "EOF")
return 0, io.EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong since I am missing a module context, but this seems like a breaking change. Since later on there is logic that depends on this handling this error type.

Copy link
Member Author

@oleiade oleiade Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific case, I think it isn't: this is the "private" file construct and, as far as I can tell, is never exposed directly to the JS runtime. By the looks of it, we have also updated all the places calling this code to expect io.EOF instead of an fsError too 🙇🏻

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't clear; I meant that the breaking change is that previously, the read method in case of the EOF resolves with null, whenever after the changes, it's probably resolved with the EOF error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, keeping in mind that both modules are experimental I'm postponing for you @oleiade decision whenever it should be fixed, or when it should be fixed. The rest of PR looks good to me 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see what you mean. Thanks for clarifying, indeed 👍🏻

I reverified, and we're safe —no breaking changes— but I admit the code does not convey what happens clearly enough.

The module's Read method callback checks if the returned error from File.Read is io.EOF, and if it is, it rewraps it as a FsError (which we do expose to the runtime) with its kind set to EOFEror. At a later point in the code, if the error is an FsError with its kind set to EOFError, then it returns null. So the behavior is unchanged.

We have a test to assert that already, but I'm taking note for the cooldown period and see if I can improve it and make it easier to read and understand 🙇🏻

@oleiade oleiade requested review from codebien and removed request for codebien September 10, 2024 10:11
@oleiade oleiade merged commit 93667ec into master Sep 10, 2024
23 checks passed
@oleiade oleiade deleted the streaming-csv branch September 10, 2024 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation-needed A PR which will need a separate PR for documentation feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants