-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an experimental csv module exposing a streaming csv parser #3743
Conversation
Take this just a simple idea rather than something that's really a requirement for this pull request to move forward, but considering that you explicitly mentioned that, would be nice to have a small benchmark for comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving form to what we started during the Crococon 💟
I left multiple comments as some form of initial feedback, but generally speaking I think this approach is more than okay, and from my side I'd suggest to move forward (with tests and all that) 🚀
I'm not sure how far are we from being able to publish this as an experimental module, but I guess it's part of the its experimental stage, the feedback and usage we will collect from users, what will help us answer some of the open questions that you left, and to actually confirm whether the current approach is good enough or not.
Posting here a summary of the use-cases we discussed privately, and that we'd like the module to tackle:
|
Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>
Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>
Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! 🚀 💜
@@ -48,7 +51,7 @@ func (f *file) Read(into []byte) (n int, err error) { | |||
|
|||
// Check if we have reached the end of the file | |||
if currentOffset == fileSize { | |||
return 0, newFsError(EOFError, "EOF") | |||
return 0, io.EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong since I am missing a module context, but this seems like a breaking change. Since later on there is logic that depends on this handling this error type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this specific case, I think it isn't: this is the "private" file construct and, as far as I can tell, is never exposed directly to the JS runtime. By the looks of it, we have also updated all the places calling this code to expect io.EOF
instead of an fsError too 🙇🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't clear; I meant that the breaking change is that previously, the read method in case of the EOF resolves with null, whenever after the changes, it's probably resolved with the EOF error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, keeping in mind that both modules are experimental I'm postponing for you @oleiade decision whenever it should be fixed, or when it should be fixed. The rest of PR looks good to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see what you mean. Thanks for clarifying, indeed 👍🏻
I reverified, and we're safe —no breaking changes— but I admit the code does not convey what happens clearly enough.
The module's Read
method callback checks if the returned error from File.Read
is io.EOF
, and if it is, it rewraps it as a FsError
(which we do expose to the runtime) with its kind set to EOFEror
. At a later point in the code, if the error is an FsError
with its kind set to EOFError
, then it returns null. So the behavior is unchanged.
We have a test to assert that already, but I'm taking note for the cooldown period and see if I can improve it and make it easier to read and understand 🙇🏻
What?
This PR is a cleaned-up version of the CSV streaming parser we hacked during Crococon.
It aims to address #2976, and adds the capability to parse a whole CSV file as a
SharedArray
natively (without having to resort to parse).Parse function
The
parse
function takes afs.File
instance as input, as well as options, parses the whole file as csv, and returns aSharedArray
instance containing the parsed records.It aims to offer a similar experience as to what is currently possible with the
open
function andpapaparse
with the added benefits to:fs.open
function: the file content and the parsed records will be shared across VUs, too (albeit a copy in itself).SharedArray
in Go. Through our pairing sessions with @joanlopez we profiled the execution extensively with Pyroscope. We made some comparisons, and found out that most of the CPU time spent parsing using papaparse into aSharedArray
was spent in the JS runtime. The approach picked in this PR mitigates that.This API allows the trade of memory for performance. The whole file content will still be held in memory a couple of times, and we'll also hold a copy of all the file's parsed rows, however, in our benchmark, this approach was significantly faster than using papaparse.
Parser
The parser results from our initial CSV parsing workshop at Crococon. Its API is specifically designed to address #2976. It exposes a Parser object and its constructor which behave similarly to a JS iterator, on which the
next
method can be called and returns the next set of records as well as adone
marker indicating whether there is more to consume.The parser relies exclusively on the
fs.File
constructs and parses rows as they go, instead of storing them all in memory. As such, it consumes less memory but is also somewhat slower (comparable to paparse) to parse as each call tonext()
needs to go through the whole JS Runtime and event loop (observed during our profiling sessions in Pyroscope); making the cost of creating/await thenext
promise significantly bigger than the actual parsing operation.The parser effectively trades performance for memory but offers some flexibility in parsing and interpreting the results.
Implementation details & Open Questions
fs
module in order to facilitate opening and manipulating files using it from another module. The biggest part of the change was to introduce an interface specific to thefs.File
behavior that we needed to rely on from the outside, namely read, seek and stat:ReadSeekStater
. See commit 8e782c1 for more details.SharedArray
constructor to the Go data module that allows to replicate the behavior of the JS constructor in Go, and effectively bypass most of the runtime overhead. We were not sure this was the best approach, let us know if you think of something better. See commit d5e6ebc for more details.What's not there yet
csv.Parser.next
returns an iterator-like object with adone
property, seeking through the file is possible, and re-instantiating the parser once the end is reached is an option), would we want indeed to have a dedicated method/API for that?Why?
Using CSV files in k6 tests is a very common pattern, and until recently, doing it efficiently could prove tricky. One common issue users encounter is that JS tends to be rather slow when performing parsing operations. Hence, we are leveraging the
fs
module constructs and asynchronous APIs introduced in Goja over the last year to implement a Go-based CSV "high-performance" streaming parser.Checklist
make lint
) and all checks pass.make tests
) and all tests pass.Related PR(s)/Issue(s)
#2976