Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of a fast CSV model. #3997

Merged
merged 61 commits into from Mar 16, 2018
Merged

Conversation

@jasongrout
Copy link
Contributor

@jasongrout jasongrout commented Feb 25, 2018

This attempts to curtail memory usage by just maintaining a list of row offsets into the CSV string data.

This is the beginning of an attempt to address #3632. If someone else wants to take this up and work on it, feel free to comment here.

This attempts to curtail memory usage by just maintaining a list of row offsets into the CSV string data.
@jasongrout jasongrout changed the title Initial very naive implementation of a CSV model. [WIP] Initial very naive implementation of a CSV model. Feb 25, 2018
Copy link
Contributor

@sccolbert sccolbert left a comment

I think this is a great start!

Loading

// Parse first row, return that number of columns
if (region === 'body') {
if (this._offsets.length > 1) {
return this._data.slice(0, this._offsets[1]).split(this._delimiter).length;
Copy link
Contributor

@sccolbert sccolbert Feb 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ideally be computed only once, as this method is called often.

Loading

Copy link
Contributor Author

@jasongrout jasongrout Feb 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

Loading

switch (region) {
case 'body':
if (this._parsedRowNum !== row) {
this._parsedRow = this._data.slice(this._offsets[row], this._offsets[row + 1]).split(this._delimiter);
Copy link
Contributor

@sccolbert sccolbert Feb 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than keeping just row offsets, you can keep starting index and length of each datum, then slice the source string each time you need the data. That would eliminate the temporary row array.

Loading

Copy link
Contributor Author

@jasongrout jasongrout Feb 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying to store the offset/length of each datum in the entire data? That means my current 30MB row offset array (in the 8-million-row case I tested) balloons to 30MB*(num of columns)*2.

Loading

Copy link
Contributor Author

@jasongrout jasongrout Feb 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this cache is assuming that data is typically asked for in row-major order - not sure how true that is. It did subjectively feel fast in the one test I tried.

Loading

Copy link
Contributor Author

@jasongrout jasongrout Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this cache is assuming that data is typically asked for in row-major order - not sure how true that is.

It isn't at all. The current grid loops over columns, then rows, on the idea that the data is often columnar.

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Feb 28, 2018

I set up a benchmarking repo at https://github.com/jasongrout/benchmark to test out some ideas. As an update, the test parser now parses about 1 million rows/sec in Chrome, and 500k rows/sec in firefox on my computer (using a 1 million by 10 column csv file of doubles).

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 6, 2018

Interesting: from https://stackoverflow.com/questions/34957890/javascript-string-size-limit-256-mb-for-me-is-it-the-same-for-all-browsers, it appears that at least my Firefox has a max string size of 268,435,455 characters (about 2**28, about 256 MB on disk), and my Chrome has a max string length of about 4x that, 1073741799 characters (about 2**30, about 1G on disk). My Safari has twice Chrome's max length, 2147483647 characters (about2**31, about 2G on disk). IE Edge successfully allocated a string the size of Safari's (2**31-1).

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 6, 2018

Some back of the envelope calculations: 2**(28 rows + 4 bytes/row - 30 bytes/G)=4Gb for storing just the row index for the smallest max string. I would say there's no hope of storing all the column offsets at those sizes. However, if we can parse 200-400 rows/ms (which are the numbers I'm seeing for parsing large datasets), we could still do a full refresh every frame parsing every row, if we didn't have to parse each row for each column, i.e., if we could cache the row parsing.

Loading

@sccolbert
Copy link
Contributor

@sccolbert sccolbert commented Mar 6, 2018

Assuming a flat array of ints which alternate [pos, len, pos, len, ...] for each cell, a million row by twenty column index should consume ~152MB:

1000000 rows * 20 cols * 8 byte pos&len / 1048576 bytes per MB = 152 MB

Loading

@sccolbert
Copy link
Contributor

@sccolbert sccolbert commented Mar 6, 2018

Caveat: If you have positions which are larger than 31 bits, the array will double in size since it will be storing them as doubles instead of ints.

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 6, 2018

Caveat: If you have positions which are larger than 31 bits, the array will double in size since it will be storing them as doubles instead of ints.

Thanks. I'm storing the offsets in a uint32 array buffer, but they originally come in as a normal js list. Of course, one unimplemented idea is to chunk up the parsing so the list isn't that large any particular time.

Right now I'm storing just the offsets and just slicing the string from one offset to the next. I have to clean up the string anyway (for the quoting), so there already is a bit of processing after slicing the string. What advantage does storing the length give me?

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 6, 2018

How about this: we initially just parse for row offsets. That also gives us the number of rows and columns. We decide based on the size of rows*cols whether we want to create an offset array that is rows*cols, or just rows. If we decide we have too many fields for a full rows*cols array, we just store rows and parse as needed. If we have a rows*cols array, we parse rows as we are asked for the data in the rows, and incrementally fill in the offset array (i.e., treat the offset array as a cache for all field offsets).

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 6, 2018

The main question there, of course, is when to cut over to the full offset array.

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 6, 2018

If we decide we have too many fields for a full rows*cols array, we just store rows and parse as needed.

Even in this case, we probably ought to have essentially a cache line of, say, 500-1000 rows. Anytime we are asked for a value, we parse 500-1000 rows from that value forward and cache the full offsets. This should speed up many full-screen repaints, as well as small repaints like scrolling sideways.

Loading

jasongrout added 3 commits Mar 6, 2018
This takes parsing 1169059 rows, 38578947 fields down from about 6s to 5s in Chrome, from 3.5 down to 3s in Firefox.
… as a max number of rows or set number of columns.

This sets the stage for incrementally parsing the file. Current idea is to initially just parse for row offsets. That also gives us the number of rows and columns. We decide based on the size of rows*cols whether we want to create an offset array that is rows*cols, or just rows. If we decide we have too many fields for a full rows*cols array, we just store rows and parse as needed. If we have a rows*cols array, we parse rows as we are asked for the data in the rows, and incrementally fill in the offset array (i.e., treat the offset array as a cache for all field offsets).

Even in the case of just storing row offsets, we should maintain a cache of column offsets for, say, 500-1000 rows, invalidated when we ask for any data item outside the cache. That should allow for full-screen paints and sideways scrolling without reparsing the data.
@sccolbert
Copy link
Contributor

@sccolbert sccolbert commented Mar 6, 2018

Right now I'm storing just the offsets and just slicing the string from one offset to the next. I have to clean up the string anyway (for the quoting), so there already is a bit of processing after slicing the string. What advantage does storing the length give me?

Nothing. Storing just the offsets is sufficient (unless you want to trim whitespace). My bad.

Loading

@sccolbert
Copy link
Contributor

@sccolbert sccolbert commented Mar 6, 2018

How about this: we initially just parse for row offsets. That also gives us the number of rows and columns.

IIRC V8 has a fixed depth you can subsequently slice a (sub)string before it creates a copy of the underlying data (as opposed to just creating a view of the original string), which is why I was suggesting a full parse table up front. A simple slice into the original string should be "almost free".

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 6, 2018

IIRC V8 has a fixed depth you can subsequently slice a (sub)string before it creates a copy of the underlying data (as opposed to just creating a view of the original string), which is why I was suggesting a full parse table up front. A simple slice into the original string should be "almost free".

Nice, I didn't know this. By "fixed depth", do you mean a fixed slice length? I'm doing all I can to just return a fixed slice (i.e., I'm only post-processing it if it was a quoted field).

Loading

@sccolbert
Copy link
Contributor

@sccolbert sccolbert commented Mar 6, 2018

Not fixed length, more like subsequently slice a substring as in foo.slice(...).slice(...).slice(...).... before it makes a copy of foo.

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 6, 2018

Ah, okay. Yeah, I'm figuring out the right indices up front and then slicing just once: https://github.com/jupyterlab/jupyterlab/pull/3997/files#diff-5049d61f71d31ada9e714e442f2c1fccR180

Loading

@sccolbert
Copy link
Contributor

@sccolbert sccolbert commented Mar 6, 2018

string.replace is almost certain to allocate a new string.

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 6, 2018

string.replace is almost certain to allocate a new string.

Ah, perhaps I should first look to see if there are any quotes (almost never) before doing a replace. Thanks!

Loading

@sccolbert
Copy link
Contributor

@sccolbert sccolbert commented Mar 6, 2018

Ah, perhaps I should first look to see if there are any quotes (almost never) before doing a replace. Thanks!

Or find the quote indices, and slice between them.

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 6, 2018

I'm already doing that (that's the trimRight/trimLeft). The replace is to take care of escaped quotes in the middle of a field. The CSV spec says that two double quotes in a row in a quoted field is really a single escaped quote. Again, probably almost never happens, so I should do a quick indexOf to see if I need to do a replace.

Loading

@sccolbert
Copy link
Contributor

@sccolbert sccolbert commented Mar 6, 2018

ah, makes sense

Loading

@jasongrout jasongrout changed the title [WIP] Initial implementation of a fast CSV model. Initial implementation of a fast CSV model. Mar 12, 2018
@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 12, 2018

Setting as not work-in-progress anymore. I think this is ready for review/merging.

CC @ellisonbg, @ian-r-rose, @blink1073, @afshin, @sccolbert.

Loading

@ian-r-rose
Copy link
Member

@ian-r-rose ian-r-rose commented Mar 12, 2018

Can you recommend a large CSV for testing?

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 12, 2018

I've been using the file mentioned here for chrome: #3632 (comment)

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 12, 2018

I also used the CSV file mentioned in #4164 (comment) in chrome and the larger one in Safari

Loading

* This model handles data with up to 2**32 characters.
*/
export
class DSVModel extends DataModel implements IDisposable {
Copy link
Contributor

@sccolbert sccolbert Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several methods/getters are lacking a return type. Is that on purpose?

Loading

Copy link
Contributor Author

@jasongrout jasongrout Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, thanks for catching that.

Loading

return;
}
this._resetParser();
this._data = null;
Copy link
Contributor

@sccolbert sccolbert Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this disposable? Also, resetting the parser causes the changed signal to be emitted, which means listeners will be dealing with a "disposed" object.

Loading

Copy link
Contributor Author

@jasongrout jasongrout Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it disposable because it has async parsing that could be scheduled. Calling the dispose method explicitly cancels any pending parsing work.

Loading

Copy link
Contributor Author

@jasongrout jasongrout Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, good catch. I tried to factor out the code that cancels the async processing, but that introduced this problem. I'll make another commit separating these two usecases.

Loading

/**
* The data source for the data model.
*
* The data model takes full ownership of the data source.
Copy link
Contributor

@sccolbert sccolbert Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings are immutable, so "ownership" doesn't really matter.

Loading

Copy link
Contributor Author

@jasongrout jasongrout Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that doesn't make sense. I don't know what I was thinking there.

Loading

jasongrout added 3 commits Mar 12, 2018
Thanks to @sccolbert for catching that in the previous case, a signal would be emitted in the dispose method.
@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 12, 2018

@sccolbert - thanks for the review. I think I've corrected the problems you pointed out.

Loading

@jasongrout jasongrout added this to the Beta 2 milestone Mar 13, 2018
@jzf2101
Copy link
Contributor

@jzf2101 jzf2101 commented Mar 13, 2018

@jasongrout How do these compare to what's in the demo?

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 13, 2018

Loading

@jzf2101
Copy link
Contributor

@jzf2101 jzf2101 commented Mar 13, 2018

Ok I wanted to just clarify that the files you're using are bigger. How do we collect the time to load?

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 13, 2018

When you double-click on the file, you're seeing both the time it takes to transfer the file to the browser and the time it takes to parse the csv and render it in the grid. Once the grid is displayed, changing the delimiter shows you how much time it takes to just parse and render the file.

@jzf2101 - I'm not sure exactly what you're asking, but you should see a big difference in that with this PR you can load much larger files (for example, you can load the files I mentioned above in Chrome), and the time to parse/render the file is way shorter (for example, the time it takes to change the delimiter and see the new rendering).

Loading

}

// Define a function to recursively parse the next chunk after a delay.
let that = this;
Copy link
Member

@afshin afshin Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function not be done as an arrow function without needing that?

// Define a function to recursively parse the next chunk after a delay.
let delayedParse = () => {
  // Parse up to the new end row.
  let done = parseChunk(currentRows + chunkRows);
  currentRows += chunkRows;

  // Gradually double the chunk size until we reach a million rows, if we
  // start below a million-row chunk size.
  if (chunkRows < 1000000) {
    chunkRows *= 2;
  }

  // If we aren't done, the schedule another parse.
  if (done) {
    this._delayedParse = null;
  } else {
    this._delayedParse = window.setTimeout(delayedParse, delay);
  }
}

Loading

Copy link
Member

@afshin afshin Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do the same above, so I assume there's a reason I am missing.

Loading

Copy link
Contributor Author

@jasongrout jasongrout Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the recursion. I need a name for the function so I can call it to set the timeout.

Loading

Copy link
Contributor Author

@jasongrout jasongrout Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, will what you wrote work? I'm not sure if the closure will include the right variable. I guess it's easy to check...

I just checked - it seems to work okay in Firefox.

Loading

Copy link
Contributor Author

@jasongrout jasongrout Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And same in Chrome:

let x = () => {console.log(setTimeout(x, 1000))}
x()

Okay, commit coming up.

Loading

jasongrout added 3 commits Mar 15, 2018
…quoted field.

In some cases and different browsers, the regex approach below is faster by
quite a bit, but in others, the loop is faster. More testing is needed to see
which approach we want to keep. In my microbenchmarks, it seemed that Firefox
was faster with the plain loop we are keeping in this commit, and Chrome seemed
faster with the regex.

const endfield = new RegExp(`[${delimiter}${rowDelimiter}]`, 'g');

endfield.lastIndex = i;
let match = endfield.exec(data);
if (match) {
  i = match.index;
  char = data.charCodeAt(i);
}
@@ -145,10 +146,13 @@ class CSVViewer extends Widget implements DocumentRegistry.IReadyWidget {
* Create the json model for the grid.
Copy link
Member

@afshin afshin Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out-of-date docstring.

Loading

afshin
afshin approved these changes Mar 16, 2018
Copy link
Member

@afshin afshin left a comment

Awesome, thanks!

Loading

@afshin
Copy link
Member

@afshin afshin commented Mar 16, 2018

The CSV spectrum in JupyterLab:

CSV spectrum

Loading

@jzf2101 jzf2101 mentioned this pull request Mar 16, 2018
@jzf2101
Copy link
Contributor

@jzf2101 jzf2101 commented Mar 16, 2018

I actually like that we should use something like that perhaps in the docs

Loading

@jasongrout
Copy link
Contributor Author

@jasongrout jasongrout commented Mar 16, 2018

Darian, feel free to hit the big green button! :)

Loading

@afshin afshin merged commit f775baa into jupyterlab:master Mar 16, 2018
2 checks passed
Loading
@lock lock bot locked as resolved and limited conversation to collaborators Aug 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants