[Performance] Data viewer can't handle large DFs #5469
View large DataFrames (>1000 columns, >1000 rows) in under 1 minute
When opening large DFs (current is 709x3201) the Data Viewer stops at showing the structure with all values at 'loading ...' (current runtime 20 minutes).
Steps to reproduce:
I was really looking forward to these features, so thanks for getting them in there! However, when dealing with quantitative finance problems we often have very large dataframes, and it would be nice to be able to use the data viewer to explore them.
I tried viewing just a (3226,) pandas.Series and got the following error thrown back:
Error: Failure during variable extraction:
TypeError Traceback (most recent call last)
C:\ProgramData\Anaconda3\envs\DEV64\lib\json_init_.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
C:\ProgramData\Anaconda3\envs\DEV64\lib\json\encoder.py in encode(self, o)
C:\ProgramData\Anaconda3\envs\DEV64\lib\json\encoder.py in iterencode(self, o, _one_shot)
C:\ProgramData\Anaconda3\envs\DEV64\lib\json\encoder.py in default(self, o)
TypeError: Object of type 'Timestamp' is not JSON serializable
So I created a DataFrame of 2 series of 700 values with a pandas.DatetimeIndex, and while I can see the values the index is not shown, which would defeat the purpose of viewing DFs when working with time series.
For the huge amount of columns, I don't think we'll be able to support it for a while. The control we're using isn't virtualizing columns, just rows, so it adds the 3000 columns into the DOM and the nodejs process runs out of memory.
@FranciscoRZ, I think we're going to have to limit the number of columns displayed.
Here's the code I used to repro the first issue:
@rchiodo, thanks for the quick response (and sorry for the late reply, I'm guessing we're in different time zones
To reproduce the problem with large DFs, I used the following code:
--> double click df in variable explorer
From that, I reproduced the problem with the DatetimeIndex as follows:
--> double click df2 in variable explorer
Here's what I get:
Sorry I can't be of more help, I really have no experience / knowledge of web development technologies.
Also, while working yesterday I noticed that as my variable environment grew, the variable explorer started to flicker more and more, and took a while to reload. Is it reevaluating all variables each time a variable is defined? If so, it seems like a really resource intensive process. I haven't yet come accross performance issues I can pin to this, but maybe a "refresh" button in the variable explorer would be more user friendly / resource conscious?
Thanks for the code. That helps. Although I'm confused as to the bug for the second? It's not showing the 'TypeError: Object of type 'Timestamp' is not JSON serializable' anymore? So that part of it is fixed/not reproing?
Thanks for the feedback on the refresh too. You can disable refreshing by just collapsing the variables explorer, but yes we do refresh every variable every cell.
It works like a debugger does, every time you step it updates all of the variables.
We should do some perf testing to see if there's any interference with actually executing cells.
Adding @IanMatthewHuff for an FYI.
Sorry, I forgot to specify that one.
I know some pandas attributes are not serializable (I'm pretty sure
Lastly, I've been playing around with the date type to see if this is a pandas related problem.
It's interesting that in the case of
So I took a quick look at the source code. I'm guessing that the Python side of the data viewer's magic happens in
Anyways, hope this helps.
Yes it does help. Thanks.
I'm going to split out the dataframe index and timestamp problems. They're different from the too many columns not working.
Actually there's already another bug for this:
I'm going to fix the timestamp problem there too. It has to do with timestamps/datetimes having custom string formatting and pandas not using the same values as str() does.
referenced this issue
May 30, 2019
I just submitted a fix for the column virtualization. Please feel free to try it out in our next insiders
It should support any number of columns and rows, but it will ask if you're sure you want to open the view if there's more than 1000 columns. More than a 1000 columns causes the initial bring up to take awhile and fetching the data can take longer too (it has to turn the rows into a JSON string in order to send it to our UI - function of how VS code works).
1000 x 10000 DF takes me about 5 minutes to load.
However it also now supports filtering with expressions on numeric columns. Example: