Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

table view struggles with too many rows #1489

Closed
kragol opened this issue Jul 15, 2020 · 4 comments
Closed

table view struggles with too many rows #1489

kragol opened this issue Jul 15, 2020 · 4 comments

Comments

@kragol
Copy link

kragol commented Jul 15, 2020

The table view does not seem to open when trying to view a large table (e.g. millions of rows with a few dozen columns). It seems progressively slower to open when increasing the number of rows until roughly 1 million where it seems to hang indefinitely (or maybe I am just not patient enough).

The reason is likely that the extension is trying to load the table as a whole when a somewhat lazy solution should be preferred for very large arrays/tables.

@davidanthoff davidanthoff added this to the Backlog milestone Jul 15, 2020
@pfitzseb
Copy link
Member

See https://github.com/JuliaComputing/TableView.jl for how that could look (module all the WebIO weirdness, of course). I'm not at all familiar with this side of the tables ecosystem though, unfortuntately. A Tables.jl based solution seems like it would be the most generic, but that would probably pull in too many dependencies.

@davidanthoff
Copy link
Member

Yes, that is exactly so :) We need to change the implementation to a lazy one. The key there will be to not have a direct communication channel between the webview and the Julia process, but instead have any communication hop via the extension, otherwise things won't work in the remote scenarios.

One other thing I've been thinking about for these lazy scenarios: I think we should probably make a complete copy of the data that we want to view in-memory in the Julia process and then have the web view poll that copy for the lazy updates. Otherwise we would have to deal with a situation where someone opens say a DataFrame, and then edits the content of the data structure, while a grid of that data structure is visible, which would be a complete nightmare to handle in terms of race conditions. So the idea would be that if one calls vscodedisplay(x) on something, it will display a snapshot of x taken at that moment, always.

It would of course be even nicer if we found some way that such a lazy view would even work if the Julia REPL process is killed or blocked... I think medium to long term one solution might be that we serialize the whole table into an arrow buffer in the Julia process, get that somehow into the extension process (I think that could be made semi fast, even for very large tables) and then have the lazy part just operate between the webview and the full data copy in the extension. But that would require a lot more stuff than we have, for example a low dependency arrow writer, which is not on the horizon, as far as I can tell...

I'm not at all familiar with this side of the tables ecosystem though, unfortuntately. A Tables.jl based solution seems like it would be the most generic, but that would probably pull in too many dependencies.

The current table viewer is based on the https://github.com/queryverse/TableTraits.jl interface, which brings almost no dependency in, and I think most Tables.jl sources should fulfill that interface as well, so that seems to easiest way here, I think.

@kragol
Copy link
Author

kragol commented Jul 16, 2020

Nice! I see you are already on top of that issue.

There might be another difficulty in store about data integrity: tables could represent data stored on disk (or network) and lazily loaded into memory. In that case, you probably should not take a snapshot of the full table, so I don't know how you could guarantee that what is displayed is consistent with the state of the (full) table when the display command was issued. Besides, it seems like a typical problem that is faced by multi-user database software. Maybe there are ideas to take from there?

Personnally, I'd be happy with not lazy loading any data while the REPL process is busy and a basic refresh button that is only available when the REPL is idle. The idea would be that unless you just pressed that button (and you know that no one else is playing with the data source if it is stored on disk/network), you can't guarantee that what you see is the current content. Cherry on the cake would be some visual indication whenever the displayed table is known to be dirty.

Anyway, thanks for the great work, whatever time it takes!

@pfitzseb
Copy link
Member

pfitzseb commented Nov 4, 2021

Should be fixed in the next version of the extension.

@pfitzseb pfitzseb closed this as completed Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants