Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactoring ResultSet #470

Closed
edublancas opened this issue Apr 28, 2023 · 7 comments · Fixed by #624
Closed

refactoring ResultSet #470

edublancas opened this issue Apr 28, 2023 · 7 comments · Fixed by #624
Assignees
Labels
stash Label used to categorize issues that will be worked on next

Comments

@edublancas
Copy link

edublancas commented Apr 28, 2023

we can improve peformance by refactoring ResultSet

  • it currently fetches all rows in the constructor. this is an expensive operation, especially for large results. and users might do it by mistake. Instead, resultSet should only fetch the minimum number of rows when needed. for example if the user puts result_set in a cell, we should only fetch the rows required to show the preview table. if they do something like list(result_set), then, yes, fetch all results.
  • do not use fetchall() when using duckdb. duckdb has native support for exporting results to pandas and polars. so we shouldn't call fetchall(), and instead use the native API (.df() or .pl())
  • this might render the polars config option to pass custom kwargs obsolete since .pl() only accepts the chunk_size argument

note that a solution for improving duckdb + pandas performance has been applied here: #469 however, we should move such logic inside ResultSet, as #469 only works when autopandas is turned on

@edublancas
Copy link
Author

impportant: see this comment, @ned2 makes a good case for going an alternative route. rather than making ResultSet lazy, we could avoid them altogether.

@edublancas edublancas added stash Label used to categorize issues that will be worked on next med complexity labels May 23, 2023
@edublancas
Copy link
Author

so to summarize: let's first work on the lazy loading approach since that will benefit all databases, once that's fixed, we can work on further DuckDB improvements in #536

@yafimvo
Copy link

yafimvo commented Jun 14, 2023

  • it currently fetches all rows in the constructor. this is an expensive operation, especially for large results. and users might do it by mistake. Instead, resultSet should only fetch the minimum number of rows when needed. for example if the user puts result_set in a cell, we should only fetch the rows required to show the preview table. if they do something like list(result_set), then, yes, fetch all results.

AC:

  1. Move fetch logic from the constructor (let's say to a new function called def fetch_results)
  2. Instead of using self._results = sqlaproxy.fetchall() we can use self._results = sqlaproxy.fetchmany(size=config.displaylimit) which will return the min number of rows to display.
  3. list(result_set) should return all results

Question:
Do we still want to count the total number of rows of a given query?
We need it to display this message:
image

@edublancas

@edublancas
Copy link
Author

@yafimvo: re counting total number of rows. fair point, if we want to, the only way would be to run a select count(*) query, right? I don't see any other option. performance wise this is pretty bad so I think we'll have to remove the message. but we can keep the display limit part.

@tonykploomber
Copy link

tonykploomber commented Jun 20, 2023

select count(*)

Interesting topic, I think it depends on database storage engine

For MyISAM the total row count is stored for each table so SELECT COUNT(*) FROM yourtable is an operation O(1). It just needs to read this value.
For InnoDB the total row count is not stored so a full scan is required. This is an O(n) operation.

Probably need to evaluate if the counting number is the bottleneck, if yes, I agree we might need to remove the message

@rupurt
Copy link

rupurt commented Sep 2, 2023

It would be great if there was a magic available to set the fetch size. For example when doing bulk ETL loads the fetch size can significantly improve performance.

@edublancas
Copy link
Author

edublancas commented Sep 3, 2023

@rupurt Can you provide more details? (Please open a new issue so we can discuss)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stash Label used to categorize issues that will be worked on next
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants