refactoring ResultSet #470

edublancas · 2023-04-28T18:14:06Z

we can improve peformance by refactoring ResultSet

it currently fetches all rows in the constructor. this is an expensive operation, especially for large results. and users might do it by mistake. Instead, resultSet should only fetch the minimum number of rows when needed. for example if the user puts result_set in a cell, we should only fetch the rows required to show the preview table. if they do something like list(result_set), then, yes, fetch all results.
do not use fetchall() when using duckdb. duckdb has native support for exporting results to pandas and polars. so we shouldn't call fetchall(), and instead use the native API (.df() or .pl())
this might render the polars config option to pass custom kwargs obsolete since .pl() only accepts the chunk_size argument

note that a solution for improving duckdb + pandas performance has been applied here: #469 however, we should move such logic inside ResultSet, as #469 only works when autopandas is turned on

The text was updated successfully, but these errors were encountered:

edublancas · 2023-04-29T17:04:37Z

impportant: see this comment, @ned2 makes a good case for going an alternative route. rather than making ResultSet lazy, we could avoid them altogether.

edublancas · 2023-05-29T16:15:00Z

so to summarize: let's first work on the lazy loading approach since that will benefit all databases, once that's fixed, we can work on further DuckDB improvements in #536

yafimvo · 2023-06-14T15:08:19Z

it currently fetches all rows in the constructor. this is an expensive operation, especially for large results. and users might do it by mistake. Instead, resultSet should only fetch the minimum number of rows when needed. for example if the user puts result_set in a cell, we should only fetch the rows required to show the preview table. if they do something like list(result_set), then, yes, fetch all results.

AC:

Move fetch logic from the constructor (let's say to a new function called def fetch_results)
Instead of using self._results = sqlaproxy.fetchall() we can use self._results = sqlaproxy.fetchmany(size=config.displaylimit) which will return the min number of rows to display.
list(result_set) should return all results

Question:
Do we still want to count the total number of rows of a given query?
We need it to display this message:

@edublancas

edublancas · 2023-06-15T18:42:02Z

@yafimvo: re counting total number of rows. fair point, if we want to, the only way would be to run a select count(*) query, right? I don't see any other option. performance wise this is pretty bad so I think we'll have to remove the message. but we can keep the display limit part.

tonykploomber · 2023-06-20T15:24:50Z

select count(*)

Interesting topic, I think it depends on database storage engine

For MyISAM the total row count is stored for each table so SELECT COUNT(*) FROM yourtable is an operation O(1). It just needs to read this value.
For InnoDB the total row count is not stored so a full scan is required. This is an O(n) operation.

Probably need to evaluate if the counting number is the bottleneck, if yes, I agree we might need to remove the message

rupurt · 2023-09-02T08:39:15Z

It would be great if there was a magic available to set the fetch size. For example when doing bulk ETL loads the fetch size can significantly improve performance.

edublancas · 2023-09-03T17:08:54Z

@rupurt Can you provide more details? (Please open a new issue so we can discuss)

edublancas mentioned this issue Apr 28, 2023

improving performance of polars conversion with duckdb #453

Closed

ned2 mentioned this issue Apr 29, 2023

improving performance when converting DuckDB's results to pandas #451

Closed

edublancas added stash Label used to categorize issues that will be worked on next med complexity labels May 23, 2023

eitsupi mentioned this issue May 29, 2023

magic_duckdb based magic? PRQL/pyprql#175

Open

edublancas assigned yafimvo May 29, 2023

yafimvo mentioned this issue Jun 19, 2023

Refactored ResultSet to lazy loading #624

Merged

4 tasks

sync-by-unito bot closed this as completed Jun 20, 2023

edublancas reopened this Jun 20, 2023

edublancas mentioned this issue Jun 21, 2023

improving performance when using duckdb #637

Closed

edublancas closed this as completed in #624 Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactoring ResultSet #470

refactoring ResultSet #470

edublancas commented Apr 28, 2023 •

edited

Loading

edublancas commented Apr 29, 2023

edublancas commented May 29, 2023

yafimvo commented Jun 14, 2023

edublancas commented Jun 15, 2023

tonykploomber commented Jun 20, 2023 •

edited

Loading

rupurt commented Sep 2, 2023

edublancas commented Sep 3, 2023 •

edited

Loading

refactoring ResultSet #470

refactoring ResultSet #470

Comments

edublancas commented Apr 28, 2023 • edited Loading

edublancas commented Apr 29, 2023

edublancas commented May 29, 2023

yafimvo commented Jun 14, 2023

edublancas commented Jun 15, 2023

tonykploomber commented Jun 20, 2023 • edited Loading

rupurt commented Sep 2, 2023

edublancas commented Sep 3, 2023 • edited Loading

edublancas commented Apr 28, 2023 •

edited

Loading

tonykploomber commented Jun 20, 2023 •

edited

Loading

edublancas commented Sep 3, 2023 •

edited

Loading