Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of DataFrameEngine.df_like #2029

Merged
merged 22 commits into from
May 13, 2022
Merged

Conversation

geoffreyangus
Copy link
Collaborator

@geoffreyangus geoffreyangus commented May 12, 2022

This PR improves performance of the changes in DaskEngine and PandasEngine in #2020 and #2021 while ensuring that the fixes are maintained. Essentially reverts recent changes and adds type casting after NaNs are dropped. Below are benchmark results, in wall clock time (seconds):

size \ method iterative assign + no casting (original) inner join (current master) iterative assign + casting (this PR)
100MB 39.46 53.18 38.99
1GB 73.25 94.78 64.83
10GB 294.14 exit code 137 (OOM) 294.5

@github-actions
Copy link

github-actions bot commented May 12, 2022

Unit Test Results

       6 files  ±0         6 suites  ±0   1h 27m 44s ⏱️ - 3m 31s
2 776 tests ±0  2 741 ✔️ ±0    35 💤 ±0  0 ±0 
8 328 runs  ±0  8 218 ✔️ ±0  110 💤 ±0  0 ±0 

Results for commit b2f7308. ± Comparison against base commit c3e4abf.

♻️ This comment has been updated with latest results.

@geoffreyangus geoffreyangus marked this pull request as ready for review May 13, 2022 16:27
@tgaddair tgaddair merged commit caf7fdc into master May 13, 2022
@tgaddair tgaddair deleted the speedup-dask-df-like branch May 13, 2022 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants