Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataframe-split is slow with many groups #5

Closed
hinkelman opened this issue Sep 7, 2020 · 6 comments
Closed

dataframe-split is slow with many groups #5

hinkelman opened this issue Sep 7, 2020 · 6 comments

Comments

@hinkelman
Copy link
Owner

hinkelman commented Sep 7, 2020

(define split-example
  (-> (dataframe-crossing (cons 'grp (iota 300))
                          (cons 'obs (iota 300)))
      (dataframe-split 'grp)))
@hinkelman
Copy link
Owner Author

I think the performance bottleneck is unique-rows in helpers.sls. Improvements in unique-rows has implications for dataframe-split, dataframe-aggregate, and dataframe-spread.

@hinkelman
Copy link
Owner Author

hinkelman commented May 16, 2021

Issue might have been running out of stack in remove-duplicates, which is at the hear of unique-rows. Potentially improved by commit eb30064, but needs more testing with large datasets.

@hinkelman
Copy link
Owner Author

Switch to using hashtable to store and lookup keys in remove-duplicates; should fix performance issues?; see commit 59a6359

@hinkelman
Copy link
Owner Author

hinkelman commented May 18, 2021

Performance is now not terrible for about 100,000 rows and 300 groups (i.e., split-example code above), but 50,000 rows and 1,000 groups (not shown here) are still frustratingly slow. I was perhaps overly optimistic that improving remove-duplicates would solve performance problems in dataframe-split.

@hinkelman hinkelman reopened this May 18, 2021
@hinkelman
Copy link
Owner Author

These tests show that run time increases roughly linearly with group size (when controlling for total df size):

(define g2 (dataframe-crossing (cons 'grp (iota 2))
                               (cons 'obs (iota 50000))))
(define g10 (dataframe-crossing (cons 'grp (iota 10))
                                (cons 'obs (iota 10000))))
(define g50 (dataframe-crossing (cons 'grp (iota 50))
                                (cons 'obs (iota 2000))))
(define g100 (dataframe-crossing (cons 'grp (iota 100))
                                 (cons 'obs (iota 1000))))

(define (test-split df)
  (dataframe-split df 'grp)
  (void))

(for-each (lambda (x) (time (test-split x))) (list g2 g10 g50 g100))

@hinkelman
Copy link
Owner Author

Went back to square one with dataframe-split, the previous code was a real mess. The new code is greatly simplified and brings some performance improvements. The dataframe-split changes started with commit 6ae6c00, but there were cascading effects of those changes with impacts spread across many commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant