dataframe-split is slow with many groups #5

hinkelman · 2020-09-07T16:50:06Z

(define split-example
  (-> (dataframe-crossing (cons 'grp (iota 300))
                          (cons 'obs (iota 300)))
      (dataframe-split 'grp)))

The text was updated successfully, but these errors were encountered:

hinkelman · 2021-05-16T19:02:53Z

I think the performance bottleneck is unique-rows in helpers.sls. Improvements in unique-rows has implications for dataframe-split, dataframe-aggregate, and dataframe-spread.

hinkelman · 2021-05-16T23:40:01Z

Issue might have been running out of stack in remove-duplicates, which is at the hear of unique-rows. Potentially improved by commit eb30064, but needs more testing with large datasets.

hinkelman · 2021-05-17T17:04:06Z

Switch to using hashtable to store and lookup keys in remove-duplicates; should fix performance issues?; see commit 59a6359

hinkelman · 2021-05-18T14:27:43Z

Performance is now not terrible for about 100,000 rows and 300 groups (i.e., split-example code above), but 50,000 rows and 1,000 groups (not shown here) are still frustratingly slow. I was perhaps overly optimistic that improving remove-duplicates would solve performance problems in dataframe-split.

hinkelman · 2023-04-21T21:05:38Z

These tests show that run time increases roughly linearly with group size (when controlling for total df size):

(define g2 (dataframe-crossing (cons 'grp (iota 2))
                               (cons 'obs (iota 50000))))
(define g10 (dataframe-crossing (cons 'grp (iota 10))
                                (cons 'obs (iota 10000))))
(define g50 (dataframe-crossing (cons 'grp (iota 50))
                                (cons 'obs (iota 2000))))
(define g100 (dataframe-crossing (cons 'grp (iota 100))
                                 (cons 'obs (iota 1000))))

(define (test-split df)
  (dataframe-split df 'grp)
  (void))

(for-each (lambda (x) (time (test-split x))) (list g2 g10 g50 g100))

hinkelman · 2023-06-30T05:45:14Z

Went back to square one with dataframe-split, the previous code was a real mess. The new code is greatly simplified and brings some performance improvements. The dataframe-split changes started with commit 6ae6c00, but there were cascading effects of those changes with impacts spread across many commits.

hinkelman closed this as completed May 17, 2021

hinkelman reopened this May 18, 2021

hinkelman closed this as completed Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataframe-split is slow with many groups #5

dataframe-split is slow with many groups #5

hinkelman commented Sep 7, 2020 •

edited

hinkelman commented May 16, 2021

hinkelman commented May 16, 2021 •

edited

hinkelman commented May 17, 2021

hinkelman commented May 18, 2021 •

edited

hinkelman commented Apr 21, 2023

hinkelman commented Jun 30, 2023

dataframe-split is slow with many groups #5

dataframe-split is slow with many groups #5

Comments

hinkelman commented Sep 7, 2020 • edited

hinkelman commented May 16, 2021

hinkelman commented May 16, 2021 • edited

hinkelman commented May 17, 2021

hinkelman commented May 18, 2021 • edited

hinkelman commented Apr 21, 2023

hinkelman commented Jun 30, 2023

hinkelman commented Sep 7, 2020 •

edited

hinkelman commented May 16, 2021 •

edited

hinkelman commented May 18, 2021 •

edited