Test new join #80

justinGilmer · 2024-02-23T17:01:52Z

When the amount of streams in a streamset grows into the 100s+, the join logic for windows and aligned windows queries became a computational burden due to the join logic in pyarrow for tables only operating on a table at a time.

Since we can just join on the 'time' column, a simpler approach is to iterate through all windowed data, get a unique sorted list of all timestamps, preallocate a null arrow table for all data with the time column being all the unique sorted timestamps above, and then take all the values that are returned from the windows queries and replace the null entries with their available data. This is needed because aligned windows queries will return an empty table for timeranges where there are no data present, while windows queries will return an entry for every timestamp.

This approach scales well as the number of streams increases in terms of run time, for 1000 streams its approximately 1.75-2x faster than the previous approach.

…row tables is equivalent to the non-arrow logic.

justinGilmer added 12 commits February 12, 2024 16:08

include alternative join method for streamset windows.

5c532f5

Clean up merge table func.

ff8d749

Resolve incorrect indexing issue.

d5ea6b1

Handle case of no data.

e965841

Try to speed up schema creation

ccd950b

Try to speed up schema creation, fix small issue.

096ccc4

Less work in the dict comp.

95af189

Less pass around the allocated data.

5257598

Dont pass around the preallocated data as much.

7ecd667

Update old aligned windows test and ensure that new join logic for ar…

82fc80e

…row tables is equivalent to the non-arrow logic.

Remove unneeded code.

6e5501b

Change double for loop to list comprehension.

bb03c28

justinGilmer requested a review from jleifnf February 23, 2024 17:28

docmerlin approved these changes Feb 23, 2024

View reviewed changes

justinGilmer merged commit 405fb6f into staging Feb 23, 2024
18 checks passed

justinGilmer deleted the test_new_join branch February 23, 2024 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test new join #80

Test new join #80

justinGilmer commented Feb 23, 2024

Test new join #80

Test new join #80

Conversation

justinGilmer commented Feb 23, 2024