Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test new join #80

Merged
merged 12 commits into from Feb 23, 2024
Merged

Test new join #80

merged 12 commits into from Feb 23, 2024

Conversation

justinGilmer
Copy link

When the amount of streams in a streamset grows into the 100s+, the join logic for windows and aligned windows queries became a computational burden due to the join logic in pyarrow for tables only operating on a table at a time.

Since we can just join on the 'time' column, a simpler approach is to iterate through all windowed data, get a unique sorted list of all timestamps, preallocate a null arrow table for all data with the time column being all the unique sorted timestamps above, and then take all the values that are returned from the windows queries and replace the null entries with their available data. This is needed because aligned windows queries will return an empty table for timeranges where there are no data present, while windows queries will return an entry for every timestamp.

This approach scales well as the number of streams increases in terms of run time, for 1000 streams its approximately 1.75-2x faster than the previous approach.

@justinGilmer justinGilmer merged commit 405fb6f into staging Feb 23, 2024
18 checks passed
@justinGilmer justinGilmer deleted the test_new_join branch February 23, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants