Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When the amount of streams in a streamset grows into the 100s+, the join logic for windows and aligned windows queries became a computational burden due to the join logic in pyarrow for tables only operating on a table at a time.
Since we can just join on the 'time' column, a simpler approach is to iterate through all windowed data, get a unique sorted list of all timestamps, preallocate a null arrow table for all data with the time column being all the unique sorted timestamps above, and then take all the values that are returned from the windows queries and replace the null entries with their available data. This is needed because aligned windows queries will return an empty table for timeranges where there are no data present, while windows queries will return an entry for every timestamp.
This approach scales well as the number of streams increases in terms of run time, for 1000 streams its approximately 1.75-2x faster than the previous approach.