-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream processing support (in the style of flink/risingwave/arroyo/etc.) #17010
Comments
Sorry didn't notice that #6839 was closed with out of scope, shall I close this, or is this something that might be in scope with the new streaming engine design? @ritchie46 |
I understand it may be out of scope. Just noting here the use of streaming for real time plots with hvplot and streamz DataFrames:
It is tantalizing to want to use Polars |
@orlp do you think supporting this use case is something that's feasible in the new streaming engine? |
@kszlim Just to confirm, when you mean "stream processing" you are referring to a long-running query with some aggregates or window functions, like an overall mean, sum, or e.g. the median of the last 10 minutes. Then you could poll the value of those aggregates as they're being computed, and see how they update with new data coming in? If I understand you correctly and that is what you meant, the new streaming engine would certainly be a step towards it, but still be quite far off. I would not expect to see such streaming capabilities in Polars in 2024 or 2025. It is certainly something we have thought of as 'perhaps one day', but I wouldn't even accept this issue right now as it's not even on any roadmap currently. |
Yes, that's exactly it. Thanks for the answer to this! It would really be a killer feature some time late in the future imo, hopefully polars will keep this in mind as the query engine evolves. |
@orlp What might than be the missing parts? Checkpointing for instance? Any other parts on top of your mind? Would be an interesting experiment to try out initially separate from Polars 🙂 |
@CLWilliam Considering the new streaming engine doesn't even have window functions at all yet, and only the most basic aggregations, no way to poll them interactively, no support for detecting non-streamable queries yet, no infinite input streams, etc, etc. |
I'll be very happy even if there's no commitment to this but just building with it in mind so that this feature could be added in the future and isn't hard blocked. |
I absolutely agree with this idea. Even simple traits that extend the current streaming API to enable stream processing of potentially unbounded data would be great. This way contributors and authors of third-party libraries could provide support for different backends. |
@AndreiPashkin You might take a look at datafusion which has made some meaningful progress in that direction. |
Stumbled across this issue but we're working on turning DataFusion into a full-featured embedded stream processing engine complete with streaming windows, check-pointing/recovery, UDFs, and a python interface: https://github.com/probably-nothing-labs/denormalized Feel free to reach out, would love any feedback! |
Description
Polars currently is the best dataframe experience for batch processing, it would be worth considering whether it'd be possible to support stream processing.
Some prior literature (for adding on streaming support) in this area exists within datafusion, it might be worth picking their brains:
apache/datafusion#9016
https://synnada.medium.com/running-windowing-queries-in-stream-processing-93068d3a5
This would be a killer feature for polars as you could now use one system to rule them all as opposed to having a bespoke/separate stream processing framework for real time analytics, unifying them would be great.
I'd imagine this could be taken into consideration during the construction of the new streaming engine.
Understandably this is a huge feature request and I totally understand if it's closed with a not planned.
The text was updated successfully, but these errors were encountered: