Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream processing support (in the style of flink/risingwave/arroyo/etc.) #17010

Open
kszlim opened this issue Jun 17, 2024 · 11 comments
Open

Stream processing support (in the style of flink/risingwave/arroyo/etc.) #17010

kszlim opened this issue Jun 17, 2024 · 11 comments
Labels
blocked Cannot be worked on due to external dependencies, or significant new internal features needed first enhancement New feature or an improvement of an existing feature new-streaming Features for or dependent on the new streaming engine wish Features we would ideally want to support, but not right now

Comments

@kszlim
Copy link
Contributor

kszlim commented Jun 17, 2024

Description

Polars currently is the best dataframe experience for batch processing, it would be worth considering whether it'd be possible to support stream processing.

Some prior literature (for adding on streaming support) in this area exists within datafusion, it might be worth picking their brains:
apache/datafusion#9016
https://synnada.medium.com/running-windowing-queries-in-stream-processing-93068d3a5

This would be a killer feature for polars as you could now use one system to rule them all as opposed to having a bespoke/separate stream processing framework for real time analytics, unifying them would be great.

I'd imagine this could be taken into consideration during the construction of the new streaming engine.

Understandably this is a huge feature request and I totally understand if it's closed with a not planned.

@kszlim kszlim added the enhancement New feature or an improvement of an existing feature label Jun 17, 2024
@kszlim
Copy link
Contributor Author

kszlim commented Jun 17, 2024

Sorry didn't notice that #6839 was closed with out of scope, shall I close this, or is this something that might be in scope with the new streaming engine design? @ritchie46

@jkleckner
Copy link

I understand it may be out of scope. Just noting here the use of streaming for real time plots with hvplot and streamz DataFrames:

It is tantalizing to want to use Polars read_ipc_stream and write_ipc_stream with a dynamically large block during bulk processing and a small block during real time processing. Maybe even some sort of flush. It would be great for hvplot to be able to source that pyarrow ipc stream.

@kszlim
Copy link
Contributor Author

kszlim commented Aug 23, 2024

@orlp do you think supporting this use case is something that's feasible in the new streaming engine?

@orlp
Copy link
Collaborator

orlp commented Aug 23, 2024

@kszlim Just to confirm, when you mean "stream processing" you are referring to a long-running query with some aggregates or window functions, like an overall mean, sum, or e.g. the median of the last 10 minutes. Then you could poll the value of those aggregates as they're being computed, and see how they update with new data coming in?

If I understand you correctly and that is what you meant, the new streaming engine would certainly be a step towards it, but still be quite far off. I would not expect to see such streaming capabilities in Polars in 2024 or 2025. It is certainly something we have thought of as 'perhaps one day', but I wouldn't even accept this issue right now as it's not even on any roadmap currently.

@kszlim
Copy link
Contributor Author

kszlim commented Aug 23, 2024

@kszlim Just to confirm, when you mean "stream processing" you are referring to a long-running query with some aggregates or window functions, like an overall mean, sum, or e.g. the median of the last 10 minutes. Then you could poll the value of those aggregates as they're being computed, and see how they update with new data coming in?

Yes, that's exactly it. Thanks for the answer to this! It would really be a killer feature some time late in the future imo, hopefully polars will keep this in mind as the query engine evolves.

@CLWilliam
Copy link

@orlp What might than be the missing parts? Checkpointing for instance? Any other parts on top of your mind? Would be an interesting experiment to try out initially separate from Polars 🙂

@orlp
Copy link
Collaborator

orlp commented Aug 24, 2024

@CLWilliam Considering the new streaming engine doesn't even have window functions at all yet, and only the most basic aggregations, no way to poll them interactively, no support for detecting non-streamable queries yet, no infinite input streams, etc, etc.

@orlp orlp added blocked Cannot be worked on due to external dependencies, or significant new internal features needed first wish Features we would ideally want to support, but not right now new-streaming Features for or dependent on the new streaming engine labels Aug 24, 2024
@kszlim
Copy link
Contributor Author

kszlim commented Aug 24, 2024

I'll be very happy even if there's no commitment to this but just building with it in mind so that this feature could be added in the future and isn't hard blocked.

@AndreiPashkin
Copy link

I absolutely agree with this idea. Even simple traits that extend the current streaming API to enable stream processing of potentially unbounded data would be great. This way contributors and authors of third-party libraries could provide support for different backends.

@jkleckner
Copy link

@AndreiPashkin You might take a look at datafusion which has made some meaningful progress in that direction.

@emgeee
Copy link

emgeee commented Sep 4, 2024

Stumbled across this issue but we're working on turning DataFusion into a full-featured embedded stream processing engine complete with streaming windows, check-pointing/recovery, UDFs, and a python interface: https://github.com/probably-nothing-labs/denormalized

Feel free to reach out, would love any feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Cannot be worked on due to external dependencies, or significant new internal features needed first enhancement New feature or an improvement of an existing feature new-streaming Features for or dependent on the new streaming engine wish Features we would ideally want to support, but not right now
Projects
None yet
Development

No branches or pull requests

6 participants