Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(flink): introduce Apache Flink backend #6408

Merged
merged 1 commit into from
Jun 14, 2023

Conversation

chloeh13q
Copy link
Contributor

This PR looks to add support for Apache Flink (streaming) in Ibis.

Motivation

As data grows, developers want to do increasingly sophisticated data science, as well as reduce the end-to-end time to value. There has been an explosion of tools, libraries, and frameworks to democratize the performance and functionality needed to create increasingly complex data systems that address these new data science demands. That said, the combinatorics of tools has forced developers to spend an increasing amount of their time rewriting code or learning new APIs. Building and extending existing standards would ameliorate much of this pain by allowing developers to write code once and use it across a myriad of tools in the data tool chain. Ibis already works with numerous backends for batch analytics, and we want to extend Ibis into real-time systems, starting with Apache Flink.

Current State

Ibis is currently a batch-oriented library. All of the current supported backends derive from a batch paradigm (aside from Spark, which does offer support for stream processing, albeit using micro-batches underneath the hood).

Unlike batch systems, which operate on bounded data, streaming systems are designed with unbounded data in mind. In order to deal with an infinite data stream, streaming data systems operate with unique concepts such as “event time”, “processing time”, “watermark”, etc.

As streaming systems gain more use cases, there have been efforts to close the gap between batch and streaming. Flink SQL, for example, was born as a part of such effort and, through allowing users to write streaming engines in a SQL-like manner, have been vastly successful in that regard. The success of Flink SQL both validates the potential of stream and batch unification and inspires the community to push for better standards, a vision that Ibis is at a unique and valuable position to help build.

Adding support for a streaming engine like Flink, however, is non-trivial.

This PR

We would like to make the work to support Apache Flink & streaming in Ibis incremental. In the first stage of the implementation, we will focus on a "string-generating backend", where a top-level function (similar to ibis.backends.clickhouse.compiler.core.translate) can be used to handle just the expr -> SQL compilation. Later on we will use this as part of the backend, when we're ready to add support for it.

This PR introduces the aforementioned function and sets up the testing infrastructure via pytest-snapshot without introducing new APIs for the time being.

@chloeh13q chloeh13q force-pushed the feat/flink-1 branch 3 times, most recently from 20720ea to 7aef924 Compare June 9, 2023 19:33
@chloeh13q chloeh13q marked this pull request as ready for review June 9, 2023 20:08
poetry.lock Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This is off to a good start - I've left some comments below.

README.md Outdated Show resolved Hide resolved
docs/backends/Flink.md Outdated Show resolved Hide resolved
docs/backends/index.md Outdated Show resolved Hide resolved


def _count_star(translator: ExprTranslator, op: ops.Node) -> str:
return "count(*)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're inheriting the base registry, CountStar should already be handled - I don't believe you'll need to define a handler for this op here.

Copy link
Contributor Author

@chloeh13q chloeh13q Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it compiles into something like SELECT t0.`i`, count(1) AS `count` instead of count(*)

ibis/backends/flink/registry.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
ibis/__init__.py Outdated Show resolved Hide resolved
gen_matrix.py Outdated Show resolved Hide resolved
ibis/backends/conftest.py Outdated Show resolved Hide resolved
ibis/backends/flink/tests/test_translator.py Show resolved Hide resolved
@chloeh13q chloeh13q force-pushed the feat/flink-1 branch 8 times, most recently from df8b339 to cd9d433 Compare June 14, 2023 09:22
@chloeh13q chloeh13q requested a review from jcrist June 14, 2023 09:45
@jcrist
Copy link
Member

jcrist commented Jun 14, 2023

Overall this looks good! I pushed up a commit fixing the lockfile changes - what version of poetry did you use when updating the lockfile?

Last thing - can you squash your commits? We generate our changelog from commit messages - specifically anything with feat or fix will feature highly in the changelog (see e.g. https://ibis-project.org/release_notes/#features). Commits with chore as the type are ignored. IMO this PR should ideally be a single commit with a message like feat(flink): add initial flink SQL compiler.

@chloeh13q
Copy link
Contributor Author

chloeh13q commented Jun 14, 2023

@jcrist Thanks for helping fix the lockfile changes. I'm using poetry 1.2.2. Also squashed the commits. Lmk if there's anything else you'd like me to address!

@cpcloud cpcloud added flink Issues or PRs related to Flink new backend PRs or issues related to adding new backends labels Jun 14, 2023
@cpcloud cpcloud added this to the 6.0 milestone Jun 14, 2023
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thansk! 🚀

@jcrist jcrist merged commit 053a6d2 into ibis-project:master Jun 14, 2023
@chloeh13q chloeh13q deleted the feat/flink-1 branch June 14, 2023 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flink Issues or PRs related to Flink new backend PRs or issues related to adding new backends
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants