Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(risingwave): init impl for Risingwave #7954

Merged
merged 2 commits into from Jan 28, 2024

Conversation

KeXiangWang
Copy link
Contributor

@KeXiangWang KeXiangWang commented Jan 11, 2024

Description of changes

This is the initial PR for ibis to support risingwave.

RisingWave is a distributed SQL streaming database engineered to provide the simplest and most cost-efficient approach for processing and managing streaming data with utmost reliability.

After a few weeks of investigation, here's my phasic results.

  1. As Risingwave is largely compatible with Postgres, Ibis can be easily extended to support Risingwave. The main work is to add a new dialect for Risingwave, which is similar to the postgres dialect. I've almost finished this part in this PR. With this PR, Ibis can be used to connect to Risingwave and run some basic queries. I've manually tested some queries which works well and add some tests imitating Postgres Backend. I would appreciate if you can help test more queries.

  2. As ibis relies on SQLalchemy to support Postgres, we follow its implementation to support Risingwave. However, there are also some differences in semantics between Risingwave and Postgres, which require some modification in either ibis or SQLalchemy. sqlalchemy-risingwave is designed to reduce this mismatch. So in this PR, I introduce the new dependencies sqlalchemy-risingwave to ibis.

  3. Ibis has no support for Materialized View natively. However Materialized View is a core concept in RW, people use RW because of its convenient auto-updated Materialized View. Now, if a user wants to create a new MV, he needs to use a raw SQL. Adding DDLs like CreateMaterializedView, CreateSource and CreateSink in the base ddl file may help. We would appreciate it if you can help offer some suggestions.

Besides this, I also met some obstacle that may need your help.

  1. Risingwave hasn't supported TEMPORARY VIEW yet, so I changed some implementations relying on TEMPORARY VIEW to a normal view. For example, for the _metadata() functions, RW backend's implementation is con.exec_driver_sql(f"CREATE VIEW IF NOT EXISTS {name} AS {query}"). While in PG it's con.exec_driver_sql(f"CREATE TEMPORARY VIEW {name} AS {query}"). Do you have any suggestions for other ways to work around this?
    Besides, I didn't quite get what _metadata() is doing here. I would appreciate it if you could explain it a bit.

  2. There's some mismatch between the postgres dialect and risingwave dialect, which are still not fully tested in this PR. We'll continue to work on it.

  3. This PR requires some new features of Risingwave v1.6.0 and sqlalchemy-risingwave v1.0.0 which are not released yet. They'll be released soon. Done. BTW, how should I indicate this backend is only for risingwave > 1.6?

  4. I don't quite understand the test pipeline of Ibis. I copied the test cases from the postgres dialect and modified them to fit the risingwave dialect, and some of them are commented temporarily due to the lack of support. I also added an SQL script to help set up the test environment, which creates tables and loads data. But I don't know how to run it in the test pipeline. Any suggestions or guidance are welcomed. I suppose the test pipeline would require a docker image of Risingwave. We can provide one if needed.

  5. I'm a newbie in the ibis community, this PR may not be perfect considering others. Any comments are welcomed and I sincerely appreciate your time and patience.

closes #8038

@deepyaman
Copy link
Contributor

Just a couple views on your questions, didn't answer everything:

  1. This PR requires some new features of Risingwave v1.6.0 and sqlalchemy-risingwave v1.0.0 which are not released yet. They'll be released soon. BTW, how should I indicate this backend is only for risingwave > 1.6?

I assume you could just include it in the backend docs, perhaps similar to how you have some notes for Oracle here: https://ibis-project.org/backends/oracle#connecting-to-older-oracle-databases

  1. I don't quite understand the test pipeline of Ibis. I copied the test cases from the postgres dialect and modified them to fit the risingwave dialect, and some of them are commented temporarily due to the lack of support. I also added an SQL script to help set up the test environment, which creates tables and loads data. But I don't know how to run it in the test pipeline. Any suggestions or guidance are welcomed. I suppose the test pipeline would require a docker image of Risingwave. We can provide one if needed.

Yes, it would require relevant Docker Compose configuration be added to compose.yaml, and then the TestConf would be able to load the data (as you've already defined in ci/schema/risingwave.sql).

@KeXiangWang
Copy link
Contributor Author

@deepyaman Hi Deepyaman, can you help invite more reviewers to push this PR forward. It seems that I have no access to invite.

@cpcloud
Copy link
Member

cpcloud commented Jan 17, 2024

@KeXiangWang Thanks for the PR and welcome to ibis 👋🏻, exciting work! I'll respond to your comments and questions inline.

I would appreciate if you can help test more queries.

We'll get into this a bit later, but ideally we don't need to crowd source testing because we have the backend test suite to help us out!

So in this PR, I introduce the new dependencies sqlalchemy-risingwave to ibis.

Sounds good! FYI we'll be moving away from sqlalchemy in two releases (version 9.0.0) and we'll be using sqlglot instead. In the case of risingwave I suspect that it will maintain a similar relationship to the postgres implementation, which is that a some of or a good chunk of the code can be shared.

Ibis has no support for Materialized View natively. ... We would appreciate it if you can help offer some suggestions.

Since RisingWave would be the first backend to support this, I suggest we avoid a new DDL object (these are in need of refactoring) for the time being and simply implement a create_materialize_view(self, ...) API directly on the RisingWave backend.

Risingwave hasn't supported TEMPORARY VIEW yet

Totally fine.

I didn't quite get what _metadata() is doing here.

_metadata accepts a query string and returns an iterator of 2-tuples, where the first element of the tuple is a column name and the second element is an ibis type. This method is used to discover the schema (names and types) of arbitrary queries. It's used in con.sql and TableExpr.sql.

how should I indicate this backend is only for risingwave > 1.6?

I think this is probably best communicated in our documentation.

I don't quite understand the test pipeline of Ibis.

Strap in 😂

Generally speaking the tests in ibis/backend/tests run for any and all backends. This is true regardless of backend architecture and deployment.

There are also usually backend-specific tests, those go in ibis/backends/$BACKEND/tests.

To run tests you'd write:

pytest -m risingwave

This will setup the data and find all the right tests (every test under ibis/backends/tests and every test under ibis/backends/risingwave/tests) and then run them.

The other piece of the puzzle is running RisingWave itself. I understand y'all publish a container, so that needs to be set up in our compose.yaml file.

When the PR is closer to ready (local tests all passing for at least one person 😅) we can roll it into CI, which in theory won't be that different from any of our other service-based backends like Trino or ClickHouse.

Excited to have you on board!

Copy link
Contributor

@chloeh13q chloeh13q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, from a maintainability perspective, I'm wondering if RisingWave intends to continue to maintain/increase compatibility with Postgres? @KeXiangWang If so, would it make sense to try to reuse functions and classes from the postgres backend as much as possible so that the same code doesn't have to be maintained in two places?

ibis/backends/risingwave/__init__.py Show resolved Hide resolved
ibis/backends/risingwave/__init__.py Show resolved Hide resolved
@KeXiangWang
Copy link
Contributor Author

Hi @cpcloud . Thank you for the detailed explanation. I now understand how to enhance this PR and will make the necessary improvements ASAP.

compose.yaml Outdated Show resolved Hide resolved
compose.yaml Outdated Show resolved Hide resolved
compose.yaml Outdated Show resolved Hide resolved
compose.yaml Outdated Show resolved Hide resolved
compose.yaml Show resolved Hide resolved
docker/risingwave/risingwave.toml Show resolved Hide resolved
@cpcloud
Copy link
Member

cpcloud commented Jan 24, 2024

😮 Extremely impressive, you've already got the backend working in CI (there's an XPASS, but that's generally good news and easily addressed by removing "risingwave" from the decorator's backend list).

compose.yaml Outdated Show resolved Hide resolved
ibis/backends/risingwave/tests/conftest.py Outdated Show resolved Hide resolved
ibis/backends/risingwave/tests/test_client.py Outdated Show resolved Hide resolved
ibis/backends/risingwave/tests/test_client.py Outdated Show resolved Hide resolved
ibis/backends/risingwave/tests/test_client.py Outdated Show resolved Hide resolved
@KeXiangWang
Copy link
Contributor Author

All the tests have passed now. Can we first merge this PR? I prefer to continue on the critical features like create_materialized_view, create_source, and create_sink in new PRs. Let's carry it out step by step.

Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall, @KeXiangWang !

I'd like to not merge in a bunch of commented-out tests. They should either be removed (especially if they've duplicative of the tests in the general backend test suite) or xfailed using the appropriate markers.

ibis/backends/risingwave/tests/test_client.py Outdated Show resolved Hide resolved
ibis/backends/risingwave/tests/test_client.py Outdated Show resolved Hide resolved
ibis/backends/risingwave/tests/test_functions.py Outdated Show resolved Hide resolved
ibis/backends/risingwave/tests/test_functions.py Outdated Show resolved Hide resolved
ibis/backends/risingwave/tests/test_functions.py Outdated Show resolved Hide resolved
ibis/backends/risingwave/tests/test_functions.py Outdated Show resolved Hide resolved
@lostmygithubaccount
Copy link
Member

lostmygithubaccount commented Jan 25, 2024

can we include this in the v8 release assuming it is merged today or tomorrow?

that would be better from a "we're announcing streaming support in Ibis" perspective, but also not critical (if it doesn't get in we'll just announce it's in progress for RisingWave)

@gforsyth
Copy link
Member

I'm going to add the risingwave sqlalchemy dialect to conda-forge so we have the option of releasing a conda-forge version of this backend sooner rather than later

@KeXiangWang
Copy link
Contributor Author

I'm going to add the risingwave sqlalchemy dialect to conda-forge.

Thanks! Anything we can help?

@gforsyth
Copy link
Member

Hey @KeXiangWang -- I'm adding you as a feedstock maintainer, so if you can comment in conda-forge/staged-recipes#25146 saying that you agree to be a maintainer that will move this along.

@KeXiangWang
Copy link
Contributor Author

Got it. It seems the name in conda-forge/staged-recipes#25146 is incorrect. My github handle is KeXiangWang instead of KeiXiangWang

@gforsyth
Copy link
Member

Got it. It seems the name in conda-forge/staged-recipes#25146 is incorrect. My github handle is KeXiangWang instead of KeiXiangWang

Yeah, typo on my part, should be fixed now!

@cpcloud
Copy link
Member

cpcloud commented Jan 27, 2024

I'm able to run the backend test suite locally 🎉

I'm going to push up a few changes:

  • I was able to reuse the minio configuration.
  • Set the config file to a ro (readonly) volume mount
  • Disable telemetry
  • Reduce the # of lines of text for command
  • Added a comment about using a release when posix_fs is launched
  • Set restart to on-failure

I'll rebase on main first and then push up those changes.

@cpcloud cpcloud added feature Features or general enhancements new backend PRs or issues related to adding new backends labels Jan 27, 2024
@cpcloud
Copy link
Member

cpcloud commented Jan 27, 2024

@KeXiangWang Can you enable allowing maintainers to edit the pull request?

@cpcloud cpcloud added the risingwave The RisingWave backend label Jan 27, 2024
@cpcloud
Copy link
Member

cpcloud commented Jan 27, 2024

After removing the --config-path arguments and removing the /risingwave.toml config file volume mount the config file appears to be unnecessary.

@KeXiangWang
Copy link
Contributor Author

KeXiangWang commented Jan 27, 2024

enable allowing maintainers to edit the pull request?

Sorry I cannot find a button for that.

Alternatively, can you push the updates to a new branch? I can first rebase my branch on it, and then push the updates to this PR.

@deepyaman
Copy link
Contributor

enable allowing maintainers to edit the pull request?

Sorry I cannot find a button for that.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests

You should see it on the right sidebar of the PR; I checked on one of mine.

@KeXiangWang
Copy link
Contributor Author

You should see it on the right sidebar of the PR

I came across the same page. It's weird but there's no such a button for this PR.🙁

@cpcloud
Copy link
Member

cpcloud commented Jan 28, 2024

Actually, I think I'll just follow up with the changes in a separate PR.

@KeXiangWang Can you rebase on main and resolve the conflicts? Once that is done and CI is green I will merge!

ci/schema/risingwave.sql Outdated Show resolved Hide resolved
@cpcloud cpcloud added this to the 8.0 milestone Jan 28, 2024
@KeXiangWang
Copy link
Contributor Author

Can you rebase on main and resolve the conflicts?

Done

@cpcloud cpcloud merged commit 351747a into ibis-project:main Jan 28, 2024
82 checks passed
@cpcloud
Copy link
Member

cpcloud commented Jan 28, 2024

Thanks @KeXiangWang great work!

gforsyth pushed a commit to gforsyth/ibis that referenced this pull request Jan 31, 2024
This is the initial PR for ibis to support risingwave.

[RisingWave](https://github.com/risingwavelabs/risingwave) is a
distributed SQL streaming database engineered to provide the simplest
and most cost-efficient approach for processing and managing streaming
data with utmost reliability.

After a few weeks of investigation, here's my phasic results.

1. As Risingwave is largely compatible with Postgres, Ibis can be easily
extended to support Risingwave. The main work is to add a new dialect
for Risingwave, which is similar to the `postgres` dialect. I've almost
finished this part in this PR. With this PR, Ibis can be used to connect
to Risingwave and run some basic queries. I've manually tested some
queries which works well and add some tests imitating Postgres Backend.
I would appreciate if you can help test more queries.

2. As ibis relies on SQLalchemy to support Postgres, we follow its
implementation to support Risingwave. However, there are also some
differences in semantics between Risingwave and Postgres, which require
some modification in either ibis or SQLalchemy. `sqlalchemy-risingwave`
is designed to reduce this mismatch. So in this PR, I introduce the new
dependencies `sqlalchemy-risingwave` to ibis.

3. Ibis has no support for Materialized View natively. However
Materialized View is a core concept in RW, people use RW because of its
convenient auto-updated Materialized View. Now, if a user wants to
create a new MV, he needs to use a raw SQL. Adding DDLs like
`CreateMaterializedView`, `CreateSource` and `CreateSink` in the [base
ddl
file](https://github.com/ibis-project/ibis/blob/main/ibis/backends/base/sql/ddl.py)
may help. We would appreciate it if you can help offer some suggestions.

Besides this, I also met some obstacle that may need your help.

1. Risingwave hasn't supported `TEMPORARY VIEW` yet, so I changed some
implementations relying on `TEMPORARY VIEW` to a normal view. For
example, for the `_metadata()` functions, RW backend's implementation is
`con.exec_driver_sql(f"CREATE VIEW IF NOT EXISTS {name} AS {query}")`.
While in PG it's `con.exec_driver_sql(f"CREATE TEMPORARY VIEW {name} AS
{query}")`. Do you have any suggestions for other ways to work around
this?
Besides, I didn't quite get what `_metadata()` is doing here. I would
appreciate it if you could explain it a bit.

2. There's some mismatch between the `postgres` dialect and `risingwave`
dialect, which are still not fully tested in this PR. We'll continue to
work on it.

3. ~~This PR requires some new features of Risingwave v1.6.0 and
sqlalchemy-risingwave v1.0.0 which are not released yet. They'll be
released soon.~~ Done. BTW, how should I indicate this backend is only
for risingwave > 1.6?

4. I don't quite understand the test pipeline of Ibis. I copied the test
cases from the `postgres` dialect and modified them to fit the
`risingwave` dialect, and some of them are commented temporarily due to
the lack of support. I also added an SQL script to help set up the test
environment, which creates tables and loads data. But I don't know how
to run it in the test pipeline. Any suggestions or guidance are
welcomed. I suppose the test pipeline would require a docker image of
Risingwave. We can provide one if needed.

5. I'm a newbie in the ibis community, this PR may not be perfect
considering others. Any comments are welcomed and I sincerely appreciate
your time and patience.

closes ibis-project#8038
cpcloud pushed a commit to gforsyth/ibis that referenced this pull request Feb 1, 2024
This is the initial PR for ibis to support risingwave.

[RisingWave](https://github.com/risingwavelabs/risingwave) is a
distributed SQL streaming database engineered to provide the simplest
and most cost-efficient approach for processing and managing streaming
data with utmost reliability.

After a few weeks of investigation, here's my phasic results.

1. As Risingwave is largely compatible with Postgres, Ibis can be easily
extended to support Risingwave. The main work is to add a new dialect
for Risingwave, which is similar to the `postgres` dialect. I've almost
finished this part in this PR. With this PR, Ibis can be used to connect
to Risingwave and run some basic queries. I've manually tested some
queries which works well and add some tests imitating Postgres Backend.
I would appreciate if you can help test more queries.

2. As ibis relies on SQLalchemy to support Postgres, we follow its
implementation to support Risingwave. However, there are also some
differences in semantics between Risingwave and Postgres, which require
some modification in either ibis or SQLalchemy. `sqlalchemy-risingwave`
is designed to reduce this mismatch. So in this PR, I introduce the new
dependencies `sqlalchemy-risingwave` to ibis.

3. Ibis has no support for Materialized View natively. However
Materialized View is a core concept in RW, people use RW because of its
convenient auto-updated Materialized View. Now, if a user wants to
create a new MV, he needs to use a raw SQL. Adding DDLs like
`CreateMaterializedView`, `CreateSource` and `CreateSink` in the [base
ddl
file](https://github.com/ibis-project/ibis/blob/main/ibis/backends/base/sql/ddl.py)
may help. We would appreciate it if you can help offer some suggestions.

Besides this, I also met some obstacle that may need your help.

1. Risingwave hasn't supported `TEMPORARY VIEW` yet, so I changed some
implementations relying on `TEMPORARY VIEW` to a normal view. For
example, for the `_metadata()` functions, RW backend's implementation is
`con.exec_driver_sql(f"CREATE VIEW IF NOT EXISTS {name} AS {query}")`.
While in PG it's `con.exec_driver_sql(f"CREATE TEMPORARY VIEW {name} AS
{query}")`. Do you have any suggestions for other ways to work around
this?
Besides, I didn't quite get what `_metadata()` is doing here. I would
appreciate it if you could explain it a bit.

2. There's some mismatch between the `postgres` dialect and `risingwave`
dialect, which are still not fully tested in this PR. We'll continue to
work on it.

3. ~~This PR requires some new features of Risingwave v1.6.0 and
sqlalchemy-risingwave v1.0.0 which are not released yet. They'll be
released soon.~~ Done. BTW, how should I indicate this backend is only
for risingwave > 1.6?

4. I don't quite understand the test pipeline of Ibis. I copied the test
cases from the `postgres` dialect and modified them to fit the
`risingwave` dialect, and some of them are commented temporarily due to
the lack of support. I also added an SQL script to help set up the test
environment, which creates tables and loads data. But I don't know how
to run it in the test pipeline. Any suggestions or guidance are
welcomed. I suppose the test pipeline would require a docker image of
Risingwave. We can provide one if needed.

5. I'm a newbie in the ibis community, this PR may not be perfect
considering others. Any comments are welcomed and I sincerely appreciate
your time and patience.

closes ibis-project#8038
gforsyth pushed a commit that referenced this pull request Apr 7, 2024
Risingwave plays a role as a computing engine and storage system in
Streaming ecosystems.
Usually, a streaming workload will include two/three systems like this:
`upstream data source --> Risingwave`
Or,
`upstream data source --> Risingwave --> downstream data sink`

This PR adds streaming support, mainly new relations related to
streaming, for the Risingwave backend.

To be specific, Four types of new relations are introduced:
- `Source`, data sources that encompass a connector connected to an
upstream data system like Kafka. A source works like an extraction or a
placeholder for an upstream system, and although it has columns, it does
not store any data itself.
- `Materialized View`, which is Risingwave's core concept for streaming.
One can create an MV with a query on existing tables or sources. The
data in MV is automatically updated in a real-time way. Users can access
the data with a `select` statement just like accessing a table.
- `Table` with connector, works like a combination of a source and a
normal table. The difference between `Table` with connector and `Source`
is that, once the table is created, it will automatically start to
consume data from upstream systems and update it into Risingwave.
- `Sink`, unlink a `Materialized View` which stores the result of a
query in RW, users can choose to sink the result to a downstream system,
e.g. Redis, and then read the data in the downstream system. Unlike an
MV, User cannot access the data directly from a sink.

Some minor changes:
1. Bump the image version to a new nightly build, as the new image
solves some issues found in the [previous
PR](#7954).
2. Mark one test, `test_semi_join`, as skipped for risingwave backend.
The test sometimes stuck and it seems to be Risingwave's fault. I'll
continue to investigate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements new backend PRs or issues related to adding new backends risingwave The RisingWave backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: RisingWave backend
6 participants