-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clickhouse query backend #1127
Clickhouse query backend #1127
Conversation
|
@kszucs Thanks! Will do a deeper review this weekend. Couple of minor things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kszucs This is a good start, thanks for doing this!
My main concern is that there appears to be a lot of copy paste from the impala backend. I understand why you took this approach, but it's very difficult to review since it's hard to pick out what's Clickhouse specific vs not.
I would suggest implementing the bare minimum to get a viable SELECT operation. So, leave out DDL, special types like arrays, and window function support.
The main focus of this PR should be the client, scalar types, and a bare bones SQL translator.
| df = pd.read_csv(path, index_col=None, header=0, dtype=dtype) | ||
| if table == 'functional_alltypes': | ||
| df = df.rename(columns={'Unnamed: 0': 'Unnamed_0'}) | ||
| cols = ['date_string_col', 'string_col'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should already be strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a temporary workaround, I'd like to use zkostyan/clickhouse-sqlalchemy with pandas to_sql, but there are a couple of issues.
| df[cols] = df[cols].astype(str) | ||
| df.timestamp_col = df.timestamp_col.astype('datetime64[s]') | ||
| elif table == 'batting': | ||
| cols = ['playerID', 'teamID', 'lgID'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should already be strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functional_alltypes.string_col is actually int64 which raises error on clickhouse side
| cols = df.select_dtypes([float]).columns | ||
| df[cols] = df[cols].fillna(0).astype(int) | ||
| elif table == 'awards_players': | ||
| cols = ['playerID', 'awardID', 'lgID', 'tie', 'notes'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should already be strings.
ibis/clickhouse/api.py
Outdated
| >>> client = ibis.clickhouse.connect( | ||
| ... host=clickhouse_host, | ||
| ... port=clickhouse_port, | ||
| ... hdfs_client=hdfs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no hdfs_client argument in this function.
ibis/clickhouse/api.py
Outdated
| Parameters | ||
| ---------- | ||
| host : str, optional | ||
| Host name of the clickhoused or HiveServer2 in Hive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just clickhoused, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, c&p...
ibis/clickhouse/identifiers.py
Outdated
| @@ -0,0 +1,169 @@ | |||
| # Copyright 2014 Cloudera Inc. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This copyright header isn't necessary.
ibis/clickhouse/identifiers.py
Outdated
| # limitations under the License. | ||
|
|
||
|
|
||
| _identifiers = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a frozenset so there isn't a linear time average case lookup for identifier quoting.
ibis/clickhouse/tests/test_array.py
Outdated
| @@ -0,0 +1,115 @@ | |||
| # @pytest.fixture | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this file and we'll add support for arrays in a follow up PR.
ibis/clickhouse/tests/test_client.py
Outdated
| @@ -0,0 +1,343 @@ | |||
| # Copyright 2014 Cloudera Inc. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this copyright header.
ibis/clickhouse/client.py
Outdated
| @@ -0,0 +1,951 @@ | |||
| # Copyright 2014 Cloudera Inc. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the copyright header.
|
@cpcloud Thanks for the review! I'll fix these. I'm on vacation until Sunday. I intend to implement as much as I can during the next week, including basic ddl operations too (I can split this up to multiple PRs though). |
|
I've resolved most of requests. Also I left a couple of commented test cases and TODO notes, because organizing them was a time consuming task. I want to implement or drop them before merging. |
|
Clickhouse supports a lot of hash functions, and I've implemented them in the compiler. The only problem is ibis expr is currently support only |
|
I think @cpcloud is out of e-mail access this week so I'll take a look through this today and let you know if I have any more feedback, and then merge. Thanks for all your work on this |
|
@wesm Thanks in advance! I'm not sure that we can consider this PR ready to merge though. |
|
I'd be OK with the clickhouse backend remaining WIP for a while as long as incremental patches do not break master. Let me review a bit later and see how things look |
ibis/expr/operations.py
Outdated
| @@ -1472,22 +1472,20 @@ class OuterJoin(Join): | |||
| pass | |||
|
|
|||
|
|
|||
| class LeftSemiJoin(Join): | |||
| class InnerSemiJoin(Join): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from a LEFT SEMI JOIN (or ANY LEFT JOIN in clickhouse parlance)?
If I understand the clickhouse JOIN documentation correctly it isn't.
If INNER is specified, the result will contain only those rows that have a matching row in the right table.If ANY is specified and there are multiple matching rows in the right table, only the first one will be joined.
Composing those two statements yields (paraphrasing):
If INNER is specified and ANY is specified the result will contain only the first row from the left table where there's a matching row in the right table.
Adding LEFT into the mix:
LEFT is specified, any rows in the left table that don’t have matching rows in the right table will be assigned the default value - zeros or empty rows.
Combining that with ANY from above would yield the same statement composed from the definitions of ANY and INNER that I wrote above.
Is there something I'm missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I've got confused, but here are the test cases from 49 to 53
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are four type of joins so must be a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it turns out there's a difference :)
ANY INNER JOINis most similar to aLEFT SEMI JOIN(which ibis supports)ANY LEFT JOINhas no equivalent single node operation in ibis.
I say "most similar to" instead of "equivalent" because the test suite shows that you can select rows from the right table.
In every relational database that I've ever used that supports a LEFT SEMI JOIN syntax you can only select fields from the left table. Databases that don't implement a syntax for this operation implement it through an EXISTS query, which precludes selecting from the right table (since there is no right table in that case).
The reason for this difference is probably because the relational algebraic definition of a semijoin only includes tuples from the left table.
This means that to fully support clickhouse joins we need to introduce new types of joins that are only available to the clickhouse backend.
You can implement this kind of join in other databases using ROW_NUMBER() OVER () AS i on the left table and filtering out only those rows where i = 1 (the first row to be joined).
I think the solution here is to bring back the any_left_join etc methods and we can implement them in other databases down the line as desired.
|
@kszucs I will give this another review today. |
ibis/expr/operations.py
Outdated
|
|
||
| """ | ||
| def _get_schema(self): | ||
| return self.left.schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of the clickhouse test suite the schema of this operation isn't known until you select columns out of it, similar to the rest of ibis's join operations.
ibis/expr/api.py
Outdated
| @@ -2472,6 +2473,8 @@ def _table_drop(self, fields): | |||
| join=join, | |||
| cross_join=cross_join, | |||
| inner_join=_regular_join_method('inner_join', 'inner'), | |||
| inner_semi_join=_regular_join_method('inner_semi_join', 'inner_semi'), | |||
| left_semi_join=_regular_join_method('left_semi_join', 'left_semi'), | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my below comments, remove these and bring back the any_left/any_inner (all_* isn't necessary because those have the conventional join semantics.)
|
@cpcloud done |
|
@cpcloud There is some packaging issue, setup.py wants to download toolz despite it's defined in the recipe. Also, clickhouse packages have just landed in conda-forge, we might define them in ibis builds. |
|
Just to recap, here are a couple of todos I have in mind:
Should I implement all of those in separate PRs? |
|
@kszucs It appears there are multiple issues here with the most recent CI run:
|
|
@kszucs Sweet, thanks for the reference. |
|
@kszucs Is it possible to run the tests only on Python 2.7 for now? |
|
Using the pip installed version, that is. |
|
I'll revert then, just a couple of minutes. |
|
Alternatively, you can install |
|
@cpcloud It's green now. |
|
@kszucs Is it possible for you to rebase on top of master? |
…ally supported table aliases
|
@cpcloud Rebase done. Thanks for the help! |
|
@kszucs Thanks for doing this! Excited to see use cases like this. Please keep the PRs and issues coming! |
This is the first draft for clickhouse backend #1120, based on impala.