Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse query backend #1127

Closed
wants to merge 51 commits into from
Closed

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Aug 21, 2017

This is the first draft for clickhouse backend #1120, based on impala.

@cpcloud cpcloud self-assigned this Aug 22, 2017
@cpcloud cpcloud added the feature Features or general enhancements label Aug 22, 2017
@cpcloud cpcloud added this to the 0.12 milestone Aug 22, 2017
@cpcloud
Copy link
Member

cpcloud commented Aug 22, 2017

@kszucs Thanks! Will do a deeper review this weekend.

Couple of minor things:

  1. If you enable CircleCI and Appveyor on your fork you can run builds faster because you get your own set of docker containers to run (as opposed to using those allocated to the ibis-project user).
  2. It looks like there might be conflicting ports. The Impala docker image has fs.default.name set to port 9000, which it looks like is also the port that clickhouse uses. I'm happy to change the default in the Impala docker image, or if it's easier for you to set a different port then that's also a viable solution. I can do whatever requires less work for you.

@cpcloud cpcloud self-requested a review August 22, 2017 13:11
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kszucs This is a good start, thanks for doing this!

My main concern is that there appears to be a lot of copy paste from the impala backend. I understand why you took this approach, but it's very difficult to review since it's hard to pick out what's Clickhouse specific vs not.

I would suggest implementing the bare minimum to get a viable SELECT operation. So, leave out DDL, special types like arrays, and window function support.

The main focus of this PR should be the client, scalar types, and a bare bones SQL translator.

df = pd.read_csv(path, index_col=None, header=0, dtype=dtype)
if table == 'functional_alltypes':
df = df.rename(columns={'Unnamed: 0': 'Unnamed_0'})
cols = ['date_string_col', 'string_col']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should already be strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a temporary workaround, I'd like to use zkostyan/clickhouse-sqlalchemy with pandas to_sql, but there are a couple of issues.

df[cols] = df[cols].astype(str)
df.timestamp_col = df.timestamp_col.astype('datetime64[s]')
elif table == 'batting':
cols = ['playerID', 'teamID', 'lgID']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should already be strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functional_alltypes.string_col is actually int64 which raises error on clickhouse side

cols = df.select_dtypes([float]).columns
df[cols] = df[cols].fillna(0).astype(int)
elif table == 'awards_players':
cols = ['playerID', 'awardID', 'lgID', 'tie', 'notes']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should already be strings.

>>> client = ibis.clickhouse.connect(
... host=clickhouse_host,
... port=clickhouse_port,
... hdfs_client=hdfs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no hdfs_client argument in this function.

Parameters
----------
host : str, optional
Host name of the clickhoused or HiveServer2 in Hive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just clickhoused, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, c&p...

@@ -0,0 +1,169 @@
# Copyright 2014 Cloudera Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copyright header isn't necessary.

# limitations under the License.


_identifiers = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a frozenset so there isn't a linear time average case lookup for identifier quoting.

@@ -0,0 +1,115 @@
# @pytest.fixture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file and we'll add support for arrays in a follow up PR.

@@ -0,0 +1,343 @@
# Copyright 2014 Cloudera Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this copyright header.

@@ -0,0 +1,951 @@
# Copyright 2014 Cloudera Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the copyright header.

@kszucs
Copy link
Member Author

kszucs commented Aug 28, 2017

@cpcloud Thanks for the review! I'll fix these. I'm on vacation until Sunday. I intend to implement as much as I can during the next week, including basic ddl operations too (I can split this up to multiple PRs though).
At the end of the next week I'll clean up every todos and commented parts.

@kszucs
Copy link
Member Author

kszucs commented Sep 3, 2017

I've resolved most of requests. Also I left a couple of commented test cases and TODO notes, because organizing them was a time consuming task. I want to implement or drop them before merging.

@kszucs
Copy link
Member Author

kszucs commented Sep 5, 2017

Clickhouse supports a lot of hash functions, and I've implemented them in the compiler. The only problem is ibis expr is currently support only fnv.

@wesm
Copy link
Member

wesm commented Sep 5, 2017

I think @cpcloud is out of e-mail access this week so I'll take a look through this today and let you know if I have any more feedback, and then merge. Thanks for all your work on this

@kszucs
Copy link
Member Author

kszucs commented Sep 5, 2017

@wesm Thanks in advance! I'm not sure that we can consider this PR ready to merge though.

@wesm
Copy link
Member

wesm commented Sep 5, 2017

I'd be OK with the clickhouse backend remaining WIP for a while as long as incremental patches do not break master. Let me review a bit later and see how things look

@@ -1472,22 +1472,20 @@ class OuterJoin(Join):
pass


class LeftSemiJoin(Join):
class InnerSemiJoin(Join):
Copy link
Member

@cpcloud cpcloud Sep 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from a LEFT SEMI JOIN (or ANY LEFT JOIN in clickhouse parlance)?

If I understand the clickhouse JOIN documentation correctly it isn't.

  • If INNER is specified, the result will contain only those rows that have a matching row in the right table.
  • If ANY is specified and there are multiple matching rows in the right table, only the first one will be joined.

Composing those two statements yields (paraphrasing):

  • If INNER is specified and ANY is specified the result will contain only the first row from the left table where there's a matching row in the right table.

Adding LEFT into the mix:

  • LEFT is specified, any rows in the left table that don’t have matching rows in the right table will be assigned the default value - zeros or empty rows.

Combining that with ANY from above would yield the same statement composed from the definitions of ANY and INNER that I wrote above.

Is there something I'm missing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I've got confused, but here are the test cases from 49 to 53

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are four type of joins so must be a difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out there's a difference :)

  • ANY INNER JOIN is most similar to a LEFT SEMI JOIN (which ibis supports)
  • ANY LEFT JOIN has no equivalent single node operation in ibis.

I say "most similar to" instead of "equivalent" because the test suite shows that you can select rows from the right table.

In every relational database that I've ever used that supports a LEFT SEMI JOIN syntax you can only select fields from the left table. Databases that don't implement a syntax for this operation implement it through an EXISTS query, which precludes selecting from the right table (since there is no right table in that case).

The reason for this difference is probably because the relational algebraic definition of a semijoin only includes tuples from the left table.

This means that to fully support clickhouse joins we need to introduce new types of joins that are only available to the clickhouse backend.

You can implement this kind of join in other databases using ROW_NUMBER() OVER () AS i on the left table and filtering out only those rows where i = 1 (the first row to be joined).

I think the solution here is to bring back the any_left_join etc methods and we can implement them in other databases down the line as desired.

@kszucs
Copy link
Member Author

kszucs commented Sep 12, 2017

@cpcloud @wesm please review

@cpcloud
Copy link
Member

cpcloud commented Sep 14, 2017

@kszucs I will give this another review today.


"""
def _get_schema(self):
return self.left.schema()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of the clickhouse test suite the schema of this operation isn't known until you select columns out of it, similar to the rest of ibis's join operations.

ibis/expr/api.py Outdated
@@ -2472,6 +2473,8 @@ def _table_drop(self, fields):
join=join,
cross_join=cross_join,
inner_join=_regular_join_method('inner_join', 'inner'),
inner_semi_join=_regular_join_method('inner_semi_join', 'inner_semi'),
left_semi_join=_regular_join_method('left_semi_join', 'left_semi'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my below comments, remove these and bring back the any_left/any_inner (all_* isn't necessary because those have the conventional join semantics.)

@kszucs
Copy link
Member Author

kszucs commented Sep 18, 2017

@cpcloud done

@kszucs kszucs changed the title [WIP] Clickhouse backend Clickhouse backend Sep 26, 2017
@kszucs
Copy link
Member Author

kszucs commented Oct 5, 2017

@cpcloud There is some packaging issue, setup.py wants to download toolz despite it's defined in the recipe.

Also, clickhouse packages have just landed in conda-forge, we might define them in ibis builds.

@kszucs
Copy link
Member Author

kszucs commented Oct 5, 2017

Just to recap, here are a couple of todos I have in mind:

  • a better pandas wrapper around clickhouse-driver to prevent double changing of data orientation (columnar->row-wise->columnar)
  • support passing external tables to clickhouse
  • clickhouse async connection (thanks to @xzkostyan we now have aioch)
  • DDL operations
  • support arrays

Should I implement all of those in separate PRs?

@cpcloud
Copy link
Member

cpcloud commented Oct 10, 2017

@kszucs It appears there are multiple issues here with the most recent CI run:

  • Python 2.7 - no dependency issues, looks like a transient failure.
  • Python 3.4 - The SQLAlchemy clickhouse driver conda package isn't compatible with anything less than Python 3.5, this will also show up with Python 2.7 once we get to that point in the build.
  • Python 3.5 - Even though the SQLAlchemy clickhouse driver conda package is only compatible with Python 3.5+ it appears to require the enum34 package which is already in the stdlib as the enum package in 3.4 and up.
  • Python 3.6 - Same issue as Python 3.5.

@kszucs
Copy link
Member Author

kszucs commented Oct 10, 2017

@cpcloud
Copy link
Member

cpcloud commented Oct 10, 2017

@kszucs Sweet, thanks for the reference.

@cpcloud
Copy link
Member

cpcloud commented Oct 10, 2017

@kszucs Is it possible to run the tests only on Python 2.7 for now?

@cpcloud
Copy link
Member

cpcloud commented Oct 10, 2017

Using the pip installed version, that is.

@kszucs
Copy link
Member Author

kszucs commented Oct 10, 2017

I'll revert then, just a couple of minutes.

@cpcloud
Copy link
Member

cpcloud commented Oct 10, 2017

Alternatively, you can install enum34 with Python 3.4+ but I'm not sure if this kind of setup has well-defined behavior.

@kszucs
Copy link
Member Author

kszucs commented Oct 10, 2017

@cpcloud It's green now.

@cpcloud
Copy link
Member

cpcloud commented Oct 10, 2017

@kszucs Is it possible for you to rebase on top of master?

@kszucs
Copy link
Member Author

kszucs commented Oct 13, 2017

@cpcloud Rebase done. Thanks for the help!

@cpcloud cpcloud closed this in 3324d1f Oct 15, 2017
@cpcloud
Copy link
Member

cpcloud commented Oct 15, 2017

@kszucs Thanks for doing this! Excited to see use cases like this. Please keep the PRs and issues coming!

@kszucs kszucs deleted the clickhouse branch December 29, 2017 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants