New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse query backend #1127

Closed
wants to merge 51 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@kszucs
Member

kszucs commented Aug 21, 2017

This is the first draft for clickhouse backend #1120, based on impala.

@cpcloud cpcloud self-assigned this Aug 22, 2017

@cpcloud cpcloud added the enhancement label Aug 22, 2017

@cpcloud cpcloud added this to the 0.12 milestone Aug 22, 2017

@cpcloud

This comment has been minimized.

Member

cpcloud commented Aug 22, 2017

@kszucs Thanks! Will do a deeper review this weekend.

Couple of minor things:

  1. If you enable CircleCI and Appveyor on your fork you can run builds faster because you get your own set of docker containers to run (as opposed to using those allocated to the ibis-project user).
  2. It looks like there might be conflicting ports. The Impala docker image has fs.default.name set to port 9000, which it looks like is also the port that clickhouse uses. I'm happy to change the default in the Impala docker image, or if it's easier for you to set a different port then that's also a viable solution. I can do whatever requires less work for you.

@cpcloud cpcloud self-requested a review Aug 22, 2017

@cpcloud

@kszucs This is a good start, thanks for doing this!

My main concern is that there appears to be a lot of copy paste from the impala backend. I understand why you took this approach, but it's very difficult to review since it's hard to pick out what's Clickhouse specific vs not.

I would suggest implementing the bare minimum to get a viable SELECT operation. So, leave out DDL, special types like arrays, and window function support.

The main focus of this PR should be the client, scalar types, and a bare bones SQL translator.

df = pd.read_csv(path, index_col=None, header=0, dtype=dtype)
if table == 'functional_alltypes':
df = df.rename(columns={'Unnamed: 0': 'Unnamed_0'})
cols = ['date_string_col', 'string_col']

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2017

Member

These should already be strings.

This comment has been minimized.

@kszucs

kszucs Aug 28, 2017

Member

This is just a temporary workaround, I'd like to use zkostyan/clickhouse-sqlalchemy with pandas to_sql, but there are a couple of issues.

df[cols] = df[cols].astype(str)
df.timestamp_col = df.timestamp_col.astype('datetime64[s]')
elif table == 'batting':
cols = ['playerID', 'teamID', 'lgID']

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2017

Member

These should already be strings.

This comment has been minimized.

@kszucs

kszucs Sep 4, 2017

Member

functional_alltypes.string_col is actually int64 which raises error on clickhouse side

cols = df.select_dtypes([float]).columns
df[cols] = df[cols].fillna(0).astype(int)
elif table == 'awards_players':
cols = ['playerID', 'awardID', 'lgID', 'tie', 'notes']

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2017

Member

These should already be strings.

>>> client = ibis.clickhouse.connect(
... host=clickhouse_host,
... port=clickhouse_port,
... hdfs_client=hdfs,

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2017

Member

There's no hdfs_client argument in this function.

Parameters
----------
host : str, optional
Host name of the clickhoused or HiveServer2 in Hive

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2017

Member

This is just clickhoused, correct?

This comment has been minimized.

@kszucs

kszucs Aug 28, 2017

Member

Correct, c&p...

@@ -0,0 +1,169 @@
# Copyright 2014 Cloudera Inc.

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2017

Member

This copyright header isn't necessary.

# limitations under the License.
_identifiers = (

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2017

Member

This should be a frozenset so there isn't a linear time average case lookup for identifier quoting.

@@ -0,0 +1,115 @@
# @pytest.fixture

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2017

Member

Please remove this file and we'll add support for arrays in a follow up PR.

@@ -0,0 +1,343 @@
# Copyright 2014 Cloudera Inc.

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2017

Member

No need for this copyright header.

@@ -0,0 +1,951 @@
# Copyright 2014 Cloudera Inc.

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2017

Member

Remove the copyright header.

@kszucs

This comment has been minimized.

Member

kszucs commented Aug 28, 2017

@cpcloud Thanks for the review! I'll fix these. I'm on vacation until Sunday. I intend to implement as much as I can during the next week, including basic ddl operations too (I can split this up to multiple PRs though).
At the end of the next week I'll clean up every todos and commented parts.

@kszucs

This comment has been minimized.

Member

kszucs commented Sep 3, 2017

I've resolved most of requests. Also I left a couple of commented test cases and TODO notes, because organizing them was a time consuming task. I want to implement or drop them before merging.

@kszucs

This comment has been minimized.

Member

kszucs commented Sep 5, 2017

Clickhouse supports a lot of hash functions, and I've implemented them in the compiler. The only problem is ibis expr is currently support only fnv.

@wesm

This comment has been minimized.

Member

wesm commented Sep 5, 2017

I think @cpcloud is out of e-mail access this week so I'll take a look through this today and let you know if I have any more feedback, and then merge. Thanks for all your work on this

@kszucs

This comment has been minimized.

Member

kszucs commented Sep 5, 2017

@wesm Thanks in advance! I'm not sure that we can consider this PR ready to merge though.

@wesm

This comment has been minimized.

Member

wesm commented Sep 5, 2017

I'd be OK with the clickhouse backend remaining WIP for a while as long as incremental patches do not break master. Let me review a bit later and see how things look

@@ -1472,22 +1472,20 @@ class OuterJoin(Join):
pass
class LeftSemiJoin(Join):
class InnerSemiJoin(Join):

This comment has been minimized.

@cpcloud

cpcloud Sep 10, 2017

Member

How is this different from a LEFT SEMI JOIN (or ANY LEFT JOIN in clickhouse parlance)?

If I understand the clickhouse JOIN documentation correctly it isn't.

  • If INNER is specified, the result will contain only those rows that have a matching row in the right table.
  • If ANY is specified and there are multiple matching rows in the right table, only the first one will be joined.

Composing those two statements yields (paraphrasing):

  • If INNER is specified and ANY is specified the result will contain only the first row from the left table where there's a matching row in the right table.

Adding LEFT into the mix:

  • LEFT is specified, any rows in the left table that don’t have matching rows in the right table will be assigned the default value - zeros or empty rows.

Combining that with ANY from above would yield the same statement composed from the definitions of ANY and INNER that I wrote above.

Is there something I'm missing here?

This comment has been minimized.

@kszucs

kszucs Sep 10, 2017

Member

Actually I've got confused, but here are the test cases from 49 to 53

This comment has been minimized.

@kszucs

kszucs Sep 12, 2017

Member

There are four type of joins so must be a difference.

This comment has been minimized.

@cpcloud

cpcloud Sep 14, 2017

Member

So it turns out there's a difference :)

  • ANY INNER JOIN is most similar to a LEFT SEMI JOIN (which ibis supports)
  • ANY LEFT JOIN has no equivalent single node operation in ibis.

I say "most similar to" instead of "equivalent" because the test suite shows that you can select rows from the right table.

In every relational database that I've ever used that supports a LEFT SEMI JOIN syntax you can only select fields from the left table. Databases that don't implement a syntax for this operation implement it through an EXISTS query, which precludes selecting from the right table (since there is no right table in that case).

The reason for this difference is probably because the relational algebraic definition of a semijoin only includes tuples from the left table.

This means that to fully support clickhouse joins we need to introduce new types of joins that are only available to the clickhouse backend.

You can implement this kind of join in other databases using ROW_NUMBER() OVER () AS i on the left table and filtering out only those rows where i = 1 (the first row to be joined).

I think the solution here is to bring back the any_left_join etc methods and we can implement them in other databases down the line as desired.

@kszucs

This comment has been minimized.

Member

kszucs commented Sep 12, 2017

@cpcloud @wesm please review

@cpcloud

This comment has been minimized.

Member

cpcloud commented Sep 14, 2017

@kszucs I will give this another review today.

"""
def _get_schema(self):
return self.left.schema()

This comment has been minimized.

@cpcloud

cpcloud Sep 14, 2017

Member

In light of the clickhouse test suite the schema of this operation isn't known until you select columns out of it, similar to the rest of ibis's join operations.

@@ -2472,6 +2473,8 @@ def _table_drop(self, fields):
join=join,
cross_join=cross_join,
inner_join=_regular_join_method('inner_join', 'inner'),
inner_semi_join=_regular_join_method('inner_semi_join', 'inner_semi'),
left_semi_join=_regular_join_method('left_semi_join', 'left_semi'),

This comment has been minimized.

@cpcloud

cpcloud Sep 14, 2017

Member

Per my below comments, remove these and bring back the any_left/any_inner (all_* isn't necessary because those have the conventional join semantics.)

@kszucs

This comment has been minimized.

Member

kszucs commented Sep 18, 2017

@cpcloud done

@kszucs kszucs changed the title from [WIP] Clickhouse backend to Clickhouse backend Sep 26, 2017

@kszucs

This comment has been minimized.

Member

kszucs commented Oct 5, 2017

@cpcloud There is some packaging issue, setup.py wants to download toolz despite it's defined in the recipe.

Also, clickhouse packages have just landed in conda-forge, we might define them in ibis builds.

@kszucs

This comment has been minimized.

Member

kszucs commented Oct 5, 2017

Just to recap, here are a couple of todos I have in mind:

  • a better pandas wrapper around clickhouse-driver to prevent double changing of data orientation (columnar->row-wise->columnar)
  • support passing external tables to clickhouse
  • clickhouse async connection (thanks to @xzkostyan we now have aioch)
  • DDL operations
  • support arrays

Should I implement all of those in separate PRs?

@cpcloud

This comment has been minimized.

Member

cpcloud commented Oct 10, 2017

@kszucs It appears there are multiple issues here with the most recent CI run:

  • Python 2.7 - no dependency issues, looks like a transient failure.
  • Python 3.4 - The SQLAlchemy clickhouse driver conda package isn't compatible with anything less than Python 3.5, this will also show up with Python 2.7 once we get to that point in the build.
  • Python 3.5 - Even though the SQLAlchemy clickhouse driver conda package is only compatible with Python 3.5+ it appears to require the enum34 package which is already in the stdlib as the enum package in 3.4 and up.
  • Python 3.6 - Same issue as Python 3.5.
@kszucs

This comment has been minimized.

Member

kszucs commented Oct 10, 2017

@cpcloud

This comment has been minimized.

Member

cpcloud commented Oct 10, 2017

@kszucs Sweet, thanks for the reference.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Oct 10, 2017

@kszucs Is it possible to run the tests only on Python 2.7 for now?

@cpcloud

This comment has been minimized.

Member

cpcloud commented Oct 10, 2017

Using the pip installed version, that is.

@kszucs

This comment has been minimized.

Member

kszucs commented Oct 10, 2017

I'll revert then, just a couple of minutes.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Oct 10, 2017

Alternatively, you can install enum34 with Python 3.4+ but I'm not sure if this kind of setup has well-defined behavior.

@kszucs

This comment has been minimized.

Member

kszucs commented Oct 10, 2017

@cpcloud It's green now.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Oct 10, 2017

@kszucs Is it possible for you to rebase on top of master?

@kszucs kszucs force-pushed the kszucs:clickhouse branch from b5627c3 to 4986f93 Oct 13, 2017

@kszucs

This comment has been minimized.

Member

kszucs commented Oct 13, 2017

@cpcloud Rebase done. Thanks for the help!

@cpcloud cpcloud closed this in 3324d1f Oct 15, 2017

@cpcloud

This comment has been minimized.

Member

cpcloud commented Oct 15, 2017

@kszucs Thanks for doing this! Excited to see use cases like this. Please keep the PRs and issues coming!

@kszucs kszucs deleted the kszucs:clickhouse branch Dec 29, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment