Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse query backend #1127

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
d24b478
Missing ibis-testing-data path suffix in developer documentation; add…
kszucs Aug 21, 2017
9dc605f
todo note to cleanup datamgr's clickhouse command
kszucs Aug 21, 2017
e7bb4c1
clickhouse skeleton; client tests are passing
kszucs Aug 21, 2017
6b71f26
wip expr tests
kszucs Aug 22, 2017
b855f95
removed null and window functions
kszucs Aug 22, 2017
5d1dea4
clickhouse: conditional reductions
kszucs Aug 22, 2017
1ff165f
clickhouse string functions
kszucs Aug 22, 2017
906b019
clickhouse: first attempt to make ci actually run
kszucs Aug 22, 2017
6ee06b3
clickhouse: flake8
kszucs Aug 22, 2017
2657bc5
clickhouse: added tcpwait 9000
kszucs Aug 22, 2017
3a4024f
clickhouse: format group by
kszucs Aug 22, 2017
a477e23
clickhouse: basic aggregation tests
kszucs Aug 22, 2017
320b308
clickhouse: wip port of test_functions from postgres
kszucs Aug 23, 2017
ae7fa50
clickhouse: wip function implementations
kszucs Aug 24, 2017
5aa93b4
clickhouse: simplified function formatter implementations
kszucs Aug 24, 2017
3a4261d
clickhouse: timestamp comparison tests
kszucs Aug 24, 2017
47553a4
clickhouse: minor cleanup; select tests
kszucs Aug 24, 2017
243dff2
clickhouse: any|all inner|left join types
kszucs Aug 25, 2017
9d4d365
clickhouse: join tests
kszucs Aug 25, 2017
c9c6156
clickhouse: fix couple of review issues
kszucs Sep 3, 2017
995192b
clickhouse: reverted any joins to inner_semi and left_semi
kszucs Sep 3, 2017
10d43c3
clickhouse: added clickhouse-sqlalchemy missing dependency to ci
kszucs Sep 3, 2017
b2b8820
clickhouse: removed ddl; split test functions file; refactored test c…
kszucs Sep 3, 2017
e8b8fd4
clickhouse: flake8
kszucs Sep 3, 2017
e6702ee
clickhouse: cleanup
kszucs Sep 3, 2017
2f2ab74
clickhouse: reverted count distinct condition
kszucs Sep 3, 2017
5a082ea
clickhouse: revert nunique with condition
kszucs Sep 3, 2017
52806fe
clickhouse: removed importorskip from conftest
kszucs Sep 3, 2017
2a20131
clickhouse: minor cleanup
kszucs Sep 3, 2017
f7f4c31
clickhouse: fix parenthesize or condition in select builder
kszucs Sep 4, 2017
60e8814
clickhouse: added parse_url
kszucs Sep 5, 2017
e787d88
clickhouse: hash functions
kszucs Sep 5, 2017
caa0b2e
clickhouse: timestamp from integer
kszucs Sep 5, 2017
a1c9d66
clcikhouse: cleanup identifiers
kszucs Sep 11, 2017
a95a807
clickhouse: rename semi joins to any prefixed ones
kszucs Sep 17, 2017
5e73b8a
clickhouse: basic timedelta support; more verbose names for type mapp…
kszucs Sep 17, 2017
b13e5da
flake8
kszucs Sep 17, 2017
2809bbf
clickhouse: turn off table prefixes in column formatting due to parti…
kszucs Oct 2, 2017
637dedd
clickhouse: support timestamp truncate (Y, M, D, H, MI)
kszucs Oct 3, 2017
6656801
clickhouse: zeroifnull; nullifzero
kszucs Oct 3, 2017
b9cba61
clickhouse: flake8
kszucs Oct 3, 2017
8abe2a3
clickhouse: fixed client unsupported test case
kszucs Oct 3, 2017
27bdc37
clickhouse removed accidentally committed ctags file
kszucs Oct 6, 2017
ff41290
clickhouse: update requirements in setup.py
kszucs Oct 9, 2017
c07a3c6
clickhouse: extras require
kszucs Oct 9, 2017
82c76ea
clickhouse: updated development dependencies, packages now available …
kszucs Oct 9, 2017
b638c0b
clickhouse: removed unused client code
kszucs Oct 9, 2017
609aa9e
clickhouse: unused import
kszucs Oct 9, 2017
bab6ffb
clickhouse: retrigger circleci builds
kszucs Oct 10, 2017
2c5339b
clickhouse: install dependencies via pip instead of conda
kszucs Oct 10, 2017
4986f93
clickhouse: flake8
kszucs Oct 10, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
11 changes: 11 additions & 0 deletions .circleci/config.yml
Expand Up @@ -4,6 +4,10 @@ environment: &environment
- IBIS_TEST_POSTGRES_DB: ibis_testing
- IBIS_POSTGRES_USER: ubuntu
- IBIS_POSTGRES_PASS: ubuntu
- IBIS_TEST_CLICKHOUSE_DB: ibis_testing
- IBIS_CLICKHOUSE_USER: default
- IBIS_CLICKHOUSE_HOST: localhost
- IBIS_CLICKHOUSE_PASS: ''

- DATA_URL: https://storage.googleapis.com/ibis-ci-data

Expand Down Expand Up @@ -49,6 +53,13 @@ python_test_steps: &python_test_steps
--script ci/postgresql_load.sql \
functional_alltypes diamonds awards_players batting

- run: |
ci/datamgr.py clickhouse \
--database "$IBIS_TEST_POSTGRES_DB" \
--data-directory "$DATA_DIR" \
--script ci/clickhouse_load.sql \
functional_alltypes diamonds awards_players batting

- run: test_data_admin.py load --data --data-dir "$DATA_DIR"
- run: mkdir -p /tmp/reports
- run: |
Expand Down
96 changes: 96 additions & 0 deletions ci/clickhouse_load.sql
@@ -0,0 +1,96 @@
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds (
`date` Date DEFAULT today(),
carat Float64,
cut String,
color String,
clarity String,
depth Float64,
`table` Float64,
price Int64,
x Float64,
y Float64,
z Float64
) ENGINE = MergeTree(date, (`carat`), 8192);

DROP TABLE IF EXISTS batting;

CREATE TABLE batting (
`date` Date DEFAULT today(),
`playerID` String,
`yearID` Int64,
stint Int64,
`teamID` String,
`lgID` String,
`G` Int64,
`AB` Int64,
`R` Int64,
`H` Int64,
`X2B` Int64,
`X3B` Int64,
`HR` Int64,
`RBI` Int64,
`SB` Int64,
`CS` Int64,
`BB` Int64,
`SO` Int64,
`IBB` Int64,
`HBP` Int64,
`SH` Int64,
`SF` Int64,
`GIDP` Int64
) ENGINE = MergeTree(date, (`playerID`), 8192);

DROP TABLE IF EXISTS awards_players;

CREATE TABLE awards_players (
`date` Date DEFAULT today(),
`playerID` String,
`awardID` String,
`yearID` Int64,
`lgID` String,
tie String,
notes String
) ENGINE = MergeTree(date, (`playerID`), 8192);

DROP TABLE IF EXISTS functional_alltypes;

CREATE TABLE functional_alltypes (
`date` Date DEFAULT toDate(timestamp_col),
`index` Int64,
`Unnamed_0` Int64,
id Int32,
bool_col UInt8,
tinyint_col Int8,
smallint_col Int16,
int_col Int32,
bigint_col Int64,
float_col Float32,
double_col Float64,
date_string_col String,
string_col String,
timestamp_col DateTime,
year Int32,
month Int32
) ENGINE = MergeTree(date, (`index`), 8192);

DROP TABLE IF EXISTS tzone;

CREATE TABLE tzone (
`date` Date DEFAULT today(),
ts DateTime,
key String,
value Float64
) ENGINE = MergeTree(date, (key), 8192);

DROP TABLE IF EXISTS array_types;

CREATE TABLE IF NOT EXISTS array_types (
`date` Date DEFAULT today(),
x Array(Int64),
y Array(String),
z Array(Float64),
grouper String,
scalar_column Float64
) ENGINE = MergeTree(date, (scalar_column), 8192);
79 changes: 79 additions & 0 deletions ci/datamgr.py
Expand Up @@ -24,6 +24,83 @@ def cli():
pass


@cli.command()
@click.argument('tables', nargs=-1)
@click.option('-S', '--script', type=click.File('rt'), required=True)
@click.option(
'-d', '--database',
default=os.environ.get('IBIS_TEST_CLICKHOUSE_DB', 'ibis_testing')
)
@click.option(
'-D', '--data-directory',
default=tempfile.gettempdir(), type=click.Path(exists=True)
)
def clickhouse(script, tables, database, data_directory):
username = os.environ.get('IBIS_CLICKHOUSE_USER', 'default')
host = os.environ.get('IBIS_CLICKHOUSE_HOST', 'localhost')
password = os.environ.get('IBIS_CLICKHOUSE_PASS', '')

url = sa.engine.url.URL(
'clickhouse+native',
username=username,
host=host,
password=password,
)
engine = sa.create_engine(str(url))
engine.execute('DROP DATABASE IF EXISTS "{}"'.format(database))
engine.execute('CREATE DATABASE "{}"'.format(database))

url = sa.engine.url.URL(
'clickhouse+native',
username=username,
host=host,
password=password,
database=database,
)
engine = sa.create_engine(str(url))
script_text = script.read()

# missing stmt
# INSERT INTO array_types (x, y, z, grouper, scalar_column) VALUES
# ([1, 2, 3], ['a', 'b', 'c'], [1.0, 2.0, 3.0], 'a', 1.0),
# ([4, 5], ['d', 'e'], [4.0, 5.0], 'a', 2.0),
# ([6], ['f'], [6.0], 'a', 3.0),
# ([1], ['a'], [], 'b', 4.0),
# ([2, 3], ['b', 'c'], [], 'b', 5.0),
# ([4, 5], ['d', 'e'], [4.0, 5.0], 'c', 6.0);

with engine.begin() as con:
# doesn't support multiple statements
for stmt in script_text.split(';'):
if len(stmt.strip()):
con.execute(stmt)

table_paths = [
os.path.join(data_directory, '{}.csv'.format(table))
for table in tables
]
dtype = {'bool_col': np.bool_}
for table, path in zip(tables, table_paths):
# correct dtypes per table to be able to insert
# TODO: cleanup, kinda ugly
df = pd.read_csv(path, index_col=None, header=0, dtype=dtype)
if table == 'functional_alltypes':
df = df.rename(columns={'Unnamed: 0': 'Unnamed_0'})
cols = ['date_string_col', 'string_col']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should already be strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a temporary workaround, I'd like to use zkostyan/clickhouse-sqlalchemy with pandas to_sql, but there are a couple of issues.

df[cols] = df[cols].astype(str)
df.timestamp_col = df.timestamp_col.astype('datetime64[s]')
elif table == 'batting':
cols = ['playerID', 'teamID', 'lgID']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should already be strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functional_alltypes.string_col is actually int64 which raises error on clickhouse side

df[cols] = df[cols].astype(str)
cols = df.select_dtypes([float]).columns
df[cols] = df[cols].fillna(0).astype(int)
elif table == 'awards_players':
cols = ['playerID', 'awardID', 'lgID', 'tie', 'notes']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should already be strings.

df[cols] = df[cols].astype(str)

df.to_sql(table, engine, index=False, if_exists='append')


@cli.command()
@click.argument('tables', nargs=-1)
@click.option('-S', '--script', type=click.File('rt'), required=True)
Expand Down Expand Up @@ -103,6 +180,8 @@ def sqlite(script, tables, database, data_directory):
os.path.join(data_directory, '{}.csv'.format(table))
for table in tables
]
click.echo(tables)
click.echo(table_paths)
for table, path in zip(tables, table_paths):
df = pd.read_csv(path, index_col=None, header=0)
with engine.begin() as con:
Expand Down
2 changes: 2 additions & 0 deletions ci/requirements-dev-2.7.yml
Expand Up @@ -23,3 +23,5 @@ dependencies:
- toolz
- pip:
- hdfs>=2.0.0
- clickhouse-driver
- clickhouse-sqlalchemy
2 changes: 2 additions & 0 deletions ci/requirements-dev-3.4.yml
Expand Up @@ -21,3 +21,5 @@ dependencies:
- toolz
- pip:
- hdfs>=2.0.0
- clickhouse-driver
- clickhouse-sqlalchemy
2 changes: 2 additions & 0 deletions ci/requirements-dev-3.5.yml
Expand Up @@ -21,3 +21,5 @@ dependencies:
- toolz
- pip:
- hdfs>=2.0.0
- clickhouse-driver
- clickhouse-sqlalchemy
2 changes: 2 additions & 0 deletions ci/requirements-dev-3.6.yml
Expand Up @@ -21,3 +21,5 @@ dependencies:
- toolz
- pip:
- hdfs>=2.0.0
- clickhouse-driver
- clickhouse-sqlalchemy
24 changes: 22 additions & 2 deletions docs/source/developer.rst
Expand Up @@ -69,6 +69,26 @@ Impala (with UDFs)

test_data_admin.py load --data --data-dir=$DATA_DIR

Clickhouse
^^^^^^^^^^

#. **Start the Clickhouse Server docker image in another terminal**:

.. code:: sh

# Keeping this running as long as you want to test ibis
docker run --rm -p 9000:9000 --tty yandex/clickhouse-server

#. **Load data**:

.. code:: sh

ci/datamgr.py clickhouse \
--database $IBIS_TEST_CLICKHOUSE_DB \
--data-directory $DATA_DIR/ibis-testing-data \
--script ci/clickhouse_load.sql \
functional_alltypes batting diamonds awards_players

PostgreSQL
^^^^^^^^^^

Expand All @@ -81,7 +101,7 @@ Here's how to load test data into PostgreSQL:

ci/datamgr.py postgres \
--database $IBIS_TEST_POSTGRES_DB \
--data-directory $DATA_DIR \
--data-directory $DATA_DIR/ibis-testing-data \
--script ci/postgresql_load.sql \
functional_alltypes batting diamonds awards_players

Expand All @@ -95,7 +115,7 @@ instructions above, then SQLite will be available in the conda environment.

ci/datamgr.py sqlite \
--database $IBIS_TEST_SQLITE_DB_PATH \
--data-directory $DATA_DIR \
--data-directory $DATA_DIR/ibis-testing-data \
--script ci/sqlite_load.sql \
functional_alltypes batting diamonds awards_players

Expand Down
5 changes: 5 additions & 0 deletions ibis/__init__.py
Expand Up @@ -39,6 +39,11 @@
except ImportError: # pip install ibis-framework[postgres]
pass

try:
import ibis.clickhouse.api as clickhouse
except ImportError: # pip install ibis-framework[clickhouse]
pass

try:
from multipledispatch import halt_ordering, restart_ordering
halt_ordering()
Expand Down
Empty file added ibis/clickhouse/__init__.py
Empty file.
81 changes: 81 additions & 0 deletions ibis/clickhouse/api.py
@@ -0,0 +1,81 @@
import ibis.common as com

from ibis.config import options
from ibis.clickhouse.client import ClickhouseClient


def compile(expr):
"""
Force compilation of expression as though it were an expression depending
on Clickhouse. Note you can also call expr.compile()

Returns
-------
compiled : string
"""
from .compiler import to_sql
return to_sql(expr)


def verify(expr):
"""
Determine if expression can be successfully translated to execute on
Clickhouse
"""
try:
compile(expr)
return True
except com.TranslationError:
return False


def connect(host='localhost', port=9000, database='default', user='default',
password='', client_name='ibis', compression=False):
"""Create an ClickhouseClient for use with Ibis.

Parameters
----------
host : str, optional
Host name of the clickhouse server
port : int, optional
Clickhouse server's port
database : str, optional
Default database when executing queries
user : str, optional
User to authenticate with
password : str, optional
Password to authenticate with
client_name: str, optional
This will appear in clickhouse server logs
compression: str, optional
Weather or not to use compression. Default is False.
Possible choices: lz4, lz4hc, quicklz, zstd
True is equivalent to 'lz4'.

Examples
--------
>>> import ibis
>>> import os
>>> clickhouse_host = os.environ.get('IBIS_TEST_CLICKHOUSE_HOST',
... 'localhost')
>>> clickhouse_port = int(os.environ.get('IBIS_TEST_CLICKHOUSE_PORT',
... 9000))
>>> client = ibis.clickhouse.connect(
... host=clickhouse_host,
... port=clickhouse_port
... )
>>> client # doctest: +ELLIPSIS
<ibis.clickhouse.client.ClickhouseClient object at 0x...>

Returns
-------
ClickhouseClient
"""

client = ClickhouseClient(host, port=port, database=database, user=user,
password=password, client_name=client_name,
compression=compression)
if options.default_backend is None:
options.default_backend = client

return client