340 changes: 340 additions & 0 deletions docs/source/notebooks/tutorial/9-Adding-a-new-expression.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,340 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extending Ibis Part 1: Adding a New Expression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two parts of ibis that users typically want to extend:\n",
"\n",
"1. Expressions (for example, by adding a new operation)\n",
"1. Backends\n",
"\n",
"This notebook will show you how to add a new operation (`sha1`) to an existing backend (BigQuery)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Description\n",
"\n",
"We're going to add a **`sha1`** method to ibis. [SHA1](https://en.wikipedia.org/wiki/SHA-1) is a hash algorithm, employed in systems such as git."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Define the Operation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's define the `sha` operation as a function that takes one string input argument and returns a hexidecimal string.\n",
"\n",
"```haskell\n",
"sha1 :: string -> string\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import ibis.expr.datatypes as dt\n",
"\n",
"from ibis.expr import rules\n",
"from ibis.expr.operations import ValueOp\n",
"\n",
"\n",
"class SHA1(ValueOp):\n",
" \n",
" input_type = [rules.string]\n",
" output_type = rules.shape_like_arg(0, 'string')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We just defined a `SHA1` class that takes one argument of type string or binary, and returns a binary. This matches the description of the function provided by BigQuery."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Define the API"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because we know the output type of the operation, to make an expression out of ``SHA1`` we simply need to construct it and call its `ibis.expr.types.Node.to_expr` method.\n",
"\n",
"We still need to add a method to `StringValue` and `BinaryValue` (this needs to work on both scalars and columns).\n",
"\n",
"When you add a method to any of the expression classes whose name matches `*Value` both the scalar and column child classes will pick it up, making it easy to define operations for both scalars and columns in one place.\n",
"\n",
"We can do this by defining a function and assigning it to the appropriate class\n",
"of expressions.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from ibis.expr.types import StringValue, BinaryValue\n",
"\n",
"\n",
"def sha1(string_value):\n",
" return SHA1(string_value).to_expr()\n",
"\n",
"\n",
"StringValue.sha1 = sha1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Interlude: Create some expressions with `sha1`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import ibis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t = ibis.table([('string_col', 'string')])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t.string_col.sha1()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Turn the Expression into SQL"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sqlalchemy as sa\n",
"\n",
"\n",
"@ibis.postgres.compiles(SHA1)\n",
"def compile_sha1(translator, expr):\n",
" # pull out the arguments to the expression\n",
" arg, = expr.op().args\n",
" \n",
" # compile the argument\n",
" compiled_arg = translator.translate(arg)\n",
" \n",
" # return a SQLAlchemy expression that calls into the PostgreSQL pgcrypto extension\n",
" return sa.func.encode(sa.func.digest(compiled_arg, 'sha1'), 'hex')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: Putting it all Together"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Connect to the `ibis_testing` database"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:**\n",
"\n",
"To be able to execute the rest of this notebook you need to run:\n",
"\n",
"```sh\n",
"docker-compose up -d --no-build postgres impala clickhouse mysql dns\n",
"docker-compose run waiter\n",
"docker-compose run ibis ci/load-data.sh\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"con = ibis.postgres.connect(\n",
" database='ibis_testing', user='postgres', host='postgres', password='postgres')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Register the [`pgcrypto`](https://www.postgresql.org/docs/10/static/pgcrypto.html) extension"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"con.raw_sql('CREATE EXTENSION IF NOT EXISTS pgcrypto'); # we don't care about the output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create and execute a `sha1` expression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t = con.table('functional_alltypes')\n",
"t"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sha1_expr = t.string_col.sha1()\n",
"sha1_expr"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql_expr = sha1_expr.compile()\n",
"print(sql_expr)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result = sha1_expr.execute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because we've defined our operation on `StringValue`, and not just on `StringColumn` we get operations on both string scalars *and* string columns for free"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"string_scalar = ibis.literal('abcdefg')\n",
"string_scalar"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sha1_scalar = string_scalar.sha1()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"con.execute(sha1_scalar)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
94 changes: 94 additions & 0 deletions docs/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,86 @@ Release Notes
releases (e.g., ``0.5.1``) will generally not be found here and contain
only bug fixes.

Current ``ibis.__version__``: |version|

v0.13.0 (March 20, 2018)
------------------------

This release brings new backends, including support for executing against
files, MySQL, Pandas user defined scalar and aggregations along with a number
of bug fixes and reliability enhancements. We recommend that all users upgrade
from earlier versions of Ibis.

New Backends
~~~~~~~~~~~~

* File Support for CSV & HDF5 (:issue:`1165`, :issue:`1194`)
* File Support for Parquet Format (:issue:`1175`, :issue:`1194`)
* Experimental support for ``MySQL`` thanks to @kszucs (:issue:`1224`)

New Features
~~~~~~~~~~~~

* Support for Unsigned Integer Types (:issue:`1194`)
* Support for Interval types and expressions with support for execution on the
Impala and Clickhouse backends (:issue:`1243`)
* Isnan, isinf operations for float and double values (:issue:`1261`)
* Support for an interval with a quarter period (:issue:`1259`)
* ``ibis.pandas.from_dataframe`` convenience function (:issue:`1155`)
* Remove the restriction on ``ROW_NUMBER()`` requiring it to have an
``ORDER BY`` clause (:issue:`1371`)
* Add ``.get()`` operation on a Map type (:issue:`1376`)
* Allow visualization of custom defined expressions
* Add experimental support for pandas UDFs/UDAFs (:issue:`1277`)
* Functions can be used as groupby keys (:issue:`1214`, :issue:`1215`)
* Generalize the use of the ``where`` parameter to reduction operations
(:issue:`1220`)
* Support for interval operations thanks to @kszucs (:issue:`1243`,
:issue:`1260`, :issue:`1249`)
* Support for the ``PARTITIONTIME`` column in the BigQuery backend
(:issue:`1322`)
* Add ``arbitrary()`` method for selecting the first non null value in a column
(:issue:`1230`, :issue:`1309`)
* Windowed ``MultiQuantile`` operation in the pandas backend thanks to
@DiegoAlbertoTorres (:issue:`1343`)
* Rules for validating table expressions thanks to @DiegoAlbertoTorres
(:issue:`1298`)
* Complete end-to-end testing framework for all supported backends
(:issue:`1256`)
* ``contains``/``not contains`` now supported in the pandas backend
(:issue:`1210`, :issue:`1211`)
* CI builds are now reproducible *locally* thanks to @kszucs (:issue:`1121`,
:issue:`1237`, :issue:`1255`, :issue:`1311`)
* ``isnan``/``isinf`` operations thanks to @kszucs (:issue:`1261`)
* Framework for generalized dtype and schema inference, and implicit casting
thanks to @kszucs (:issue:`1221`, :issue:`1269`)
* Generic utilities for expression traversal thanks to @kszucs (:issue:`1336`)
* ``day_of_week`` API (:issue:`306`, :issue:`1047`)
* Design documentation for ibis (:issue:`1351`)

Bug Fixes
~~~~~~~~~

* Unbound parameters were failing in the simple case of a
:meth:`~ibis.expr.types.TableExpr.mutate` call with no operation
(:issue:`1378`)
* Fix parameterized subqueries (:issue:`1300`, :issue:`1331`, :issue:`1303`,
:issue:`1378`)
* Fix subquery extraction, which wasn't happening in topological order
(:issue:`1342`)
* Fix parenthesization if ``isnull`` (:issue:`1307`)
* Calling drop after mutate did not work (:issue:`1296`, :issue:`1299`)
* SQLAlchemy backends were missing an implementation of
:class:`~ibis.expr.operations.NotContains`.
* Support ``REGEX_EXTRACT`` in PostgreSQL 10 (:issue:`1276`, :issue:`1278`)

API Changes
-----------

* Fixing :issue:`1378` required the removal of the ``name`` parameter to the
:func:`~ibis.param` function. Use the :meth:`~ibis.expr.types.Expr.name`
method instead.

v0.12.0 (October 28, 2017)
--------------------------

Expand All @@ -22,6 +102,7 @@ New Backends
* BigQuery backend (:issue:`1170`), thanks to @tsdlovell.
* Clickhouse backend (:issue:`1127`), thanks to @kszucs.


New Features
~~~~~~~~~~~~

Expand Down Expand Up @@ -83,6 +164,19 @@ Performance Enhancements
Contributors
~~~~~~~~~~~~

The following people contributed to the 0.12.0 release ::

$ git shortlog -sn --no-merges v0.11.2..v0.12.0
63 Phillip Cloud
8 Jeff Reback
2 Krisztián Szűcs
2 Tory Haavik
1 Anirudh
1 Szucs Krisztian
1 dlovell
1 kwangin


0.11.0 (June 28, 2017)
----------------------

Expand Down
5 changes: 2 additions & 3 deletions docs/source/sql.rst
Original file line number Diff line number Diff line change
Expand Up @@ -368,13 +368,12 @@ functions:

.. ipython:: python
stats = dict(
expr = pop.group_by('country').aggregate(
num_persons=pop.count(),
avg_age=pop.age.mean(),
avg_male=pop.age.mean(where=pop.gender == 'M'),
avg_female=pop.age.mean(where=pop.gender == 'F')
)
expr = pop.group_by('country').aggregate(**stats)
This indeed generates the correct SQL. Note that SQL engines handle ``NULL``
values differently in aggregation functions, but Ibis will write the SQL
Expand All @@ -401,7 +400,7 @@ Consider the SQL idiom:

.. code-block:: sql
SELECT {{ COLUMN_EXPR }}, count(*)
SELECT some_column_expression, count(*)
FROM table
GROUP BY 1
Expand Down
20 changes: 14 additions & 6 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,18 @@
.. _tutorial:

*******************
Expression tutorial
*******************
Tutorial
========

These notebooks come from http://github.com/cloudera/ibis-notebooks and are
reproduced here using ``nbconvert``.
Here we show Jupyter notebooks that take you through various tasks using ibis.

.. include:: generated-notebooks/manifest.txt
.. toctree::
:maxdepth: 1

notebooks/tutorial/1-Intro-and-Setup.ipynb
notebooks/tutorial/2-Basics-Aggregate-Filter-Limit.ipynb
notebooks/tutorial/3-Projection-Join-Sort.ipynb
notebooks/tutorial/4-More-Value-Expressions.ipynb
notebooks/tutorial/5-IO-Create-Insert-External-Data.ipynb
notebooks/tutorial/6-Advanced-Topics-TopK-SelfJoins.ipynb
notebooks/tutorial/7-Advanced-Topics-ComplexFiltering.ipynb
notebooks/tutorial/8-More-Analytics-Helpers.ipynb
7 changes: 0 additions & 7 deletions docs/source/type-system.rst

This file was deleted.

107 changes: 107 additions & 0 deletions docs/source/udf.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
.. _udf:

User Defined Functions
======================

Ibis provides a mechanism for writing custom scalar and aggregate functions,
with varying levels of support for different backends. UDFs/UDAFs are a complex
topic.

This section of the documentation will discuss some of the backend specific
details of user defined functions.

API
---

.. warning::

The UDF/UDAF API is quite experimental at this point and is therefore
provisional and subject to change.

Going forward, the API for user defined *scalar* functions will look like this:

.. code-block:: python
@udf(input_type=[double], output_type=double)
def add_one(x):
return x + 1.0
User defined *aggregate* functions are nearly identical, with the exception
of using the ``@udaf`` decorator instead of the ``@udf`` decorator.

Impala
------

TODO

Pandas
------

Pandas supports defining both UDFs and UDAFs.

When you define a UDF you automatically get support for applying that UDF in a
scalar context, *as well as* in any group by operation.

When you define a UDAF you automatically get support for standard scalar
aggregations, group bys, *as well as* any supported windowing operation.

The API for these functions is the same as described above.

The objects you receive as input arguments are either ``pandas.Series`` or
python or numpy scalars depending on the operation.

Using ``add_one`` from above as an example, the following call will receive a
``pandas.Series`` for the ``x`` argument:

.. code-block:: python
>>> import ibis
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 3]})
>>> con = ibis.pandas.connect({'df': df})
>>> t = con.table('df')
>>> expr = add_one(t.a)
And this will receive the ``int`` 1:

.. code-block:: python
>>> expr = add_one(1)
Finally, since the pandas backend passes around ``**kwargs`` you can accept
``**kwargs`` in your function:

.. code-block:: python
@udf([double], double)
def add_one(x, **kwargs):
return x + 1.0
Or you can leave them out as we did in the example above. You can also
optionally accept *specific* keyword arguments. This requires knowledge of how
the pandas backend works for it to be useful:

.. note::

Any keyword arguments (other than ``**kwargs``) must be given a default
value or the UDF/UDAF **will not work**. A standard Python convention is to
set the default value to ``None``.

For example:

.. code-block:: python
@udf([double], double)
def add_one(x, scope=None):
return x + 1.0
BigQuery
--------

TODO

SQLite
------

TODO
81 changes: 41 additions & 40 deletions ibis/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,52 +14,64 @@


# flake8: noqa
from multipledispatch import halt_ordering, restart_ordering

from ibis.filesystems import HDFS, WebHDFS
from ibis.common import IbisError

import ibis.config_init
import ibis.util as util
import ibis.expr.api as api
import ibis.expr.types as ir

from ibis.config import options
from ibis.common import IbisError
from ibis.compat import suppress
from ibis.filesystems import HDFS, WebHDFS

# speeds up signature registration
halt_ordering()

# __all__ is defined
from ibis.expr.api import *

try:
# pandas backend is mandatory
import ibis.pandas.api as pandas

with suppress(ImportError):
# pip install ibis-framework[csv]
import ibis.file.csv as csv

with suppress(ImportError):
# pip install ibis-framework[parquet]
import ibis.file.parquet as parquet

with suppress(ImportError):
# pip install ibis-framework[hdf5]
import ibis.file.hdf5 as hdf5

with suppress(ImportError):
# pip install ibis-framework[impala]
import ibis.impala.api as impala
except ImportError: # pip install ibis-framework[impala]
pass

try:
with suppress(ImportError):
# pip install ibis-framework[sqlite]
import ibis.sql.sqlite.api as sqlite
except ImportError: # pip install ibis-framework[sqlite]
pass

try:
with suppress(ImportError):
# pip install ibis-framework[postgres]
import ibis.sql.postgres.api as postgres
except ImportError: # pip install ibis-framework[postgres]
pass

try:
with suppress(ImportError):
# pip install ibis-framework[mysql]
import ibis.sql.mysql.api as mysql

with suppress(ImportError):
# pip install ibis-framework[clickhouse]
import ibis.clickhouse.api as clickhouse
except ImportError: # pip install ibis-framework[clickhouse]
pass

try:
with suppress(ImportError):
# pip install ibis-framework[bigquery]
import ibis.bigquery.api as bigquery
except ImportError: # pip install ibis-framework[bigquery]
pass

try:
from multipledispatch import halt_ordering, restart_ordering
halt_ordering()
import ibis.pandas.api as pandas
restart_ordering()
except ImportError: # pip install ibis-framework[pandas]
pass

import ibis.config_init
from ibis.config import options
import ibis.util as util
restart_ordering()


def hdfs_connect(host='localhost', port=50070, protocol='webhdfs',
Expand Down Expand Up @@ -117,17 +129,6 @@ def hdfs_connect(host='localhost', port=50070, protocol='webhdfs',
hdfs_client = InsecureClient(url, **kwds)
return WebHDFS(hdfs_client)

def test(impala=False):
import pytest
import ibis
import os

ibis_dir, _ = os.path.split(ibis.__file__)

args = ['--pyargs', ibis_dir]
if impala:
args.append('--impala')
pytest.main(args)

from ._version import get_versions
__version__ = get_versions()['version']
Expand Down
12 changes: 7 additions & 5 deletions ibis/bigquery/api.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import google.cloud.bigquery # noqa: F401 fail early if bigquery is missing
import ibis.common as com
from ibis.config import options # noqa: F401
from ibis.bigquery.client import BigQueryClient
from ibis.bigquery.compiler import dialect


def compile(expr):
def compile(expr, params=None):
"""
Force compilation of expression as though it were an expression depending
on BigQuery. Note you can also call expr.compile()
Expand All @@ -12,17 +14,17 @@ def compile(expr):
-------
compiled : string
"""
from .compiler import to_sql
return to_sql(expr)
from ibis.bigquery.compiler import to_sql
return to_sql(expr, dialect.make_context(params=params))


def verify(expr):
def verify(expr, params=None):
"""
Determine if expression can be successfully translated to execute on
BigQuery
"""
try:
compile(expr)
compile(expr, params=params)
return True
except com.TranslationError:
return False
Expand Down
226 changes: 216 additions & 10 deletions ibis/bigquery/client.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,29 @@
import re
import regex as re
import time
import collections
import datetime

import six

import pandas as pd
import google.cloud.bigquery as bq

from multipledispatch import Dispatcher

import ibis
import ibis.common as com
import ibis.expr.types as ir
import ibis.expr.schema as sch
import ibis.expr.datatypes as dt

from ibis.compat import parse_version
from ibis.client import Database, Query, SQLClient
from ibis.bigquery import compiler as comp
import google.cloud.bigquery

from google.api.core.exceptions import BadRequest


NATIVE_PARTITION_COL = '_PARTITIONTIME'


def _ensure_split(table_id, dataset_id):
Expand All @@ -22,6 +38,64 @@ def _ensure_split(table_id, dataset_id):
return (table_id, dataset_id)


_IBIS_TYPE_TO_DTYPE = {
'string': 'STRING',
'int64': 'INT64',
'double': 'FLOAT64',
'boolean': 'BOOL',
'timestamp': 'TIMESTAMP',
'date': 'DATE',
}

_DTYPE_TO_IBIS_TYPE = {
'INT64': dt.int64,
'FLOAT64': dt.double,
'BOOL': dt.boolean,
'STRING': dt.string,
'DATE': dt.date,
# FIXME: enforce no tz info
'DATETIME': dt.timestamp,
'TIME': dt.time,
'TIMESTAMP': dt.timestamp,
'BYTES': dt.binary,
}


_LEGACY_TO_STANDARD = {
'INTEGER': 'INT64',
'FLOAT': 'FLOAT64',
'BOOLEAN': 'BOOL',
}


@dt.dtype.register(bq.schema.SchemaField)
def bigquery_field_to_ibis_dtype(field):
typ = field.field_type
if typ == 'RECORD':
fields = field.fields
assert fields
names = [el.name for el in fields]
ibis_types = list(map(dt.dtype, fields))
ibis_type = dt.Struct(names, ibis_types)
else:
ibis_type = _LEGACY_TO_STANDARD.get(typ, typ)
ibis_type = _DTYPE_TO_IBIS_TYPE.get(ibis_type, ibis_type)
if field.mode == 'REPEATED':
ibis_type = dt.Array(ibis_type)
return ibis_type


@sch.infer.register(bq.table.Table)
def bigquery_schema(table):
pairs = [(el.name, dt.dtype(el)) for el in table.schema]
try:
if table.list_partitions():
pairs.append((NATIVE_PARTITION_COL, dt.timestamp))
except BadRequest:
pass
return sch.schema(pairs)


class BigQueryCursor(object):
"""Cursor to allow the BigQuery client to reuse machinery in ibis/client.py
"""
Expand All @@ -46,14 +120,30 @@ def __exit__(self, exc_type, exc_value, traceback):

class BigQuery(Query):

def __init__(self, client, ddl, query_parameters=None):
super(BigQuery, self).__init__(client, ddl)
self.query_parameters = query_parameters or {}

def _fetch(self, cursor):
return pd.DataFrame(cursor.fetchall(), columns=cursor.columns)
df = pd.DataFrame(cursor.fetchall(), columns=cursor.columns)
return self.schema().apply_to(df)

def execute(self):
# synchronous by default
with self.client._execute(
self.compiled_ddl,
results=True,
query_parameters=self.query_parameters
) as cur:
result = self._fetch(cur)

return self._wrap_result(result)


class BigQueryAPIProxy(object):

def __init__(self, project_id):
self._client = google.cloud.bigquery.Client(project_id)
self._client = bq.Client(project_id)

@property
def client(self):
Expand All @@ -79,19 +169,100 @@ def get_table(self, table_id, dataset_id, reload=True):
def get_schema(self, table_id, dataset_id):
return self.get_table(table_id, dataset_id).schema

def run_sync_query(self, stmt):
query = self.client.run_sync_query(stmt)
query.use_legacy_sql = False
query.run()
# run_sync_query is not really synchronous: there's a timeout
while not query.job.done():
query.job.reload()
time.sleep(0.1)
return query


class BigQueryDatabase(Database):
pass


bigquery_param = Dispatcher('bigquery_param')


@bigquery_param.register(ir.StructScalar, collections.OrderedDict)
def bq_param_struct(param, value):
field_params = [bigquery_param(param[k], v) for k, v in value.items()]
return bq.StructQueryParameter(param.get_name(), *field_params)


@bigquery_param.register(ir.ArrayValue, list)
def bq_param_array(param, value):
param_type = param.type()
assert isinstance(param_type, dt.Array), str(param_type)

try:
bigquery_type = _IBIS_TYPE_TO_DTYPE[str(param_type.value_type)]
except KeyError:
raise com.UnsupportedBackendType(param_type)
else:
return bq.ArrayQueryParameter(param.get_name(), bigquery_type, value)


@bigquery_param.register(
ir.TimestampScalar,
six.string_types + (datetime.datetime, datetime.date)
)
def bq_param_timestamp(param, value):
assert isinstance(param.type(), dt.Timestamp)

# TODO(phillipc): Not sure if this is the correct way to do this.
timestamp_value = pd.Timestamp(value, tz='UTC').to_pydatetime()
return bq.ScalarQueryParameter(
param.get_name(), 'TIMESTAMP', timestamp_value)


@bigquery_param.register(ir.StringScalar, six.string_types)
def bq_param_string(param, value):
return bq.ScalarQueryParameter(param.get_name(), 'STRING', value)


@bigquery_param.register(ir.Int64Scalar, six.integer_types)
def bq_param_integer(param, value):
return bq.ScalarQueryParameter(param.get_name(), 'INT64', value)


@bigquery_param.register(ir.DoubleScalar, float)
def bq_param_double(param, value):
return bq.ScalarQueryParameter(param.get_name(), 'FLOAT64', value)


@bigquery_param.register(ir.BooleanScalar, bool)
def bq_param_boolean(param, value):
return bq.ScalarQueryParameter(param.get_name(), 'BOOL', value)


@bigquery_param.register(ir.DateScalar, six.string_types)
def bq_param_date_string(param, value):
return bigquery_param(param, pd.Timestamp(value).to_pydatetime().date())


@bigquery_param.register(ir.DateScalar, datetime.datetime)
def bq_param_date_datetime(param, value):
return bigquery_param(param, value.date())


@bigquery_param.register(ir.DateScalar, datetime.date)
def bq_param_date(param, value):
return bq.ScalarQueryParameter(param.get_name(), 'DATE', value)


class BigQueryClient(SQLClient):

sync_query = BigQuery
database_class = BigQueryDatabase
proxy_class = BigQueryAPIProxy
dialect = comp.BigQueryDialect

def __init__(self, project_id, dataset_id):
self._proxy = self.__class__.proxy_class(project_id)
self._proxy = type(self).proxy_class(project_id)
self._dataset_id = dataset_id

@property
Expand All @@ -106,8 +277,24 @@ def dataset_id(self):
def _table_expr_klass(self):
return ir.TableExpr

def _build_ast(self, expr, params=None):
return comp.build_ast(expr, params=params)
def table(self, *args, **kwargs):
t = super(BigQueryClient, self).table(*args, **kwargs)
if NATIVE_PARTITION_COL in t.columns:
col = ibis.options.bigquery.partition_col
assert col not in t
return (t
.mutate(**{col: t[NATIVE_PARTITION_COL]})
.drop([NATIVE_PARTITION_COL]))
return t

def _build_ast(self, expr, context):
result = comp.build_ast(expr, context)
return result

def _execute_query(self, ddl, async=False):
klass = self.async_query if async else self.sync_query
inst = klass(self, ddl, query_parameters=ddl.context.params)
return inst.execute()

def _fully_qualified_name(self, name, database):
dataset_id = database or self.dataset_id
Expand All @@ -116,11 +303,21 @@ def _fully_qualified_name(self, name, database):
def _get_table_schema(self, qualified_name):
return self.get_schema(qualified_name)

def _execute(self, stmt, results=True):
def _execute(self, stmt, results=True, query_parameters=None):
# TODO(phillipc): Allow **kwargs in calls to execute
query = self._proxy.client.run_sync_query(stmt)
query.use_legacy_sql = False
query.query_parameters = [
bigquery_param(param.to_expr(), value)
for param, value in (query_parameters or {}).items()
]
query.run()

# run_sync_query is not really synchronous: there's a timeout
while not query.job.done():
query.job.reload()
time.sleep(0.1)

return BigQueryCursor(query)

def database(self, name=None):
Expand Down Expand Up @@ -165,7 +362,11 @@ def list_tables(self, like=None, database=None):
def get_schema(self, name, database=None):
(table_id, dataset_id) = _ensure_split(name, database)
bq_table = self._proxy.get_table(table_id, dataset_id)
return bigquery_table_to_ibis_schema(bq_table)
return sch.infer(bq_table)

@property
def version(self):
return parse_version(bq.__version__)


_DTYPE_TO_IBIS_TYPE = {
Expand Down Expand Up @@ -206,5 +407,10 @@ def _discover_type(field):


def bigquery_table_to_ibis_schema(table):
pairs = ((el.name, _discover_type(el)) for el in table.schema)
pairs = [(el.name, _discover_type(el)) for el in table.schema]
try:
if table.list_partitions():
pairs.append((NATIVE_PARTITION_COL, dt.timestamp))
except BadRequest:
pass
return ibis.schema(pairs)
341 changes: 301 additions & 40 deletions ibis/bigquery/compiler.py

Large diffs are not rendered by default.

10 changes: 10 additions & 0 deletions ibis/bigquery/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,16 @@ def df(alltypes):
return alltypes.execute()


@pytest.fixture(scope='session')
def parted_alltypes(client):
return client.table('functional_alltypes_parted')


@pytest.fixture(scope='session')
def parted_df(parted_alltypes):
return parted_alltypes.execute()


@pytest.fixture(scope='session')
def struct_table(client):
return client.table('struct_table')
248 changes: 236 additions & 12 deletions ibis/bigquery/tests/test_client.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
import collections

from datetime import date, datetime

import pytest

import numpy as np
import pandas as pd
import pandas.util.testing as tm

import ibis
import ibis.common as com
import ibis.expr.datatypes as dt
import ibis.expr.types as ir


Expand Down Expand Up @@ -40,7 +46,10 @@ def test_simple_aggregate_execute(alltypes, df):


def test_list_tables(client):
assert len(client.list_tables(like='functional_alltypes')) == 1
assert set(client.list_tables(like='functional_alltypes')) == {
'functional_alltypes',
'functional_alltypes_parted',
}


def test_current_database(client):
Expand All @@ -61,14 +70,6 @@ def test_database_layer(client):
assert sorted(actual) == sorted(expected)


def test_compile_verify(alltypes):
column = alltypes['string_col']
unsupported_expr = column.replace('foo', 'bar')
supported_expr = column.lower()
assert not unsupported_expr.verify()
assert supported_expr.verify()


def test_compile_toplevel():
t = ibis.table([('foo', 'double')], name='t0')

Expand Down Expand Up @@ -134,11 +135,234 @@ def test_array_length(struct_table):
tm.assert_series_equal(result, expected)


@pytest.mark.xfail
def test_array_collect(struct_table):
key = struct_table.array_of_structs_col[0].string_field
expr = struct_table.groupby(key).aggregate(
expr = struct_table.groupby(key=key).aggregate(
foo=lambda t: t.array_of_structs_col[0].int_field.collect()
)
result = expr.execute()
assert result == -1
expected = struct_table.execute()
expected = expected.assign(
key=expected.array_of_structs_col.apply(lambda x: x[0]['string_field'])
).groupby('key').apply(
lambda df: list(
df.array_of_structs_col.apply(lambda x: x[0]['int_field'])
)
).reset_index().rename(columns={0: 'foo'})
tm.assert_frame_equal(result, expected)


def test_count_distinct_with_filter(alltypes):
expr = alltypes.string_col.nunique(
where=alltypes.string_col.cast('int64') > 1
)
result = expr.execute()
expected = alltypes.string_col.execute()
expected = expected[expected.astype('int64') > 1].nunique()
assert result == expected


@pytest.mark.parametrize('type', ['date', dt.date])
def test_cast_string_to_date(alltypes, df, type):
import toolz

string_col = alltypes.date_string_col
month, day, year = toolz.take(3, string_col.split('/'))

expr = '20' + ibis.literal('-').join([year, month, day])
expr = expr.cast(type)
result = expr.execute().astype(
'datetime64[ns]'
).sort_values().reset_index(drop=True).rename('date_string_col')
expected = pd.to_datetime(
df.date_string_col
).dt.normalize().sort_values().reset_index(drop=True)
tm.assert_series_equal(result, expected)


def test_has_partitions(alltypes, parted_alltypes, client):
col = ibis.options.bigquery.partition_col
assert col not in alltypes.columns
assert col in parted_alltypes.columns


def test_different_partition_col_name(client):
col = ibis.options.bigquery.partition_col = 'FOO_BAR'
alltypes = client.table('functional_alltypes')
parted_alltypes = client.table('functional_alltypes_parted')
assert col not in alltypes.columns
assert col in parted_alltypes.columns


def test_subquery_scalar_params(alltypes):
t = alltypes
param = ibis.param('timestamp').name('my_param')
expr = t[['float_col', 'timestamp_col', 'int_col', 'string_col']][
lambda t: t.timestamp_col < param
].groupby('string_col').aggregate(
foo=lambda t: t.float_col.sum()
).foo.count()
result = expr.compile(params={param: '20140101'})
expected = """\
SELECT count(`foo`) AS `count`
FROM (
SELECT `string_col`, sum(`float_col`) AS `foo`
FROM (
SELECT `float_col`, `timestamp_col`, `int_col`, `string_col`
FROM testing.functional_alltypes
WHERE `timestamp_col` < @my_param
) t1
GROUP BY 1
) t0"""
assert result == expected


_IBIS_TYPE_TO_DTYPE = {
'string': 'STRING',
'int64': 'INT64',
'double': 'FLOAT64',
'boolean': 'BOOL',
'timestamp': 'TIMESTAMP',
'date': 'DATE',
}


def test_scalar_param_string(alltypes, df):
param = ibis.param('string')
expr = alltypes[alltypes.string_col == param]

string_value = '0'
result = expr.execute(
params={param: string_value}
).sort_values('id').reset_index(drop=True)
expected = df.loc[
df.string_col == string_value
].sort_values('id').reset_index(drop=True)
tm.assert_frame_equal(result, expected)


def test_scalar_param_int64(alltypes, df):
param = ibis.param('int64')
expr = alltypes[alltypes.string_col.cast('int64') == param]

int64_value = 0
result = expr.execute(
params={param: int64_value}
).sort_values('id').reset_index(drop=True)
expected = df.loc[
df.string_col.astype('int64') == int64_value
].sort_values('id').reset_index(drop=True)
tm.assert_frame_equal(result, expected)


def test_scalar_param_double(alltypes, df):
param = ibis.param('double')
expr = alltypes[alltypes.string_col.cast('int64').cast('double') == param]

double_value = 0.0
result = expr.execute(
params={param: double_value}
).sort_values('id').reset_index(drop=True)
expected = df.loc[
df.string_col.astype('int64').astype('float64') == double_value
].sort_values('id').reset_index(drop=True)
tm.assert_frame_equal(result, expected)


def test_scalar_param_boolean(alltypes, df):
param = ibis.param('boolean')
expr = alltypes[(alltypes.string_col.cast('int64') == 0) == param]

bool_value = True
result = expr.execute(
params={param: bool_value}
).sort_values('id').reset_index(drop=True)
expected = df.loc[
df.string_col.astype('int64') == 0
].sort_values('id').reset_index(drop=True)
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
'timestamp_value',
['2009-01-20 01:02:03', date(2009, 1, 20), datetime(2009, 1, 20, 1, 2, 3)]
)
def test_scalar_param_timestamp(alltypes, df, timestamp_value):
param = ibis.param('timestamp')
expr = alltypes[alltypes.timestamp_col <= param][['timestamp_col']]

result = expr.execute(
params={param: timestamp_value}
).sort_values('timestamp_col').reset_index(drop=True)
value = pd.Timestamp(timestamp_value, tz='UTC')
expected = df.loc[
df.timestamp_col <= value, ['timestamp_col']
].sort_values('timestamp_col').reset_index(drop=True)
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
'date_value',
['2009-01-20', date(2009, 1, 20), datetime(2009, 1, 20)]
)
def test_scalar_param_date(alltypes, df, date_value):
param = ibis.param('date')
expr = alltypes[alltypes.timestamp_col.cast('date') <= param]

result = expr.execute(
params={param: date_value}
).sort_values('timestamp_col').reset_index(drop=True)
value = pd.Timestamp(date_value)
expected = df.loc[
df.timestamp_col.dt.normalize() <= value
].sort_values('timestamp_col').reset_index(drop=True)
tm.assert_frame_equal(result, expected)


def test_scalar_param_array(alltypes, df):
param = ibis.param('array<double>')
expr = alltypes.sort_by('id').limit(1).double_col.collect() + param
result = expr.execute(params={param: [1]})
expected = [df.sort_values('id').double_col.iat[0]] + [1.0]
assert result == expected


def test_scalar_param_struct(client):
struct_type = dt.Struct.from_tuples([('x', dt.int64), ('y', dt.string)])
param = ibis.param(struct_type)
value = collections.OrderedDict([('x', 1), ('y', 'foobar')])
result = client.execute(param, {param: value})
assert value == result


@pytest.mark.xfail(
raises=com.UnsupportedBackendType,
reason='Cannot handle nested structs/arrays in 0.27 API',
)
def test_scalar_param_nested(client):
param = ibis.param('struct<x: array<struct<y: array<double>>>>')
value = collections.OrderedDict([
(
'x',
[
collections.OrderedDict([
('y', [1.0, 2.0, 3.0])
])
]
)
])
result = client.execute(param, {param: value})
assert value == result


def test_raw_sql(client):
assert client.raw_sql('SELECT 1').fetchall() == [(1,)]


def test_scalar_param_scope(alltypes):
t = alltypes
param = ibis.param('timestamp')
mut = t.mutate(param=param).compile(params={param: '2017-01-01'})
assert mut == """\
SELECT *, @param AS `param`
FROM testing.functional_alltypes"""
14 changes: 14 additions & 0 deletions ibis/bigquery/tests/test_compiler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import ibis
import ibis.expr.datatypes as dt


def test_timestamp_accepts_date_literals(alltypes):
date_string = '2009-03-01'
param = ibis.param(dt.timestamp).name('param_0')
expr = alltypes.mutate(param=param)
params = {param: date_string}
result = expr.compile(params=params)
expected = """\
SELECT *, @param AS `param`
FROM testing.functional_alltypes"""
assert result == expected
29 changes: 20 additions & 9 deletions ibis/clickhouse/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,20 @@

from ibis.config import options
from ibis.clickhouse.client import ClickhouseClient
from ibis.clickhouse.compiler import dialect


def compile(expr):
__all__ = 'compile', 'verify', 'connect', 'dialect'


try:
import lz4 # noqa: F401
_default_compression = 'lz4'
except ImportError:
_default_compression = False


def compile(expr, params=None):
"""
Force compilation of expression as though it were an expression depending
on Clickhouse. Note you can also call expr.compile()
Expand All @@ -13,24 +24,24 @@ def compile(expr):
-------
compiled : string
"""
from .compiler import to_sql
return to_sql(expr)
from ibis.clickhouse.compiler import to_sql
return to_sql(expr, dialect.make_context(params=params))


def verify(expr):
def verify(expr, params=None):
"""
Determine if expression can be successfully translated to execute on
Clickhouse
"""
try:
compile(expr)
compile(expr, params=params)
return True
except com.TranslationError:
return False


def connect(host='localhost', port=9000, database='default', user='default',
password='', client_name='ibis', compression=False):
password='', client_name='ibis', compression=_default_compression):
"""Create an ClickhouseClient for use with Ibis.
Parameters
Expand All @@ -48,8 +59,9 @@ def connect(host='localhost', port=9000, database='default', user='default',
client_name: str, optional
This will appear in clickhouse server logs
compression: str, optional
Weather or not to use compression. Default is False.
Possible choices: lz4, lz4hc, quicklz, zstd
Weather or not to use compression.
Default is lz4 if installed else False.
Possible choices: lz4, lz4hc, quicklz, zstd, True, False
True is equivalent to 'lz4'.
Examples
Expand All @@ -71,7 +83,6 @@ def connect(host='localhost', port=9000, database='default', user='default',
-------
ClickhouseClient
"""

client = ClickhouseClient(host, port=port, database=database, user=user,
password=password, client_name=client_name,
compression=compression)
Expand Down
201 changes: 153 additions & 48 deletions ibis/clickhouse/client.py
Original file line number Diff line number Diff line change
@@ -1,67 +1,141 @@
import re
import numpy as np
import pandas as pd

import ibis.common as com
import ibis.expr.datatypes as dt
import ibis.expr.operations as ops
import ibis.expr.types as ir
import ibis.expr.schema as sch
import ibis.expr.datatypes as dt

from ibis.config import options
from ibis.compat import zip as czip
from ibis.compat import zip as czip, parse_version
from ibis.client import Query, Database, DatabaseEntity, SQLClient
from ibis.clickhouse.compiler import build_ast
from ibis.clickhouse.compiler import ClickhouseDialect, build_ast
from ibis.util import log
from ibis.sql.compiler import DDL

from clickhouse_driver.client import Client as _DriverClient

from .types import clickhouse_to_pandas, clickhouse_to_ibis


fully_qualified_re = re.compile(r"(.*)\.(?:`(.*)`|(.*))")


_clickhouse_dtypes = {
'Null': dt.Null,
'UInt8': dt.UInt8,
'UInt16': dt.UInt16,
'UInt32': dt.UInt32,
'UInt64': dt.UInt64,
'Int8': dt.Int8,
'Int16': dt.Int16,
'Int32': dt.Int32,
'Int64': dt.Int64,
'Float32': dt.Float32,
'Float64': dt.Float64,
'String': dt.String,
'FixedString': dt.String,
'Date': dt.Date,
'DateTime': dt.Timestamp
}
_ibis_dtypes = {v: k for k, v in _clickhouse_dtypes.items()}
_ibis_dtypes[dt.String] = 'String'


class ClickhouseDataType(object):

__slots__ = 'typename', 'nullable'

def __init__(self, typename, nullable=False):
if typename not in _clickhouse_dtypes:
raise com.UnsupportedBackendType(typename)
self.typename = typename
self.nullable = nullable

def __str__(self):
if self.nullable:
return 'Nullable({})'.format(self.typename)
else:
return self.typename

def __repr__(self):
return '<Clickhouse {}>'.format(str(self))

@classmethod
def parse(cls, spec):
# TODO(kszucs): spare parsing, depends on clickhouse-driver#22
if spec.startswith('Nullable'):
return cls(spec[9:-1], nullable=True)
else:
return cls(spec)

def to_ibis(self):
return _clickhouse_dtypes[self.typename](nullable=self.nullable)

@classmethod
def from_ibis(cls, dtype, nullable=None):
typename = _ibis_dtypes[type(dtype)]
if nullable is None:
nullable = dtype.nullable
return cls(typename, nullable=nullable)


@dt.dtype.register(ClickhouseDataType)
def clickhouse_to_ibis_dtype(clickhouse_dtype):
return clickhouse_dtype.to_ibis()


class ClickhouseDatabase(Database):
pass


class ClickhouseQuery(Query):

def _external_tables(self):
tables = []
for name, df in self.extra_options.get('external_tables', {}).items():
if not isinstance(df, pd.DataFrame):
raise TypeError('External table is not an instance of pandas '
'dataframe')

schema = sch.infer(df)
chtypes = map(ClickhouseDataType.from_ibis, schema.types)
structure = list(zip(schema.names, map(str, chtypes)))

tables.append(dict(name=name,
data=df.to_dict('records'),
structure=structure))
return tables

def execute(self):
# synchronous by default
cursor = self.client._execute(self.compiled_ddl)
cursor = self.client._execute(
self.compiled_ddl,
external_tables=self._external_tables()
)
result = self._fetch(cursor)
return self._wrap_result(result)

def _fetch(self, cursor):
data, columns = cursor
names, types = czip(*columns)

cols = {}
for (col, name, db_type) in czip(data, names, types):
dtype = self._db_type_to_dtype(db_type, name)
try:
cols[name] = pd.Series(col, dtype=dtype)
except TypeError:
cols[name] = pd.Series(col)

return pd.DataFrame(cols, columns=names)
data, colnames, _ = cursor
if not len(data):
# handle empty resultset
return pd.DataFrame([], columns=colnames)

def _db_type_to_dtype(self, db_type, column):
return clickhouse_to_pandas[db_type]
df = pd.DataFrame.from_items(zip(colnames, data))
return self.schema().apply_to(df)


class ClickhouseClient(SQLClient):
"""An Ibis client interface that uses Clickhouse"""

database_class = ClickhouseDatabase
sync_query = ClickhouseQuery
dialect = ClickhouseDialect

def __init__(self, *args, **kwargs):
self.con = _DriverClient(*args, **kwargs)

def _build_ast(self, expr, params=None):
return build_ast(expr, params=params)
def _build_ast(self, expr, context):
return build_ast(expr, context)

@property
def current_database(self):
Expand All @@ -79,12 +153,23 @@ def close(self):
"""Close Clickhouse connection and drop any temporary objects"""
self.con.disconnect()

def _execute(self, query):
def _execute(self, query, external_tables=(), results=True):
if isinstance(query, DDL):
query = query.compile()
self.log(query)

return self.con.execute(query, columnar=True, with_column_types=True)
response = self.con.process_ordinary_query(
query, columnar=True, with_column_types=True,
external_tables=external_tables
)
if not results:
return response

data, columns = response
colnames, typenames = czip(*columns)
coltypes = list(map(ClickhouseDataType.parse, typenames))

return data, colnames, coltypes

def _fully_qualified_name(self, name, database):
if bool(fully_qualified_re.search(name)):
Expand Down Expand Up @@ -120,7 +205,8 @@ def list_tables(self, like=None, database=None):
return self.list_tables(like=like, database=database)
statement += " LIKE '{0}'".format(like)

return self._execute(statement)
data, _, _ = self.raw_sql(statement, results=True)
return data[0]

def set_database(self, name):
"""
Expand Down Expand Up @@ -161,7 +247,8 @@ def list_databases(self, like=None):
if like:
statement += " WHERE name LIKE '{0}'".format(like)

return self._execute(statement)
data, _, _ = self.raw_sql(statement, results=True)
return data[0]

def get_schema(self, table_name, database=None):
"""
Expand All @@ -179,12 +266,12 @@ def get_schema(self, table_name, database=None):
"""
qualified_name = self._fully_qualified_name(table_name, database)
query = 'DESC {0}'.format(qualified_name)
data, _ = self._execute(query)
data, _, _ = self.raw_sql(query, results=True)

names, types = data[:2]
ibis_types = map(clickhouse_to_ibis.get, types)
colnames, coltypes = data[:2]
coltypes = list(map(ClickhouseDataType.parse, coltypes))

return dt.Schema(names, ibis_types)
return sch.schema(colnames, coltypes)

@property
def client_options(self):
Expand Down Expand Up @@ -221,10 +308,8 @@ def _get_table_schema(self, tname):
return self.get_schema(tname)

def _get_schema_using_query(self, query):
_, types = self._execute(query)
names, clickhouse_types = zip(*types)
ibis_types = map(clickhouse_to_ibis.get, clickhouse_types)
return dt.Schema(names, ibis_types)
_, colnames, coltypes = self._execute(query)
return sch.schema(colnames, coltypes)

def _exec_statement(self, stmt, adapter=None):
query = ClickhouseQuery(self, stmt)
Expand All @@ -237,6 +322,21 @@ def _table_command(self, cmd, name, database=None):
qualified_name = self._fully_qualified_name(name, database)
return '{0} {1}'.format(cmd, qualified_name)

@property
def version(self):
self.con.connection.force_connect()

try:
server = self.con.connection.server_info
vstring = '{}.{}.{}'.format(server.version_major,
server.version_minor,
server.revision)
except Exception:
self.con.connection.disconnect()
raise
else:
return parse_version(vstring)


class ClickhouseTable(ir.TableExpr, DatabaseEntity):
"""References a physical table in Clickhouse"""
Expand Down Expand Up @@ -287,18 +387,23 @@ def name(self):
def _execute(self, stmt):
return self._client._execute(stmt)

def insert(self, obj, **kwargs):
from .identifiers import quote_identifier
schema = self.schema()

class ClickhouseTemporaryTable(ops.DatabaseTable):
assert isinstance(obj, pd.DataFrame)
assert set(schema.names) >= set(obj.columns)

def __del__(self):
try:
self.drop()
except com.IbisError:
pass
columns = ', '.join(map(quote_identifier, obj.columns))
query = 'INSERT INTO {table} ({columns}) VALUES'.format(
table=self._qualified_name, columns=columns)

def drop(self):
try:
self.source.drop_table(self.name)
except Exception: # ClickhouseError
# database might have been dropped
pass
# convert data columns with datetime64 pandas dtype to native date
# because clickhouse-driver 0.0.10 does arithmetic operations on it
obj = obj.copy()
for col in obj.select_dtypes(include=[np.datetime64]):
if isinstance(schema[col], dt.Date):
obj[col] = obj[col].dt.date

data = obj.to_dict('records')
return self._client.con.process_insert_query(query, data, **kwargs)
30 changes: 23 additions & 7 deletions ibis/clickhouse/compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
from .operations import _operation_registry, _name_expr


def build_ast(expr, context=None, params=None):
builder = ClickhouseQueryBuilder(expr, context=context, params=params)
def build_ast(expr, context):
builder = ClickhouseQueryBuilder(expr, context=context)
return builder.get_result()


Expand Down Expand Up @@ -40,10 +40,6 @@ class ClickhouseQueryBuilder(comp.QueryBuilder):

select_builder = ClickhouseSelectBuilder

@property
def _make_context(self):
return ClickhouseQueryContext


class ClickhouseQueryContext(comp.QueryContext):

Expand Down Expand Up @@ -82,6 +78,19 @@ def format_group_by(self):

return '\n'.join(lines)

def format_limit(self):
if not self.limit:
return None

buf = StringIO()

n, offset = self.limit['n'], self.limit['offset']
buf.write('LIMIT {}'.format(n))
if offset is not None and offset != 0:
buf.write(', {}'.format(offset))

return buf.getvalue()


class ClickhouseTableSetFormatter(comp.TableSetFormatter):

Expand Down Expand Up @@ -143,13 +152,20 @@ def _quote_identifier(self, name):
class ClickhouseExprTranslator(comp.ExprTranslator):

_registry = _operation_registry
_context_class = ClickhouseQueryContext
context_class = ClickhouseQueryContext

def name(self, translated, name, force=True):
return _name_expr(translated,
quote_identifier(name, force=force))


class ClickhouseDialect(comp.Dialect):

translator = ClickhouseExprTranslator


dialect = ClickhouseDialect

compiles = ClickhouseExprTranslator.compiles
rewrites = ClickhouseExprTranslator.rewrites

Expand Down
178 changes: 123 additions & 55 deletions ibis/clickhouse/operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,17 @@
import ibis.sql.transforms as transforms

from ibis.clickhouse.identifiers import quote_identifier
from ibis.clickhouse.types import ibis_to_clickhouse


def _cast(translator, expr):
from ibis.clickhouse.client import ClickhouseDataType

op = expr.op()
arg, target = op.args
arg_ = translator.translate(arg)
type_ = str(ClickhouseDataType.from_ibis(target, nullable=False))

if isinstance(arg, ir.CategoryValue) and target == 'int32':
return arg_
else:
type_ = ibis_to_clickhouse[target.name.lower()]
return 'CAST({0!s} AS {1!s})'.format(arg_, type_)
return 'CAST({0!s} AS {1!s})'.format(arg_, type_)


def _between(translator, expr):
Expand Down Expand Up @@ -65,7 +63,7 @@ def formatter(translator, expr):
arg_count = len(op.args)
if arity != arg_count:
msg = 'Incorrect number of args {0} instead of {1}'
raise com.TranslationError(msg.format(arg_count, arity))
raise com.UnsupportedOperationError(msg.format(arg_count, arity))
return _call(translator, func_name, *op.args)
return formatter

Expand All @@ -81,7 +79,7 @@ def agg_variance_like(func):
'pop': '{0}Pop'.format(func)}

def formatter(translator, expr):
arg, where, how = expr.op().args
arg, how, where = expr.op().args
return _aggregate(translator, variants[how], arg, where)

return formatter
Expand Down Expand Up @@ -129,6 +127,14 @@ def varargs_formatter(translator, expr):
return varargs_formatter


def _arbitrary(translator, expr):
arg, how, where = expr.op().args
functions = {'first': 'any',
'last': 'anyLast',
'heavy': 'anyHeavy'}
return _aggregate(translator, functions[how], arg, where=where)


def _substring(translator, expr):
# arg_ is the formatted notation
op = expr.op()
Expand All @@ -151,8 +157,9 @@ def _string_find(translator, expr):
op = expr.op()
arg, substr, start, _ = op.args
if start is not None:
raise com.TranslationError('String find doesn\'t '
'support start argument')
raise com.UnsupportedOperationError(
"String find doesn't support start argument"
)

return _call(translator, 'position', arg, substr) + ' - 1'

Expand All @@ -169,12 +176,6 @@ def _regex_extract(translator, expr):
return 'extractAll({0}, {1})'.format(arg_, pattern_)


def _string_join(translator, expr):
op = expr.op()
arg, strings = op.args
return _call(translator, 'concat_ws', arg, *strings)


def _parse_url(translator, expr):
op = expr.op()
arg, extract, key = op.args
Expand All @@ -191,8 +192,9 @@ def _parse_url(translator, expr):
else:
return _call(translator, 'queryString', arg)
else:
raise com.TranslationError('Parse url with extrac {0} is not '
'supported'.format(extract))
raise com.UnsupportedOperationError(
'Parse url with extract {0} is not supported'.format(extract)
)


def _index_of(translator, expr):
Expand Down Expand Up @@ -233,8 +235,9 @@ def _hash(translator, expr):
'sipHash64', 'sipHash128'}

if how not in algorithms:
raise com.TranslationError('Unsupported hash algorithm {0}'
.format(how))
raise com.UnsupportedOperationError(
'Unsupported hash algorithm {0}'.format(how)
)

return _call(translator, how, arg)

Expand All @@ -261,6 +264,26 @@ def _value_list(translator, expr):
return '({0})'.format(', '.join(values_))


def _interval_format(translator, expr):
if expr.unit in {'ms', 'us', 'ns'}:
raise com.UnsupportedOperationError(
"Clickhouse doesn't support subsecond interval resolutions")

return 'INTERVAL {} {}'.format(expr.op().value, expr.resolution.upper())


def _interval_from_integer(translator, expr):
op = expr.op()
arg, unit = op.args

if expr.unit in {'ms', 'us', 'ns'}:
raise com.UnsupportedOperationError(
"Clickhouse doesn't support subsecond interval resolutions")

arg_ = translator.translate(arg)
return 'INTERVAL {} {}'.format(arg_, expr.resolution.upper())


def literal(translator, expr):
value = expr.op().value
if isinstance(expr, ir.BooleanValue):
Expand All @@ -269,6 +292,8 @@ def literal(translator, expr):
return "'{0!s}'".format(value.replace("'", "\\'"))
elif isinstance(expr, ir.NumericValue):
return repr(value)
elif isinstance(expr, ir.IntervalValue):
return _interval_format(translator, expr)
elif isinstance(expr, ir.TimestampValue):
if isinstance(value, datetime):
if value.microsecond != 0:
Expand All @@ -280,8 +305,10 @@ def literal(translator, expr):
if isinstance(value, date):
value = value.strftime('%Y-%m-%d')
return "toDate('{0!s}')".format(value)
elif isinstance(expr, ir.ArrayValue):
return str(list(value))
else:
raise NotImplementedError
raise NotImplementedError(type(expr))


class CaseFormatter(object):
Expand Down Expand Up @@ -359,45 +386,32 @@ def _timestamp_from_unix(translator, expr):
op = expr.op()
arg, unit = op.args

if unit == 'ms':
raise ValueError('`ms` unit is not supported!')
elif unit == 'us':
raise ValueError('`us` unit is not supported!')
if unit in {'ms', 'us', 'ns'}:
raise ValueError('`{}` unit is not supported!'.format(unit))

return _call(translator, 'toDateTime', arg)


def _timestamp_delta(translator, expr):
op = expr.op()
arg, offset = op.args

if isinstance(arg, ir.TimestampValue):
offset_ = offset.to_unit('s').n
elif isinstance(arg, ir.DateValue):
offset_ = offset.to_unit('d').n
else:
raise com.TranslationError('Unsupported timedelta operation')

arg_ = translator.translate(arg)
return '{0} + {1}'.format(arg_, offset_)


def _truncate(translator, expr):
op = expr.op()
arg, unit = op.args

converters = {
'Y': 'toStartOfYear',
'M': 'toStartOfMonth',
'W': 'toMonday',
'D': 'toDate',
'H': 'toStartOfHour',
'MI': 'toStartOfMinute'
'h': 'toStartOfHour',
'm': 'toStartOfMinute',
's': 'toDateTime'
}

try:
converter = converters[unit]
except KeyError:
raise com.TranslationError('Unsupported concat unit {0}'.format(unit))
raise com.UnsupportedOperationError(
'Unsupported truncate unit {}'.format(unit)
)

return _call(translator, converter, arg)

Expand Down Expand Up @@ -436,7 +450,7 @@ def _table_column(translator, expr):
proj_expr = table.projection([field_name]).to_array()
return _table_array_view(translator, proj_expr)

# TODO: table aliasing is partially supported
# TODO(kszucs): table aliasing is partially supported
# if ctx.need_aliases():
# alias = ctx.get_ref(table)
# if alias is not None:
Expand All @@ -445,7 +459,40 @@ def _table_column(translator, expr):
return quoted_name


# TODO: clickhouse uses differenct string functions
def _string_split(translator, expr):
value, sep = expr.op().args
return 'splitByString({}, {})'.format(
translator.translate(sep),
translator.translate(value)
)


def _string_join(translator, expr):
sep, elements = expr.op().args
assert isinstance(elements.op(), ir.ValueList), \
'elements must be a ValueList, got {}'.format(type(elements.op()))
return 'arrayStringConcat([{}], {})'.format(
', '.join(map(translator.translate, elements)),
translator.translate(sep),
)


def _string_repeat(translator, expr):
value, times = expr.op().args
result = 'arrayStringConcat(arrayMap(x -> {}, range({})))'.format(
translator.translate(value), translator.translate(times)
)
return result


def _string_like(translator, expr):
value, pattern = expr.op().args[:2]
return '{} LIKE {}'.format(
translator.translate(value), translator.translate(pattern)
)


# TODO: clickhouse uses different string functions
# for ascii and utf-8 encodings,

_binary_infix_ops = {
Expand Down Expand Up @@ -481,6 +528,9 @@ def _table_column(translator, expr):
# Unary operations
ops.TypeOf: unary('toTypeName'),

ops.IsNan: unary('isNaN'),
ops.IsInf: unary('isInfinite'),

ops.Abs: unary('abs'),
ops.Ceil: unary('ceil'),
ops.Floor: unary('floor'),
Expand Down Expand Up @@ -514,6 +564,7 @@ def _table_column(translator, expr):

ops.Count: agg('count'),
ops.CountDistinct: agg('uniq'),
ops.Arbitrary: _arbitrary,

# string operations
ops.StringLength: unary('length'),
Expand All @@ -524,26 +575,34 @@ def _table_column(translator, expr):
ops.StringFind: _string_find,
ops.FindInSet: _index_of,
ops.StringReplace: fixed_arity('replaceAll', 3),
ops.StringJoin: _string_join,
ops.StringSplit: _string_split,
ops.StringSQLLike: _string_like,
ops.Repeat: _string_repeat,

# TODO: there are no concat_ws in clickhouse
# ops.StringJoin: varargs('concat'),

ops.StringSQLLike: binary_infix_op('LIKE'),
ops.RegexSearch: fixed_arity('match', 2),
# TODO: extractAll(haystack, pattern)[index + 1]
ops.RegexExtract: _regex_extract,
ops.RegexReplace: fixed_arity('replaceRegexpAll', 3),
ops.ParseURL: _parse_url,

# Timestamp operations
# Temporal operations
ops.Date: unary('toDate'),
ops.DateTruncate: _truncate,

ops.TimestampNow: lambda *args: 'now()',
ops.TimestampTruncate: _truncate,

ops.TimeTruncate: _truncate,

ops.IntervalFromInteger: _interval_from_integer,

ops.ExtractYear: unary('toYear'),
ops.ExtractMonth: unary('toMonth'),
ops.ExtractDay: unary('toDayOfMonth'),
ops.ExtractHour: unary('toHour'),
ops.ExtractMinute: unary('toMinute'),
ops.ExtractSecond: unary('toSecond'),
ops.Truncate: _truncate,

# Other operations
ops.E: lambda *args: 'e()',
Expand All @@ -553,6 +612,8 @@ def _table_column(translator, expr):

ops.Cast: _cast,

# for more than 2 args this should be arrayGreatest|Least(array([]))
# because clickhouse's greatest and least doesn't support varargs
ops.Greatest: varargs('greatest'),
ops.Least: varargs('least'),

Expand All @@ -568,18 +629,25 @@ def _table_column(translator, expr):
ops.TableColumn: _table_column,
ops.TableArrayView: _table_array_view,

ops.TimestampDelta: _timestamp_delta,
ops.DateAdd: binary_infix_op('+'),
ops.DateSub: binary_infix_op('-'),
ops.DateDiff: binary_infix_op('-'),
ops.TimestampAdd: binary_infix_op('+'),
ops.TimestampSub: binary_infix_op('-'),
ops.TimestampDiff: binary_infix_op('-'),
ops.TimestampFromUNIX: _timestamp_from_unix,

transforms.ExistsSubquery: _exists_subquery,
transforms.NotExistsSubquery: _exists_subquery
transforms.NotExistsSubquery: _exists_subquery,

ops.ArrayLength: unary('length'),
}


def raise_error(translator, expr, *args):
msg = 'Clickhouse backend doesn\'t support {0} operation!'
msg = "Clickhouse backend doesn't support {0} operation!"
op = expr.op()
raise com.TranslationError(msg.format(type(op)))
raise com.UnsupportedOperationError(msg.format(type(op)))


def _null_literal(translator, expr):
Expand Down
15 changes: 9 additions & 6 deletions ibis/clickhouse/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,18 @@
import pytest


CLICKHOUSE_HOST = os.environ.get('IBIS_CLICKHOUSE_HOST', 'localhost')
CLICKHOUSE_PORT = int(os.environ.get('IBIS_CLICKHOUSE_PORT', 9000))
CLICKHOUSE_USER = os.environ.get('IBIS_CLICKHOUSE_USER', 'default')
CLICKHOUSE_PASS = os.environ.get('IBIS_CLICKHOUSE_PASS', '')
CLICKHOUSE_HOST = os.environ.get('IBIS_TEST_CLICKHOUSE_HOST', 'localhost')
CLICKHOUSE_PORT = int(os.environ.get('IBIS_TEST_CLICKHOUSE_PORT', 9000))
CLICKHOUSE_USER = os.environ.get('IBIS_TEST_CLICKHOUSE_USER', 'default')
CLICKHOUSE_PASS = os.environ.get('IBIS_TEST_CLICKHOUSE_PASSWORD', '')
IBIS_TEST_CLICKHOUSE_DB = os.environ.get('IBIS_TEST_DATA_DB', 'ibis_testing')


@pytest.fixture(scope='module')
def con():
return ibis.clickhouse.connect(
host=CLICKHOUSE_HOST,
port=CLICKHOUSE_PORT,
user=CLICKHOUSE_USER,
password=CLICKHOUSE_PASS,
database=IBIS_TEST_CLICKHOUSE_DB,
Expand All @@ -37,5 +38,7 @@ def df(alltypes):

@pytest.fixture
def translate():
from ibis.clickhouse.compiler import ClickhouseExprTranslator
return lambda expr: ClickhouseExprTranslator(expr).get_result()
from ibis.clickhouse.compiler import ClickhouseDialect
dialect = ClickhouseDialect()
context = dialect.make_context()
return lambda expr: dialect.translator(expr, context).get_result()
141 changes: 3 additions & 138 deletions ibis/clickhouse/tests/test_aggregations.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ def test_reduction_where(con, alltypes, translate, reduction, func_translated):
expr = method(where=cond)

assert translate(expr) == expected
assert isinstance(con.execute(expr), (np.float, np.uint))


def test_std_var_pop(con, alltypes, translate):
Expand Down Expand Up @@ -55,98 +54,15 @@ def test_reduction_invalid_where(con, alltypes, reduction):
fn(alltypes.double_col)


# @pytest.mark.parametrize(
# ('func', 'pandas_func'),
# [
# # tier and histogram
# (
# lambda d: d.bucket([0, 10, 25, 50, 100]),
# lambda s: pd.cut(
# s, [0, 10, 25, 50, 100], right=False, labels=False,
# )
# ),
# (
# lambda d: d.bucket([0, 10, 25, 50], include_over=True),
# lambda s: pd.cut(
# s, [0, 10, 25, 50, np.inf], right=False, labels=False
# )
# ),
# (
# lambda d: d.bucket([0, 10, 25, 50], close_extreme=False),
# lambda s: pd.cut(s, [0, 10, 25, 50], right=False, labels=False),
# ),
# (
# lambda d: d.bucket(
# [0, 10, 25, 50], closed='right', close_extreme=False
# ),
# lambda s: pd.cut(
# s, [0, 10, 25, 50],
# include_lowest=False,
# right=True,
# labels=False,
# )
# ),
# (
# lambda d: d.bucket([10, 25, 50, 100], include_under=True),
# lambda s: pd.cut(
# s, [0, 10, 25, 50, 100], right=False, labels=False
# ),
# ),
# ]
# )
# def test_bucket(alltypes, df, func, pandas_func):
# expr = func(alltypes.double_col)
# result = expr.execute()
# expected = pandas_func(df.double_col)
# tm.assert_series_equal(result, expected, check_names=False)


# def test_category_label(alltypes, df):
# t = alltypes
# d = t.double_col

# bins = [0, 10, 25, 50, 100]
# labels = ['a', 'b', 'c', 'd']
# bucket = d.bucket(bins)
# expr = bucket.label(labels)
# result = expr.execute().astype('category', ordered=True)
# result.name = 'double_col'

# expected = pd.cut(df.double_col, bins, labels=labels, right=False)

# tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(('func', 'pandas_func'), [
(
lambda t, cond: t.bool_col.count(),
lambda df, cond: df.bool_col.count(),
),
# (
# lambda t, cond: t.bool_col.nunique(),
# lambda df, cond: df.bool_col.nunique(),
# ),
(
lambda t, cond: t.bool_col.approx_nunique(),
lambda df, cond: df.bool_col.nunique(),
),
# group_concat
# (
# lambda t, cond: t.bool_col.any(),
# lambda df, cond: df.bool_col.any(),
# ),
# (
# lambda t, cond: t.bool_col.all(),
# lambda df, cond: df.bool_col.all(),
# ),
# (
# lambda t, cond: t.bool_col.notany(),
# lambda df, cond: ~df.bool_col.any(),
# ),
# (
# lambda t, cond: t.bool_col.notall(),
# lambda df, cond: ~df.bool_col.all(),
# ),
(
lambda t, cond: t.double_col.sum(),
lambda df, cond: df.double_col.sum(),
Expand All @@ -157,7 +73,7 @@ def test_reduction_invalid_where(con, alltypes, reduction):
),
(
lambda t, cond: t.int_col.approx_median(),
lambda df, cond: df.int_col.median(),
lambda df, cond: np.int32(df.int_col.median()),
),
(
lambda t, cond: t.double_col.min(),
Expand Down Expand Up @@ -187,14 +103,6 @@ def test_reduction_invalid_where(con, alltypes, reduction):
lambda t, cond: t.bool_col.count(where=cond),
lambda df, cond: df.bool_col[cond].count(),
),
# (
# lambda t, cond: t.bool_col.nunique(where=cond),
# lambda df, cond: df.bool_col[cond].nunique(),
# ),
# (
# lambda t, cond: t.bool_col.approx_nunique(where=cond),
# lambda df, cond: df.bool_col[cond].nunique(),
# ),
(
lambda t, cond: t.double_col.sum(where=cond),
lambda df, cond: df.double_col[cond].sum(),
Expand All @@ -204,8 +112,8 @@ def test_reduction_invalid_where(con, alltypes, reduction):
lambda df, cond: df.double_col[cond].mean(),
),
(
lambda t, cond: t.int_col.approx_median(where=cond),
lambda df, cond: df.int_col[cond].median(),
lambda t, cond: t.float_col.approx_median(where=cond),
lambda df, cond: df.float_col[cond].median(),
),
(
lambda t, cond: t.double_col.min(where=cond),
Expand Down Expand Up @@ -247,22 +155,6 @@ def test_aggregations(alltypes, df, func, pandas_func, translate):
np.testing.assert_allclose(result, expected)


# def test_group_concat(alltypes, df):
# expr = alltypes.string_col.group_concat()
# result = expr.execute()
# expected = ','.join(df.string_col.dropna())
# assert result == expected


# TODO: requires CountDistinct to support condition
# def test_distinct_aggregates(alltypes, df, translate):
# expr = alltypes.limit(100).double_col.nunique()
# result = expr.execute()

# assert translate(expr) == 'uniq(`double_col`)'
# assert result == df.head(100).double_col.nunique()


@pytest.mark.parametrize('op', [
methodcaller('sum'),
methodcaller('mean'),
Expand All @@ -284,33 +176,6 @@ def test_anonymus_aggregate(alltypes, df, translate):
tm.assert_frame_equal(result, expected, check_like=True)


# def test_rank(con):
# t = con.table('functional_alltypes')
# expr = t.double_col.rank()
# sqla_expr = expr.compile()
# result = str(sqla_expr.compile(compile_kwargs=dict(literal_binds=True)))
# expected = """\
# assert result == expected


# def test_percent_rank(con):
# t = con.table('functional_alltypes')
# expr = t.double_col.percent_rank()
# sqla_expr = expr.compile()
# result = str(sqla_expr.compile(compile_kwargs=dict(literal_binds=True)))
# expected = """\
# assert result == expected


# def test_ntile(con):
# t = con.table('functional_alltypes')
# expr = t.double_col.ntile(7)
# sqla_expr = expr.compile()
# result = str(sqla_expr.compile(compile_kwargs=dict(literal_binds=True)))
# expected = """\
# assert result == expected


def test_boolean_summary(alltypes):
expr = alltypes.bool_col.summary()
result = expr.execute()
Expand Down
50 changes: 49 additions & 1 deletion ibis/clickhouse/tests/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
import ibis
import ibis.config as config
import ibis.expr.types as ir
import pandas.util.testing as tm

from ibis import literal as L
from ibis.compat import StringIO


pytest.importorskip('clickhouse_driver')
pytestmark = pytest.mark.clickhouse

Expand Down Expand Up @@ -170,3 +170,51 @@ def test_execute_exprs_no_table_ref(con):
ibis.now().name('b'),
L(2).log().name('c')])
con.execute(exlist)


def test_insert(con, alltypes, df):
drop = 'DROP TABLE IF EXISTS temporary_alltypes'
create = ('CREATE TABLE IF NOT EXISTS '
'temporary_alltypes AS functional_alltypes')

con.raw_sql(drop)
con.raw_sql(create)

temporary = con.table('temporary_alltypes')
records = df[:10]

assert len(temporary.execute()) == 0
temporary.insert(records)

tm.assert_frame_equal(temporary.execute(), records)


def test_insert_with_less_columns(con, alltypes, df):
drop = 'DROP TABLE IF EXISTS temporary_alltypes'
create = ('CREATE TABLE IF NOT EXISTS '
'temporary_alltypes AS functional_alltypes')

con.raw_sql(drop)
con.raw_sql(create)

temporary = con.table('temporary_alltypes')
records = df.loc[:10, ['string_col', 'date_col']]

with pytest.raises(AssertionError):
temporary.insert(records)


def test_insert_with_more_columns(con, alltypes, df):
drop = 'DROP TABLE IF EXISTS temporary_alltypes'
create = ('CREATE TABLE IF NOT EXISTS '
'temporary_alltypes AS functional_alltypes')

con.raw_sql(drop)
con.raw_sql(create)

temporary = con.table('temporary_alltypes')
records = df[:10]
records['non_existing_column'] = 'raise on me'

with pytest.raises(AssertionError):
temporary.insert(records)
245 changes: 31 additions & 214 deletions ibis/clickhouse/tests/test_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from ibis import literal as L


pytest.importorskip('clickhouse_driver')
clickhouse_driver = pytest.importorskip('clickhouse_driver')
pytestmark = pytest.mark.clickhouse


Expand All @@ -39,17 +39,6 @@ def test_cast_string_col(alltypes, translate, to_type, expected):
assert translate(expr) == expected


# def test_char_varchar_types(con):
# sql = """\
# SELECT CAST(string_col AS varchar(20)) AS varchar_col,
# CAST(string_col AS CHAR(5)) AS char_col
# FROM ibis_testing.`functional_alltypes`"""

# t = con.sql(sql)
# assert isinstance(t.varchar_col, api.StringColumn)
# assert isinstance(t.char_col, api.StringColumn)


@pytest.mark.xfail(raises=AssertionError,
reason='Clickhouse doesn\'t have decimal type')
def test_decimal_cast():
Expand All @@ -58,7 +47,7 @@ def test_decimal_cast():

@pytest.mark.parametrize('column', [
'index',
'Unnamed_0', # FIXME rename to `Unnamed: 0`
'Unnamed: 0',
'id',
'bool_col',
'tinyint_col',
Expand Down Expand Up @@ -99,16 +88,17 @@ def test_timestamp_now(con, translate):


@pytest.mark.parametrize(('unit', 'expected'), [
('y', pd.Timestamp('2009-01-01')),
('m', pd.Timestamp('2009-05-01')),
('d', pd.Timestamp('2009-05-17')),
('h', pd.Timestamp('2009-05-17 12:00:00')),
('minute', pd.Timestamp('2009-05-17 12:34:00')),
pytest.mark.xfail(('y', '2009-01-01')),
pytest.mark.xfail(('m', '2009-05-01')),
pytest.mark.xfail(('d', '2009-05-17')),
pytest.mark.xfail(('w', '2009-05-11')),
('h', '2009-05-17 12:00:00'),
('minute', '2009-05-17 12:34:00'),
])
def test_timestamp_truncate(con, translate, unit, expected):
stamp = ibis.timestamp('2009-05-17 12:34:56')
expr = stamp.truncate(unit)
assert con.execute(expr) == expected
assert con.execute(expr) == pd.Timestamp(expected)


@pytest.mark.parametrize(('func', 'expected'), [
Expand Down Expand Up @@ -159,26 +149,6 @@ def test_coalesce(con, expr, expected):
assert con.execute(expr) == expected


# TODO: clickhouse cannot cast NULL to other types
# @pytest.mark.parametrize(
# ('expr', 'expected'),
# [
# (ibis.coalesce(ibis.NA, ibis.NA), None),
# (ibis.coalesce(ibis.NA, ibis.NA, ibis.NA.cast('double')), None),
# (
# ibis.coalesce(
# ibis.NA.cast('int8'),
# ibis.NA.cast('int8'),
# ibis.NA.cast('int8'),
# ),
# None,
# ),
# ]
# )
# def test_coalesce_all_na(con, expr, expected):
# assert con.execute(expr) == expected


@pytest.mark.parametrize(('expr', 'expected'), [
(ibis.NA.fillna(5), 5),
(L(5).fillna(10), 5),
Expand All @@ -199,7 +169,12 @@ def test_fillna_nullif(con, expr, expected):
(L(1.2345), 'Float64'),
(L(datetime(2015, 9, 1, hour=14, minute=48, second=5)), 'DateTime'),
(L(date(2015, 9, 1)), 'Date'),
(ibis.NA, 'Null')
pytest.mark.xfail(
(ibis.NA, 'Null'),
raises=AssertionError,
reason=('Client/server version mismatch not handled in the clickhouse '
'driver')
)
])
def test_typeof(con, value, expected):
assert con.execute(value.typeof()) == expected
Expand Down Expand Up @@ -308,30 +283,6 @@ def test_parse_url_query_parameter(con, translate):
assert con.execute(expr) == 'kEuEcWfewf8'


# def test_string_join(self):
# cases = [
# (L(',').join(['a', 'b']), "concat_ws(',', 'a', 'b')")
# ]
# self._check_expr_cases(cases)


# TODO
# def test_identical_to(self):
# cases = [
# (ibis.NA.cast('int64'), ibis.NA.cast('int64'), True),
# (L(1), L(1), True),
# (ibis.NA.cast('int64'), L(1), False),
# (L(1), ibis.NA.cast('int64'), False),
# (L(0), L(1), False),
# (L(1), L(0), False),
# ]
# con = self.con
# for left, right, expected in cases:
# expr = left.identical_to(right)
# result = con.execute(expr)
# assert result == expected


@pytest.mark.parametrize(('expr', 'expected'), [
(L('foobar').find('bar'), 3),
(L('foobar').find('baz'), -1),
Expand Down Expand Up @@ -477,62 +428,21 @@ def test_column_regexp_replace(con, alltypes, translate):
assert len(con.execute(expr))


# @pytest.mark.parametrize('how', [
# 'MD5', 'halfMD5',
# 'SHA1', 'SHA224', 'SHA256',
# 'intHash32', 'intHash64',
# 'cityHash64',
# 'sipHash64', 'sipHash128'
# ])
# def test_hash(con, translate, how):
# expr = L('test').hash(how=how)
# assert translate(expr) == "{0}('test')".format(how)
# assert len(con.execute(expr))


def test_numeric_builtins_work(con, alltypes, df, translate):
expr = alltypes.double_col
result = expr.execute()
expected = df.double_col.fillna(0)
tm.assert_series_equal(result, expected)


# TODO
# def test_distinct_array(con, alltypes, translate):
# expr = alltypes.string_col.distinct()
# result = con.execute(expr)
# assert isinstance(result, pd.Series)


# def test_not_exists(alltypes, df):
# t = alltypes
# t2 = t.view()

# expr = t[~(t.string_col == t2.string_col).any()]
# result = expr.execute()

# left, right = df, t2.execute()
# expected = left[left.string_col != right.string_col]

# tm.assert_frame_equal(
# result, expected,
# check_index_type=False,
# check_dtype=False,
# )


# def test_interactive_repr_shows_error(alltypes):
# # #591. Doing this in PostgreSQL because so many built-in functions are
# # not available

# expr = alltypes.double_col.approx_median()

# with config.option_context('interactive', True):
# result = repr(expr)

# assert 'no translator rule' in result.lower()


@pytest.mark.xfail(
raises=clickhouse_driver.errors.UnknownTypeError,
reason=(
'Newer clickhouse server uses Nullable(Nothing) type '
'for Null values which is currently unhandled by '
'clickhouse-driver'
)
)
def test_null_column(alltypes, translate):
t = alltypes
nrows = t.count().execute()
Expand All @@ -542,31 +452,6 @@ def test_null_column(alltypes, translate):
tm.assert_series_equal(result, expected)


# def test_null_column_union(alltypes, df):
# t = alltypes
# s = alltypes[['double_col']].mutate(
# string_col=ibis.NA.cast('string'),
# )
# expr = t[['double_col', 'string_col']].union(s)
# result = expr.execute()
# nrows = t.count().execute()
# expected = pd.concat(
# [
# df[['double_col', 'string_col']],
# pd.concat(
# [
# df[['double_col']],
# pd.DataFrame({'string_col': [None] * nrows})
# ],
# axis=1,
# )
# ],
# axis=0,
# ignore_index=True
# )
# tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(('attr', 'expected'), [
(operator.methodcaller('year'), {2009, 2010}),
(operator.methodcaller('month'), set(range(1, 13))),
Expand All @@ -586,79 +471,11 @@ def test_timestamp_from_integer(con, alltypes, translate):
assert len(con.execute(expr))


# def test_timestamp_with_timezone(con):
# t = con.table('tzone')
# result = t.ts.execute()
# assert str(result.dtype.tz)


# @pytest.fixture(
# params=[
# None,
# 'UTC',
# 'America/New_York',
# 'America/Los_Angeles',
# 'Europe/Paris',
# 'Chile/Continental',
# 'Asia/Tel_Aviv',
# 'Asia/Tokyo',
# 'Africa/Nairobi',
# 'Australia/Sydney',
# ]
# )
# def tz(request):
# return request.param


# @pytest.yield_fixture
# def tzone_compute(con, guid, tz):
# schema = ibis.schema([
# ('ts', dt.timestamp(tz)),
# ('b', 'double'),
# ('c', 'string'),
# ])
# con.create_table(guid, schema=schema)
# t = con.table(guid)

# n = 10
# df = pd.DataFrame({
# 'ts': pd.date_range('2017-04-01', periods=n, tz=tz).values,
# 'b': np.arange(n).astype('float64'),
# 'c': list(string.ascii_lowercase[:n]),
# })

# df.to_sql(
# guid,
# con.con,
# index=False,
# if_exists='append',
# dtype={
# 'ts': sa.TIMESTAMP(timezone=True),
# 'b': sa.FLOAT,
# 'c': sa.TEXT,
# }
# )

# try:
# yield t
# finally:
# con.drop_table(guid)
# assert guid not in con.list_tables()


# def test_ts_timezone_is_preserved(tzone_compute, tz):
# assert dt.Timestamp(tz).equals(tzone_compute.ts.type())


# def test_timestamp_with_timezone_select(tzone_compute, tz):
# ts = tzone_compute.ts.execute()
# assert str(getattr(ts.dtype, 'tz', None)) == str(tz)


# def test_timestamp_type_accepts_all_timezones(con):
# assert all(
# dt.Timestamp(row.name).timezone == row.name
# for row in con.con.execute(
# 'SELECT name FROM pg_timezone_names'
# )
# )
def test_count_distinct_with_filter(alltypes):
expr = alltypes.string_col.nunique(
where=alltypes.string_col.cast('int64') > 1
)
result = expr.execute()
expected = alltypes.string_col.execute()
expected = expected[expected.astype('int64') > 1].nunique()
assert result == expected
30 changes: 2 additions & 28 deletions ibis/clickhouse/tests/test_operators.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,6 @@
pytestmark = pytest.mark.clickhouse


# def test_not(alltypes):
# t = alltypes.limit(10)
# expr = t.projection([(~t.double_col.isnull()).name('double_col')])
# result = expr.execute().double_col
# expected = ~t.execute().double_col.isnull()
# tm.assert_series_equal(result, expected)


# @pytest.mark.parametrize('op', [operator.invert, operator.neg])
# def test_not_and_negate_bool(con, op, df):
# t = con.table('functional_alltypes').limit(10)
# expr = t.projection([op(t.bool_col).name('bool_col')])
# result = expr.execute().bool_col
# expected = op(df.head(10).bool_col)
# tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(('left', 'right', 'type'), [
(L('2017-04-01'), date(2017, 4, 2), dt.date),
(date(2017, 4, 2), L('2017-04-01'), dt.date),
Expand Down Expand Up @@ -188,15 +171,6 @@ def test_negate_non_boolean(con, alltypes, field, df):
tm.assert_series_equal(result, expected)


# def test_negate_boolean(con, alltypes, df):
# t = alltypes.limit(10)
# expr = t.projection([(~t.bool_col).name('bool_col')])
# result = expr.execute().bool_col
# print(result)
# expected = ~df.head(10).bool_col
# tm.assert_series_equal(result, expected)


def test_negate_literal(con):
expr = -L(5.245)
assert round(con.execute(expr), 3) == -5.245
Expand All @@ -206,13 +180,13 @@ def test_negate_literal(con):
(
lambda t: (t.double_col > 20).ifelse(10, -20),
lambda df: pd.Series(np.where(df.double_col > 20, 10, -20),
dtype='int16')
dtype='int8')
),
(
lambda t: (t.double_col > 20).ifelse(10, -20).abs(),
lambda df: (pd.Series(np.where(df.double_col > 20, 10, -20))
.abs()
.astype('uint16'))
.astype('int8'))
),
])
def test_ifelse(alltypes, df, op, pandas_op, translate):
Expand Down
78 changes: 52 additions & 26 deletions ibis/clickhouse/tests/test_select.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
import sys
import pytest
import pandas as pd
import pandas.util.testing as tm

import ibis
import ibis.common as com


pytest.importorskip('clickhouse_driver')
driver = pytest.importorskip('clickhouse_driver')
pytestmark = pytest.mark.clickhouse


Expand Down Expand Up @@ -68,6 +69,15 @@ def test_head(alltypes):
tm.assert_frame_equal(result, expected)


def test_limit_offset(alltypes):
expected = alltypes.execute()

tm.assert_frame_equal(alltypes.limit(4).execute(), expected.head(4))
tm.assert_frame_equal(alltypes.limit(8).execute(), expected.head(8))
tm.assert_frame_equal(alltypes.limit(4, offset=4).execute(),
expected.ix[4:7].reset_index(drop=True))


def test_subquery(alltypes, df):
t = alltypes

Expand Down Expand Up @@ -441,31 +451,47 @@ def test_named_from_filter_groupby():
assert ibis.clickhouse.compile(expr) == expected


# def test_filter_with_analytic():
# x = ibis.table(ibis.schema([('col', 'int32')]), 'x')
# with_filter_col = x[x.columns + [ibis.null().name('filter')]]
# filtered = with_filter_col[with_filter_col['filter'].isnull()]
# subquery = filtered[filtered.columns]
def test_join_with_external_table_errors(con, alltypes, df):
external_table = ibis.table([
('a', 'string'),
('b', 'int64'),
('c', 'string')
], name='external')

# with_analytic = subquery[['col', subquery.count().name('analytic')]]
# expr = with_analytic[with_analytic.columns]
alltypes = alltypes.mutate(b=alltypes.tinyint_col)
expr = alltypes.inner_join(external_table, ['b'])[
external_table.a, external_table.c, alltypes.id]

# result = ibis.clickhouse.compile(expr)
# expected = """\
# SELECT `col`, `analytic`
# FROM (
# SELECT `col`, count(*) OVER () AS `analytic`
# FROM (
# SELECT `col`, `filter`
# FROM (
# SELECT *
# FROM (
# SELECT `col`, NULL AS `filter`
# FROM x
# ) t3
# WHERE `filter` IS NULL
# ) t2
# ) t1
# ) t0"""
with pytest.raises(driver.errors.ServerException):
expr.execute()

# assert result == expected
with pytest.raises(TypeError):
expr.execute(external_tables={'external': []})


def test_join_with_external_table(con, alltypes, df):
external_df = pd.DataFrame([
('alpha', 1, 'first'),
('beta', 2, 'second'),
('gamma', 3, 'third')
], columns=['a', 'b', 'c'])
external_df['b'] = external_df['b'].astype('int8')

external_table = ibis.table([
('a', 'string'),
('b', 'int64'),
('c', 'string')
], name='external')

alltypes = alltypes.mutate(b=alltypes.tinyint_col)
expr = alltypes.inner_join(external_table, ['b'])[
external_table.a, external_table.c, alltypes.id]

result = expr.execute(external_tables={'external': external_df})
expected = (df.assign(b=df.tinyint_col)
.merge(external_df, on='b')[['a', 'c', 'id']])

result = result.sort_values('id').reset_index(drop=True)
expected = expected.sort_values('id').reset_index(drop=True)

tm.assert_frame_equal(result, expected, check_column_type=False)
Loading