63 changes: 62 additions & 1 deletion docs/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,68 @@ Release Notes

Current ``ibis.__version__``: |version|

v0.13.0 (March 20, 2018)
v0.14.0 (August 23rd, 2018)
---------------------------

This release brings refactored, more composable core components and rule system
to ibis. We also focused quite heavily on the BigQuery backend this release.

New Features
~~~~~~~~~~~~

* Allow keyword arguments in Node subclasses (:issue:`968`)
* Splat args into Node subclasses instead of requiring a list (:issue:`969`)
* Add support for ``UNION`` in the BigQuery backend (:issue:`1408`,
:issue:`1409`)
* Support for writing UDFs in BigQuery (:issue:`1377`). See :ref:`the BigQuery
UDF docs <udf.bigquery>` for more details.
* Support for cross-project expressions in the BigQuery backend.
(:issue:`1427`, :issue:`1428`)
* Add ``strftime`` and ``to_timestamp`` support for BigQuery (:issue:`1422`,
:issue:`1410`)
* Require ``google-cloud-bigquery >=1.0`` (:issue:`1424`)
* Limited support for interval arithmetic in the pandas backend (:issue:`1407`)
* Support for subclassing ``TableExpr`` (:issue:`1439`)
* Fill out pandas backend operations (:issue:`1423`)
* Add common DDL APIs to the pandas backend (:issue:`1464`)
* Implement the ``sql`` method for BigQuery (:issue:`1463`)
* Add ``to_timestamp`` for BigQuery (:issue:`1455`)
* Add the ``mapd`` backend (:issue:`1419`)
* Implement range windows (:issue:`1349`)
* Support for map types in the pandas backend (:issue:`1498`)
* Add ``mean`` and ``sum`` for ``boolean`` types in BigQuery (:issue:`1516`)
* All recent versions of SQLAlchemy are now suppported (:issue:`1384`)
* Add support for ``NUMERIC`` types in the BigQuery backend (:issue:`1534`)
* Speed up grouped and rolling operations in the pandas backend (:issue:`1549`)
* Implement ``TimestampNow`` for BigQuery and pandas (:issue:`1575`)

Bug Fixes
~~~~~~~~~

* Nullable property is now propagated through value types (:issue:`1289`)
* Implicit casting between signed and unsigned integers checks boundaries
* Fix precedence of case statement (:issue:`1412`)
* Fix handling of large timestamps (:issue:`1440`)
* Fix ``identical_to`` precedence (:issue:`1458`)
* Pandas 0.23 compatibility (:issue:`1458`)
* Preserve timezones in timestamp-typed literals (:issue:`1459`)
* Fix incorrect topological ordering of ``UNION`` expressions (:issue:`1501`)
* Fix projection fusion bug when attempting to fuse columns of the same name
(:issue:`1496`)
* Fix output type for some decimal operations (:issue:`1541`)

API Changes
-----------

* The previous, private rules API has been rewritten (:issue:`1366`)
* Defining input arguments for operations happens in a more readable fashion
instead of the previous `input_type` list.
* Removed support for async query execution (only Impala supported)
* Remove support for Python 3.4 (:issue:`1326`)
* BigQuery division defaults to using ``IEEE_DIVIDE`` (:issue:`1390`)
* Add ``tolerance`` parameter to ``asof_join`` (:issue:`1443`)

v0.13.0 (March 30, 2018)
------------------------

This release brings new backends, including support for executing against
Expand Down
2 changes: 2 additions & 0 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,5 @@ Here we show Jupyter notebooks that take you through various tasks using ibis.
notebooks/tutorial/6-Advanced-Topics-TopK-SelfJoins.ipynb
notebooks/tutorial/7-Advanced-Topics-ComplexFiltering.ipynb
notebooks/tutorial/8-More-Analytics-Helpers.ipynb
notebooks/tutorial/9-Adding-a-new-elementwise-expression.ipynb
notebooks/tutorial/10-Adding-a-new-reduction-expression.ipynb
74 changes: 69 additions & 5 deletions docs/source/udf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@ details of user defined functions.
API
---

.. _udf.api:

.. warning::

The UDF/UDAF API is quite experimental at this point and is therefore
provisional and subject to change.
The UDF/UDAF API is experimental. It is provisional and subject to change.

Going forward, the API for user defined *scalar* functions will look like this:
The API for user defined *scalar* functions will look like this:

.. code-block:: python
Expand All @@ -33,15 +34,19 @@ of using the ``@udaf`` decorator instead of the ``@udf`` decorator.
Impala
------

.. _udf.impala:

TODO

Pandas
------

.. _udf.pandas:

Pandas supports defining both UDFs and UDAFs.

When you define a UDF you automatically get support for applying that UDF in a
scalar context, *as well as* in any group by operation.
scalar context, as well as in any group by operation.

When you define a UDAF you automatically get support for standard scalar
aggregations, group bys, *as well as* any supported windowing operation.
Expand Down Expand Up @@ -99,9 +104,68 @@ For example:
BigQuery
--------

TODO
.. _udf.bigquery:

.. note::

BigQuery only supports scalar UDFs at this time.

BigQuery supports UDFs through JavaScript. Ibis provides support for this by
turning Python code into JavaScript.

The interface is very similar to the pandas UDF API:

.. code-block:: python
@udf([double], double)
def my_bigquery_add_one(x):
return x + 1.0
Ibis will parse the source of the function and turn the resulting Python AST
into JavaScript source code (technically, ECMAScript 2015). Most of the Python
language is supported including classes, functions and generators.

If you want to inspect the generated code you can look at the ``js`` property
of the function.

.. code-block:: python
>>> print(my_bigquery_add_one.js)
CREATE TEMPORARY FUNCTION my_bigquery_add_one(x FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
'use strict';
function my_bigquery_add_one(x) {
return (x + 1.0);
}
return my_bigquery_add_one(x);
""";
When you want to use this function you call it like any other Python
function--only on an ibis expression:

.. code-block:: python
>>> import ibis
>>> t = ibis.table([('a', 'double')])
>>> expr = my_bigquery_add_one(t.a)
>>> print(ibis.bigquery.compile(expr))
CREATE TEMPORARY FUNCTION my_bigquery_add_one(x FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
'use strict';
function my_bigquery_add_one(x) {
return (x + 1.0);
}
return my_bigquery_add_one(x);
""";
SELECT my_bigquery_add_one(`a`) AS `tmp`
FROM t0
SQLite
------

.. _udf.sqlite:

TODO
13 changes: 10 additions & 3 deletions ibis/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@


# flake8: noqa
import sys
from multipledispatch import halt_ordering, restart_ordering

import ibis.config_init
Expand All @@ -26,12 +27,12 @@
from ibis.compat import suppress
from ibis.filesystems import HDFS, WebHDFS

# speeds up signature registration
halt_ordering()

# __all__ is defined
from ibis.expr.api import *

# speeds up signature registration
halt_ordering()

# pandas backend is mandatory
import ibis.pandas.api as pandas

Expand Down Expand Up @@ -71,6 +72,12 @@
# pip install ibis-framework[bigquery]
import ibis.bigquery.api as bigquery

with suppress(ImportError):
# pip install ibis-framework[mapd]
if sys.version_info.major < 3:
raise ImportError('The MapD backend is not supported under Python 2.')
import ibis.mapd.api as mapd

restart_ordering()


Expand Down
11 changes: 9 additions & 2 deletions ibis/bigquery/api.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,15 @@
import google.cloud.bigquery # noqa: F401 fail early if bigquery is missing
import ibis.common as com

from ibis.config import options # noqa: F401
from ibis.bigquery.client import BigQueryClient
from ibis.bigquery.compiler import dialect

try:
from ibis.bigquery.udf import udf # noqa: F401
except ImportError:
pass


def compile(expr, params=None):
"""
Expand All @@ -30,17 +36,18 @@ def verify(expr, params=None):
return False


def connect(project_id, dataset_id):
def connect(project_id, dataset_id, credentials=None):
"""Create a BigQueryClient for use with Ibis
Parameters
----------
project_id: str
dataset_id: str
credentials : google.auth.credentials.Credentials, optional, default None
Returns
-------
BigQueryClient
"""

return BigQueryClient(project_id, dataset_id)
return BigQueryClient(project_id, dataset_id, credentials=credentials)
456 changes: 261 additions & 195 deletions ibis/bigquery/client.py

Large diffs are not rendered by default.

265 changes: 222 additions & 43 deletions ibis/bigquery/compiler.py

Large diffs are not rendered by default.

120 changes: 120 additions & 0 deletions ibis/bigquery/datatypes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import six

from multipledispatch import Dispatcher

import ibis.expr.datatypes as dt


class TypeTranslationContext(object):
"""A tag class to allow alteration of the way a particular type is
translated.
Notes
-----
This is used to raise an exception when INT64 types are encountered to
avoid suprising results due to BigQuery's handling of INT64 types in
JavaScript UDFs.
"""
__slots__ = ()


class UDFContext(TypeTranslationContext):
__slots__ = ()


ibis_type_to_bigquery_type = Dispatcher('ibis_type_to_bigquery_type')


@ibis_type_to_bigquery_type.register(six.string_types)
def trans_string_default(datatype):
return ibis_type_to_bigquery_type(dt.dtype(datatype))


@ibis_type_to_bigquery_type.register(dt.DataType)
def trans_default(t):
return ibis_type_to_bigquery_type(t, TypeTranslationContext())


@ibis_type_to_bigquery_type.register(six.string_types, TypeTranslationContext)
def trans_string_context(datatype, context):
return ibis_type_to_bigquery_type(dt.dtype(datatype), context)


@ibis_type_to_bigquery_type.register(dt.Floating, TypeTranslationContext)
def trans_float64(t, context):
return 'FLOAT64'


@ibis_type_to_bigquery_type.register(dt.Integer, TypeTranslationContext)
def trans_integer(t, context):
return 'INT64'


@ibis_type_to_bigquery_type.register(
dt.UInt64, (TypeTranslationContext, UDFContext)
)
def trans_lossy_integer(t, context):
raise TypeError(
'Conversion from uint64 to BigQuery integer type (int64) is lossy'
)


@ibis_type_to_bigquery_type.register(dt.Array, TypeTranslationContext)
def trans_array(t, context):
return 'ARRAY<{}>'.format(
ibis_type_to_bigquery_type(t.value_type, context))


@ibis_type_to_bigquery_type.register(dt.Struct, TypeTranslationContext)
def trans_struct(t, context):
return 'STRUCT<{}>'.format(
', '.join(
'{} {}'.format(
name,
ibis_type_to_bigquery_type(dt.dtype(type), context)
) for name, type in zip(t.names, t.types)
)
)


@ibis_type_to_bigquery_type.register(dt.Date, TypeTranslationContext)
def trans_date(t, context):
return 'DATE'


@ibis_type_to_bigquery_type.register(dt.Timestamp, TypeTranslationContext)
def trans_timestamp(t, context):
if t.timezone is not None:
raise TypeError('BigQuery does not support timestamps with timezones')
return 'TIMESTAMP'


@ibis_type_to_bigquery_type.register(dt.DataType, TypeTranslationContext)
def trans_type(t, context):
return str(t).upper()


@ibis_type_to_bigquery_type.register(dt.Integer, UDFContext)
def trans_integer_udf(t, context):
# JavaScript does not have integers, only a Number class. BigQuery doesn't
# behave as expected with INT64 inputs or outputs
raise TypeError(
'BigQuery does not support INT64 as an argument type or a return type '
'for UDFs. Replace INT64 with FLOAT64 in your UDF signature and '
'cast all INT64 inputs to FLOAT64.'
)


@ibis_type_to_bigquery_type.register(dt.Decimal, TypeTranslationContext)
def trans_numeric(t, context):
if (t.precision, t.scale) != (38, 9):
raise TypeError(
'BigQuery only supports decimal types with precision of 38 and '
'scale of 9'
)
return 'NUMERIC'


@ibis_type_to_bigquery_type.register(dt.Decimal, TypeTranslationContext)
def trans_numeric_udf(t, context):
raise TypeError('Decimal types are not supported in BigQuery UDFs')
46 changes: 43 additions & 3 deletions ibis/bigquery/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,48 @@
import ibis


PROJECT_ID = os.environ.get('GOOGLE_BIGQUERY_PROJECT_ID')
PROJECT_ID = os.environ.get('GOOGLE_BIGQUERY_PROJECT_ID', 'ibis-gbq')
DATASET_ID = 'testing'


def connect(project_id, dataset_id):
ga = pytest.importorskip('google.auth')

try:
return ibis.bigquery.connect(project_id, dataset_id)
except ga.exceptions.DefaultCredentialsError:
pytest.skip(
'no BigQuery credentials found (project_id={}, dataset_id={}), '
'skipping'.format(project_id, dataset_id)
)


@pytest.fixture(scope='session')
def project_id():
return PROJECT_ID


@pytest.fixture(scope='session')
def client():
return connect(PROJECT_ID, DATASET_ID)


@pytest.fixture(scope='session')
def client_no_credentials():
ga = pytest.importorskip('google.auth')

try:
return ibis.bigquery.connect(PROJECT_ID, DATASET_ID)
return ibis.bigquery.connect(PROJECT_ID, DATASET_ID, credentials=None)
except ga.exceptions.DefaultCredentialsError:
pytest.skip("no credentials found, skipping")
pytest.skip(
'no BigQuery credentials found (project_id={}, dataset_id={}), '
'skipping'.format(PROJECT_ID, DATASET_ID)
)


@pytest.fixture(scope='session')
def client2():
return connect(PROJECT_ID, DATASET_ID)


@pytest.fixture(scope='session')
Expand All @@ -42,3 +72,13 @@ def parted_df(parted_alltypes):
@pytest.fixture(scope='session')
def struct_table(client):
return client.table('struct_table')


@pytest.fixture(scope='session')
def numeric_table(client):
return client.table('numeric_table')


@pytest.fixture(scope='session')
def public():
return connect(PROJECT_ID, dataset_id='bigquery-public-data.stackoverflow')
407 changes: 369 additions & 38 deletions ibis/bigquery/tests/test_client.py

Large diffs are not rendered by default.

500 changes: 498 additions & 2 deletions ibis/bigquery/tests/test_compiler.py

Large diffs are not rendered by default.

84 changes: 84 additions & 0 deletions ibis/bigquery/tests/test_datatypes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
import pytest

from multipledispatch.conflict import ambiguities

import ibis.expr.datatypes as dt
from ibis.bigquery.datatypes import (
ibis_type_to_bigquery_type, UDFContext, TypeTranslationContext
)


def test_no_ambiguities():
ambs = ambiguities(ibis_type_to_bigquery_type.funcs)
assert not ambs


@pytest.mark.parametrize(
('datatype', 'expected'),
[
(dt.float32, 'FLOAT64'),
(dt.float64, 'FLOAT64'),
(dt.uint8, 'INT64'),
(dt.uint16, 'INT64'),
(dt.uint32, 'INT64'),
(dt.int8, 'INT64'),
(dt.int16, 'INT64'),
(dt.int32, 'INT64'),
(dt.int64, 'INT64'),
(dt.string, 'STRING'),
(dt.Array(dt.int64), 'ARRAY<INT64>'),
(dt.Array(dt.string), 'ARRAY<STRING>'),
(
dt.Struct.from_tuples([
('a', dt.int64),
('b', dt.string),
('c', dt.Array(dt.string)),
]),
'STRUCT<a INT64, b STRING, c ARRAY<STRING>>'
),
(dt.date, 'DATE'),
(dt.timestamp, 'TIMESTAMP'),
pytest.mark.xfail(
(dt.timestamp(timezone='US/Eastern'), 'TIMESTAMP'),
raises=TypeError,
reason='Not supported in BigQuery'
),
('array<struct<a: string>>', 'ARRAY<STRUCT<a STRING>>'),
pytest.mark.xfail(
(dt.Decimal(38, 9), 'NUMERIC'),
raises=TypeError,
reason='Not supported in BigQuery'
),
]
)
def test_simple(datatype, expected):
context = TypeTranslationContext()
assert ibis_type_to_bigquery_type(datatype, context) == expected


@pytest.mark.parametrize('datatype', [dt.uint64, dt.Decimal(8, 3)])
def test_simple_failure_mode(datatype):
with pytest.raises(TypeError):
ibis_type_to_bigquery_type(datatype)


@pytest.mark.parametrize(
('type', 'expected'),
[
pytest.mark.xfail((dt.int64, 'INT64'), raises=TypeError),
pytest.mark.xfail(
(dt.Array(dt.int64), 'ARRAY<INT64>'),
raises=TypeError
),
pytest.mark.xfail(
(
dt.Struct.from_tuples([('a', dt.Array(dt.int64))]),
'STRUCT<a ARRAY<INT64>>'
),
raises=TypeError,
)
]
)
def test_ibis_type_to_bigquery_type_udf(type, expected):
context = UDFContext()
assert ibis_type_to_bigquery_type(type, context) == expected
1 change: 1 addition & 0 deletions ibis/bigquery/udf/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from ibis.bigquery.udf.api import udf # noqa: F401
236 changes: 236 additions & 0 deletions ibis/bigquery/udf/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
import collections
import inspect
import itertools

import ibis.expr.rules as rlz
import ibis.expr.datatypes as dt

from ibis.compat import functools
from ibis.expr.signature import Argument as Arg

from ibis.bigquery.compiler import BigQueryUDFNode, compiles

from ibis.bigquery.udf.core import PythonToJavaScriptTranslator
from ibis.bigquery.datatypes import ibis_type_to_bigquery_type, UDFContext


__all__ = 'udf',


_udf_name_cache = collections.defaultdict(itertools.count)


def create_udf_node(name, fields):
"""Create a new UDF node type.
Parameters
----------
name : str
Then name of the UDF node
fields : OrderedDict
Mapping of class member name to definition
Returns
-------
result : type
A new BigQueryUDFNode subclass
"""
definition = next(_udf_name_cache[name])
external_name = '{}_{:d}'.format(name, definition)
return type(external_name, (BigQueryUDFNode,), fields)


def udf(input_type, output_type, strict=True, libraries=None):
'''Define a UDF for BigQuery
Parameters
----------
input_type : List[DataType]
output_type : DataType
strict : bool
Whether or not to put a ``'use strict';`` string at the beginning of
the UDF. Setting to ``False`` is probably a bad idea.
libraries : List[str]
A list of Google Cloud Storage URIs containing to JavaScript source
code. Note that any symbols (functions, classes, variables, etc.) that
are exposed in these JavaScript files will be visible inside the UDF.
Returns
-------
wrapper : Callable
The wrapped function
Notes
-----
``INT64`` is not supported as an argument type or a return type, as per
`the BigQuery documentation
<https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions#sql-type-encodings-in-javascript>`_.
Examples
--------
>>> from ibis.bigquery.api import udf
>>> import ibis.expr.datatypes as dt
>>> @udf(input_type=[dt.double], output_type=dt.double)
... def add_one(x):
... return x + 1
>>> print(add_one.js)
CREATE TEMPORARY FUNCTION add_one_0(x FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
'use strict';
function add_one(x) {
return (x + 1);
}
return add_one(x);
""";
>>> @udf(input_type=[dt.double, dt.double],
... output_type=dt.Array(dt.double))
... def my_range(start, stop):
... def gen(start, stop):
... curr = start
... while curr < stop:
... yield curr
... curr += 1
... result = []
... for value in gen(start, stop):
... result.append(value)
... return result
>>> print(my_range.js)
CREATE TEMPORARY FUNCTION my_range_0(start FLOAT64, stop FLOAT64)
RETURNS ARRAY<FLOAT64>
LANGUAGE js AS """
'use strict';
function my_range(start, stop) {
function* gen(start, stop) {
let curr = start;
while ((curr < stop)) {
yield curr;
curr += 1;
}
}
let result = [];
for (let value of gen(start, stop)) {
result.push(value);
}
return result;
}
return my_range(start, stop);
""";
>>> @udf(
... input_type=[dt.double, dt.double],
... output_type=dt.Struct.from_tuples([
... ('width', 'double'), ('height', 'double')
... ])
... )
... def my_rectangle(width, height):
... class Rectangle:
... def __init__(self, width, height):
... self.width = width
... self.height = height
...
... @property
... def area(self):
... return self.width * self.height
...
... def perimeter(self):
... return 2 * (self.width + self.height)
...
... return Rectangle(width, height)
>>> print(my_rectangle.js)
CREATE TEMPORARY FUNCTION my_rectangle_0(width FLOAT64, height FLOAT64)
RETURNS STRUCT<width FLOAT64, height FLOAT64>
LANGUAGE js AS """
'use strict';
function my_rectangle(width, height) {
class Rectangle {
constructor(width, height) {
this.width = width;
this.height = height;
}
get area() {
return (this.width * this.height);
}
perimeter() {
return (2 * (this.width + this.height));
}
}
return (new Rectangle(width, height));
}
return my_rectangle(width, height);
""";
'''
if libraries is None:
libraries = []

def wrapper(f):
if not callable(f):
raise TypeError('f must be callable, got {}'.format(f))

signature = inspect.signature(f)
parameter_names = signature.parameters.keys()

udf_node_fields = collections.OrderedDict([
(name, Arg(rlz.value(type)))
for name, type in zip(parameter_names, input_type)
] + [
(
'output_type',
lambda self, output_type=output_type: rlz.shape_like(
self.args, dtype=output_type
)
),
('__slots__', ('js',)),
])

udf_node = create_udf_node(f.__name__, udf_node_fields)

@compiles(udf_node)
def compiles_udf_node(t, expr):
return '{}({})'.format(
udf_node.__name__,
', '.join(map(t.translate, expr.op().args))
)

type_translation_context = UDFContext()
return_type = ibis_type_to_bigquery_type(
dt.dtype(output_type), type_translation_context)
bigquery_signature = ', '.join(
'{name} {type}'.format(
name=name,
type=ibis_type_to_bigquery_type(
dt.dtype(type), type_translation_context)
) for name, type in zip(parameter_names, input_type)
)
source = PythonToJavaScriptTranslator(f).compile()
js = '''\
CREATE TEMPORARY FUNCTION {external_name}({signature})
RETURNS {return_type}
LANGUAGE js AS """
{strict}{source}
return {internal_name}({args});
"""{libraries};'''.format(
external_name=udf_node.__name__,
internal_name=f.__name__,
return_type=return_type,
source=source,
signature=bigquery_signature,
strict=repr('use strict') + ';\n' if strict else '',
args=', '.join(parameter_names),
libraries=(
'\nOPTIONS (\n library={}\n)'.format(
repr(list(libraries))
) if libraries else ''
)
)

@functools.wraps(f)
def wrapped(*args, **kwargs):
node = udf_node(*args, **kwargs)
node.js = js
return node.to_expr()

wrapped.__signature__ = signature
wrapped.js = js
return wrapped

return wrapper
638 changes: 638 additions & 0 deletions ibis/bigquery/udf/core.py

Large diffs are not rendered by default.

70 changes: 70 additions & 0 deletions ibis/bigquery/udf/find.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import ast

import toolz


class NameFinder:
"""Helper class to find the unique names in an AST.
"""

__slots__ = ()

def find(self, node):
typename = type(node).__name__
method = getattr(self, 'find_{}'.format(typename), None)
if method is None:
fields = getattr(node, '_fields', None)
if fields is None:
return
for field in fields:
value = getattr(node, field)
for result in self.find(value):
yield result
else:
for result in method(node):
yield result

def find_Name(self, node):
# TODO not sure if this is robust to scope changes
yield node

def find_list(self, node):
return list(toolz.concat(map(self.find, node)))

def find_Call(self, node):
if not isinstance(node.func, ast.Name):
fields = node._fields
else:
fields = [field for field in node._fields if field != 'func']
return toolz.concat(map(
self.find, (getattr(node, field) for field in fields)
))


def find_names(node):
"""Return the unique :class:`ast.Name` instances in an AST.
Parameters
----------
node : ast.AST
Returns
-------
unique_names : List[ast.Name]
Examples
--------
>>> import ast
>>> node = ast.parse('a + b')
>>> names = find_names(node)
>>> names # doctest: +ELLIPSIS
[<_ast.Name object at 0x...>, <_ast.Name object at 0x...>]
>>> names[0].id
'a'
>>> names[1].id
'b'
"""
return list(toolz.unique(
filter(None, NameFinder().find(node)),
key=lambda node: (node.id, type(node.ctx))
))
58 changes: 58 additions & 0 deletions ibis/bigquery/udf/rewrite.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import ast


def matches(value, pattern):
"""Check whether `value` matches `pattern`.
Parameters
----------
value : ast.AST
pattern : ast.AST
Returns
-------
matched : bool
"""
# types must match exactly
if type(value) != type(pattern):
return False

# primitive value, such as None, True, False etc
if not isinstance(value, ast.AST) and not isinstance(pattern, ast.AST):
return value == pattern

fields = [
(field, getattr(pattern, field))
for field in pattern._fields if hasattr(pattern, field)
]
for field_name, field_value in fields:
if not matches(getattr(value, field_name), field_value):
return False
return True


class Rewriter:
"""AST pattern matching to enable rewrite rules.
Attributes
----------
funcs : List[Tuple[ast.AST, Callable[ast.expr, [ast.expr]]]]
"""
def __init__(self):
self.funcs = []

def register(self, pattern):
def wrapper(f):
self.funcs.append((pattern, f))
return f
return wrapper

def __call__(self, node):
# TODO: more efficient way of doing this?
for pattern, func in self.funcs:
if matches(node, pattern):
return func(node)
return node


rewrite = Rewriter()
Empty file.
542 changes: 542 additions & 0 deletions ibis/bigquery/udf/tests/test_core.py

Large diffs are not rendered by default.

83 changes: 83 additions & 0 deletions ibis/bigquery/udf/tests/test_find.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import ast
from ibis.bigquery.udf.find import find_names
from ibis.util import is_iterable


def parse_expr(expr):
body = parse_stmt(expr)
return body.value


def parse_stmt(stmt):
body, = ast.parse(stmt).body
return body


def eq(left, right):
if type(left) != type(right):
return False

if is_iterable(left) and is_iterable(right):
return all(map(eq, left, right))

if not isinstance(left, ast.AST) and not isinstance(right, ast.AST):
return left == right

assert hasattr(left, '_fields') and hasattr(right, '_fields')
return left._fields == right._fields and all(
eq(getattr(left, left_name), getattr(right, right_name))
for left_name, right_name in zip(left._fields, right._fields)
)


def var(id):
return ast.Name(id=id, ctx=ast.Load())


def store(id):
return ast.Name(id=id, ctx=ast.Store())


def test_find_BinOp():
expr = parse_expr('a + 1')
found = find_names(expr)
assert len(found) == 1
assert eq(found[0], var('a'))


def test_find_dup_names():
expr = parse_expr('a + 1 * a')
found = find_names(expr)
assert len(found) == 1
assert eq(found[0], var('a'))


def test_find_Name():
expr = parse_expr('b')
found = find_names(expr)
assert len(found) == 1
assert eq(found[0], var('b'))


def test_find_Tuple():
expr = parse_expr('(a, (b, 1), (((c,),),))')
found = find_names(expr)
assert len(found) == 3
assert eq(found, [var('a'), var('b'), var('c')])


def test_find_Compare():
expr = parse_expr('a < b < c == e + (f, (gh,))')
found = find_names(expr)
assert len(found) == 6
assert eq(
found,
[var('a'), var('b'), var('c'), var('e'), var('f'), var('gh')]
)


def test_find_ListComp():
expr = parse_expr('[i for i in range(n) if i < 2]')
found = find_names(expr)
assert all(isinstance(f, ast.Name) for f in found)
assert eq(found, [var('i'), store('i'), var('n')])
254 changes: 254 additions & 0 deletions ibis/bigquery/udf/tests/test_udf_execute.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
import os

import pytest

import pandas as pd
import pandas.util.testing as tm

import ibis
import ibis.expr.datatypes as dt

pytest.importorskip('google.cloud.bigquery')

pytestmark = pytest.mark.bigquery

from ibis.bigquery.api import udf # noqa: E402

PROJECT_ID = os.environ.get('GOOGLE_BIGQUERY_PROJECT_ID', 'ibis-gbq')
DATASET_ID = 'testing'


@pytest.fixture(scope='module')
def client():
ga = pytest.importorskip('google.auth')

try:
return ibis.bigquery.connect(PROJECT_ID, DATASET_ID)
except ga.exceptions.DefaultCredentialsError:
pytest.skip("no credentials found, skipping")


@pytest.fixture(scope='module')
def alltypes(client):
t = client.table('functional_alltypes')
expr = t[t.bigint_col.isin([10, 20])].limit(10)
return expr


@pytest.fixture(scope='module')
def df(alltypes):
return alltypes.execute()


def test_udf(client, alltypes, df):
@udf(input_type=[dt.double, dt.double], output_type=dt.double)
def my_add(a, b):
return a + b

expr = my_add(alltypes.double_col, alltypes.double_col)
result = expr.execute()
assert not result.empty

expected = (df.double_col + df.double_col).rename('tmp')
tm.assert_series_equal(
result.value_counts().sort_index(),
expected.value_counts().sort_index()
)


def test_udf_with_struct(client, alltypes, df):
@udf(
input_type=[dt.double, dt.double],
output_type=dt.Struct.from_tuples([
('width', dt.double),
('height', dt.double)
])
)
def my_struct_thing(a, b):
class Rectangle:
def __init__(self, width, height):
self.width = width
self.height = height
return Rectangle(a, b)

assert my_struct_thing.js == '''\
CREATE TEMPORARY FUNCTION my_struct_thing_0(a FLOAT64, b FLOAT64)
RETURNS STRUCT<width FLOAT64, height FLOAT64>
LANGUAGE js AS """
'use strict';
function my_struct_thing(a, b) {
class Rectangle {
constructor(width, height) {
this.width = width;
this.height = height;
}
}
return (new Rectangle(a, b));
}
return my_struct_thing(a, b);
""";'''

expr = my_struct_thing(alltypes.double_col, alltypes.double_col)
result = expr.execute()
assert not result.empty

expected = pd.Series(
[{'width': c, 'height': c} for c in df.double_col],
name='tmp'
)
tm.assert_series_equal(result, expected)


def test_udf_compose(client, alltypes, df):
@udf([dt.double], dt.double)
def add_one(x):
return x + 1.0

@udf([dt.double], dt.double)
def times_two(x):
return x * 2.0

t = alltypes
expr = times_two(add_one(t.double_col))
result = expr.execute()
expected = ((df.double_col + 1.0) * 2.0).rename('tmp')
tm.assert_series_equal(result, expected)


def test_udf_scalar(client):
@udf([dt.double, dt.double], dt.double)
def my_add(x, y):
return x + y

expr = my_add(1, 2)
result = client.execute(expr)
assert result == 3


def test_multiple_calls_has_one_definition(client):

@udf([dt.string], dt.double)
def my_str_len(s):
return s.length

s = ibis.literal('abcd')
expr = my_str_len(s) + my_str_len(s)
sql = client.compile(expr)
expected = '''\
CREATE TEMPORARY FUNCTION my_str_len_0(s STRING)
RETURNS FLOAT64
LANGUAGE js AS """
'use strict';
function my_str_len(s) {
return s.length;
}
return my_str_len(s);
""";
SELECT my_str_len_0('abcd') + my_str_len_0('abcd') AS `tmp`'''
assert sql == expected
result = client.execute(expr)
assert result == 8.0


def test_udf_libraries(client):
@udf(
[dt.Array(dt.string)],
dt.double,
# whatever symbols are exported in the library are visible inside the
# UDF, in this case lodash defines _ and we use that here
libraries=['gs://ibis-testing-libraries/lodash.min.js']
)
def string_length(strings):
return _.sum(_.map(strings, lambda x: x.length)) # noqa: F821

raw_data = ['aaa', 'bb', 'c']
data = ibis.literal(raw_data)
expr = string_length(data)
result = client.execute(expr)
expected = sum(map(len, raw_data))
assert result == expected


def test_udf_with_len(client):
@udf([dt.string], dt.double)
def my_str_len(x):
return len(x)

@udf([dt.Array(dt.string)], dt.double)
def my_array_len(x):
return len(x)

assert client.execute(my_str_len('aaa')) == 3
assert client.execute(my_array_len(['aaa', 'bb'])) == 2


def test_multiple_calls_redefinition(client):

@udf([dt.string], dt.double)
def my_len(s):
return s.length

s = ibis.literal('abcd')
expr = my_len(s) + my_len(s)

@udf([dt.string], dt.double)
def my_len(s):
return s.length + 1
expr = expr + my_len(s)

sql = client.compile(expr)
expected = '''\
CREATE TEMPORARY FUNCTION my_len_0(s STRING)
RETURNS FLOAT64
LANGUAGE js AS """
'use strict';
function my_len(s) {
return s.length;
}
return my_len(s);
""";
CREATE TEMPORARY FUNCTION my_len_1(s STRING)
RETURNS FLOAT64
LANGUAGE js AS """
'use strict';
function my_len(s) {
return (s.length + 1);
}
return my_len(s);
""";
SELECT (my_len_0('abcd') + my_len_0('abcd')) + my_len_1('abcd') AS `tmp`'''
assert sql == expected


@pytest.mark.parametrize(
('argument_type', 'return_type'),
[
pytest.mark.xfail((dt.int64, dt.float64), raises=TypeError),
pytest.mark.xfail((dt.float64, dt.int64), raises=TypeError),
# complex argument type, valid return type
pytest.mark.xfail((dt.Array(dt.int64), dt.float64), raises=TypeError),
# valid argument type, complex invalid return type
pytest.mark.xfail(
(dt.float64, dt.Array(dt.int64)), raises=TypeError),
# both invalid
pytest.mark.xfail(
(dt.Array(dt.Array(dt.int64)), dt.int64), raises=TypeError),
# struct type with nested integer, valid return type
pytest.mark.xfail(
(dt.Struct.from_tuples([('x', dt.Array(dt.int64))]), dt.float64),
raises=TypeError,
)
]
)
def test_udf_int64(client, argument_type, return_type):
# invalid argument type, valid return type
@udf([argument_type], return_type)
def my_int64_add(x):
return 1.0
171 changes: 91 additions & 80 deletions ibis/clickhouse/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,13 @@
import numpy as np
import pandas as pd

from collections import OrderedDict

import ibis.common as com
import ibis.expr.types as ir
import ibis.expr.schema as sch
import ibis.expr.datatypes as dt
import ibis.expr.operations as ops

from ibis.config import options
from ibis.compat import zip as czip, parse_version
Expand All @@ -18,10 +21,12 @@


fully_qualified_re = re.compile(r"(.*)\.(?:`(.*)`|(.*))")
base_typename_re = re.compile(r"(\w+)")


_clickhouse_dtypes = {
'Null': dt.Null,
'Nothing': dt.Null,
'UInt8': dt.UInt8,
'UInt16': dt.UInt16,
'UInt32': dt.UInt32,
Expand All @@ -46,9 +51,11 @@ class ClickhouseDataType(object):
__slots__ = 'typename', 'nullable'

def __init__(self, typename, nullable=False):
if typename not in _clickhouse_dtypes:
m = base_typename_re.match(typename)
base_typename = m.groups()[0]
if base_typename not in _clickhouse_dtypes:
raise com.UnsupportedBackendType(typename)
self.typename = typename
self.typename = base_typename
self.nullable = nullable

def __str__(self):
Expand Down Expand Up @@ -108,7 +115,7 @@ def _external_tables(self):

def execute(self):
cursor = self.client._execute(
self.compiled_ddl,
self.compiled_sql,
external_tables=self._external_tables()
)
result = self._fetch(cursor)
Expand All @@ -120,16 +127,95 @@ def _fetch(self, cursor):
# handle empty resultset
return pd.DataFrame([], columns=colnames)

df = pd.DataFrame.from_items(zip(colnames, data))
df = pd.DataFrame.from_dict(
OrderedDict(zip(colnames, data))
)
return self.schema().apply_to(df)


class ClickhouseTable(ir.TableExpr, DatabaseEntity):
"""References a physical table in Clickhouse"""

@property
def _qualified_name(self):
return self.op().args[0]

@property
def _unqualified_name(self):
return self._match_name()[1]

@property
def _client(self):
return self.op().args[2]

def _match_name(self):
m = fully_qualified_re.match(self._qualified_name)
if not m:
raise com.IbisError('Cannot determine database name from {0}'
.format(self._qualified_name))
db, quoted, unquoted = m.groups()
return db, quoted or unquoted

@property
def _database(self):
return self._match_name()[0]

def invalidate_metadata(self):
self._client.invalidate_metadata(self._qualified_name)

def metadata(self):
"""
Return parsed results of DESCRIBE FORMATTED statement
Returns
-------
meta : TableMetadata
"""
return self._client.describe_formatted(self._qualified_name)

describe_formatted = metadata

@property
def name(self):
return self.op().name

def _execute(self, stmt):
return self._client._execute(stmt)

def insert(self, obj, **kwargs):
from .identifiers import quote_identifier
schema = self.schema()

assert isinstance(obj, pd.DataFrame)
assert set(schema.names) >= set(obj.columns)

columns = ', '.join(map(quote_identifier, obj.columns))
query = 'INSERT INTO {table} ({columns}) VALUES'.format(
table=self._qualified_name, columns=columns)

# convert data columns with datetime64 pandas dtype to native date
# because clickhouse-driver 0.0.10 does arithmetic operations on it
obj = obj.copy()
for col in obj.select_dtypes(include=[np.datetime64]):
if isinstance(schema[col], dt.Date):
obj[col] = obj[col].dt.date

data = obj.to_dict('records')
return self._client.con.process_insert_query(query, data, **kwargs)


class ClickhouseDatabaseTable(ops.DatabaseTable):
pass


class ClickhouseClient(SQLClient):
"""An Ibis client interface that uses Clickhouse"""

database_class = ClickhouseDatabase
sync_query = ClickhouseQuery
query_class = ClickhouseQuery
dialect = ClickhouseDialect
table_class = ClickhouseDatabaseTable
table_expr_class = ClickhouseTable

def __init__(self, *args, **kwargs):
self.con = _DriverClient(*args, **kwargs)
Expand All @@ -142,10 +228,6 @@ def current_database(self):
# might be better to use driver.Connection instead of Client
return self.con.connection.database

@property
def _table_expr_klass(self):
return ClickhouseTable

def log(self, msg):
log(msg)

Expand Down Expand Up @@ -336,74 +418,3 @@ def version(self):
raise
else:
return parse_version(vstring)


class ClickhouseTable(ir.TableExpr, DatabaseEntity):
"""References a physical table in Clickhouse"""

@property
def _qualified_name(self):
return self.op().args[0]

@property
def _unqualified_name(self):
return self._match_name()[1]

@property
def _client(self):
return self.op().args[2]

def _match_name(self):
m = fully_qualified_re.match(self._qualified_name)
if not m:
raise com.IbisError('Cannot determine database name from {0}'
.format(self._qualified_name))
db, quoted, unquoted = m.groups()
return db, quoted or unquoted

@property
def _database(self):
return self._match_name()[0]

def invalidate_metadata(self):
self._client.invalidate_metadata(self._qualified_name)

def metadata(self):
"""
Return parsed results of DESCRIBE FORMATTED statement
Returns
-------
meta : TableMetadata
"""
return self._client.describe_formatted(self._qualified_name)

describe_formatted = metadata

@property
def name(self):
return self.op().name

def _execute(self, stmt):
return self._client._execute(stmt)

def insert(self, obj, **kwargs):
from .identifiers import quote_identifier
schema = self.schema()

assert isinstance(obj, pd.DataFrame)
assert set(schema.names) >= set(obj.columns)

columns = ', '.join(map(quote_identifier, obj.columns))
query = 'INSERT INTO {table} ({columns}) VALUES'.format(
table=self._qualified_name, columns=columns)

# convert data columns with datetime64 pandas dtype to native date
# because clickhouse-driver 0.0.10 does arithmetic operations on it
obj = obj.copy()
for col in obj.select_dtypes(include=[np.datetime64]):
if isinstance(schema[col], dt.Date):
obj[col] = obj[col].dt.date

data = obj.to_dict('records')
return self._client.con.process_insert_query(query, data, **kwargs)
24 changes: 14 additions & 10 deletions ibis/clickhouse/operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ def _name_expr(formatted_expr, quoted_name):
def varargs(func_name):
def varargs_formatter(translator, expr):
op = expr.op()
return _call(translator, func_name, *op.args)
return _call(translator, func_name, *op.arg)
return varargs_formatter


Expand All @@ -142,7 +142,7 @@ def _substring(translator, expr):
arg_, start_ = translator.translate(arg), translator.translate(start)

# Clickhouse is 1-indexed
if length is None or isinstance(length.op(), ir.Literal):
if length is None or isinstance(length.op(), ops.Literal):
if length is not None:
length_ = length.op().value
return 'substring({0}, {1} + 1, {2})'.format(arg_, start_, length_)
Expand Down Expand Up @@ -265,23 +265,25 @@ def _value_list(translator, expr):


def _interval_format(translator, expr):
if expr.unit in {'ms', 'us', 'ns'}:
dtype = expr.type()
if dtype.unit in {'ms', 'us', 'ns'}:
raise com.UnsupportedOperationError(
"Clickhouse doesn't support subsecond interval resolutions")

return 'INTERVAL {} {}'.format(expr.op().value, expr.resolution.upper())
return 'INTERVAL {} {}'.format(expr.op().value, dtype.resolution.upper())


def _interval_from_integer(translator, expr):
op = expr.op()
arg, unit = op.args

if expr.unit in {'ms', 'us', 'ns'}:
dtype = expr.type()
if dtype.unit in {'ms', 'us', 'ns'}:
raise com.UnsupportedOperationError(
"Clickhouse doesn't support subsecond interval resolutions")

arg_ = translator.translate(arg)
return 'INTERVAL {} {}'.format(arg_, expr.resolution.upper())
return 'INTERVAL {} {}'.format(arg_, dtype.resolution.upper())


def literal(translator, expr):
Expand All @@ -307,6 +309,8 @@ def literal(translator, expr):
return "toDate('{0!s}')".format(value)
elif isinstance(expr, ir.ArrayValue):
return str(list(value))
elif isinstance(expr, ir.SetScalar):
return '({})'.format(', '.join(map(repr, value)))
else:
raise NotImplementedError(type(expr))

Expand Down Expand Up @@ -469,7 +473,7 @@ def _string_split(translator, expr):

def _string_join(translator, expr):
sep, elements = expr.op().args
assert isinstance(elements.op(), ir.ValueList), \
assert isinstance(elements.op(), ops.ValueList), \
'elements must be a ValueList, got {}'.format(type(elements.op()))
return 'arrayStringConcat([{}], {})'.format(
', '.join(map(translator.translate, elements)),
Expand Down Expand Up @@ -607,8 +611,8 @@ def _string_like(translator, expr):
# Other operations
ops.E: lambda *args: 'e()',

ir.Literal: literal,
ir.ValueList: _value_list,
ops.Literal: literal,
ops.ValueList: _value_list,

ops.Cast: _cast,

Expand Down Expand Up @@ -669,7 +673,7 @@ def _zero_if_null(translator, expr):


_undocumented_operations = {
ir.NullLiteral: _null_literal, # undocumented
ops.NullLiteral: _null_literal, # undocumented
ops.IsNull: unary('isNull'),
ops.NotNull: unary('isNotNull'),
ops.IfNull: fixed_arity('ifNull', 2),
Expand Down
5 changes: 3 additions & 2 deletions ibis/clickhouse/tests/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,8 @@ def test_insert_with_less_columns(con, alltypes, df):
con.raw_sql(create)

temporary = con.table('temporary_alltypes')
records = df.loc[:10, ['string_col', 'date_col']]
records = df.loc[:10, ['string_col']].copy()
records['date_col'] = None

with pytest.raises(AssertionError):
temporary.insert(records)
Expand All @@ -213,7 +214,7 @@ def test_insert_with_more_columns(con, alltypes, df):
con.raw_sql(create)

temporary = con.table('temporary_alltypes')
records = df[:10]
records = df[:10].copy()
records['non_existing_column'] = 'raise on me'

with pytest.raises(AssertionError):
Expand Down
11 changes: 7 additions & 4 deletions ibis/clickhouse/tests/test_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

import ibis
import ibis.expr.types as ir
import ibis.expr.datatypes as dt
from ibis import literal as L


Expand All @@ -20,7 +21,8 @@
('int8', 'CAST(`double_col` AS Int8)'),
('int16', 'CAST(`double_col` AS Int16)'),
('float', 'CAST(`double_col` AS Float32)'),
('double', '`double_col`')
# alltypes.double_col is non-nullable
(dt.Double(nullable=False), '`double_col`')
])
def test_cast_double_col(alltypes, translate, to_type, expected):
expr = alltypes.double_col.cast(to_type)
Expand All @@ -30,7 +32,7 @@ def test_cast_double_col(alltypes, translate, to_type, expected):
@pytest.mark.parametrize(('to_type', 'expected'), [
('int8', 'CAST(`string_col` AS Int8)'),
('int16', 'CAST(`string_col` AS Int16)'),
('string', '`string_col`'),
(dt.String(nullable=False), '`string_col`'),
('timestamp', 'CAST(`string_col` AS DateTime)'),
('date', 'CAST(`string_col` AS Date)')
])
Expand Down Expand Up @@ -70,8 +72,9 @@ def test_noop_cast(alltypes, translate, column):


def test_timestamp_cast_noop(alltypes, translate):
result1 = alltypes.timestamp_col.cast('timestamp')
result2 = alltypes.int_col.cast('timestamp')
target = dt.Timestamp(nullable=False)
result1 = alltypes.timestamp_col.cast(target)
result2 = alltypes.int_col.cast(target)

assert isinstance(result1, ir.TimestampColumn)
assert isinstance(result2, ir.TimestampColumn)
Expand Down
31 changes: 9 additions & 22 deletions ibis/clickhouse/tests/test_operators.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,13 +131,17 @@ def test_string_temporal_compare_between_datetimes(con, left, right):
assert result


def test_field_in_literals(con, alltypes, translate):
expr = alltypes.string_col.isin(['foo', 'bar', 'baz'])
assert translate(expr) == "`string_col` IN ('foo', 'bar', 'baz')"
@pytest.mark.parametrize('container', [list, tuple, set])
def test_field_in_literals(con, alltypes, translate, container):
foobar = container(['foo', 'bar', 'baz'])
expected = tuple(set(foobar))

expr = alltypes.string_col.isin(foobar)
assert translate(expr) == "`string_col` IN {}".format(expected)
assert len(con.execute(expr))

expr = alltypes.string_col.notin(['foo', 'bar', 'baz'])
assert translate(expr) == "`string_col` NOT IN ('foo', 'bar', 'baz')"
expr = alltypes.string_col.notin(foobar)
assert translate(expr) == "`string_col` NOT IN {}".format(expected)
assert len(con.execute(expr))


Expand Down Expand Up @@ -230,20 +234,3 @@ def test_search_case(con, alltypes, translate):
END"""
assert translate(expr) == expected
assert len(con.execute(expr))


# TODO: Clickhouse raises incompatible type error
# def test_bucket_to_case(con, alltypes, translate):
# buckets = [0, 10, 25, 50]

# expr1 = alltypes.float_col.bucket(buckets)
# expected1 = """\
# CASE
# WHEN (`float_col` >= 0) AND (`float_col` < 10) THEN 0
# WHEN (`float_col` >= 10) AND (`float_col` < 25) THEN 1
# WHEN (`float_col` >= 25) AND (`float_col` <= 50) THEN 2
# ELSE Null
# END"""

# assert translate(expr1) == expected1
# assert len(con.execute(expr1))
25 changes: 13 additions & 12 deletions ibis/clickhouse/tests/test_select.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,19 +48,20 @@ def test_timestamp_extract_field(con, db, alltypes):


def test_isin_notin_in_select(con, db, alltypes, translate):
filtered = alltypes[alltypes.string_col.isin(['foo', 'bar'])]
values = ['foo', 'bar']
filtered = alltypes[alltypes.string_col.isin(values)]
result = ibis.clickhouse.compile(filtered)
expected = """SELECT *
FROM {0}.`functional_alltypes`
WHERE `string_col` IN ('foo', 'bar')"""
assert result == expected.format(db.name)
FROM {}.`functional_alltypes`
WHERE `string_col` IN {}"""
assert result == expected.format(db.name, tuple(set(values)))

filtered = alltypes[alltypes.string_col.notin(['foo', 'bar'])]
filtered = alltypes[alltypes.string_col.notin(values)]
result = ibis.clickhouse.compile(filtered)
expected = """SELECT *
FROM {0}.`functional_alltypes`
WHERE `string_col` NOT IN ('foo', 'bar')"""
assert result == expected.format(db.name)
FROM {}.`functional_alltypes`
WHERE `string_col` NOT IN {}"""
assert result == expected.format(db.name, tuple(set(values)))


def test_head(alltypes):
Expand Down Expand Up @@ -321,8 +322,8 @@ def test_where_simple_comparisons(con, db, alltypes):
result = ibis.clickhouse.compile(expr)
expected = """SELECT *
FROM {0}.`functional_alltypes`
WHERE `float_col` > 0 AND
`int_col` < (`float_col` * 2)"""
WHERE (`float_col` > 0) AND
(`int_col` < (`float_col` * 2))"""
assert result == expected.format(db.name)
assert len(con.execute(expr))

Expand All @@ -334,8 +335,8 @@ def test_where_with_between(con, db, alltypes):
result = ibis.clickhouse.compile(expr)
expected = """SELECT *
FROM {0}.`functional_alltypes`
WHERE `int_col` > 0 AND
`float_col` BETWEEN 0 AND 1"""
WHERE (`int_col` > 0) AND
(`float_col` BETWEEN 0 AND 1)"""
assert result == expected.format(db.name)
con.execute(expr)

Expand Down
15 changes: 13 additions & 2 deletions ibis/clickhouse/tests/test_types.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import pytest
import pandas as pd


pytest.importorskip('clickhouse_driver')
Expand All @@ -14,4 +13,16 @@ def test_column_types(alltypes):
assert df.bigint_col.dtype.name == 'int64'
assert df.float_col.dtype.name == 'float32'
assert df.double_col.dtype.name == 'float64'
assert pd.core.common.is_datetime64_dtype(df.timestamp_col.dtype)
assert df.timestamp_col.dtype.name == 'datetime64[ns]'


def test_columns_types_with_additional_argument(con):
sql_types = ["toFixedString('foo', 8) AS fixedstring_col"]
if con.version.base_version >= '1.1.54337':
sql_types.append(
"toDateTime('2018-07-02 00:00:00', 'UTC') AS datetime_col")
sql = 'SELECT {}'.format(', '.join(sql_types))
df = con.sql(sql).execute()
assert df.fixedstring_col.dtype.name == 'object'
if con.version.base_version >= '1.1.54337':
assert df.datetime_col.dtype.name == 'datetime64[ns]'
119 changes: 37 additions & 82 deletions ibis/client.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,5 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import abc

import six

from ibis.config import options
Expand All @@ -31,29 +18,29 @@ class Client(object):

class Query(object):

"""
Abstraction for DDL query execution to enable both synchronous and
asynchronous queries, progress, cancellation and more (for backends
supporting such functionality).
"""Abstraction for DML query execution to enable queries, progress,
cancellation and more (for backends supporting such functionality).
"""

def __init__(self, client, ddl, **kwargs):
def __init__(self, client, sql, **kwargs):
self.client = client

dml = getattr(sql, 'dml', sql)
self.expr = getattr(
ddl, 'parent_expr', getattr(ddl, 'table_set', None)
dml, 'parent_expr', getattr(dml, 'table_set', None)
)

if isinstance(ddl, comp.DDL):
self.compiled_ddl = ddl.compile()
if not isinstance(sql, six.string_types):
self.compiled_sql = sql.compile()
else:
self.compiled_ddl = ddl
self.compiled_sql = sql

self.result_wrapper = getattr(ddl, 'result_handler', None)
self.result_wrapper = getattr(dml, 'result_handler', None)
self.extra_options = kwargs

def execute(self):
# synchronous by default
with self.client._execute(self.compiled_ddl, results=True) as cur:
with self.client._execute(self.compiled_sql, results=True) as cur:
result = self._fetch(cur)

return self._wrap_result(result)
Expand All @@ -77,29 +64,12 @@ def schema(self):
'schema'.format(type(self.expr)))


class AsyncQuery(Query):

"""Abstract asynchronous query"""

def execute(self):
raise NotImplementedError

def is_finished(self):
raise NotImplementedError

def cancel(self):
raise NotImplementedError

def get_result(self):
raise NotImplementedError


class SQLClient(six.with_metaclass(abc.ABCMeta, Client)):

sync_query = Query
async_query = Query

dialect = comp.Dialect
query_class = Query
table_class = ops.DatabaseTable
table_expr_class = ir.TableExpr

def table(self, name, database=None):
"""
Expand All @@ -117,12 +87,8 @@ def table(self, name, database=None):
"""
qualified_name = self._fully_qualified_name(name, database)
schema = self._get_table_schema(qualified_name)
node = ops.DatabaseTable(qualified_name, schema, self)
return self._table_expr_klass(node)

@property
def _table_expr_klass(self):
return ir.TableExpr
node = self.table_class(qualified_name, schema, self)
return self.table_expr_class(node)

@property
def current_database(self):
Expand Down Expand Up @@ -174,9 +140,7 @@ def sql(self, query):
# there is already a limit in the query, we find and remove it
limited_query = 'SELECT * FROM ({}) t0 LIMIT 0'.format(query)
schema = self._get_schema_using_query(limited_query)

node = ops.SQLQueryResult(query, schema, self)
return ir.TableExpr(node)
return ops.SQLQueryResult(query, schema, self).to_expr()

def raw_sql(self, query, results=False):
"""
Expand All @@ -187,7 +151,7 @@ def raw_sql(self, query, results=False):
Parameters
----------
query : string
SQL or DDL statement
DML or DDL statement
results : boolean, default False
Pass True if the query as a result set
Expand All @@ -198,8 +162,7 @@ def raw_sql(self, query, results=False):
"""
return self._execute(query, results=results)

def execute(self, expr, params=None, limit='default', async=False,
**kwargs):
def execute(self, expr, params=None, limit='default', **kwargs):
"""
Compile and execute Ibis expression using this backend client
interface, returning results in-memory in the appropriate object type
Expand All @@ -211,7 +174,6 @@ def execute(self, expr, params=None, limit='default', async=False,
For expressions yielding result yets; retrieve at most this number of
values/rows. Overrides any limit already set on the expression.
params : not yet implemented
async : boolean, default False
Returns
-------
Expand All @@ -220,17 +182,13 @@ def execute(self, expr, params=None, limit='default', async=False,
Array expressions: pandas.Series
Scalar expressions: Python scalar value
"""
ast = self._build_ast_ensure_limit(expr, limit, params=params)

if len(ast.queries) > 1:
raise NotImplementedError
else:
return self._execute_query(ast.queries[0], async=async, **kwargs)
query_ast = self._build_ast_ensure_limit(expr, limit, params=params)
result = self._execute_query(query_ast, **kwargs)
return result

def _execute_query(self, ddl, async=False, **kwargs):
klass = self.async_query if async else self.sync_query
inst = klass(self, ddl, **kwargs)
return inst.execute()
def _execute_query(self, dml, **kwargs):
query = self.query_class(self, dml, **kwargs)
return query.execute()

def compile(self, expr, params=None, limit=None):
"""
Expand All @@ -240,17 +198,16 @@ def compile(self, expr, params=None, limit=None):
-------
output : single query or list of queries
"""
ast = self._build_ast_ensure_limit(expr, limit, params=params)
queries = [query.compile() for query in ast.queries]
return queries[0] if len(queries) == 1 else queries
query_ast = self._build_ast_ensure_limit(expr, limit, params=params)
return query_ast.compile()

def _build_ast_ensure_limit(self, expr, limit, params=None):
context = self.dialect.make_context(params=params)

ast = self._build_ast(expr, context)
query_ast = self._build_ast(expr, context)
# note: limit can still be None at this point, if the global
# default_limit is None
for query in reversed(ast.queries):
for query in reversed(query_ast.queries):
if (isinstance(query, comp.Select) and
not isinstance(expr, ir.ScalarExpr) and
query.table_set is not None):
Expand All @@ -267,7 +224,7 @@ def _build_ast_ensure_limit(self, expr, limit, params=None):
elif limit is not None and limit != 'default':
query.limit = {'n': limit,
'offset': query.limit['offset']}
return ast
return query_ast

def explain(self, expr, params=None):
"""
Expand All @@ -280,11 +237,11 @@ def explain(self, expr, params=None):
"""
if isinstance(expr, ir.Expr):
context = self.dialect.make_context(params=params)
ast = self._build_ast(expr, context)
if len(ast.queries) > 1:
query_ast = self._build_ast(expr, context)
if len(query_ast.queries) > 1:
raise Exception('Multi-query expression')

query = ast.queries[0].compile()
query = query_ast.queries[0].compile()
else:
query = expr

Expand All @@ -303,8 +260,7 @@ def _build_ast(self, expr, context):

class QueryPipeline(object):
"""
Execute a series of queries, possibly asynchronously, and capture any
result sets generated
Execute a series of queries, and capture any result sets generated
Note: No query pipelines have yet been implemented
"""
Expand All @@ -325,10 +281,9 @@ def validate_backends(backends):
return backends


def execute(expr, limit='default', async=False, params=None, **kwargs):
def execute(expr, limit='default', params=None, **kwargs):
backend, = validate_backends(list(find_backends(expr)))
return backend.execute(expr, limit=limit, async=async, params=params,
**kwargs)
return backend.execute(expr, limit=limit, params=params, **kwargs)


def compile(expr, limit=None, params=None, **kwargs):
Expand Down
4 changes: 4 additions & 0 deletions ibis/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ class UnsupportedBackendType(TranslationError):
pass


class UnboundExpressionError(ValueError, IbisError):
pass


class IbisInputError(ValueError, IbisError):
pass

Expand Down
12 changes: 8 additions & 4 deletions ibis/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ def viewkeys(x):
from inspect import signature, Parameter, _empty
import unittest.mock as mock
range = range
map = map
import builtins
import pickle
maketrans = str.maketrans
Expand All @@ -45,6 +46,7 @@ def viewkeys(x):
lzip = zip
zip = itertools.izip
zip_longest = itertools.izip_longest
map = itertools.imap

def viewkeys(x):
return x.viewkeys()
Expand Down Expand Up @@ -107,11 +109,13 @@ def suppress(*exceptions):

# pandas compat
try:
from pandas.api.types import (DatetimeTZDtype, # noqa: F401
CategoricalDtype) # noqa: F401
from pandas.api.types import ( # noqa: F401
DatetimeTZDtype, CategoricalDtype, infer_dtype
)
except ImportError:
from pandas.types.dtypes import (DatetimeTZDtype, # noqa: F401
CategoricalDtype) # noqa: F401
from pandas.types.dtypes import ( # noqa: F401
DatetimeTZDtype, CategoricalDtype, infer_dtype
)

try:
from pandas.core.tools.datetimes import to_time, to_datetime # noqa: F401
Expand Down
92 changes: 71 additions & 21 deletions ibis/expr/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@

from ibis.common import RelationError, ExpressionError, IbisTypeError


# ---------------------------------------------------------------------
# Some expression metaprogramming / graph transformations to support
# compilation later
Expand Down Expand Up @@ -120,7 +121,7 @@ def get_result(self):
return table.projection([named_expr]), name

def _visit(self, expr):
if is_scalar_reduce(expr) and not has_multiple_bases(expr):
if is_scalar_reduction(expr) and not has_multiple_bases(expr):
# An aggregation unit
key = self._key(expr)
if key not in self.memo:
Expand Down Expand Up @@ -229,10 +230,6 @@ def finder(expr):
return lin.traverse(finder, expr)


def is_scalar_reduce(x):
return isinstance(x, ir.ScalarExpr) and ops.is_reduction(x)


def substitute_parents(expr, lift_memo=None, past_projection=True):
rewriter = ExprSimplifier(expr, lift_memo=lift_memo,
block_projection=not past_projection)
Expand All @@ -257,7 +254,7 @@ def __init__(self, expr, lift_memo=None, block_projection=False):
def get_result(self):
expr = self.expr
node = expr.op()
if isinstance(node, ir.Literal):
if isinstance(node, ops.Literal):
return expr

# For table column references, in the event that we're on top of a
Expand Down Expand Up @@ -346,18 +343,15 @@ def lift(self, expr, block=None):

def _lift_TableColumn(self, expr, block=None):
node = expr.op()

tnode = node.table.op()
root = _base_table(tnode)

root = node.table.op()
result = expr

if isinstance(root, ops.Selection):
can_lift = False

for val in root.selections:
if (isinstance(val.op(), ops.PhysicalTable) and
node.name in val.schema()):

can_lift = True
lifted_root = self.lift(val)
elif (isinstance(val.op(), ops.TableColumn) and
Expand All @@ -366,9 +360,6 @@ def _lift_TableColumn(self, expr, block=None):
can_lift = True
lifted_root = self.lift(val.op().table)

# XXX
# can_lift = False

# HACK: If we've projected a join, do not lift the children
# TODO: what about limits and other things?
# if isinstance(root.table.op(), Join):
Expand Down Expand Up @@ -585,12 +576,11 @@ def _filter_selection(expr, predicates):
simplified_predicates = [substitute_parents(x) for x in predicates]
fused_predicates = op.predicates + simplified_predicates
result = ops.Selection(op.table,
proj_exprs=op.selections,
selections=op.selections,
predicates=fused_predicates,
sort_keys=op.sort_keys)
else:
result = ops.Selection(expr, proj_exprs=[],
predicates=predicates)
result = ops.Selection(expr, selections=[], predicates=predicates)

return result.to_expr()

Expand Down Expand Up @@ -664,8 +654,8 @@ def _validate_projection(self, expr):

lifted_node = substitute_parents(expr).op()

is_valid = (col_table.is_ancestor(node.table) or
col_table.is_ancestor(lifted_node.table))
is_valid = (col_table.equals(node.table.op()) or
col_table.equals(lifted_node.table.op()))

return is_valid

Expand Down Expand Up @@ -784,7 +774,7 @@ def _check_fusion(self, root):

# a * projection
if (isinstance(val, ir.TableExpr) and
(self.parent.op().is_ancestor(val) or
(self.parent.op().equals(val.op()) or
# gross we share the same table root. Better way to
# detect?
len(roots) == 1 and val._root_tables()[0] is roots[0])):
Expand Down Expand Up @@ -940,7 +930,7 @@ def validate(self, expr):
if isinstance(arg, ir.ScalarExpr):
# arg_valid = True
pass
elif isinstance(arg, ops.TopKExpr):
elif isinstance(arg, ir.TopKExpr):
# TopK not subjected to further analysis for now
roots_valid.append(True)
elif isinstance(arg, (ir.ColumnExpr, ir.AnalyticExpr)):
Expand Down Expand Up @@ -1076,3 +1066,63 @@ def predicate(expr):
return lin.halt, expr

return list(lin.traverse(predicate, expr, type=ir.BooleanColumn))


def is_analytic(expr, exclude_windows=False):
def _is_analytic(op):
if isinstance(op, (ops.Reduction, ops.AnalyticOp)):
return True
elif isinstance(op, ops.WindowOp) and exclude_windows:
return False

for arg in op.args:
if isinstance(arg, ir.Expr) and _is_analytic(arg.op()):
return True

return False

return _is_analytic(expr.op())


def is_reduction(expr):
"""Check whether an expression is a reduction or not
Aggregations yield typed scalar expressions, since the result of an
aggregation is a single value. When creating an table expression
containing a GROUP BY equivalent, we need to be able to easily check
that we are looking at the result of an aggregation.
As an example, the expression we are looking at might be something
like: foo.sum().log10() + bar.sum().log10()
We examine the operator DAG in the expression to determine if there
are aggregations present.
A bound aggregation referencing a separate table is a "false
aggregation" in a GROUP BY-type expression and should be treated a
literal, and must be computed as a separate query and stored in a
temporary variable (or joined, for bound aggregations with keys)
Parameters
----------
expr : ir.Expr
Returns
-------
check output : bool
"""
def has_reduction(op):
if getattr(op, '_reduction', False):
return True

for arg in op.args:
if isinstance(arg, ir.ScalarExpr) and has_reduction(arg.op()):
return True

return False

return has_reduction(expr.op() if isinstance(expr, ir.Expr) else expr)


def is_scalar_reduction(expr):
return isinstance(expr, ir.ScalarExpr) and is_reduction(expr)
94 changes: 33 additions & 61 deletions ibis/expr/analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,106 +13,78 @@
# limitations under the License.


import ibis.expr.rules as rlz
import ibis.expr.datatypes as dt
import ibis.expr.types as ir
import ibis.expr.rules as rules
import ibis.expr.operations as ops

from ibis.expr.signature import Argument as Arg

def _validate_closed(closed):
closed = closed.lower()
if closed not in {'left', 'right'}:
raise ValueError("closed must be 'left' or 'right'")
return closed


class BucketLike(ir.ValueOp):
class BucketLike(ops.ValueOp):

@property
def nbuckets(self):
return None

def output_type(self):
ctype = dt.Category(self.nbuckets)
return ctype.array_type()
dtype = dt.Category(self.nbuckets)
return dtype.array_type()


class Bucket(BucketLike):

def __init__(self, arg, buckets, closed='left', close_extreme=True,
include_under=False, include_over=False):
self.arg = arg
self.buckets = buckets
self.closed = _validate_closed(closed)

self.close_extreme = bool(close_extreme)
self.include_over = bool(include_over)
self.include_under = bool(include_under)

if not len(buckets):
arg = Arg(rlz.noop)
buckets = Arg(rlz.noop)
closed = Arg(rlz.isin({'left', 'right'}), default='left')
close_extreme = Arg(bool, default=True)
include_under = Arg(bool, default=False)
include_over = Arg(bool, default=False)

def _validate(self):
if not len(self.buckets):
raise ValueError('Must be at least one bucket edge')
elif len(buckets) == 1:
elif len(self.buckets) == 1:
if not self.include_under or not self.include_over:
raise ValueError(
'If one bucket edge provided, must have '
'include_under=True and include_over=True'
)

super(Bucket, self).__init__(
arg, buckets, self.closed,
self.close_extreme, self.include_under, self.include_over
)

@property
def nbuckets(self):
return len(self.buckets) - 1 + self.include_over + self.include_under


class Histogram(BucketLike):

def __init__(
self, arg, nbins, binwidth, base, closed='left', aux_hash=None
):
self.arg = arg
self.nbins = nbins
self.binwidth = binwidth
self.base = base

arg = Arg(rlz.noop)
nbins = Arg(rlz.noop, default=None)
binwidth = Arg(rlz.noop, default=None)
base = Arg(rlz.noop, default=None)
closed = Arg(rlz.isin({'left', 'right'}), default='left')
aux_hash = Arg(rlz.noop, default=None)

def _validate(self):
if self.nbins is None:
if self.binwidth is None:
raise ValueError('Must indicate nbins or binwidth')
elif self.binwidth is not None:
raise ValueError('nbins and binwidth are mutually exclusive')

self.closed = _validate_closed(closed)
self.aux_hash = aux_hash

super(Histogram, self).__init__(
arg, nbins, binwidth, base, self.closed, aux_hash
)

def output_type(self):
# always undefined cardinality (for now)
ctype = dt.Category()
return ctype.array_type()

return dt.category.array_type()

class CategoryLabel(ir.ValueOp):

def __init__(self, arg, labels, nulls):
self.arg = ops.as_value_expr(arg)
self.labels = labels
class CategoryLabel(ops.ValueOp):
arg = Arg(rlz.category)
labels = Arg(rlz.noop)
nulls = Arg(rlz.noop, default=None)
output_type = rlz.shape_like('arg', dt.string)

card = self.arg.type().cardinality
if len(labels) != card:
def _validate(self):
cardinality = self.arg.type().cardinality
if len(self.labels) != cardinality:
raise ValueError('Number of labels must match number of '
'categories: %d' % card)

self.nulls = nulls
super(CategoryLabel, self).__init__(self.arg, labels, nulls)

def output_type(self):
return rules.shape_like(self.arg, 'string')
'categories: {}'.format(cardinality))


def bucket(arg, buckets, closed='left', close_extreme=True,
Expand Down
972 changes: 574 additions & 398 deletions ibis/expr/api.py

Large diffs are not rendered by default.

273 changes: 189 additions & 84 deletions ibis/expr/datatypes.py

Large diffs are not rendered by default.

17 changes: 10 additions & 7 deletions ibis/expr/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,13 +109,13 @@ def get_result(self):
text = self._format_node(self.expr)
elif isinstance(what, ops.TableColumn):
text = self._format_column(self.expr)
elif isinstance(what, ir.Literal):
elif isinstance(what, ops.Literal):
text = 'Literal[{}]\n {}'.format(
self._get_type_display(), str(what.value)
)
elif isinstance(what, ir.ScalarParameter):
elif isinstance(what, ops.ScalarParameter):
text = 'ScalarParameter[{}]'.format(self._get_type_display())
elif isinstance(what, ir.Node):
elif isinstance(what, ops.Node):
text = self._format_node(self.expr)

if isinstance(self.expr, ir.ValueExpr) and self.expr._name is not None:
Expand Down Expand Up @@ -157,7 +157,7 @@ def _memoize_tables(self):

if isinstance(op, ops.PhysicalTable):
memo.observe(e, self._format_table)
elif isinstance(op, ir.Node):
elif isinstance(op, ops.Node):
stack.extend(
arg for arg in reversed(op.args)
if isinstance(arg, ir.Expr)
Expand Down Expand Up @@ -214,20 +214,23 @@ def visit(what, extra_indents=0):

formatted_args.append(result)

arg_names = op._arg_names
arg_names = getattr(op, 'display_argnames', op.argnames)

if not arg_names:
for arg in op.args:
if isinstance(arg, list):
if util.is_iterable(arg):
for x in arg:
visit(x)
else:
visit(arg)
else:
for arg, name in zip(op.args, arg_names):
if name == 'arg' and isinstance(op, ops.ValueOp):
# don't display first argument's name in repr
name = None
if name is not None:
name = self._indent('{0}:'.format(name))
if isinstance(arg, list):
if util.is_iterable(arg):
if name is not None and len(arg) > 0:
formatted_args.append(name)
indents = 1
Expand Down
2,301 changes: 1,168 additions & 1,133 deletions ibis/expr/operations.py

Large diffs are not rendered by default.

1,079 changes: 273 additions & 806 deletions ibis/expr/rules.py

Large diffs are not rendered by default.

27 changes: 15 additions & 12 deletions ibis/expr/schema.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import collections
from multipledispatch import Dispatcher

import ibis.common as com
Expand Down Expand Up @@ -101,6 +102,12 @@ def equals(self, other, cache=None):
def __eq__(self, other):
return self.equals(other)

def __gt__(self, other):
return set(self.items()) > set(other.items())

def __ge__(self, other):
return set(self.items()) >= set(other.items())

def append(self, schema):
return Schema(self.names + schema.names, self.types + schema.types)

Expand Down Expand Up @@ -130,14 +137,6 @@ class HasSchema(object):
concrete dataset or database table.
"""

def __init__(self, schema, name=None):
if not isinstance(schema, Schema):
raise TypeError(
'schema argument to HasSchema class must be a Schema instance'
)
self.schema = schema
self.name = name

def __repr__(self):
return '{}({})'.format(type(self).__name__, repr(self.schema))

Expand All @@ -152,6 +151,10 @@ def equals(self, other, cache=None):
def root_tables(self):
return [self]

@property
def schema(self):
raise NotImplementedError


schema = Dispatcher('schema')
infer = Dispatcher('infer')
Expand All @@ -162,16 +165,16 @@ def identity(s):
return s


@schema.register(dict)
def schema_from_dict(d):
@schema.register(collections.Mapping)
def schema_from_mapping(d):
return Schema.from_dict(d)


@schema.register((tuple, list))
@schema.register(collections.Iterable)
def schema_from_pairs(lst):
return Schema.from_tuples(lst)


@schema.register((tuple, list), (tuple, list))
@schema.register(collections.Iterable, collections.Iterable)
def schema_from_names_types(names, types):
return Schema(names, types)
205 changes: 205 additions & 0 deletions ibis/expr/signature.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
import six
import itertools

import ibis.util as util
import ibis.expr.rules as rlz

from ibis.compat import PY2
from collections import OrderedDict

try:
from cytoolz import unique
except ImportError:
from toolz import unique


_undefined = object() # marker for missing argument


class Argument(object):
"""Argument definition
"""
if PY2:
# required to maintain definition order in Annotated metaclass
_counter = itertools.count()
__slots__ = '_serial', 'validator', 'default'
else:
__slots__ = 'validator', 'default'

def __init__(self, validator, default=_undefined):
"""Argument constructor
Parameters
----------
validator : Union[Callable[[arg], coerced], Type, Tuple[Type]]
Function which handles validation and/or coercion of the given
argument.
default : Union[Any, Callable[[], str]]
In case of missing (None) value for validation this will be used.
Note, that default value (except for None) must also pass the inner
validator.
If callable is passed, it will be executed just before the inner, and
itsreturn value will be treaded as default.
"""
if PY2:
self._serial = next(self._counter)

self.default = default
if isinstance(validator, type):
self.validator = rlz.instance_of(validator)
elif isinstance(validator, tuple):
assert util.all_of(validator, type)
self.validator = rlz.instance_of(validator)
elif callable(validator):
self.validator = validator
else:
raise TypeError('Argument validator must be a callable, type or '
'tuple of types, given: {}'.format(validator))

def __eq__(self, other):
return (
self.validator == other.validator and
self.default == other.default
)

@property
def optional(self):
return self.default is not _undefined

def validate(self, value=_undefined, name=None):
"""
Parameters
----------
value : Any, default undefined
Raises TypeError if argument is mandatory but not value has been
given.
name : Optional[str]
Argument name for error message
"""
if self.optional:
if value is _undefined or value is None:
if self.default is None:
return None
elif util.is_function(self.default):
value = self.default()
else:
value = self.default
elif value is _undefined:
if name is not None:
name = ' `{}`'.format(name)
raise TypeError('Missing required value for argument' + name)

return self.validator(value)

__call__ = validate # syntactic sugar


class TypeSignature(OrderedDict):

__slots__ = ()

@classmethod
def from_dtypes(cls, dtypes):
return cls(('_{}'.format(i), Argument(rlz.value(dtype)))
for i, dtype in enumerate(dtypes))

def validate(self, *args, **kwargs):
result = []
for i, (name, argument) in enumerate(self.items()):
if i < len(args):
if name in kwargs:
raise TypeError(
'Got multiple values for argument {}'.format(name)
)
value = argument.validate(args[i], name=name)
elif name in kwargs:
value = argument.validate(kwargs[name], name=name)
else:
value = argument.validate(name=name)

result.append((name, value))

return result

__call__ = validate # syntactic sugar

def names(self):
return tuple(self.keys())


class AnnotableMeta(type):

if PY2:
@staticmethod
def _precedes(arg1, arg2):
"""Comparator helper for sorting name-argument pairs"""
return cmp(arg1[1]._serial, arg2[1]._serial) # noqa: F821
else:
@classmethod
def __prepare__(metacls, name, bases, **kwds):
return OrderedDict()

def __new__(meta, name, bases, dct):
slots, signature = [], TypeSignature()

for parent in bases:
# inherit parent slots
if hasattr(parent, '__slots__'):
slots += parent.__slots__
# inherit from parent signatures
if hasattr(parent, 'signature'):
signature.update(parent.signature)

# finally apply definitions from the currently created class
if PY2:
# on python 2 we cannot maintain definition order
attribs, arguments = {}, []
for k, v in dct.items():
if isinstance(v, Argument):
arguments.append((k, v))
else:
attribs[k] = v

# so we need to sort arguments based on their unique counter
signature.update(sorted(arguments, cmp=meta._precedes))
else:
# thanks to __prepare__ attrs are already ordered
attribs = {}
for k, v in dct.items():
if isinstance(v, Argument):
# so we can set directly
signature[k] = v
else:
attribs[k] = v

# if slots or signature are defined no inheritance happens
signature = attribs.get('signature', signature)
slots = attribs.get('__slots__', tuple(slots)) + signature.names()

attribs['signature'] = signature
attribs['__slots__'] = tuple(unique(slots))

return super(AnnotableMeta, meta).__new__(meta, name, bases, attribs)


@six.add_metaclass(AnnotableMeta)
class Annotable(object):

__slots__ = ()

def __init__(self, *args, **kwargs):
for name, value in self.signature.validate(*args, **kwargs):
setattr(self, name, value)
self._validate()

def _validate(self):
pass

@property
def args(self):
return tuple(getattr(self, name) for name in self.signature.names())

@property
def argnames(self):
return self.signature.names()
4 changes: 1 addition & 3 deletions ibis/expr/tests/mocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,9 +356,7 @@ def _build_ast(self, expr, context):
from ibis.impala.compiler import build_ast
return build_ast(expr, context)

def execute(self, expr, limit=None, async=False, params=None):
if async:
raise NotImplementedError
def execute(self, expr, limit=None, params=None):
ast = self._build_ast_ensure_limit(expr, limit, params=params)
for query in ast.queries:
self.executed_queries.append(query.compile())
Expand Down
5 changes: 4 additions & 1 deletion ibis/expr/tests/test_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@
from ibis.tests.util import assert_equal


# TODO: test is_reduction
# TODO: test is_scalar_reduction

# Place to collect esoteric expression analysis bugs and tests


Expand Down Expand Up @@ -329,4 +332,4 @@ def test_is_ancestor_analytic():
with_analytic = subquery[subquery.columns +
[subquery.count().name('analytic')]]

assert not subquery.op().is_ancestor(with_analytic)
assert not subquery.op().equals(with_analytic.op())
41 changes: 23 additions & 18 deletions ibis/expr/tests/test_case.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,6 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import pytest

import ibis.expr.datatypes as dt
import ibis.expr.types as ir
import ibis.expr.operations as ops
import ibis
Expand Down Expand Up @@ -50,7 +37,7 @@ def test_simple_case_expr(table):
.end())

assert_equal(expr1, expr2)
assert isinstance(expr1, ir.Int32Column)
assert isinstance(expr1, ir.IntegerColumn)


def test_multiple_case_expr(table):
Expand All @@ -72,7 +59,7 @@ def test_multiple_case_expr(table):
.end())

op = expr.op()
assert isinstance(expr, ir.DoubleColumn)
assert isinstance(expr, ir.FloatingColumn)
assert isinstance(op, ops.SearchedCase)
assert op.default is default

Expand All @@ -93,7 +80,7 @@ def test_simple_case_null_else(table):

assert isinstance(expr, ir.StringColumn)
assert isinstance(op.default, ir.ValueExpr)
assert isinstance(op.default.op(), ir.NullLiteral)
assert isinstance(op.default.op(), ops.NullLiteral)


def test_multiple_case_null_else(table):
Expand All @@ -102,7 +89,7 @@ def test_multiple_case_null_else(table):

assert isinstance(expr, ir.StringColumn)
assert isinstance(op.default, ir.ValueExpr)
assert isinstance(op.default.op(), ir.NullLiteral)
assert isinstance(op.default.op(), ops.NullLiteral)


@pytest.mark.xfail(raises=AssertionError, reason='NYT')
Expand All @@ -113,3 +100,21 @@ def test_case_type_precedence():
@pytest.mark.xfail(raises=AssertionError, reason='NYT')
def test_no_implicit_cast_possible():
assert False


def test_case_mixed_type():
t0 = ibis.table(
[('one', 'string'),
('two', 'double'),
('three', 'int32')], name='my_data')

expr = (
t0.three
.case()
.when(0, 'low')
.when(1, 'high')
.else_('null')
.end()
.name('label'))
result = t0[expr]
assert result['label'].type().equals(dt.string)
88 changes: 41 additions & 47 deletions ibis/expr/tests/test_datatypes.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,31 @@
import pytest
import datetime
import pytz
from collections import OrderedDict
from multipledispatch.conflict import ambiguities

import ibis

import ibis.expr.api as api
import ibis.expr.types as types
import ibis.expr.rules as rules

from ibis import IbisError
from ibis.expr import datatypes as dt
from ibis.expr.rules import highest_precedence_type
import ibis.expr.datatypes as dt
from ibis.common import IbisTypeError


def test_validate_type():
assert dt.validate_type is dt.dtype


def test_array():
assert dt.dtype('ARRAY<DOUBLE>') == dt.Array(dt.double)


def test_nested_array():
assert dt.dtype('array<array<string>>') == dt.Array(dt.Array(dt.string))
@pytest.mark.parametrize(('spec', 'expected'), [
('ARRAY<DOUBLE>', dt.Array(dt.double)),
('array<array<string>>', dt.Array(dt.Array(dt.string))),
('map<string, double>', dt.Map(dt.string, dt.double)),
('map<int64, array<map<string, int8>>>',
dt.Map(dt.int64, dt.Array(dt.Map(dt.string, dt.int8)))),
('set<uint8>', dt.Set(dt.uint8)),
([dt.uint8], dt.Array(dt.uint8)),
([dt.float32, dt.float64], dt.Array(dt.float64)),
({dt.string}, dt.Set(dt.string))
])
def test_dtype(spec, expected):
assert dt.dtype(spec) == expected


def test_array_with_string_value_type():
Expand All @@ -33,33 +35,24 @@ def test_array_with_string_value_type():
)


def test_map():
assert dt.dtype('map<string, double>') == dt.Map(dt.string, dt.double)


def test_nested_map():
expected = dt.Map(dt.int64, dt.Array(dt.Map(dt.string, dt.int8)))
assert dt.dtype('map<int64, array<map<string, int8>>>') == expected


def test_map_with_string_value_type():
assert dt.Map('int32', 'double') == dt.Map(dt.int32, dt.double)
assert dt.Map('int32', 'array<double>') == \
dt.Map(dt.int32, dt.Array(dt.double))


def test_map_does_not_allow_non_primitive_keys():
with pytest.raises(SyntaxError):
with pytest.raises(IbisTypeError):
dt.dtype('map<array<string>, double>')


def test_token_error():
with pytest.raises(SyntaxError):
with pytest.raises(IbisTypeError):
dt.dtype('array<string>>')


def test_empty_complex_type():
with pytest.raises(SyntaxError):
with pytest.raises(IbisTypeError):
dt.dtype('map<>')


Expand Down Expand Up @@ -126,7 +119,7 @@ def test_struct_with_string_types():
'decimal(3,',
])
def test_decimal_failure(case):
with pytest.raises(SyntaxError):
with pytest.raises(IbisTypeError):
dt.dtype(case)


Expand All @@ -149,7 +142,7 @@ def test_char_varchar(spec):
'char()'
])
def test_char_varchar_invalid(spec):
with pytest.raises(SyntaxError):
with pytest.raises(IbisTypeError):
dt.dtype(spec)


Expand Down Expand Up @@ -182,28 +175,25 @@ def test_primitive(spec, expected):
assert dt.dtype(spec) == expected


def test_precedence_with_no_arguments():
with pytest.raises(ValueError) as e:
highest_precedence_type([])
assert str(e.value) == 'Must pass at least one expression'


def test_rule_instance_of():
class MyOperation(types.Node):
input_type = [rules.instance_of(types.IntegerValue)]

MyOperation([api.literal(5)])

with pytest.raises(IbisError):
MyOperation([api.literal('string')])


def test_literal_mixed_type_fails():
data = [1, 'a']
with pytest.raises(TypeError):
ibis.literal(data)


def test_timestamp_literal_without_tz():
now_raw = datetime.datetime.utcnow()
assert now_raw.tzinfo is None
assert ibis.literal(now_raw).type().timezone is None


def test_timestamp_literal_with_tz():
now_raw = datetime.datetime.utcnow()
now_utc = pytz.utc.localize(now_raw)
assert now_utc.tzinfo == pytz.UTC
assert ibis.literal(now_utc).type().timezone == str(pytz.UTC)


def test_array_type_not_equals():
left = dt.Array(dt.string)
right = dt.Array(dt.int32)
Expand Down Expand Up @@ -240,8 +230,8 @@ def test_timestamp_with_timezone_parser_invalid_timezone():


@pytest.mark.parametrize('unit', [
'Y', 'Q', 'M', 'W', 'D', # date units
'h', 'm', 's', 'ms', 'us', 'ns' # time units
'Y', 'Q', 'M', 'W', 'D', # date units
'h', 'm', 's', 'ms', 'us', 'ns' # time units
])
def test_interval(unit):
definition = "interval('{}')".format(unit)
Expand Down Expand Up @@ -284,7 +274,7 @@ def test_interval_unvalid_unit(unit):
'interval("Y\')',
])
def test_string_argument_parsing_failure_mode(case):
with pytest.raises(SyntaxError):
with pytest.raises(IbisTypeError):
dt.dtype(case)


Expand Down Expand Up @@ -340,6 +330,9 @@ def test_time_valid():
# parametric types
(list('abc'), dt.Array(dt.string)),
(set('abc'), dt.Set(dt.string)),
({1, 5, 5, 6}, dt.Set(dt.int8)),
(frozenset(list('abc')), dt.Set(dt.string)),
([1, 2, 3], dt.Array(dt.int8)),
([1, 128], dt.Array(dt.int16)),
([1, 128, 32768], dt.Array(dt.int32)),
Expand Down Expand Up @@ -390,6 +383,7 @@ def test_implicit_castable(source, target):
@pytest.mark.parametrize(('source', 'target'), [
(dt.string, dt.null),
(dt.int32, dt.int16),
(dt.int16, dt.uint64),
(dt.Decimal(12, 2), dt.int32),
(dt.timestamp, dt.boolean),
(dt.boolean, dt.interval),
Expand Down
Loading