75 changes: 44 additions & 31 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,61 +3,74 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Ibis: Python Data Analysis Framework
====================================
Ibis: Python Data Analysis Productivity Framework
=================================================

Ibis is a productivity-centric Python data analysis framework, designed to be
an ideal companion for SQL engines and distributed storage systems like
Hadoop. Ibis is being jointly developed with `Impala <http://impala.io>`_ to
deliver a complete 100% Python user experience on data of any size (small,
medium, or big).

At this time, Ibis supports the following SQL-based systems:

- Impala (on HDFS)
- SQLite

Coming from SQL? Check out :ref:`Ibis for SQL Programmers <sql>`.
Ibis is a toolbox to bridge the gap between local Python environments (like
pandas and scikit-learn) and remote storage and execution systems like Hadoop
components (like HDFS, Impala, Hive, Spark) and SQL databases (Postgres,
etc.). Its goal is to simplify analytical workflows and make you more
productive.

We have a handful of specific priority focus areas:

- Enable data analysts to translation analytics using SQL engines to Python
instead of using the SQL language.
- Enable data analysts to translate local, single-node data idioms to scalable
computation representations (e.g. SQL or Spark)
- Integration with pandas and other Python data ecosystem components
- Provide high level analytics APIs and workflow tools to enhance productivity
and streamline common or tedious tasks.
- Provide high performance extensions for the Impala MPP query engine to enable
high performance Python code to operate in a scalable Hadoop-like environment
- Abstract away database-specific SQL differences
- Integration with community standard data formats (e.g. Parquet and Avro)
- Integrate with the Python data ecosystem using the above tools
- Abstract away database-specific SQL differences

As the `Apache Arrow <http://arrow.apache.org/>`_ project develops, we will
look to use Arrow to enable computational code written in Python to be executed
natively within other systems like Apache Spark and Apache Impala (incubating).

To learn more about Ibis's vision, roadmap, and updates, please follow
http://ibis-project.org.

Source code is on GitHub: http://github.com/cloudera/ibis

Install Ibis from PyPI with:

::

pip install ibis-framework

Or from `conda-forge <http://conda-forge.github.io>`_ with

::

conda install ibis-framework -c conda-forge

At this time, Ibis offers some level of support for the following systems:

- `Apache Impala (incubating) <http://impala.io/>`_
- `Apache Kudu (incubating) <http://getkudu.io>`_
- Hadoop Distributed File System (HDFS)
- PostgreSQL (Experimental)
- SQLite

Coming from SQL? Check out :ref:`Ibis for SQL Programmers <sql>`.

Architecturally, Ibis features:

- A pandas-like domain specific language (DSL) designed specifically for
analytics, aka **Ibis expressions**, that enable composable, reusable
analytics on structured data. If you can express something with a SQL SELECT
query, you can write it with Ibis.
- Integrated user interfaces to HDFS and other storage systems.
- An extensible translator-compiler system that targets multiple SQL systems
- Tools for wrapping user-defined functions in Impala and eventually other SQL
engines

SQL engine support near on the horizon:
SQL engine support needing code contributors:

- PostgreSQL
- Redshift
- Vertica
- Spark SQL
- Presto
- Hive
- MySQL / MariaDB

See the project blog http://blog.ibis-project.org for more frequent updates.

To learn more about Ibis's vision and roadmap, please visit
http://ibis-project.org.

Source code is on GitHub: http://github.com/cloudera/ibis

Since this is a young project, the documentation is definitely patchy in
places, but this will improve as things progress.

Expand Down
26 changes: 26 additions & 0 deletions docs/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,32 @@ Release Notes
interesting. Point (minor, e.g. 0.5.1) releases will generally not be found
here and contain only bug fixes.

0.8 (May 19, 2016)
------------------

This release brings initial PostgreSQL backend support along with a number of
critical bug fixes and usability improvements. As several correctness bugs with
the SQL compiler were fixed, we recommend that all users upgrade from earlier
versions of Ibis.

New features
~~~~~~~~~~~~
* Initial PostgreSQL backend contributed by Philip Cloud.
* Add ``groupby`` as an alias for ``group_by`` to table expressions

Bug fixes
~~~~~~~~~
* Fix an expression error when filtering based on a new field
* Fix Impala's SQL compilation of using ``OR`` with compound filters
* Various fixes with the ``having(...)`` function in grouped table expressions
* Fix CTE (``WITH``) extraction inside ``UNION ALL`` expressions.
* Fix ``ImportError`` on Python 2 when ``mock`` library not installed

API changes
~~~~~~~~~~~
* The deprecated ``ibis.impala_connect`` and ``ibis.make_client`` APIs have
been removed

0.7 (March 16, 2016)
--------------------

Expand Down
37 changes: 1 addition & 36 deletions ibis/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,48 +28,13 @@

import ibis.impala.api as impala
import ibis.sql.sqlite.api as sqlite
import ibis.sql.postgres.api as postgres

import ibis.config_init
from ibis.config import options
import ibis.util as util


# Deprecated
impala_connect = util.deprecate(impala.connect,
'impala_connect is deprecated, use'
' ibis.impala.connect instead')


def make_client(db, hdfs_client=None):
"""
Create an Ibis client from a database connection and optional additional
connections (like HDFS)
Parameters
----------
db : Connection
e.g. produced by ibis.impala.connect
hdfs_client : ibis HDFS client
Examples
--------
>>> con = ibis.impala.connect(**impala_params)
>>> hdfs = ibis.hdfs_connect(**hdfs_params)
>>> client = ibis.make_client(con, hdfs_client=hdfs)
Returns
-------
client : IbisClient
"""
db._hdfs = hdfs_client
return db

make_client = util.deprecate(
make_client, ('make_client is deprecated. '
'Use ibis.impala.connect '
' with hdfs_client=hdfs_client'))


def hdfs_connect(host='localhost', port=50070, protocol='webhdfs',
use_https='default', auth_mechanism='NOSASL',
verify=True, **kwds):
Expand Down
5 changes: 4 additions & 1 deletion ibis/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,10 @@ def dict_values(x):
def dict_values(x):
return x.values()

import mock
try:
import mock # mock is an optional dependency
except ImportError:
pass
range = xrange

integer_types = six.integer_types + (np.integer,)
110 changes: 36 additions & 74 deletions ibis/expr/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,96 +12,57 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from ibis.common import RelationError, ExpressionError
from ibis.common import RelationError, ExpressionError, IbisTypeError
from ibis.expr.datatypes import HasSchema
from ibis.expr.window import window
import ibis.expr.types as ir
import ibis.expr.operations as ops
import ibis.util as util
import toolz

# ---------------------------------------------------------------------
# Some expression metaprogramming / graph transformations to support
# compilation later


def sub_for(expr, substitutions):
helper = _Substitutor(expr, substitutions)
return helper.get_result()
mapping = dict((repr(k.op()), v) for k, v in substitutions)
return _subs(expr, mapping)


class _Substitutor(object):

def __init__(self, expr, substitutions, sub_memo=None):
self.expr = expr

self.substitutions = substitutions

self._id_to_expr = {}
for k, v in substitutions:
self._id_to_expr[self._key(k)] = v

self.sub_memo = sub_memo or {}
self.unchanged = True

def get_result(self):
expr = self.expr
node = expr.op()

if node.blocks():
return expr

subbed_args = []
for arg in node.args:
if isinstance(arg, (tuple, list)):
subbed_arg = [self._sub_arg(x) for x in arg]
else:
subbed_arg = self._sub_arg(arg)
subbed_args.append(subbed_arg)

# Do not modify unnecessarily
if self.unchanged:
return expr

subbed_node = type(node)(*subbed_args)
if isinstance(expr, ir.ValueExpr):
result = expr._factory(subbed_node, name=expr._name)
else:
result = expr._factory(subbed_node)

return result

def _sub_arg(self, arg):
if isinstance(arg, ir.Expr):
subbed_arg = self.sub(arg)
if subbed_arg is not arg:
self.unchanged = False
else:
# a string or some other thing
subbed_arg = arg

return subbed_arg

def _key(self, expr):
def _expr_key(expr):
try:
return repr(expr.op())
except AttributeError:
return expr

def sub(self, expr):
key = self._key(expr)

if key in self.sub_memo:
return self.sub_memo[key]

if key in self._id_to_expr:
return self._id_to_expr[key]

result = self._sub(expr)

self.sub_memo[key] = result
return result
@toolz.memoize(key=lambda args, kwargs: _expr_key(args[0]))
def _subs(expr, mapping):
"""Substitute expressions with other expressions
"""
node = expr.op()
key = repr(node)
if key in mapping:
return mapping[key]
if node.blocks():
return expr

new_args = list(node.args)
unchanged = True
for i, arg in enumerate(new_args):
if isinstance(arg, ir.Expr):
new_arg = _subs(arg, mapping)
unchanged = unchanged and new_arg is arg
new_args[i] = new_arg
if unchanged:
return expr
try:
new_node = type(node)(*new_args)
except IbisTypeError:
return expr

def _sub(self, expr):
helper = _Substitutor(expr, self.substitutions,
sub_memo=self.sub_memo)
return helper.get_result()
return expr._factory(new_node, name=getattr(expr, '_name', None))


class ScalarAggregate(object):
Expand Down Expand Up @@ -474,8 +435,9 @@ def apply_filter(expr, predicates):
return _filter_selection(expr, predicates)
elif isinstance(op, ops.Aggregation):
# Potential fusion opportunity
simplified_predicates = [sub_for(x, [(expr, op.table)])
for x in predicates]
simplified_predicates = [
sub_for(predicate, [(expr, op.table)]) for predicate in predicates
]

if op.table._is_valid(simplified_predicates):
result = ops.Aggregation(
Expand Down
2 changes: 1 addition & 1 deletion ibis/expr/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -1522,7 +1522,7 @@ def _string_contains(arg, substr):
-------
contains : boolean
"""
return arg.like('%{0}%'.format(substr))
return arg.find(substr) >= 0


def _string_dunder_contains(arg, substr):
Expand Down
2 changes: 1 addition & 1 deletion ibis/expr/datatypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,7 @@ def __repr__(self):
return ('category(K=%s)' % card)

def __hash__(self):
return hash((self.cardinality))
return hash(self.cardinality)

def __eq__(self, other):
if not isinstance(other, Category):
Expand Down
14 changes: 11 additions & 3 deletions ibis/expr/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@
import ibis.expr.window as _window
import ibis.util as util

import six
import toolz


def _resolve_exprs(table, exprs):
exprs = util.promote_list(exprs)
Expand All @@ -32,9 +35,14 @@ class GroupedTableExpr(object):
Helper intermediate construct
"""

def __init__(self, table, by, having=None, order_by=None, window=None):
def __init__(
self, table, by, having=None, order_by=None, window=None, **expressions
):
self.table = table
self.by = by
self.by = util.promote_list(by if by is not None else []) + [
(table[v] if isinstance(v, six.string_types) else v).name(k)
for k, v in sorted(expressions.items(), key=toolz.first)
]
self._order_by = order_by or []
self._having = having or []
self._window = window
Expand Down Expand Up @@ -187,7 +195,7 @@ def count(self, metric_name='count'):
The aggregated table
"""
metric = self.table.count().name(metric_name)
return self.table.aggregate([metric], by=self.by)
return self.table.aggregate([metric], by=self.by, having=self._having)

size = count

Expand Down
13 changes: 7 additions & 6 deletions ibis/expr/operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -1792,11 +1792,12 @@ def _plain_subquery(self):
def _attempt_pushdown(self):
metrics_valid, lowered_metrics = self._pushdown_exprs(self.metrics)
by_valid, lowered_by = self._pushdown_exprs(self.by)
having_valid, lowered_having = self._pushdown_exprs(self.having or None)

if metrics_valid and by_valid:
if metrics_valid and by_valid and having_valid:
return Aggregation(self.op.table, lowered_metrics,
by=lowered_by,
having=self.having,
having=lowered_having,
predicates=self.op.predicates,
sort_keys=self.op.sort_keys)
else:
Expand Down Expand Up @@ -1859,13 +1860,13 @@ def __init__(self, table, agg_exprs, by=None, having=None,

self.agg_exprs = self._rewrite_exprs(agg_exprs)

by = by or []
by = [] if by is None else by
self.by = self.table._resolve(by)

self.having = having or []
self.having = [] if having is None else having

self.predicates = predicates or []
sort_keys = sort_keys or []
self.predicates = [] if predicates is None else predicates
sort_keys = [] if sort_keys is None else sort_keys
self.sort_keys = [to_sort_key(self.table, k)
for k in util.promote_list(sort_keys)]

Expand Down
8 changes: 8 additions & 0 deletions ibis/expr/tests/test_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,3 +282,11 @@ def test_no_rewrite(self):
# result = L.substitute_parents(expr)
# expected = t['c'] == 2
# assert_equal(result, expected)


def test_join_table_choice():
# GH807
x = ibis.table(ibis.schema([('n', 'int64')]), 'x')
t = x.aggregate(cnt=x.n.count())
predicate = t.cnt > 0
assert L.sub_for(predicate, [(t, t.op().table)]).equals(predicate)
6 changes: 3 additions & 3 deletions ibis/expr/tests/test_sql_builtins.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,11 @@ def test_group_concat(self):
expr = col.group_concat()
assert isinstance(expr.op(), ops.GroupConcat)
arg, sep = expr.op().args
assert sep == ','
sep == ','

expr = col.group_concat('|')
arg, sep = expr.op().args
assert sep == '|'
sep == '|'

def test_zeroifnull(self):
dresult = self.alltypes.double_col.zeroifnull()
Expand Down Expand Up @@ -117,7 +117,7 @@ def test_round(self):

result = self.alltypes.double_col.round(2)
assert isinstance(result, ir.DoubleArray)
assert result.op().args[1] == 2
assert result.op().args[1].equals(ibis.literal(2))

# Even integers are double (at least in Impala, check with other DB
# implementations)
Expand Down
2 changes: 1 addition & 1 deletion ibis/expr/tests/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ def test_join(self):

def test_contains(self):
expr = self.table.g.contains('foo')
expected = self.table.g.like('%foo%')
expected = self.table.g.find('foo') >= 0
assert_equal(expr, expected)

self.assertRaises(Exception, lambda: 'foo' in self.table.g)
Expand Down
15 changes: 15 additions & 0 deletions ibis/expr/tests/test_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -485,6 +485,13 @@ def test_aggregate_keywords(self):
assert_equal(expr, expected)
assert_equal(expr2, expected)

def test_groupby_alias(self):
t = self.table

result = t.groupby('g').size()
expected = t.group_by('g').size()
assert_equal(result, expected)

def test_summary_expand_list(self):
summ = self.table.f.summary()

Expand Down Expand Up @@ -543,6 +550,14 @@ def test_group_by_having_api(self):
expected = self.table.aggregate(metric, by='g', having=postp)
assert_equal(expr, expected)

def test_group_by_kwargs(self):
t = self.table
expr = (t.group_by(['f', t.h], z='g', z2=t.d)
.aggregate(t.d.mean().name('foo')))
expected = (t.group_by(['f', t.h, t.g.name('z'), t.d.name('z2')])
.aggregate(t.d.mean().name('foo')))
assert_equal(expr, expected)

def test_aggregate_root_table_internal(self):
pass

Expand Down
5 changes: 5 additions & 0 deletions ibis/expr/tests/test_value_exprs.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ def test_null(self):
expr = ibis.literal(None)
assert isinstance(expr, ir.NullScalar)
assert isinstance(expr.op(), ir.NullLiteral)
assert expr._arg.value is None

expr2 = ibis.null()
assert_equal(expr, expr2)
Expand Down Expand Up @@ -484,6 +485,10 @@ def test_between(self):
self.assertRaises(TypeError, self.table.f.between, 0, '1')
self.assertRaises(TypeError, self.table.f.between, '0', 1)

def test_chained_comparisons_not_allowed(self):
with self.assertRaises(ValueError):
0 < self.table.f < 1


class TestBinaryArithOps(BasicTestCase, unittest.TestCase):

Expand Down
34 changes: 25 additions & 9 deletions ibis/expr/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,13 @@ def __repr__(self):
else:
return self._repr()

def __bool__(self):
raise ValueError("The truth value of an Ibis expression is not "
"defined")

def __nonzero__(self):
return self.__bool__()

def _repr(self, memo=None):
from ibis.expr.format import ExprFormatter
return ExprFormatter(self).get_result()
Expand All @@ -67,8 +74,8 @@ def pipe(self, f, *args, **kwargs):
Generic composition function to enable expression pipelining
>>> (expr
.pipe(f, *args, **kwargs)
.pipe(g, *args2, **kwargs2))
>>> .pipe(f, *args, **kwargs)
>>> .pipe(g, *args2, **kwargs2))
is equivalent to
Expand All @@ -86,9 +93,9 @@ def pipe(self, f, *args, **kwargs):
Examples
--------
>>> def foo(data, a=None, b=None):
pass
... pass
>>> def bar(a, b, data=None):
pass
... pass
>>> expr.pipe(foo, a=5, b=10)
>>> expr.pipe((bar, 'data'), 1, 2)
Expand Down Expand Up @@ -321,7 +328,9 @@ def __init__(self, name, table_expr):
Node.__init__(self, [name, table_expr])

if name not in table_expr.schema():
raise KeyError("'{0}' is not a field".format(name))
raise com.IbisTypeError(
"'{0}' is not a field in {1}".format(name, table_expr.columns)
)

self.name = name
self.table = table_expr
Expand Down Expand Up @@ -684,7 +693,7 @@ def add_column(self, expr, name=None):

return self.projection([self, expr])

def group_by(self, by):
def group_by(self, by=None, **additional_grouping_expressions):
"""
Create an intermediate grouped table expression, pending some group
operation to be applied with it.
Expand All @@ -693,12 +702,19 @@ def group_by(self, by):
--------
x.group_by([b1, b2]).aggregate(metrics)
Notes
-----
group_by and groupby are equivalent, with `groupby` being provided for
ease-of-use for pandas users.
Returns
-------
grouped_expr : GroupedTableExpr
"""
from ibis.expr.groupby import GroupedTableExpr
return GroupedTableExpr(self, by)
return GroupedTableExpr(self, by, **additional_grouping_expressions)

groupby = group_by


# -----------------------------------------------------------------------------
Expand Down Expand Up @@ -1065,11 +1081,11 @@ class NullLiteral(ValueNode):
"""

def __init__(self):
pass
self.value = None

@property
def args(self):
return [None]
return [self.value]

def equals(self, other):
return isinstance(other, NullLiteral)
Expand Down
29 changes: 20 additions & 9 deletions ibis/impala/compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,8 +230,14 @@ def format_where(self):

buf = StringIO()
buf.write('WHERE ')
fmt_preds = [self._translate(pred, permit_subquery=True)
for pred in self.where]
fmt_preds = []
for pred in self.where:
new_pred = self._translate(pred, permit_subquery=True)
if isinstance(pred.op(), ops.Or):
# parens for OR exprs because it binds looser than AND
new_pred = _parenthesize(new_pred)
fmt_preds.append(new_pred)

conj = ' AND\n{0}'.format(' ' * 6)
buf.write(conj.join(fmt_preds))
return buf.getvalue()
Expand Down Expand Up @@ -376,11 +382,16 @@ def compile(self):
else:
union_keyword = 'UNION ALL'

left_set = context.get_compiled_expr(self.left)
right_set = context.get_compiled_expr(self.right)
left_set = context.get_compiled_expr(self.left, isolated=True)
right_set = context.get_compiled_expr(self.right, isolated=True)

query = '{0}\n{1}\n{2}'.format(left_set, union_keyword, right_set)
return query
# XXX: hack of all trades - our right relation has a CTE
# TODO: factor out common subqueries in the union
if right_set.startswith('WITH'):
format_string = '({0})\n{1}\n({2})'
else:
format_string = '{0}\n{1}\n{2}'
return format_string.format(left_set, union_keyword, right_set)


# ---------------------------------------------------------------------
Expand Down Expand Up @@ -989,7 +1000,7 @@ def _substring(translator, expr):

# Impala is 1-indexed
if length is None or isinstance(length.op(), ir.Literal):
lvalue = length.op().value if length else None
lvalue = length.op().value if length is not None else None
if lvalue:
return 'substr({0}, {1} + 1, {2})'.format(arg_formatted,
start_formatted,
Expand All @@ -1010,12 +1021,12 @@ def _string_find(translator, expr):
arg_formatted = translator.translate(arg)
substr_formatted = translator.translate(substr)

if start and not isinstance(start.op(), ir.Literal):
if start is not None and not isinstance(start.op(), ir.Literal):
start_fmt = translator.translate(start)
return 'locate({0}, {1}, {2} + 1) - 1'.format(substr_formatted,
arg_formatted,
start_fmt)
elif start and start.op().value:
elif start is not None and start.op().value:
sval = start.op().value
return 'locate({0}, {1}, {2}) - 1'.format(substr_formatted,
arg_formatted,
Expand Down
14 changes: 14 additions & 0 deletions ibis/impala/tests/test_exprs.py
Original file line number Diff line number Diff line change
Expand Up @@ -1550,3 +1550,17 @@ def test_char_varchar_types(self):

assert isinstance(t.varchar_col, api.StringArray)
assert isinstance(t.char_col, api.StringArray)

def test_unions_with_ctes(self):
t = self.con.table('functional_alltypes')

expr1 = (t.group_by(['tinyint_col', 'string_col'])
.aggregate(t.double_col.sum().name('metric')))
expr2 = expr1.view()

join1 = (expr1.join(expr2, expr1.string_col == expr2.string_col)
[[expr1]])
join2 = join1.view()

expr = join1.union(join2)
self.con.explain(expr)
6 changes: 4 additions & 2 deletions ibis/impala/tests/test_udf.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,15 +55,17 @@ def test_sql_generation(self):
func.register('identity', 'udf_testing')

result = func('hello world')
assert result == "SELECT udf_testing.identity('hello world')"
assert (ibis.impala.compile(result) ==
"SELECT udf_testing.identity('hello world') AS `tmp`")

def test_sql_generation_from_infoclass(self):
func = api.wrap_udf('test.so', ['string'], 'string', 'info_test')
repr(func)

func.register('info_test', 'udf_testing')
result = func('hello world')
assert result == "SELECT udf_testing.info_test('hello world')"
assert (ibis.impala.compile(result) ==
"SELECT udf_testing.info_test('hello world') AS `tmp`")

def test_udf_primitive_output_types(self):
types = [
Expand Down
122 changes: 100 additions & 22 deletions ibis/sql/alchemy.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,16 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import numbers
import operator
import six

import sqlalchemy as sa
import sqlalchemy.sql as sql

from sqlalchemy.sql.elements import Over as _Over
from sqlalchemy.ext.compiler import compiles as sa_compiles

from ibis.client import SQLClient, AsyncQuery, Query
from ibis.sql.compiler import Select, Union, TableSetFormatter
import ibis.common as com
Expand All @@ -31,34 +35,40 @@


_ibis_type_to_sqla = {
dt.Int8: sa.types.SmallInteger,
dt.Int16: sa.types.SmallInteger,
dt.Int32: sa.types.Integer,
dt.Int64: sa.types.BigInteger,
dt.Int8: sa.SmallInteger,
dt.Int16: sa.SmallInteger,
dt.Int32: sa.Integer,
dt.Int64: sa.BigInteger,

# Mantissa-based
dt.Float: sa.types.Float(precision=24),
dt.Double: sa.types.Float(precision=53),
dt.Float: sa.Float(precision=24),
dt.Double: sa.Float(precision=53),

dt.Boolean: sa.types.Boolean,
dt.Boolean: sa.Boolean,

dt.String: sa.types.String,
dt.String: sa.String,

dt.Timestamp: sa.types.DateTime,
dt.Timestamp: sa.DateTime,

dt.Decimal: sa.types.NUMERIC,
dt.Decimal: sa.NUMERIC,
}

_sqla_type_mapping = {
sa.types.SmallInteger: dt.Int16,
sa.types.INTEGER: dt.Int64,
sa.types.BOOLEAN: dt.Boolean,
sa.types.BIGINT: dt.Int64,
sa.types.FLOAT: dt.Double,
sa.types.REAL: dt.Double,
sa.SmallInteger: dt.Int16,
sa.SMALLINT: dt.Int16,
sa.Integer: dt.Int32,
sa.INTEGER: dt.Int32,
sa.BigInteger: dt.Int64,
sa.BIGINT: dt.Int64,
sa.Boolean: dt.Boolean,
sa.BOOLEAN: dt.Boolean,
sa.FLOAT: dt.Double,
sa.REAL: dt.Float,
sa.VARCHAR: dt.String,
sa.Float: dt.Double,

sa.types.TEXT: dt.String,
sa.types.NullType: dt.String,
sa.types.NullType: dt.Null,
sa.types.Text: dt.String,
}

Expand All @@ -84,8 +94,15 @@ def schema_from_table(table):
ibis_class = _sqla_type_to_ibis[c.type]
elif type_class in _sqla_type_to_ibis:
ibis_class = _sqla_type_to_ibis[type_class]
elif isinstance(c.type, sa.DateTime):
ibis_class = dt.Timestamp()
else:
raise NotImplementedError(c.type)
for k, v in _sqla_type_to_ibis.items():
if isinstance(c.type, type(k)):
ibis_class = v
break
else:
raise NotImplementedError(c.type)
t = ibis_class(c.nullable)

types.append(t)
Expand Down Expand Up @@ -392,14 +409,17 @@ def __init__(self, *args, **kwargs):
self.dialect = kwargs.pop('dialect', AlchemyDialect)
comp.QueryContext.__init__(self, *args, **kwargs)

def subcontext(self):
return type(self)(dialect=self.dialect, parent=self)
def subcontext(self, isolated=False):
if not isolated:
return type(self)(dialect=self.dialect, parent=self)
else:
return type(self)(dialect=self.dialect)

def _to_sql(self, expr, ctx):
return to_sqlalchemy(expr, context=ctx)

def _compile_subquery(self, expr):
sub_ctx = self.subcontext()
def _compile_subquery(self, expr, isolated=False):
sub_ctx = self.subcontext(isolated=isolated)
return self._to_sql(expr, sub_ctx)

def has_table(self, expr, parent_contexts=False):
Expand Down Expand Up @@ -894,3 +914,61 @@ def _floor_divide(t, expr):
return t.translate(new_expr)

return fixed_arity(lambda x, y: x / y, 2)(t, expr)


@compiles(ops.SortKey)
def _sort_key(t, expr):
# We need to define this for window functions that have an order by
by, ascending = expr.op().args
sort_direction = sa.asc if ascending else sa.desc
return sort_direction(t.translate(by))


_valid_frame_types = numbers.Integral, str, type(None)


class Over(_Over):
def __init__(
self,
element,
order_by=None,
partition_by=None,
preceding=None,
following=None,
):
super(Over, self).__init__(
element, order_by=order_by, partition_by=partition_by
)
if not isinstance(preceding, _valid_frame_types):
raise TypeError(
'preceding must be a string, integer or None, got %r' % (
type(preceding).__name__
)
)
if not isinstance(following, _valid_frame_types):
raise TypeError(
'following must be a string, integer or None, got %r' % (
type(following).__name__
)
)
self.preceding = preceding if preceding is not None else 'UNBOUNDED'
self.following = following if following is not None else 'UNBOUNDED'


@sa_compiles(Over)
def compile_over_with_frame(element, compiler, **kw):
clauses = ' '.join(
'%s BY %s' % (word, compiler.process(clause, **kw))
for word, clause in (
('PARTITION', element.partition_by),
('ORDER', element.order_by),
)
if clause is not None and len(clause)
)
return '%s OVER (%s%sROWS BETWEEN %s PRECEDING AND %s FOLLOWING)' % (
compiler.process(getattr(element, 'element', element.func), **kw),
clauses,
' ' if clauses else '', # only add a space if we order by or group by
str(element.preceding).upper(),
str(element.following).upper(),
)
15 changes: 9 additions & 6 deletions ibis/sql/compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -926,8 +926,8 @@ def __init__(self, indent=2, parent=None):

self._table_key_memo = {}

def _compile_subquery(self, expr):
sub_ctx = self.subcontext()
def _compile_subquery(self, expr, isolated=False):
sub_ctx = self.subcontext(isolated=isolated)
return self._to_sql(expr, sub_ctx)

def _to_sql(self, expr, ctx):
Expand All @@ -943,7 +943,7 @@ def top_context(self):
def set_always_alias(self):
self.always_alias = True

def get_compiled_expr(self, expr):
def get_compiled_expr(self, expr, isolated=False):
this = self.top_context

key = self._get_table_key(expr)
Expand All @@ -954,7 +954,7 @@ def get_compiled_expr(self, expr):
if isinstance(op, ops.SQLQueryResult):
result = op.query
else:
result = self._compile_subquery(expr)
result = self._compile_subquery(expr, isolated=isolated)

this.subquery_memo[key] = result
return result
Expand Down Expand Up @@ -1009,8 +1009,11 @@ def set_extracted(self, expr):
self.extracted_subexprs.add(key)
self.make_alias(expr)

def subcontext(self):
return type(self)(indent=self.indent, parent=self)
def subcontext(self, isolated=False):
if not isolated:
return type(self)(indent=self.indent, parent=self)
else:
return type(self)(indent=self.indent)

# Maybe temporary hacks for correlated / uncorrelated subqueries

Expand Down
50 changes: 50 additions & 0 deletions ibis/sql/postgres/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Copyright 2015 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


from .client import PostgreSQLClient
from .compiler import rewrites # noqa


def compile(expr):
"""
Force compilation of expression for the PostgreSQL target
"""
from .client import PostgreSQLDialect
from ibis.sql.alchemy import to_sqlalchemy
return to_sqlalchemy(expr, dialect=PostgreSQLDialect)


def connect(host=None, user=None, password=None, port=None, database=None,
url=None, driver=None):

"""
Create an Ibis client connected to a PostgreSQL database.
Multiple database files can be created using the attach() method
Parameters
----------
host : string, default None
user : string, default None
password : string, default None
port : string or integer, default None
database : string, default None
url : string, default None
Complete SQLAlchemy connection string. If passed, the other connection
arguments are ignored.
driver : string, default 'psycopg2'
"""
return PostgreSQLClient(host=host, user=user, password=password, port=port,
database=database, url=url, driver=driver)
109 changes: 109 additions & 0 deletions ibis/sql/postgres/client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Copyright 2015 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import sqlalchemy as sa

from ibis.client import Database
from .compiler import PostgreSQLDialect
import ibis.expr.types as ir
import ibis.sql.alchemy as alch


class PostgreSQLTable(alch.AlchemyTable):
pass


class PostgreSQLDatabase(Database):
pass


class PostgreSQLClient(alch.AlchemyClient):

"""
The Ibis PostgreSQL client class
"""

dialect = PostgreSQLDialect
database_class = PostgreSQLDatabase

def __init__(self, host=None, user=None, password=None, port=None,
database=None, url=None, driver=None):
if url is None:
if user is not None:
if password is None:
userpass = user
else:
userpass = '{0}:{1}'.format(user, password)

address = '{0}@{1}'.format(userpass, host)
else:
address = host

if port is not None:
address = '{0}:{1}'.format(address, port)

if database is not None:
address = '{0}/{1}'.format(address, database)

if driver is not None and driver != 'psycopg2':
raise NotImplementedError(driver)

url = 'postgresql://{0}'.format(address)

url = sa.engine.url.make_url(url)
self.name = url.database
self.database_name = 'public'
self.con = sa.create_engine(url)
self.meta = sa.MetaData(bind=self.con, reflect=True)

@property
def current_database(self):
return self.database_name

def list_databases(self):
raise NotImplementedError

def set_database(self):
raise NotImplementedError

@property
def client(self):
return self

def table(self, name, database=None):
"""
Create a table expression that references a particular table in the
PostgreSQL database
Parameters
----------
name : string
Returns
-------
table : TableExpr
"""
alch_table = self._get_sqla_table(name)
node = PostgreSQLTable(alch_table, self)
return self._table_expr_klass(node)

def drop_table(self):
pass

def create_table(self, name, expr=None):
pass

@property
def _table_expr_klass(self):
return ir.TableExpr
527 changes: 527 additions & 0 deletions ibis/sql/postgres/compiler.py

Large diffs are not rendered by default.

81 changes: 81 additions & 0 deletions ibis/sql/postgres/tests/common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Copyright 2015 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import getpass
import os

import pytest

from ibis.sql.postgres.compiler import PostgreSQLExprTranslator
import ibis.sql.postgres.api as api

from sqlalchemy.dialects.postgresql import dialect as postgres_dialect


PG_USER = os.environ.get('IBIS_POSTGRES_USER', getpass.getuser())
PG_PASS = os.environ.get('IBIS_POSTGRES_PASS')


@pytest.mark.postgresql
class PostgreSQLTests(object):

@classmethod
def setUpClass(cls):
cls.env = PostgreSQLTestEnv()
cls.dialect = postgres_dialect()

E = cls.env

cls.con = api.connect(host=E.host, user=E.user, password=E.password,
database=E.database_name)
cls.alltypes = cls.con.table('functional_alltypes')

def _check_expr_cases(self, cases, context=None, named=False):
for expr, expected in cases:
result = self._translate(expr, named=named, context=context)

compiled = result.compile(dialect=self.dialect)
ex_compiled = expected.compile(dialect=self.dialect)

assert str(compiled) == str(ex_compiled)

def _translate(self, expr, named=False, context=None):
translator = PostgreSQLExprTranslator(
expr, context=context, named=named
)
return translator.get_result()

def _to_sqla(self, table):
return table.op().sqla_table

def _check_e2e_cases(self, cases):
for expr, expected in cases:
result = self.con.execute(expr)
assert result == expected


class PostgreSQLTestEnv(object):

def __init__(self):
if PG_PASS:
creds = '{0}:{1}'.format(PG_USER, PG_PASS)
else:
creds = PG_USER

self.user = PG_USER
self.password = PG_PASS
self.host = 'localhost'
self.database_name = 'ibis_testing'

self.db_url = 'postgresql://{0}@localhost/ibis_testing'.format(creds)
81 changes: 81 additions & 0 deletions ibis/sql/postgres/tests/test_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Copyright 2015 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import pandas as pd

from .common import PostgreSQLTests
from ibis.compat import unittest
from ibis.tests.util import assert_equal
import ibis.expr.types as ir
import ibis


class TestPostgreSQLClient(PostgreSQLTests, unittest.TestCase):

@classmethod
def tearDownClass(cls):
pass

def test_table(self):
table = self.con.table('functional_alltypes')
assert isinstance(table, ir.TableExpr)

def test_array_execute(self):
d = self.alltypes.limit(10).double_col
s = d.execute()
assert isinstance(s, pd.Series)
assert len(s) == 10

def test_literal_execute(self):
expr = ibis.literal('1234')
result = self.con.execute(expr)
assert result == '1234'

def test_simple_aggregate_execute(self):
d = self.alltypes.double_col.sum()
v = d.execute()
assert isinstance(v, float)

def test_list_tables(self):
assert len(self.con.list_tables()) > 0
assert len(self.con.list_tables(like='functional')) == 1

def test_compile_verify(self):
unsupported_expr = self.alltypes.string_col.approx_nunique()
assert not unsupported_expr.verify()

supported_expr = self.alltypes.double_col.sum()
assert supported_expr.verify()

def test_database_layer(self):
db = self.con.database()

t = db.functional_alltypes
assert_equal(t, self.alltypes)

assert db.list_tables() == self.con.list_tables()

def test_compile_toplevel(self):
# t = ibis.table([
# ('foo', 'double')
# ])

# # it works!
# expr = t.foo.sum()
# ibis.postgres.compile(expr)

# This does not work yet because if the compiler encounters a
# non-SQLAlchemy table it fails
pass
586 changes: 586 additions & 0 deletions ibis/sql/postgres/tests/test_functions.py

Large diffs are not rendered by default.

85 changes: 85 additions & 0 deletions ibis/sql/tests/test_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -1447,6 +1447,39 @@ def test_subquery_used_for_self_join(self):
GROUP BY 1"""
assert result == expected

def test_subquery_in_union(self):
t = self.con.table('alltypes')

expr1 = t.group_by(['a', 'g']).aggregate(t.f.sum().name('metric'))
expr2 = expr1.view()

join1 = expr1.join(expr2, expr1.g == expr2.g)[[expr1]]
join2 = join1.view()

expr = join1.union(join2)
result = to_sql(expr)
expected = """\
(WITH t0 AS (
SELECT `a`, `g`, sum(`f`) AS `metric`
FROM alltypes
GROUP BY 1, 2
)
SELECT t0.*
FROM t0
INNER JOIN t0 t1
ON t0.`g` = t1.`g`)
UNION ALL
(WITH t0 AS (
SELECT `a`, `g`, sum(`f`) AS `metric`
FROM alltypes
GROUP BY 1, 2
)
SELECT t0.*
FROM t0
INNER JOIN t0 t1
ON t0.`g` = t1.`g`)"""
assert result == expected

def test_subquery_factor_correlated_subquery(self):
# #173, #183 and other issues

Expand Down Expand Up @@ -2235,3 +2268,55 @@ def test_multiple_count_distinct(self):
FROM functional_alltypes
GROUP BY 1"""
assert result == expected


def test_pushdown_with_or():
t = ibis.table(
[('double_col', 'double'),
('string_col', 'string'),
('int_col', 'int32'),
('float_col', 'float')],
'functional_alltypes',
)
subset = t[(t.double_col > 3.14) & t.string_col.contains('foo')]
filt = subset[(subset.int_col - 1 == 0) | (subset.float_col <= 1.34)]
result = to_sql(filt)
expected = """\
SELECT *
FROM functional_alltypes
WHERE (`double_col` > 3.14) AND (locate('foo', `string_col`) - 1 >= 0) AND
(((`int_col` - 1) = 0) OR (`float_col` <= 1.34))"""
assert result == expected


def test_having_size():
t = ibis.table(
[('double_col', 'double'),
('string_col', 'string'),
('int_col', 'int32'),
('float_col', 'float')],
'functional_alltypes',
)
expr = t.group_by(t.string_col).having(t.double_col.max() == 1).size()
result = to_sql(expr)
assert result == """\
SELECT `string_col`, count(*) AS `count`
FROM functional_alltypes
GROUP BY 1
HAVING max(`double_col`) = 1"""


def test_having_from_filter():
t = ibis.table([('a', 'int64'), ('b', 'string')], 't')
filt = t[t.b == 'm']
gb = filt.group_by(filt.b)
having = gb.having(filt.a.max() == 2)
agg = having.aggregate(filt.a.sum().name('sum'))
result = to_sql(agg)
expected = """\
SELECT `b`, sum(`a`) AS `sum`
FROM t
WHERE `b` = 'm'
GROUP BY 1
HAVING max(`a`) = 2"""
assert result == expected
4 changes: 2 additions & 2 deletions ibis/sql/tests/test_sqlalchemy.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,9 @@ def test_sqla_schema_conversion(self):
# name, type, nullable
('smallint', sat.SmallInteger, False, dt.int16),
('int', sat.Integer, True, dt.int32),
('integer', sat.INTEGER(), True, dt.int64),
('integer', sat.INTEGER(), True, dt.int32),
('bigint', sat.BigInteger, False, dt.int64),
('real', sat.REAL, True, dt.double),
('real', sat.REAL, True, dt.float),
('bool', sat.Boolean, True, dt.boolean),
('timestamp', sat.DateTime, True, dt.timestamp),
]
Expand Down
2 changes: 1 addition & 1 deletion ibis/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

import ibis

groups = ['hdfs', 'impala', 'madlib', 'sqlite', 'kudu']
groups = ['hdfs', 'impala', 'madlib', 'postgresql', 'sqlite', 'kudu']


def pytest_configure(config):
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
pytest
numpy>=1.7.0
pandas>=0.12.0
impyla>=0.13.2
impyla>=0.13.7
hdfs>=2.0.0
sqlalchemy>=1.0.0
six
toolz
131 changes: 93 additions & 38 deletions scripts/test_data_admin.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import getpass
import os
import shutil
import tempfile
Expand All @@ -21,6 +22,8 @@
from subprocess import check_call

from click import group, option
import sqlalchemy as sa
from sqlalchemy import create_engine

import ibis
from ibis.compat import BytesIO
Expand Down Expand Up @@ -88,7 +91,7 @@ def can_build_udfs():
return True


def is_data_loaded(con):
def is_impala_loaded(con):
if not con.hdfs.exists(ENV.test_data_dir):
return False
if not con.exists_database(ENV.test_data_db):
Expand Down Expand Up @@ -215,16 +218,31 @@ def download_parquet_files(con, tmp_db_hdfs_path):
con.hdfs.get(tmp_db_hdfs_path, parquet_path)


def generate_sqlite_db(con):
from sqlalchemy import create_engine
def get_postgres_engine():
pg_user = os.environ.get('IBIS_POSTGRES_USER', getpass.getuser())
pg_pass = os.environ.get('IBIS_POSTGRES_PASS')

if pg_pass:
creds = '{0}:{1}'.format(pg_user, pg_pass)
else:
creds = pg_user

engine = (create_engine('postgresql://{0}@localhost/ibis_testing'
.format(creds)))
return engine


def get_sqlite_engine():
path = pjoin(IBIS_TEST_DATA_LOCAL_DIR, 'ibis_testing.db')
csv_path = guid()
return create_engine('sqlite:///{0}'.format(path))


engine = create_engine('sqlite:///{0}'.format(path))
def load_sql_databases(con, engines):
csv_path = guid()

generate_sql_csv_sources(csv_path, con.database('ibis_testing'))
make_sqlite_testing_db(csv_path, engine)
for engine in engines:
make_testing_db(csv_path, engine)
shutil.rmtree(csv_path)


Expand Down Expand Up @@ -333,12 +351,33 @@ def generate_sql_csv_sources(output_path, db):
df.to_csv('{0}.csv'.format(path), na_rep='\\N')


def make_sqlite_testing_db(csv_dir, con):
def make_testing_db(csv_dir, con):
for name in _sql_tables:
print(name)
path = osp.join(csv_dir, '{0}.csv'.format(name))
df = pd.read_csv(path, na_values=['\\N'])
pd.io.sql.to_sql(df, name, con, chunksize=10000)
df = pd.read_csv(path, na_values=['\\N'], dtype={'bool_col': 'bool'})
df.to_sql(
name,
con,
chunksize=10000,
if_exists='replace',
dtype={
'index': sa.INTEGER,
'id': sa.INTEGER,
'bool_col': sa.BOOLEAN,
'tinyint_col': sa.SMALLINT,
'smallint_col': sa.SMALLINT,
'int_col': sa.INTEGER,
'bigint_col': sa.BIGINT,
'float_col': sa.REAL,
'double_col': sa.FLOAT,
'date_string_col': sa.TEXT,
'string_col': sa.TEXT,
'timestamp_col': sa.TIMESTAMP,
'year': sa.INTEGER,
'month': sa.INTEGER,
}
)


# ==========================================
Expand Down Expand Up @@ -395,7 +434,10 @@ def create(create_tarball, push_to_s3):
download_parquet_files(con, tmp_db_hdfs_path)
download_avro_files(con)
generate_csv_files()
generate_sqlite_db(con)

# Only populate SQLite here
engines = [get_sqlite_engine()]
load_sql_databases(con, engines)
finally:
con.drop_database(tmp_db, force=True)
assert not con.hdfs.exists(tmp_db_hdfs_path)
Expand Down Expand Up @@ -436,34 +478,27 @@ def load(data, udf, data_dir, overwrite):

# load the data files
if data:
already_loaded = is_data_loaded(con)
print('Attempting to load Ibis test data (--data)')
if already_loaded and not overwrite:
print('Data is already loaded and not overwriting; moving on')
else:
if already_loaded:
print('Data is already loaded; attempting to overwrite')
tmp_dir = tempfile.mkdtemp(prefix='__ibis_tmp_')
try:
if not data_dir:
print('Did not specify a local dir with the test data, so '
'downloading it from S3')
data_dir = dnload_ibis_test_data_from_s3(tmp_dir)
print('Uploading to HDFS')
upload_ibis_test_data_to_hdfs(con, data_dir)
print('Creating Ibis test data database')
create_test_database(con)
parquet_tables = create_parquet_tables(con)
avro_tables = create_avro_tables(con)
for table in parquet_tables + avro_tables:
print('Computing stats for {0}'.format(table.op().name))
table.compute_stats()

# sqlite database
sqlite_src = osp.join(data_dir, 'ibis_testing.db')
shutil.copy(sqlite_src, '.')
finally:
shutil.rmtree(tmp_dir)
tmp_dir = tempfile.mkdtemp(prefix='__ibis_tmp_')

if not data_dir:
# TODO(wesm): do not download if already downloaded
print('Did not specify a local dir with the test data, so '
'downloading it from S3')
data_dir = dnload_ibis_test_data_from_s3(tmp_dir)
try:
load_impala_data(con, data_dir, overwrite)

# sqlite database
print('Setting up SQLite')
sqlite_src = osp.join(data_dir, 'ibis_testing.db')
shutil.copy(sqlite_src, '.')

print('Loading SQL engines')
# SQL engines
engines = [get_postgres_engine()]
load_sql_databases(con, engines)
finally:
shutil.rmtree(tmp_dir)
else:
print('Skipping Ibis test data load (--no-data)')

Expand All @@ -484,6 +519,26 @@ def load(data, udf, data_dir, overwrite):
print('Skipping UDF build/load (--no-udf)')


def load_impala_data(con, data_dir, overwrite=False):
already_loaded = is_impala_loaded(con)
print('Attempting to load Ibis Impala test data (--data)')
if already_loaded and not overwrite:
print('Data is already loaded and not overwriting; moving on')
else:
if already_loaded:
print('Data is already loaded; attempting to overwrite')

print('Uploading to HDFS')
upload_ibis_test_data_to_hdfs(con, data_dir)
print('Creating Ibis test data database')
create_test_database(con)
parquet_tables = create_parquet_tables(con)
avro_tables = create_avro_tables(con)
for table in parquet_tables + avro_tables:
print('Computing stats for {0}'.format(table.op().name))
table.compute_stats()


@main.command()
@option('--test-data', is_flag=True,
help='Cleanup Ibis test data, test database, and also the test UDFs '
Expand Down