21 changes: 14 additions & 7 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ Installation
System dependencies
~~~~~~~~~~~~~~~~~~~

Ibis requires a working Python 2.6 or 2.7 installation (3.x support will come
in a future release). We recommend `Anaconda <http://continuum.io/downloads>`_.
Ibis requires a working Python 2.6, 2.7, or 3.4 installation. We recommend
`Anaconda <http://continuum.io/downloads>`_.

Installing the Python package
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -40,6 +40,14 @@ Some platforms will require that you have Kerberos installed to build properly.
* Redhat / CentOS: ``yum install krb5-devel``
* Ubuntu / Debian: ``apt-get install libkrb5-dev``

.. _install.sqlite:

Ibis SQLite Quickstart
----------------------

See http://blog.ibis-project.org/sqlite-crunchbase-quickstart/ for a quickstart
using SQLite. Otherwise read on to try out Ibis on Impala.

Creating a client
-----------------

Expand All @@ -52,7 +60,7 @@ the client using ``ibis.impala.connect``:
hdfs = ibis.hdfs_connect(host=webhdfs_host, port=webhdfs_port)
con = ibis.impala.connect(host=impala_host, port=impala_port,
hdfs_client=hdfs
hdfs_client=hdfs)
Both method calls can take ``auth_mechanism='GSSAPI'`` or
``auth_mechanism='LDAP'`` to connect to Kerberos clusters. Depending on your
Expand All @@ -72,16 +80,15 @@ reproduced as part of the documentation.
Using Ibis with the Cloudera Quickstart VM
------------------------------------------

Since Ibis requires a running Impala cluster, we have provided a lean
VirtualBox image to simplify the process for those looking to try out Ibis
Using Ibis with Impala requires a running Impala cluster, so we have provided a
lean VirtualBox image to simplify the process for those looking to try out Ibis
(without setting up a cluster) or start contributing code to the project.

What follows are streamlined setup instructions for the VM. If you wish to
download it directly and setup from the ``ova`` file, use this `download link
<http://archive.cloudera.com/cloudera-ibis/ibis-demo.ova>`_.

The VM was built with Oracle VirtualBox 4.3.28. We recommend using the latest
version of the software for the best compatibility.
The VM was built with Oracle VirtualBox 4.3.28.

TL;DR
~~~~~
Expand Down
132 changes: 0 additions & 132 deletions docs/source/impala-udf.rst

This file was deleted.

793 changes: 779 additions & 14 deletions docs/source/impala.rst

Large diffs are not rendered by default.

7 changes: 4 additions & 3 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,13 @@ Hadoop. Ibis is being jointly developed with `Impala <http://impala.io>`_ to
deliver a complete 100% Python user experience on data of any size (small,
medium, or big).

At this item, Ibis supports the following SQL-based systems:
At this time, Ibis supports the following SQL-based systems:

- Impala (on HDFS)
- SQLite

Coming from SQL? Check out :ref:`Ibis for SQL Programmers <sql>`.

We have a handful of specific priority focus areas:

- Enable data analysts to translation analytics using SQL engines to Python
Expand Down Expand Up @@ -65,10 +67,9 @@ places, but this will improve as things progress.
getting-started
configuration
tutorial
impala-udf
impala
api
sql
impala
release
developer
type-system
Expand Down
89 changes: 78 additions & 11 deletions docs/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,77 @@ Release Notes

**Note**: These release notes will only include notable or major bug fixes
since most minor bug fixes tend to be esoteric and not generally
interesting.
interesting. Point (minor, e.g. 0.5.1) releases will generally not be found
here and contain only bug fixes.

0.5.0 (September 10, 2015)
--------------------------
0.6 (December 1, 2015)
----------------------

This release brings expanded pandas and Impala integration, including support
for managing partitioned tables in Impala. See the new :ref:`Ibis for Impala
Users <impala>` guide for more on using Ibis with Impala.

The :ref:`Ibis for SQL Programmers <sql>` guide also was written since the 0.5
release.

This release also includes bug fixes affecting generated SQL correctness. All
users should upgrade as soon as possible.

New features
~~~~~~~~~~~~

* New integrated Impala functionality. See :ref:`Ibis for Impala Users
<impala>` for more details on these things.

* Improved Impala-pandas integration. Create tables or insert into existing
tables from pandas ``DataFrame`` objects.
* Partitioned table metadata management API. Add, drop, alter, and
insert into table partitions.
* Add ``is_partitioned`` property to ``ImpalaTable``.
* Added support for ``LOAD DATA`` DDL using the ``load_data`` function, also
supporting partitioned tables.
* Modify table metadata (location, format, SerDe properties etc.) using
``ImpalaTable.alter``
* Interrupting Impala expression execution with Control-C will attempt to
cancel the running query with the server.
* Set the compression codec (e.g. snappy) used with
``ImpalaClient.set_compression_codec``.
* Get and set query options for a client session with
``ImpalaClient.get_options`` and ``ImpalaClient.set_options``.
* Add ``ImpalaTable.metadata`` method that parses the output of the
``DESCRIBE FORMATTED`` DDL to simplify table metadata inspection.
* Add ``ImpalaTable.stats`` and ``ImpalaTable.column_stats`` to see computed
table and partition statistics.
* Add ``CHAR`` and ``VARCHAR`` handling
* Add ``refresh``, ``invalidate_metadata`` DDL options and add
``incremental`` option to ``compute_stats`` for ``COMPUTE INCREMENTAL
STATS``.

* Add ``substitute`` method for performing multiple value substitutions in an
array or scalar expression.
* Division is by default *true division* like Python 3 for all numeric
data. This means for SQL systems that use C-style division semantics, the
appropriate ``CAST`` will be automatically inserted in the generated SQL.
* Easier joins on tables with overlapping column names. See :ref:`Ibis for SQL Programmers <sql>`.
* Expressions like ``string_expr[:3]`` now work as expected.
* Add ``coalesce`` instance method to all value expressions.
* Passing ``limit=None`` to the ``execute`` method on expressions disables any
default row limits.

API Changes
~~~~~~~~~~~

* ``ImpalaTable.rename`` no longer mutates the calling table expression.

Contributors
~~~~~~~~~~~~

::

$ git log v0.5.0..v0.6.0 --pretty=format:%aN | sort | uniq -c | sort -rn

0.5 (September 10, 2015)
------------------------

Highlights in this release are the SQLite, Python 3, Impala UDA support, and an
asynchronous execution API. There are also many usability improvements, bug
Expand Down Expand Up @@ -49,8 +116,8 @@ Contributors
9 Uri Laserson
1 Kristopher Overholt

0.4.0 (August 14, 2015)
-----------------------
0.4 (August 14, 2015)
---------------------

New features
~~~~~~~~~~~~
Expand Down Expand Up @@ -91,8 +158,8 @@ Contributors
2 Kristopher Overholt
1 Marius van Niekerk

0.3.0 (July 20, 2015)
---------------------
0.3 (July 20, 2015)
-------------------

First public release. See http://ibis-project.org for more.

Expand Down Expand Up @@ -129,8 +196,8 @@ Contributors
4 Isaac Hodes
2 Meghana Vuyyuru

0.2.0 (June 16, 2015)
---------------------
0.2 (June 16, 2015)
-------------------

New features
~~~~~~~~~~~~
Expand Down Expand Up @@ -195,8 +262,8 @@ Contributors
1 Juliet Hougland
1 Isaac Hodes

0.1.0 (March 26, 2015)
----------------------
0.1 (March 26, 2015)
--------------------

First Ibis release.

Expand Down
1,300 changes: 1,297 additions & 3 deletions docs/source/sql.rst

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _tutorial:

********
Tutorial
********
*******************
Expression tutorial
*******************

These notebooks come from http://github.com/cloudera/ibis-notebooks and are
reproduced here using ``nbconvert``.
Expand Down
4 changes: 4 additions & 0 deletions docs/sphinxext/ipython_sphinxext/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Taken from pandas

Source: https://github.com/pydata/pandas/tree/master/doc/sphinxext/ipython_sphinxext
License: BSD
Empty file.
116 changes: 116 additions & 0 deletions docs/sphinxext/ipython_sphinxext/ipython_console_highlighting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
"""reST directive for syntax-highlighting ipython interactive sessions.
XXX - See what improvements can be made based on the new (as of Sept 2009)
'pycon' lexer for the python console. At the very least it will give better
highlighted tracebacks.
"""

#-----------------------------------------------------------------------------
# Needed modules

# Standard library
import re

# Third party
from pygments.lexer import Lexer, do_insertions
from pygments.lexers.agile import (PythonConsoleLexer, PythonLexer,
PythonTracebackLexer)
from pygments.token import Comment, Generic

from sphinx import highlighting

#-----------------------------------------------------------------------------
# Global constants
line_re = re.compile('.*?\n')

#-----------------------------------------------------------------------------
# Code begins - classes and functions


class IPythonConsoleLexer(Lexer):

"""
For IPython console output or doctests, such as:
.. sourcecode:: ipython
In [1]: a = 'foo'
In [2]: a
Out[2]: 'foo'
In [3]: print(a)
foo
In [4]: 1 / 0
Notes:
- Tracebacks are not currently supported.
- It assumes the default IPython prompts, not customized ones.
"""

name = 'IPython console session'
aliases = ['ipython']
mimetypes = ['text/x-ipython-console']
input_prompt = re.compile("(In \[[0-9]+\]: )|( \.\.\.+:)")
output_prompt = re.compile("(Out\[[0-9]+\]: )|( \.\.\.+:)")
continue_prompt = re.compile(" \.\.\.+:")
tb_start = re.compile("\-+")

def get_tokens_unprocessed(self, text):
pylexer = PythonLexer(**self.options)
tblexer = PythonTracebackLexer(**self.options)

curcode = ''
insertions = []
for match in line_re.finditer(text):
line = match.group()
input_prompt = self.input_prompt.match(line)
continue_prompt = self.continue_prompt.match(line.rstrip())
output_prompt = self.output_prompt.match(line)
if line.startswith("#"):
insertions.append((len(curcode),
[(0, Comment, line)]))
elif input_prompt is not None:
insertions.append((len(curcode),
[(0, Generic.Prompt, input_prompt.group())]))
curcode += line[input_prompt.end():]
elif continue_prompt is not None:
insertions.append((len(curcode),
[(0, Generic.Prompt, continue_prompt.group())]))
curcode += line[continue_prompt.end():]
elif output_prompt is not None:
# Use the 'error' token for output. We should probably make
# our own token, but error is typicaly in a bright color like
# red, so it works fine for our output prompts.
insertions.append((len(curcode),
[(0, Generic.Error, output_prompt.group())]))
curcode += line[output_prompt.end():]
else:
if curcode:
for item in do_insertions(insertions,
pylexer.get_tokens_unprocessed(curcode)):
yield item
curcode = ''
insertions = []
yield match.start(), Generic.Output, line
if curcode:
for item in do_insertions(insertions,
pylexer.get_tokens_unprocessed(curcode)):
yield item


def setup(app):
"""Setup as a sphinx extension."""

# This is only a lexer, so adding it below to pygments appears sufficient.
# But if somebody knows that the right API usage should be to do that via
# sphinx, by all means fix it here. At least having this setup.py
# suppresses the sphinx warning we'd get without it.
pass

#-----------------------------------------------------------------------------
# Register the extension as a valid pygments lexer
highlighting.lexers['ipython'] = IPythonConsoleLexer()
1,089 changes: 1,089 additions & 0 deletions docs/sphinxext/ipython_sphinxext/ipython_directive.py

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions ibis/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,7 @@ def test(impala=False):
if impala:
args.append('--impala')
pytest.main(args)

from ._version import get_versions
__version__ = get_versions()['version']
del get_versions
460 changes: 460 additions & 0 deletions ibis/_version.py

Large diffs are not rendered by default.

15 changes: 9 additions & 6 deletions ibis/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def __init__(self, client, ddl):
def execute(self):
# synchronous by default
with self.client._execute(self.compiled_ddl, results=True) as cur:
result = self._fetch_from_cursor(cur)
result = self._fetch(cur)

return self._wrap_result(result)

Expand All @@ -56,7 +56,7 @@ def _wrap_result(self, result):
result = self.result_wrapper(result)
return result

def _fetch_from_cursor(self, cursor):
def _fetch(self, cursor):
import pandas as pd
rows = cursor.fetchall()
# TODO(wesm): please evaluate/reimpl to optimize for perf/memory
Expand Down Expand Up @@ -201,7 +201,7 @@ def raw_sql(self, query, results=False):
"""
return self._execute(query, results=results)

def execute(self, expr, params=None, limit=None, async=False):
def execute(self, expr, params=None, limit='default', async=False):
"""
Compile and execute Ibis expression using this backend client
interface, returning results in-memory in the appropriate object type
Expand Down Expand Up @@ -254,13 +254,16 @@ def _build_ast_ensure_limit(self, expr, limit):
not isinstance(expr, ir.ScalarExpr) and
query.table_set is not None):
if query.limit is None:
query_limit = limit or options.sql.default_limit
if limit == 'default':
query_limit = options.sql.default_limit
else:
query_limit = limit
if query_limit:
query.limit = {
'n': query_limit,
'offset': 0
}
elif limit is not None:
elif limit is not None and limit != 'default':
query.limit = {'n': limit,
'offset': query.limit['offset']}
return ast
Expand Down Expand Up @@ -306,7 +309,7 @@ class QueryPipeline(object):
pass


def execute(expr, limit=None, async=False):
def execute(expr, limit='default', async=False):
backend = find_backend(expr)
return backend.execute(expr, limit=limit, async=async)

Expand Down
9 changes: 7 additions & 2 deletions ibis/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@
from six import BytesIO, StringIO, string_types as py_string


PY26 = sys.version_info[0] == 2 and sys.version_info[1] == 6
PY3 = (sys.version_info[0] >= 3)
PY26 = sys.version_info[:2] == (2, 6)
PY3 = sys.version_info[0] == 3
PY2 = sys.version_info[0] == 2


Expand All @@ -44,6 +44,8 @@ def lzip(*x):
def dict_values(x):
return list(x.values())
from decimal import Decimal
import unittest.mock as mock
range = range
else:
import cPickle

Expand All @@ -61,4 +63,7 @@ def dict_values(x):
def dict_values(x):
return x.values()

import mock
range = xrange

integer_types = six.integer_types + (np.integer,)
96 changes: 88 additions & 8 deletions ibis/expr/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -567,7 +567,6 @@ class Projector(object):

def __init__(self, parent, proj_exprs):
self.parent = parent

self.input_exprs = proj_exprs

node = self.parent.op()
Expand All @@ -580,15 +579,13 @@ def __init__(self, parent, proj_exprs):
self.parent_roots = roots

clean_exprs = []
validator = ExprValidator([parent])
# validator = ExprValidator([parent])

for expr in proj_exprs:
# Perform substitution only if we share common roots
if validator.shares_some_roots(expr):
expr = substitute_parents(expr, past_projection=False)

# if validator.shares_one_root(expr):
# expr = substitute_parents(expr, past_projection=False)
expr = windowize_function(expr)

clean_exprs.append(expr)

self.clean_exprs = clean_exprs
Expand Down Expand Up @@ -662,16 +659,99 @@ def validate(self, expr):
return True

def _among_roots(self, node):
return self.roots_shared(node) > 0

def roots_shared(self, node):
count = 0
for root in self.roots:
if root.is_ancestor(node):
return True
return False
count += 1
return count

def shares_some_roots(self, expr):
expr_roots = expr._root_tables()
return any(self._among_roots(root)
for root in expr_roots)

def shares_one_root(self, expr):
expr_roots = expr._root_tables()
total = sum(self.roots_shared(root)
for root in expr_roots)
return total == 1

def shares_multiple_roots(self, expr):
expr_roots = expr._root_tables()
total = sum(self.roots_shared(expr_roots)
for root in expr_roots)
return total > 1

def validate_all(self, exprs):
for expr in exprs:
self.assert_valid(expr)

def assert_valid(self, expr):
if not self.validate(expr):
msg = self._error_message(expr)
raise RelationError(msg)

def _error_message(self, expr):
return ('The expression %s does not fully originate from '
'dependencies of the table expression.' % repr(expr))


class CommonSubexpr(object):

def __init__(self, exprs):
self.parent_exprs = exprs

def validate(self, expr):
if isinstance(expr, ir.TableExpr):
if not self._check(expr):
return False

op = expr.op()

for arg in op.flat_args():
if not isinstance(arg, ir.Expr):
continue
elif not isinstance(arg, ir.TableExpr):
if not self.validate(arg):
return False
else:
# Table expression. Must be found in a parent table expr a
# blocking root of one of the parent tables
if not self._check(arg):
return False

return True

def _check(self, expr):
# Table dependency matches one of the parent exprs
is_valid = False
for parent in self.parent_exprs:
is_valid = is_valid or self._check_table(parent, expr)
return is_valid

def _check_table(self, parent, needle):
def _matches(expr):
op = expr.op()

if expr.equals(needle):
return True

if isinstance(op, ir.BlockingTableNode):
return False

for arg in op.flat_args():
if not isinstance(arg, ir.Expr):
continue
if _matches(arg):
return True

return True

return _matches(parent)

def validate_all(self, exprs):
for expr in exprs:
self.assert_valid(expr)
Expand Down
161 changes: 120 additions & 41 deletions ibis/expr/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -371,7 +371,8 @@ def f(self, where=None):

def _extract_field(name, klass):
def f(self):
return klass(self).to_expr()
expr = klass(self).to_expr()
return expr.name(name)
f.__name__ = name
return f

Expand Down Expand Up @@ -657,16 +658,88 @@ def notin(arg, values):
sub = _binop_expr('__sub__', _ops.Subtract)
mul = _binop_expr('__mul__', _ops.Multiply)
div = _binop_expr('__div__', _ops.Divide)
floordiv = _binop_expr('__floordiv__', _ops.FloorDivide)
pow = _binop_expr('__pow__', _ops.Power)
mod = _binop_expr('__mod__', _ops.Modulus)

rsub = _rbinop_expr('__rsub__', _ops.Subtract)
rdiv = _rbinop_expr('__rdiv__', _ops.Divide)
rfloordiv = _rbinop_expr('__rfloordiv__', _ops.FloorDivide)


def substitute(arg, value, replacement=None, else_=None):
"""
Substitute (replace) one or more values in a value expression
Parameters
----------
value : expr-like or dict
replacement : expr-like, optional
If an expression is passed to value, this must be passed
else_ : expr, optional
Returns
-------
replaced : case statement (for now!)
"""
expr = arg.case()
if isinstance(value, dict):
for k, v in sorted(value.items()):
expr = expr.when(k, v)
else:
expr = expr.when(value, replacement)

if else_ is not None:
expr = expr.else_(else_)
else:
expr = expr.else_(arg)

return expr.end()


def _case(arg):
"""
Create a new SimpleCaseBuilder to chain multiple if-else
statements. Add new search expressions with the .when method. These
must be comparable with this array expression. Conclude by calling
.end()
Examples
--------
case_expr = (expr.case()
.when(case1, output1)
.when(case2, output2)
.default(default_output)
.end())
Returns
-------
builder : CaseBuilder
"""
return _ops.SimpleCaseBuilder(arg)


def cases(arg, case_result_pairs, default=None):
"""
Create a case expression in one shot.
Returns
-------
case_expr : SimpleCase
"""
builder = arg.case()
for case, result in case_result_pairs:
builder = builder.when(case, result)
if default is not None:
builder = builder.else_(default)
return builder.end()


_generic_value_methods = dict(
hash=hash,
cast=cast,
coalesce=coalesce,
typeof=typeof,
fillna=fillna,
nullif=nullif,
Expand All @@ -678,6 +751,10 @@ def notin(arg, values):

over=over,

case=_case,
cases=cases,
substitute=substitute,

__add__=add,
add=add,

Expand All @@ -689,11 +766,15 @@ def notin(arg, values):

__div__=div,
__truediv__=div,
__floordiv__=floordiv,
div=div,
floordiv=floordiv,

__rdiv__=rdiv,
__rtruediv__=rdiv,
__rfloordiv__=rfloordiv,
rdiv=rdiv,
rfloordiv=rfloordiv,

__pow__=pow,
pow=pow,
Expand Down Expand Up @@ -791,44 +872,6 @@ def bottomk(arg, k, by=None):
raise NotImplementedError


def _case(arg):
"""
Create a new SimpleCaseBuilder to chain multiple if-else
statements. Add new search expressions with the .when method. These
must be comparable with this array expression. Conclude by calling
.end()
Examples
--------
case_expr = (expr.case()
.when(case1, output1)
.when(case2, output2)
.default(default_output)
.end())
Returns
-------
builder : CaseBuilder
"""
return _ops.SimpleCaseBuilder(arg)


def cases(arg, case_result_pairs, default=None):
"""
Create a case expression in one shot.
Returns
-------
case_expr : SimpleCase
"""
builder = arg.case()
for case, result in case_result_pairs:
builder = builder.when(case, result)
if default is not None:
builder = builder.else_(default)
return builder.end()


def _generic_summary(arg, exact_nunique=False, prefix=None):
"""
Compute a set of summary metrics from the input value expression
Expand Down Expand Up @@ -906,8 +949,6 @@ def expr_list(exprs):


_generic_array_methods = dict(
case=_case,
cases=cases,
bottomk=bottomk,
distinct=distinct,
nunique=nunique,
Expand Down Expand Up @@ -1488,7 +1529,25 @@ def _string_dunder_contains(arg, substr):
raise TypeError('Use val.contains(arg)')


def _string_getitem(self, key):
if isinstance(key, slice):
start, stop, step = key.start, key.stop, key.step
if step and step != 1:
raise ValueError('Step can only be 1')

start = start or 0

if start < 0 or stop < 0:
raise ValueError('negative slicing not yet supported')

return self.substr(start, stop - start)
else:
raise NotImplementedError


_string_value_methods = dict(
__getitem__=_string_getitem,

length=_unary_op('length', _ops.StringLength),
lower=_unary_op('lower', _ops.Lowercase),
upper=_unary_op('upper', _ops.Uppercase),
Expand Down Expand Up @@ -2041,10 +2100,30 @@ def _table_view(self):
return TableExpr(_ops.SelfReference(self))


def _table_drop(self, fields):
if len(fields) == 0:
# noop
return self

fields = set(fields)
to_project = []
for name in self.schema():
if name in fields:
fields.remove(name)
else:
to_project.append(name)

if len(fields) > 0:
raise KeyError('Fields not in table: {0!s}'.format(fields))

return self.projection(to_project)


_table_methods = dict(
aggregate=aggregate,
count=_table_count,
distinct=_table_distinct,
drop=_table_drop,
info=_table_info,
limit=_table_limit,
set_column=_table_set_column,
Expand Down
23 changes: 23 additions & 0 deletions ibis/expr/datatypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,9 @@ def __repr__(self):
def __len__(self):
return len(self.names)

def __iter__(self):
return iter(self.names)

def _repr(self):
buf = StringIO()
space = 2 + max(len(x) for x in self.names)
Expand All @@ -59,6 +62,23 @@ def _repr(self):
def __contains__(self, name):
return name in self._name_locs

def __getitem__(self, name):
return self.types[self._name_locs[name]]

def delete(self, names_to_delete):
for name in names_to_delete:
if name not in self:
raise KeyError(name)

new_names, new_types = [], []
for name, type_ in zip(self.names, self.types):
if name in names_to_delete:
continue
new_names.append(name)
new_types.append(type_)

return Schema(new_names, new_types)

@classmethod
def from_tuples(cls, values):
if not isinstance(values, (list, tuple)):
Expand Down Expand Up @@ -91,6 +111,9 @@ def append(self, schema):
types = self.types + schema.types
return Schema(names, types)

def items(self):
return zip(self.names, self.types)


class HasSchema(object):

Expand Down
17 changes: 13 additions & 4 deletions ibis/expr/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@

import ibis.util as util

import ibis.expr.datatypes as dt
import ibis.expr.types as ir
import ibis.expr.operations as ops

Expand All @@ -28,12 +27,22 @@ def __init__(self):
self.aliases = {}
self.ops = {}
self.counts = defaultdict(lambda: 0)
self._repr_memo = {}

def __contains__(self, obj):
return self._key(obj) in self.formatted

def _key(self, obj):
return obj._repr()
memo_key = id(obj)
if memo_key in self._repr_memo:
return self._repr_memo[memo_key]
result = self._format(obj)
self._repr_memo[memo_key] = result

return result

def _format(self, obj):
return obj._repr(memo=self._repr_memo)

def observe(self, obj, formatter=lambda x: x._repr()):
key = self._key(obj)
Expand Down Expand Up @@ -81,7 +90,7 @@ def get_result(self):
if self.memoize:
self._memoize_tables()

if isinstance(what, dt.HasSchema):
if isinstance(what, ir.TableNode) and what.has_schema():
# This should also catch aggregations
if not self.memoize and what in self.memo:
text = 'Table: %s' % self.memo.get_alias(what)
Expand Down Expand Up @@ -137,7 +146,7 @@ def visit(arg):
visit(op.args)
if isinstance(op, table_memo_ops):
self.memo.observe(op, self._format_node)
elif isinstance(op, dt.HasSchema):
elif isinstance(op, ir.TableNode) and op.has_schema():
self.memo.observe(op, self._format_table)

walk(self.expr)
Expand Down
66 changes: 37 additions & 29 deletions ibis/expr/operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,6 @@ def __init__(self, table):

Node.__init__(self, [table])

def root_tables(self):
return self.table._root_tables()

def to_expr(self):
ctype = self.table._get_type(self.name)
klass = ctype.array_type()
Expand Down Expand Up @@ -154,7 +151,6 @@ class TypeOf(ValueOp):
output_type = rules.shape_like_arg(0, 'string')



class Negate(UnaryOp):

input_type = [number]
Expand Down Expand Up @@ -1026,9 +1022,6 @@ def __init__(self, arg):
def output_type(self):
return type(self.arg)

def root_tables(self):
return self.arg._root_tables()

def count(self):
"""
Only valid if the distinct contains a single column
Expand Down Expand Up @@ -1309,9 +1302,9 @@ def output_type(self):

class Join(TableNode):

def __init__(self, left, right, join_predicates):
from ibis.expr.analysis import ExprValidator
_arg_names = ['left', 'right', 'predicates']

def __init__(self, left, right, predicates):
if not rules.is_table(left):
raise TypeError('Can only join table expressions, got %s for '
'left table' % type(left))
Expand All @@ -1320,21 +1313,32 @@ def __init__(self, left, right, join_predicates):
raise TypeError('Can only join table expressions, got %s for '
'right table' % type(left))

if left.equals(right):
right = right.view()

self.left = left
self.right = right
self.predicates = self._clean_predicates(join_predicates)
(self.left,
self.right,
self.predicates) = self._make_distinct(left, right, predicates)

# Validate join predicates. Each predicate must be valid jointly when
# considering the roots of each input table
validator = ExprValidator([self.left, self.right])
from ibis.expr.analysis import CommonSubexpr
validator = CommonSubexpr([self.left, self.right])
validator.validate_all(self.predicates)

Node.__init__(self, [self.left, self.right, self.predicates])

def _clean_predicates(self, predicates):
def _make_distinct(self, left, right, predicates):
# see GH #667

# If left and right table have a common parent expression (e.g. they
# have different filters), must add a self-reference and make the
# appropriate substitution in the join predicates

if left.equals(right):
right = right.view()

predicates = self._clean_predicates(left, right, predicates)
return left, right, predicates

def _clean_predicates(self, left, right, predicates):
import ibis.expr.analysis as L

result = []
Expand All @@ -1348,11 +1352,13 @@ def _clean_predicates(self, predicates):
raise com.ExpressionError('Join key tuple must be '
'length 2')
lk, rk = pred
lk = self.left._ensure_expr(lk)
rk = self.right._ensure_expr(rk)
lk = left._ensure_expr(lk)
rk = right._ensure_expr(rk)
pred = lk == rk
else:
pred = L.substitute_parents(pred, past_projection=False)
elif isinstance(pred, six.string_types):
pred = getattr(left, pred) == getattr(right, pred)
elif not isinstance(pred, ir.Expr):
raise NotImplementedError

if not isinstance(pred, ir.BooleanArray):
raise com.ExpressionError('Join predicate must be comparison')
Expand Down Expand Up @@ -1640,12 +1646,12 @@ class Projection(ir.BlockingTableNode, HasSchema):
_arg_names = ['table', 'selections']

def __init__(self, table_expr, proj_exprs):
from ibis.expr.analysis import ExprValidator
from ibis.expr.analysis import ExprValidator as Validator

# Need to validate that the column expressions are compatible with the
# input table; this means they must either be scalar expressions or
# array expressions originating from the same root table expression
validator = ExprValidator([table_expr])
validator = Validator([table_expr])

# Resolve schema and initialize
types = []
Expand Down Expand Up @@ -1814,13 +1820,18 @@ def output_type(self):

class Divide(BinaryOp):

def output_type(self):
if not util.all_of(self.args, ir.NumericValue):
raise TypeError('One argument was non-numeric')
input_type = [number, number]

def output_type(self):
return rules.shape_like_args(self.args, 'double')


class FloorDivide(Divide):

def output_type(self):
return rules.shape_like_args(self.args, 'int64')


class LogicalBinaryOp(BinaryOp):

def output_type(self):
Expand Down Expand Up @@ -2026,9 +2037,6 @@ class Constant(ValueOp):
def __init__(self):
ValueOp.__init__(self, [])

def root_tables(self):
return []


class TimestampNow(Constant):

Expand Down
269 changes: 269 additions & 0 deletions ibis/expr/tests/test_analysis.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import ibis

from ibis.compat import unittest
from ibis.expr.tests.mocks import BasicTestCase
import ibis.expr.analysis as L
import ibis.expr.operations as ops
import ibis.common as com

from ibis.tests.util import assert_equal


# Place to collect esoteric expression analysis bugs and tests


class TestTableExprBasics(BasicTestCase, unittest.TestCase):

def test_rewrite_substitute_distinct_tables(self):
t = self.con.table('test1')
tt = self.con.table('test1')

expr = t[t.c > 0]
expr2 = tt[tt.c > 0]

metric = t.f.sum().name('metric')
expr3 = expr.aggregate(metric)

result = L.sub_for(expr3, [(expr2, t)])
expected = t.aggregate(metric)

assert_equal(result, expected)

def test_rewrite_join_projection_without_other_ops(self):
# Drop out filters and other commutative table operations. Join
# predicates are "lifted" to reference the base, unmodified join roots

# Star schema with fact table
table = self.con.table('star1')
table2 = self.con.table('star2')
table3 = self.con.table('star3')

filtered = table[table['f'] > 0]

pred1 = table['foo_id'] == table2['foo_id']
pred2 = filtered['bar_id'] == table3['bar_id']

j1 = filtered.left_join(table2, [pred1])
j2 = j1.inner_join(table3, [pred2])

# Project out the desired fields
view = j2[[filtered, table2['value1'], table3['value2']]]

# Construct the thing we expect to obtain
ex_pred2 = table['bar_id'] == table3['bar_id']
ex_expr = (table.left_join(table2, [pred1])
.inner_join(table3, [ex_pred2]))

rewritten_proj = L.substitute_parents(view)
op = rewritten_proj.op()
assert_equal(op.table, ex_expr)

# Ensure that filtered table has been substituted with the base table
assert op.selections[0] is table

def test_rewrite_past_projection(self):
table = self.con.table('test1')

# Rewrite past a projection
table3 = table[['c', 'f']]
expr = table3['c'] == 2

result = L.substitute_parents(expr)
expected = table['c'] == 2
assert_equal(result, expected)

# Unsafe to rewrite past projection
table5 = table[(table.f * 2).name('c'), table.f]
expr = table5['c'] == 2
result = L.substitute_parents(expr)
assert result is expr

def test_rewrite_expr_with_parent(self):
table = self.con.table('test1')

table2 = table[table['f'] > 0]

expr = table2['c'] == 2

result = L.substitute_parents(expr)
expected = table['c'] == 2
assert_equal(result, expected)

# Substitution not fully possible if we depend on a new expr in a
# projection

table4 = table[['c', (table['c'] * 2).name('foo')]]
expr = table4['c'] == table4['foo']
result = L.substitute_parents(expr)
expected = table['c'] == table4['foo']
assert_equal(result, expected)

def test_rewrite_distinct_but_equal_objects(self):
t = self.con.table('test1')
t_copy = self.con.table('test1')

table2 = t[t_copy['f'] > 0]

expr = table2['c'] == 2

result = L.substitute_parents(expr)
expected = t['c'] == 2
assert_equal(result, expected)

def test_projection_with_join_pushdown_rewrite_refs(self):
# Observed this expression IR issue in a TopK-rewrite context
table1 = ibis.table([
('a_key1', 'string'),
('a_key2', 'string'),
('a_value', 'double')
], 'foo')

table2 = ibis.table([
('b_key1', 'string'),
('b_name', 'string'),
('b_value', 'double')
], 'bar')

table3 = ibis.table([
('c_key2', 'string'),
('c_name', 'string')
], 'baz')

proj = (table1.inner_join(table2, [('a_key1', 'b_key1')])
.inner_join(table3, [(table1.a_key2, table3.c_key2)])
[table1, table2.b_name.name('b'), table3.c_name.name('c'),
table2.b_value])

cases = [
(proj.a_value > 0, table1.a_value > 0),
(proj.b_value > 0, table2.b_value > 0)
]

for higher_pred, lower_pred in cases:
result = proj.filter([higher_pred])
op = result.op()
assert isinstance(op, ops.Projection)
filter_op = op.table.op()
assert isinstance(filter_op, ops.Filter)
new_pred = filter_op.predicates[0]
assert_equal(new_pred, lower_pred)

def test_multiple_join_deeper_reference(self):
# Join predicates down the chain might reference one or more root
# tables in the hierarchy.
table1 = ibis.table({'key1': 'string', 'key2': 'string',
'value1': 'double'})
table2 = ibis.table({'key3': 'string', 'value2': 'double'})
table3 = ibis.table({'key4': 'string', 'value3': 'double'})

joined = table1.inner_join(table2, [table1['key1'] == table2['key3']])
joined2 = joined.inner_join(table3, [table1['key2'] == table3['key4']])

# it works, what more should we test here?
materialized = joined2.materialize()
repr(materialized)

def test_filter_on_projected_field(self):
# See #173. Impala and other SQL engines do not allow filtering on a
# just-created alias in a projection
region = self.con.table('tpch_region')
nation = self.con.table('tpch_nation')
customer = self.con.table('tpch_customer')
orders = self.con.table('tpch_orders')

fields_of_interest = [customer,
region.r_name.name('region'),
orders.o_totalprice.name('amount'),
orders.o_orderdate
.cast('timestamp').name('odate')]

all_join = (
region.join(nation, region.r_regionkey == nation.n_regionkey)
.join(customer, customer.c_nationkey == nation.n_nationkey)
.join(orders, orders.o_custkey == customer.c_custkey))

tpch = all_join[fields_of_interest]

# Correlated subquery, yikes!
t2 = tpch.view()
conditional_avg = t2[(t2.region == tpch.region)].amount.mean()

# `amount` is part of the projection above as an aliased field
amount_filter = tpch.amount > conditional_avg

result = tpch.filter([amount_filter])

# Now then! Predicate pushdown here is inappropriate, so we check that
# it didn't occur.

# If filter were pushed below projection, the top-level operator type
# would be Projection instead.
assert type(result.op()) == ops.Filter

def test_bad_join_predicate_raises(self):
# Join predicate references a derived table, but we can salvage and
# rewrite it to get the join semantics out
# see ibis #74
table = ibis.table([
('c', 'int32'),
('f', 'double'),
('g', 'string')
], 'foo_table')

table2 = ibis.table([
('key', 'string'),
('value', 'double')
], 'bar_table')

filter_pred = table['f'] > 0
table3 = table[filter_pred]

with self.assertRaises(com.ExpressionError):
table.inner_join(table2, [table3['g'] == table2['key']])

# expected = table.inner_join(table2, [table['g'] == table2['key']])
# assert_equal(result, expected)

def test_filter_self_join(self):
# GH #667
purchases = ibis.table([('region', 'string'),
('kind', 'string'),
('user', 'int64'),
('amount', 'double')], 'purchases')

metric = purchases.amount.sum().name('total')
agged = (purchases.group_by(['region', 'kind'])
.aggregate(metric))

left = agged[agged.kind == 'foo']
right = agged[agged.kind == 'bar']

cond = left.region == right.region
joined = left.join(right, cond)

# unmodified by analysis
assert_equal(joined.op().predicates[0], cond)

metric = (left.total - right.total).name('diff')
what = [left.region, metric]
projected = joined.projection(what)

proj_exprs = projected.op().selections

# proj exprs unaffected by analysis
assert_equal(proj_exprs[0], left.region)
assert_equal(proj_exprs[1], metric)
20 changes: 20 additions & 0 deletions ibis/expr/tests/test_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,3 +177,23 @@ def test_named_value_expr_show_name(self):

# not really committing to a particular output yet
assert 'baz' in result2

def test_memoize_filtered_tables_in_join(self):
# related: GH #667
purchases = ibis.table([('region', 'string'),
('kind', 'string'),
('user', 'int64'),
('amount', 'double')], 'purchases')

metric = purchases.amount.sum().name('total')
agged = (purchases.group_by(['region', 'kind'])
.aggregate(metric))

left = agged[agged.kind == 'foo']
right = agged[agged.kind == 'bar']

cond = left.region == right.region
joined = left.join(right, cond)

result = repr(joined)
assert result.count('Filter') == 2
24 changes: 17 additions & 7 deletions ibis/expr/tests/test_sql_builtins.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@

from ibis.expr.tests.mocks import MockConnection
from ibis.compat import unittest
import ibis.expr.api as api
from ibis.tests.util import assert_equal
import ibis.expr.operations as ops
import ibis.expr.types as ir
import ibis


class TestBuiltins(unittest.TestCase):
Expand Down Expand Up @@ -83,8 +84,8 @@ def test_ceil_floor(self):
assert type(cresult.op()) == ops.Ceil
assert type(fresult.op()) == ops.Floor

cresult = api.literal(1.2345).ceil()
fresult = api.literal(1.2345).floor()
cresult = ibis.literal(1.2345).ceil()
fresult = ibis.literal(1.2345).floor()
assert isinstance(cresult, ir.Int64Scalar)
assert isinstance(fresult, ir.Int64Scalar)

Expand All @@ -102,7 +103,7 @@ def test_sign(self):
assert isinstance(result, ir.FloatArray)
assert type(result.op()) == ops.Sign

result = api.literal(1.2345).sign()
result = ibis.literal(1.2345).sign()
assert isinstance(result, ir.FloatScalar)

dec_col = self.lineitem.l_extendedprice
Expand Down Expand Up @@ -130,7 +131,7 @@ def test_round(self):
result = dec.round(2)
assert isinstance(result, ir.DecimalArray)

result = api.literal(1.2345).round()
result = ibis.literal(1.2345).round()
assert isinstance(result, ir.Int64Scalar)

def _check_unary_op(self, expr, fname, ex_op, ex_type):
Expand All @@ -142,7 +143,7 @@ def _check_unary_op(self, expr, fname, ex_op, ex_type):
class TestCoalesceLikeFunctions(unittest.TestCase):

def setUp(self):
self.table = api.table([
self.table = ibis.table([
('v1', 'decimal(12, 2)'),
('v2', 'decimal(10, 4)'),
('v3', 'int32'),
Expand All @@ -153,7 +154,16 @@ def setUp(self):
('v8', 'boolean')
], 'testing')

self.functions = [api.coalesce, api.greatest, api.least]
self.functions = [ibis.coalesce, ibis.greatest, ibis.least]

def test_coalesce_instance_method(self):
v7 = self.table.v7
v5 = self.table.v5.cast('string')
v8 = self.table.v8.cast('string')

result = v7.coalesce(v5, v8, 'foo')
expected = ibis.coalesce(v7, v5, v8, 'foo')
assert_equal(result, expected)

def test_integer_promotions(self):
t = self.table
Expand Down
12 changes: 11 additions & 1 deletion ibis/expr/tests/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

from ibis.expr.tests.mocks import MockConnection
from ibis.compat import unittest
from ibis.tests.util import assert_equal


class TestStringOps(unittest.TestCase):
Expand Down Expand Up @@ -92,6 +93,15 @@ def test_join(self):
def test_contains(self):
expr = self.table.g.contains('foo')
expected = self.table.g.like('%foo%')
assert expr.equals(expected)
assert_equal(expr, expected)

self.assertRaises(Exception, lambda: 'foo' in self.table.g)

def test_getitem_slice(self):
cases = [
(self.table.g[:3], self.table.g.substr(0, 3)),
(self.table.g[2:6], self.table.g.substr(2, 4)),
]

for case, expected in cases:
assert_equal(case, expected)
259 changes: 24 additions & 235 deletions ibis/expr/tests/test_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
from ibis.expr.types import ArrayExpr, TableExpr, RelationError
from ibis.common import ExpressionError
from ibis.expr.datatypes import array_type
import ibis.expr.analysis as L
import ibis.expr.api as api
import ibis.expr.types as ir
import ibis.expr.operations as ops
Expand Down Expand Up @@ -189,6 +188,11 @@ def test_projection_self(self):

assert_equal(result, expected)

def test_projection_array_expr(self):
result = self.table[self.table.a]
expected = self.table[[self.table.a]]
assert_equal(result, expected)

def test_add_column(self):
# Creates a projection with a select-all on top of a non-projection
# TableExpr
Expand Down Expand Up @@ -303,38 +307,6 @@ def test_add_predicate_coalesce(self):
result = interm.filter([interm['b'] > 0])
assert_equal(result, expected)

def test_rewrite_expr_with_parent(self):
table = self.con.table('test1')

table2 = table[table['f'] > 0]

expr = table2['c'] == 2

result = L.substitute_parents(expr)
expected = table['c'] == 2
assert_equal(result, expected)

# Substitution not fully possible if we depend on a new expr in a
# projection

table4 = table[['c', (table['c'] * 2).name('foo')]]
expr = table4['c'] == table4['foo']
result = L.substitute_parents(expr)
expected = table['c'] == table4['foo']
assert_equal(result, expected)

def test_rewrite_distinct_but_equal_objects(self):
t = self.con.table('test1')
t_copy = self.con.table('test1')

table2 = t[t_copy['f'] > 0]

expr = table2['c'] == 2

result = L.substitute_parents(expr)
expected = t['c'] == 2
assert_equal(result, expected)

def test_repr_same_but_distinct_objects(self):
t = self.con.table('test1')
t_copy = self.con.table('test1')
Expand All @@ -357,128 +329,6 @@ def test_filter_fusion_distinct_table_objects(self):
assert_equal(expr, expr3)
assert_equal(expr, expr4)

def test_rewrite_substitute_distinct_tables(self):
t = self.con.table('test1')
tt = self.con.table('test1')

expr = t[t.c > 0]
expr2 = tt[tt.c > 0]

metric = t.f.sum().name('metric')
expr3 = expr.aggregate(metric)

result = L.sub_for(expr3, [(expr2, t)])
expected = t.aggregate(metric)

assert_equal(result, expected)

def test_rewrite_join_projection_without_other_ops(self):
# Drop out filters and other commutative table operations. Join
# predicates are "lifted" to reference the base, unmodified join roots

# Star schema with fact table
table = self.con.table('star1')
table2 = self.con.table('star2')
table3 = self.con.table('star3')

filtered = table[table['f'] > 0]

pred1 = table['foo_id'] == table2['foo_id']
pred2 = filtered['bar_id'] == table3['bar_id']

j1 = filtered.left_join(table2, [pred1])
j2 = j1.inner_join(table3, [pred2])

# Project out the desired fields
view = j2[[filtered, table2['value1'], table3['value2']]]

# Construct the thing we expect to obtain
ex_pred2 = table['bar_id'] == table3['bar_id']
ex_expr = (table.left_join(table2, [pred1])
.inner_join(table3, [ex_pred2]))

rewritten_proj = L.substitute_parents(view)
op = rewritten_proj.op()
assert_equal(op.table, ex_expr)

# Ensure that filtered table has been substituted with the base table
assert op.selections[0] is table

def test_rewrite_past_projection(self):
table = self.con.table('test1')

# Rewrite past a projection
table3 = table[['c', 'f']]
expr = table3['c'] == 2

result = L.substitute_parents(expr)
expected = table['c'] == 2
assert_equal(result, expected)

# Unsafe to rewrite past projection
table5 = table[(table.f * 2).name('c'), table.f]
expr = table5['c'] == 2
result = L.substitute_parents(expr)
assert result is expr

def test_projection_predicate_pushdown(self):
# Probably test this during the evaluation phase. In SQL, "fusable"
# table operations will be combined together into a single select
# statement
#
# see ibis #71 for more on this
t = self.table
proj = t['a', 'b', 'c']

# Rewrite a little more aggressively here
result = proj[t.a > 0]

# at one point these yielded different results
filtered = t[t.a > 0]
expected = filtered[t.a, t.b, t.c]
expected2 = filtered.projection(['a', 'b', 'c'])

assert_equal(result, expected)
assert_equal(result, expected2)

def test_projection_with_join_pushdown_rewrite_refs(self):
# Observed this expression IR issue in a TopK-rewrite context
table1 = api.table([
('a_key1', 'string'),
('a_key2', 'string'),
('a_value', 'double')
], 'foo')

table2 = api.table([
('b_key1', 'string'),
('b_name', 'string'),
('b_value', 'double')
], 'bar')

table3 = api.table([
('c_key2', 'string'),
('c_name', 'string')
], 'baz')

proj = (table1.inner_join(table2, [('a_key1', 'b_key1')])
.inner_join(table3, [(table1.a_key2, table3.c_key2)])
[table1, table2.b_name.name('b'), table3.c_name.name('c'),
table2.b_value])

cases = [
(proj.a_value > 0, table1.a_value > 0),
(proj.b_value > 0, table2.b_value > 0)
]

for higher_pred, lower_pred in cases:
result = proj.filter([higher_pred])
op = result.op()
assert isinstance(op, ops.Projection)
filter_op = op.table.op()
assert isinstance(filter_op, ops.Filter)
new_pred = filter_op.predicates[0]
assert_equal(new_pred, lower_pred)

def test_column_relabel(self):
# GH #551. Keeping the test case very high level to not presume that
# the relabel is necessarily implemented using a projection
Expand Down Expand Up @@ -566,6 +416,10 @@ def test_table_count(self):
assert isinstance(result.op(), ops.Count)
assert result.get_name() == 'count'

def test_len_raises_expression_error(self):
with self.assertRaises(com.ExpressionError):
len(self.table)

def test_sum_expr_basics(self):
# Impala gives bigint for all integer types
ex_class = api.Int64Scalar
Expand Down Expand Up @@ -957,21 +811,6 @@ def test_join_compound_boolean_predicate(self):
# The user might have composed predicates through logical operations
pass

def test_multiple_join_deeper_reference(self):
# Join predicates down the chain might reference one or more root
# tables in the hierarchy.
table1 = ibis.table({'key1': 'string', 'key2': 'string',
'value1': 'double'})
table2 = ibis.table({'key3': 'string', 'value2': 'double'})
table3 = ibis.table({'key4': 'string', 'value3': 'double'})

joined = table1.inner_join(table2, [table1['key1'] == table2['key3']])
joined2 = joined.inner_join(table3, [table1['key2'] == table3['key4']])

# it works, what more should we test here?
materialized = joined2.materialize()
repr(materialized)

def test_filter_join_unmaterialized(self):
table1 = ibis.table({'key1': 'string', 'key2': 'string',
'value1': 'double'})
Expand All @@ -982,72 +821,22 @@ def test_filter_join_unmaterialized(self):
filtered = joined.filter([table1.value1 > 0])
repr(filtered)

def test_filter_on_projected_field(self):
# See #173. Impala and other SQL engines do not allow filtering on a
# just-created alias in a projection
region = self.con.table('tpch_region')
nation = self.con.table('tpch_nation')
customer = self.con.table('tpch_customer')
orders = self.con.table('tpch_orders')

fields_of_interest = [customer,
region.r_name.name('region'),
orders.o_totalprice.name('amount'),
orders.o_orderdate
.cast('timestamp').name('odate')]

all_join = (
region.join(nation, region.r_regionkey == nation.n_regionkey)
.join(customer, customer.c_nationkey == nation.n_nationkey)
.join(orders, orders.o_custkey == customer.c_custkey))

tpch = all_join[fields_of_interest]

# Correlated subquery, yikes!
t2 = tpch.view()
conditional_avg = t2[(t2.region == tpch.region)].amount.mean()

# `amount` is part of the projection above as an aliased field
amount_filter = tpch.amount > conditional_avg

result = tpch.filter([amount_filter])

# Now then! Predicate pushdown here is inappropriate, so we check that
# it didn't occur.

# If filter were pushed below projection, the top-level operator type
# would be Projection instead.
assert type(result.op()) == ops.Filter

def test_join_can_rewrite_errant_predicate(self):
# Join predicate references a derived table, but we can salvage and
# rewrite it to get the join semantics out
# see ibis #74
table = ibis.table([
('c', 'int32'),
('f', 'double'),
('g', 'string')
], 'foo_table')

table2 = ibis.table([
('key', 'string'),
('value', 'double')
], 'bar_table')

filter_pred = table['f'] > 0
table3 = table[filter_pred]

result = table.inner_join(table2, [table3['g'] == table2['key']])
expected = table.inner_join(table2, [table['g'] == table2['key']])
assert_equal(result, expected)

def test_non_equijoins(self):
# Move non-equijoin predicates to WHERE during SQL translation if
# possible, per #107
pass

def test_join_overlapping_column_names(self):
pass
t1 = ibis.table([('foo', 'string'),
('bar', 'string'),
('value1', 'double')])
t2 = ibis.table([('foo', 'string'),
('bar', 'string'),
('value2', 'double')])

joined = t1.join(t2, 'foo')
expected = t1.join(t2, t1.foo == t2.foo)
assert_equal(joined, expected)

joined = t1.join(t2, ['foo', 'bar'])
expected = t1.join(t2, [t1.foo == t2.foo,
t1.bar == t2.bar])
assert_equal(joined, expected)

def test_join_key_alternatives(self):
t1 = self.con.table('star1')
Expand Down
12 changes: 3 additions & 9 deletions ibis/expr/tests/test_timestamp.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
import pandas as pd

import ibis
import ibis.common as com
import ibis.expr.api as api
import ibis.expr.operations as ops
import ibis.expr.types as ir
Expand Down Expand Up @@ -50,21 +49,16 @@ def test_extract_fields(self):
('day', ops.ExtractDay, ir.Int32Array),
('hour', ops.ExtractHour, ir.Int32Array),
('minute', ops.ExtractMinute, ir.Int32Array),
('second', ops.ExtractSecond, ir.Int32Array)
('second', ops.ExtractSecond, ir.Int32Array),
('millisecond', ops.ExtractMillisecond, ir.Int32Array),
]

for attr, ex_op, ex_type in cases:
result = getattr(self.col, attr)()
assert result.get_name() == attr
assert isinstance(result, ex_type)
assert isinstance(result.op(), ex_op)

def test_extract_no_propagate_name(self):
# see #146
table = self.con.table('functional_alltypes')

expr = table.timestamp_col.hour()
self.assertRaises(com.ExpressionError, expr.get_name)

def test_now(self):
result = api.now()
assert isinstance(result, ir.TimestampScalar)
Expand Down
24 changes: 24 additions & 0 deletions ibis/expr/tests/test_value_exprs.py
Original file line number Diff line number Diff line change
Expand Up @@ -668,3 +668,27 @@ def test_concat(self):
result = list1.concat(list2)
expected = ibis.expr_list(exprs + exprs2)
assert_equal(result, expected)


class TestSubstitute(unittest.TestCase):

def setUp(self):
self.table = ibis.table([('foo', 'string'),
('bar', 'string')], 't1')

def test_substitute_dict(self):
subs = {'a': 'one', 'b': self.table.bar}

result = self.table.foo.substitute(subs)
expected = (self.table.foo.case()
.when('a', 'one')
.when('b', self.table.bar)
.else_(self.table.foo).end())
assert_equal(result, expected)

result = self.table.foo.substitute(subs, else_=ibis.NA)
expected = (self.table.foo.case()
.when('a', 'one')
.when('b', self.table.bar)
.else_(ibis.NA).end())
assert_equal(result, expected)
44 changes: 37 additions & 7 deletions ibis/expr/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def __repr__(self):
else:
return self._repr()

def _repr(self):
def _repr(self, memo=None):
from ibis.expr.format import ExprFormatter
return ExprFormatter(self).get_result()

Expand Down Expand Up @@ -118,11 +118,17 @@ def factory(arg, name=None):
def _can_implicit_cast(self, arg):
return False

def execute(self, limit=None, async=False):
def execute(self, limit='default', async=False):
"""
If this expression is based on physical tables in a database backend,
execute it against that backend.
Parameters
----------
limit : integer or None, default 'default'
Pass an integer to effect a specific row limit. limit=None means "no
limit". The default is whatever is in ibis.options.
Returns
-------
result : expression-dependent
Expand Down Expand Up @@ -170,8 +176,8 @@ def _get_unbound_tables(self):
pass


def _safe_repr(x):
return x._repr() if isinstance(x, Expr) else repr(x)
def _safe_repr(x, memo=None):
return x._repr(memo=memo) if isinstance(x, (Expr, Node)) else repr(x)


class Node(object):
Expand All @@ -195,12 +201,27 @@ def __init__(self, args):
def __repr__(self):
return self._repr()

def _repr(self):
def _repr(self, memo=None):
# Quick and dirty to get us started
opname = type(self).__name__
pprint_args = []

_pp = _safe_repr
memo = memo or {}

if id(self) in memo:
return memo[id(self)]

def _pp(x):
if isinstance(x, Expr):
key = id(x.op())
else:
key = id(x)

if key in memo:
return memo[key]
result = _safe_repr(x, memo=memo)
memo[key] = result
return result

for x in self.args:
if isinstance(x, (tuple, list)):
Expand Down Expand Up @@ -557,9 +578,15 @@ def __getitem__(self, what):
elif isinstance(what, BooleanArray):
# Boolean predicate
return self.filter([what])
elif isinstance(what, ArrayExpr):
# Projection convenience
return self.projection(what)
else:
raise NotImplementedError

def __len__(self):
raise com.ExpressionError('Use .count() instead')

def __getattr__(self, key):
try:
return object.__getattribute__(self, key)
Expand Down Expand Up @@ -1104,7 +1131,10 @@ def find_all_base_tables(expr, memo=None):
if memo is None:
memo = {}

if isinstance(expr, TableExpr):
node = expr.op()

if (isinstance(expr, TableExpr) and
isinstance(node, BlockingTableNode)):
if id(expr) not in memo:
memo[id(expr)] = expr
return memo
Expand Down
6 changes: 4 additions & 2 deletions ibis/impala/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from ibis.impala.client import (ImpalaConnection, ImpalaClient, # noqa
Database, ImpalaTable)
from ibis.impala.client import (ImpalaConnection, # noqa
ImpalaClient,
ImpalaDatabase,
ImpalaTable)
from ibis.impala.udf import * # noqa
from ibis.impala.madlib import MADLibAPI # noqa
from ibis.config import options
Expand Down
909 changes: 676 additions & 233 deletions ibis/impala/client.py

Large diffs are not rendered by default.

18 changes: 11 additions & 7 deletions ibis/impala/compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -1037,15 +1037,10 @@ def _parse_url(translator, expr):

if key is None:
return "parse_url({0}, '{1}')".format(arg_formatted, extract)
elif not isinstance(key.op(), ir.Literal):
key_fmt = translator.translate(key)
return "parse_url({0}, '{1}', {2})".format(arg_formatted,
extract,
key_fmt)
else:
key_fmt = translator.translate(key)
return "parse_url({0}, '{1}', {2})".format(arg_formatted,
extract,
key)
extract, key_fmt)


def _find_in_set(translator, expr):
Expand Down Expand Up @@ -1309,3 +1304,12 @@ class ImpalaExprTranslator(comp.ExprTranslator):
def name(self, translated, name, force=True):
return _name_expr(translated,
quote_identifier(name, force=force))

compiles = ImpalaExprTranslator.compiles
rewrites = ImpalaExprTranslator.rewrites


@rewrites(ops.FloorDivide)
def _floor_divide(expr):
left, right = expr.op().args
return left.div(right).floor()
224 changes: 201 additions & 23 deletions ibis/impala/ddl.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,30 @@ def _if_exists(self):
return 'IF NOT EXISTS ' if self.can_exist else ''


_format_aliases = {
'TEXT': 'TEXTFILE'
}


def _sanitize_format(format):
if format is None:
return
format = format.upper()
format = _format_aliases.get(format, format)
if format not in ('PARQUET', 'AVRO', 'TEXTFILE'):
raise ValueError('Invalid format: {0}'.format(format))

return format


def _format_properties(props):
tokens = []
for k, v in sorted(props.items()):
tokens.append("'{0!s}'='{1!s}'".format(k, v))

return '({0})'.format(', '.join(tokens))


class CreateTable(CreateDDL):

"""
Expand All @@ -78,13 +102,7 @@ def __init__(self, table_name, database=None, external=False,
self.path = path
self.external = external
self.can_exist = can_exist
self.format = self._validate_storage_format(format)

def _validate_storage_format(self, format):
format = format.lower()
if format not in ('parquet', 'avro'):
raise ValueError('Invalid format: {0}'.format(format))
return format
self.format = _sanitize_format(format)

def _create_line(self):
scoped_name = self._get_scoped_name(self.table_name, self.database)
Expand All @@ -105,8 +123,8 @@ def _location(self):

def _storage(self):
storage_lines = {
'parquet': '\nSTORED AS PARQUET',
'avro': '\nSTORED AS AVRO'
'PARQUET': '\nSTORED AS PARQUET',
'AVRO': '\nSTORED AS AVRO'
}
return storage_lines[self.format]

Expand Down Expand Up @@ -209,7 +227,7 @@ def __init__(self, table_name, schema, table_format, **kwargs):
CreateTable.__init__(self, table_name, **kwargs)

def compile(self):
from ibis.expr.api import schema
from ibis.expr.api import Schema

buf = StringIO()
buf.write(self._create_line())
Expand All @@ -219,18 +237,25 @@ def _push_schema(x):
buf.write('{0}'.format(formatted))

if self.partition is not None:
modified_schema = []
partition_schema = []
for name, dtype in zip(self.schema.names, self.schema.types):
if name in self.partition:
partition_schema.append((name, dtype))
else:
modified_schema.append((name, dtype))
main_schema = self.schema
part_schema = self.partition
if not isinstance(part_schema, Schema):
part_schema = Schema(
part_schema,
[self.schema[name] for name in part_schema])

to_delete = []
for name in self.partition:
if name in self.schema:
to_delete.append(name)

if len(to_delete):
main_schema = main_schema.delete(to_delete)

buf.write('\n')
_push_schema(schema(modified_schema))
_push_schema(main_schema)
buf.write('\nPARTITIONED BY ')
_push_schema(schema(partition_schema))
_push_schema(part_schema)
else:
buf.write('\n')
_push_schema(self.schema)
Expand All @@ -253,11 +278,12 @@ def to_ddl(self):
class DelimitedFormat(object):

def __init__(self, path, delimiter=None, escapechar=None,
lineterminator=None):
na_rep=None, lineterminator=None):
self.path = path
self.delimiter = delimiter
self.escapechar = escapechar
self.lineterminator = lineterminator
self.na_rep = na_rep

def to_ddl(self):
buf = StringIO()
Expand All @@ -276,6 +302,10 @@ def to_ddl(self):

buf.write("\nLOCATION '{0}'".format(self.path))

if self.na_rep is not None:
buf.write("\nTBLPROPERTIES('serialization.null.format'='{0}')"
.format(self.na_rep))

return buf.getvalue()


Expand Down Expand Up @@ -304,10 +334,11 @@ class CreateTableDelimited(CreateTableWithSchema):

def __init__(self, table_name, path, schema,
delimiter=None, escapechar=None, lineterminator=None,
external=True, **kwargs):
na_rep=None, external=True, **kwargs):
table_format = DelimitedFormat(path, delimiter=delimiter,
escapechar=escapechar,
lineterminator=lineterminator)
lineterminator=lineterminator,
na_rep=na_rep)
CreateTableWithSchema.__init__(self, table_name, schema,
table_format, external=external,
**kwargs)
Expand All @@ -333,11 +364,16 @@ def compile(self):
class InsertSelect(ImpalaDDL):

def __init__(self, table_name, select_expr, database=None,
partition=None,
partition_schema=None,
overwrite=False):
self.table_name = table_name
self.database = database
self.select = select_expr

self.partition = partition
self.partition_schema = partition_schema

self.overwrite = overwrite

def compile(self):
Expand All @@ -346,16 +382,158 @@ def compile(self):
else:
cmd = 'INSERT INTO'

if self.partition is not None:
part = _format_partition(self.partition,
self.partition_schema)
partition = ' {0} '.format(part)
else:
partition = ''

select_query = self.select.compile()
scoped_name = self._get_scoped_name(self.table_name, self.database)
return'{0} {1}\n{2}'.format(cmd, scoped_name, select_query)
return'{0} {1}{2}\n{3}'.format(cmd, scoped_name, partition,
select_query)


def _format_partition(partition, partition_schema):
tokens = []
if isinstance(partition, dict):
for name in partition_schema:
if name in partition:
tok = '{0}={1}'.format(name, partition[name])
else:
# dynamic partitioning
tok = name
tokens.append(tok)
else:
for name, value in zip(partition_schema, partition):
tok = '{0}={1}'.format(name, value)
tokens.append(tok)

return 'PARTITION ({0})'.format(', '.join(tokens))


class LoadData(ImpalaDDL):

"""
Generate DDL for LOAD DATA command. Cannot be cancelled
"""

def __init__(self, table_name, path, database=None,
partition=None, partition_schema=None,
overwrite=False):
self.table_name = table_name
self.database = database
self.path = path

self.partition = partition
self.partition_schema = partition_schema

self.overwrite = overwrite

def compile(self):
overwrite = 'OVERWRITE ' if self.overwrite else ''

if self.partition is not None:
partition = '\n' + _format_partition(self.partition,
self.partition_schema)
else:
partition = ''

scoped_name = self._get_scoped_name(self.table_name, self.database)
return ("LOAD DATA INPATH '{0}' {1}INTO TABLE {2}{3}"
.format(self.path, overwrite, scoped_name, partition))


class AlterTable(ImpalaDDL):

def __init__(self, table, location=None, format=None, tbl_properties=None,
serde_properties=None):
self.table = table
self.location = location
self.format = _sanitize_format(format)
self.tbl_properties = tbl_properties
self.serde_properties = serde_properties

def _wrap_command(self, cmd):
return 'ALTER TABLE {0}'.format(cmd)

def _format_properties(self, prefix=''):
tokens = []

if self.location is not None:
tokens.append("LOCATION '{0}'".format(self.location))

if self.format is not None:
tokens.append("FILEFORMAT {0}".format(self.format))

if self.tbl_properties is not None:
props = _format_properties(self.tbl_properties)
tokens.append('TBLPROPERTIES {0}'.format(props))

if self.serde_properties is not None:
props = _format_properties(self.serde_properties)
tokens.append('SERDEPROPERTIES {0}'.format(props))

if len(tokens) > 0:
return '\n{0}{1}'.format(prefix, '\n'.join(tokens))
else:
return ''

def compile(self):
props = self._format_properties()
action = '{0} SET {1}'.format(self.table, props)
return self._wrap_command(action)


class PartitionProperties(AlterTable):

def __init__(self, table, partition, partition_schema,
location=None, format=None,
tbl_properties=None, serde_properties=None):
self.partition = partition
self.partition_schema = partition_schema

AlterTable.__init__(self, table, location=location, format=format,
tbl_properties=tbl_properties,
serde_properties=serde_properties)

def _compile(self, cmd, property_prefix=''):
part = _format_partition(self.partition, self.partition_schema)
if cmd:
part = '{0} {1}'.format(cmd, part)

props = self._format_properties(property_prefix)
action = '{0} {1}{2}'.format(self.table, part, props)
return self._wrap_command(action)


class AddPartition(PartitionProperties):

def __init__(self, table, partition, partition_schema, location=None):
PartitionProperties.__init__(self, table, partition,
partition_schema,
location=location)

def compile(self):
return self._compile('ADD')


class AlterPartition(PartitionProperties):

def compile(self):
return self._compile('', 'SET ')


class DropPartition(PartitionProperties):

def __init__(self, table, partition, partition_schema):
PartitionProperties.__init__(self, table, partition,
partition_schema)

def compile(self):
return self._compile('DROP')


class RenameTable(AlterTable):

Expand Down
320 changes: 320 additions & 0 deletions ibis/impala/metadata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,320 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from six import StringIO
import pandas as pd


def parse_metadata(descr_table):
parser = MetadataParser(descr_table)
return parser.parse()


def _noop(tup):
return None


def _item_converter(i):
def _get_item(converter=None):
def _converter(tup):
result = tup[i]
if converter is not None:
result = converter(result)
return result

return _converter

return _get_item

_get_type = _item_converter(1)
_get_comment = _item_converter(2)


def _try_timestamp(x):
try:
return pd.Timestamp(x)
except (ValueError, TypeError):
return x


def _try_unix_timestamp(x):
try:
return pd.Timestamp.fromtimestamp(int(x))
except (ValueError, TypeError):
return x


def _try_boolean(x):
try:
x = x.lower()
if x in ('true', 'yes'):
return True
elif x in ('false', 'no'):
return False
return x
except (ValueError, TypeError):
return x


def _try_int(x):
try:
return int(x)
except (ValueError, TypeError):
return x


class MetadataParser(object):

"""
A simple state-ish machine to parse the results of DESCRIBE FORMATTED
"""

def __init__(self, table):
self.table = table
self.tuples = list(self.table.itertuples(index=False))

def _reset(self):
self.pos = 0
self.schema = None
self.partitions = None
self.info = None
self.storage = None

def _next_tuple(self):
if self.pos == len(self.tuples):
raise StopIteration

result = self.tuples[self.pos]
self.pos += 1
return result

def parse(self):
self._reset()
self._parse()

return TableMetadata(self.schema, self.info, self.storage,
partitions=self.partitions)

def _parse(self):
self.schema = self._parse_schema()

next_section = self._next_tuple()
if 'partition' in next_section[0].lower():
self._parse_partitions()
else:
self._parse_info()

def _parse_partitions(self):
self.partitions = self._parse_schema()

next_section = self._next_tuple()
if 'table information' not in next_section[0].lower():
raise ValueError('Table information not present')

self._parse_info()

def _parse_schema(self):
tup = self._next_tuple()
if 'col_name' not in tup[0]:
raise ValueError('DESCRIBE FORMATTED did not return '
'the expected results: {0}'
.format(tup))
self._next_tuple()

# Use for both main schema and partition schema (if any)
schema = []
while True:
tup = self._next_tuple()
if tup[0].strip() == '':
break
schema.append((tup[0], tup[1]))

return schema

def _parse_info(self):
self.info = {}
while True:
tup = self._next_tuple()
orig_key = tup[0].strip(':')
key = _clean_param_name(tup[0])

if key == '' or key.startswith('#'):
# section is done
break

if key == 'table parameters':
self._parse_table_parameters()
elif key in self._info_cleaners:
result = self._info_cleaners[key](tup)
self.info[orig_key] = result
else:
self.info[orig_key] = tup[1]

if 'storage information' not in key:
raise ValueError('Storage information not present')

self._parse_storage_info()

_info_cleaners = {
'database': _get_type(),
'owner': _get_type(),
'createtime': _get_type(_try_timestamp),
'lastaccesstime': _get_type(_try_timestamp),
'protect mode': _get_type(),
'retention': _get_type(_try_int),
'location': _get_type(),
'table type': _get_type()
}

def _parse_table_parameters(self):
params = self._parse_nested_params(self._table_param_cleaners)
self.info['Table Parameters'] = params

_table_param_cleaners = {
'external': _try_boolean,
'column_stats_accurate': _try_boolean,
'numfiles': _try_int,
'totalsize': _try_int,
'stats_generated_via_stats_task': _try_boolean,
'numrows': _try_int,
'transient_lastddltime': _try_unix_timestamp,
}

def _parse_storage_info(self):
self.storage = {}
while True:
# end of the road
try:
tup = self._next_tuple()
except StopIteration:
break

orig_key = tup[0].strip(':')
key = _clean_param_name(tup[0])

if key == '' or key.startswith('#'):
# section is done
break

if key == 'storage desc params':
self._parse_storage_desc_params()
elif key in self._storage_cleaners:
result = self._storage_cleaners[key](tup)
self.storage[orig_key] = result
else:
self.storage[orig_key] = tup[1]

_storage_cleaners = {
'compressed': _get_type(_try_boolean),
'num buckets': _get_type(_try_int),
}

def _parse_storage_desc_params(self):
params = self._parse_nested_params(self._storage_param_cleaners)
self.storage['Desc Params'] = params

_storage_param_cleaners = {}

def _parse_nested_params(self, cleaners):
params = {}
while True:
try:
tup = self._next_tuple()
except StopIteration:
break
if pd.isnull(tup[1]):
break

key, value = tup[1:]

if key.lower() in cleaners:
cleaner = cleaners[key.lower()]
value = cleaner(value)
params[key] = value

return params


def _clean_param_name(x):
return x.strip().strip(':').lower()


def _get_meta(attr, key):
@property
def f(self):
data = getattr(self, attr)
if isinstance(key, list):
result = data
for k in key:
if k not in result:
raise KeyError(k)
result = result[k]
return result
else:
return data[key]
return f


class TableMetadata(object):

"""
Container for the parsed and wrangled results of DESCRIBE FORMATTED for
easier Ibis use (and testing).
"""
def __init__(self, schema, info, storage, partitions=None):
self.schema = schema
self.info = info
self.storage = storage
self.partitions = partitions

def __repr__(self):
import pprint

# Quick and dirty for now
buf = StringIO()
buf.write(str(type(self)))
buf.write('\n')

data = {
'schema': self.schema,
'info': self.info,
'storage info': self.storage
}
if self.partitions is not None:
data['partition schema'] = self.partitions

pprint.pprint(data, stream=buf)

return buf.getvalue()

@property
def is_partitioned(self):
return self.partitions is not None

create_time = _get_meta('info', 'CreateTime')
location = _get_meta('info', 'Location')
owner = _get_meta('info', 'Owner')
num_rows = _get_meta('info', ['Table Parameters', 'numRows'])
hive_format = _get_meta('storage', 'InputFormat')

tbl_properties = _get_meta('info', 'Table Parameters')
serde_properties = _get_meta('storage', 'Desc Params')


class TableInfo(object):
pass


class TableStorageInfo(object):
pass
209 changes: 209 additions & 0 deletions ibis/impala/pandas_interop.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


from posixpath import join as pjoin
import os

import pandas.core.common as pdcom
import pandas as pd

import ibis.common as com

from ibis.config import options
from ibis.util import log
import ibis.compat as compat
import ibis.expr.datatypes as itypes
import ibis.util as util


# ----------------------------------------------------------------------
# pandas integration


def pandas_col_to_ibis_type(col):
import numpy as np
dty = col.dtype

# datetime types
if pdcom.is_datetime64_dtype(dty):
if pdcom.is_datetime64_ns_dtype(dty):
return 'timestamp'
else:
raise com.IbisTypeError("Column {0} has dtype {1}, which is "
"datetime64-like but does "
"not use nanosecond units"
.format(col.name, dty))
if pdcom.is_timedelta64_dtype(dty):
print("Warning: encoding a timedelta64 as an int64")
return 'int64'

if pdcom.is_categorical_dtype(dty):
return itypes.Category(len(col.cat.categories))

if pdcom.is_bool_dtype(dty):
return 'boolean'

# simple numerical types
if issubclass(dty.type, np.int8):
return 'int8'
if issubclass(dty.type, np.int16):
return 'int16'
if issubclass(dty.type, np.int32):
return 'int32'
if issubclass(dty.type, np.int64):
return 'int64'
if issubclass(dty.type, np.float32):
return 'float'
if issubclass(dty.type, np.float64):
return 'double'
if issubclass(dty.type, np.uint8):
return 'int16'
if issubclass(dty.type, np.uint16):
return 'int32'
if issubclass(dty.type, np.uint32):
return 'int64'
if issubclass(dty.type, np.uint64):
raise com.IbisTypeError("Column {0} is an unsigned int64"
.format(col.name))

if pdcom.is_object_dtype(dty):
return _infer_object_dtype(col)

raise com.IbisTypeError("Column {0} is dtype {1}".format(col.name, dty))


def _infer_object_dtype(arr):
# TODO: accelerate with Cython/C

BOOLEAN, STRING = 0, 1
state = BOOLEAN

avalues = arr.values if isinstance(arr, pd.Series) else arr
nulls = pd.isnull(avalues)

if nulls.any():
for i in compat.range(len(avalues)):
if state == BOOLEAN:
if not nulls[i] and not pdcom.is_bool(avalues[i]):
state = STRING
elif state == STRING:
break
if state == BOOLEAN:
return 'boolean'
elif state == STRING:
return 'string'
else:
return pd.lib.infer_dtype(avalues)


class DataFrameWriter(object):

"""
Interface class for writing pandas objects to Impala tables
Class takes ownership of any temporary data written to HDFS
"""
def __init__(self, client, df, path=None):
self.client = client
self.hdfs = client.hdfs

self.df = df

self.temp_hdfs_dirs = []

def write_temp_csv(self):
temp_hdfs_dir = pjoin(options.impala.temp_hdfs_path,
'pandas_{0}'.format(util.guid()))
self.hdfs.mkdir(temp_hdfs_dir)

# Keep track of the temporary HDFS file
self.temp_hdfs_dirs.append(temp_hdfs_dir)

# Write the file to HDFS
hdfs_path = pjoin(temp_hdfs_dir, '0.csv')

self.write_csv(hdfs_path)

return temp_hdfs_dir

def write_csv(self, path):
import csv

tmp_path = 'tmp_{0}.csv'.format(util.guid())
f = open(tmp_path, 'w+')

try:
# Write the DataFrame to the temporary file path
if options.verbose:
log('Writing DataFrame to temporary file')

self.df.to_csv(f, header=False, index=False,
sep=',',
quoting=csv.QUOTE_NONE,
escapechar='\\',
na_rep='#NULL')
f.seek(0)

if options.verbose:
log('Writing CSV to: {0}'.format(path))

self.hdfs.put(path, f)
finally:
f.close()
try:
os.remove(tmp_path)
except os.error:
pass

return path

def get_schema(self):
# define a temporary table using delimited data
return pandas_to_ibis_schema(self.df)

def delimited_table(self, csv_dir, name=None, database=None):
temp_delimited_name = 'ibis_tmp_pandas_{0}'.format(util.guid())
schema = self.get_schema()

return self.client.delimited_file(csv_dir, schema,
name=temp_delimited_name,
database=database,
delimiter=',',
na_rep='#NULL',
escapechar='\\\\',
external=True,
persist=False)

def __del__(self):
try:
self.cleanup()
except com.IbisError:
pass

def cleanup(self):
for path in self.temp_hdfs_dirs:
self.hdfs.rmdir(path)
self.temp_hdfs_dirs = []
self.csv_dir = None


def pandas_to_ibis_schema(frame):
from ibis.expr.api import schema
# no analog for decimal in pandas
pairs = []
for col_name in frame:
ibis_type = pandas_col_to_ibis_type(frame[col_name])
pairs.append((col_name, ibis_type))
return schema(pairs)
16 changes: 16 additions & 0 deletions ibis/impala/parquet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Impala Parquet configuration and any other Parquet utilities
# / support
Loading