61 changes: 53 additions & 8 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,19 +18,15 @@ entry to using Ibis.
.. autosummary::
:toctree: generated/

make_client
hdfs_connect

Impala client
-------------
.. currentmodule:: ibis.impala.api

These methods are available on the Impala client object after connecting to
your Impala cluster, HDFS cluster, and creating the client with
``ibis.make_client``.

Use ``ibis.impala.connect`` to create an Impala connection to use for
assembling a client.
your HDFS cluster (``ibis.hdfs_connect``) and connecting to Impala with
``ibis.impala.connect``.

.. autosummary::
:toctree: generated/
Expand Down Expand Up @@ -79,8 +75,10 @@ Table methods
.. autosummary::
:toctree: generated/

ImpalaTable.drop
ImpalaTable.compute_stats
ImpalaTable.drop
ImpalaTable.insert
ImpalaTable.rename

Creating views is also possible:

Expand Down Expand Up @@ -110,6 +108,25 @@ Executing expressions
ImpalaClient.execute
ImpalaClient.disable_codegen

.. _api.sqlite:

SQLite client
-------------
.. currentmodule:: ibis.sql.sqlite.api

The SQLite client is accessible through the ``ibis.sqlite`` namespace.

Use ``ibis.sqlite.connect`` to create a SQLite client.

.. autosummary::
:toctree: generated/

connect
SQLiteClient.attach
SQLiteClient.database
SQLiteClient.list_tables
SQLiteClient.table

.. _api.hdfs:

HDFS
Expand Down Expand Up @@ -164,6 +181,22 @@ These methods are available directly in the ``ibis`` module namespace.
trailing_window
cumulative_window

.. _api.expr:

General expression methods
--------------------------

.. currentmodule:: ibis.expr.api

.. autosummary::
:toctree: generated/

Expr.compile
Expr.equals
Expr.execute
Expr.pipe
Expr.verify

.. _api.table:

Table methods
Expand All @@ -185,8 +218,8 @@ Table methods
TableExpr.group_by
TableExpr.limit
TableExpr.mutate
TableExpr.pipe
TableExpr.projection
TableExpr.relabel
TableExpr.schema
TableExpr.set_column
TableExpr.sort_by
Expand Down Expand Up @@ -238,6 +271,7 @@ Scalar or array methods
ValueExpr.isnull
ValueExpr.notnull
ValueExpr.over
ValueExpr.typeof

ValueExpr.add
ValueExpr.sub
Expand Down Expand Up @@ -291,6 +325,14 @@ Scalar or array methods
NumericValue.floor
NumericValue.sign
NumericValue.exp
NumericValue.sqrt
NumericValue.log
NumericValue.ln
NumericValue.log2
NumericValue.log10
NumericValue.round
NumericValue.nullifzero
NumericValue.zeroifnull


Array methods
Expand All @@ -302,6 +344,9 @@ Array methods
NumericArray.sum
NumericArray.mean

NumericArray.std
NumericArray.var

NumericArray.cumsum
NumericArray.cummean

Expand Down
25 changes: 13 additions & 12 deletions docs/source/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,18 +72,19 @@ Therefore, the connection semantics are similar to the other access methods for
working with secure clusters.

Specifically, after authenticating yourself against Kerberos (e.g., by issuing
the appropriate ``kinit`` commmand), simply pass ``use_kerberos=True`` (and set
``kerberos_service_name`` if necessary) to the ``ibis.impala_connect(...)``
method when instantiating an ``ImpalaConnection``. This method also takes
arguments to configure LDAP (``use_ldap``, ``ldap_user``, and
``ldap_password``) and SSL (``use_ssl``, ``ca_cert``). See the documentation
for the Impala shell for more details.
the appropriate ``kinit`` commmand), simply pass ``auth_mechanism='GSSAPI'`` or
``auth_mechanism='LDAP'`` (and set ``kerberos_service_name`` if necessary along
with ``user`` and ``password`` if necessary) to the
``ibis.impala_connect(...)`` method when instantiating an ``ImpalaConnection``.
This method also takes arguments to configure SSL (``use_ssl``, ``ca_cert``).
See the documentation for the Impala shell for more details.

Ibis also includes functionality that communicates directly with HDFS, using
the WebHDFS REST API. When calling ``ibis.hdfs_connect(...)``, also pass
``use_kerberos=True``, and ensure that you are connecting to the correct port,
which may likely be an SSL-secured WebHDFS port. Also note that you can pass
``verify=False`` to avoid verifying SSL certificates (which may be helpful in
testing). Ibis will assume ``https`` when connecting to a Kerberized cluster.
Because some Ibis commands create HDFS directories as well as new Impala
databases and/or tables, your user will require the necessary privileges.
``auth_mechanism='GSSAPI'`` or ``auth_mechanism='LDAP'``, and ensure that you
are connecting to the correct port, which may likely be an SSL-secured WebHDFS
port. Also note that you can pass ``verify=False`` to avoid verifying SSL
certificates (which may be helpful in testing). Ibis will assume ``https``
when connecting to a Kerberized cluster. Because some Ibis commands create HDFS
directories as well as new Impala databases and/or tables, your user will
require the necessary privileges.
24 changes: 15 additions & 9 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,21 +44,21 @@ Creating a client
-----------------

To create an Ibis "client", you must first connect your services and assemble
the client using ``ibis.make_client``:
the client using ``ibis.impala.connect``:

.. code-block:: python
import ibis
ic = ibis.impala.connect(host=impala_host, port=impala_port)
hdfs = ibis.hdfs_connect(host=webhdfs_host, port=webhdfs_port)
con = ibis.impala.connect(host=impala_host, port=impala_port,
hdfs_client=hdfs
con = ibis.make_client(ic, hdfs_client=hdfs)
Both method calls can take ``use_kerberos=True`` to connect to Kerberos
clusters. Depending on your cluster setup, this may also include LDAP or SSL.
See the :ref:`API reference <api.client>` for more, along with the Impala shell
reference, as the connection semantics are identical.
Both method calls can take ``auth_mechanism='GSSAPI'`` or
``auth_mechanism='LDAP'`` to connect to Kerberos clusters. Depending on your
cluster setup, this may also include SSL. See the :ref:`API reference
<api.client>` for more, along with the Impala shell reference, as the
connection semantics are identical.
Learning resources
------------------
Expand All @@ -76,6 +76,13 @@ Since Ibis requires a running Impala cluster, we have provided a lean
VirtualBox image to simplify the process for those looking to try out Ibis
(without setting up a cluster) or start contributing code to the project.
What follows are streamlined setup instructions for the VM. If you wish to
download it directly and setup from the ``ova`` file, use this `download link
<http://archive.cloudera.com/cloudera-ibis/ibis-demo.ova>`_.
The VM was built with Oracle VirtualBox 4.3.28. We recommend using the latest
version of the software for the best compatibility.
TL;DR
~~~~~
Expand All @@ -89,7 +96,6 @@ Single Steps
To use Ibis with the special Cloudera Quickstart VM follow the below
instructions:
* Install Oracle VirtualBox
* Make sure Anaconda is installed. You can get it from
http://continuum.io/downloads. Now prepend the Anaconda Python
to your path like this ``export PATH=$ANACONDA_HOME/bin:$PATH``
Expand Down
54 changes: 25 additions & 29 deletions docs/source/impala-udf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,7 @@ You can compile this to either a shared library (a ``.so`` file) or to LLVM
bitcode with clang (a ``.ll`` file). Skipping that step for now (will add some
more detailed instructions here later, promise).

To make this function callable, we first create a UDF wrapper with
``ibis.impala.wrap_udf``:
To make this function callable, we use ``ibis.impala.wrap_udf``:

.. code-block:: python
Expand All @@ -47,46 +46,25 @@ To make this function callable, we first create a UDF wrapper with
udf_db = 'ibis_testing'
udf_name = 'fuzzy_equals'
wrapper = ibis.impala.wrap_udf(library, inputs, output, symbol, name=udf_name)
fuzzy_equals = ibis.impala.wrap_udf(library, inputs, output,
symbol, name=udf_name)
In typical workflows, you will set up a UDF in Impala once then use it
thenceforth. So the *first time* you do this, you need to create the UDF in
Impala:

.. code-block:: python
client.create_udf(wrapper, name=udf_name, database=udf_db)
client.create_function(fuzzy_equals, database=udf_db)
Now, we must register this function as a new Impala operation in Ibis. This
must take place each time you load your Ibis session.

.. code-block:: python
operation_class = wrapper.to_operation()
ibis.impala.add_operation(operation_class, udf_name, udf_db)
func.register(fuzzy_equals.name, udf_db)
Lastly, we define a *user API* to make ``fuzzy_equals`` callable on Ibis
expressions:

.. code-block:: python
def fuzzy_equals(left, right):
"""
Approximate equals UDF
Parameters
----------
left : numeric
right : numeric
Returns
-------
is_approx_equal : boolean
"""
op = operation_class(left, right)
return op.to_expr()
Now, we have a callable Python function that works with Ibis expressions:
The object ``fuzzy_equals`` is callable and works with Ibis expressions:

.. code-block:: python
Expand All @@ -110,7 +88,7 @@ Now, we have a callable Python function that works with Ibis expressions:
9 False
Name: tmp, dtype: bool
Note that the call to ``ibis.impala.add_operation`` must happen each time you
Note that the call to ``register`` on the UDF object must happen each time you
use Ibis. If you have a lot of UDFs, I suggest you create a file with all of
your wrapper declarations and user APIs that you load with your Ibis session to
plug in all your own functions.
Expand All @@ -120,6 +98,24 @@ Using aggregate functions (UDAs)

Coming soon.

Adding documentation to new functions
-------------------------------------

.. code-block:: python
fuzzy_equal.__doc__ = """\
Approximate equals UDF
Parameters
----------
left : numeric
right : numeric
Returns
-------
is_approx_equal : boolean
"""
Adding UDF functions to Ibis types
----------------------------------

Expand Down
52 changes: 46 additions & 6 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,53 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Ibis Documentation
==================
Ibis: Python Data Analysis Framework
====================================

Ibis is a productivity-centric Python data analysis framework, designed to be
an ideal companion for SQL engines and distributed storage systems like
Hadoop. Ibis is being jointly developed with `Impala <http://impala.io>`_ to
deliver a complete 100% Python user experience on data of any size (small,
medium, or big).

At this item, Ibis supports the following SQL-based systems:

- Impala (on HDFS)
- SQLite

We have a handful of specific priority focus areas:

- Enable data analysts to translation analytics using SQL engines to Python
instead of using the SQL language.
- Provide high level analytics APIs and workflow tools to enhance productivity
and streamline common or tedious tasks.
- Provide high performance extensions for the Impala MPP query engine to enable
high performance Python code to operate in a scalable Hadoop-like environment
- Abstract away database-specific SQL differences
- Integration with community standard data formats (e.g. Parquet and Avro)
- Integrate with the Python data ecosystem using the above tools

Architecturally, Ibis features:

- A pandas-like domain specific language (DSL) designed specifically for
analytics, aka **Ibis expressions**, that enable composable, reusable
analytics on structured data. If you can express something with a SQL SELECT
query, you can write it with Ibis.
- An extensible translator-compiler system that targets multiple SQL systems
- Tools for wrapping user-defined functions in Impala and eventually other SQL
engines

SQL engine support near on the horizon:

- PostgreSQL
- Redshift
- Vertica
- Spark SQL
- Presto
- Hive
- MySQL / MariaDB

Ibis is a Python data analysis framework, designed to be an ideal companion for
big data storage and computation systems. Ibis is being jointly developed with
Impala to deliver a complete 100% Python user experience on tera- and petascale
big data problems.
See the project blog http://blog.ibis-project.org for more frequent updates.

To learn more about Ibis's vision and roadmap, please visit
http://ibis-project.org.
Expand Down
90 changes: 90 additions & 0 deletions docs/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,53 @@
Release Notes
=============

**Note**: These release notes will only include notable or major bug fixes
since most minor bug fixes tend to be esoteric and not generally
interesting.

0.5.0 (September 10, 2015)
--------------------------

Highlights in this release are the SQLite, Python 3, Impala UDA support, and an
asynchronous execution API. There are also many usability improvements, bug
fixes, and other new features.

New features
~~~~~~~~~~~~
* SQLite client and built-in function support
* Ibis now supports Python 3.4 as well as 2.6 and 2.7
* Ibis can utilize Impala user-defined aggregate (UDA) functions
* SQLAlchemy-based translation toolchain to enable more SQL engines having
SQLAlchemy dialects to be supported
* Many window function usability improvements (nested analytic functions and
deferred binding conveniences)
* More convenient aggregation with keyword arguments in ``aggregate`` functions
* Built preliminary wrapper API for MADLib-on-Impala
* Add ``var`` and ``std`` aggregation methods and support in Impala
* Add ``nullifzero`` numeric method for all SQL engines
* Add ``rename`` method to Impala tables (for renaming tables in the Hive
metastore)
* Add ``close`` method to ``ImpalaClient`` for session cleanup (#533)
* Add ``relabel`` method to table expressions
* Add ``insert`` method to Impala tables
* Add ``compile`` and ``verify`` methods to all expressions to test compilation
and ability to compile (since many operations are unavailable in SQLite, for
example)

API changes
~~~~~~~~~~~
* Impala Ibis client creation now uses only ``ibis.impala.connect``, and
``ibis.make_client`` has been deprecated

Contributors
~~~~~~~~~~~~
::

$ git log v0.4.0..v0.5.0 --pretty=format:%aN | sort | uniq -c | sort -rn
55 Wes McKinney
9 Uri Laserson
1 Kristopher Overholt

0.4.0 (August 14, 2015)
-----------------------

Expand Down Expand Up @@ -33,6 +80,17 @@ New features
to cluster, for better usability.
* Add conda installation recipes

Contributors
~~~~~~~~~~~~
::

$ git log v0.3.0..v0.4.0 --pretty=format:%aN | sort | uniq -c | sort -rn
38 Wes McKinney
9 Uri Laserson
2 Meghana Vuyyuru
2 Kristopher Overholt
1 Marius van Niekerk

0.3.0 (July 20, 2015)
---------------------

Expand Down Expand Up @@ -61,6 +119,16 @@ New features
* Add an internal operation type signature API to enhance developer
productivity.

Contributors
~~~~~~~~~~~~
::

$ git log v0.2.0..v0.3.0 --pretty=format:%aN | sort | uniq -c | sort -rn
59 Wes McKinney
29 Uri Laserson
4 Isaac Hodes
2 Meghana Vuyyuru

0.2.0 (June 16, 2015)
---------------------

Expand Down Expand Up @@ -118,5 +186,27 @@ Bug fixes
~~~~~~~~~
* Numerous expression API bug fixes and rough edges fixed

Contributors
~~~~~~~~~~~~
::

$ git log v0.1.0..v0.2.0 --pretty=format:%aN | sort | uniq -c | sort -rn
71 Wes McKinney
1 Juliet Hougland
1 Isaac Hodes

0.1.0 (March 26, 2015)
----------------------

First Ibis release.

* Expression DSL design and type system
* Expression to ImpalaSQL compiler toolchain
* Impala built-in function wrappers

::

$ git log 84d0435..v0.1.0 --pretty=format:%aN | sort | uniq -c | sort -rn
78 Wes McKinney
1 srus
1 Henry Robinson
46 changes: 28 additions & 18 deletions ibis/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

# flake8: noqa

__version__ = '0.4.0'
__version__ = '0.5.0'

from ibis.filesystems import HDFS, WebHDFS
from ibis.common import IbisError
Expand All @@ -27,10 +27,11 @@
from ibis.expr.api import *

import ibis.impala.api as impala
import ibis.sql.sqlite.api as sqlite

import ibis.config_init
from ibis.config import options
import util
import ibis.util as util


# Deprecated
Expand Down Expand Up @@ -60,30 +61,40 @@ def make_client(db, hdfs_client=None):
-------
client : IbisClient
"""
return impala.ImpalaClient(db, hdfs_client=hdfs_client)
db._hdfs = hdfs_client
return db

make_client = util.deprecate(
make_client, ('make_client is deprecated. '
'Use ibis.impala.connect '
' with hdfs_client=hdfs_client'))


def hdfs_connect(host='localhost', port=50070, protocol='webhdfs',
use_kerberos=False, verify=True, **kwds):
auth_mechanism='NOSASL', verify=True, **kwds):
"""
Connect to HDFS
Parameters
----------
host : string
port : int, default 50070 (webhdfs default)
host : string, Host name of the HDFS NameNode
port : int, NameNode's WebHDFS port (default 50070)
protocol : {'webhdfs'}
use_kerberos : boolean, default False
verify : boolean, default False
Set to False to turn off verifying SSL certificates
auth_mechanism : string, Set to NOSASL or PLAIN for non-secure clusters.
Set to GSSAPI or LDAP for Kerberos-secured clusters.
verify : boolean, Set to False to turn off verifying SSL certificates.
(default True)
Other keywords are forwarded to hdfs library classes
Returns
-------
client : ibis HDFS client
client : WebHDFS
"""
if use_kerberos:
import requests
session = kwds.setdefault('session', requests.Session())
session.verify = verify
if auth_mechanism in ['GSSAPI', 'LDAP']:
try:
import requests_kerberos
except ImportError:
Expand All @@ -93,23 +104,22 @@ def hdfs_connect(host='localhost', port=50070, protocol='webhdfs',
"requests-kerberos` or `pip install hdfs[kerberos]`.")
from hdfs.ext.kerberos import KerberosClient
url = 'https://{0}:{1}'.format(host, port) # note SSL
hdfs_client = KerberosClient(url, mutual_auth='OPTIONAL',
verify=verify, **kwds)
kwds.setdefault('mutual_auth', 'OPTIONAL')
hdfs_client = KerberosClient(url, **kwds)
else:
from hdfs.client import InsecureClient
url = 'http://{0}:{1}'.format(host, port)
hdfs_client = InsecureClient(url, verify=verify, **kwds)
hdfs_client = InsecureClient(url, **kwds)
return WebHDFS(hdfs_client)


def test(include_e2e=False):
def test(impala=False):
import pytest
import ibis
import os

ibis_dir, _ = os.path.split(ibis.__file__)

args = ['--pyargs', ibis_dir]
if include_e2e:
args.append('--e2e')
if impala:
args.append('--impala')
pytest.main(args)
342 changes: 305 additions & 37 deletions ibis/client.py

Large diffs are not rendered by default.

13 changes: 7 additions & 6 deletions ibis/cloudpickle.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ def dump(self, obj):
self.inject_addons()
try:
return pickle.Pickler.dump(self, obj)
except RuntimeError, e:
except RuntimeError as e:
if 'recursion' in e.args[0]:
msg = """Could not pickle object as excessively deep recursion required.
Try _fast_serialization=2 or contact PiCloud support"""
Expand Down Expand Up @@ -313,7 +313,7 @@ def extract_code_globals(co):
extended_arg = 0
i = i+2
if op == EXTENDED_ARG:
extended_arg = oparg*65536L
extended_arg = oparg*65536
if op in GLOBAL_OPS:
out_names.add(names[oparg])
#print 'extracted', out_names, ' from ', names
Expand Down Expand Up @@ -348,7 +348,7 @@ def extract_func_data(self, func):
def get_contents(cell):
try:
return cell.cell_contents
except ValueError, e: #cell is empty error on not yet assigned
except ValueError as e: #cell is empty error on not yet assigned
raise pickle.PicklingError('Function to be pickled has free variables that are referenced before assignment in enclosing scope')


Expand All @@ -366,7 +366,8 @@ def get_contents(cell):
outvars.append('globals: ' + str(f_globals))
outvars.append('defaults: ' + str(defaults))
outvars.append('closure: ' + str(closure))
print 'function ', func, 'is extracted to: ', ', '.join(outvars)
print('function {0} is extracted to: {1}'.format(
func, ', '.join(outvars)))

base_globals = self.globals_ref.get(id(func.func_globals), {})
self.globals_ref[id(func.func_globals)] = base_globals
Expand Down Expand Up @@ -417,7 +418,7 @@ def save_global(self, obj, name=None, pack=struct.pack):
themodule = sys.modules[modname]
try:
klass = getattr(themodule, name)
except AttributeError, a:
except AttributeError as a:
# print themodule, name, obj, type(obj)
raise pickle.PicklingError("Can't pickle builtin %s" % obj)
else:
Expand Down Expand Up @@ -799,7 +800,7 @@ def _modules_to_main(modList):
if type(modname) is str:
try:
mod = __import__(modname)
except Exception, i: #catch all...
except Exception as i: #catch all...
sys.stderr.write('warning: could not import %s\n. Your function may unexpectedly error due to this import failing; \
A version mismatch is likely. Specific error was:\n' % modname)
print_exec(sys.stderr)
Expand Down
30 changes: 30 additions & 0 deletions ibis/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,12 @@

# flake8: noqa

import itertools

import numpy as np

import sys
import six
from six import BytesIO, StringIO, string_types as py_string


Expand All @@ -29,6 +34,31 @@
import unittest

if PY3:
import pickle
unicode_type = str
def lzip(*x):
return list(zip(*x))
zip = zip
pickle_dump = pickle.dumps
pickle_load = pickle.loads
def dict_values(x):
return list(x.values())
from decimal import Decimal
else:
import cPickle

try:
from cdecimal import Decimal
except ImportError:
from decimal import Decimal

unicode_type = unicode
lzip = zip
zip = itertools.izip
from ibis.cloudpickle import dumps as pickle_dump
pickle_load = cPickle.loads

def dict_values(x):
return x.values()

integer_types = six.integer_types + (np.integer,)
1 change: 1 addition & 0 deletions ibis/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
cf.register_option('verbose', False, validator=cf.is_bool)
cf.register_option('verbose_log', None)

cf.register_option('default_backend', None)

sql_default_limit_doc = """
Number of rows to be retrieved for an unlimited table expression
Expand Down
43 changes: 16 additions & 27 deletions ibis/expr/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,8 @@ def get_result(self):

lifted_args = []
for arg in node.args:
lifted_arg, unch_arg = self._lift_arg(arg)
lifted_arg, unch_arg = self._lift_arg(
arg, block=self.block_projection)
lifted_args.append(lifted_arg)

unchanged = unchanged and unch_arg
Expand Down Expand Up @@ -257,7 +258,15 @@ def _lift_Aggregation(self, expr, block=None):
block = self.block_projection

op = expr.op()
lifted_table = self.lift(op.table, block=True)

# as exposed in #544, do not lift the table inside (which may be
# filtered or otherwise altered in some way) if blocking

if block:
lifted_table = op.table
else:
lifted_table = self.lift(op.table, block=True)

unch = lifted_table is op.table

lifted_aggs, unch1 = self._lift_arg(op.agg_exprs, block=True)
Expand All @@ -282,8 +291,8 @@ def _lift_Projection(self, expr, block=None):
op = expr.op()

if block:
lifted_table = op.table
unch = True
# GH #549: dig no further
return expr
else:
lifted_table, unch = self._lift_arg(op.table, block=True)

Expand Down Expand Up @@ -529,7 +538,7 @@ def _walk(x, w):
unchanged = True
windowed_args = []
for arg in op.args:
if not isinstance(arg, ir.Expr):
if not isinstance(arg, ir.ValueExpr):
windowed_args.append(arg)
continue

Expand Down Expand Up @@ -559,6 +568,8 @@ class Projector(object):
def __init__(self, parent, proj_exprs):
self.parent = parent

self.input_exprs = proj_exprs

node = self.parent.op()

if isinstance(node, ops.Projection):
Expand Down Expand Up @@ -762,25 +773,3 @@ def walk(expr):

walk(expr)
return out_exprs


def find_backend(expr):
from ibis.client import Client

backends = []

def walk(expr):
node = expr.op()
for arg in node.flat_args():
if isinstance(arg, Client):
backends.append(arg)
elif isinstance(arg, ir.Expr):
walk(arg)

walk(expr)
backends = util.unique_by_key(backends, id)

if len(backends) > 1:
raise ValueError('Multiple backends found')

return backends[0]
331 changes: 322 additions & 9 deletions ibis/expr/api.py

Large diffs are not rendered by default.

63 changes: 48 additions & 15 deletions ibis/expr/datatypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ def __contains__(self, name):

@classmethod
def from_tuples(cls, values):
if not isinstance(values, (list, tuple)):
values = list(values)

if len(values):
names, types = zip(*values)
else:
Expand Down Expand Up @@ -135,6 +138,15 @@ def root_tables(self):

class DataType(object):

def __init__(self, nullable=True):
self.nullable = nullable

def __call__(self, nullable=True):
return self._factory(nullable=nullable)

def _factory(self, nullable=True):
return type(self)(nullable=nullable)

def __eq__(self, other):
return self.equals(other)

Expand All @@ -145,7 +157,10 @@ def __hash__(self):
return hash(type(self))

def __repr__(self):
return self.name()
name = self.name()
if not self.nullable:
name = '{0}[non-nullable]'.format(name)
return name

def name(self):
return type(self).__name__.lower()
Expand All @@ -154,7 +169,8 @@ def equals(self, other):
if isinstance(other, six.string_types):
other = validate_type(other)

return isinstance(other, type(self))
return (isinstance(other, type(self)) and
self.nullable == other.nullable)

def can_implicit_cast(self, other):
return self.equals(other)
Expand Down Expand Up @@ -192,8 +208,10 @@ class Integer(Primitive):

def can_implicit_cast(self, other):
if isinstance(other, Integer):
return other._nbytes <= self._nbytes
return False
return ((type(self) == Integer) or
(other._nbytes <= self._nbytes))
else:
return False


class String(Variadic):
Expand All @@ -209,7 +227,15 @@ class SignedInteger(Integer):


class Floating(Primitive):
pass

def can_implicit_cast(self, other):
if isinstance(other, Integer):
return True
elif isinstance(other, Floating):
# return other._nbytes <= self._nbytes
return True
else:
return False


class Int8(Integer):
Expand Down Expand Up @@ -249,9 +275,10 @@ class Double(Floating):
class Decimal(DataType):
# Decimal types are parametric, we store the parameters in this object

def __init__(self, precision, scale):
def __init__(self, precision, scale, nullable=True):
self.precision = precision
self.scale = scale
DataType.__init__(self, nullable=nullable)

def _base_type(self):
return 'decimal'
Expand All @@ -273,6 +300,10 @@ def __eq__(self, other):
return (self.precision == other.precision and
self.scale == other.scale)

@classmethod
def can_implicit_cast(cls, other):
return isinstance(other, (Floating, Decimal))

def array_type(self):
def constructor(op, name=None):
from ibis.expr.types import DecimalArray
Expand All @@ -288,8 +319,9 @@ def constructor(op, name=None):

class Category(DataType):

def __init__(self, cardinality=None):
def __init__(self, cardinality=None, nullable=True):
self.cardinality = cardinality
DataType.__init__(self, nullable=nullable)

def _base_type(self):
return 'category'
Expand Down Expand Up @@ -335,26 +367,26 @@ def constructor(op, name=None):

class Struct(DataType):

def __init__(self, names, types):
pass
def __init__(self, names, types, nullable=True):
DataType.__init__(self, nullable=nullable)


class Array(Variadic):

def __init__(self, value_type):
pass
def __init__(self, value_type, nullable=True):
Variadic.__init__(self, nullable=nullable)


class Enum(DataType):

def __init__(self, rep_type, value_type):
pass
def __init__(self, rep_type, value_type, nullable=True):
DataType.__init__(self, nullable=nullable)


class Map(DataType):

def __init__(self, key_type, value_type):
pass
def __init__(self, key_type, value_type, nullable=True):
DataType.__init__(self, nullable=nullable)


# ---------------------------------------------------------------------
Expand All @@ -363,6 +395,7 @@ def __init__(self, key_type, value_type):
any = Any()
null = Null()
boolean = Boolean()
int_ = Integer()
int8 = Int8()
int16 = Int16()
int32 = Int32()
Expand Down
8 changes: 5 additions & 3 deletions ibis/expr/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,9 +164,9 @@ def _format_column(self, expr):
table_formatted = self.memo.get_alias(parent_op)
else:
table_formatted = '\n' + self._indent(self._format_node(parent_op))
return ("Column[%s] '%s' from table %s" % (self.expr.type(),
col.name,
table_formatted))
type_display = self._get_type_display(self.expr)
return ("Column[{0}] '{1}' from table {2}"
.format(type_display, col.name, table_formatted))

def _format_node(self, op):
formatted_args = []
Expand Down Expand Up @@ -233,6 +233,8 @@ def _get_type_display(self, expr=None):
return 'table'
elif isinstance(expr, ir.ArrayExpr):
return 'array(%s)' % expr.type()
elif isinstance(expr, ir.SortExpr):
return 'array-sort'
elif isinstance(expr, (ir.ScalarExpr, ir.AnalyticExpr)):
return '%s' % expr.type()
elif isinstance(expr, ir.ExprList):
Expand Down
33 changes: 24 additions & 9 deletions ibis/expr/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def __init__(self, table, by, having=None, order_by=None, window=None):

def __getitem__(self, args):
# Shortcut for projection with window functions
return self.projection(args)
return self.projection(list(args))

def __getattr__(self, attr):
if hasattr(self.table, attr):
Expand All @@ -56,9 +56,9 @@ def _column_wrapper(self, attr):
else:
return GroupedArray(col, self)

def aggregate(self, metrics):
def aggregate(self, metrics=None, **kwds):
return self.table.aggregate(metrics, by=self.by,
having=self._having)
having=self._having, **kwds)

def having(self, expr):
"""
Expand Down Expand Up @@ -99,14 +99,21 @@ def order_by(self, expr):

def mutate(self, exprs=None, **kwds):
"""
Returns a table projection with analytic / window functions applied
Returns a table projection with analytic / window functions
applied. Any arguments can be functions.
Parameters
----------
exprs : list, default None
kwds : key=value pairs
Examples
--------
expr = (table
.group_by('foo')
.order_by(ibis.desc('bar'))
.mutate(qux=table.baz.lag()))
>>> expr = (table
.group_by('foo')
.order_by(ibis.desc('bar'))
.mutate(qux=lambda x: x.baz.lag(),
qux2=table.baz.lead()))
Returns
-------
Expand All @@ -117,14 +124,22 @@ def mutate(self, exprs=None, **kwds):
else:
exprs = util.promote_list(exprs)

for k, v in kwds.items():
kwd_names = list(kwds.keys())
kwd_values = list(kwds.values())
kwd_values = self.table._resolve(kwd_values)

for k, v in sorted(zip(kwd_names, kwd_values)):
exprs.append(v.name(k))

return self.projection([self.table] + exprs)

def projection(self, exprs):
"""
Like mutate, but do not include existing table columns
"""
w = self._get_window()
windowed_exprs = []
exprs = self.table._resolve(exprs)
for expr in exprs:
expr = L.windowize_function(expr, w=w)
windowed_exprs.append(expr)
Expand Down
152 changes: 91 additions & 61 deletions ibis/expr/operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@
# limitations under the License.

import operator
import six

from ibis.expr.types import TableColumn # noqa

from ibis.compat import py_string
from ibis.expr.datatypes import HasSchema, Schema
Expand Down Expand Up @@ -53,9 +56,9 @@ def __new__(cls, name, parents, dct):
return super(ValueOperationMeta, cls).__new__(cls, name, parents, dct)


class ValueOp(ValueNode):
class ValueOp(six.with_metaclass(ValueOperationMeta, ValueNode)):

__metaclass__ = ValueOperationMeta
pass


class PhysicalTable(ir.BlockingTableNode, HasSchema):
Expand All @@ -82,6 +85,9 @@ def __init__(self, name, schema, source):
TableNode.__init__(self, [name, schema, source])
HasSchema.__init__(self, schema, name=name)

def change_name(self, new_name):
return type(self)(new_name, self.args[1], self.source)


class SQLQueryResult(ir.BlockingTableNode, HasSchema):

Expand All @@ -95,36 +101,6 @@ def __init__(self, query, schema, source):
HasSchema.__init__(self, schema)


class TableColumn(ValueNode):

"""
Selects a column from a TableExpr
"""

def __init__(self, name, table_expr):
Node.__init__(self, [name, table_expr])

if name not in table_expr.schema():
raise KeyError("'{0}' is not a field".format(name))

self.name = name
self.table = table_expr

def parent(self):
return self.table

def resolve_name(self):
return self.name

def root_tables(self):
return self.table._root_tables()

def to_expr(self):
ctype = self.table._get_type(self.name)
klass = ctype.array_type()
return klass(self, name=self.name)


class TableArrayView(ValueNode):

"""
Expand Down Expand Up @@ -172,6 +148,13 @@ def output_type(self):
return rules.shape_like(self.args[0], self.args[1])


class TypeOf(ValueOp):

input_type = [value]
output_type = rules.shape_like_arg(0, 'string')



class Negate(UnaryOp):

input_type = [number]
Expand Down Expand Up @@ -232,6 +215,23 @@ class NullIf(ValueOp):
output_type = rules.type_of_arg(0)


class NullIfZero(ValueOp):

"""
Set values to NULL if they equal to zero. Commonly used in cases where
divide-by-zero would produce an overflow or infinity.
Equivalent to (value == 0).ifelse(ibis.NA, value)
Returns
-------
maybe_nulled : type of caller
"""

input_type = [number]
output_type = rules.type_of_arg(0)


def _coalesce_upcast(self):
# TODO: how much validation is necessary that the call is valid and can
# succeed?
Expand Down Expand Up @@ -532,8 +532,13 @@ class RegexExtract(ValueOp):

class RegexReplace(ValueOp):

input_type = [string, string(name='pattern'),
string(name='replacement')]
input_type = [string, string(name='pattern'), string(name='replacement')]
output_type = rules.shape_like_arg(0, 'string')


class StringReplace(ValueOp):

input_type = [string, string(name='pattern'), string(name='replacement')]
output_type = rules.shape_like_arg(0, 'string')


Expand Down Expand Up @@ -665,28 +670,30 @@ def _mean_output_type(self):
return t


def _scalar_output(rule):
def f(self):
t = dt.validate_type(rule(self))
return t.scalar_type()
return f
class Sum(Reduction):

output_type = rules.scalar_output(_sum_output_type)


def _array_output(rule):
def f(self):
t = dt.validate_type(rule(self))
return t.array_type()
return f
class Mean(Reduction):

output_type = rules.scalar_output(_mean_output_type)

class Sum(Reduction):

output_type = _scalar_output(_sum_output_type)
class VarianceBase(Reduction):

input_type = [rules.array, boolean(name='where', optional=True),
rules.string_options(['sample', 'pop'],
name='how', optional=True)]
output_type = rules.scalar_output(_mean_output_type)

class Mean(Reduction):

output_type = _scalar_output(_mean_output_type)
class StandardDev(VarianceBase):
pass


class Variance(VarianceBase):
pass


def _decimal_scalar_ctor(precision, scale):
Expand All @@ -710,12 +717,12 @@ def _min_max_output_rule(self):

class Max(Reduction):

output_type = _scalar_output(_min_max_output_rule)
output_type = rules.scalar_output(_min_max_output_rule)


class Min(Reduction):

output_type = _scalar_output(_min_max_output_rule)
output_type = rules.scalar_output(_min_max_output_rule)


class HLLCardinality(Reduction):
Expand Down Expand Up @@ -764,9 +771,17 @@ class WindowOp(ValueOp):
output_type = rules.type_of_arg(0)

def __init__(self, expr, window):
from ibis.expr.window import propagate_down_window
if not is_analytic(expr):
raise com.IbisInputError('Expression does not contain a valid '
'window operation')

table = ir.find_base_table(expr)
if table is not None:
window = window.bind(table)

expr = propagate_down_window(expr, window)

ValueOp.__init__(self, expr, window)

def over(self, window):
Expand Down Expand Up @@ -900,7 +915,7 @@ class CumulativeSum(CumulativeOp):
Cumulative sum. Requires an order window.
"""

output_type = _array_output(_sum_output_type)
output_type = rules.array_output(_sum_output_type)


class CumulativeMean(CumulativeOp):
Expand All @@ -909,7 +924,7 @@ class CumulativeMean(CumulativeOp):
Cumulative mean. Requires an order window.
"""

output_type = _array_output(_mean_output_type)
output_type = rules.array_output(_mean_output_type)


class CumulativeMax(CumulativeOp):
Expand All @@ -918,7 +933,7 @@ class CumulativeMax(CumulativeOp):
Cumulative max. Requires an order window.
"""

output_type = _array_output(_min_max_output_rule)
output_type = rules.array_output(_min_max_output_rule)


class CumulativeMin(CumulativeOp):
Expand All @@ -927,7 +942,7 @@ class CumulativeMin(CumulativeOp):
Cumulative min. Requires an order window.
"""

output_type = _array_output(_min_max_output_rule)
output_type = rules.array_output(_min_max_output_rule)


class PercentRank(AnalyticOp):
Expand Down Expand Up @@ -1081,7 +1096,7 @@ class CumulativeAny(CumulativeOp):
Cumulative any
"""

output_type = _array_output(lambda self: 'boolean')
output_type = rules.array_output(lambda self: 'boolean')


class CumulativeAll(CumulativeOp):
Expand All @@ -1090,7 +1105,7 @@ class CumulativeAll(CumulativeOp):
Cumulative all
"""

output_type = _array_output(lambda self: 'boolean')
output_type = rules.array_output(lambda self: 'boolean')


# ---------------------------------------------------------------------
Expand Down Expand Up @@ -1470,6 +1485,8 @@ def _validate(self):

class Filter(TableNode):

_arg_names = ['table', 'predicates']

def __init__(self, table_expr, predicates):
self.table = table_expr
self.predicates = predicates
Expand Down Expand Up @@ -1542,7 +1559,7 @@ def to_sort_key(table, key):
if isinstance(key, DeferredSortKey):
key = key.resolve(table)

if isinstance(key, SortKey):
if isinstance(key, ir.SortExpr):
return key

if isinstance(key, (tuple, list)):
Expand All @@ -1552,7 +1569,7 @@ def to_sort_key(table, key):

if not isinstance(key, ir.Expr):
key = table._ensure_expr(key)
if isinstance(key, (SortKey, DeferredSortKey)):
if isinstance(key, (ir.SortExpr, DeferredSortKey)):
return to_sort_key(table, key)

if isinstance(sort_order, py_string):
Expand All @@ -1561,10 +1578,12 @@ def to_sort_key(table, key):
elif not isinstance(sort_order, bool):
sort_order = bool(sort_order)

return SortKey(key, ascending=sort_order)
return SortKey(key, ascending=sort_order).to_expr()


class SortKey(object):
class SortKey(ir.Node):

_arg_names = ['by', 'ascending']

def __init__(self, expr, ascending=True):
if not rules.is_array(expr):
Expand All @@ -1573,13 +1592,18 @@ def __init__(self, expr, ascending=True):
self.expr = expr
self.ascending = ascending

ir.Node.__init__(self, [self.expr, self.ascending])

def __repr__(self):
# Temporary
rows = ['Sort key:',
' ascending: {0!s}'.format(self.ascending),
util.indent(_safe_repr(self.expr), 2)]
return '\n'.join(rows)

def to_expr(self):
return ir.SortExpr(self)

def equals(self, other):
return (isinstance(other, SortKey) and
self.expr.equals(other.expr) and
Expand All @@ -1594,7 +1618,7 @@ def __init__(self, what, ascending=True):

def resolve(self, parent):
what = parent._ensure_expr(self.what)
return SortKey(what, ascending=self.ascending)
return SortKey(what, ascending=self.ascending).to_expr()


class SelfReference(ir.BlockingTableNode, HasSchema):
Expand Down Expand Up @@ -2087,6 +2111,12 @@ class Truncate(ValueOp):
output_type = rules.shape_like_arg(0, 'timestamp')


class Strftime(ValueOp):

input_type = [rules.timestamp, rules.string(name='format_str')]
output_type = rules.shape_like_arg(0, 'string')


class ExtractTimestampField(TimestampUnaryOp):

output_type = rules.shape_like_arg(0, 'int32')
Expand Down
76 changes: 56 additions & 20 deletions ibis/expr/rules.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,14 +307,21 @@ def _validate(self, args, i):
raise NotImplementedError


def _to_argument(val):
if isinstance(val, dt.DataType):
val = value_typed_as(val)
elif not isinstance(val, Argument):
val = val()
return val


class TypeSignature(object):

def __init__(self, type_specs):
types = []

for val in type_specs:
if not isinstance(val, Argument):
val = val()
val = _to_argument(val)
types.append(val)

self.types = types
Expand Down Expand Up @@ -345,7 +352,7 @@ def _validate(self, args, types):
try:
clean_args[i] = validator.validate(clean_args, i)
except IbisTypeError as e:
exc = e.message
exc = e.args[0]
msg = ('Argument {0}: {1}'.format(i, exc) +
'\nArgument was: {0}'.format(ir._safe_repr(args[i])))
raise IbisTypeError(msg)
Expand All @@ -356,9 +363,7 @@ def _validate(self, args, types):
class VarArgs(TypeSignature):

def __init__(self, arg_type, min_length=1):
if not isinstance(arg_type, Argument):
arg_type = arg_type()
self.arg_type = arg_type
self.arg_type = _to_argument(arg_type)
self.min_length = min_length

def __repr__(self):
Expand All @@ -376,6 +381,26 @@ def validate(self, args):
varargs = VarArgs


def scalar_output(rule):
def f(self):
if isinstance(rule, dt.DataType):
t = rule
else:
t = dt.validate_type(rule(self))
return t.scalar_type()
return f


def array_output(rule):
def f(self):
if isinstance(rule, dt.DataType):
t = rule
else:
t = dt.validate_type(rule(self))
return t.array_type()
return f


def shape_like_flatargs(out_type):

def output_type(self):
Expand Down Expand Up @@ -444,14 +469,14 @@ def _validate(self, args, i):
class AnyTyped(Argument):

def __init__(self, types, fail_message, **arg_kwds):
self.types = types
self.types = util.promote_list(types)
self.fail_message = fail_message
Argument.__init__(self, **arg_kwds)

def _validate(self, args, i):
arg = args[i]

if not isinstance(arg, self.types):
if not self._type_matches(arg):
if isinstance(self.fail_message, py_string):
exc = self.fail_message
else:
Expand All @@ -460,6 +485,17 @@ def _validate(self, args, i):

return arg

def _type_matches(self, arg):
for t in self.types:
if (isinstance(t, dt.DataType) or
isinstance(t, type) and issubclass(t, dt.DataType)):
if t.can_implicit_cast(arg.type()):
return True
else:
if isinstance(arg, t):
return True
return False


class ValueTyped(AnyTyped, ValueArgument):

Expand All @@ -474,8 +510,7 @@ def _validate(self, args, i):
class MultipleTypes(Argument):

def __init__(self, types, **arg_kwds):
self.types = [t() if not isinstance(t, Argument) else t
for t in types]
self.types = [_to_argument(t) for t in types]
Argument.__init__(self, **arg_kwds)

def _validate(self, args, i):
Expand All @@ -487,8 +522,7 @@ def _validate(self, args, i):
class OneOf(Argument):

def __init__(self, types, **arg_kwds):
self.types = [t() if not isinstance(t, Argument) else t
for t in types]
self.types = [_to_argument(t) for t in types]
Argument.__init__(self, **arg_kwds)

def _validate(self, args, i):
Expand Down Expand Up @@ -582,11 +616,15 @@ def _validate(self, args, i):


def integer(**arg_kwds):
return ValueTyped(ir.IntegerValue, 'not integer', **arg_kwds)
return ValueTyped(dt.int_, 'not integer', **arg_kwds)


def double(**arg_kwds):
return ValueTyped(dt.double, 'not double', **arg_kwds)


def decimal(**arg_kwds):
return ValueTyped(ir.DecimalValue, 'not decimal', **arg_kwds)
return ValueTyped(dt.Decimal, 'not decimal', **arg_kwds)


def timestamp(**arg_kwds):
Expand All @@ -599,11 +637,11 @@ def timedelta(**arg_kwds):


def string(**arg_kwds):
return ValueTyped(ir.StringValue, 'not string', **arg_kwds)
return ValueTyped(dt.string, 'not string', **arg_kwds)


def boolean(**arg_kwds):
return ValueTyped(ir.BooleanValue, 'not string', **arg_kwds)
return ValueTyped(dt.boolean, 'not string', **arg_kwds)


def one_of(args, **arg_kwds):
Expand All @@ -630,9 +668,7 @@ def _validate(self, args, i):
class ListOf(Argument):

def __init__(self, value_type, min_length=0, **arg_kwds):
if not isinstance(value_type, Argument):
value_type = value_type()
self.value_type = value_type
self.value_type = _to_argument(value_type)
self.min_length = min_length
Argument.__init__(self, **arg_kwds)

Expand All @@ -653,7 +689,7 @@ def _validate(self, args, i):
try:
checked_arg = self.value_type.validate(arg, j)
except IbisTypeError as e:
exc = e.message
exc = e.args[0]
msg = ('List element {0} had a type error: {1}'
.format(j, exc))
raise IbisTypeError(msg)
Expand Down
15 changes: 15 additions & 0 deletions ibis/expr/tests/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright 2015 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from ibis.tests.conftest import * # noqa
232 changes: 230 additions & 2 deletions ibis/expr/tests/mocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,224 @@ class MockConnection(SQLClient):
('timestamp_col', 'timestamp'),
('year', 'int32'),
('month', 'int32')
]
],
'airlines': [
('year', 'int32'),
('month', 'int32'),
('day', 'int32'),
('dayofweek', 'int32'),
('dep_time', 'int32'),
('crs_dep_time', 'int32'),
('arr_time', 'int32'),
('crs_arr_time', 'int32'),
('carrier', 'string'),
('flight_num', 'int32'),
('tail_num', 'int32'),
('actual_elapsed_time', 'int32'),
('crs_elapsed_time', 'int32'),
('airtime', 'int32'),
('arrdelay', 'int32'),
('depdelay', 'int32'),
('origin', 'string'),
('dest', 'string'),
('distance', 'int32'),
('taxi_in', 'int32'),
('taxi_out', 'int32'),
('cancelled', 'int32'),
('cancellation_code', 'string'),
('diverted', 'int32'),
('carrier_delay', 'int32'),
('weather_delay', 'int32'),
('nas_delay', 'int32'),
('security_delay', 'int32'),
('late_aircraft_delay', 'int32')
],
'tpcds_customer': [
('c_customer_sk', 'int64'),
('c_customer_id', 'string'),
('c_current_cdemo_sk', 'int32'),
('c_current_hdemo_sk', 'int32'),
('c_current_addr_sk', 'int32'),
('c_first_shipto_date_sk', 'int32'),
('c_first_sales_date_sk', 'int32'),
('c_salutation', 'string'),
('c_first_name', 'string'),
('c_last_name', 'string'),
('c_preferred_cust_flag', 'string'),
('c_birth_day', 'int32'),
('c_birth_month', 'int32'),
('c_birth_year', 'int32'),
('c_birth_country', 'string'),
('c_login', 'string'),
('c_email_address', 'string'),
('c_last_review_date', 'string')],
'tpcds_customer_address': [
('ca_address_sk', 'bigint'),
('ca_address_id', 'string'),
('ca_street_number', 'string'),
('ca_street_name', 'string'),
('ca_street_type', 'string'),
('ca_suite_number', 'string'),
('ca_city', 'string'),
('ca_county', 'string'),
('ca_state', 'string'),
('ca_zip', 'string'),
('ca_country', 'string'),
('ca_gmt_offset', 'decimal(5,2)'),
('ca_location_type', 'string')],
'tpcds_customer_demographics': [
('cd_demo_sk', 'bigint'),
('cd_gender', 'string'),
('cd_marital_status', 'string'),
('cd_education_status', 'string'),
('cd_purchase_estimate', 'int'),
('cd_credit_rating', 'string'),
('cd_dep_count', 'int'),
('cd_dep_employed_count', 'int'),
('cd_dep_college_count', 'int')],
'tpcds_date_dim': [
('d_date_sk', 'bigint'),
('d_date_id', 'string'),
('d_date', 'string'),
('d_month_seq', 'int'),
('d_week_seq', 'int'),
('d_quarter_seq', 'int'),
('d_year', 'int'),
('d_dow', 'int'),
('d_moy', 'int'),
('d_dom', 'int'),
('d_qoy', 'int'),
('d_fy_year', 'int'),
('d_fy_quarter_seq', 'int'),
('d_fy_week_seq', 'int'),
('d_day_name', 'string'),
('d_quarter_name', 'string'),
('d_holiday', 'string'),
('d_weekend', 'string'),
('d_following_holiday', 'string'),
('d_first_dom', 'int'),
('d_last_dom', 'int'),
('d_same_day_ly', 'int'),
('d_same_day_lq', 'int'),
('d_current_day', 'string'),
('d_current_week', 'string'),
('d_current_month', 'string'),
('d_current_quarter', 'string'),
('d_current_year', 'string')],
'tpcds_household_demographics': [
('hd_demo_sk', 'bigint'),
('hd_income_band_sk', 'int'),
('hd_buy_potential', 'string'),
('hd_dep_count', 'int'),
('hd_vehicle_count', 'int')],
'tpcds_item': [
('i_item_sk', 'bigint'),
('i_item_id', 'string'),
('i_rec_start_date', 'string'),
('i_rec_end_date', 'string'),
('i_item_desc', 'string'),
('i_current_price', 'decimal(7,2)'),
('i_wholesale_cost', 'decimal(7,2)'),
('i_brand_id', 'int'),
('i_brand', 'string'),
('i_class_id', 'int'),
('i_class', 'string'),
('i_category_id', 'int'),
('i_category', 'string'),
('i_manufact_id', 'int'),
('i_manufact', 'string'),
('i_size', 'string'),
('i_formulation', 'string'),
('i_color', 'string'),
('i_units', 'string'),
('i_container', 'string'),
('i_manager_id', 'int'),
('i_product_name', 'string')],
'tpcds_promotion': [
('p_promo_sk', 'bigint'),
('p_promo_id', 'string'),
('p_start_date_sk', 'int'),
('p_end_date_sk', 'int'),
('p_item_sk', 'int'),
('p_cost', 'decimal(15,2)'),
('p_response_target', 'int'),
('p_promo_name', 'string'),
('p_channel_dmail', 'string'),
('p_channel_email', 'string'),
('p_channel_catalog', 'string'),
('p_channel_tv', 'string'),
('p_channel_radio', 'string'),
('p_channel_press', 'string'),
('p_channel_event', 'string'),
('p_channel_demo', 'string'),
('p_channel_details', 'string'),
('p_purpose', 'string'),
('p_discount_active', 'string')],
'tpcds_store': [
('s_store_sk', 'bigint'),
('s_store_id', 'string'),
('s_rec_start_date', 'string'),
('s_rec_end_date', 'string'),
('s_closed_date_sk', 'int'),
('s_store_name', 'string'),
('s_number_employees', 'int'),
('s_floor_space', 'int'),
('s_hours', 'string'),
('s_manager', 'string'),
('s_market_id', 'int'),
('s_geography_class', 'string'),
('s_market_desc', 'string'),
('s_market_manager', 'string'),
('s_division_id', 'int'),
('s_division_name', 'string'),
('s_company_id', 'int'),
('s_company_name', 'string'),
('s_street_number', 'string'),
('s_street_name', 'string'),
('s_street_type', 'string'),
('s_suite_number', 'string'),
('s_city', 'string'),
('s_county', 'string'),
('s_state', 'string'),
('s_zip', 'string'),
('s_country', 'string'),
('s_gmt_offset', 'decimal(5,2)'),
('s_tax_precentage', 'decimal(5,2)')],
'tpcds_store_sales': [
('ss_sold_time_sk', 'bigint'),
('ss_item_sk', 'bigint'),
('ss_customer_sk', 'bigint'),
('ss_cdemo_sk', 'bigint'),
('ss_hdemo_sk', 'bigint'),
('ss_addr_sk', 'bigint'),
('ss_store_sk', 'bigint'),
('ss_promo_sk', 'bigint'),
('ss_ticket_number', 'int'),
('ss_quantity', 'int'),
('ss_wholesale_cost', 'decimal(7,2)'),
('ss_list_price', 'decimal(7,2)'),
('ss_sales_price', 'decimal(7,2)'),
('ss_ext_discount_amt', 'decimal(7,2)'),
('ss_ext_sales_price', 'decimal(7,2)'),
('ss_ext_wholesale_cost', 'decimal(7,2)'),
('ss_ext_list_price', 'decimal(7,2)'),
('ss_ext_tax', 'decimal(7,2)'),
('ss_coupon_amt', 'decimal(7,2)'),
('ss_net_paid', 'decimal(7,2)'),
('ss_net_paid_inc_tax', 'decimal(7,2)'),
('ss_net_profit', 'decimal(7,2)'),
('ss_sold_date_sk', 'bigint')],
'tpcds_time_dim': [
('t_time_sk', 'bigint'),
('t_time_id', 'string'),
('t_time', 'int'),
('t_hour', 'int'),
('t_minute', 'int'),
('t_second', 'int'),
('t_am_pm', 'string'),
('t_shift', 'string'),
('t_sub_shift', 'string'),
('t_meal_time', 'string')]
}

def __init__(self):
Expand All @@ -129,12 +346,23 @@ def _get_table_schema(self, name):
name = name.replace('`', '')
return Schema.from_tuples(self._tables[name])

def execute(self, expr, limit=None):
def _build_ast(self, expr):
from ibis.impala.compiler import build_ast
return build_ast(expr)

def execute(self, expr, limit=None, async=False):
if async:
raise NotImplementedError
ast = self._build_ast_ensure_limit(expr, limit)
for query in ast.queries:
self.executed_queries.append(query.compile())
return None

def compile(self, expr, limit=None):
ast = self._build_ast_ensure_limit(expr, limit)
queries = [q.compile() for q in ast.queries]
return queries[0] if len(queries) == 1 else queries


_all_types_schema = [
('a', 'int8'),
Expand Down
24 changes: 21 additions & 3 deletions ibis/expr/tests/test_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,12 @@ def setUp(self):
]
self.schema_dict = dict(self.schema)
self.table = ibis.table(self.schema)
self.con = MockConnection()

def test_format_table_column(self):
# GH #507
result = repr(self.table.f)
assert 'Column[array(double)]' in result

def test_format_projection(self):
# This should produce a ref to the projection
Expand Down Expand Up @@ -118,9 +124,8 @@ def test_format_multiple_join_with_projection(self):
repr(view)

def test_memoize_database_table(self):
con = MockConnection()
table = con.table('test1')
table2 = con.table('test2')
table = self.con.table('test1')
table2 = self.con.table('test2')

filter_pred = table['f'] > 0
table3 = table[filter_pred]
Expand Down Expand Up @@ -148,6 +153,19 @@ def test_memoize_filtered_table(self):
result = repr(delay_filter)
assert result.count('Filter') == 1

def test_memoize_insert_sort_key(self):
table = self.con.table('airlines')

t = table['arrdelay', 'dest']
expr = (t.group_by('dest')
.mutate(dest_avg=t.arrdelay.mean(),
dev=t.arrdelay - t.arrdelay.mean()))

worst = expr[expr.dev.notnull()].sort_by(ibis.desc('dev')).limit(10)

result = repr(worst)
assert result.count('airlines') == 1

def test_named_value_expr_show_name(self):
expr = self.table.f * 2
expr2 = expr.name('baz')
Expand Down
5 changes: 5 additions & 0 deletions ibis/expr/tests/test_interactive.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,8 @@ def test_histogram_repr_no_query_execute(self):
with config.option_context('interactive', True):
expr._repr()
assert self.con.executed_queries == []

def test_compile_no_execute(self):
t = self.con.table('functional_alltypes')
t.double_col.sum().compile()
assert self.con.executed_queries == []
87 changes: 80 additions & 7 deletions ibis/expr/tests/test_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,11 @@ def test_view_new_relation(self):
assert roots[0] is tview.op()

def test_get_type(self):
for k, v in self.schema_dict.iteritems():
for k, v in self.schema_dict.items():
assert self.table._get_type(k) == v

def test_getitem_column_select(self):
for k, v in self.schema_dict.iteritems():
for k, v in self.schema_dict.items():
col = self.table[k]

# Make sure it's the right type
Expand Down Expand Up @@ -142,7 +142,7 @@ def test_projection_unnamed_literal_interactive_blowup(self):
try:
table.select([table.bigint_col, ibis.literal(5)])
except Exception as e:
assert 'named' in e.message
assert 'named' in e.args[0]

def test_projection_of_aggregated(self):
# Fully-formed aggregations "block"; in a projection, column
Expand Down Expand Up @@ -173,6 +173,22 @@ def test_projection_convenient_syntax(self):
proj2 = self.table[[self.table, self.table['a'].name('foo')]]
assert_equal(proj, proj2)

def test_projection_mutate_analysis_bug(self):
# GH #549

t = self.con.table('airlines')

# it works!
(t[t.depdelay.notnull()]
.mutate(leg=ibis.literal('-').join([t.origin, t.dest]))
['year', 'month', 'day', 'depdelay', 'leg'])

def test_projection_self(self):
result = self.table[self.table]
expected = self.table.projection(self.table)

assert_equal(result, expected)

def test_add_column(self):
# Creates a projection with a select-all on top of a non-projection
# TableExpr
Expand Down Expand Up @@ -463,6 +479,17 @@ def test_projection_with_join_pushdown_rewrite_refs(self):
new_pred = filter_op.predicates[0]
assert_equal(new_pred, lower_pred)

def test_column_relabel(self):
# GH #551. Keeping the test case very high level to not presume that
# the relabel is necessarily implemented using a projection
types = ['int32', 'string', 'double']
table = api.table(zip(['foo', 'bar', 'baz'], types))
result = table.relabel({'foo': 'one', 'baz': 'three'})

schema = result.schema()
ex_schema = api.schema(zip(['one', 'bar', 'three'], types))
assert_equal(schema, ex_schema)

def test_limit(self):
limited = self.table.limit(10, offset=5)
assert limited.op().n == 10
Expand All @@ -476,7 +503,7 @@ def test_sort_by(self):
# Default is ascending for anything coercable to an expression,
# and we'll have ascending/descending wrappers to help.
result = self.table.sort_by(['f'])
sort_key = result.op().keys[0]
sort_key = result.op().keys[0].op()
assert_equal(sort_key.expr, self.table.f)
assert sort_key.ascending

Expand All @@ -488,9 +515,9 @@ def test_sort_by(self):
result3 = self.table.sort_by([('f', 'descending')])
result4 = self.table.sort_by([('f', 0)])

key2 = result2.op().keys[0]
key3 = result3.op().keys[0]
key4 = result4.op().keys[0]
key2 = result2.op().keys[0].op()
key3 = result3.op().keys[0].op()
key4 = result4.op().keys[0].op()

assert not key2.ascending
assert not key3.ascending
Expand Down Expand Up @@ -592,6 +619,19 @@ def test_aggregate_non_list_inputs(self):
expected = self.table.aggregate([metric], by=[by], having=[having])
assert_equal(result, expected)

def test_aggregate_keywords(self):
t = self.table

expr = t.aggregate(foo=t.f.sum(), bar=lambda x: x.f.mean(),
by='g')
expr2 = t.group_by('g').aggregate(foo=t.f.sum(),
bar=lambda x: x.f.mean())
expected = t.aggregate([t.f.mean().name('bar'),
t.f.sum().name('foo')], by='g')

assert_equal(expr, expected)
assert_equal(expr2, expected)

def test_summary_expand_list(self):
summ = self.table.f.summary()

Expand Down Expand Up @@ -788,6 +828,17 @@ def test_join_combo_with_projection(self):
proj = joined.projection([t, t2['foo'], t2['bar']])
repr(proj)

def test_join_getitem_projection(self):
region = self.con.table('tpch_region')
nation = self.con.table('tpch_nation')

pred = region.r_regionkey == nation.n_regionkey
joined = region.inner_join(nation, pred)

result = joined[nation]
expected = joined.projection(nation)
assert_equal(result, expected)

def test_self_join(self):
# Self-joins are problematic with this design because column
# expressions may reference either the left or right self. For example:
Expand Down Expand Up @@ -1251,6 +1302,28 @@ def g(x):
expected = self.table.mutate(foo=g)
assert_equal(result, expected)

def test_groupby_mutate(self):
t = self.table

g = t.group_by('g').order_by('f')
expr = g.mutate(foo=lambda x: x.f.lag(),
bar=lambda x: x.f.rank())
expected = g.mutate(foo=t.f.lag(),
bar=t.f.rank())

assert_equal(expr, expected)

def test_groupby_projection(self):
t = self.table

g = t.group_by('g').order_by('f')
expr = g.projection([lambda x: x.f.lag().name('foo'),
lambda x: x.f.rank().name('bar')])
expected = g.projection([t.f.lag().name('foo'),
t.f.rank().name('bar')])

assert_equal(expr, expected)

def test_set_column(self):
def g(x):
return x.f * 2
Expand Down
4 changes: 2 additions & 2 deletions ibis/expr/tests/test_temporal.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ def test_downconvert_hours(self):
(offset.to_unit('s'), T.second(K * 3600)),
(offset.to_unit('ms'), T.millisecond(K * 3600000)),
(offset.to_unit('us'), T.microsecond(K * 3600000000)),
(offset.to_unit('ns'), T.nanosecond(K * 3600000000000L))
(offset.to_unit('ns'), T.nanosecond(K * 3600000000000))
]
self._check_cases(cases)

Expand All @@ -128,7 +128,7 @@ def test_downconvert_day(self):
(day.to_unit('s'), T.second(K * 86400)),
(day.to_unit('ms'), T.millisecond(K * 86400000)),
(day.to_unit('us'), T.microsecond(K * 86400000000)),
(day.to_unit('ns'), T.nanosecond(K * 86400000000000L))
(day.to_unit('ns'), T.nanosecond(K * 86400000000000))
]
self._check_cases(cases)

Expand Down
9 changes: 4 additions & 5 deletions ibis/expr/tests/test_value_exprs.py
Original file line number Diff line number Diff line change
Expand Up @@ -349,10 +349,9 @@ def test_number_to_string(self):
assert isinstance(casted_literal, api.StringScalar)
assert casted_literal.get_name() == 'bar'

def test_casted_exprs_are_unnamed(self):
def test_casted_exprs_are_named(self):
expr = self.table.f.cast('string')
with self.assertRaises(Exception):
expr.get_name()
assert expr.get_name() == 'cast(f, string)'

# it works! per GH #396
expr.value_counts()
Expand Down Expand Up @@ -493,7 +492,7 @@ def test_binop_string_type_error(self):
ints = self.table['a']
strs = self.table['g']

ops = ['add', 'mul', 'div', 'sub']
ops = ['add', 'mul', 'truediv', 'sub']
for name in ops:
f = getattr(operator, name)
self.assertRaises(TypeError, f, ints, strs)
Expand Down Expand Up @@ -571,7 +570,7 @@ def test_divide_literal_promotions(self):
('b', -5, 'double'),
('c', 5, 'double'),
]
self._check_literal_promote_cases(operator.div, cases)
self._check_literal_promote_cases(operator.truediv, cases)

def test_pow_literal_promotions(self):
cases = [
Expand Down
71 changes: 70 additions & 1 deletion ibis/expr/tests/test_window_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,76 @@ def test_compose_group_by_apis(self):
assert_equal(expr, expr3)

def test_combine_windows(self):
pass
t = self.t
w1 = ibis.window(group_by=t.g, order_by=t.f)
w2 = ibis.window(preceding=5, following=5)

w3 = w1.combine(w2)
expected = ibis.window(group_by=t.g, order_by=t.f,
preceding=5, following=5)
assert_equal(w3, expected)

w4 = ibis.window(group_by=t.a, order_by=t.e)
w5 = w3.combine(w4)
expected = ibis.window(group_by=[t.g, t.a],
order_by=[t.f, t.e],
preceding=5, following=5)
assert_equal(w5, expected)

def test_over_auto_bind(self):
# GH #542
t = self.t

w = ibis.window(group_by='g', order_by='f')

expr = t.f.lag().over(w)

actual_window = expr.op().args[1]
expected = ibis.window(group_by=t.g, order_by=t.f)
assert_equal(actual_window, expected)

def test_window_function_bind(self):
# GH #532
t = self.t

w = ibis.window(group_by=lambda x: x.g,
order_by=lambda x: x.f)

expr = t.f.lag().over(w)

actual_window = expr.op().args[1]
expected = ibis.window(group_by=t.g, order_by=t.f)
assert_equal(actual_window, expected)

def test_auto_windowize_analysis_bug(self):
# GH #544
t = self.con.table('airlines')

annual_delay = (t[t.dest.isin(['JFK', 'SFO'])]
.group_by(['dest', 'year'])
.aggregate(t.arrdelay.mean().name('avg_delay')))
what = annual_delay.group_by('dest')
enriched = what.mutate(grand_avg=annual_delay.avg_delay.mean())

expr = (annual_delay.avg_delay.mean().name('grand_avg')
.over(ibis.window(group_by=annual_delay.dest)))
expected = annual_delay[annual_delay, expr]

assert_equal(enriched, expected)

def test_mutate_sorts_keys(self):
t = self.con.table('airlines')

m = t.arrdelay.mean()

g = t.group_by('dest')

result = g.mutate(zzz=m, yyy=m, ddd=m, ccc=m, bbb=m, aaa=m)

expected = g.mutate([m.name('aaa'), m.name('bbb'), m.name('ccc'),
m.name('ddd'), m.name('yyy'), m.name('zzz')])

assert_equal(result, expected)

def test_window_bind_to_table(self):
w = ibis.window(group_by='g', order_by=ibis.desc('f'))
Expand Down
249 changes: 85 additions & 164 deletions ibis/expr/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,11 @@

from ibis.common import IbisError, RelationError
import ibis.common as com
import ibis.compat as compat
import ibis.config as config
import ibis.util as util


def _ops():
import ibis.expr.operations as mod
return mod


class Parameter(object):

"""
Expand Down Expand Up @@ -53,9 +49,11 @@ def __repr__(self):
try:
result = self.execute()
return repr(result)
except com.TranslationError:
output = ('Translation to backend failed, repr follows:\n%s'
% self._repr())
except com.TranslationError as e:
output = ('Translation to backend failed\n'
'Error message: {0}\n'
'Expression repr follows:\n{1}'
.format(e.args[0], self._repr()))
return output
else:
return self._repr()
Expand Down Expand Up @@ -120,7 +118,7 @@ def factory(arg, name=None):
def _can_implicit_cast(self, arg):
return False

def execute(self, limit=None):
def execute(self, limit=None, async=False):
"""
If this expression is based on physical tables in a database backend,
execute it against that backend.
Expand All @@ -130,9 +128,30 @@ def execute(self, limit=None):
result : expression-dependent
Result of compiling expression and executing in backend
"""
import ibis.expr.analysis as L
backend = L.find_backend(self)
return backend.execute(self, limit=limit)
from ibis.client import execute
return execute(self, limit=limit, async=async)

def compile(self, limit=None):
"""
Compile expression to whatever execution target, to verify
Returns
-------
compiled : value or list
query representation or list thereof
"""
from ibis.client import compile
return compile(self, limit=limit)

def verify(self):
"""
Returns True if expression can be compiled to its attached client
"""
try:
self.compile()
return True
except:
return False

def equals(self, other):
if type(self) != type(other):
Expand Down Expand Up @@ -266,6 +285,36 @@ def resolve_name(self):
raise com.ExpressionError('Expression is not named: %s' % repr(self))


class TableColumn(ValueNode):

"""
Selects a column from a TableExpr
"""

def __init__(self, name, table_expr):
Node.__init__(self, [name, table_expr])

if name not in table_expr.schema():
raise KeyError("'{0}' is not a field".format(name))

self.name = name
self.table = table_expr

def parent(self):
return self.table

def resolve_name(self):
return self.name

def root_tables(self):
return self.table._root_tables()

def to_expr(self):
ctype = self.table._get_type(self.name)
klass = ctype.array_type()
return klass(self, name=self.name)


class ExpressionList(Node):

def __init__(self, exprs):
Expand Down Expand Up @@ -335,7 +384,7 @@ def output_type(self):
import ibis.expr.rules as rules
if isinstance(self.value, bool):
klass = BooleanScalar
elif isinstance(self.value, (int, long)):
elif isinstance(self.value, compat.integer_types):
int_type = rules.int_literal_class(self.value)
klass = int_type.scalar_type()
elif isinstance(self.value, float):
Expand Down Expand Up @@ -502,7 +551,7 @@ def __getitem__(self, what):
if isinstance(what, AnalyticExpr):
what = what._table_getitem()

if isinstance(what, (list, tuple)):
if isinstance(what, (list, tuple, TableExpr)):
# Projection case
return self.projection(what)
elif isinstance(what, BooleanArray):
Expand Down Expand Up @@ -547,17 +596,6 @@ def _ensure_expr(self, expr):
def _get_type(self, name):
return self._arg.get_type(name)

def materialize(self):
"""
Force schema resolution for a joined table, selecting all fields from
all tables.
"""
if self._is_materialized():
return self
else:
op = _ops().MaterializedJoin(self)
return TableExpr(op)

def get_columns(self, iterable):
"""
Get multiple columns from the table
Expand All @@ -580,7 +618,7 @@ def get_column(self, name):
-------
column : array expression
"""
ref = _ops().TableColumn(name, self)
ref = TableColumn(name, self)
return ref.to_expr()

@property
Expand All @@ -599,33 +637,10 @@ def schema(self):
raise IbisError('Table operation is not yet materialized')
return self.op().get_schema()

def to_array(self):
"""
Single column tables can be viewed as arrays.
"""
op = _ops().TableArrayView(self)
return op.to_expr()

def _is_materialized(self):
# The operation produces a known schema
return self.op().has_schema()

def view(self):
"""
Create a new table expression that is semantically equivalent to the
current one, but is considered a distinct relation for evaluation
purposes (e.g. in SQL).
For doing any self-referencing operations, like a self-join, you will
use this operation to create a reference to the current table
expression.
Returns
-------
expr : TableExpr
"""
return TableExpr(_ops().SelfReference(self))

def add_column(self, expr, name=None):
"""
Add indicated column expression to table, producing a new table. Note:
Expand All @@ -645,38 +660,6 @@ def add_column(self, expr, name=None):

return self.projection([self, expr])

def distinct(self):
"""
Compute set of unique rows/tuples occurring in this table
"""
op = _ops().Distinct(self)
return op.to_expr()

def projection(self, exprs):
"""
Compute new table expression with the indicated column expressions from
this table.
Parameters
----------
exprs : column expression, or string, or list of column expressions and
strings. If strings passed, must be columns in the table already
Returns
-------
projection : TableExpr
"""
import ibis.expr.analysis as L

if isinstance(exprs, (Expr,) + six.string_types):
exprs = [exprs]

exprs = [self._ensure_expr(e) for e in exprs]
op = L.Projector(self, exprs).get_result()
return TableExpr(op)

select = projection

def group_by(self, by):
"""
Create an intermediate grouped table expression, pending some group
Expand All @@ -693,88 +676,6 @@ def group_by(self, by):
from ibis.expr.groupby import GroupedTableExpr
return GroupedTableExpr(self, by)

def aggregate(self, agg_exprs, by=None, having=None):
"""
Aggregate a table with a given set of reductions, with grouping
expressions, and post-aggregation filters.
Parameters
----------
agg_exprs : expression or expression list
by : optional, default None
Grouping expressions
having : optional, default None
Post-aggregation filters
Returns
-------
agg_expr : TableExpr
"""
op = _ops().Aggregation(self, agg_exprs, by=by, having=having)
return TableExpr(op)

def limit(self, n, offset=0):
"""
Select the first n rows at beginning of table (may not be deterministic
depending on implementatino and presence of a sorting).
Parameters
----------
n : int
Rows to include
offset : int, default 0
Number of rows to skip first
Returns
-------
limited : TableExpr
"""
op = _ops().Limit(self, n, offset=offset)
return TableExpr(op)

def sort_by(self, sort_exprs):
"""
Sort table by the indicated column expressions and sort orders
(ascending/descending)
Parameters
----------
sort_exprs : sorting expressions
Must be one of:
- Column name or expression
- Sort key, e.g. desc(col)
- (column name, True (ascending) / False (descending))
Examples
--------
sorted = table.sort_by([('a', True), ('b', False)])
Returns
-------
sorted : TableExpr
"""
op = _ops().SortBy(self, sort_exprs)
return TableExpr(op)

def union(self, other, distinct=False):
"""
Form the table set union of two table expressions having identical
schemas.
Parameters
----------
other : TableExpr
distinct : boolean, default False
Only union distinct rows not occurring in the calling table (this
can be very expensive, be careful)
Returns
-------
union : TableExpr
"""
op = _ops().Union(self, other, distinct=distinct)
return TableExpr(op)


# -----------------------------------------------------------------------------
# Declare all typed ValueExprs. This is what the user will actually interact
Expand Down Expand Up @@ -1160,6 +1061,10 @@ class ListExpr(ArrayExpr, AnyValue):
pass


class SortExpr(Expr):
pass


class ValueList(ValueNode):

"""
Expand All @@ -1168,7 +1073,7 @@ class ValueList(ValueNode):

def __init__(self, args):
self.values = [as_value_expr(x) for x in args]
ValueNode.__init__(self, [self.values])
ValueNode.__init__(self, self.values)

def root_tables(self):
return distinct_roots(*self.values)
Expand All @@ -1193,3 +1098,19 @@ def find_base_table(expr):
r = find_base_table(arg)
if isinstance(r, TableExpr):
return r


def find_all_base_tables(expr, memo=None):
if memo is None:
memo = {}

if isinstance(expr, TableExpr):
if id(expr) not in memo:
memo[id(expr)] = expr
return memo

for arg in expr.op().flat_args():
if isinstance(arg, Expr):
find_all_base_tables(arg, memo)

return memo
36 changes: 31 additions & 5 deletions ibis/expr/window.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,14 @@ def __init__(self, group_by=None, order_by=None,
order_by = []

self._group_by = util.promote_list(group_by)
self._order_by = util.promote_list(order_by)
self._order_by = [ops.SortKey(expr)
if isinstance(expr, ir.Expr)
else expr
for expr in self._order_by]

self._order_by = []
for x in util.promote_list(order_by):
if isinstance(x, ir.SortExpr):
pass
elif isinstance(x, ir.Expr):
x = ops.SortKey(x).to_expr()
self._order_by.append(x)

self.preceding = _list_to_tuple(preceding)
self.following = _list_to_tuple(following)
Expand Down Expand Up @@ -208,3 +211,26 @@ def trailing_window(periods, group_by=None, order_by=None):
"""
return Window(preceding=periods, following=0,
group_by=group_by, order_by=order_by)


def propagate_down_window(expr, window):
op = expr.op()

clean_args = []
unchanged = True
for arg in op.args:
if (isinstance(arg, ir.Expr) and
not isinstance(op, ops.WindowOp)):
new_arg = propagate_down_window(arg, window)
if isinstance(new_arg.op(), ops.AnalyticOp):
new_arg = ops.WindowOp(new_arg, window).to_expr()
if arg is not new_arg:
unchanged = False
arg = new_arg

clean_args.append(arg)

if unchanged:
return expr
else:
return type(op)(*clean_args).to_expr()
170 changes: 32 additions & 138 deletions ibis/filesystems.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,13 @@
# This file may adapt small portions of https://github.com/mtth/hdfs (MIT
# license), see the LICENSES directory.

from os import path as osp
from posixpath import join as pjoin
import os
import posixpath
import shutil

import six

from ibis.config import options
import ibis.common as com
import ibis.util as util

from hdfs.util import temppath


class HDFSError(com.IbisError):
Expand Down Expand Up @@ -111,6 +105,13 @@ def get(self, hdfs_path, local_path='.', overwrite=False):
----------
hdfs_path : string
local_path : string, default '.'
Further keyword arguments passed down to any internal API used.
Returns
-------
written_path : string
The path to the written file or directory
"""
raise NotImplementedError

Expand Down Expand Up @@ -179,7 +180,7 @@ def write(self, hdfs_path, buf, overwrite=False, blocksize=None,
replication=None, buffersize=None):
raise NotImplementedError

def mkdir(self, path, create_parent=False):
def mkdir(self, path):
pass

def ls(self, hdfs_path, status=False):
Expand Down Expand Up @@ -231,7 +232,7 @@ def rmdir(self, path):
"""
self.client.delete(path, recursive=True)

def find_any_file(self, hdfs_dir):
def _find_any_file(self, hdfs_dir):
contents = self.ls(hdfs_dir, status=True)

def valid_filename(name):
Expand Down Expand Up @@ -270,50 +271,27 @@ def status(self, path):

@implements(HDFS.chmod)
def chmod(self, path, permissions):
self.client.set_permissions(path, permissions)
self.client.set_permission(path, permissions)

@implements(HDFS.chown)
def chown(self, path, owner=None, group=None):
self.client.set_owner(path, owner, group)

@implements(HDFS.exists)
def exists(self, path):
try:
self.client.status(path)
return True
except Exception:
return False
return not self.client.status(path, strict=False) is None

@implements(HDFS.ls)
def ls(self, hdfs_path, status=False):
contents = self.client.list(hdfs_path)
if not status:
return [path for path, detail in contents]
else:
return contents
return self.client.list(hdfs_path, status=status)

@implements(HDFS.mkdir)
def mkdir(self, dir_path, create_parent=False):
# ugh, see #252

# create a temporary file, then delete it
dummy = pjoin(dir_path, util.guid())
self.client.write(dummy, '')
self.client.delete(dummy)
def mkdir(self, dir_path):
self.client.makedirs(dir_path)

@implements(HDFS.size)
def size(self, hdfs_path):
stat = self.status(hdfs_path)

if stat['type'] == 'FILE':
return stat['length']
elif stat['type'] == 'DIRECTORY':
total = 0
for path in self.ls(hdfs_path):
total += self.size(path)
return total
else:
raise NotImplementedError
return self.client.content(hdfs_path)['length']

@implements(HDFS.mv)
def mv(self, hdfs_path_src, hdfs_path_dest, overwrite=True):
Expand All @@ -324,118 +302,34 @@ def mv(self, hdfs_path_src, hdfs_path_dest, overwrite=True):

def delete(self, hdfs_path, recursive=False):
"""
Delete a file.
"""
return self.client.delete(hdfs_path, recursive=recursive)

@implements(HDFS.head)
def head(self, hdfs_path, nbytes=1024, offset=0):
gen = self.client.read(hdfs_path, offset=offset, length=nbytes)
return ''.join(gen)
_reader = self.client.read(hdfs_path, offset=offset, length=nbytes)
with _reader as reader:
return reader.read()

@implements(HDFS.put)
def put(self, hdfs_path, resource, overwrite=False, verbose=None,
**kwargs):
verbose = verbose or options.verbose
is_path = isinstance(resource, six.string_types)

if is_path and osp.isdir(resource):
for dirpath, dirnames, filenames in os.walk(resource):
rel_dir = osp.relpath(dirpath, resource)
if rel_dir == '.':
rel_dir = ''
for fpath in filenames:
abs_path = osp.join(dirpath, fpath)
rel_hdfs_path = pjoin(hdfs_path, rel_dir, fpath)
self.put(rel_hdfs_path, abs_path, overwrite=overwrite,
verbose=verbose, **kwargs)
if isinstance(resource, six.string_types):
# `resource` is a path.
return self.client.upload(hdfs_path, resource, overwrite=overwrite,
**kwargs)
else:
if is_path:
basename = os.path.basename(resource)
if self.exists(hdfs_path):
if self.status(hdfs_path)['type'] == 'DIRECTORY':
hdfs_path = pjoin(hdfs_path, basename)
if verbose:
self.log('Writing local {0} to HDFS {1}'.format(resource,
hdfs_path))
self.client.upload(hdfs_path, resource,
overwrite=overwrite, **kwargs)
else:
if verbose:
self.log('Writing buffer to HDFS {0}'.format(hdfs_path))
resource.seek(0)
self.client.write(hdfs_path, resource, overwrite=overwrite,
**kwargs)
# `resource` is a file-like object.
hdfs_path = self.client.resolve(hdfs_path)
self.client.write(hdfs_path, data=resource, overwrite=overwrite,
**kwargs)
return hdfs_path

@implements(HDFS.get)
def get(self, hdfs_path, local_path, overwrite=False, verbose=None):
def get(self, hdfs_path, local_path, overwrite=False, verbose=None,
**kwargs):
verbose = verbose or options.verbose

hdfs_path = hdfs_path.rstrip(posixpath.sep)

if osp.isdir(local_path) and not overwrite:
dest = osp.join(local_path, posixpath.basename(hdfs_path))
else:
local_dir = osp.dirname(local_path) or '.'
if osp.isdir(local_dir):
dest = local_path
else:
# fail early
raise HDFSError('Parent directory %s does not exist',
local_dir)

# TODO: threadpool

def _get_file(remote, local):
if verbose:
self.log('Writing HDFS {0} to local {1}'.format(remote, local))
self.client.download(remote, local, overwrite=overwrite)

def _scrape_dir(path, dst):
objs = self.client.list(path)
for hpath, detail in objs:
relpath = posixpath.relpath(hpath, hdfs_path)
full_opath = pjoin(dst, relpath)

if detail['type'] == 'FILE':
_get_file(hpath, full_opath)
else:
os.makedirs(full_opath)
_scrape_dir(hpath, dst)

status = self.status(hdfs_path)
if status['type'] == 'FILE':
if not overwrite and osp.exists(local_path):
raise IOError('{0} exists'.format(local_path))

_get_file(hdfs_path, local_path)
else:
# TODO: partitioned files

with temppath() as tpath:
_temp_dir_path = osp.join(tpath, posixpath.basename(hdfs_path))
os.makedirs(_temp_dir_path)
_scrape_dir(hdfs_path, _temp_dir_path)

if verbose:
self.log('Moving {0} to {1}'.format(_temp_dir_path,
local_path))

if overwrite and osp.exists(local_path):
# swap and delete
local_swap_path = util.guid()
shutil.move(local_path, local_swap_path)

try:
shutil.move(_temp_dir_path, local_path)
if verbose:
msg = 'Deleting original {0}'.format(local_path)
self.log(msg)
shutil.rmtree(local_swap_path)
except:
# undo our diddle
shutil.move(local_swap_path, local_path)
else:
shutil.move(_temp_dir_path, local_path)

return dest
return self.client.download(hdfs_path, local_path, overwrite=overwrite,
**kwargs)
92 changes: 66 additions & 26 deletions ibis/impala/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,49 +13,89 @@

from ibis.impala.client import (ImpalaConnection, ImpalaClient, # noqa
Database, ImpalaTable)
from ibis.impala.udf import add_operation, wrap_udf, wrap_uda # noqa
from ibis.impala.udf import * # noqa
from ibis.impala.madlib import MADLibAPI # noqa
from ibis.config import options
import ibis.common as com


def connect(host='localhost', port=21050, protocol='hiveserver2',
database='default', timeout=45, use_ssl=False, ca_cert=None,
use_ldap=False, ldap_user=None, ldap_password=None,
use_kerberos=False, kerberos_service_name='impala',
pool_size=8):
def compile(expr):
"""
Create an Impala Client for use with Ibis
Force compilation of expression as though it were an expression depending
on Impala. Note you can also call expr.compile()
Returns
-------
compiled : string
"""
from .compiler import to_sql
return to_sql(expr)


def verify(expr):
"""
Determine if expression can be successfully translated to execute on Impala
"""
try:
compile(expr)
return True
except com.TranslationError:
return False


def connect(host='localhost', port=21050, database='default', timeout=45,
use_ssl=False, ca_cert=None, user=None, password=None,
auth_mechanism='NOSASL', kerberos_service_name='impala',
pool_size=8, hdfs_client=None):
"""
Create an ImpalaClient for use with Ibis.
Parameters
----------
host : host name
port : int, default 21050 (HiveServer 2)
protocol : {'hiveserver2', 'beeswax'}
database :
timeout :
use_ssl :
ca_cert :
use_ldap : boolean, default False
ldap_user :
ldap_password :
use_kerberos : boolean, default False
kerberos_service_name : string, default 'impala'
host : string, Host name of the impalad or HiveServer2 in Hive
port : int, Defaults to 21050 (Impala's HiveServer2)
database : string, Default database when obtaining new cursors
timeout : int, Connection timeout (seconds) when communicating with
HiveServer2
use_ssl : boolean, Use SSL when connecting to HiveServer2
ca_cert : string, Local path to 3rd party CA certificate or copy of server
certificate for self-signed certificates. If SSL is enabled, but this
argument is None, then certificate validation is skipped.
user : string, LDAP user to authenticate
password : string, LDAP password to authenticate
auth_mechanism : string, {'NOSASL' <- default, 'PLAIN', 'GSSAPI', 'LDAP'}.
Use NOSASL for non-secured Impala connections. Use PLAIN for
non-secured Hive clusters. Use LDAP for LDAP authenticated
connections. Use GSSAPI for Kerberos-secured clusters.
kerberos_service_name : string, Specify particular impalad service
principal.
Examples
--------
>>> hdfs = ibis.hdfs_connect(**hdfs_params)
>>> client = ibis.impala.connect(hdfs_client=hdfs, **impala_params)
Returns
-------
con : ImpalaConnection
con : ImpalaClient
"""
params = {
'host': host,
'port': port,
'protocol': protocol,
'database': database,
'timeout': timeout,
'use_ssl': use_ssl,
'ca_cert': ca_cert,
'use_ldap': use_ldap,
'ldap_user': ldap_user,
'ldap_password': ldap_password,
'use_kerberos': use_kerberos,
'user': user,
'password': password,
'auth_mechanism': auth_mechanism,
'kerberos_service_name': kerberos_service_name
}

return ImpalaConnection(pool_size=pool_size, **params)
con = ImpalaConnection(pool_size=pool_size, **params)
client = ImpalaClient(con, hdfs_client=hdfs_client)

if options.default_backend is None:
options.default_backend = client

return client
Loading