385 changes: 382 additions & 3 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,382 @@
===
API
===
.. currentmodule:: ibis
.. _api:

*************
API Reference
*************

.. currentmodule:: ibis

.. _api.client:

Creating connections
--------------------

These methods are in the ``ibis`` module namespace, and your main point of
entry to using Ibis.

.. autosummary::
:toctree: generated/

make_client
impala_connect
hdfs_connect

Impala client
-------------

These methods are available on the Impala client object after connecting to
your Impala cluster, HDFS cluster, and creating the client with
``ibis.make_client``.

Table methods
~~~~~~~~~~~~~
.. autosummary::
:toctree: generated/

ImpalaClient.table
ImpalaClient.sql
ImpalaClient.list_tables
ImpalaClient.exists_table
ImpalaClient.drop_table
ImpalaClient.create_table
ImpalaClient.insert
ImpalaClient.truncate_table
ImpalaClient.get_schema
ImpalaClient.cache_table

Creating views is also possible:

.. autosummary::
:toctree: generated/

ImpalaClient.create_view
ImpalaClient.drop_view
ImpalaClient.drop_table_or_view

Accessing data formats in HDFS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

ImpalaClient.avro_file
ImpalaClient.delimited_file
ImpalaClient.parquet_file

Database methods
~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

ImpalaClient.set_database
ImpalaClient.create_database
ImpalaClient.drop_database
ImpalaClient.list_databases
ImpalaClient.exists_database

Executing expressions
~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

ImpalaClient.execute
ImpalaClient.disable_codegen

.. _api.hdfs:

HDFS
----

Client objects have an ``hdfs`` attribute you can use to interact directly with
HDFS.

.. autosummary::
:toctree: generated/

HDFS.ls
HDFS.get
HDFS.head
HDFS.put
HDFS.put_tarfile
HDFS.rm
HDFS.rmdir
HDFS.size
HDFS.status

Top-level expression APIs
-------------------------

These methods are available directly in the ``ibis`` module namespace.

.. autosummary::
:toctree: generated/

case
literal
schema
table
timestamp
where
ifelse
coalesce
greatest
least
negate
desc
now
NA
null
expr_list
row_number
window
trailing_window
cumulative_window

.. _api.table:

Table methods
-------------

.. currentmodule:: ibis.expr.api

.. autosummary::
:toctree: generated/

TableExpr.add_column
TableExpr.aggregate
TableExpr.count
TableExpr.distinct
TableExpr.filter
TableExpr.get_column
TableExpr.get_columns
TableExpr.group_by
TableExpr.limit
TableExpr.mutate
TableExpr.pipe
TableExpr.projection
TableExpr.schema
TableExpr.set_column
TableExpr.sort_by
TableExpr.union
TableExpr.view

TableExpr.join
TableExpr.cross_join
TableExpr.inner_join
TableExpr.left_join
TableExpr.outer_join
TableExpr.semi_join
TableExpr.anti_join


Grouped table methods
~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

GroupedTableExpr.aggregate
GroupedTableExpr.count
GroupedTableExpr.having
GroupedTableExpr.mutate
GroupedTableExpr.order_by
GroupedTableExpr.over
GroupedTableExpr.projection
GroupedTableExpr.size

Generic value methods
---------------------

.. _api.functions:

Scalar or array methods
~~~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

ValueExpr.between
ValueExpr.cast
ValueExpr.fillna
ValueExpr.isin
ValueExpr.notin
ValueExpr.nullif
ValueExpr.hash
ValueExpr.isnull
ValueExpr.notnull
ValueExpr.over

ValueExpr.add
ValueExpr.sub
ValueExpr.mul
ValueExpr.div
ValueExpr.pow
ValueExpr.rdiv
ValueExpr.rsub

Array methods
~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

ArrayExpr.case
ArrayExpr.cases
ArrayExpr.distinct

ArrayExpr.count
ArrayExpr.min
ArrayExpr.max
ArrayExpr.approx_median
ArrayExpr.approx_nunique
ArrayExpr.group_concat
ArrayExpr.nunique
ArrayExpr.summary

ArrayExpr.value_counts

ArrayExpr.first
ArrayExpr.last
ArrayExpr.dense_rank
ArrayExpr.rank
ArrayExpr.lag
ArrayExpr.lead
ArrayExpr.cummin
ArrayExpr.cummax

General numeric methods
-----------------------

Scalar or array methods
~~~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

NumericValue.abs
NumericValue.ceil
NumericValue.floor
NumericValue.sign
NumericValue.exp


Array methods
~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

NumericArray.sum
NumericArray.mean

NumericArray.cumsum
NumericArray.cummean

NumericArray.bottomk
NumericArray.topk
NumericArray.bucket
NumericArray.histogram

Integer methods
---------------

Scalar or array methods
~~~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

IntegerValue.to_timestamp

.. _api.string:

String methods
--------------

All string operations are valid either on scalar or array values

.. autosummary::
:toctree: generated/

StringValue.length
StringValue.lower
StringValue.upper
StringValue.reverse
StringValue.ascii_str
StringValue.strip
StringValue.lstrip
StringValue.rstrip
StringValue.capitalize
StringValue.contains
StringValue.like
StringValue.parse_url
StringValue.substr
StringValue.left
StringValue.right
StringValue.repeat
StringValue.find
StringValue.translate
StringValue.find_in_set
StringValue.join
StringValue.lpad
StringValue.rpad

StringValue.rlike
StringValue.re_search
StringValue.re_extract
StringValue.re_replace


Timestamp methods
-----------------

All timestamp operations are valid either on scalar or array values

.. autosummary::
:toctree: generated/

TimestampValue.truncate
TimestampValue.year
TimestampValue.month
TimestampValue.day
TimestampValue.hour
TimestampValue.minute
TimestampValue.second
TimestampValue.millisecond

Boolean methods
---------------

.. autosummary::
:toctree: generated/

BooleanValue.ifelse


.. autosummary::
:toctree: generated/

BooleanArray.any

Category methods
----------------

Category is a logical type with either a known or unknown cardinality. Values
are represented semantically as integers starting at 0.

.. autosummary::
:toctree: generated/

CategoryValue.label

Decimal methods
---------------

.. autosummary::
:toctree: generated/

DecimalValue.precision
DecimalValue.scale
8 changes: 8 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
# All configuration values have a default; values that are commented out
# serve to show the default.

import glob
import sys
import os

Expand All @@ -35,6 +36,13 @@
'numpydoc'
]

autosummary_generate = glob.glob("*.rst")

# autosummary_generate = True

import numpydoc
numpydoc_show_class_members = False

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

Expand Down
5 changes: 5 additions & 0 deletions docs/source/configuration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.. _configuration:

****************
Configuring Ibis
****************
46 changes: 46 additions & 0 deletions docs/source/developer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
.. _develop:

***********************************
Developing and Contributing to Ibis
***********************************

For a primer on general open source contributions, see the `pandas contribution
guide <http://pandas.pydata.org/pandas-docs/stable/contributing.html>`_. The
project will be run much like pandas has been.

Test environment setup
----------------------

If you do not have access to an Impala cluster, you may wish to set up the test
virtual machine. We've set up a Quickstart VM to get you up and running faster,
:ref:`see here <install.quickstart>`.

Unit tests and integration tests that use Impala require a test data load. See
``scripts/load_test_data.py`` in the source repository for the data loading
script.

Contribution Ideas
------------------

Here's a few ideas to think about outside of participating in the primary
development roadmap:

* Documentation
* Use cases and IPython notebooks
* Other SQL-based backends (Presto, Hive, Spark SQL, PostgreSQL)
* S3 filesytem support
* Integration with MLLib via PySpark

Contributor License Agreements
------------------------------

While Ibis is an Apache-licensed open source project, we require individual and
corporate contributors to execute a `contributor license agreement
<https://en.wikipedia.org/wiki/Contributor_License_Agreement>`_ to enable any
copyright issues to be avoided and to protect the user base from
disruption. This agreement only needs to be signed once.

We'll use the same CLA's that Impala uses:

* `Individual CLA <https://github.com/cloudera/Impala/wiki/Individual-Contributor-License-Agreement-(ICLA)>`_
* `Corporate CLA <https://github.com/cloudera/Impala/wiki/Corporate-Contributor-License-Agreement-(CCLA)>`_
106 changes: 106 additions & 0 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
.. _install:

********************************
Installation and Getting Started
********************************

Getting up and running with Ibis involves installing the Python package and
connecting to HDFS and Impala. If you don't have a Hadoop cluster available
with Impala, see :ref:`install.quickstart` below for instructions to use a VM
to get up and running quickly.

Installation
------------

System dependencies
~~~~~~~~~~~~~~~~~~~

Ibis requires a working Python 2.6 or 2.7 installation (3.x support will come
in a future release). We recommend `Anaconda <http://continuum.io/downloads>`_.

Some platforms will require that you have Kerberos installed to build properly.

* Redhat / CentOS: ``yum install krb5-devel``
* Ubuntu / Debian: ``apt-get install libkrb5-dev``

Installing the Python package
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Install ibis using ``pip`` (or ``conda``, whenever it becomes available):

::

pip install ibis-framework

This installs the ``ibis`` library to your configured Python environment.

Creating a client
-----------------

To create an Ibis "client", you must first connect your services and assemble
the client using ``ibis.make_client``:

.. code-block:: python
import ibis
ic = ibis.impala_connect(host=impala_host, port=impala_port)
hdfs = ibis.hdfs_connect(host=webhdfs_host, port=webhdfs_port)
con = ibis.make_client(ic, hdfs_client=hdfs)
Depending on your cluster setup, this may be more complicated, especially if
LDAP or Kerberos is involved. See the :ref:`API reference <api.client>` for
more.

Learning resources
------------------

We are collecting IPython notebooks for learning here:
http://github.com/cloudera/ibis-notebooks. Some of these notebooks will be
reproduced as part of the documentation.

.. _install.quickstart:

Using Ibis with the Cloudera Quickstart VM
------------------------------------------

Since Ibis requires a running Impala cluster, we have provided a lean
VirtualBox image to simplify the process for those looking to try out Ibis
(without setting up a cluster) or start contributing code to the project.

TL;DR
~~~~~

::

curl -s https://raw.githubusercontent.com/cloudera/ibis-notebooks/master/setup/bootstrap.sh | bash

Single Steps
~~~~~~~~~~~~

To use Ibis with the special Cloudera Quickstart VM follow the below
instructions:

* Install Oracle VirtualBox
* Make sure Anaconda is installed. You can get it from
http://continuum.io/downloads. Now prepend the Anaconda Python
to your path like this ``export PATH=$ANACONDA_HOME/bin:$PATH``
* ``pip install ibis-framework``
* ``git clone https://github.com/cloudera/ibis-notebooks.git``
* ``cd ibis-notebooks``
* ``./setup/setup-ibis-demo-vm.sh``
* ``source setup/ibis-env.sh``
* ``ipython notebook``

VM setup
~~~~~~~~

The setup script will download a VirtualBox appliance image and import it in
VirtualBox. In addition, it will create a new host only network adapter with
DHCP. After the VM is started, it will extract the current IP address and add a
new /etc/hosts entry pointing from the IP of the VM to the hostname
``quickstart.cloudera``. The reason for this entry is that Hadoop and HDFS
require a working reverse name mapping. If you don't want to run the automated
steps make sure to check the individual steps in the file
``setup/setup-ibis-demo-vm.sh``.
17 changes: 14 additions & 3 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,27 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Ibis
====
Ibis Documentation
==================

Ibis is a Python big data framework. To learn more about Ibis's vision and
roadmap, please visit http://ibis-project.org.

Source code is on GitHub: http://github.com/cloudera/ibis

Contents:
Since this is a young project, the documentation is definitely patchy in
places, but this will improve as things progress.

.. toctree::
:maxdepth: 1

getting-started
configuration
tutorial
api
release
developer
type-system
legal

Indices and tables
Expand Down
208 changes: 208 additions & 0 deletions docs/source/legal.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,211 @@
=====
Legal
=====

Ibis is distributed under the Apache License, Version 2.0.

Ibis development is generously sponsored by Cloudera, Inc.

License::

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

1. Definitions.

"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.

"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.

"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.

"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.

"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.

"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.

"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).

"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.

"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."

"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.

2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.

3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.

4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:

(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and

(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and

(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and

(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.

You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.

5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.

6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.

7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.

8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.

9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work.

To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
29 changes: 27 additions & 2 deletions docs/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,33 @@
Release Notes
=============

0.3.0 (TBD)
-----------
0.3.0 (July 20, 2015)
---------------------

First public release. See http://ibis-project.org for more.

New features
~~~~~~~~~~~~
* Implement window / analytic function support
* Enable non-equijoins (join clauses with operations other than ``==``).
* Add remaining :ref:`string functions <api.string>` supported by Impala.
* Add ``pipe`` method to tables (hat-tip to the pandas dev team).
* Add ``mutate`` convenience method to tables.
* Fleshed out ``WebHDFS`` implementations: get/put directories, move files,
etc. See the :ref:`full HDFS API <api.hdfs>`.
* Add ``truncate`` method for timestamp values
* ``ImpalaClient`` can execute scalar expressions not involving any table.
* Can also create internal Impala tables with a specific HDFS path.
* Make Ibis's temporary Impala database and HDFS paths configurable (see
``ibis.options``).
* Add ``truncate_table`` function to client (if the user's Impala cluster
supports it).
* Python 2.6 compatibility
* Enable Ibis to execute concurrent queries in multithreaded applications
(earlier versions were not thread-safe).
* Test data load script in ``scripts/load_test_data.py``
* Add an internal operation type signature API to enhance developer
productivity.

0.2.0 (June 16, 2015)
---------------------
Expand Down
10 changes: 10 additions & 0 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.. _api:

********
Tutorial
********

These notebooks come from http://github.com/cloudera/ibis-notebooks and are
reproduced here using ``nbconvert``.

.. include:: generated-notebooks/manifest.txt
7 changes: 7 additions & 0 deletions docs/source/type-system.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.. _internals:

*********************
Ibis design internals
*********************

More to come here.
Empty file added docs/sphinxext/__init__.py
Empty file.
23 changes: 12 additions & 11 deletions ibis/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@

# flake8: noqa

__version__ = '0.2.0'
__version__ = '0.3.0'

from ibis.client import ImpalaConnection, ImpalaClient
from ibis.filesystems import WebHDFS
from ibis.filesystems import HDFS, WebHDFS

import ibis.expr.api as api
import ibis.expr.types as ir
Expand All @@ -43,9 +43,9 @@ def make_client(db, hdfs_client=None):
Examples
--------
con = ibis.impala_connect(**impala_params)
hdfs = ibis.hdfs_connect(**hdfs_params)
client = ibis.make_client(con, hdfs_client=hdfs)
>>> con = ibis.impala_connect(**impala_params)
>>> hdfs = ibis.hdfs_connect(**hdfs_params)
>>> client = ibis.make_client(con, hdfs_client=hdfs)
Returns
-------
Expand All @@ -55,9 +55,10 @@ def make_client(db, hdfs_client=None):


def impala_connect(host='localhost', port=21050, protocol='hiveserver2',
database=None, timeout=45, use_ssl=False, ca_cert=None,
database='default', timeout=45, use_ssl=False, ca_cert=None,
use_ldap=False, ldap_user=None, ldap_password=None,
use_kerberos=False, kerberos_service_name='impala'):
use_kerberos=False, kerberos_service_name='impala',
pool_size=8):
"""
Create an Impala Client for use with Ibis
Expand Down Expand Up @@ -95,7 +96,7 @@ def impala_connect(host='localhost', port=21050, protocol='hiveserver2',
'kerberos_service_name': kerberos_service_name
}

return ImpalaConnection(**params)
return ImpalaConnection(pool_size=pool_size, **params)


def hdfs_connect(host='localhost', port=50070, protocol='webhdfs', **kwds):
Expand All @@ -113,7 +114,7 @@ def hdfs_connect(host='localhost', port=50070, protocol='webhdfs', **kwds):
client : ibis HDFS client
"""
from hdfs import InsecureClient
url = 'http://{}:{}'.format(host, port)
url = 'http://{0}:{1}'.format(host, port)
client = InsecureClient(url, **kwds)
return WebHDFS(client)

Expand All @@ -126,6 +127,6 @@ def test(include_e2e=False):
ibis_dir, _ = os.path.split(ibis.__file__)

args = ['--pyargs', ibis_dir]
if not include_e2e:
args.extend(['-m', 'not e2e'])
if include_e2e:
args.append('--e2e')
pytest.main(args)
470 changes: 366 additions & 104 deletions ibis/client.py

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions ibis/cloudpickle.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
"""

# flake8: noqa

from copy_reg import _extension_registry
from functools import partial

Expand Down
6 changes: 5 additions & 1 deletion ibis/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,11 @@ class TranslationError(IbisError):
pass


class IbisTypeError(IbisError, TypeError):
class IbisInputError(ValueError, IbisError):
pass


class IbisTypeError(TypeError, IbisError):
pass


Expand Down
30 changes: 30 additions & 0 deletions ibis/compat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright 2015 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# flake8: noqa

import sys
from six import BytesIO


PY26 = sys.version_info[0] == 2 and sys.version_info[1] == 6


if PY26:
import unittest2 as unittest
else:
import unittest


py_string = basestring
8 changes: 8 additions & 0 deletions ibis/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,12 @@

from collections import namedtuple
from contextlib import contextmanager
import pprint
import warnings
import sys

from six import StringIO

PY3 = (sys.version_info[0] >= 3)

if PY3:
Expand Down Expand Up @@ -151,6 +154,11 @@ def __init__(self, d, prefix=""):
object.__setattr__(self, "d", d)
object.__setattr__(self, "prefix", prefix)

def __repr__(self):
buf = StringIO()
pprint.pprint(self.d, stream=buf)
return buf.getvalue()

def __setattr__(self, key, val):
prefix = object.__getattribute__(self, "prefix")
if prefix:
Expand Down
23 changes: 16 additions & 7 deletions ibis/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,7 @@

cf.register_option('interactive', False, validator=cf.is_bool)
cf.register_option('verbose', False, validator=cf.is_bool)


def to_stdout(x):
print(x)


cf.register_option('verbose_log', to_stdout)
cf.register_option('verbose_log', None)


sql_default_limit_doc = """
Expand All @@ -32,3 +26,18 @@ def to_stdout(x):

with cf.config_prefix('sql'):
cf.register_option('default_limit', 10000, sql_default_limit_doc)


impala_temp_db_doc = """
Database to use for temporary tables, views. functions, etc.
"""

impala_temp_hdfs_path_doc = """
HDFS path for storage of temporary data
"""


with cf.config_prefix('impala'):
cf.register_option('temp_db', '__ibis_tmp', impala_temp_db_doc)
cf.register_option('temp_hdfs_path', '/tmp/ibis',
impala_temp_hdfs_path_doc)
92 changes: 80 additions & 12 deletions ibis/expr/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from ibis.common import RelationError
from ibis.common import RelationError, ExpressionError
from ibis.expr.window import window
import ibis.expr.types as ir
import ibis.expr.operations as ops
import ibis.util as util
Expand Down Expand Up @@ -45,6 +46,9 @@ def get_result(self):
expr = self.expr
node = expr.op()

if getattr(node, 'blocking', False):
return expr

subbed_args = []
for arg in node.args:
if isinstance(arg, (tuple, list)):
Expand Down Expand Up @@ -227,9 +231,9 @@ def _lift_TableColumn(self, expr, block=None):

can_lift = True
lifted_root = self.lift(val)
elif (isinstance(val.op(), ops.TableColumn)
and val.op().name == val.get_name()
and node.name == val.get_name()):
elif (isinstance(val.op(), ops.TableColumn) and
val.op().name == val.get_name() and
node.name == val.get_name()):
can_lift = True
lifted_root = self.lift(val.op().table)

Expand Down Expand Up @@ -457,19 +461,19 @@ def _validate_projection(self, expr):
return False

for val in self.parent.selections:
if (isinstance(val.op(), ops.PhysicalTable)
and node.name in val.schema()):
if (isinstance(val.op(), ops.PhysicalTable) and
node.name in val.schema()):
is_valid = True
elif (isinstance(val.op(), ops.TableColumn)
and node.name == val.get_name()
and not _is_aliased(val)):
elif (isinstance(val.op(), ops.TableColumn) and
node.name == val.get_name() and
not _is_aliased(val)):
# Aliased table columns are no good
col_table = val.op().table.op()

lifted_node = substitute_parents(expr).op()

is_valid = (col_table.is_ancestor(node.table)
or col_table.is_ancestor(lifted_node.table))
is_valid = (col_table.is_ancestor(node.table) or
col_table.is_ancestor(lifted_node.table))

# is_valid = True

Expand All @@ -480,6 +484,66 @@ def _is_aliased(col_expr):
return col_expr.op().name != col_expr.get_name()


def windowize_function(expr, w=None):
def _check_window(x):
# Hmm
arg, window = x.op().args
if isinstance(arg.op(), ops.RowNumber):
if len(window._order_by) == 0:
raise ExpressionError('RowNumber requires explicit '
'window sort')

return x

def _windowize(x, w):
if not isinstance(x.op(), ops.WindowOp):
walked = _walk(x, w)
else:
window_arg, window_w = x.op().args
walked_child = _walk(window_arg, w)

if walked_child is not window_arg:
walked = x._factory(ops.WindowOp(walked_child, window_w),
name=x._name)
else:
walked = x

op = walked.op()
if isinstance(op, (ops.AnalyticOp, ops.Reduction)):
if w is None:
w = window()
return _check_window(walked.over(w))
elif isinstance(op, ops.WindowOp):
if w is not None:
return _check_window(walked.over(w))
else:
return _check_window(walked)
else:
return walked

def _walk(x, w):
op = x.op()

unchanged = True
windowed_args = []
for arg in op.args:
if not isinstance(arg, ir.Expr):
windowed_args.append(arg)
continue

new_arg = _windowize(arg, w)
unchanged = unchanged and arg is new_arg
windowed_args.append(new_arg)

if not unchanged:
new_op = type(op)(*windowed_args)
return x._factory(new_op, name=x._name)
else:
return x

return _windowize(expr, w)


class Projector(object):

"""
Expand Down Expand Up @@ -509,6 +573,9 @@ def __init__(self, parent, proj_exprs):
# Perform substitution only if we share common roots
if validator.shares_some_roots(expr):
expr = substitute_parents(expr, past_projection=False)

expr = windowize_function(expr)

clean_exprs.append(expr)

self.clean_exprs = clean_exprs
Expand Down Expand Up @@ -699,7 +766,8 @@ def walk(expr):
walk(op.left)
walk(op.right)
else:
raise Exception('Invalid predicate: {!r}'.format(expr))
raise Exception('Invalid predicate: {0!s}'
.format(expr._repr()))

walk(expr)
return out_exprs
Expand Down
21 changes: 11 additions & 10 deletions ibis/expr/analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@


import ibis.expr.types as ir
import ibis.expr.rules as rules
import ibis.expr.operations as ops


Expand Down Expand Up @@ -53,10 +54,9 @@ def __init__(self, arg, buckets, closed='left', close_extreme=True,
raise ValueError('If one bucket edge provided, must have'
' include_under=True and include_over=True')

ir.ValueNode.__init__(self, [self.arg, self.buckets, self.closed,
self.close_extreme,
self.include_under,
self.include_over])
ir.ValueNode.__init__(self, self.arg, self.buckets, self.closed,
self.close_extreme, self.include_under,
self.include_over)

@property
def nbuckets(self):
Expand Down Expand Up @@ -84,8 +84,8 @@ def __init__(self, arg, nbins, binwidth, base, closed='left',
self.closed = self._validate_closed(closed)

self.aux_hash = aux_hash
ir.ValueNode.__init__(self, [self.arg, self.nbins, self.binwidth,
self.base, self.closed, self.aux_hash])
ir.ValueNode.__init__(self, self.arg, self.nbins, self.binwidth,
self.base, self.closed, self.aux_hash)

def output_type(self):
# always undefined cardinality (for now)
Expand All @@ -105,15 +105,16 @@ def __init__(self, arg, labels, nulls):
'categories: %d' % card)

self.nulls = nulls
ir.ValueNode.__init__(self, [self.arg, self.labels, self.nulls])
ir.ValueNode.__init__(self, self.arg, self.labels, self.nulls)

def output_type(self):
return ops._shape_like(self.arg, 'string')
return rules.shape_like(self.arg, 'string')


def bucket(arg, buckets, closed='left', close_extreme=True,
include_under=False, include_over=False):
"""
Compute a discrete binning of a numeric array
Parameters
----------
Expand All @@ -122,8 +123,8 @@ def bucket(arg, buckets, closed='left', close_extreme=True,
closed : {'left', 'right'}, default 'left'
Which side of each interval is closed. For example
buckets = [0, 100, 200]
closed = 'left': 100 falls in 2nd bucket
closed = 'right': 100 falls in 1st bucket
closed = 'left': 100 falls in 2nd bucket
closed = 'right': 100 falls in 1st bucket
close_extreme : boolean, default True
Returns
Expand Down
554 changes: 525 additions & 29 deletions ibis/expr/api.py

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions ibis/expr/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ def get_result(self):
str(what.value))

if isinstance(self.expr, ir.ValueExpr) and self.expr._name is not None:
text = '{} = {}'.format(self.expr.get_name(), text)
text = '{0} = {1}'.format(self.expr.get_name(), text)

if self.memoize:
alias_to_text = [(self.memo.aliases[x],
Expand Down Expand Up @@ -144,13 +144,13 @@ def _indent(self, text, indents=1):

def _format_table(self, table):
# format the schema
rows = ['name: {!s}\nschema:'.format(table.name)]
rows = ['name: {0!s}\nschema:'.format(table.name)]
rows.extend([' %s : %s' % tup for tup in
zip(table.schema.names, table.schema.types)])
opname = type(table).__name__
type_display = self._get_type_display(table)
opline = '%s[%s]' % (opname, type_display)
return '{}\n{}'.format(opline, self._indent('\n'.join(rows)))
return '{0}\n{1}'.format(opline, self._indent('\n'.join(rows)))

def _format_column(self, expr):
# HACK: if column is pulled from a Filter of another table, this parent
Expand Down Expand Up @@ -191,7 +191,7 @@ def visit(what, extra_indents=0):
else:
for arg, name in zip(op.args, arg_names):
if name is not None:
name = self._indent('{}:'.format(name))
name = self._indent('{0}:'.format(name))
if isinstance(arg, list):
if name is not None and len(arg) > 0:
formatted_args.append(name)
Expand Down
111 changes: 99 additions & 12 deletions ibis/expr/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,28 +14,34 @@

# User API for grouped data operations

import ibis.expr.analysis as L
import ibis.expr.operations as ops
import ibis.expr.types as ir
import ibis.expr.window as _window
import ibis.util as util


def _resolve_exprs(table, exprs):
exprs = util.promote_list(exprs)
return table._resolve(exprs)


class GroupedTableExpr(object):

"""
Helper intermediate construct
"""

def __init__(self, table, by, having=None):
if not isinstance(by, (list, tuple)):
if not isinstance(by, ir.Expr):
by = table._resolve([by])
else:
by = [by]
else:
by = table._resolve(by)

def __init__(self, table, by, having=None, order_by=None, window=None):
self.table = table
self.by = by
self.by = _resolve_exprs(table, by)
self._order_by = order_by or []
self._having = having or []
self._window = window

def __getitem__(self, args):
# Shortcut for projection with window functions
return self.projection(args)

def __getattr__(self, attr):
if hasattr(self.table, attr):
Expand All @@ -59,13 +65,94 @@ def having(self, expr):
Add a post-aggregation result filter (like the having argument in
`aggregate`), for composability with the group_by API
Parameters
----------
Returns
-------
grouped : GroupedTableExpr
"""
exprs = util.promote_list(expr)
new_having = self._having + exprs
return GroupedTableExpr(self.table, self.by, having=new_having)
return GroupedTableExpr(self.table, self.by, having=new_having,
order_by=self._order_by,
window=self._window)

def order_by(self, expr):
"""
Expressions to use for ordering data for a window function
computation. Ignored in aggregations.
Parameters
----------
expr : value expression or list of value expressions
Returns
-------
grouped : GroupedTableExpr
"""
exprs = util.promote_list(expr)
new_order = self._order_by + exprs
return GroupedTableExpr(self.table, self.by, having=self._having,
order_by=new_order,
window=self._window)

def mutate(self, exprs=None, **kwds):
"""
Returns a table projection with analytic / window functions applied
Examples
--------
expr = (table
.group_by('foo')
.order_by(ibis.desc('bar'))
.mutate(qux=table.baz.lag()))
Returns
-------
mutated : TableExpr
"""
if exprs is None:
exprs = []
else:
exprs = util.promote_list(exprs)

for k, v in kwds.items():
exprs.append(v.name(k))

return self.projection([self.table] + exprs)

def projection(self, exprs):
w = self._get_window()
windowed_exprs = []
for expr in exprs:
expr = L.windowize_function(expr, w=w)
windowed_exprs.append(expr)
return self.table.projection(windowed_exprs)

def _get_window(self):
if self._window is None:
groups = self.by
sorts = self._order_by
preceding, following = None, None
else:
w = self._window
groups = w.group_by + self.by
sorts = w.order_by + self._order_by
preceding, following = w.preceding, w.following

sorts = [ops.to_sort_key(self.table, k) for k in sorts]

return _window.window(preceding=preceding, following=following,
group_by=groups, order_by=sorts)

def over(self, window):
"""
Add a window clause to be applied to downstream analytic expressions
"""
return GroupedTableExpr(self.table, self.by, having=self._having,
order_by=self._order_by,
window=window)

def count(self, metric_name='count'):
"""
Expand All @@ -92,7 +179,7 @@ def _group_agg_dispatch(name):
def wrapper(self, *args, **kwargs):
f = getattr(self.arr, name)
metric = f(*args, **kwargs)
alias = '{}({})'.format(name, self.arr.get_name())
alias = '{0}({1})'.format(name, self.arr.get_name())
return self.parent.aggregate(metric.name(alias))

wrapper.__name__ = name
Expand Down
1,221 changes: 680 additions & 541 deletions ibis/expr/operations.py

Large diffs are not rendered by default.

423 changes: 419 additions & 4 deletions ibis/expr/rules.py

Large diffs are not rendered by default.

9 changes: 5 additions & 4 deletions ibis/expr/temporal.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,9 @@ def __repr__(self):
if self.n == 1:
pretty_unit = self.unit_name
else:
pretty_unit = '{}s'.format(self.unit_name)
pretty_unit = '{0}s'.format(self.unit_name)

return '<Timedelta: {} {}>'.format(self.n, pretty_unit)
return '<Timedelta: {0} {1}>'.format(self.n, pretty_unit)

def replace(self, n):
return type(self)(n)
Expand Down Expand Up @@ -99,7 +99,7 @@ def unit(self):

def combine(self, other):
if not isinstance(other, TimeIncrement):
raise TypeError('Must be a fixed size timedelta, was {!r}'
raise TypeError('Must be a fixed size timedelta, was {0!r}'
.format(type(other)))

a, b = _to_common_units([self, other])
Expand Down Expand Up @@ -205,7 +205,8 @@ def convert(self, n, from_unit, to_unit):

if j < i:
if n % factor:
raise IbisError('{} is not a multiple of {}'.format(n, factor))
raise IbisError('{0} is not a multiple of {1}'.format(n,
factor))
return n / factor
else:
return n * factor
Expand Down
37 changes: 31 additions & 6 deletions ibis/expr/tests/mocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

from ibis.client import SQLClient
import ibis.expr.types as ir
import ibis


class MockConnection(SQLClient):
Expand Down Expand Up @@ -122,16 +123,40 @@ class MockConnection(SQLClient):
}

def __init__(self):
self.last_executed_expr = None
self.executed_queries = []

def _get_table_schema(self, name):
name = name.replace('`', '')
return ir.Schema.from_tuples(self._tables[name])

def execute(self, expr, default_limit=None):
ast, expr = self._build_ast_ensure_limit(expr, default_limit)
def execute(self, expr, limit=None):
ast = self._build_ast_ensure_limit(expr, limit)
for query in ast.queries:
query.compile()

self.last_executed_expr = expr
self.executed_queries.append(query.compile())
return None


_all_types_schema = [
('a', 'int8'),
('b', 'int16'),
('c', 'int32'),
('d', 'int64'),
('e', 'float'),
('f', 'double'),
('g', 'string'),
('h', 'boolean')
]


class BasicTestCase(object):

def setUp(self):
self.schema = _all_types_schema
self.schema_dict = dict(self.schema)
self.table = ibis.table(self.schema)

self.int_cols = ['a', 'b', 'c', 'd']
self.bool_cols = ['h']
self.float_cols = ['e', 'f']

self.con = MockConnection()
19 changes: 17 additions & 2 deletions ibis/expr/tests/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import unittest

from ibis.expr.tests.mocks import MockConnection
from ibis.compat import unittest
import ibis.expr.types as ir
import ibis


class TestAnalytics(unittest.TestCase):
Expand Down Expand Up @@ -65,3 +65,18 @@ def test_histogram(self):
self.assertRaises(ValueError, d.histogram, nbins=10, binwidth=5)
self.assertRaises(ValueError, d.histogram)
self.assertRaises(ValueError, d.histogram, 10, closed='foo')

def test_topk_analysis_bug(self):
# GH #398
airlines = ibis.table([('dest', 'string'),
('origin', 'string'),
('arrdelay', 'int32')], 'airlines')

dests = ['ORD', 'JFK', 'SFO']
t = airlines[airlines.dest.isin(dests)]
delay_filter = t.dest.topk(10, by=t.arrdelay.mean())
filtered = t.filter([delay_filter])

# predicate is unmodified by analysis
post_pred = filtered.op().predicates[1]
assert delay_filter.equals(post_pred)
3 changes: 1 addition & 2 deletions ibis/expr/tests/test_decimal.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,11 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import unittest

import ibis.expr.api as api
import ibis.expr.types as ir
import ibis.expr.operations as ops

from ibis.compat import unittest
from ibis.expr.tests.mocks import MockConnection


Expand Down
149 changes: 149 additions & 0 deletions ibis/expr/tests/test_format.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import ibis

from ibis.compat import unittest
from ibis.expr.format import ExprFormatter
from ibis.expr.tests.mocks import MockConnection


class TestExprFormatting(unittest.TestCase):
# Uncertain about how much we want to commit to unit tests around the
# particulars of the output at the moment.

def setUp(self):
self.schema = [
('a', 'int8'),
('b', 'int16'),
('c', 'int32'),
('d', 'int64'),
('e', 'float'),
('f', 'double'),
('g', 'string'),
('h', 'boolean')
]
self.schema_dict = dict(self.schema)
self.table = ibis.table(self.schema)

def test_format_projection(self):
# This should produce a ref to the projection
proj = self.table[['c', 'a', 'f']]
repr(proj['a'])

def test_table_type_output(self):
foo = ibis.table(
[
('job', 'string'),
('dept_id', 'string'),
('year', 'int32'),
('y', 'double')
], 'foo')

expr = foo.dept_id == foo.view().dept_id
result = repr(expr)
assert 'SelfReference[table]' in result
assert 'UnboundTable[table]' in result

def test_memoize_aggregate_correctly(self):
table = self.table

agg_expr = (table['c'].sum() / table['c'].mean() - 1).name('analysis')
agg_exprs = [table['a'].sum().name('sum(a)'),
table['b'].mean().name('mean(b)'), agg_expr]

result = table.aggregate(agg_exprs, by=['g'])

formatter = ExprFormatter(result)
formatted = formatter.get_result()

alias = formatter.memo.get_alias(table.op())
assert formatted.count(alias) == 7

def test_aggregate_arg_names(self):
# Not sure how to test this *well*

t = self.table

by_exprs = [t.g.name('key1'), t.f.round().name('key2')]
agg_exprs = [t.c.sum().name('c'), t.d.mean().name('d')]

expr = self.table.group_by(by_exprs).aggregate(agg_exprs)
result = repr(expr)
assert 'metrics' in result
assert 'by' in result

def test_format_multiple_join_with_projection(self):
# Star schema with fact table
table = ibis.table([
('c', 'int32'),
('f', 'double'),
('foo_id', 'string'),
('bar_id', 'string'),
])

table2 = ibis.table([
('foo_id', 'string'),
('value1', 'double')
])

table3 = ibis.table([
('bar_id', 'string'),
('value2', 'double')
])

filtered = table[table['f'] > 0]

pred1 = table['foo_id'] == table2['foo_id']
pred2 = filtered['bar_id'] == table3['bar_id']

j1 = filtered.left_join(table2, [pred1])
j2 = j1.inner_join(table3, [pred2])

# Project out the desired fields
view = j2[[table, table2['value1'], table3['value2']]]

# it works!
repr(view)

def test_memoize_database_table(self):
con = MockConnection()
table = con.table('test1')
table2 = con.table('test2')

filter_pred = table['f'] > 0
table3 = table[filter_pred]
join_pred = table3['g'] == table2['key']

joined = table2.inner_join(table3, [join_pred])

met1 = (table3['f'] - table2['value']).mean().name('foo')
result = joined.aggregate([met1, table3['f'].sum().name('bar')],
by=[table3['g'], table2['key']])

formatted = repr(result)
assert formatted.count('test1') == 1
assert formatted.count('test2') == 1

def test_named_value_expr_show_name(self):
expr = self.table.f * 2
expr2 = expr.name('baz')

# it works!
repr(expr)

result2 = repr(expr2)

# not really committing to a particular output yet
assert 'baz' in result2
77 changes: 77 additions & 0 deletions ibis/expr/tests/test_interactive.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from ibis.compat import unittest
from ibis.expr.tests.mocks import MockConnection
import ibis.config as config

from ibis.tests.util import assert_equal


class TestInteractiveUse(unittest.TestCase):

def setUp(self):
self.con = MockConnection()

def test_interactive_execute_on_repr(self):
table = self.con.table('functional_alltypes')
expr = table.bigint_col.sum()
with config.option_context('interactive', True):
repr(expr)

assert len(self.con.executed_queries) > 0

def test_default_limit(self):
table = self.con.table('functional_alltypes')

with config.option_context('interactive', True):
repr(table)

expected = """\
SELECT *
FROM functional_alltypes
LIMIT {0}""".format(config.options.sql.default_limit)

assert self.con.executed_queries[0] == expected

def test_disable_query_limit(self):
table = self.con.table('functional_alltypes')

with config.option_context('interactive', True):
with config.option_context('sql.default_limit', None):
repr(table)

expected = """\
SELECT *
FROM functional_alltypes"""

assert self.con.executed_queries[0] == expected

def test_interactive_non_compilable_repr_not_fail(self):
# #170
table = self.con.table('functional_alltypes')

expr = table.string_col.topk(3)

# it works!
with config.option_context('interactive', True):
repr(expr)

def test_histogram_repr_no_query_execute(self):
t = self.con.table('functional_alltypes')
tier = t.double_col.histogram(10).name('bucket')
expr = t.group_by(tier).size()
with config.option_context('interactive', True):
expr._repr()
assert self.con.executed_queries == []
59 changes: 59 additions & 0 deletions ibis/expr/tests/test_pipe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from ibis.compat import unittest
import ibis


class TestPipe(unittest.TestCase):

def setUp(self):
self.table = ibis.table([
('key1', 'string'),
('key2', 'string'),
('key3', 'string'),
('value', 'double')
], 'foo_table')

def test_pipe_positional_args(self):
def my_func(data, foo, bar):
return data[bar] + foo

result = self.table.pipe(my_func, 4, 'value')
expected = self.table['value'] + 4

assert result.equals(expected)

def test_pipe_keyword_args(self):
def my_func(data, foo=None, bar=None):
return data[bar] + foo

result = self.table.pipe(my_func, foo=4, bar='value')
expected = self.table['value'] + 4

assert result.equals(expected)

def test_pipe_pass_to_keyword(self):
def my_func(x, y, data=None):
return data[x] + y

result = self.table.pipe((my_func, 'data'), 'value', 4)
expected = self.table['value'] + 4

assert result.equals(expected)

def test_call_pipe_equivalence(self):
result = self.table(lambda x: x['key1'].cast('double').sum())
expected = self.table.key1.cast('double').sum()
assert result.equals(expected)
13 changes: 7 additions & 6 deletions ibis/expr/tests/test_sql_builtins.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import unittest

from ibis.expr.tests.mocks import MockConnection
from ibis.compat import unittest
import ibis.expr.api as api
import ibis.expr.operations as ops
import ibis.expr.types as ir
Expand Down Expand Up @@ -46,10 +45,12 @@ def test_group_concat(self):

expr = col.group_concat()
assert isinstance(expr.op(), ops.GroupConcat)
assert expr.op().sep == ','
arg, sep = expr.op().args
assert sep == ','

expr = col.group_concat('|')
assert expr.op().sep == '|'
arg, sep = expr.op().args
assert sep == '|'

def test_zeroifnull(self):
dresult = self.alltypes.double_col.zeroifnull()
Expand Down Expand Up @@ -111,11 +112,11 @@ def test_sign(self):
def test_round(self):
result = self.alltypes.double_col.round()
assert isinstance(result, ir.Int64Array)
assert result.op().digits is None
assert result.op().args[1] is None

result = self.alltypes.double_col.round(2)
assert isinstance(result, ir.DoubleArray)
assert result.op().digits == 2
assert result.op().args[1] == 2

# Even integers are double (at least in Impala, check with other DB
# implementations)
Expand Down
30 changes: 21 additions & 9 deletions ibis/expr/tests/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import unittest

import ibis.expr.api as api
from ibis import literal
import ibis.expr.types as ir
import ibis.expr.operations as ops

from ibis.expr.tests.mocks import MockConnection
from ibis.compat import unittest


class TestStringOps(unittest.TestCase):
Expand All @@ -37,15 +36,15 @@ def test_lower_upper(self):
assert isinstance(lresult.op(), ops.Lowercase)
assert isinstance(uresult.op(), ops.Uppercase)

lit = api.literal('FoO')
lit = literal('FoO')

lresult = lit.lower()
uresult = lit.upper()
assert isinstance(lresult, ir.StringScalar)
assert isinstance(uresult, ir.StringScalar)

def test_substr(self):
lit = api.literal('FoO')
lit = literal('FoO')

result = self.table.g.substr(2, 4)
lit_result = lit.substr(0, 2)
Expand All @@ -55,8 +54,11 @@ def test_substr(self):

op = result.op()
assert isinstance(op, ops.Substring)
assert op.start == 2
assert op.length == 4

start, length = op.args[1:]

assert start.equals(literal(2))
assert length.equals(literal(4))

def test_left_right(self):
result = self.table.g.left(5)
Expand All @@ -66,17 +68,27 @@ def test_left_right(self):
result = self.table.g.right(5)
op = result.op()
assert isinstance(op, ops.StrRight)
assert op.nchars == 5
assert op.args[1].equals(literal(5))

def test_length(self):
lit = api.literal('FoO')
lit = literal('FoO')
result = self.table.g.length()
lit_result = lit.length()

assert isinstance(result, ir.Int32Array)
assert isinstance(lit_result, ir.Int32Scalar)
assert isinstance(result.op(), ops.StringLength)

def test_join(self):
dash = literal('-')

expr = dash.join([self.table.f.cast('string'),
self.table.g])
assert isinstance(expr, ir.StringArray)

expr = dash.join([literal('ab'), literal('cd')])
assert isinstance(expr, ir.StringScalar)

def test_contains(self):
expr = self.table.g.contains('foo')
expected = self.table.g.like('%foo%')
Expand Down
1,030 changes: 113 additions & 917 deletions ibis/expr/tests/test_base.py → ibis/expr/tests/test_table.py

Large diffs are not rendered by default.

3 changes: 1 addition & 2 deletions ibis/expr/tests/test_temporal.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import unittest

from ibis.common import IbisError
import ibis.expr.operations as ops
import ibis.expr.types as ir
import ibis.expr.temporal as T

from ibis.expr.tests.mocks import MockConnection
from ibis.compat import unittest


class TestFixedOffsets(unittest.TestCase):
Expand Down
3 changes: 1 addition & 2 deletions ibis/expr/tests/test_timestamp.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import unittest

import pandas as pd

import ibis
Expand All @@ -23,6 +21,7 @@
import ibis.expr.types as ir

from ibis.expr.tests.mocks import MockConnection
from ibis.compat import unittest


class TestTimestamp(unittest.TestCase):
Expand Down
722 changes: 722 additions & 0 deletions ibis/expr/tests/test_value_exprs.py

Large diffs are not rendered by default.

87 changes: 87 additions & 0 deletions ibis/expr/tests/test_window_functions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import ibis

from ibis.compat import unittest
from ibis.expr.tests.mocks import BasicTestCase

from ibis.tests.util import assert_equal


class TestWindowFunctions(BasicTestCase, unittest.TestCase):

def setUp(self):
BasicTestCase.setUp(self)
self.t = self.con.table('alltypes')

def test_compose_group_by_apis(self):
t = self.t
w = ibis.window(group_by=t.g, order_by=t.f)

diff = t.d - t.d.lag()
grouped = t.group_by('g').order_by('f')

expr = grouped[t, diff.name('diff')]
expr2 = grouped.mutate(diff=diff)
expr3 = grouped.mutate([diff.name('diff')])

window_expr = (t.d - t.d.lag().over(w)).name('diff')
expected = t.projection([t, window_expr])

assert_equal(expr, expected)
assert_equal(expr, expr2)
assert_equal(expr, expr3)

def test_combine_windows(self):
pass

def test_window_bind_to_table(self):
w = ibis.window(group_by='g', order_by=ibis.desc('f'))

w2 = w.bind(self.t)
expected = ibis.window(group_by=self.t.g,
order_by=ibis.desc(self.t.f))

assert_equal(w2, expected)

def test_preceding_following_validate(self):
# these all work
[
ibis.window(preceding=0),
ibis.window(following=0),
ibis.window(preceding=0, following=0),
ibis.window(preceding=(None, 4)),
ibis.window(preceding=(10, 4)),
ibis.window(following=(4, None)),
ibis.window(following=(4, 10))
]

# these are ill-specified
error_cases = [
lambda: ibis.window(preceding=(1, 3)),
lambda: ibis.window(preceding=(3, 1), following=2),
lambda: ibis.window(preceding=(3, 1), following=(2, 4)),
lambda: ibis.window(preceding=-1),
lambda: ibis.window(following=-1),
lambda: ibis.window(preceding=(-1, 2)),
lambda: ibis.window(following=(2, -1))
]

for i, case in enumerate(error_cases):
with self.assertRaises(Exception):
case()

def test_window_equals(self):
pass
420 changes: 261 additions & 159 deletions ibis/expr/types.py

Large diffs are not rendered by default.

210 changes: 210 additions & 0 deletions ibis/expr/window.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
# Copyright 2014 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import ibis.expr.types as ir
import ibis.expr.operations as ops
import ibis.util as util
import ibis.common as com


def _list_to_tuple(x):
if isinstance(x, list):
x = tuple(x)
return x


class Window(object):

"""
A generic window function clause, patterned after SQL window clauses for
the time being. Can be expanded to cover more use cases as they arise.
Using None for preceding or following currently indicates unbounded. Use 0
for current_value
"""

def __init__(self, group_by=None, order_by=None,
preceding=None, following=None):
if group_by is None:
group_by = []

if order_by is None:
order_by = []

self._group_by = util.promote_list(group_by)
self._order_by = util.promote_list(order_by)
self._order_by = [ops.SortKey(expr)
if isinstance(expr, ir.Expr)
else expr
for expr in self._order_by]

self.preceding = _list_to_tuple(preceding)
self.following = _list_to_tuple(following)

self._validate_frame()

def _validate_frame(self):
p_tuple = has_p = False
f_tuple = has_f = False
if self.preceding is not None:
p_tuple = isinstance(self.preceding, tuple)
has_p = True

if self.following is not None:
f_tuple = isinstance(self.following, tuple)
has_f = True

if ((p_tuple and has_f) or (f_tuple and has_p)):
raise com.IbisInputError('Can only specify one window side '
' when you want an off-center '
'window')
elif p_tuple:
start, end = self.preceding
if start is None:
assert end >= 0
else:
assert start > end
elif f_tuple:
start, end = self.following
if end is None:
assert start >= 0
else:
assert start < end
else:
if has_p and self.preceding < 0:
raise com.IbisInputError('Window offset must be positive')

if has_f and self.following < 0:
raise com.IbisInputError('Window offset must be positive')

def bind(self, table):
# Internal API, ensure that any unresolved expr references (as strings,
# say) are bound to the table being windowed
groups = table._resolve(self._group_by)
sorts = [ops.to_sort_key(table, k) for k in self._order_by]
return self._replace(group_by=groups, order_by=sorts)

def combine(self, window):
kwds = dict(
preceding=self.preceding or window.preceding,
following=self.following or window.following,
group_by=self._group_by + window._group_by,
order_by=self._order_by + window._order_by
)
return Window(**kwds)

def group_by(self, expr):
new_groups = self._group_by + util.promote_list(expr)
return self._replace(group_by=new_groups)

def _replace(self, **kwds):
new_kwds = dict(
group_by=kwds.get('group_by', self._group_by),
order_by=kwds.get('order_by', self._order_by),
preceding=kwds.get('preceding', self.preceding),
following=kwds.get('following', self.following)
)
return Window(**new_kwds)

def order_by(self, expr):
new_sorts = self._order_by + util.promote_list(expr)
return self._replace(order_by=new_sorts)

def equals(self, other):
if not isinstance(other, Window):
return False

if (len(self._group_by) != len(other._group_by) or
not ir.all_equal(self._group_by, other._group_by)):
return False

if (len(self._order_by) != len(other._order_by) or
not ir.all_equal(self._order_by, other._order_by)):
return False

return (self.preceding == other.preceding and
self.following == other.following)


def window(preceding=None, following=None, group_by=None, order_by=None):
"""
Create a window clause for use with window (analytic and aggregate)
functions.
All window frames / ranges are inclusive.
Parameters
----------
preceding : int, tuple, or None, default None
Specify None for unbounded, 0 to include current row
tuple for off-center window
following : int, tuple, or None, default None
Specify None for unbounded, 0 to include current row
tuple for off-center window
group_by : expressions, default None
Either specify here or with TableExpr.group_by
order_by : expressions, default None
For analytic functions requiring an ordering, specify here, or let Ibis
determine the default ordering (for functions like rank)
Returns
-------
win : ibis Window
"""
return Window(preceding=preceding, following=following,
group_by=group_by, order_by=order_by)


def cumulative_window(group_by=None, order_by=None):
"""
Create a cumulative window clause for use with aggregate window functions.
All window frames / ranges are inclusive.
Parameters
----------
group_by : expressions, default None
Either specify here or with TableExpr.group_by
order_by : expressions, default None
For analytic functions requiring an ordering, specify here, or let Ibis
determine the default ordering (for functions like rank)
Returns
-------
win : ibis Window
"""
return Window(preceding=None, following=0,
group_by=group_by, order_by=order_by)


def trailing_window(periods, group_by=None, order_by=None):
"""
Create a trailing window for use with aggregate window functions.
Parameters
----------
periods : int
Number of trailing periods to include. 0 includes only the current period
group_by : expressions, default None
Either specify here or with TableExpr.group_by
order_by : expressions, default None
For analytic functions requiring an ordering, specify here, or let Ibis
determine the default ordering (for functions like rank)
Returns
-------
win : ibis Window
"""
return Window(preceding=periods, following=0,
group_by=group_by, order_by=order_by)
211 changes: 177 additions & 34 deletions ibis/filesystems.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,17 @@
# license), see the LICENSES directory.

from os import path as osp
from posixpath import join as pjoin
import os
import posixpath
import shutil

import six

from ibis.config import options
import ibis.common as com
import ibis.util as util


from hdfs.util import temppath


Expand Down Expand Up @@ -54,6 +57,22 @@ def status(self, path):
raise NotImplementedError

def head(self, hdfs_path, nbytes=1024, offset=0):
"""
Retrieve the requested number of bytes from a file
Parameters
----------
hdfs_path : string
Absolute HDFS path
nbytes : int, default 1024 (1K)
Number of bytes to retrieve
offset : int, default 0
Number of bytes at beginning of file to skip before retrieving data
Returns
-------
head_data : bytes
"""
raise NotImplementedError

def get(self, hdfs_path, local_path='.', overwrite=False):
Expand All @@ -67,7 +86,7 @@ def get(self, hdfs_path, local_path='.', overwrite=False):
"""
raise NotImplementedError

def put(self, hdfs_path, local_path, overwrite=False, verbose=None,
def put(self, hdfs_path, resource, overwrite=False, verbose=None,
**kwargs):
"""
Write file or directory to HDFS
Expand All @@ -76,8 +95,8 @@ def put(self, hdfs_path, local_path, overwrite=False, verbose=None,
----------
hdfs_path : string
Directory or path
local_path : string
Relative or absolute path to local resource
resource : string or buffer-like
Relative or absolute path to local resource, or a file-like object
overwrite : boolean, default False
verbose : boolean, default ibis options.verbose
Expand All @@ -90,6 +109,44 @@ def put(self, hdfs_path, local_path, overwrite=False, verbose=None,
"""
raise NotImplementedError

def put_tarfile(self, hdfs_path, local_path, compression='gzip',
verbose=None, overwrite=False):
"""
Write contents of tar archive to HDFS directly without having to
decompress it locally first
Parameters
----------
hdfs_path : string
local_path : string
compression : {'gzip', 'bz2', None}
overwrite : boolean, default False
verbose : boolean, default None (global default)
"""
import tarfile
modes = {
None: 'r',
'gzip': 'r:gz',
'bz2': 'r:bz2'
}

if compression not in modes:
raise ValueError('Invalid compression type {0}'
.format(compression))
mode = modes[compression]

tf = tarfile.open(local_path, mode=mode)
for info in tf:
if not info.isfile():
continue

buf = tf.extractfile(info)
abspath = pjoin(hdfs_path, info.path)
self.put(abspath, buf, verbose=verbose, overwrite=overwrite)

def put_zipfile(self, hdfs_path, local_path):
raise NotImplementedError

def write(self, hdfs_path, buf, overwrite=False, blocksize=None,
replication=None, buffersize=None):
raise NotImplementedError
Expand All @@ -107,13 +164,43 @@ def ls(self, hdfs_path, status=False):
"""
raise NotImplementedError

def size(self, hdfs_path):
"""
Return total size of file or directory
Parameters
----------
size : int
"""
raise NotImplementedError

def tail(self, hdfs_path, nbytes=1024):
raise NotImplementedError

def mv(self, hdfs_path_src, hdfs_path_dest, overwrite=True):
"""
Move hdfs_path_src to hdfs_path_dest
Parameters
----------
overwrite : boolean, default True
Overwrite hdfs_path_dest if it exists.
"""
raise NotImplementedError

def cp(self, hdfs_path_src, hdfs_path_dest):
raise NotImplementedError

def rm(self, path):
"""
Delete a single file
"""
return self.delete(path)

def rmdir(self, path):
"""
Delete a directory and all its contents
"""
self.client.delete(path, recursive=True)

def find_any_file(self, hdfs_dir):
Expand Down Expand Up @@ -148,6 +235,9 @@ def protocol(self):
return 'webhdfs'

def status(self, path):
"""
Retrieve HDFS metadata for path
"""
return self.client.status(path)

@implements(HDFS.exists)
Expand All @@ -171,45 +261,85 @@ def mkdir(self, dir_path, create_parent=False):
# ugh, see #252

# create a temporary file, then delete it
dummy = posixpath.join(dir_path, util.guid())
dummy = pjoin(dir_path, util.guid())
self.client.write(dummy, '')
self.client.delete(dummy)

@implements(HDFS.size)
def size(self, hdfs_path):
stat = self.status(hdfs_path)

if stat['type'] == 'FILE':
return stat['length']
elif stat['type'] == 'DIRECTORY':
total = 0
for path in self.ls(hdfs_path):
total += self.size(path)
return total
else:
raise NotImplementedError

@implements(HDFS.mv)
def mv(self, hdfs_path_src, hdfs_path_dest, overwrite=True):
if overwrite and self.exists(hdfs_path_dest):
if self.status(hdfs_path_dest)['type'] == 'FILE':
self.rm(hdfs_path_dest)
return self.client.rename(hdfs_path_src, hdfs_path_dest)

def delete(self, hdfs_path, recursive=False):
"""
"""
return self.client.delete(hdfs_path, recursive=recursive)

@implements(HDFS.head)
def head(self, hdfs_path, nbytes=1024, offset=0):
gen = self.client.read(hdfs_path, offset=offset, length=nbytes)
return ''.join(gen)

@implements(HDFS.put)
def put(self, hdfs_path, local_path, overwrite=False, verbose=None,
def put(self, hdfs_path, resource, overwrite=False, verbose=None,
**kwargs):
if osp.isdir(local_path):
for dirpath, dirnames, filenames in os.walk(local_path):
rel_dir = osp.relpath(dirpath, local_path)
verbose = verbose or options.verbose
is_path = isinstance(resource, six.string_types)

if is_path and osp.isdir(resource):
for dirpath, dirnames, filenames in os.walk(resource):
rel_dir = osp.relpath(dirpath, resource)
if rel_dir == '.':
rel_dir = ''
for fpath in filenames:
abs_path = osp.join(dirpath, fpath)
rel_hdfs_path = posixpath.join(hdfs_path, rel_dir, fpath)
rel_hdfs_path = pjoin(hdfs_path, rel_dir, fpath)
self.put(rel_hdfs_path, abs_path, overwrite=overwrite,
verbose=verbose, **kwargs)
else:
if verbose:
self.log('Writing local {} to HDFS {}'.format(local_path,
hdfs_path))
self.client.upload(hdfs_path, local_path,
overwrite=overwrite, **kwargs)
if is_path:
basename = os.path.basename(resource)
if self.exists(hdfs_path):
if self.status(hdfs_path)['type'] == 'DIRECTORY':
hdfs_path = pjoin(hdfs_path, basename)
if verbose:
self.log('Writing local {0} to HDFS {1}'.format(resource,
hdfs_path))
self.client.upload(hdfs_path, resource,
overwrite=overwrite, **kwargs)
else:
if verbose:
self.log('Writing buffer to HDFS {0}'.format(hdfs_path))
# TODO: eliminate the .getvalue() call to support general
# handle types
resource.seek(0)
self.client.write(hdfs_path, resource.read(),
overwrite=overwrite, **kwargs)

@implements(HDFS.get)
def get(self, hdfs_path, local_path, overwrite=False):
def get(self, hdfs_path, local_path, overwrite=False, verbose=None):
verbose = verbose or options.verbose

hdfs_path = hdfs_path.rstrip(posixpath.sep)

if osp.isdir(local_path):
if osp.isdir(local_path) and not overwrite:
dest = osp.join(local_path, posixpath.basename(hdfs_path))
else:
local_dir = osp.dirname(local_path) or '.'
Expand All @@ -222,43 +352,56 @@ def get(self, hdfs_path, local_path, overwrite=False):

# TODO: threadpool

def _get_file(remote, local):
if verbose:
self.log('Writing HDFS {0} to local {1}'.format(remote, local))
self.client.download(remote, local, overwrite=overwrite)

def _scrape_dir(path, dst):
objs = self.client.list(path)
for hpath, detail in objs:
relpath = posixpath.relpath(hpath, hdfs_path)
full_opath = posixpath.join(dst, relpath)
full_opath = pjoin(dst, relpath)

if detail['type'] == 'FILE':
self.client.download(hpath, full_opath)
_get_file(hpath, full_opath)
else:
os.makedirs(full_opath)
_scrape_dir(hpath, dst)

status = self.status(hdfs_path)
if status['type'] == 'FILE':
if not overwrite and osp.exists(local_path):
raise Exception('{0} exists'.format(local_path))
raise IOError('{0} exists'.format(local_path))

self.client.download(hdfs_path, local_path)
_get_file(hdfs_path, local_path)
else:
# TODO: partitioned files

with temppath() as tpath:
_temp_dir_path = osp.join(tpath, posixpath.basename(hdfs_path))
os.makedirs(_temp_dir_path)
_scrape_dir(hdfs_path, _temp_dir_path)
shutil.move(_temp_dir_path, local_path)

return dest

def write(self, hdfs_path, buf, overwrite=False, blocksize=None,
replication=None, buffersize=None):
"""
Write a buffer-like object to indicated HDFS path
if verbose:
self.log('Moving {0} to {1}'.format(_temp_dir_path,
local_path))

if overwrite and osp.exists(local_path):
# swap and delete
local_swap_path = util.guid()
shutil.move(local_path, local_swap_path)

try:
shutil.move(_temp_dir_path, local_path)
if verbose:
msg = 'Deleting original {0}'.format(local_path)
self.log(msg)
shutil.rmtree(local_swap_path)
except:
# undo our diddle
shutil.move(local_swap_path, local_path)
else:
shutil.move(_temp_dir_path, local_path)

Parameters
----------
"""
self.client.write(buf, hdfs_path, overwrite=overwrite,
blocksize=blocksize, replication=replication,
buffersize=buffersize)
return dest
124 changes: 83 additions & 41 deletions ibis/sql/compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.

# An Ibis analytical expression will typically consist of a primary SELECT
# statement, with zero or more supporting DDL queries. For example we would
# want to support converting a text file in HDFS to a Parquet-backed Impala
# table, with optional teardown if the user wants the intermediate converted
# table to be temporary.

from collections import defaultdict

import ibis.common as com
import ibis.expr.analysis as L
import ibis.expr.operations as ops
import ibis.expr.types as ir
Expand Down Expand Up @@ -173,8 +168,8 @@ def _build_result_query(self):
def _populate_context(self):
# Populate aliases for the distinct relations used to output this
# select statement.

self._make_table_aliases(self.table_set)
if self.table_set is not None:
self._make_table_aliases(self.table_set)

# XXX: This is a temporary solution to the table-aliasing / correlated
# subquery problem. Will need to revisit and come up with a cleaner
Expand Down Expand Up @@ -219,7 +214,7 @@ def _analyze_select_exprs(self):
def _visit_select_expr(self, expr):
op = expr.op()

method = '_visit_select_{}'.format(type(op).__name__)
method = '_visit_select_{0}'.format(type(op).__name__)
if hasattr(self, method):
f = getattr(self, method)
return f(expr)
Expand Down Expand Up @@ -299,14 +294,14 @@ def _visit_filter(self, expr):

op = expr.op()

method = '_visit_filter_{}'.format(type(op).__name__)
method = '_visit_filter_{0}'.format(type(op).__name__)
if hasattr(self, method):
f = getattr(self, method)
return f(expr)

unchanged = True
if isinstance(expr, ir.ScalarExpr):
if expr.is_reduction():
if ops.is_reduction(expr):
return self._rewrite_reduction_filter(expr)

if isinstance(op, ops.BinaryOp):
Expand Down Expand Up @@ -364,9 +359,16 @@ def _visit_filter_TopK(self, expr):
op = expr.op()

metrics = [op.by.name(metric_name)]
rank_set = (self.table_set.aggregate(metrics, by=[op.arg])
.sort_by([(metric_name, False)])
.limit(op.k))

arg_table = L.find_base_table(op.arg)
by_table = L.find_base_table(op.by)

if arg_table.equals(by_table):
agg = arg_table.aggregate(metrics, by=[op.arg])
else:
agg = self.table_set.aggregate(metrics, by=[op.arg])

rank_set = agg.sort_by([(metric_name, False)]).limit(op.k)

pred = (op.arg == getattr(rank_set, op.arg.get_name()))
self.table_set = self.table_set.semi_join(rank_set, [pred])
Expand All @@ -384,20 +386,28 @@ def _collect_elements(self):
# expression that is being translated only depends on a single table
# expression.

source_table = self.query_expr
source_expr = self.query_expr

# hm, is this the best place for this?
root_op = source_table.op()
root_op = source_expr.op()
if (isinstance(root_op, ops.Join) and
not isinstance(root_op, ops.MaterializedJoin)):
# Unmaterialized join
source_table = source_table.materialize()
source_expr = source_expr.materialize()

self._collect(source_table, toplevel=True)
if isinstance(root_op, ops.TableNode):
self._collect(source_expr, toplevel=True)
if self.table_set is None:
raise com.InternalError('no table set')
else:
if isinstance(root_op, ir.ExpressionList):
self.select_set = source_expr.exprs()
else:
self.select_set = [source_expr]

def _collect(self, expr, toplevel=False):
op = expr.op()
method = '_collect_{}'.format(type(op).__name__)
method = '_collect_{0}'.format(type(op).__name__)

# Do not visit nodes twice
if op in self.op_memo:
Expand Down Expand Up @@ -576,7 +586,8 @@ def __init__(self, query, greedy=False):
self.expr_counts = defaultdict(lambda: 0)

def get_result(self):
self.visit(self.query.table_set)
if self.query.table_set is not None:
self.visit(self.query.table_set)

for clause in self.query.filters:
self.visit(clause)
Expand Down Expand Up @@ -608,7 +619,7 @@ def _has_been_observed(self, expr):

def visit(self, expr):
node = expr.op()
method = '_visit_{}'.format(type(node).__name__)
method = '_visit_{0}'.format(type(node).__name__)

if hasattr(self, method):
f = getattr(self, method)
Expand Down Expand Up @@ -758,47 +769,78 @@ def _adapt_expr(expr):
#
# Canonical case is scalar values or arrays produced by some reductions
# (simple reductions, or distinct, say)
as_is = lambda x: x
def as_is(x):
return x

if isinstance(expr, ir.TableExpr):
return expr, as_is

def _scalar_reduce(x):
return isinstance(x, ir.ScalarExpr) and x.is_reduction()

if _scalar_reduce(expr):
table_expr = _reduction_to_aggregation(expr, agg_name='tmp')
return isinstance(x, ir.ScalarExpr) and ops.is_reduction(x)

if isinstance(expr, ir.ScalarExpr):
def scalar_handler(results):
return results['tmp'][0]

return table_expr, scalar_handler
if _scalar_reduce(expr):
table_expr = _reduction_to_aggregation(expr, agg_name='tmp')
return table_expr, scalar_handler
else:
base_table = L.find_base_table(expr)
if base_table is None:
# expr with no table refs
return expr.name('tmp'), scalar_handler
else:
raise NotImplementedError(expr._repr())

elif isinstance(expr, ir.ExprList):
exprs = expr.exprs()
for expr in exprs:
if not _scalar_reduce(expr):
raise NotImplementedError(expr)

table = L.find_base_table(exprs[0])
return table.aggregate(exprs), as_is
is_aggregation = True
any_aggregation = False

for x in exprs:
if not _scalar_reduce(x):
is_aggregation = False
else:
any_aggregation = True

if is_aggregation:
table = L.find_base_table(exprs[0])
return table.aggregate(exprs), as_is
elif not any_aggregation:
return expr, as_is
else:
raise NotImplementedError(expr._repr())

elif isinstance(expr, ir.ArrayExpr):
op = expr.op()

if isinstance(op, (ops.TableColumn, ops.DistinctArray)):
table_expr = op.table

def _get_column(name):
def column_handler(results):
return results[op.name]
result_handler = column_handler
return results[name]
return column_handler

if isinstance(op, ops.TableColumn):
table_expr = op.table
result_handler = _get_column(op.name)
else:
# Something more complicated.
base_table = L.find_source_table(expr)
table_expr = base_table.projection([expr.name('tmp')])

def projection_handler(results):
return results['tmp']
if isinstance(op, ops.DistinctArray):
expr = op.arg
try:
name = op.arg.get_name()
except Exception:
name = 'tmp'

result_handler = projection_handler
table_expr = (base_table.projection([expr.name(name)])
.distinct())
result_handler = _get_column(name)
else:
table_expr = base_table.projection([expr.name('tmp')])
result_handler = _get_column('tmp')

return table_expr, result_handler
else:
Expand Down
Loading