What is Ibis?
===

Ibis is a Python-based analytics framework, inspired by the pandas API and other successful tabular data manipulation toolkits, designed to use remote executions engines to provide speed and scalability for large distributed data sets. At the moment, we are focused on **Impala** as the execution engine for Ibis.

The project breaks down into a number of logical components that fit together

- A domain specific language, or **DSL** (as we'll call it henceforth), for describing data transformations, analytics, ETL, and any other dataset or system manipulation steps. This is a fancy way of saying "a Python library with classes and functions that provide a higher level mode of expresison".

- Tools to support workflows involving data ingest, ETL, caching, database view creation, and so forth. We aim to free the user from some of the low-level details of interacting with analytical data stores and get you focused on the actual data analysis. 

- Future: Powerful user-defined function support within the Ibis DSL. This will also require some customizations to the execution engine (i.e. Impala) to be supported.

- Future: Use the Ibis DSL with in-database machine learning toolkits like MADLib

The Ibis DSL aims for several goals

- Remote computation from a local Python compute session
- Composability; easy to chain together operations and build pipelines
- Integration with pandas and other Python libraries
- Semantic completeness: support any operations possible in the underlying compute systems, e.g. Impala SQL
- Validation as you go: catch you making mis-steps right away, if possible
- Ease of code reuse

Installation
===

To install ibis, use `pip` (or `easy_install`, but `pip` is really better). To install ibis from source clone the repository and run the `setup.py` script:

    > git clone https://github.com/cloudera/ibis.git
    > cd ibis
    > python setup.py install


Note that ibis depends on a number of other Python libraries. If you are missing any of those libraries, `pip` will attempt to install them.

Import ibis and verify that all is working like so:

In [None]:
import ibis
ibis.test()

**If you see some `WARNING` messages, don't worry, nothing serious. We are going to get rid of those at some point.**

Now, you're going to want to make sure you can connect to your Impala cluster:

In [None]:
ic = ibis.impala_connect(host='quickstart.cloudera')
ic

Obviously, substitute the parameters that are appropriate for your environment (see docstring for `ibis.impala_connect`). `impala_connect` uses the same parameters as Impyla's (https://pypi.python.org/pypi/impyla) DBAPI interface

If you have WebHDFS available, connect to HDFS with according to your WebHDFS config. For kerberized or more complex HDFS clusters please look at http://hdfscli.readthedocs.org/en/latest/ for info on connecting. You can use a connection from that library instead of using `hdfs_connect`

In [None]:
hdfs = ibis.hdfs_connect(host='quickstart.cloudera', port=50070)

Finally, create the Ibis client

In [None]:
con = ibis.make_client(ic, hdfs_client=hdfs)
con

Loading the testing/demo data and running the test suite
===

Look for the separately distributed `ibis-testing-data.tar.gz` tarball. You're going to want to unzip this and run the `scripts/load_test_data.py` program that is distributed with Ibis. This requires a user login with `CREATE DATABASE` and `CREATE TABLE` permissions, so you may need to get the help of an admin if you don't have these permissions.

    > mv ibis-testing-data.tar.gz scripts
    > cd scripts
    > tar xvf ibis-testing-data.tar.gz
    > python load_test_data.py

**Note that this creates a database called `ibis_testing` which we will use in these tutorials**

If you want to check exhaustively that Ibis works including all interactions with HDFS and Impala:

In [None]:
ibis.test(include_e2e=True)