444 changes: 203 additions & 241 deletions .circleci/config.yml

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ci/udf/CMakeCache.txt
ci/udf/CMakeFiles/
ci/udf/Makefile
13 changes: 10 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
*.kdev4
*.log
*.swp
*.swo
*.pdb
.idea

Expand Down Expand Up @@ -47,14 +48,20 @@ Icon?
docs/source/generated
docs/source/generated-notebooks

# Ibis testing data
ci/ibis-testing-data*
ci/ibis_testing.db

# UDF testing generated files
testing/udf/CMakeCache.txt
testing/udf/CMakeFiles/
testing/udf/Makefile
ci/udf/CMakeCache.txt
ci/udf/CMakeFiles/
ci/udf/Makefile
.cache/

# test data
scripts/ibis-testing*
ibis_testing*
.tox/
.asv/
.ipynb_checkpoints/
.pytest_cache
4 changes: 0 additions & 4 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,9 @@ include README.md
include LICENSE.txt
graft ibis
graft LICENSES
graft scripts
graft conda-recipes

graft docs
prune docs/source/generated
prune docs/source/generated-notebooks
prune docs/build

global-exclude *CMakeCache*
global-exclude *.o
Expand Down
37 changes: 22 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,38 @@
# Ibis: Python data analysis framework for Hadoop and SQL engines

[![Anaconda-Server Badge](https://anaconda.org/conda-forge/ibis-framework/badges/version.svg)](https://anaconda.org/conda-forge/ibis-framework)
[![Documentation Status](https://img.shields.io/badge/docs-docs.ibis--project.org-blue.svg)](http://docs.ibis-project.org)
[![CircleCI Status](https://circleci.com/gh/ibis-project/ibis.svg?style=shield&circle-token=b84ff8383cbb0d6788ee0f9635441cb962949a4f)](https://circleci.com/gh/ibis-project/ibis/tree/master)
[![AppVeyor Status](https://ci.appveyor.com/api/projects/status/github/ibis-project/ibis?branch=master&svg=true)](https://ci.appveyor.com/project/cpcloud/ibis-xh5g1)
[![Documentation Status](https://readthedocs.org/projects/ibis-project/badge/?version=latest)](http://ibis-project.readthedocs.io/en/latest/?badge=latest)

Current release from Anaconda.org [![Anaconda-Server Badge](https://anaconda.org/conda-forge/ibis-framework/badges/version.svg)](https://anaconda.org/conda-forge/ibis-framework)

Ibis is a toolbox to bridge the gap between local Python environments, remote
storage, execution systems like Hadoop components (HDFS, Impala, Hive, Spark)
and SQL databases. Its goal is to simplify analytical workflows and make you
more productive.

# Ibis: Python data analysis framework for Hadoop and SQL engines
Install Ibis from PyPI with:

Ibis is a toolbox to bridge the gap between local Python environments,
remote storage, execution systems like Hadoop components (HDFS, Impala,
Hive, Spark) and SQL databases. Its goal is to simplify analytical
workflows and make you more productive.
```sh
pip install ibis-framework
```

Install Ibis from PyPI with:
or from conda-forge with

$ pip install ibis-framework
```sh
conda install ibis-framework -c conda-forge
```

At this time, Ibis provides tools for interacting with the following
systems:
Ibis currently provides tools for interacting with the following systems:

- [Apache Impala (incubating)](http://impala.io/)
- [Apache Kudu](http://getkudu.io)
- [Hadoop Distributed File System (HDFS)](https://hadoop.apache.org/)
- PostgreSQL (Experimental)
- SQLite
- Direct execution of ibis expressions against pandas object (Experimental)
- [PostgreSQL](https://www.postgresql.org/)
- [MySQL](https://www.mysql.com/) (Experimental)
- [SQLite](https://www.sqlite.org/)
- [Pandas](https://pandas.pydata.org/) [DataFrames](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) (Experimental)
- [Clickhouse](https://clickhouse.yandex)
- [BigQuery](https://cloud.google.com/bigquery)

Learn more about using the library at http://docs.ibis-project.org and read the
project blog at http://ibis-project.org for news and updates.
53 changes: 39 additions & 14 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,31 @@ platform:
- x64

environment:
PGPORT: "5432"
PGHOST: "localhost"
PGUSER: "postgres"
PGPASSWORD: "Password12!"
IBIS_POSTGRES_USER: "%PGUSER%"
IBIS_POSTGRES_PASS: "%PGPASSWORD%"
DATA_DIR: "%USERPROFILE%\\ibis-testing-data"
DATA_URL: "https://storage.googleapis.com/ibis-ci-data"
IBIS_TEST_POSTGRES_DB: "ibis_testing"
IBIS_TEST_SQLITE_DB_PATH: "%USERPROFILE%\\ibis_testing.db"

IBIS_TEST_DOWNLOAD_DIRECTORY: "%USERPROFILE%"
IBIS_TEST_DOWNLOAD_BASE_URL: "https://storage.googleapis.com/ibis-ci-data"
IBIS_TEST_DOWNLOAD_NAME: "ibis-testing-data.tar.gz"

IBIS_TEST_DATA_DIRECTORY: "%USERPROFILE%\\ibis-testing-data"

IBIS_TEST_POSTGRES_PORT: "%PGPORT%"
IBIS_TEST_POSTGRES_HOST: "%PGHOST%"
IBIS_TEST_POSTGRES_USER: "%PGUSER%"
IBIS_TEST_POSTGRES_PASSWORD: "%PGPASSWORD%"
IBIS_TEST_POSTGRES_DATABASE: "ibis_testing"

IBIS_TEST_MYSQL_HOST: "localhost"
IBIS_TEST_MYSQL_PORT: "3306"
IBIS_TEST_MYSQL_USER: "root"
IBIS_TEST_MYSQL_PASSWORD: "Password12!"
IBIS_TEST_MYSQL_DATABASE: "ibis_testing"

IBIS_TEST_SQLITE_DATABASE: "%USERPROFILE%\\ibis_testing.db"

CONDA: "C:\\Miniconda36-x64\\Scripts\\conda"
ACTIVATE: "C:\\Miniconda36-x64\\Scripts\\activate"

Expand All @@ -22,18 +39,26 @@ environment:
- PYTHON_VERSION: "3.6"

services:
- postgresql93
- mysql
- postgresql101

test_script:
- "set PATH=C:\\Program Files\\PostgreSQL\\10\\bin\\;%PATH%"
- "psql -c \"SELECT VERSION()\""

- "%CONDA% --version"
- "%CONDA% config --set always_yes true"
- "%CONDA% install conda=4.3.22 --channel conda-forge"
- "%CONDA% create --name \"ibis_%PYTHON_VERSION%\" python=%PYTHON_VERSION% --channel conda-forge"
- "%ACTIVATE% \"ibis_%PYTHON_VERSION%\""
- "pip install -e .\"[sqlite, postgres, visualization, pandas]\""
- "pip install flake8 mock pytest click \"pbs==0.110\""
- "%CONDA% install --channel conda-forge pytables numpy sqlalchemy psycopg2 graphviz click mock plumbum flake8 pytest"
- "%CONDA% list"
- "pip install -e .\"[sqlite, postgres, mysql, visualization, pandas, csv, hdf5]\""

- "flake8"
- "python ci\\datamgr.py download --directory \"%USERPROFILE%\""
- "python ci\\datamgr.py sqlite --database \"%IBIS_TEST_SQLITE_DB_PATH%\" --data-directory \"%DATA_DIR%\" --script ci\\sqlite_load.sql functional_alltypes batting awards_players diamonds"
- "python ci\\datamgr.py postgres --database \"%IBIS_TEST_POSTGRES_DB%\" --data-directory \"%DATA_DIR%\" --script ci\\postgresql_load.sql functional_alltypes batting awards_players diamonds"
- "pytest --tb=short -m \"not impala and not hdfs\" ibis"

- "python ci\\datamgr.py download"
- "python ci\\datamgr.py parquet -i"
- "python ci\\datamgr.py mysql"
- "python ci\\datamgr.py sqlite"
- "python ci\\datamgr.py postgres"
- "pytest --tb=short -m \"not backend and not clickhouse and not impala and not hdfs and not bigquery\" -rs ibis"
27 changes: 27 additions & 0 deletions ci/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM ibisproject/miniconda3

# fonts are for docs
RUN apt-get -qq update -y \
&& apt-get -qq install -y --no-install-recommends ttf-dejavu \
git gcc make clang libboost-dev postgresql-client ca-certificates \
&& rm -rf /var/lib/apt/lists/*

ARG PYTHON
ARG ENVKIND

ADD ci/requirements-${ENVKIND}-${PYTHON}.yml /

RUN conda env create -q -n ibis-${ENVKIND}-${PYTHON} -f /requirements-${ENVKIND}-${PYTHON}.yml \
&& conda install conda-build -y -q

# we intentionally keep conda artifacts in the image to speedup recipe building
# on the other hand to reduce image size run the following in the previous layer
# && conda clean -a -y

RUN echo 'source activate ibis-'${ENVKIND}-${PYTHON}' && exec "$@"' > activate.sh

ADD . /ibis
WORKDIR /ibis
RUN bash /activate.sh python setup.py develop

ENTRYPOINT ["bash", "/activate.sh"]
16 changes: 10 additions & 6 deletions ci/asvconfig.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
#!/usr/bin/env python

if __name__ == '__main__':
import os
import json
import socket
import sys
import asv
import json
import socket


import asv
if __name__ == '__main__':
if len(sys.argv) > 1:
hostname = sys.argv[1]
else:
hostname = socket.gethostname()

hostname = 'circle' if os.environ.get('CIRCLECI') else socket.gethostname()
machine_info = asv.machine.Machine.get_defaults()
machine_info['machine'] = hostname
machine_info['ram'] = '{:d}GB'.format(int(machine_info['ram']) // 1000000)
Expand Down
9 changes: 9 additions & 0 deletions ci/benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/usr/bin/env bash

CWD=$(dirname $0)

pip install asv
$CWD/asvconfig.py $1 | tee $HOME/.asv-machine.json
git remote add upstream https://github.com/ibis-project/ibis
git fetch upstream refs/heads/master
asv continuous -f 1.5 -e upstream/master $2 || echo > /dev/null
7 changes: 7 additions & 0 deletions ci/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash -e

docker-compose rm --force --stop
docker-compose up -d --no-build postgres mysql clickhouse impala
docker-compose run --rm waiter
docker-compose build --pull ibis
docker-compose run --rm ibis ci/load-data.sh
466 changes: 215 additions & 251 deletions ci/datamgr.py

Large diffs are not rendered by default.

96 changes: 96 additions & 0 deletions ci/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
version: '3'
services:

postgres:
image: postgres
ports:
- 5432:5432
environment:
POSTGRES_PASSWORD: postgres

mysql:
image: mariadb:10.2
ports:
- 3306:3306
environment:
- MYSQL_ALLOW_EMPTY_PASSWORD=1
- MYSQL_DATABASE=ibis_testing
- MYSQL_USER=ibis
- MYSQL_PASSWORD=ibis

impala:
image: ibisproject/impala
hostname: impala
networks:
default:
aliases:
- quickstart.cloudera
environment:
PGPASSWORD: postgres
ports:
# HDFS
- 9020:9020
- 50070:50070
- 50075:50075
- 8020:8020
- 8042:8042
# Hive
- 9083:9083
# Impala
- 21000:21000
- 21050:21050
- 25000:25000
- 25010:25010
- 25020:25020

clickhouse:
image: yandex/clickhouse-server:1.1.54327
ports:
- 8123:8123
- 9000:9000

waiter:
image: jwilder/dockerize
command: |
dockerize -wait tcp://mysql:3306
-wait tcp://postgres:5432
-wait tcp://impala:21050
-wait tcp://impala:50070
-wait tcp://clickhouse:9000
-wait-retry-interval 5s
-timeout 5m
ibis:
image: ibis:${PYTHON_VERSION:-3.6}
environment:
- IBIS_TEST_DOWNLOAD_DIRECTORY=/tmp
- IBIS_TEST_DATA_DIRECTORY=/tmp/ibis-testing-data
- IBIS_TEST_SQLITE_DATABASE=/tmp/ibis_testing.db
- IBIS_TEST_NN_HOST=impala
- IBIS_TEST_IMPALA_HOST=impala
- IBIS_TEST_IMPALA_PORT=21050
- IBIS_TEST_WEBHDFS_PORT=50070
- IBIS_TEST_WEBHDFS_USER=hdfs
- IBIS_TEST_MYSQL_HOST=mysql
- IBIS_TEST_MYSQL_PORT=3306
- IBIS_TEST_MYSQL_USER=ibis
- IBIS_TEST_MYSQL_PASSWORD=ibis
- IBIS_TEST_MYSQL_DATABASE=ibis_testing
- IBIS_TEST_POSTGRES_HOST=postgres
- IBIS_TEST_POSTGRES_PORT=5432
- IBIS_TEST_POSTGRES_USER=postgres
- IBIS_TEST_POSTGRES_PASSWORD=postgres
- IBIS_TEST_POSTGRES_DATABASE=ibis_testing
- IBIS_TEST_CLICKHOUSE_HOST=clickhouse
- IBIS_TEST_CLICKHOUSE_PORT=9000
- IBIS_TEST_CLICKHOUSE_DATABASE=ibis_testing
- GOOGLE_BIGQUERY_PROJECT_ID=ibis-gbq
- GOOGLE_APPLICATION_CREDENTIALS=/tmp/gcloud-service-key.json
volumes:
- /tmp/ibis:/tmp
build:
context: ..
dockerfile: ci/Dockerfile
args:
PYTHON: ${PYTHON_VERSION:-3.6}
ENVKIND: ${ENVKIND:-dev}
14 changes: 14 additions & 0 deletions ci/docs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash -e

export ENVKIND=docs
export PYTHON_VERSION="3.6"

docker-compose build --pull ibis
docker-compose run --rm ibis ping -c 1 quickstart.cloudera
docker-compose run --rm ibis rm -rf /tmp/docs.ibis-project.org
docker-compose run --rm ibis git clone \
--branch gh-pages \
https://github.com/ibis-project/docs.ibis-project.org /tmp/docs.ibis-project.org

docker-compose run --rm ibis find /tmp/docs.ibis-project.org -maxdepth 1 ! -wholename /tmp/docs.ibis-project.org ! -name '*.git' ! -name '.' ! -name 'CNAME' ! -name '*.nojekyll' -exec rm -rf {} \;
docker-compose run --rm ibis sphinx-build -b html docs/source /tmp/docs.ibis-project.org -W -j auto -T
43 changes: 24 additions & 19 deletions scripts/test_data_admin.py → ci/impalamgr.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,23 @@
# limitations under the License.

import os
import shutil
import ibis
import click
import tempfile

from subprocess import check_call

import sh
import click
from plumbum import local, CommandNotFound
from plumbum.cmd import rm, make, cmake

import ibis
from ibis.compat import BytesIO
from ibis.compat import BytesIO, Path
from ibis.common import IbisError
from ibis.impala.tests.common import IbisTestEnv


SCRIPT_DIR = Path(__file__).parent.absolute()
DATA_DIR = Path(os.environ.get('IBIS_TEST_DATA_DIRECTORY',
SCRIPT_DIR / 'ibis-testing-data'))


ENV = IbisTestEnv()


Expand Down Expand Up @@ -63,18 +66,18 @@ def can_write_to_hdfs(con):

def can_build_udfs():
try:
sh.which('cmake')
except sh.ErrorReturnCode:
local.which('cmake')
except CommandNotFound:
print('Could not find cmake on PATH')
return False
try:
sh.which('make')
except sh.ErrorReturnCode:
local.which('make')
except CommandNotFound:
print('Could not find make on PATH')
return False
try:
sh.which('clang++')
except sh.ErrorReturnCode:
local.which('clang++')
except CommandNotFound:
print('Could not find LLVM on PATH; if IBIS_TEST_LLVM_CONFIG is set, '
'try setting PATH="$($IBIS_TEST_LLVM_CONFIG --bindir):$PATH"')
return False
Expand Down Expand Up @@ -184,13 +187,15 @@ def create_avro_tables(con):
def build_udfs():
print('Building UDFs')
ibis_home_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
udf_dir = os.path.join(ibis_home_dir, 'testing', 'udf')
check_call('cmake . && make VERBOSE=1', shell=True, cwd=udf_dir)
udf_dir = os.path.join(ibis_home_dir, 'ci', 'udf')

with local.cwd(udf_dir):
assert (cmake('.') and make('VERBOSE=1'))


def upload_udfs(con):
ibis_home_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
build_dir = os.path.join(ibis_home_dir, 'testing', 'udf', 'build')
build_dir = os.path.join(ibis_home_dir, 'ci', 'udf', 'build')
bitcode_dir = os.path.join(ENV.test_data_dir, 'udf')
print('Uploading UDFs to {}'.format(bitcode_dir))
if con.hdfs.exists(bitcode_dir):
Expand Down Expand Up @@ -220,7 +225,7 @@ def main():
'Path to testing data. This downloads data from Google Cloud Storage '
'if unset'
),
type=click.Path(exists=True)
default=DATA_DIR
)
@click.option(
'--overwrite', is_flag=True, help='Forces overwriting of data/UDFs'
Expand All @@ -241,9 +246,9 @@ def load(data, udf, data_dir, overwrite):
if data:
tmp_dir = tempfile.mkdtemp(prefix='__ibis_tmp_')
try:
load_impala_data(con, data_dir, overwrite)
load_impala_data(con, str(data_dir), overwrite)
finally:
shutil.rmtree(tmp_dir)
rm('-rf', tmp_dir)
else:
print('Skipping Ibis test data load (--no-data)')

Expand Down
41 changes: 41 additions & 0 deletions ci/load-data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/env bash

CWD="$(dirname "${0}")"

declare -A argcommands=([sqlite]=sqlite
[parquet]="parquet -i"
[postgres]=postgres
[clickhouse]=clickhouse
[mysql]=mysql
[impala]=impala)

if [[ "$#" == 0 ]]; then
ARGS=(${!argcommands[@]}) # keys of argcommands
else
ARGS=($*)
fi

python $CWD/datamgr.py download

for arg in ${ARGS[@]}; do
if [[ "${arg}" == "impala" ]]; then
python "${CWD}"/impalamgr.py load --data &
else
python "${CWD}"/datamgr.py ${argcommands[${arg}]} &
fi
done

FAIL=0

for job in `jobs -p`
do
wait "${job}" || let FAIL+=1
done

if [[ "${FAIL}" == 0 ]]; then
echo "Done loading ${ARGS[@]}"
exit 0
else
echo "Failed loading ${ARGS[@]}" >&2
exit 1
fi
33 changes: 22 additions & 11 deletions ci/requirements-dev-2.7.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,38 @@ channels:
- conda-forge
dependencies:
- click
- clickhouse-cityhash
- clickhouse-driver>=0.0.8
- clickhouse-sqlalchemy
- cmake
- enum34
- flake8
- funcsigs
- functools32
- google-cloud-bigquery<0.28
- graphviz
- impyla>=0.13.7
- impyla>=0.14.0
- lz4
- mock
- multipledispatch
- numpy=1.10.0
- pandas=0.18.1
- numpy=1.11.*
- pandas
- pathlib2
- plumbum
- psycopg2
- pyarrow>=0.6.0
- pymysql
- pytables
- pytest
- python=2.7
- python-graphviz
- sh
- python-hdfs>=2.0.16
- regex
- requests
- six
- sqlalchemy>=1.0.0
- thrift<=0.9.3
- sqlalchemy>=1.0.0,<1.1.15
- thriftpy<=0.3.9
- thrift<=0.9.3
- toolz
- clickhouse-driver>=0.0.8
- clickhouse-sqlalchemy
- pip:
- hdfs>=2.0.0
- google-cloud-bigquery
- xorg-libxpm
- xorg-libxrender
18 changes: 10 additions & 8 deletions ci/requirements-dev-3.4.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,24 @@ dependencies:
- cmake
- flake8
- graphviz
- impyla>=0.13.7
- impyla>=0.14.0
- multipledispatch
- numpy=1.11.0
- pandas=0.19.0
- pandas
- plumbum
- psycopg2
- pymysql
- pytest
- python=3.4
- python-graphviz
- sh
- regex
- requests
- six
- sqlalchemy>=1.0.0
- thrift<=0.9.3
- thriftpy<=0.3.9
- sqlalchemy>=1.0.0,<1.1.15
- toolz
- pip:
- hdfs>=2.0.0
- clickhouse-driver>=0.0.8
- clickhouse-sqlalchemy
- google-cloud-bigquery
- google-cloud-bigquery<0.28
- hdfs>=2.0.16
- urllib3
25 changes: 15 additions & 10 deletions ci/requirements-dev-3.5.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,30 @@ channels:
- conda-forge
dependencies:
- click
- clickhouse-cityhash
- clickhouse-driver>=0.0.8
- clickhouse-sqlalchemy
- cmake
- flake8
- google-cloud-bigquery<0.28
- graphviz
- impyla>=0.13.7
- impyla>=0.14.0
- lz4
- multipledispatch
- numpy=1.12.0
- pandas
- plumbum
- psycopg2
- pyarrow>=0.6.0
- pymysql
- pytest
- python=3.5
- python-graphviz
- python-hdfs>=2.0.16
- regex
- requests
- six
- sh
- sqlalchemy>=1.0.0
- thrift<=0.9.3
- thriftpy<=0.3.9
- sqlalchemy>=1.0.0,<1.1.15
- toolz
- clickhouse-driver>=0.0.8
- clickhouse-sqlalchemy
- pip:
- hdfs>=2.0.0
- google-cloud-bigquery
- xorg-libxpm
- xorg-libxrender
25 changes: 16 additions & 9 deletions ci/requirements-dev-3.6.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,32 @@ channels:
- conda-forge
dependencies:
- click
- clickhouse-cityhash
- clickhouse-driver>=0.0.8
- clickhouse-sqlalchemy
- cmake
- flake8
- google-cloud-bigquery<0.28
- graphviz
- impyla>=0.13.7
- impyla>=0.14.0
- lz4
- multipledispatch
- numpy
- pandas
- plumbum
- psycopg2
- pyarrow>=0.6.0
- pymysql
- pytables
- pytest
- python=3.6
- python-graphviz
- sh
- python-hdfs>=2.0.16
- regex
- requests
- six
- sqlalchemy>=1.0.0
- sqlalchemy>=1.0.0,<1.1.15
- thrift
- thriftpy<=0.3.9
- toolz
- clickhouse-driver>=0.0.8
- clickhouse-sqlalchemy
- pip:
- hdfs>=2.0.0
- google-cloud-bigquery
- xorg-libxpm
- xorg-libxrender
27 changes: 17 additions & 10 deletions ci/requirements-docs-3.6.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,36 @@ channels:
- conda-forge
dependencies:
- click
- clickhouse-cityhash
- clickhouse-driver>=0.0.8
- clickhouse-sqlalchemy
- cmake
- flake8
- google-cloud-bigquery<0.28
- graphviz
- impyla>=0.13.7
- impyla>=0.14.0
- ipython
- jupyter
- lz4
- matplotlib
- multipledispatch
- nbsphinx
- numpy
- numpydoc
- pandas
- plumbum
- psycopg2
- pyarrow>=0.6.0
- pymysql
- pytables
- pytest
- python=3.6
- python-graphviz
- sh
- python-hdfs>=2.0.16
- regex
- six
- sphinx_rtd_theme
- sqlalchemy>=1.0.0
- thrift
- thriftpy<=0.3.9
- sqlalchemy>=1.0.0,<1.1.15
- toolz
- clickhouse-driver>=0.0.8
- clickhouse-sqlalchemy
- pip:
- hdfs>=2.0.0
- google-cloud-bigquery
- xorg-libxpm
- xorg-libxrender
14 changes: 1 addition & 13 deletions ci/clickhouse_load.sql → ci/schema/clickhouse.sql
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds (
`date` Date DEFAULT today(),
carat Float64,
Expand All @@ -14,8 +12,6 @@ CREATE TABLE diamonds (
z Float64
) ENGINE = MergeTree(date, (`carat`), 8192);

DROP TABLE IF EXISTS batting;

CREATE TABLE batting (
`date` Date DEFAULT today(),
`playerID` String,
Expand All @@ -42,8 +38,6 @@ CREATE TABLE batting (
`GIDP` Int64
) ENGINE = MergeTree(date, (`playerID`), 8192);

DROP TABLE IF EXISTS awards_players;

CREATE TABLE awards_players (
`date` Date DEFAULT today(),
`playerID` String,
Expand All @@ -54,12 +48,10 @@ CREATE TABLE awards_players (
notes String
) ENGINE = MergeTree(date, (`playerID`), 8192);

DROP TABLE IF EXISTS functional_alltypes;

CREATE TABLE functional_alltypes (
`date` Date DEFAULT toDate(timestamp_col),
`index` Int64,
`Unnamed_0` Int64,
`Unnamed: 0` Int64,
id Int32,
bool_col UInt8,
tinyint_col Int8,
Expand All @@ -75,17 +67,13 @@ CREATE TABLE functional_alltypes (
month Int32
) ENGINE = MergeTree(date, (`index`), 8192);

DROP TABLE IF EXISTS tzone;

CREATE TABLE tzone (
`date` Date DEFAULT today(),
ts DateTime,
key String,
value Float64
) ENGINE = MergeTree(date, (key), 8192);

DROP TABLE IF EXISTS array_types;

CREATE TABLE IF NOT EXISTS array_types (
`date` Date DEFAULT today(),
x Array(Int64),
Expand Down
74 changes: 74 additions & 0 deletions ci/schema/mysql.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds (
carat FLOAT,
cut TEXT,
color TEXT,
clarity TEXT,
depth FLOAT,
`table` FLOAT,
price BIGINT,
x FLOAT,
y FLOAT,
z FLOAT
) DEFAULT CHARACTER SET = utf8;

DROP TABLE IF EXISTS batting;

CREATE TABLE batting (
`playerID` VARCHAR(255),
`yearID` BIGINT,
stint BIGINT,
`teamID` VARCHAR(7),
`lgID` VARCHAR(7),
`G` BIGINT,
`AB` BIGINT,
`R` BIGINT,
`H` BIGINT,
`X2B` BIGINT,
`X3B` BIGINT,
`HR` BIGINT,
`RBI` BIGINT,
`SB` BIGINT,
`CS` BIGINT,
`BB` BIGINT,
`SO` BIGINT,
`IBB` BIGINT,
`HBP` BIGINT,
`SH` BIGINT,
`SF` BIGINT,
`GIDP` BIGINT
) DEFAULT CHARACTER SET = utf8;

DROP TABLE IF EXISTS awards_players;

CREATE TABLE awards_players (
`playerID` VARCHAR(255),
`awardID` VARCHAR(255),
`yearID` BIGINT,
`lgID` VARCHAR(7),
tie VARCHAR(7),
notes VARCHAR(255)
) DEFAULT CHARACTER SET = utf8;

DROP TABLE IF EXISTS functional_alltypes;

CREATE TABLE functional_alltypes (
`index` BIGINT,
`Unnamed: 0` BIGINT,
id INTEGER,
bool_col BOOLEAN,
tinyint_col TINYINT,
smallint_col SMALLINT,
int_col INTEGER,
bigint_col BIGINT,
float_col FLOAT,
double_col DOUBLE,
date_string_col TEXT,
string_col TEXT,
timestamp_col TIMESTAMP,
year INTEGER,
month INTEGER
) DEFAULT CHARACTER SET = utf8;

CREATE INDEX `ix_functional_alltypes_index` ON functional_alltypes (`index`);
File renamed without changes.
2 changes: 1 addition & 1 deletion ci/sqlite_load.sql → ci/schema/sqlite.sql
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ CREATE TABLE functional_alltypes (
int_col BIGINT,
bigint_col BIGINT,
float_col FLOAT,
double_col FLOAT,
double_col REAL,
date_string_col TEXT,
string_col TEXT,
timestamp_col TEXT,
Expand Down
5 changes: 5 additions & 0 deletions ci/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash -e

cmd='$(find /ibis -name "*.py[co]" -delete > /dev/null 2>&1 || true) && pytest "$@"'
docker-compose build --pull ibis
docker-compose run --rm ibis bash -c "$cmd" -- "$@"
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
23 changes: 20 additions & 3 deletions conda-recipes/ibis-framework/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,33 +13,47 @@ source:
requirements:
build:
- enum34 # [py27]
- funcsigs # [py27]
- functools32 # [py27]
- pathlib2 # [py27]
- numpy >=1.10.0
- pandas >=0.18.1
- python
- regex
- setuptools
- multipledispatch
- six
- toolz
run:
- enum34 # [py27]
- funcsigs # [py27]
- functools32 # [py27]
- pathlib2 # [py27]
- numpy >=1.10.0
- pandas >=0.18.1
- python
- regex
- setuptools
- multipledispatch
- six
- toolz

test:
requires:
# xorg-* required for distros (often docker containers based on slimmed
# distros) that don't have X stuff installed
#
# see: https://github.com/conda-forge/graphviz-feedstock/issues/18
- graphviz # [not (py34 and win)]
- mock # [py27]
- multipledispatch
- pytest >=3
- python-graphviz # [not (py34 and win)]
- pyarrow >=0.6.0 # [not py34]
imports:
- ibis
- ibis.expr
- ibis.expr.tests
- ibis.expr.visualize # [not (py34 and win)]
- ibis.hive
- ibis.hive.tests
- ibis.impala # [linux]
Expand All @@ -56,10 +70,13 @@ test:
- ibis.sql.tests
- ibis.sql.vertica
- ibis.sql.vertica.tests
- ibis.file
- ibis.file.tests
- ibis.pandas
- ibis.tests
- ibis.tests.all
commands:
- pytest --version
- pytest --tb=short --pyargs ibis -m "not impala and not hdfs and not bigquery"
- pytest -x --tb=short -m "not backend and not clickhouse and not impala and not hdfs and not bigquery" -rs "$(python -c 'import site; sp_dir, = site.getsitepackages(); print(sp_dir)')"/ibis

about:
license: Apache License, Version 2.0
Expand Down
389 changes: 191 additions & 198 deletions dev/merge-pr.py
100644 → 100755

Large diffs are not rendered by default.

42 changes: 0 additions & 42 deletions docs/build-notebooks.py

This file was deleted.

Binary file added docs/source/_static/favicon.ico
Binary file not shown.
81 changes: 81 additions & 0 deletions docs/source/backends.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
.. _backends:

Backends
========

This document describes the classes of backends, how they work, and any details
about each backend that are relevant to end users.

.. _classes_of_backends:

Classes of Backends
-------------------

There are currently three classes of backends that live in ibis.

#. String generating backends
#. Expression generating backends
#. Direct execution backends

.. _string_generating_backends:

String Generating Backends
~~~~~~~~~~~~~~~~~~~~~~~~~~

The first category of backend translates ibis expressions into strings.
Generally speaking these backends also need to handle their own execution.
They work by translating each node into a string, and passing the generated
string to the database through a driver API.

Impala
******

TODO

Clickhouse
**********

TODO

BigQuery
********

TODO

.. _expression_generating_backends:

Expression Generating Backends
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The second category of backends translates ibis expressions into other
expressions. Currently, all expression generating backends generate `SQLAlchemy
expressions <http://docs.sqlalchemy.org/en/latest/core/tutorial.html>`_.

Instead of generating strings at each translation step, these backends build up
an expression. These backends tend to execute their expressions directly
through the driver APIs provided by SQLAlchemy (or one of its transitive
dependencies).

SQLite
******

TODO

PostgreSQL
**********

TODO

.. _direct_execution_backends:

Direct Execution Backends
~~~~~~~~~~~~~~~~~~~~~~~~~

The only existing backend that directly executes ibis expressions is the pandas
backend. A full description of the implementation can be found in the module
docstring of the pandas backend located in ``ibis/pandas/execution/core.py``.

Pandas
******

TODO
115 changes: 58 additions & 57 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# serve to show the default.

import glob
import datetime

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
Expand All @@ -21,7 +22,7 @@
# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'
# needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
Expand All @@ -32,6 +33,7 @@
'sphinx.ext.extlinks',
'sphinx.ext.mathjax',
'numpydoc',
'nbsphinx',

'IPython.sphinxext.ipython_directive',
'IPython.sphinxext.ipython_console_highlighting',
Expand All @@ -50,14 +52,14 @@
source_suffix = '.rst'

# The encoding of source files.
#source_encoding = 'utf-8-sig'
# source_encoding = 'utf-8-sig'

# The master toctree document.
master_doc = 'index'

# General information about the project.
project = u'Ibis'
copyright = u'2015, Cloudera, Inc.'
project = 'Ibis'
copyright = '{}, Ibis Developers'.format(datetime.date.today().year)

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand All @@ -66,82 +68,83 @@
# The short X.Y version.
# version = '0.2'

from ibis import __version__ as version
from ibis import __version__ as version # noqa: E402

# The full version, including alpha/beta/rc tags.
release = version

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#language = None
# language = None

# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'
# today_fmt = '%B %d, %Y'

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = []
exclude_patterns = ['_build', '**.ipynb_checkpoints']

# The reST default role (used for this markup: `text`) to use for all
# documents.
#default_role = None
# default_role = None

# If true, '()' will be appended to :func: etc. cross-reference text.
#add_function_parentheses = True
# add_function_parentheses = True

# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True
# add_module_names = True

# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False
# show_authors = False

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'

# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []
# modindex_common_prefix = []

# If true, keep warnings as "system message" paragraphs in the built documents.
#keep_warnings = False
# keep_warnings = False


# -- Options for HTML output ----------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.

import sphinx_rtd_theme
import sphinx_rtd_theme # noqa: E402

html_theme = "sphinx_rtd_theme"
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]

# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#html_theme_options = {}
# html_theme_options = {}

# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []
# html_theme_path = []

# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
#html_title = None
# html_title = None

# A shorter title for the navigation bar. Default is the same as html_title.
#html_short_title = None
# html_short_title = None

# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
#html_logo = None
# html_logo = None

# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
#html_favicon = None
html_favicon = '_static/favicon.ico'

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
Expand All @@ -151,93 +154,91 @@
# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
#html_extra_path = []
# html_extra_path = []

# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
#html_last_updated_fmt = '%b %d, %Y'
# html_last_updated_fmt = '%b %d, %Y'

# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
#html_use_smartypants = True
# html_use_smartypants = True

# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}
# html_sidebars = {}

# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}
# html_additional_pages = {}

# If false, no module index is generated.
#html_domain_indices = True
# html_domain_indices = True

# If false, no index is generated.
#html_use_index = True
# html_use_index = True

# If true, the index is split into individual pages for each letter.
#html_split_index = False
# html_split_index = False

# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True
# html_show_sourcelink = True

# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
#html_show_sphinx = True
# html_show_sphinx = True

# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
#html_show_copyright = True
# html_show_copyright = True

# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ''
# html_use_opensearch = ''

# This is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = None
# html_file_suffix = None

# Output file base name for HTML help builder.
htmlhelp_basename = 'Ibisdoc'


# -- Options for LaTeX output ---------------------------------------------

latex_elements = {
latex_elements = {}
# The paper size ('letterpaper' or 'a4paper').
#'papersize': 'letterpaper',
# 'papersize': 'letterpaper',

# The font size ('10pt', '11pt' or '12pt').
#'pointsize': '10pt',
# 'pointsize': '10pt',

# Additional stuff for the LaTeX preamble.
#'preamble': '',
}
# 'preamble': '',

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
('index', 'Ibis.tex', u'Ibis Documentation',
u'Cloudera, Inc.', 'manual'),
('index', 'Ibis.tex', 'Ibis Documentation', 'Ibis Developers', 'manual'),
]

# The name of an image file (relative to this directory) to place at the top of
# the title page.
#latex_logo = None
# latex_logo = None

# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False
# latex_use_parts = False

# If true, show page references after internal links.
#latex_show_pagerefs = False
# latex_show_pagerefs = False

# If true, show URL addresses after external links.
#latex_show_urls = False
# latex_show_urls = False

# Documents to append as an appendix to all manuals.
#latex_appendices = []
# latex_appendices = []

# If false, no module index is generated.
#latex_domain_indices = True
# latex_domain_indices = True


# extlinks alias
Expand All @@ -249,12 +250,12 @@
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'ibis', u'Ibis Documentation',
[u'Cloudera, Inc.'], 1)
('index', 'ibis', 'Ibis Documentation',
['Ibis Developers'], 1)
]

# If true, show URL addresses after external links.
#man_show_urls = False
# man_show_urls = False


# -- Options for Texinfo output -------------------------------------------
Expand All @@ -263,19 +264,19 @@
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
('index', 'Ibis', u'Ibis Documentation',
u'Cloudera, Inc.', 'Ibis', 'One line description of project.',
('index', 'Ibis', 'Ibis Documentation',
'Ibis Developers', 'Ibis', 'Pandas-like expressions for analytics',
'Miscellaneous'),
]

# Documents to append as an appendix to all manuals.
#texinfo_appendices = []
# texinfo_appendices = []

# If false, no module index is generated.
#texinfo_domain_indices = True
# texinfo_domain_indices = True

# How to display URL addresses: 'footnote', 'no', or 'inline'.
#texinfo_show_urls = 'footnote'
# texinfo_show_urls = 'footnote'

# If true, do not generate a @detailmenu in the "Top" node's menu.
#texinfo_no_detailmenu = False
# texinfo_no_detailmenu = False
212 changes: 212 additions & 0 deletions docs/source/design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
.. _design:

Design
======


.. _primary_goals:

Primary Goals
-------------

#. Type safety
#. Expressiveness
#. Composability
#. Familiarity

.. _flow_of_execution:

Flow of Execution
-----------------

#. User writes expression
#. Each method or function call builds a new expression
#. Expressions are type checked as you create them
#. Expressions have some optimizations that happen as the user builds them
#. Backend specific rewrites
#. Expressions are compiled
#. The SQL string that generated by the compiler is sent to the database and
executed (this step is skipped for the pandas backend)
#. The database returns some data that is then turned into a pandas DataFrame
by ibis

.. _expressions:

Expressions
-----------

The main user-facing component of ibis is expressions. The base class of all
expressions in ibis is the :class:`~ibis.expr.types.Expr` class.

Expressions provide the user facing API, defined in ``ibis/expr/api.py``

.. _type_system:

Type System
~~~~~~~~~~~

Ibis's type system consists of a set of rules for specifying the types of
inputs to :class:`~ibis.expr.types.Node` subclasses. Upon construction of a
:class:`~ibis.expr.types.Node` subclass, ibis performs validation of every
input to the node based on the rule that was used to declare the input.

Rules are defined in ``ibis/expr/rules.py``

.. _expr_class:

The :class:`~ibis.expr.types.Expr` class
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expressions are a thin but important abstraction over operations, containing
only type information and shape information, i.e., whether they are tables,
columns, or scalars.

Examples of expressions include :class:`~ibis.expr.types.Int64Column`,
:class:`~ibis.expr.types.StringScalar`, and
:class:`~ibis.expr.types.TableExpr`.

Here's an example of each type of expression:

.. code-block:: ipython
import ibis
t = ibis.table([('a', 'int64')])
int64_column = t.a
type(int64_column)
string_scalar = ibis.literal('some_string_value')
type(string_scalar)
table_expr = t.mutate(b=t.a + 1)
type(table_expr)
.. _node_class:

The :class:`~ibis.expr.types.Node` Class
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:class:`~ibis.expr.types.Node` subclasses make up the core set of operations of
ibis. Each node corresponds to a particular operation.

Most nodes are defined in the :mod:`~ibis.expr.operations` module.

Examples of nodes include :class:`~ibis.expr.operations.Add` and
:class:`~ibis.expr.operations.Sum`.

Nodes have two important members (and often these are the only members defined):

#. ``input_type``: a list of rules
#. ``output_type``: a rule or method

The ``input_type`` member is a list of rules that defines the types of
the inputs to the operation. This is sometimes called the signature.

The ``output_type`` member is a rule or a method that defines the output type
of the operation. This is sometimes called the return type.

An example of ``input_type``/``output_type`` usage is the
:class:`~ibis.expr.operations.Log` class:

.. code-block:: ipython
class Log(Node):
input_type = [
rules.double(),
rules.double(name='base', optional=True)
]
output_type = rules.shape_like_arg(0, 'double')
This class describes an operation called ``Log`` that takes one required
argument: a double scalar or column, and one optional argument: a double scalar
or column named ``base`` that defaults to nothing if not provided. The base
argument is ``None`` by default so that the expression will behave as the
underlying database does.

These objects are instantiated when you use ibis APIs:

.. code-block:: ipython
import ibis
t = ibis.table([('a', 'double')])
log_1p = (1 + t.a).log() # an Add and a Log are instantiated here
.. _expr_vs_ops:

Expressions vs Operations: Why are they different?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Separating expressions from their underlying operations makes it easy to
generically describe and validate the inputs to particular nodes. In the log
example, it doesn't matter what *operation* (node) the double-valued arguments
are coming from, they must only satisfy the requirement denoted by the rule.

Separation of the :class:`~ibis.expr.types.Node` and
:class:`~ibis.expr.types.Expr` classes also allows the API to be tied to the
physical type of the expression rather than the particular operation, making it
easy to define the API in terms of types rather than specific operations.

Furthermore, operations often have an output type that depends on the input
type. An example of this is the ``greatest`` function, which takes the maximum
of all of its arguments. Another example is ``CASE`` statements, whose ``THEN``
expressions determine the output type of the expression.

This allows ibis to provide **only** the APIs that make sense for a particular
type, even when an operation yields a different output type depending on its
input. Concretely, this means that you cannot perform operations that don't
make sense, like computing the average of a string column.

.. _compilation:

Compilation
-----------

The next major component of ibis is the compilers.

The first few versions of ibis directly generated strings, but the compiler
infrastructure was generalized to support compilation of `SQLAlchemy
<https://docs.sqlalchemy.org/en/latest/core/tutorial.html>`_ based expressions.

The compiler works by translating the different pieces of SQL expression into a
string or SQLAlchemy expression.

The main pieces of a ``SELECT`` statement are:

#. The set of column expressions (``select_set``)
#. ``WHERE`` clauses (``where``)
#. ``GROUP BY`` clauses (``group_by``)
#. ``HAVING`` clauses (``having``)
#. ``LIMIT`` clauses (``limit``)
#. ``ORDER BY`` clauses (``order_by``)
#. ``DISTINCT`` clauses (``distinct``)

Each of these pieces is translated into a SQL string and finally assembled by
the instance of the :class:`~ibis.sql.compiler.ExprTranslator` subclass
specific to the backend being compiled. For example, the
:class:`~ibis.impala.compiler.ImpalaExprTranslator` is one of the subclasses
that will perform this translation.

.. note::

While ibis was designed with an explicit goal of first-class SQL support,
ibis can target other systems such as pandas.

.. _execution:

Execution
---------

We presumably want to *do* something with our compiled expressions. This is
where execution comes in.

This is least complex part of ibis, mostly only requiring ibis to correctly
handle whatever the database hands back.

By and large, the execution of compiled SQL is handled by the database to which
SQL is sent from ibis.

However, once the data arrives from the database we need to convert that
data to a pandas DataFrame.

The Query class, with its :meth:`~ibis.sql.client.Query._fetch` method,
provides a way for ibis :class:`~ibis.sql.client.SQLClient` objects to do any
additional processing necessary after the database returns results to the
client.
56 changes: 38 additions & 18 deletions docs/source/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,49 @@ Conda Environment Setup
# Install ibis
python setup.py develop
All-in-One Command
------------------

The following command does three steps:

#. Downloads the test data
#. Starts each backend via docker-compose
#. Initializes the backends with the test tables

.. code:: sh
cd testing
bash start-all.sh
To use specific backends follow the instructions below.


Download Test Dataset
---------------------

#. `Install docker <https://docs.docker.com/engine/installation/>`_
#. **Download the test data**:

By default this will download and extract the dataset under
testing/ibis-testing-data.

.. code:: sh
DATA_DIR=$PWD
ci/datamgr.py download --directory=$DATA_DIR
testing/datamgr.py download
Setting Up Test Databases
-------------------------

To start each backends

.. code:: sh
cd testing
docker-compose up
Impala (with UDFs)
^^^^^^^^^^^^^^^^^^

Expand All @@ -67,7 +99,7 @@ Impala (with UDFs)

.. code:: sh
test_data_admin.py load --data --data-dir=$DATA_DIR
testing/impalamgr.py load --data --data-dir ibis-testing-data
Clickhouse
^^^^^^^^^^
Expand All @@ -83,11 +115,7 @@ Clickhouse

.. code:: sh
ci/datamgr.py clickhouse \
--database $IBIS_TEST_CLICKHOUSE_DB \
--data-directory $DATA_DIR/ibis-testing-data \
--script ci/clickhouse_load.sql \
functional_alltypes batting diamonds awards_players
testing/datamgr.py clickhouse
PostgreSQL
^^^^^^^^^^
Expand All @@ -99,11 +127,7 @@ Here's how to load test data into PostgreSQL:

.. code:: sh
ci/datamgr.py postgres \
--database $IBIS_TEST_POSTGRES_DB \
--data-directory $DATA_DIR/ibis-testing-data \
--script ci/postgresql_load.sql \
functional_alltypes batting diamonds awards_players
testing/datamgr.py postgres
SQLite
^^^^^^
Expand All @@ -113,11 +137,7 @@ instructions above, then SQLite will be available in the conda environment.

.. code:: sh
ci/datamgr.py sqlite \
--database $IBIS_TEST_SQLITE_DB_PATH \
--data-directory $DATA_DIR/ibis-testing-data \
--script ci/sqlite_load.sql \
functional_alltypes batting diamonds awards_players
testing/datamgr.py sqlite
Running Tests
Expand Down
40 changes: 40 additions & 0 deletions docs/source/extending.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
.. _extending:


Extending Ibis
==============

Users typically want to extend ibis in one of two ways:

#. Add a new expression
#. Add a new backend


Below we provide notebooks showing how to extend ibis in each of these ways.


Adding a New Expression
-----------------------

.. note::

Make sure you've run the following commands before executing the notebook

.. code-block:: sh
docker-compose up -d --no-build postgres dns
docker-compose run waiter
docker-compose run ibis ci/load-data.sh postgres
Here we show how to add a ``sha1`` method to the PostgreSQL backend:

.. toctree::
:maxdepth: 1

notebooks/tutorial/9-Adding-a-new-expression.ipynb


Adding a New Backend
--------------------

TBD
75 changes: 31 additions & 44 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,56 +152,43 @@ with:
Learning resources
------------------

We are collecting IPython notebooks for learning here:
http://github.com/cloudera/ibis-notebooks. Some of these notebooks will be
reproduced as part of the documentation.
We are collecting Jupyter notebooks for learning here:
https://github.com/ibis-project/ibis/tree/master/docs/source/notebooks. Some of
these notebooks will be reproduced as part of the documentation.

.. _install.quickstart:

Using Ibis with the Cloudera Quickstart VM
------------------------------------------

Using Ibis with Impala requires a running Impala cluster, so we have provided a
lean VirtualBox image to simplify the process for those looking to try out Ibis
(without setting up a cluster) or start contributing code to the project.
Running Ibis Queries using Docker
---------------------------------

What follows are streamlined setup instructions for the VM. If you wish to
download it directly and setup from the ``ova`` file, use this `download link
<http://archive.cloudera.com/cloudera-ibis/ibis-demo.ova>`_.
Contributor `Krisztián Szűcs <https://github.com/kszucs>`_ has spent many hours
crafting a very easy-to-use ``docker-compose`` setup that enables users and
developers of ibis to get up and running quickly.

The VM was built with Oracle VirtualBox 4.3.28.
Here are the steps:

TL;DR
~~~~~

::
.. code-block:: sh
# clone ibis
git clone https://github.com/ibis-project/ibis
# go to where the docker-compose file is
pushd ibis/ci
# build the latest version of ibis
docker-compose build --pull ibis
# spin up containers
docker-compose up -d --no-build postgres impala clickhouse
# wait for things to finish starting
docker-compose run waiter
# load data into databases
docker-compose run ibis ci/load-data.sh
curl -s https://raw.githubusercontent.com/cloudera/ibis-notebooks/master/setup/bootstrap.sh | bash

Single Steps
~~~~~~~~~~~~

To use Ibis with the special Cloudera Quickstart VM follow the below
instructions:

* Make sure Anaconda is installed. You can get it from
http://continuum.io/downloads. Now prepend the Anaconda Python
to your path like this ``export PATH=$ANACONDA_HOME/bin:$PATH``
* ``pip install ibis-framework``
* ``git clone https://github.com/cloudera/ibis-notebooks.git``
* ``cd ibis-notebooks``
* ``./setup/setup-ibis-demo-vm.sh``
* ``source setup/ibis-env.sh``
* ``ipython notebook``

VM setup
~~~~~~~~

The setup script will download a VirtualBox appliance image and import it in
VirtualBox. In addition, it will create a new host only network adapter with
DHCP. After the VM is started, it will extract the current IP address and add a
new /etc/hosts entry pointing from the IP of the VM to the hostname
``quickstart.cloudera``. The reason for this entry is that Hadoop and HDFS
require a working reverse name mapping. If you don't want to run the automated
steps make sure to check the individual steps in the file
``setup/setup-ibis-demo-vm.sh``.
# confirm that you can reach impala
impala_ip_address="$(docker inspect -f '{{.NetworkSettings.Networks.ci_default.IPAddress}}' ci_impala_1)"
ping -c 1 "${impala_ip_address}"
14 changes: 3 additions & 11 deletions docs/source/impala.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,9 @@ can use pandas with Ibis and Impala.
:suppress:
import ibis
hdfs = ibis.hdfs_connect(port=5070)
client = ibis.impala.connect(hdfs_client=hdfs)
host = 'quickstart.cloudera'
hdfs = ibis.hdfs_connect(host=host)
client = ibis.impala.connect(host=host, hdfs_client=hdfs)
The Impala client object
------------------------
Expand Down Expand Up @@ -258,15 +259,6 @@ getting information about the partition schema and any existing partition data:
ImpalaTable.partition_schema
ImpalaTable.partitions

For example:

.. ipython:: python
ss = client.table('tpcds_parquet.store_sales')
ss.is_partitioned
ss.partitions()[:5]
ss.partition_schema()
To address a specific partition in any method that is partition specific, you
can either use a dict with the partition key names and values, or pass a list
of the partition values:
Expand Down
13 changes: 8 additions & 5 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ natively within other systems like Apache Spark and Apache Impala (incubating).
To learn more about Ibis's vision, roadmap, and updates, please follow
http://ibis-project.org.

Source code is on GitHub: http://github.com/pandas-dev/ibis
Source code is on GitHub: https://github.com/ibis-project/ibis

Install Ibis from PyPI with:

Expand All @@ -47,10 +47,10 @@ At this time, Ibis offers some level of support for the following systems:

- `Apache Impala (incubating) <http://impala.io/>`_
- `Apache Kudu (incubating) <http://getkudu.io>`_
- Hadoop Distributed File System (HDFS)
- PostgreSQL
- SQLite
- Google BigQuery (experimental)
- Yandex Clickhouse
- Direct execution of ibis expressions against pandas objects (Experimental)

Coming from SQL? Check out :ref:`Ibis for SQL Programmers <sql>`.
Expand All @@ -71,7 +71,6 @@ SQL engine support needing code contributors:
- Spark SQL
- Presto
- Hive
- MySQL / MariaDB

Since this is a young project, the documentation is definitely patchy in
places, but this will improve as things progress.
Expand All @@ -81,15 +80,19 @@ places, but this will improve as things progress.

getting-started
configuration
tutorial
impala
tutorial
api
sql
udf
developer
type-system
design
extending
backends
release
legal


Indices and tables
==================

Expand Down
97 changes: 97 additions & 0 deletions docs/source/notebooks/tutorial/1-Intro-and-Setup.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Impala/HDFS intro and Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting started"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You're going to want to make sure you can import `ibis`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import ibis\n",
"import os"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you have WebHDFS available, connect to HDFS with according to your WebHDFS config. For kerberized or more complex HDFS clusters please look at http://hdfscli.readthedocs.org/en/latest/ for info on connecting. You can use a connection from that library instead of using `hdfs_connect`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)\n",
"hdfs = ibis.hdfs_connect(host='quickstart.cloudera', port=hdfs_port)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, create the Ibis client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"con = ibis.impala.connect('quickstart.cloudera', hdfs_client=hdfs)\n",
"con"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Obviously, substitute the parameters that are appropriate for your environment (see docstring for `ibis.impala.connect`). `impala.connect` uses the same parameters as Impyla's (https://pypi.python.org/pypi/impyla) DBAPI interface"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
617 changes: 617 additions & 0 deletions docs/source/notebooks/tutorial/2-Basics-Aggregate-Filter-Limit.ipynb

Large diffs are not rendered by default.

514 changes: 514 additions & 0 deletions docs/source/notebooks/tutorial/3-Projection-Join-Sort.ipynb

Large diffs are not rendered by default.

526 changes: 526 additions & 0 deletions docs/source/notebooks/tutorial/4-More-Value-Expressions.ipynb

Large diffs are not rendered by default.

661 changes: 661 additions & 0 deletions docs/source/notebooks/tutorial/5-IO-Create-Insert-External-Data.ipynb

Large diffs are not rendered by default.

331 changes: 331 additions & 0 deletions docs/source/notebooks/tutorial/6-Advanced-Topics-TopK-SelfJoins.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Advanced Topics: Top-K and Self Joins"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import ibis\n",
"import os\n",
"hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)\n",
"hdfs = ibis.hdfs_connect(host='quickstart.cloudera', port=hdfs_port)\n",
"con = ibis.impala.connect(host='quickstart.cloudera', database='ibis_testing',\n",
" hdfs_client=hdfs)\n",
"ibis.options.interactive = True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Top-K\" Filtering\n",
"\n",
"\n",
"A common analytical pattern involves subsetting based on some method of ranking. For example, \"the 5 most frequently occurring widgets in a dataset\". By choosing the right metric, you can obtain the most important or least important items from some dimension, for some definition of important.\n",
"\n",
"To carry out the pattern by hand involves the following\n",
"\n",
"- Choose a ranking metric\n",
"- Aggregate, computing the ranking metric, by the target dimension\n",
"- Order by the ranking metric and take the highest K values\n",
"- Use those values as a set filter (either with `semi_join` or `isin`) in your next query\n",
"\n",
"For example, let's look at the TPC-H tables and find the 5 or 10 customers who placed the most orders over their lifetime:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"orders = con.table('tpch_orders')\n",
"\n",
"top_orders = (orders\n",
" .group_by('o_custkey')\n",
" .size()\n",
" .sort_by(('count', False))\n",
" .limit(5))\n",
"top_orders"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we could use these customer keys as a filter in some other analysis:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Among the top 5 most frequent customers, what's the histogram of their order statuses?\n",
"analysis = (orders[orders.o_custkey.isin(top_orders.o_custkey)]\n",
" .group_by('o_orderstatus')\n",
" .size())\n",
"analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is such a common pattern that Ibis supports a high level primitive `topk` operation, which can be used immediately as a filter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"top_orders = orders.o_custkey.topk(5)\n",
"orders[top_orders].group_by('o_orderstatus').size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This goes a little further. Suppose now we want to rank customers by their total spending instead of the number of orders, perhaps a more meaningful metric:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"total_spend = orders.o_totalprice.sum().name('total')\n",
"top_spenders = (orders\n",
" .group_by('o_custkey')\n",
" .aggregate(total_spend)\n",
" .sort_by(('total', False))\n",
" .limit(5))\n",
"top_spenders"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To use another metric, just pass it to the `by` argument in `topk`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"top_spenders = orders.o_custkey.topk(5, by=total_spend)\n",
"orders[top_spenders].group_by('o_orderstatus').size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Self joins\n",
"\n",
"\n",
"If you're a relational data guru, you may have wondered how it's possible to join tables with themselves, because joins clauses involve column references back to the original table.\n",
"\n",
"Consider the SQL\n",
"\n",
"```sql\n",
" SELECT t1.key, sum(t1.value - t2.value) AS metric\n",
" FROM my_table t1\n",
" JOIN my_table t2\n",
" ON t1.key = t2.subkey\n",
" GROUP BY 1\n",
"```\n",
" \n",
"Here, we have an unambiguous way to refer to each of the tables through aliasing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's consider the TPC-H database, and support we want to compute year-over-year change in total order amounts by region using joins."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"region = con.table('tpch_region')\n",
"nation = con.table('tpch_nation')\n",
"customer = con.table('tpch_customer')\n",
"orders = con.table('tpch_orders')\n",
"\n",
"orders.limit(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's join all the things and select the fields we care about:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fields_of_interest = [region.r_name.name('region'), \n",
" nation.n_name.name('nation'),\n",
" orders.o_totalprice.name('amount'),\n",
" orders.o_orderdate.cast('timestamp').name('odate') # these are strings\n",
" ]\n",
"\n",
"joined_all = (region.join(nation, region.r_regionkey == nation.n_regionkey)\n",
" .join(customer, customer.c_nationkey == nation.n_nationkey)\n",
" .join(orders, orders.o_custkey == customer.c_custkey)\n",
" [fields_of_interest])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, great, let's have a look:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"joined_all.limit(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sweet, now let's aggregate by year and region:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"year = joined_all.odate.year().name('year')\n",
"\n",
"total = joined_all.amount.sum().cast('double').name('total')\n",
"\n",
"annual_amounts = (joined_all\n",
" .group_by(['region', year])\n",
" .aggregate(total))\n",
"annual_amounts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking good so far. Now, we need to join this table on itself, by subtracting 1 from one of the year columns.\n",
"\n",
"We do this by creating a \"joinable\" view of a table that is considered a distinct object within Ibis. To do this, use the `view` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"current = annual_amounts\n",
"prior = annual_amounts.view()\n",
"\n",
"yoy_change = (current.total - prior.total).name('yoy_change')\n",
"\n",
"results = (current.join(prior, ((current.region == prior.region) & \n",
" (current.year == (prior.year - 1))))\n",
" [current.region, current.year, yoy_change])\n",
"df = results.execute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['yoy_pretty'] = df.yoy_change.map(lambda x: '$%.2fmm' % (x / 1000000.))\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you're being fastidious and want to consider the first year occurring in the dataset for each region to have 0 for the prior year, you will instead need to do an outer join and treat nulls in the prior side of the join as zero:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"yoy_change = (current.total - prior.total.zeroifnull()).name('yoy_change')\n",
"results = (current.outer_join(prior, ((current.region == prior.region) & \n",
" (current.year == (prior.year - 1))))\n",
" [current.region, current.year, current.total,\n",
" prior.total.zeroifnull().name('prior_total'), \n",
" yoy_change])\n",
"\n",
"results.limit(10)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,359 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Advanced Topics: Additional Filtering\n",
"\n",
"The filtering examples we've shown to this point have been pretty simple, either comparisons between columns or fixed values, or set filter functions like `isin` and `notin`. \n",
"\n",
"Ibis supports a number of richer analytical filters that can involve one or more of:\n",
"\n",
"- Aggregates computed from the same or other tables\n",
"- Conditional aggregates (in SQL-speak these are similar to \"correlated subqueries\")\n",
"- \"Existence\" set filters (equivalent to the SQL `EXISTS` and `NOT EXISTS` keywords)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import ibis\n",
"import os\n",
"hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)\n",
"hdfs = ibis.hdfs_connect(host='quickstart.cloudera', port=hdfs_port)\n",
"con = ibis.impala.connect(host='quickstart.cloudera', database='ibis_testing',\n",
" hdfs_client=hdfs)\n",
"ibis.options.interactive = True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using scalar aggregates in filters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"table = con.table('functional_alltypes')\n",
"table.limit(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could always compute some aggregate value from the table and use that in another expression, or we can use a data-derived aggregate in the filter. Take the average of a column for example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"table.double_col.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use this expression as a substitute for a scalar value in a filter, and the execution engine will combine everything into a single query rather than having to access Impala multiple times:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cond = table.bigint_col > table.double_col.mean()\n",
"expr = table[cond & table.bool_col].limit(5)\n",
"expr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conditional aggregates\n",
"\n",
"\n",
"Suppose that we wish to filter using an aggregate computed conditional on some other expressions holding true. Using the TPC-H datasets, suppose that we want to filter customers based on the following criteria: Orders such that their amount exceeds the average amount for their sales region over the whole dataset. This can be computed any numbers of ways (such as joining auxiliary tables and filtering post-join)\n",
"\n",
"Again, from prior examples, here are the joined up tables with all the customer data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"region = con.table('tpch_region')\n",
"nation = con.table('tpch_nation')\n",
"customer = con.table('tpch_customer')\n",
"orders = con.table('tpch_orders')\n",
"\n",
"fields_of_interest = [customer,\n",
" region.r_name.name('region'), \n",
" orders.o_totalprice,\n",
" orders.o_orderdate.cast('timestamp').name('odate')]\n",
"\n",
"tpch = (region.join(nation, region.r_regionkey == nation.n_regionkey)\n",
" .join(customer, customer.c_nationkey == nation.n_nationkey)\n",
" .join(orders, orders.o_custkey == customer.c_custkey)\n",
" [fields_of_interest])\n",
"\n",
"tpch.limit(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this particular case, filtering based on the conditional average `o_totalprice` by region requires creating a table view (similar to the self-join examples from earlier) that can be treated as a distinct table entity in the expression. This would **not** be required if we were computing a conditional statistic from some other table. So this is a little more complicated than some other cases would be:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t2 = tpch.view()\n",
"conditional_avg = t2[(t2.region == tpch.region)].o_totalprice.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you've done this, you can use the conditional average in a filter expression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"amount_filter = tpch.o_totalprice > conditional_avg\n",
"tpch[amount_filter].limit(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By looking at the table sizes before and after applying the filter you can see the relative size of the subset taken. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tpch.count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tpch[amount_filter].count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or even group by year and compare before and after:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tpch.schema()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"year = tpch.odate.year().name('year')\n",
"\n",
"pre_sizes = tpch.group_by(year).size()\n",
"post_sizes = tpch[amount_filter].group_by(year).size().view()\n",
"\n",
"percent = ((post_sizes['count'] / pre_sizes['count'].cast('double'))\n",
" .name('fraction'))\n",
"\n",
"expr = (pre_sizes.join(post_sizes, pre_sizes.year == post_sizes.year)\n",
" [pre_sizes.year, \n",
" pre_sizes['count'].name('pre_count'),\n",
" post_sizes['count'].name('post_count'),\n",
" percent])\n",
"expr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Existence\" filters\n",
"\n",
"\n",
"Some filtering involves checking for the existence of a particular value in a column of another table, or amount the results of some value expression. This is common in many-to-many relationships, and can be performed in numerous different ways, but it's nice to be able to express it with a single concise statement and let Ibis compute it optimally.\n",
"\n",
"Here's some examples:\n",
"\n",
"- Filter down to customers having at least one open order\n",
"- Find customers having no open orders with 1-URGENT status\n",
"- Find stores (in the stores table) having the same name as a vendor (in the vendors table).\n",
"\n",
"We'll go ahead and solve the first couple of these problems using the TPC-H tables to illustrate the API:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"customer = con.table('tpch_customer')\n",
"orders = con.table('tpch_orders')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"orders.limit(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We introduce the `any` reduction:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"has_open_orders = ((orders.o_orderstatus == 'O') & \n",
" (customer.c_custkey == orders.o_custkey)).any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is now a valid filter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"customer[has_open_orders].limit(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the second example, in which we want to find customers not having any open urgent orders, we write down the condition that they _do_ have some first:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"has_open_urgent_orders = ((orders.o_orderstatus == 'O') & \n",
" (orders.o_orderpriority == '1-URGENT') & \n",
" (customer.c_custkey == orders.o_custkey)).any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can negate this condition and use it as a filter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"customer[-has_open_urgent_orders].count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, it is true that `customer.c_custkey` has no duplicate values, but that need not be the case. There could be multiple copies of any given value in either table column being compared, and the behavior will be the same (existence or non-existence is verified)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
292 changes: 292 additions & 0 deletions docs/source/notebooks/tutorial/8-More-Analytics-Helpers.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,292 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Additional Analytics Tools"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import ibis\n",
"import os\n",
"hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)\n",
"hdfs = ibis.hdfs_connect(host='quickstart.cloudera', port=hdfs_port)\n",
"con = ibis.impala.connect(host='quickstart.cloudera', database='ibis_testing',\n",
" hdfs_client=hdfs)\n",
"ibis.options.interactive = True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Frequency tables\n",
"\n",
"Ibis provides the `value_counts` API, just like pandas, for computing a frequency table for a table column or array expression. You might have seen it used already earlier in the tutorial. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lineitem = con.table('tpch_lineitem')\n",
"orders = con.table('tpch_orders')\n",
"\n",
"items = (orders.join(lineitem, orders.o_orderkey == lineitem.l_orderkey)\n",
" [lineitem, orders])\n",
"\n",
"items.o_orderpriority.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This can be customized, of course:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"freq = (items.group_by(items.o_orderpriority)\n",
" .aggregate([items.count().name('nrows'),\n",
" items.l_extendedprice.sum().name('total $')]))\n",
"freq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Binning and histograms\n",
"\n",
"\n",
"Numeric array expressions (columns with numeric type and other array expressions) have `bucket` and `histogram` methods which produce different kinds of binning. These produce category values (the computed bins) that can be used in grouping and other analytics.\n",
"\n",
"Let's have a look at a few examples\n",
"\n",
"I'll use the `summary` function to see the general distribution of lineitem prices in the order data joined above:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"items.l_extendedprice.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alright then, now suppose we want to split the item prices up into some buckets of our choosing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"buckets = [0, 5000, 10000, 50000, 100000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `bucket` function creates a bucketed category from the prices:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bucketed = items.l_extendedprice.bucket(buckets).name('bucket')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's have a look at the value counts:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bucketed.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The buckets we wrote down define 4 buckets numbered 0 through 3. The `NaN` is a pandas `NULL` value (since that's how pandas represents nulls in numeric arrays), so don't worry too much about that. Since the bucketing ends at 100000, we see there are 4122 values that are over 100000. These can be included in the bucketing with `include_over`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bucketed = (items.l_extendedprice\n",
" .bucket(buckets, include_over=True)\n",
" .name('bucket'))\n",
"bucketed.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `bucketed` object here is a special **_category_** type"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bucketed.type()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Category values can either have a known or unknown **_cardinality_**. In this case, there's either 4 or 5 buckets based on how we used the `bucket` function.\n",
"\n",
"Labels can be assigned to the buckets at any time using the `label` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bucket_counts = bucketed.value_counts()\n",
"\n",
"labeled_bucket = (bucket_counts.bucket\n",
" .label(['0 to 5000', '5000 to 10000', '10000 to 50000',\n",
" '50000 to 100000', 'Over 100000'])\n",
" .name('bucket_name'))\n",
"\n",
"expr = (bucket_counts[labeled_bucket, bucket_counts]\n",
" .sort_by('bucket'))\n",
"expr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nice, huh?\n",
"\n",
"`histogram` is a linear (fixed size bin) equivalent:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t = con.table('functional_alltypes')\n",
"\n",
"d = t.double_col\n",
"\n",
"tier = d.histogram(10).name('hist_bin')\n",
"expr = (t.group_by(tier)\n",
" .aggregate([d.min(), d.max(), t.count()])\n",
" .sort_by('hist_bin'))\n",
"expr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering in aggregations\n",
"\n",
"\n",
"Suppose that you want to compute an aggregation with a subset of the data for _only one_ of the metrics / aggregates in question, and the complete data set with the other aggregates. Most aggregation functions are thus equipped with a `where` argument. Let me show it to you in action:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t = con.table('functional_alltypes')\n",
"\n",
"d = t.double_col\n",
"s = t.string_col\n",
"\n",
"cond = s.isin(['3', '5', '7'])\n",
"\n",
"metrics = [t.count().name('# rows total'), \n",
" cond.sum().name('# selected'),\n",
" d.sum().name('total'),\n",
" d.sum(where=cond).name('selected total')]\n",
"\n",
"color = (t.float_col\n",
" .between(3, 7)\n",
" .ifelse('red', 'blue')\n",
" .name('color'))\n",
"\n",
"t.group_by(color).aggregate(metrics)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Loading