570 changes: 428 additions & 142 deletions .github/workflows/main.yml

Large diffs are not rendered by default.

24 changes: 24 additions & 0 deletions .github/workflows/test-report.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: Test Report
on:
workflow_run:
workflows: ['CI']
types:
- completed
jobs:
report:
if: ${{ github.event.workflow_run.conclusion == 'success' || github.event.workflow_run.conclusion == 'failure' }}
runs-on: ubuntu-latest
concurrency: report
steps:
- name: Download artifact
uses: dawidd6/action-download-artifact@v2
with:
workflow: ${{ github.event.workflow_run.workflow_id }}
workflow_conclusion: completed
path: artifacts

- name: publish test report
uses: EnricoMi/publish-unit-test-result-action@v1
with:
commit: ${{ github.event.workflow_run.head_sha }}
files: artifacts/**/junit.xml
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,6 @@ ibis_testing*
.ipynb_checkpoints/
.pytest_cache
.mypy_cache

# temporary doc build
docbuild
16 changes: 12 additions & 4 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,24 @@
repos:
- repo: https://github.com/timothycrosley/isort
rev: 5.6.4
- repo: https://github.com/PyCQA/isort
rev: 5.9.3
hooks:
- id: isort
- repo: https://github.com/psf/black
rev: 19.10b0
rev: 21.9b0
hooks:
- id: black
exclude: (ibis/_version|versioneer).py
exclude: (ibis/_version|versioneer)\.py
- repo: git://github.com/pre-commit/pre-commit-hooks
rev: v2.1.0
hooks:
- id: flake8
types:
- python
- repo: https://github.com/asottile/pyupgrade
rev: v2.29.0
hooks:
- id: pyupgrade
args: [--py37-plus]
exclude: (ibis/_version|versioneer)\.py
types:
- python
192 changes: 0 additions & 192 deletions Makefile

This file was deleted.

7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
| Documentation | [![Documentation Status](https://img.shields.io/badge/docs-docs.ibis--project.org-blue.svg)](http://ibis-project.org) |
| Conda packages | [![Anaconda-Server Badge](https://anaconda.org/conda-forge/ibis-framework/badges/version.svg)](https://anaconda.org/conda-forge/ibis-framework) |
| PyPI | [![PyPI](https://img.shields.io/pypi/v/ibis-framework.svg)](https://pypi.org/project/ibis-framework) |
| Azure | [![Azure Status](https://dev.azure.com/ibis-project/ibis/_apis/build/status/ibis-project.ibis)](https://dev.azure.com/ibis-project/ibis/_build) |
| GitHub Actions | [![Build status](https://github.com/ibis-project/ibis/actions/workflows/main.yml/badge.svg)](https://github.com/ibis-project/ibis/actions/workflows/main.yml) |
| Coverage | [![Codecov branch](https://img.shields.io/codecov/c/github/ibis-project/ibis/master.svg)](https://codecov.io/gh/ibis-project/ibis) |


Expand All @@ -32,12 +32,13 @@ Ibis currently provides tools for interacting with the following systems:
- [Apache Kudu](https://kudu.apache.org/)
- [Hadoop Distributed File System (HDFS)](https://hadoop.apache.org/)
- [PostgreSQL](https://www.postgresql.org/)
- [MySQL](https://www.mysql.com/) (Experimental)
- [MySQL](https://www.mysql.com/)
- [SQLite](https://www.sqlite.org/)
- [Pandas](https://pandas.pydata.org/) [DataFrames](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)
- [Clickhouse](https://clickhouse.yandex)
- [BigQuery](https://cloud.google.com/bigquery)
- [OmniSciDB](https://www.omnisci.com)
- [Spark](https://spark.apache.org) (Experimental)
- [PySpark](https://spark.apache.org)
- [Dask](https://dask.org/) (Experimental)

Learn more about using the library at http://ibis-project.org.
6 changes: 0 additions & 6 deletions azure-pipelines.yml

This file was deleted.

6 changes: 3 additions & 3 deletions benchmarks/benchmarks.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ def time_impala_large_expr_compile(self):

class PandasBackend:
def setup(self):
n = 30 * int(2e5)
n = 30 * int(2e4)
self.data = pd.DataFrame(
{
'key': np.random.choice(16000, size=n),
Expand Down Expand Up @@ -268,10 +268,10 @@ def time_high_card_grouped_rolling(self):
self.high_card_grouped_rolling.execute()

def time_low_card_grouped_rolling_udf(self):
self.low_card_grouped_rolling_udf.execute()
self.low_card_grouped_rolling_udf_mean.execute()

def time_high_card_grouped_rolling_udf(self):
self.high_card_grouped_rolling_udf.execute()
self.high_card_grouped_rolling_udf_mean.execute()

def time_low_card_window_analytics_udf(self):
self.low_card_window_analytics_udf.execute()
Expand Down
5 changes: 0 additions & 5 deletions ci/.env
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,3 @@ IBIS_TEST_POSTGRES_DATABASE=ibis_testing
IBIS_TEST_CLICKHOUSE_HOST=clickhouse
IBIS_TEST_CLICKHOUSE_PORT=9000
IBIS_TEST_CLICKHOUSE_DATABASE=ibis_testing
IBIS_TEST_OMNISCIDB_HOST=omniscidb
IBIS_TEST_OMNISCIDB_PORT=6274
IBIS_TEST_OMNISCIDB_DATABASE=ibis_testing
IBIS_TEST_OMNISCIDB_USER=admin
IBIS_TEST_OMNISCIDB_PASSWORD=HyperInteractive
28 changes: 0 additions & 28 deletions ci/Dockerfile.dev

This file was deleted.

16 changes: 0 additions & 16 deletions ci/Dockerfile.docs

This file was deleted.

183 changes: 0 additions & 183 deletions ci/azure/linux.yml

This file was deleted.

30 changes: 0 additions & 30 deletions ci/backends-markers.sh

This file was deleted.

41 changes: 0 additions & 41 deletions ci/backends-to-start.sh

This file was deleted.

22 changes: 0 additions & 22 deletions ci/check-services.sh

This file was deleted.

24 changes: 24 additions & 0 deletions ci/condarc
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# vim: ft=yaml
always_yes: true

# remote_connect_timeout_secs (float)
# The number seconds conda will wait for your client to establish a
# connection to a remote url resource.
remote_connect_timeout_secs: 30.0

# remote_max_retries (int)
# The maximum number of retries each HTTP connection should attempt.
#
remote_max_retries: 10

# remote_backoff_factor (int)
# The factor determines the time HTTP connection should wait for
# attempt.
#
remote_backoff_factor: 2

# remote_read_timeout_secs (float)
# Once conda has connected to a remote resource and sent an HTTP
# request, the read timeout is the number of seconds conda will wait for
# the server to send a response.
remote_read_timeout_secs: 60.0
380 changes: 111 additions & 269 deletions ci/datamgr.py

Large diffs are not rendered by default.

2 changes: 0 additions & 2 deletions ci/deps/bigquery.yml

This file was deleted.

12 changes: 7 additions & 5 deletions ci/deps/clickhouse.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
sqlalchemy>=1.3
clickhouse-cityhash
clickhouse-driver>=0.1.3
clickhouse-sqlalchemy
lz4
name: ibis
dependencies:
- sqlalchemy=1.3
- clickhouse-cityhash
- clickhouse-driver
- clickhouse-sqlalchemy
- lz4
4 changes: 4 additions & 0 deletions ci/deps/dask-min.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
name: ibis
dependencies:
- dask=2021.2.0
- pyarrow
6 changes: 6 additions & 0 deletions ci/deps/dask.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# For exclusions see https://github.com/ibis-project/ibis/pull/2802
# dask 2021.5.0,5.1,6.0 had some issues with meta
name: ibis
dependencies:
- dask!=2021.5.*,!=2021.6.0
- pyarrow
3 changes: 3 additions & 0 deletions ci/deps/hdf5.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
name: ibis
dependencies:
- pytables
14 changes: 8 additions & 6 deletions ci/deps/impala.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
sqlalchemy>=1.3
impyla>=0.15.0
requests>=2.24
thrift>=0.9.3
thriftpy2>=0.4
thrift_sasl>=0.2.1
name: ibis
dependencies:
- bitarray=2.0.1
- impyla
- requests
- thrift_sasl
- python-hdfs
- boost
6 changes: 4 additions & 2 deletions ci/deps/mysql.yml
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
sqlalchemy>=1.3
pymysql
name: ibis
dependencies:
- sqlalchemy
- pymysql
2 changes: 0 additions & 2 deletions ci/deps/omniscidb.yml

This file was deleted.

4 changes: 3 additions & 1 deletion ci/deps/parquet.yml
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
pyarrow>=0.13
name: ibis
dependencies:
- pyarrow
7 changes: 7 additions & 0 deletions ci/deps/postgres-min.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
name: ibis
dependencies:
- sqlalchemy=1.3
- psycopg2=2.7
- geoalchemy2=0.6
- geopandas=0.6
- shapely=1.6
10 changes: 7 additions & 3 deletions ci/deps/postgres.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
sqlalchemy>=1.3
psycopg2>=2.8
geoalchemy2>=0.6
name: ibis
dependencies:
- sqlalchemy
- psycopg2
- geoalchemy2
- geopandas
- shapely
10 changes: 6 additions & 4 deletions ci/deps/pyspark-min.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Need to import double-conversion below otherwise `import pyarrow` fails with ImportError: libdouble-conversion.so.3
double-conversion
pyarrow=0.12.1
pyspark=2.4.3
name: ibis
dependencies:
- openjdk=8
- pyarrow=1
- pyspark=2.4.3
- pandas=1.2.5
4 changes: 3 additions & 1 deletion ci/deps/pyspark.yml
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
pyspark>=2.4.3
name: ibis
dependencies:
- pyspark
4 changes: 0 additions & 4 deletions ci/deps/spark-min.yml

This file was deleted.

1 change: 0 additions & 1 deletion ci/deps/spark.yml

This file was deleted.

128 changes: 0 additions & 128 deletions ci/docker-compose.yml

This file was deleted.

41 changes: 0 additions & 41 deletions ci/dockerize.sh

This file was deleted.

17 changes: 0 additions & 17 deletions ci/docs.sh

This file was deleted.

23 changes: 9 additions & 14 deletions ci/impalamgr.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,13 @@

env_items = ENV.items()
maxlen = max(map(len, map(toolz.first, env_items))) + len('IbisTestEnv[""]')
format_string = '%-{:d}s == %r'.format(maxlen)
format_string = f'%-{maxlen:d}s == %r'
for key, value in env_items:
logger.info(format_string, 'IbisTestEnv[{!r}]'.format(key), value)
logger.info(format_string, f'IbisTestEnv[{key!r}]', value)


def make_ibis_client(env):
hc = ibis.hdfs_connect(
hc = ibis.impala.hdfs_connect(
host=env.nn_host,
port=env.webhdfs_port,
auth_mechanism=env.auth_mechanism,
Expand All @@ -55,16 +55,11 @@ def make_ibis_client(env):
)


def can_write_to_hdfs(con):
def raise_if_cannot_write_to_hdfs(con):
test_path = os.path.join(ENV.test_data_dir, ibis.util.guid())
test_file = BytesIO(ibis.util.guid().encode('utf-8'))
try:
con.hdfs.put(test_path, test_file)
con.hdfs.rm(test_path)
return True
except Exception:
logger.exception('Could not write to HDFS')
return False
con.hdfs.put(test_path, test_file)
con.hdfs.rm(test_path)


def can_build_udfs():
Expand Down Expand Up @@ -232,7 +227,7 @@ def upload_udfs(con):
# ==========================================


@click.group(context_settings=dict(help_option_names=['-h', '--help']))
@click.group(context_settings={'help_option_names': ['-h', '--help']})
def main():
"""Manage impala test data for Ibis."""

Expand Down Expand Up @@ -260,8 +255,8 @@ def load(data, udf, data_dir, overwrite):
con = make_ibis_client(ENV)

# validate our environment before performing possibly expensive operations
if not can_write_to_hdfs(con):
raise IbisError('Failed to write to HDFS; check your settings')
raise_if_cannot_write_to_hdfs(con)

if udf and not can_build_udfs():
raise IbisError('Build environment does not support building UDFs')

Expand Down
26 changes: 0 additions & 26 deletions ci/load-data.sh

This file was deleted.

28 changes: 28 additions & 0 deletions ci/merge_and_update_env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash
#
# Merge environment files and update the corresponding conda environment

set -euo pipefail

if [ "$#" -eq 0 ]; then
>&2 echo "error: must provide at least one backend"
exit 1
fi

# install conda-merge, don't try to update already installed dependencies
mamba install --freeze-installed --name ibis conda-merge

additional_env_files=()

# pull all files associated with input backends
for backend in "$@"; do
env_file="ci/deps/${backend}.yml"

if [ -f "${env_file}" ]; then
additional_env_files+=("$env_file")
fi
done

env_yaml="$(mktemp --suffix=.yml)"
conda-merge environment.yml "${additional_env_files[@]}" | tee "$env_yaml"
mamba env update --name ibis --file "$env_yaml"
3 changes: 0 additions & 3 deletions ci/omniscidb.conf

This file was deleted.

25 changes: 9 additions & 16 deletions ci/recipe/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,9 @@ source:
url: https://github.com/ibis-project/ibis/archive/{{ version }}.tar.gz

build:
number: 1
number: 0
script: {{ PYTHON }} -m pip install . --no-deps --ignore-installed --no-cache-dir -vvv
# uncomment noarch when pymapd and pyspark issues are fixed for py38
# noarch: python
skip: true # [py<37]

requirements:
host:
Expand All @@ -28,24 +27,23 @@ requirements:
- setuptools

run:
- cached_property
- clickhouse-driver >=0.1.3
- clickhouse-cityhash # [not win]
- clickhouse-sqlalchemy
- geoalchemy2
- geopandas
- google-cloud-bigquery-core >=1.12.0,<1.24.0dev
- graphviz
- impyla >=0.15.0
- lz4
- multipledispatch >=0.6
- numpy >=1.15
- pandas >=0.25.3
- parsy
- psycopg2
- pyarrow >=0.15
- pydata-google-auth
- pymapd 0.24 # [py<38]
- pymysql
- pyspark >=2.4.3 # [py<38]
- pyspark >=2.4.3
- pytables >=3.0.0
- python
- python-graphviz
Expand All @@ -59,22 +57,20 @@ requirements:
- thrift >=0.11
- thriftpy2
- toolz
- tzlocal <3 # not directly needed, but need to pin since 3.0 is broken

test:
imports:
- ibis
- ibis.backends.bigquery
- ibis.backends.clickhouse
- ibis.backends.csv
- ibis.backends.parquet
- ibis.backends.hdf5
- ibis.backends.impala
- ibis.backends.mysql
- ibis.backends.omniscidb # [py<38]
- ibis.backends.pandas
- ibis.backends.postgres
- ibis.backends.pyspark # [py<38]
- ibis.backends.spark
- ibis.backends.pyspark
- ibis.backends.sqlite

about:
Expand All @@ -86,9 +82,6 @@ about:

extra:
recipe-maintainers:
- cpcloud
- mariusvniekerk
- wesm
- kszucs
- xmnlab
- jreback
- xmnlab
- datapythonista
28 changes: 16 additions & 12 deletions ci/run_tests.sh
Original file line number Diff line number Diff line change
@@ -1,22 +1,26 @@
#!/bin/bash -e
#!/usr/bin/env bash
# Run the Ibis tests. Two environment variables are considered:
# - PYTEST_BACKENDS: Space-separated list of backends to run
# - PYTEST_EXPRESSION: Marker expression, for example "not udf"

TESTS_DIRS="ibis/tests"
for BACKEND in $PYTEST_BACKENDS; do
if [[ -d ibis/$BACKEND/tests ]]; then
TESTS_DIRS="$TESTS_DIRS ibis/$BACKEND/tests"
set -eo pipefail

TESTS_DIRS=()

if [ -n "$PYTEST_BACKENDS" ]; then
TESTS_DIRS+=("ibis/backends/tests")
fi

for backend in $PYTEST_BACKENDS; do
backend_test_dir="ibis/backends/$backend/tests"
if [ -d "$backend_test_dir" ]; then
TESTS_DIRS+=("$backend_test_dir")
fi
done

echo "TESTS_DIRS: $TESTS_DIRS"
echo "PYTEST_EXPRESSION: $PYTEST_EXPRESSION"

set -x

pytest $TESTS_DIRS \
-m "${PYTEST_EXPRESSION}" \
pytest "${TESTS_DIRS[@]}" \
-ra \
--junitxml=junit.xml \
--cov=ibis \
--cov-report=xml:coverage.xml
--cov-report=xml:coverage.xml "$@"
41 changes: 0 additions & 41 deletions ci/schema/bigquery.sql

This file was deleted.

82 changes: 0 additions & 82 deletions ci/schema/omniscidb.sql

This file was deleted.

64 changes: 0 additions & 64 deletions ci/setup_env.sh

This file was deleted.

29 changes: 0 additions & 29 deletions conftest.py

This file was deleted.

10 changes: 5 additions & 5 deletions dev/merge-pr.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,11 @@ def merge_pr(
"{GITHUB_API_BASE}/pulls/{pr_num:d}/merge".format(
GITHUB_API_BASE=GITHUB_API_BASE, pr_num=pr_num
),
json=dict(
commit_title=commit_title,
commit_message=commit_message,
merge_method=merge_method,
),
json={
'commit_title': commit_title,
'commit_message': commit_message,
'merge_method': merge_method,
},
auth=(github_user, password),
)
status_code = resp.status_code
Expand Down
41 changes: 40 additions & 1 deletion docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,25 @@ These methods are available directly in the ``ibis`` module namespace.
trailing_range_window
random

.. _api.basebackend:

Backend methods
------------------

.. currentmodule:: ibis.backends.base

.. autosummary::
:toctree: generated/

BaseBackend.connect
BaseBackend.database
BaseBackend.current_database
BaseBackend.list_tables
BaseBackend.table
BaseBackend.version
BaseBackend.compile
BaseBackend.execute

.. _api.expr:

General expression methods
Expand Down Expand Up @@ -66,6 +85,7 @@ Table methods
:toctree: generated/

TableExpr.aggregate
TableExpr.asof_join
TableExpr.count
TableExpr.distinct
TableExpr.drop
Expand Down Expand Up @@ -229,6 +249,9 @@ Scalar or column methods

IntegerValue.convert_base
IntegerValue.to_timestamp
IntegerColumn.bit_and
IntegerColumn.bit_or
IntegerColumn.bit_xor

.. _api.string:

Expand Down Expand Up @@ -419,6 +442,22 @@ Decimal methods
DecimalValue.precision
DecimalValue.scale

.. _api.struct:

Struct methods
-----------------

Scalar or column methods
~~~~~~~~~~~~~~~~~~~~~~~~

Values in a ``StructValue`` can be accessed using indexing, e.g. ``struct_expr['my_col']``. See :meth:`StructValue.__getitem__`.

.. autosummary::
:toctree: generated/

StructValue.destructure
StructValue.__getitem__

.. _api.geospatial:

Geospatial methods
Expand Down Expand Up @@ -450,7 +489,7 @@ Scalar or column methods
GeoSpatialValue.distance
GeoSpatialValue.end_point
GeoSpatialValue.envelope
GeoSpatialValue.equals
GeoSpatialValue.geo_equals
GeoSpatialValue.geometry_n
GeoSpatialValue.geometry_type
GeoSpatialValue.intersection
Expand Down
176 changes: 0 additions & 176 deletions docs/source/backends/bigquery.rst

This file was deleted.

17 changes: 8 additions & 9 deletions docs/source/backends/clickhouse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Install dependencies for Ibis's Clickhouse dialect(minimal supported version is

::

pip install ibis-framework[clickhouse]
pip install 'ibis-framework[clickhouse]'

Create a client by passing in database connection parameters such as ``host``,
``port``, ``database``, and ``user`` to :func:`ibis.clickhouse.connect`:
Expand All @@ -30,11 +30,10 @@ Use ``ibis.clickhouse.connect`` to create a client.
.. autosummary::
:toctree: ../generated/

connect
ClickhouseClient.close
ClickhouseClient.exists_table
ClickhouseClient.exists_database
ClickhouseClient.get_schema
ClickhouseClient.set_database
ClickhouseClient.list_databases
ClickhouseClient.list_tables
Backend.connect
Backend.close
Backend.exists_table
Backend.exists_database
Backend.get_schema
Backend.list_databases
Backend.list_tables
4 changes: 4 additions & 0 deletions docs/source/backends/dask.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
`Dask <https://dask.org/>`_
===========================

Dask backend is currently experimental.
791 changes: 641 additions & 150 deletions docs/source/backends/impala.rst

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions docs/source/backends/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,10 @@ For more information on a specific backend, check the next backend pages:
postgres
mysql
impala
omnisci
bigquery
clickhouse
spark
pyspark
pandas
dask


.. _classes_of_backends:
Expand All @@ -44,7 +43,6 @@ string to the database through a driver API.
- `Google BigQuery <https://cloud.google.com/bigquery/>`_
- `Hadoop Distributed File System (HDFS) <https://hadoop.apache.org/>`_
- `OmniSciDB <https://www.omnisci.com/>`_
- `PySpark/Spark SQL <https://spark.apache.org/sql/>`_ (Experimental)

.. _expression_generating_backends:

Expand All @@ -61,7 +59,7 @@ dependencies).

- `PostgreSQL <https://www.postgresql.org/>`_
- `SQLite <https://www.sqlite.org/>`_
- `MySQL <https://www.mysql.com/>`_ (Experimental)
- `MySQL <https://www.mysql.com/>`_

.. _direct_execution_backends:

Expand All @@ -72,3 +70,5 @@ backend. A full description of the implementation can be found in the module
docstring of the pandas backend located in ``ibis/backends/pandas/core.py``.

- `Pandas <http://pandas.pydata.org/>`_
- `PySpark <https://spark.apache.org/sql/>`_
- `Dask <https://dask.org/>`_ (Experimental)
12 changes: 6 additions & 6 deletions docs/source/backends/mysql.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Install dependencies for Ibis's MySQL dialect:

::

pip install ibis-framework[mysql]
pip install 'ibis-framework[mysql]'

Create a client by passing a connection string or individual parameters to
:func:`ibis.mysql.connect`:
Expand Down Expand Up @@ -36,8 +36,8 @@ create a client.
.. autosummary::
:toctree: ../generated/

connect
MySQLClient.database
MySQLClient.list_databases
MySQLClient.list_tables
MySQLClient.table
Backend.connect
Backend.database
Backend.list_databases
Backend.list_tables
Backend.table
440 changes: 0 additions & 440 deletions docs/source/backends/omnisci.rst

This file was deleted.

12 changes: 6 additions & 6 deletions docs/source/backends/postgres.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Install dependencies for Ibis's PostgreSQL dialect:

::

pip install ibis-framework[postgres]
pip install 'ibis-framework[postgres]'

Create a client by passing a connection string to the ``url`` parameter or
individual parameters to :func:`ibis.postgres.connect`:
Expand Down Expand Up @@ -39,8 +39,8 @@ create a client.
.. autosummary::
:toctree: ../generated/

connect
PostgreSQLClient.database
PostgreSQLClient.list_tables
PostgreSQLClient.list_databases
PostgreSQLClient.table
Backend.connect
Backend.database
Backend.list_tables
Backend.list_databases
Backend.table
36 changes: 36 additions & 0 deletions docs/source/backends/pyspark.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
.. _install.pyspark:

`PySpark <https://spark.apache.org/sql/>`_
====================================================

Install dependencies for Ibis's PySpark dialect:

::

pip install 'ibis-framework[pyspark]'

.. note::

When using the PySpark backend with PySpark 2.4.x and pyarrow >= 0.15.0, you
need to set ``ARROW_PRE_0_15_IPC_FORMAT=1``. See `here
<https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html#compatibility-setting-for-pyarrow-0-15-0-and-spark-2-3-x-2-4-x>`_
for details

.. _api.pyspark:

PySpark client
~~~~~~~~~~~~~~
.. currentmodule:: ibis.backends.pyspark

The PySpark client is accessible through the ``ibis.pyspark`` namespace.

Use ``ibis.pyspark.connect`` to create a client.

.. autosummary::
:toctree: ../generated/

Backend.connect
Backend.database
Backend.list_databases
Backend.list_tables
Backend.table
58 changes: 0 additions & 58 deletions docs/source/backends/spark.rst

This file was deleted.

12 changes: 6 additions & 6 deletions docs/source/backends/sqlite.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Install dependencies for Ibis's SQLite dialect:

::

pip install ibis-framework[sqlite]
pip install 'ibis-framework[sqlite]'

Create a client by passing a path to a SQLite database to
:func:`ibis.sqlite.connect`:
Expand All @@ -33,8 +33,8 @@ Use ``ibis.sqlite.connect`` to create a SQLite client.
.. autosummary::
:toctree: ../generated/

connect
SQLiteClient.attach
SQLiteClient.database
SQLiteClient.list_tables
SQLiteClient.table
Backend.connect
Backend.attach
Backend.database
Backend.list_tables
Backend.table
3 changes: 1 addition & 2 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
# -*- coding: utf-8 -*-
#
# Ibis documentation build configuration file, created by
# sphinx-quickstart on Wed Jun 10 11:06:29 2015.
Expand Down Expand Up @@ -68,7 +67,7 @@

# General information about the project.
project = 'Ibis'
copyright = '{}, Ibis Developers'.format(datetime.date.today().year)
copyright = f'{datetime.date.today().year}, Ibis Developers'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ Ibis can also be installed with Kerberos support for its HDFS functionality:

::

pip install ibis-framework[kerberos]
pip install 'ibis-framework[kerberos]'

Some platforms will require that you have Kerberos installed to build properly.

Expand Down
71 changes: 71 additions & 0 deletions docs/source/release/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,77 @@ Release Notes
These release notes are for versions of ibis **1.0 and later**. Release
notes for pre-1.0 versions of ibis can be found at :doc:`release-pre-1.0`

* :support:`2678` Improvement of the backend API. The former `Client` subclasses have been replaced by a `Backend` class that must
subclass `ibis.backends.base.BaseBackend`. The `BaseBackend` class contains abstract methods for the minimum subset of methods that
backends must implement, and their signatures have been standardized across backends. The Ibis compiler has been refactored, and
backends don't need to implement all compiler classes anymore if the default works for them. Only a subclass of
`ibis.backends.base.sql.compiler.Compiler` is now required. Backends now need to register themselves as entry points.
* :support:`2905` Deprecate `exists_table(table)` in favor of `table in list_tables()`
* :bug:`2991` Fix data races in impala connection pool accounting
* :bug:`2985` Fix null literal compilation in the Clickhouse backend
* :bug:`2984` Fix order of limit and offset parameters in the Clickhouse backend
* :support:`2977` Remove handwritten type parser; parsing errors that were previously `IbisTypeError` are now `parsy.ParseError`. `parsy` is now a hard requirement.
* :support:`2962` Methods `current_database` and `list_databases` raise an exception for backends that do not support databases
* :bug:`2956` Replace `equals` operation for geospatial datatype to `geo_equals`
* :support:`2913` Method `set_database` has been deprecated, in favor of creating a new connection to a different database
* :feature:`2938` Serialization-deserialization of Node via pickle is now byte compatible between different processes
* :support:`2914` Removed `log` method of clients, in favor of `verbose_log` option
* :feature:`2916` Support joining on different columns in ClickHouse backend
* :feature:`2908` Support summarization of empty data in Pandas backend
* :support:`2883` Output of `Client.version` returned as a string, instead of a setuptools `Version`
* :feature:`2882` Unify implementation of fillna and isna in Pyspark backend
* :support:`2862` Deprecated `list_schemas` in SQLAlchemy backends in favor of `list_databases`
* :bug:`2829` Fix .drop(fields). The argument can now be either a list of strings or a string.
* :feature:`2873` Support binary operation with Timedelta in Pyspark backend
* :support:`2865` Deprecated `ibis.<backend>.verify()` in favor of capturing exception in `ibis.<backend>.compile()`
* :bug:`2845` Fix projection on differences and intersections for SQL backends
* :feature:`2839`: Add `group_concat` operation for Clickhouse backend
* :bug:`2827` Backends are loaded in a lazy way, so third-party backends can import Ibis without circular imports
* :bug:`2830` Disable aggregation optimization due to N squared performance
* :bug:`2821` Fix `.cast()` to array outputting list instead of np.array in Pandas backend
* :bug:`2820` Fix aggregation with mixed reduction datatypes (array + scalar) on Dask backend
* :feature:`2808` Support comparison of ColumnExpr to timestamp literal
* :support:`2789` Simplification of data fetching. Backends don't need to implement `Query` anymore
* :feature:`2805` Make op schema a cached property
* :feature:`2613` :feature:`2778` Implement `.insert()` for SQLAlchemy backends
* :feature:`2792` Infer categorical and decimal Series to more specific Ibis types in Pandas backend
* :feature:`2790` Add `startswith` and `endswith` operations
* :feature:`2776` :feature:`2797` Allow more flexible return type for UDFs
* :feature:`2779` Implement Clip in the Pyspark backend
* :bug:`2770` Fix error when using reduction UDF that returns np.array in a grouped aggregation
* :feature:`2753` Use `ndarray` as array representation in Pandas backend
* :support:`2665` Move BigQuery backend to a `separate repository <https://github.com/ibis-project/ibis-bigquery>`_.
The backend will be released separately, use `pip install ibis-bigquery` or `conda install ibis-bigquery` to
install it, and then use as before.
* :bug:`2712` Fix time context trimming error for multi column udfs in pandas backend
* :bug:`2710` Fix error during compilation of range_window in base_sql backends (:issue:`2608`)
* :feature:`2687` Support Spark filter with window operation
* :bug:`2696` Fix wrong row indexing in the result for 'window after filter' for timecontext adjustment
* :bug:`2702` Fix `aggregate` exploding the output of Reduction ops that return a list/ndarray
* :bug:`2693` Fix issues with context adjustment for filter with PySpark backend
* :support:`2689` Supporting SQLAlchemy 1.4, and requiring minimum 1.3
* :support:`2680` Namespace time_col config, fix type check for trim_with_timecontext for pandas window execution
* :feature:`2646` Support context adjustment for udfs for pandas backend
* :feature:`2655` Add `auth_local_webserver`, `auth_external_data`, and
`auth_cache` parameters to BigQuery connect method. Set
`auth_local_webserver` to use a local server instead of copy-pasting an
authorization code. Set `auth_external_data` to true to request additional
scopes required to query Google Drive and Sheets. Set `auth_cache` to
`reauth` or `none` to force reauthentication.
* :bug:`2657` Add temporary struct col in pyspark backend to ensure that UDFs are executed only once
* :bug:`2588` Fix BigQuery connect bug that ignored project ID parameter
* :bug:`2636` Fix overwrite logic to account for DestructColumn inside mutate API
* :feature:`2641` Add `bit_and`, `bit_or`, and `bit_xor` integer column aggregates (BigQuery and MySQL backends)
* :feature:`2379` Backends are defined as entry points
* :bug:`2635` Fix fusion optimization bug that incorrectly changes operation order
* :feature:`2615` Add `ibis.array` for creating array expressions
* :feature:`2607` Implement Not operation in PySpark backend
* :feature:`2610` Added support for case/when in PySpark backend
* :bug:`2610` Fixes a NPE issue with substr in PySpark backend
* :feature:`2603` Add support for np.array as literals for backends that already support lists as literals
* :bug:`2354` Fixes binary data type translation into BigQuery bytes data type
* :bug:`2577` Make StructValue picklable
* :support:`2505` Remove deprecated `ibis.HDFS`, `ibis.WebHDFS` and `ibis.hdfs_connect`
* :feature:`2514` Add Struct.from_dict
* :feature:`2310` Add hash and hashbytes support for BigQuery backend
* :feature:`2511` Support reduction UDF without groupby to return multiple columns for Pandas backend
Expand Down
4 changes: 2 additions & 2 deletions docs/source/release/release-pre-1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ New Features
* Splat args into Node subclasses instead of requiring a list (:ghissue:`969`)
* Add support for ``UNION`` in the BigQuery backend (:ghissue:`1408`,
:ghissue:`1409`)
* Support for writing UDFs in BigQuery (:ghissue:`1377`). See :ref:`the BigQuery
UDF docs <udf.bigquery>` for more details.
* Support for writing UDFs in BigQuery (:ghissue:`1377`). See the BigQuery
UDF docs for more details.
* Support for cross-project expressions in the BigQuery backend.
(:ghissue:`1427`, :ghissue:`1428`)
* Add ``strftime`` and ``to_timestamp`` support for BigQuery (:ghissue:`1422`,
Expand Down
2 changes: 1 addition & 1 deletion docs/source/tutorial/01-Introduction-to-Ibis.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that if you installed Ibis with `pip` instead of `conda`, you may need to install the SQLite backend separately with `pip install ibis-framework[sqlite]`.\n",
"Note that if you installed Ibis with `pip` instead of `conda`, you may need to install the SQLite backend separately with `pip install 'ibis-framework[sqlite]'`.\n",
"\n",
"### Exploring the data\n",
"\n",
Expand Down
165 changes: 86 additions & 79 deletions docs/source/tutorial/04-More-Value-Expressions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,12 @@
"metadata": {},
"outputs": [],
"source": [
"import ibis\n",
"import os\n",
"hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)\n",
"hdfs = ibis.hdfs_connect(host='impala', port=hdfs_port)\n",
"con = ibis.impala.connect(host='impala', database='ibis_testing',\n",
" hdfs_client=hdfs)\n",
"ibis.options.interactive = True"
"import ibis\n",
"\n",
"ibis.options.interactive = True\n",
"\n",
"connection = ibis.sqlite.connect(os.path.join('data', 'geography.db'))"
]
},
{
Expand Down Expand Up @@ -57,8 +56,9 @@
"metadata": {},
"outputs": [],
"source": [
"table = con.table('functional_alltypes')\n",
"table.string_col.cast('double').sum()"
"countries = connection.table('countries')\n",
"countries\n",
"connection.table('gdp')"
]
},
{
Expand All @@ -67,7 +67,17 @@
"metadata": {},
"outputs": [],
"source": [
"table.string_col.cast('decimal(12,2)').sum()"
"countries = connection.table('countries')\n",
"countries.population.cast('float').sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"countries.area_km2.cast('int32').sum()"
]
},
{
Expand All @@ -86,15 +96,18 @@
"metadata": {},
"outputs": [],
"source": [
"expr = (table.string_col\n",
"expr = (countries.continent\n",
" .case()\n",
" .when('4', 'fee')\n",
" .when('7', 'fi')\n",
" .when('1', 'fo')\n",
" .when('0', 'fum')\n",
" .else_(table.string_col)\n",
" .when('AF', 'Africa')\n",
" .when('AN', 'Antarctica')\n",
" .when('AS', 'Asia')\n",
" .when('EU', 'Europe')\n",
" .when('NA', 'North America')\n",
" .when('OC', 'Oceania')\n",
" .when('SA', 'South America')\n",
" .else_(countries.continent)\n",
" .end()\n",
" .name('new_strings'))\n",
" .name('continent_name'))\n",
"\n",
"expr.value_counts()"
]
Expand All @@ -112,12 +125,16 @@
"metadata": {},
"outputs": [],
"source": [
"expr = (table.string_col\n",
"expr = (countries.continent\n",
" .case()\n",
" .when('4', 'fee')\n",
" .when('7', 'fi')\n",
" .when('AF', 'Africa')\n",
" .when('AS', 'Asia')\n",
" .when('EU', 'Europe')\n",
" .when('NA', 'North America')\n",
" .when('OC', 'Oceania')\n",
" .when('SA', 'South America')\n",
" .end()\n",
" .name('with_nulls'))\n",
" .name('continent_name_with_nulls'))\n",
"\n",
"expr.value_counts()"
]
Expand All @@ -136,12 +153,13 @@
"outputs": [],
"source": [
"expr = (ibis.case()\n",
" .when(table.int_col > 5, table.bigint_col * 2)\n",
" .when(table.int_col > 2, table.bigint_col)\n",
" .else_(table.int_col)\n",
" .end())\n",
" .when(countries.population > 25_000_000, 'big')\n",
" .when(countries.population < 5_000_000, 'small')\n",
" .else_('medium')\n",
" .end()\n",
" .name('size'))\n",
"\n",
"table['id', 'int_col', 'bigint_col', expr.name('case_result')].limit(20)"
"countries['name', 'population', expr].limit(10)"
]
},
{
Expand All @@ -157,11 +175,11 @@
"metadata": {},
"outputs": [],
"source": [
"expr = ((table.int_col > 5)\n",
" .ifelse(table.bigint_col / 2, table.bigint_col * 2)\n",
" .name('ifelse_result'))\n",
"expr = ((countries.continent == 'AS')\n",
" .ifelse('Asia', 'Not Asia')\n",
" .name('is_asia'))\n",
"\n",
"table['int_col', 'bigint_col', expr].limit(10)"
"countries['name', 'continent', expr].limit(10)"
]
},
{
Expand All @@ -183,8 +201,8 @@
"metadata": {},
"outputs": [],
"source": [
"bool_clause = table.string_col.notin(['1', '4', '7'])\n",
"table[bool_clause].string_col.value_counts()"
"is_america = countries.continent.isin(['NA', 'SA'])\n",
"countries[is_america].continent.value_counts()"
]
},
{
Expand All @@ -200,9 +218,9 @@
"metadata": {},
"outputs": [],
"source": [
"top_strings = table.string_col.value_counts().limit(3).string_col\n",
"top_filter = table.string_col.isin(top_strings)\n",
"expr = table[top_filter]\n",
"top_continents = countries.continent.value_counts().limit(3).continent\n",
"top_continents_filter = countries.continent.isin(top_continents)\n",
"expr = countries[top_continents_filter]\n",
"\n",
"expr.count()"
]
Expand All @@ -220,7 +238,7 @@
"metadata": {},
"outputs": [],
"source": [
"table[table.string_col.topk(3)].count()"
"countries.continent.topk(3)"
]
},
{
Expand All @@ -245,13 +263,13 @@
"metadata": {},
"outputs": [],
"source": [
"expr = (table.string_col\n",
"expr = (countries.continent\n",
" .case()\n",
" .when('4', 'fee')\n",
" .when('7', 'fi')\n",
" .when('1', 'fo')\n",
" .when('AF', 'Africa')\n",
" .when('EU', 'Europe')\n",
" .when('AS', 'Asia')\n",
" .end()\n",
" .name('new_strings'))\n",
" .name('top_continent_name'))\n",
"\n",
"expr.isnull().value_counts()"
]
Expand All @@ -269,7 +287,7 @@
"metadata": {},
"outputs": [],
"source": [
"expr2 = expr.isnull().ifelse('was null', expr).name('strings')\n",
"expr2 = expr.isnull().ifelse('Other continent', expr).name('continent')\n",
"expr2.value_counts()"
]
},
Expand All @@ -289,7 +307,7 @@
"metadata": {},
"outputs": [],
"source": [
"table['int_col', 'bigint_col'].distinct()"
"countries['continent'].distinct()"
]
},
{
Expand All @@ -298,7 +316,7 @@
"metadata": {},
"outputs": [],
"source": [
"table.string_col.distinct()"
"countries.continent.distinct()"
]
},
{
Expand All @@ -314,9 +332,10 @@
"metadata": {},
"outputs": [],
"source": [
"metric = (table.bigint_col\n",
"metric = (countries.continent\n",
" .distinct().count()\n",
" .name('unique_bigints'))"
" .name('num_continents'))\n",
"metric"
]
},
{
Expand All @@ -332,7 +351,7 @@
"metadata": {},
"outputs": [],
"source": [
"table.string_col.nunique()"
"countries.continent.nunique()"
]
},
{
Expand All @@ -342,7 +361,7 @@
"## String operations\n",
"\n",
"\n",
"What's supported is pretty basic right now. We intend to support the full gamut of regular expression munging with a nice API, though in some cases some work will be required on Impala's backend to support everything. "
"What's supported is pretty basic right now. We intend to support the full gamut of regular expression munging with a nice API, though in some cases some work will be required on SQLite's backend to support everything. "
]
},
{
Expand All @@ -351,8 +370,7 @@
"metadata": {},
"outputs": [],
"source": [
"nation = con.table('tpch_nation')\n",
"nation.limit(5)"
"countries[['name']].limit(5)"
]
},
{
Expand All @@ -368,8 +386,8 @@
"metadata": {},
"outputs": [],
"source": [
"expr = nation.n_name.lower().left(1).name('first_letter')\n",
"expr.value_counts().sort_by(('count', False))"
"expr = countries.name.lower().left(1).name('first_letter')\n",
"expr.value_counts().sort_by(('count', False)).limit(10)"
]
},
{
Expand All @@ -389,7 +407,7 @@
"metadata": {},
"outputs": [],
"source": [
"nation[nation.n_name.like('%GE%')]"
"countries[countries.name.like('%GE%')].name"
]
},
{
Expand All @@ -398,7 +416,7 @@
"metadata": {},
"outputs": [],
"source": [
"nation[nation.n_name.lower().rlike('.*ge.*')]"
"countries[countries.name.lower().rlike('.*ge.*')].name"
]
},
{
Expand All @@ -407,7 +425,7 @@
"metadata": {},
"outputs": [],
"source": [
"nation[nation.n_name.lower().contains('ge')]"
"countries[countries.name.lower().contains('ge')].name"
]
},
{
Expand All @@ -430,9 +448,9 @@
"metadata": {},
"outputs": [],
"source": [
"table = con.table('functional_alltypes')\n",
"independence = connection.table('independence')\n",
"\n",
"table[table.timestamp_col, table.timestamp_col.minute().name('minute')].limit(10)"
"independence[independence.independence_date, independence.independence_date.month().name('month')].limit(10)"
]
},
{
Expand All @@ -449,11 +467,11 @@
"outputs": [],
"source": [
"def get_field(f):\n",
" return getattr(table.timestamp_col, f)().name(f)\n",
" return getattr(independence.independence_date, f)().name(f)\n",
"\n",
"fields = ['year', 'month', 'day', 'hour', 'minute', 'second', 'millisecond']\n",
"projection = [table.timestamp_col] + [get_field(x) for x in fields]\n",
"table[projection].limit(10)"
"fields = ['year', 'month', 'day'] # datetime fields can also use: 'hour', 'minute', 'second', 'millisecond'\n",
"projection = [independence.independence_date] + [get_field(x) for x in fields]\n",
"independence[projection].limit(10)"
]
},
{
Expand All @@ -469,7 +487,9 @@
"metadata": {},
"outputs": [],
"source": [
"table[table.timestamp_col.min(), table.timestamp_col.max(), table.count().name('nrows')]"
"independence[independence.independence_date.min(),\n",
" independence.independence_date.max(),\n",
" independence.count().name('nrows')].distinct()"
]
},
{
Expand All @@ -478,27 +498,14 @@
"metadata": {},
"outputs": [],
"source": [
"table[table.timestamp_col < '2010-01-01'].count()"
"independence[independence.independence_date > '2000-01-01'].count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"table[table.timestamp_col < \n",
" (ibis.timestamp('2010-01-01') + ibis.interval(months=3))].count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"expr = (table.timestamp_col + ibis.interval(days=1) + ibis.interval(hours=4)).name('offset')\n",
"table[table.timestamp_col, expr, ibis.now().name('current_time')].limit(10)"
"Some backends support adding offsets, for example `independence.independence_date + ibis.interval(days=1)` or `ibis.now() - independence.independence_date`."
]
}
],
Expand All @@ -518,9 +525,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 4
}
577 changes: 29 additions & 548 deletions docs/source/tutorial/05-IO-Create-Insert-External-Data.ipynb

Large diffs are not rendered by default.

208 changes: 208 additions & 0 deletions docs/source/tutorial/06-Advanced-Topics-ComplexFiltering.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Advanced Topics: Additional Filtering\n",
"\n",
"The filtering examples we've shown to this point have been pretty simple, either comparisons between columns or fixed values, or set filter functions like `isin` and `notin`. \n",
"\n",
"Ibis supports a number of richer analytical filters that can involve one or more of:\n",
"\n",
"- Aggregates computed from the same or other tables\n",
"- Conditional aggregates (in SQL-speak these are similar to \"correlated subqueries\")\n",
"- \"Existence\" set filters (equivalent to the SQL `EXISTS` and `NOT EXISTS` keywords)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import ibis\n",
"\n",
"ibis.options.interactive = True\n",
"\n",
"connection = ibis.sqlite.connect(os.path.join('data', 'geography.db'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using scalar aggregates in filters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"countries = connection.table('countries')\n",
"countries.limit(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could always compute some aggregate value from the table and use that in another expression, or we can use a data-derived aggregate in the filter. Take the average of a column. For example the average of countries size:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"countries.area_km2.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use this expression as a substitute for a scalar value in a filter, and the execution engine will combine everything into a single query rather than having to access the database multiple times. For example, we want to filter European countries larger than the average country size in the world. See how most countries in Europe are smaller than the world average:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cond = countries.area_km2 > countries.area_km2.mean()\n",
"expr = countries[(countries.continent == 'EU') & cond]\n",
"expr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conditional aggregates\n",
"\n",
"\n",
"Suppose that we wish to filter using an aggregate computed conditional on some other expressions holding true.\n",
"\n",
"For example, we want to filter European countries larger than the average country size, but this time of the average in Africa. African countries have an smaller size compared to the world average, and France gets into the list:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"conditional_avg = countries[countries.continent == 'AF'].area_km2.mean()\n",
"countries[(countries.continent == 'EU') & (countries.area_km2 > conditional_avg)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Existence\" filters\n",
"\n",
"\n",
"Some filtering involves checking for the existence of a particular value in a column of another table, or amount the results of some value expression. This is common in many-to-many relationships, and can be performed in numerous different ways, but it's nice to be able to express it with a single concise statement and let Ibis compute it optimally.\n",
"\n",
"An example could be finding all countries that had **any** year with a higher GDP than 3 trillion US dollars:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gdp = connection.table('gdp')\n",
"gdp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cond = ((gdp.country_code == countries.iso_alpha3) &\n",
" (gdp.value > 3e12)).any()\n",
"\n",
"countries[cond]['name']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how this is different than a join between `countries` and `gdp`, which would return one row per year. The method `.any()` is equivalent to filtering with a subquery."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering in aggregations\n",
"\n",
"\n",
"Suppose that you want to compute an aggregation with a subset of the data for _only one_ of the metrics / aggregates in question, and the complete data set with the other aggregates. Most aggregation functions are thus equipped with a `where` argument. Let me show it to you in action:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"arctic = countries.name.isin(['United States',\n",
" 'Canada',\n",
" 'Finland',\n",
" 'Greenland',\n",
" 'Iceland',\n",
" 'Norway',\n",
" 'Russia',\n",
" 'Sweden'])\n",
"\n",
"metrics = [countries.count().name('# countries'),\n",
" countries.population.sum().name('total population'),\n",
" countries.population.sum(where=arctic).name('population arctic countries')]\n",
"\n",
"(countries.groupby(countries.continent)\n",
" .aggregate(metrics))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
331 changes: 0 additions & 331 deletions docs/source/tutorial/06-Advanced-Topics-TopK-SelfJoins.ipynb

This file was deleted.

Loading