38 changes: 38 additions & 0 deletions docs/backend_table_hiearchy.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: Backend Table Hierarchy
---

Several SQL backends support two levels of hierarchy in organizing tables
(although the levels are also used for other purposes, like data access,
billing, etc.).

Ibis uses the following terminology:

- `database`: a collection of tables
- `catalog`: a collection of databases

Below is a table with the terminology used by each backend for the two levels of
hierarchy. This is provided as a reference, note that when using Ibis, we will
use the terms `catalog` and `database` and map them onto the appropriate fields.


| Backend | Catalog | Database |
|------------|----------------|------------|
| bigquery | project | database |
| clickhouse | | database |
| dask | | NA |
| datafusion | catalog | schema |
| druid | dataSourceType | dataSource |
| duckdb | database | schema |
| flink | catalog | database |
| impala | | database |
| mssql | database | schema |
| mysql | | database |
| oracle | | database |
| pandas | | NA |
| polars | | NA |
| postgres | database | schema |
| pyspark | | database |
| risingwave | database | schema |
| snowflake | | database |
| trino | catalog | schema |
4 changes: 1 addition & 3 deletions docs/backends/datafusion.qmd
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# DataFusion

[https://arrow.apache.org/datafusion](https://arrow.apache.org/datafusion)

{{< include /_callouts/experimental_backend.qmd >}}
[https://datafusion.apache.org/](https://datafusion.apache.org/)

![](https://img.shields.io/badge/memtables-not supported- grey?style=flat-square) ![](https://img.shields.io/badge/inputs-CSV | Delta Lake | Parquet-blue?style=flat-square) ![](https://img.shields.io/badge/outputs-CSV | Delta Lake | pandas | Parquet | PyArrow-orange?style=flat-square)

Expand Down
2 changes: 0 additions & 2 deletions docs/backends/druid.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

[https://druid.apache.org](https://druid.apache.org)

{{< include /_callouts/experimental_backend.qmd >}}

![](https://img.shields.io/badge/memtables-fallback-yellow?style=flat-square) ![](https://img.shields.io/badge/inputs-Druid tables-blue?style=flat-square) ![](https://img.shields.io/badge/outputs-Druid tables | CSV | pandas | Parquet | PyArrow-orange?style=flat-square)

## Install
Expand Down
35 changes: 35 additions & 0 deletions docs/backends/duckdb.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,41 @@ You can store it as an environment variable to avoid having to log in again:
['penguins']
```

## Geospatial

The DuckDB backend has experimental support for
[geospatial](../posts/ibis-duckdb-geospatial/index.qmd) operations.

To enable geospatial operations, install the `geospatial` extra or otherwise install
the additional required dependencies:

See [`read_geo`](#ibis.backends.duckdb.Backend.read_geo) for some tips on reading
in geospatial data.


::: {.panel-tabset}

## `pip`

```bash
pip install 'ibis-framework[geospatial]'
```


## `conda`

```bash
conda install -c conda-forge geopandas 'shapely>=2,<3'
```

## `mamba`

```bash
mamba install -c conda-forge geopandas 'shapely>=2,<3'
```

:::

```{python}
#| echo: false
BACKEND = "DuckDB"
Expand Down
6 changes: 1 addition & 5 deletions docs/backends/exasol.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -73,11 +73,7 @@ con = ibis.exasol.connect(
user = "username",
password = "password",
host = "localhost",
port = 8563,
schema = None,
encryption = True,
certificate_validation = True,
encoding = "en_US.UTF-8"
port = 8563
)
```

Expand Down
12 changes: 2 additions & 10 deletions docs/backends/impala.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -126,13 +126,7 @@ The best way to interact with a single table is through the
render_methods(
table,
"drop",
"drop_partition",
"files",
"insert",
"is_partitioned",
"partition_schema",
"partitions",
"refresh",
"describe_formatted",
)
```
Expand Down Expand Up @@ -202,7 +196,7 @@ table or database.
```{python}
#| echo: false
#| output: asis
render_methods(get_object("ibis.backends.base.sql", "BaseSQLBackend"), "table")
render_methods(get_object("ibis.backends.sql", "SQLBackend"), "table")
```

The client's `table` method allows you to create an Ibis table
Expand Down Expand Up @@ -1065,9 +1059,7 @@ func.register(fuzzy_equals.name, udf_db)
The object `fuzzy_equals` is callable and works with Ibis expressions:

```python
>>> db = c.database('ibis_testing')

>>> t = db.functional_alltypes
>>> t = con.tables.functional_alltypes

>>> expr = fuzzy_equals(t.float_col, t.double_col / 10)

Expand Down
31 changes: 31 additions & 0 deletions docs/backends/mssql.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -71,14 +71,32 @@ con = ibis.mssql.connect() # <1>

### `ibis.mssql.connect`

::: {.panel-tabset}

## Windows

```python
con = ibis.mssql.connect(
user="username",
password="password",
host="hostname",
driver="SQL Server",
)
```

## OSX / Linux

```python
con = ibis.mssql.connect(
user="username",
password="password",
host="hostname",
driver="FreeTDS", # if you are using unixODBC
)
```

:::

::: {.callout-note}
`ibis.mssql.connect` is a thin wrapper around [`ibis.backends.mssql.Backend.do_connect`](#ibis.backends.mssql.Backend.do_connect).
:::
Expand All @@ -102,6 +120,19 @@ passing a properly-formatted MSSQL connection URL to `ibis.connect`:
con = ibis.connect(f"mssql://{user}:{password}@{host}:{port}")
```

### Using Integrated Authentication

Integrated Authentication allows users to log in via NTLM or Kerberos.
To make use of integrated authentication, use `ibis.mssql.connect` as documented
above, but do not specify values for `user` or `password`. The PyODBC driver
will then default to Integrated Authentication.

Some environments may require you to set additional keywords to connect.
Additional information is available at the following pages:

- https://learn.microsoft.com/en-us/aspnet/web-api/overview/security/integrated-windows-authentication
- https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/using-integrated-authentication

```{python}
#| echo: false
BACKEND = "MSSQL"
Expand Down
2 changes: 0 additions & 2 deletions docs/backends/oracle.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

[https://docs.oracle.com/database/oracle/oracle-database](https://docs.oracle.com/database/oracle/oracle-database)

{{< include /_callouts/experimental_backend.qmd >}}

![](https://img.shields.io/badge/memtables-fallback-yellow?style=flat-square) ![](https://img.shields.io/badge/inputs-Oracle tables-blue?style=flat-square) ![](https://img.shields.io/badge/outputs-Oracle tables | CSV | pandas | Parquet | PyArrow-orange?style=flat-square)

## Install
Expand Down
2 changes: 0 additions & 2 deletions docs/backends/polars.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

[https://www.pola.rs](https://www.pola.rs)

{{< include /_callouts/experimental_backend.qmd >}}

![](https://img.shields.io/badge/memtables-native-green?style=flat-square) ![](https://img.shields.io/badge/inputs-CSV | Delta Lake | pandas | Parquet-blue?style=flat-square) ![](https://img.shields.io/badge/outputs-CSV | pandas | Delta Lake | Parquet | PyArrow-orange?style=flat-square)

## Install
Expand Down
104 changes: 104 additions & 0 deletions docs/backends/risingwave.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# RisingWave

[https://risingwave.com/](https://risingwave.com/)

![](https://img.shields.io/badge/memtables-fallback-yellow?style=flat-square) ![](https://img.shields.io/badge/inputs-Postgres tables-blue?style=flat-square) ![](https://img.shields.io/badge/outputs-Postgres tables | CSV | pandas | Parquet | PyArrow-orange?style=flat-square)

## Install

Install Ibis and dependencies for the RisingWave backend:

::: {.panel-tabset}

## `pip`

Install with the `risingwave` extra:

```{.bash}
pip install 'ibis-framework[risingwave]'
```

And connect:

```{.python}
import ibis
con = ibis.risingwave.connect() # <1>
```

1. Adjust connection parameters as needed.

## `conda`

Install for Risingwave:

```{.bash}
conda install -c conda-forge ibis-risingwave
```

And connect:

```{.python}
import ibis
con = ibis.risingwave.connect() # <1>
```

1. Adjust connection parameters as needed.

## `mamba`

Install for Risingwave:

```{.bash}
mamba install -c conda-forge ibis-risingwave
```

And connect:

```{.python}
import ibis
con = ibis.risingwave.connect() # <1>
```

1. Adjust connection parameters as needed.

:::

## Connect

### `ibis.risingwave.connect`

```python
con = ibis.risingwave.connect(
user="username",
password="password",
host="hostname",
port=4566,
database="database",
)
```

::: {.callout-note}
`ibis.risingwave.connect` is a thin wrapper around
[`ibis.backends.risingwave.Backend.do_connect`](#ibis.backends.risingwave.Backend.do_connect).
:::


### Connection Parameters

```{python}
#| echo: false
#| output: asis
from _utils import render_do_connect
render_do_connect("risingwave")
```

```{python}
#| echo: false
BACKEND = "Risingwave"
```

{{< include ./_templates/api.qmd >}}
2 changes: 0 additions & 2 deletions docs/backends/trino.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

[https://trino.io](https://trino.io)

{{< include /_callouts/experimental_backend.qmd >}}

![](https://img.shields.io/badge/memtables-fallback-yellow?style=flat-square) ![](https://img.shields.io/badge/inputs-Trino tables-blue?style=flat-square) ![](https://img.shields.io/badge/outputs-Trino tables | CSV | pandas | Parquet | PyArrow-orange?style=flat-square)

## Install
Expand Down
102 changes: 102 additions & 0 deletions docs/backends_sankey.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
from __future__ import annotations

import plotly.graph_objects as go


def to_greyish(hex_code, grey_value=128):
hex_code = hex_code.lstrip("#")
r, g, b = int(hex_code[0:2], 16), int(hex_code[2:4], 16), int(hex_code[4:6], 16)

new_r = (r + grey_value) // 2
new_g = (g + grey_value) // 2
new_b = (b + grey_value) // 2

new_hex_code = f"#{new_r:02x}{new_g:02x}{new_b:02x}"

return new_hex_code


category_colors = {
"Ibis API": "#7C65A0",
"SQL": "#6A9BC9",
"DataFrame": "#D58273",
}

backend_categories = {
list(category_colors.keys())[1]: [
"BigQuery",
"ClickHouse",
"DataFusion",
"Druid",
"DuckDB",
"Exasol",
"Flink",
"Impala",
"MSSQL",
"MySQL",
"Oracle",
"PostgreSQL",
"PySpark",
"RisingWave",
"Snowflake",
"SQLite",
"Trino",
],
list(category_colors.keys())[2]: ["Dask", "pandas", "Polars"],
}

nodes, links = [], []
node_index = {}

nodes.append({"label": "Ibis API", "color": category_colors["Ibis API"]})
node_index["Ibis API"] = 0

idx = 1
for category, backends in backend_categories.items():
nodes.append({"label": category, "color": category_colors[category]})
node_index[category] = idx
links.append({"source": 0, "target": idx, "value": len(backends)})
idx += 1

for backend in backends:
if backend not in node_index:
nodes.append({"label": backend, "color": category_colors[category]})
node_index[backend] = idx
idx += 1
links.append(
{
"source": node_index[category],
"target": node_index[backend],
"value": 1,
}
)

fig = go.Figure(
data=[
go.Sankey(
node=dict(
pad=20,
thickness=20,
line=dict(color="grey", width=0.5),
label=[node["label"] for node in nodes],
color=[node["color"] for node in nodes],
),
link=dict(
source=[link["source"] for link in links],
target=[link["target"] for link in links],
value=[link["value"] for link in links],
line=dict(color="grey", width=0.5),
color=[to_greyish(nodes[link["target"]]["color"]) for link in links],
),
)
],
)

fig.update_layout(
title_text="Ibis backend types",
font_size=24,
# font_family="Arial",
title_font_size=30,
margin=dict(l=30, r=30, t=80, b=30),
template="plotly_dark",
)
1 change: 0 additions & 1 deletion docs/concepts/composable-ecosystem.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ About pandas'"](https://wesmckinney.com/blog/apache-arrow-pandas-internals/).

Ibis started as a pandas-like API for Apache Impala, but has since expanded to
support many backends. It currently leverages open-source projects like
[SQLAlchemy](https://github.com/sqlalchemy/sqlalchemy) and
[SQLGlot](https://github.com/tobymao/sqlglot) to work with many backends. While
these projects are great, they rely on backend-specific SQL that does not
constitute a standard. Going forward, we expect ADBC and Substrait to be the
Expand Down
117 changes: 117 additions & 0 deletions docs/concepts/datatypes.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
title: Datatypes and Datashapes
---

Every value in Ibis has two important properties: a type and shape.

The type is probably familiar to you. It is something like

- `Integer`
- `Floating`
- `String`
- `Array`

The shape is one of

- `Scalar` (a single value)
- `Column` (a series of values)

## Datatype Flavors

For some datatypes, there are further options that define them.
For instance, `Integer` values can be signed or unsigned, and
they have a precision. For example, "uint8", "int64", etc.
These flavors don't affect their capabilities
(eg both signed and unsigned ints have a `.abs()` method),
but the flavor does impact how the underlying backend performs the computation.

## Capabilities

Depending on the combination of datatype and datashape, a value has
different capabilities. For example:

- All `String` values (both `StringScalars` and `StringColumns`) have the
method `.upper()` that transforms the string to uppercase.
`Floating` and `Array` values don't have this method, of course.
- `IntegerColumn` and `FloatingColumn` values have `.mean()`, `.max()`, etc methods
because you can aggregate over them, since they are a collection of values.
On the other hand, `IntegerScalar` and `FloatingScalar` values do **not** have these
methods, because it doesn't make sense to take the mean or max of a single value.
- If you call `.to_pandas()` on these values, you get different results.
`Scalar` shapes result in scalar objects:
- `IntegerScalar`: NumPy `int64` object (or whatever specific flavor).
- `FloatingScalar`: NumPy `float64` object (or whatever specific flavor).
- `StringScalar`: plain python `str` object.
- `ArrayScalar`: plain python `list` object.
- On the other hand, `Column` shapes result in `pandas.Series`:
- `IntegerColumn`: pd.Series of integers, with the same flavor.
For example, if the `IntegerColumn` was specifically "uint16",
then the pandas series will hold a numpy array of type "uint16".
- `FloatingColumn`: pd.Series of numpy floats with the same flavor.
- etc.

## Broadcasting and Alignment

There are rules for how different datashapes are combined. This is similar to
how SQL and NumPy handles merging datashapes, if you are familiar with them.

```{python}
import ibis
ibis.options.interactive = True
t1 = ibis.examples.penguins.fetch().head(100)
t1
```

We can look at the datatype of the year Column

```{python}
t1.year.type()
```

Combining two `Scalar`s results in a `Scalar`:

```{python}
t1.year.mean() + t1.year.std()
```

Combining a `Column` and `Scalar` results in a `Column`:

```{python}
t1.year + 1000
```

Combining two `Column`s results in a `Column`:

```{python}
t1.year + t1.bill_length_mm
```

One requirement that might surprise you if you are coming from NumPy is
Ibis's requirements on aligning `Columns`: In NumPy, if you have two arbitrary
arrays, each of length 100, you can add them together, and it works because the
elements are "lined up" based on position. Ibis is different. Because it is based
around SQL, and SQL has no notion of inherent row ordering, you cannot "line up"
any two `Column`s in Ibis: They both **have** to be derived from the same
`Table` expression. For example:

```{python}
t2 = ibis.examples.population.fetch().head(100)
t2
```

```{python}
#| error: true
t1.bill_depth_mm + t2.population
```

If you want to use these two columns together, you would need to join the tables together first:

```{python}
j = ibis.join(t1, t2, "year")
j
```

```{python}
j.bill_depth_mm + j.population
```
5 changes: 2 additions & 3 deletions docs/concepts/internals.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -125,11 +125,10 @@ The next major component of Ibis is the compilers.

The first few versions of Ibis directly generated strings, but the compiler
infrastructure was generalized to support compilation of
[SQLAlchemy](https://docs.sqlalchemy.org/en/latest/tutorial/index.html#unified-tutorial) based
expressions.
[SQLGlot](https://sqlglot.com/)-based expressions.

The compiler works by translating the different pieces of SQL expression into a
string or SQLAlchemy expression.
string or SQLGlot expression.

The main pieces of a `SELECT` statement are:

Expand Down
8 changes: 8 additions & 0 deletions docs/concepts/user-testimonials.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,14 @@ GitHub](https://github.com/ibis-project/ibis/issues/7341#issuecomment-1760625921
> Sadly, I didn't come across Ibis at that time and even contemplated creating a
> set on my own.
---

From [Mark Druffel on Kedro Slack](https://linen-slack.kedro.org/t/16404402/does-anyone-know-if-kedro-viz-should-work-with-data-factorie#bbd4bcf4-fe46-4538-af13-d827bd0aa94a):

> I now have catalog entries that use vanilla pyspark in my databricks environment and
> polars on my laptop which is pretty slick 🔥 Just thought I'd share since I've seen
> your team has been mentioning ibis a bit.
## Have a story to share?

Let us know! We'd love to include it here. Please share your experience with
Expand Down
36 changes: 26 additions & 10 deletions docs/concepts/who.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,34 @@ guide](/contribute).
## Voltron Data

[Voltron Data](https://voltrondata.com) is the primary sponsor of Ibis, with
most of the core development team employed there. As of writing, this includes
five full-time developers, one technical product manager, and other staff who
contribute to Ibis.
most of the core development team employed there. This includes nine full-time
developers, one technical product manager, and other staff who contribute to
Ibis.

::: {.callout-tip title="Why does Voltron Data support Ibis?"}
Check out the [blog post on why Voltron Data supports
Ibis](../posts/why-voda-supports-ibis/index.qmd).
:::

## Other companies

Ibis is used by many other companies, with various tools built on top of it.
Some include:

- [Google BigQuery DataFrames](https://github.com/googleapis/python-bigquery-dataframes), a clone of the pandas API built on Ibis
- [Starburst Galaxy Python DataFrames](https://www.starburst.io/blog/introducing-python-dataframes/), with support for Ibis
- [Claypot AI's contribution of the Flink backend](https://github.com/claypotai/ibis-flink-example), working in collaboration with Voltron Data
- [Microsoft's Magpie project](https://www.microsoft.com/en-us/research/project/magpie-2/), built on top of Ibis
- [SuperDuperDB](https://github.com/SuperDuperDB/superduperdb), bringing AI to any backend Ibis supports
- [Google BigQuery
DataFrames](https://github.com/googleapis/python-bigquery-dataframes), a clone
of the pandas API built on Ibis
- [Starburst Galaxy Python
DataFrames](https://www.starburst.io/blog/introducing-python-dataframes/), with
support for Ibis
- [Claypot AI's contribution of the Flink
backend](https://github.com/claypotai/ibis-flink-example), working in
collaboration with Voltron Data
- [Microsoft's Magpie
project](https://www.microsoft.com/en-us/research/project/magpie-2/), built on
top of Ibis
- [SuperDuperDB](https://github.com/SuperDuperDB/superduperdb), bringing AI to
any backend Ibis supports

Ibis is also contributed to by other companies. You can [look through the full
list of contributors on
Expand All @@ -41,8 +55,10 @@ Wes, Voltron Data, and others to solve problems seen throughout the space that
are compounding as data volume and AI complexity increase. Some good background
material on the composable data ecosystem and Ibis can be found at:

- ["Apache Arrow and the '10 Things I Hate About pandas'" by Wes](https://wesmckinney.com/blog/apache-arrow-pandas-internals/)
- ["The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future" by Wes](https://wesmckinney.com/blog/looking-back-15-years/)
- ["Apache Arrow and the '10 Things I Hate About pandas'" by
Wes](https://wesmckinney.com/blog/apache-arrow-pandas-internals/)
- ["The Road to Composable Data Systems: Thoughts on the Last 15 Years and the
Future" by Wes](https://wesmckinney.com/blog/looking-back-15-years/)
- ["The Composable Codex" by Voltron Data](https://voltrondata.com/codex)

## Support for production workloads
Expand Down
184 changes: 93 additions & 91 deletions docs/contribute/01_environment.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,67 +11,31 @@ hide:

::: {.panel-tabset}

## pip

::: {.callout-warning}
`pip` will not handle installation of system dependencies

`pip` will not install system dependencies needed for some packages such as `psycopg2` and `kerberos`.

For a better development experience see the `conda` or `nix` setup instructions.
:::

1. [Install `gh`](https://cli.github.com/manual/installation)

1. Fork and clone the ibis repository:

```sh
gh repo fork --clone --remote ibis-project/ibis
```

1. Change directory into `ibis`:

```sh
cd ibis
```

1. Install development dependencies

```sh
pip install 'poetry==1.7.1'
pip install -r requirements-dev.txt
```

1. Install ibis in development mode

```sh
pip install -e '.[all]'
```

## Conda

::: {.callout-note}
Some optional dependencies for Windows are not available through `conda`/`mamba`
## MacOS users on arm64 CPUs must use a different setup
If you are working with an `arm64` architecture (Mac M1/M2) refer directly to
the `conda/mamba` setup and refer to the comment related to `arm64`
architectures.
:::

### Support matrix
## Conda/Mamba

### Support matrix [^conda-callout]

| Python Version | Python 3.9 | Python 3.10 | Python 3.11 |
| -------------------- | ---------------------------- | ---------------- | ---------------- |
| **Operating System** | | | |
| **Linux** | {{< fa check >}}[^supported] | {{< fa check >}} | {{< fa check >}} |
| **macOS (x86_64)** | {{< fa check >}} | {{< fa check >}} | {{< fa check >}} |
| **macOS (aarch64)** | {{< fa check >}} | {{< fa check >}} | {{< fa check >}} |
| **Windows** | {{< fa check >}} | {{< fa check >}} | {{< fa check >}} |
| Python Version {{< fa arrow-right >}} | Python 3.9 | Python 3.10 | Python 3.11 |
| -----------------------------------------: | :--------------------------: | :--------------: | :--------------: |
| **Operating System** {{< fa arrow-down >}} | | | |
| **Linux** | {{< fa check >}}[^supported] | {{< fa check >}} | {{< fa check >}} |
| **macOS (x86_64)** | {{< fa check >}} | {{< fa check >}} | {{< fa check >}} |
| **macOS (aarch64)** | {{< fa check >}} | {{< fa check >}} | {{< fa check >}} |
| **Windows** | {{< fa check >}} | {{< fa check >}} | {{< fa check >}} |

::: {.panel-tabset}

```{python}
#| echo: false
#| output: asis
import itertools
managers = {
"conda": {
"name": "Miniconda",
Expand All @@ -80,67 +44,60 @@ managers = {
"mamba": {
"name": "Mamba",
"url": "https://github.com/mamba-org/mamba",
}
}
platforms = {
"Linux": "linux-64",
"macOS (x86_64)": "osx-64",
"macOS (aarch64)": "osx-arm64", "Windows": "win-64"
},
}
for manager, params in managers.items():
# tabsets can't be nested so the count is manual
step = itertools.count(1)
name = params["name"]
url = params["url"]
print(f"## `{manager}`")
print()
print(f"{next(step)}. Install [{name}]({url})")
print(f"1. Install [{name}]({url})")
print()
print(f"{next(step)}. Install `gh`")
print(f"1. Install `gh`")
print()
print(" ```sh")
print(f" {manager} install -c conda-forge gh")
print(" ```")
print()
print(f"{next(step)}. Fork and clone the ibis repository:")
print(f"1. Fork and clone the ibis repository:")
print()
print(" ```sh")
print(" gh repo fork --clone --remote ibis-project/ibis")
print(" ```")
print()
print(f"{next(step)}. Create a Conda environment from a lock file in the repo:")
print(f"1. Create a Conda environment using `environment.yml`")
print()
print("::: {.panel-tabset}")
print(" ```sh")
print(" cd ibis")
print(f" {manager} env create --file conda/environment.yml")
print(" ```")
print()
for os, platform in platforms.items():
print(f"## {os}")
if manager == "conda" or manager == "mamba":
print()
print("```sh")
print(f"# Create a dev environment for {platform}")
print("cd ibis")
print(f"{manager} create -n ibis-dev --file=ci/conda-lock/{platform}/3.10.lock")
print("```")
print(
" If you're using arm64 architecture (Mac M1/M2), use "
"`conda/environment-arm64.yml` for setting up a dev environment for all the "
"backends that are possible to install excluding Flink; use "
"`conda/environment-arm64-flink.yml` for setting up a dev environment for all "
"the backends that are possible to install including Flink. The reason to have "
"two separate environments is because `apache-flink` forces `pyarrow` to "
"downgrade to 11.0, which causes conflicts in other backends."
)
print()
print()
print(":::")
print()
print(f"{next(step)}. Activate the environment")
print(f"1. Activate the environment")
print()
print(" ```sh")
print(f" {manager} activate ibis-dev")
print(" ```")
print()
print(f"{next(step)}. Install your local copy of `ibis` into the Conda environment")
print(
f"1. Install your local copy of `ibis` into the {manager.capitalize()} environment"
)
print()
print(" ```sh")
print(" cd ibis")
print(" pip install -e '.[all]'")
print(" pip install -e .")
print(" ```")
print()
```
Expand All @@ -149,15 +106,15 @@ for manager, params in managers.items():

## Nix

#### Support Matrix
### Support matrix

| Python Version {{< fa arrow-right >}} | Python 3.9 | Python 3.10 | Python 3.11 |
| -----------------------------------------: | :---------------------------------------: | :-----------------------------: | :-----------------------------: |
| **Operating System** {{< fa arrow-down >}} | | | |
| **Linux** | {{< fa check >}} [^supported] | {{< fa check >}} | {{< fa check >}} |
| **macOS (x86_64)** | {{< fa check >}} | {{< fa check >}} | {{< fa check >}} |
| **macOS (aarch64)** | {{< fa triangle-exclamation >}}[^unknown] | {{< fa triangle-exclamation >}} | {{< fa triangle-exclamation >}} |
| **Windows** | {{< fa ban >}}[^unlikely] | {{< fa ban >}} | {{< fa ban >}} |
| Python Version {{< fa arrow-right >}} | Python 3.9 | Python 3.10 | Python 3.11 |
| -----------------------------------------: | :--------------------------: | :-----------------------: | :-----------------------: |
| **Operating System** {{< fa arrow-down >}} | | | |
| **Linux** | {{< fa check >}}[^supported] | {{< fa check >}} | {{< fa check >}} |
| **macOS (x86_64)** | {{< fa check >}} | {{< fa check >}} | {{< fa check >}} |
| **macOS (arm64/M1/M2)** | {{< fa ban >}}[^m1] | {{< fa ban >}}[^m1] | {{< fa ban >}}[^m1] |
| **Windows** | {{< fa ban >}}[^unlikely] | {{< fa ban >}}[^unlikely] | {{< fa ban >}}[^unlikely] |

1. [Install `nix`](https://nixos.org/download.html)
1. Configure `nix`
Expand Down Expand Up @@ -208,18 +165,63 @@ for manager, params in managers.items():
This will launch a `bash` shell with all of the required dependencies installed.
This may take a while due to artifact download from the cache.


## pip

::: {.callout-warning}
## `pip` will not handle installation of system dependencies

`pip` will not install system dependencies needed for some packages such as `psycopg2` and `kerberos`.

For a better development experience see the `conda/mamba` or `nix` setup instructions.
:::

1. [Install `gh`](https://cli.github.com/manual/installation)

1. Fork and clone the ibis repository:

```sh
gh repo fork --clone --remote ibis-project/ibis
```

1. Change directory into `ibis`:

```sh
cd ibis
```

1. Install development dependencies

```sh
pip install 'poetry==1.8.2'
pip install -r requirements-dev.txt
```

1. Install ibis in development mode

```sh
pip install -e .
```

:::

## Code style and formatting

Ibis uses several code linters and has a style guide for documentation, please
checkout the [style and formatting guide](./03_style.qmd) for instructions on how to set things up.

## Building the docs

Install [`just`](https://just.systems/man/en/chapter_4.html) and run
Install [`just`](https://just.systems/man/en/chapter_4.html) (if you installed
via `conda/mamba` you are covered) and run

```bash
just docs-preview
```

to build and serve the documentation.

[^conda-callout]: Some optional dependencies for Windows and Mac OS are not available through `conda`/`mamba`
[^supported]: Tested in CI. If this doesn't work for you, please [file an issue](https://github.com/ibis-project/ibis/issues/new).
[^unknown]: Not tested in CI. Unknown if this setup works.
[^m1]: Not working as of March 2024. See https://github.com/ibis-project/ibis/issues/8696
[^unlikely]: Unlikely to ever be supported or no upstream support.
206 changes: 189 additions & 17 deletions docs/contribute/02_workflow.qmd
Original file line number Diff line number Diff line change
@@ -1,31 +1,38 @@
# Contribute to the Ibis codebase
---
title: "Contribute to the Ibis codebase"
---

## Getting started
## Workflow

### Getting started

First, set up a [development environment](./01_environment.qmd).

## Taking issues
If you're developing on Mac M1 or M2, see [docs for setting up Colima](#set-up-colima).

### Taking issues

If you find an issue you want to work on, write a comment with the text
`/take` on the issue. GitHub will then assign the issue to you.

## Running the test suite
### Running the test suite

To run tests that do not require a backend:

```sh
pytest -m core
```

### Backend test suites
#### Backend test suites

:::{.callout-tip}
You may be able to skip this section
:::
::: {.callout-tip}
## You may be able to skip this section

If you haven't made changes to the core of ibis (e.g., `ibis/expr`)
or any specific backends (`ibis/backends`) this material isn't necessary to
follow to make a pull request.
If you haven't made changes to the core of ibis (e.g., `ibis/expr`) or any
specific backends (`ibis/backends`) this material isn't necessary to
follow to make a pull request. You can jump directly to the
[Writing the commit](#writing-the-commit) section
:::

First, we need to download example data to run the tests successfully:

Expand All @@ -39,17 +46,39 @@ To run the tests for a specific backend (e.g. sqlite):
pytest -m sqlite
```

## Setting up non-trivial backends
### Setting up non-trivial backends

::: {.callout-note}
## MacOS users on arm64 CPUs must use a different setup

If you are working with an `arm64` architecture (Mac M1/M2) you can setup Docker
with [Colima](https://github.com/abiosoft/colima). Refer to
[set up colima](#set-up-colima). Or you can download Docker desktop.
:::

These client-server backends need to be started before testing them.
They can be started with `docker compose` directly, or using the `just` tool.

- ClickHouse: `just up clickhouse`
- PostgreSQL: `just up postgres`
- Exasol: `just up exasol` (no `arm64` support)
- Flink: `just up flink`
- Impala: `just up impala`
- SQL Server: `just up mssql`
- MySQL: `just up mysql`
- impala: `just up impala`
- Oracle: `just up oracle`
- PostgreSQL: `just up postgres`
- RisingWave: `just up risingwave`
- Trino: `just up trino`
- Druid: `just up druid`

### Test the backend locally
and then run the test suite for the backend you just started. For example,
if you ran `just up postgres`:

```sh
pytest -m postgres
```

#### Test the backend locally

If anything seems amiss with a backend, you can of course test it locally:

Expand All @@ -58,7 +87,25 @@ export PGPASSWORD=postgres
psql -t -A -h localhost -U postgres -d ibis_testing -c "select 'success'"
```

## Writing the commit
#### Adding appropriate tests

If you pull request involves a new feature, you should add appropriate tests to cover
all ordinary and edge cases.

Pytest markers can be used to assert that a test should fail or raise a specific error.
We use a number of pytest markers in ibis:

- `pytest.mark.notimpl`: the backend can do a thing, we haven't mapped the op
- `pytest.mark.notyet`: the backend cannot do a thing, but might in the future
- `pytest.mark.never`: the backend will never support this / pass this test (common example
here is a test running on sqlite that relies on strong typing)
- `pytest.mark.broken`: this test broke and it's demonstrably unrelated to the PR I'm working
on and fixing it shouldn't block this PR from going in (but we should fix it up pronto)

Refrain from using a generic marker like `pytest.mark.xfail`.


### Writing the commit

Ibis follows the [Conventional Commits](https://www.conventionalcommits.org/) structure.
In brief, the commit summary should look like:
Expand All @@ -76,6 +123,131 @@ The type (e.g. `fix`) can be:

fixes #4242

## Submit a pull request
### Submit a pull request

Ibis follows the standard GitHub pull request process. The team will review the PR and merge when it's ready.

## Colima setup for Mac M1/M2 users {#set-up-colima}

[Colima](https://github.com/abiosoft/colima) is a container runtime that
supports Mac M1 and M2 CPUs.

If you are working with an M1 or M2 CPU architecture, you will need to
run Docker to be able to test some backends supported by Ibis. With Colima[^1],
you can run Docker on newer Macs without using Docker Desktop[^2].

[^1]: [Installing Colima as a Docker Engine provider with `buildx` and `compose` plugins installed](https://aosolorzano.medium.com/installing-colima-as-a-docker-engine-provider-with-buildx-and-compose-plugins-installed-1ce8b3bae158)
[^2]: [Goodbye Docker Desktop for Mac, Hello Colima](https://jacobtomlinson.dev/posts/2022/goodbye-docker-desktop-for-mac-hello-colima/)

### Get started

#### Uninstall Docker Desktop

If you have Docker desktop installed, follow [^1] or [^2], both of them have
a section on how to uninstall Docker Desktop.

#### Install Docker client with [Homebrew](https://brew.sh/)

Make sure your Homebrew installation is up to date.

```bash
$ brew install docker
```

#### Check installation by running

```bash
$ docker --version
```

Notice we haven't installed any Docker Engine yet, and only the Docker client
information is displayed.

#### Install Colima

```bash
$ brew install colima
```

#### Install Docker Compose plugin (optional)

If you want to be able to run `docker compose` or run `just up`, follow these
steps.

The official [Docker
documentation](https://docs.docker.com/compose/install/linux/) suggests
running the following commands in a separate terminal:

Replace the link on the `curl` step with the version you desired, find them here
https://github.com/docker/compose/releases/

```bash
$ DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
$ mkdir -p $DOCKER_CONFIG/cli-plugins
$ curl -SL https://github.com/docker/compose/releases/download/v2.24.6/docker-compose-darwin-aarch64 -o $DOCKER_CONFIG/cli-plugins/docker-compose
```

Then add execution permission to the downloaded binary:

```bash
$ chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
```

Check it worked:

```bash
$ docker compose version
```

#### Start Colima

```bash
$ colima start
```

You can now run `docker version` and you will see the engine.

::: {.callout-note}
## By default this launches containers with 2 CPUs, 2GB of RAM, and 60GB of disk space.

You can modify this as well as the architecture by passing command line
arguments (`--cpu`, `--memory`, `--disk`, and `--arch`) to `colima start`.
:::

```bash
$ colima status
$ colima list
```

In your Ibis clone

```bash
$ just up postgres
```

Once the `just` command finishes, you can run the tests by doing:

```bash
$ pytest -m postgres
```

Once you are done, you can stop the container by doing:

```bash
$ just down postgres
```

If you are done for the day, and want to avoid the Colima instance eating your
resources, you will want to stop it.

```bash
$ colima stop
```

If you upgraded colima, you need to delete the existing instance. If you want to
modify the allocation, you need to delete the existing instance too (unless
you are going only up).

```bash
$ colima delete
```
6 changes: 1 addition & 5 deletions docs/contribute/05_reference.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
---
title: "Test Class Reference"
---
# Test class reference

This page provides a partial reference to the attributes, methods, properties
and class-level variables that are used to help configure a backend for the Ibis
Expand All @@ -20,8 +18,6 @@ from ibis.backends.tests.base import BackendTest
class TestConf(BackendTest):
"""Backend-specific class with information for testing."""

supports_divide_by_zero = True
supports_floating_modulus = False
returned_timestamp_unit = "us"
supports_structs = True
supports_json = True
Expand Down
22 changes: 22 additions & 0 deletions docs/fix-version-href.lua
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
function Header(el)
-- Walk the pandoc AST and find all links in Header elements
-- Pluck out the version string and set it if it hasn't been set already
local version = nil;

el = el:walk({
Link = function(link)
return link:walk({
Str = function(str)
if version == nil then
version = str.text
end
return str.text
end,
})
end,
})

el.attr.identifier = version

return el
end
10 changes: 5 additions & 5 deletions docs/how-to/analytics/basics.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -58,20 +58,20 @@ t.mutate(bill_length_cm=t["bill_length_mm"] / 10).relocate(
Use the `.join()` method to join data:

```{python}
t.join(t, t["species"] == t["species"], how="left_semi")
t.join(t, ["species"], how="left_semi")
```

## Combining it all together

We can use [the underscore to chain expressions together](./chain_expressions.qmd).

```{python}
t.join(t, t["species"] == t["species"], how="left_semi").filter(
ibis._["species"] != "Adelie"
t.join(t, ["species"], how="left_semi").filter(
_.species != "Adelie"
).group_by(["species", "island"]).aggregate(
avg_bill_length=ibis._["bill_length_mm"].mean()
avg_bill_length=_.bill_length_mm.mean()
).order_by(
ibis._["avg_bill_length"].desc()
_.avg_bill_length.desc()
)
```

Expand Down
4 changes: 2 additions & 2 deletions docs/how-to/analytics/chain_expressions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,11 @@ join = (
t1
# _ is t1
.join(t2, _.x == t2.x)
# _ is the join result:
# `xmod` is a deferred expression:
.mutate(xmod=xmod)
# _ is the TableExpression after mutate:
.group_by(_.xmod)
# `ct` is a ColumnExpression derived from a deferred expression:
# `ymax` and `zmax` are ColumnExpressions derived from a deferred expression:
.aggregate(ymax=ymax, zmax=zmax)
# _ is the aggregation result:
.filter(_.ymax == _.zmax)
Expand Down
276 changes: 276 additions & 0 deletions docs/how-to/analytics/imdb.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
---
title: Analyze IMDB data using Ibis
---

Let's use the Ibis examples module and the DuckDB backend to find some movies
to watch.

Adapted from [Phillip in the Cloud's livestream using the same
data](https://www.youtube.com/watch?v=J7sEn9VklKY).

## Imports

For this example, we'll just use Ibis.

```{python}
from ibis.interactive import * # <1>
```

1. This import imports `ibis.examples` as `ex`.

## Fetch the example data

We can use the `ibis.examples` module to fetch the IMDB data. Ibis
automatically caches the data on disk so subsequent runs don't require fetching
from cloud storage on each call to `fetch`.

```{python}
name_basics = ex.imdb_name_basics.fetch()
name_basics
```

To ensure column names are Pythonic, we can rename as `snake_case`.

```{python}
name_basics.rename("snake_case")
```

Let's grab all of the relevant IMDB tables and rename columns.

```{python}
name_basics = ex.imdb_name_basics.fetch().rename("snake_case")
title_akas = ex.imdb_title_akas.fetch().rename("snake_case")
title_basics = ex.imdb_title_basics.fetch().rename("snake_case")
title_crew = ex.imdb_title_crew.fetch().rename("snake_case")
title_episode = ex.imdb_title_episode.fetch().rename("snake_case")
title_principals = ex.imdb_title_principals.fetch().rename("snake_case")
title_ratings = ex.imdb_title_ratings.fetch().rename("snake_case")
```

## Preview the data

We'll print out the first few rows of each table to get an idea of what is
contained in each.

```{python}
name_basics.head()
```

```{python}
title_akas.head()
```

```{python}
title_basics.head()
```

```{python}
title_crew.head()
```

```{python}
title_episode.head()
```

```{python}
title_principals.head()
```

```{python}
title_ratings.head()
```

## Basic data exploration

Let's check how many records are in each table. It's just Python, so we can
construct a dictionary and iterate through it in a for loop.

```{python}
tables = {
"name_basics": name_basics,
"title_akas": title_akas,
"title_basics": title_basics,
"title_crew": title_crew,
"title_episode": title_episode,
"title_principals": title_principals,
"title_ratings": title_ratings,
}
max_name_len = max(map(len, tables.keys())) + 1
```

```{python}
print("Length of tables:")
for t in tables:
print(f"\t{t.ljust(max_name_len)}: {tables[t].count().to_pandas():,}")
```

## Clean data

Looking at the data, the `nconst` and `tconst` columns seem to be unique
identifiers. Let's confirm and adjust them accordingly.

```{python}
name_basics.head()
```

Check the number of unique `nconst` values.

```{python}
name_basics.nconst.nunique()
```

Confirm it's equal to the number of rows.

```{python}
name_basics.nconst.nunique() == name_basics.count()
```

Mutate the table to convert `nconst` to an integer.

```{python}
t = name_basics.mutate(nconst=_.nconst.replace("nm", "").cast("int"))
t.head()
```

Let's also turn `primary_profession` into an array of strings instead of
a single comma-separated string.

```{python}
t = t.mutate(primary_profession=_.primary_profession.split(","))
t
```

And, combining the two concepts, convert `known_for_titles` into an array of
integers corresponding to `tconst` identifiers.

```{python}
t = t.mutate(
known_for_titles=_.known_for_titles.split(",").map(
lambda tconst: tconst.replace("tt", "").cast("int")
)
)
t
```

## DRY-ing up the code

We can define functions to convert `nconst` and `tconst` to integers.

```{python}
def nconst_to_int(nconst):
return nconst.replace("nm", "").cast("int")
def tconst_to_int(tconst):
return tconst.replace("tt", "").cast("int")
```

Then combine the previous data cleansing in a single mutate call.

```{python}
name_basics = name_basics.mutate(
nconst=nconst_to_int(_.nconst),
primary_profession=_.primary_profession.split(","),
known_for_titles=_.known_for_titles.split(",").map(tconst_to_int),
)
name_basics
```

We can use `ibis.to_sql` to see the SQL this generates.

```{python}
ibis.to_sql(name_basics)
```

Clean the rest of the tables. We'll convert `nconst` and `tconst` columns
consistently to allow for easy joining.

```{python}
title_akas = title_akas.mutate(title_id=tconst_to_int(_.title_id)).rename(
tconst="title_id"
)
title_basics = title_basics.mutate(tconst=tconst_to_int(_.tconst))
title_crew = title_crew.mutate(
tconst=tconst_to_int(_.tconst),
directors=_.directors.split(",").map(nconst_to_int),
writers=_.writers.split(",").map(nconst_to_int),
)
title_episode = title_episode.mutate(
tconst=tconst_to_int(_.tconst), parent_tconst=tconst_to_int(_.parent_tconst)
)
title_principals = title_principals.mutate(
tconst=tconst_to_int(_.tconst), nconst=nconst_to_int(_.nconst)
)
title_ratings = title_ratings.mutate(tconst=tconst_to_int(_.tconst))
```

## Finding good (and bad) movies to watch

Join the IMDB rankings with information about the movies.

```{python}
joined = title_basics.join(title_ratings, "tconst")
joined
```

```{python}
joined.title_type.value_counts().order_by(_.title_type_count.desc())
```

Filter down to movies.

```{python}
joined = joined.filter(_.title_type == "movie")
joined
```

Reorder the columns and drop some.

```{python}
joined = joined.select(
"tconst",
"primary_title",
"average_rating",
"num_votes",
"genres",
"runtime_minutes",
)
joined
```

Sort by the average rating.

```{python}
joined = joined.order_by([_.average_rating.desc(), _.num_votes.desc()])
joined
```

A lot of 10/10 movies I haven't heard of … let's filter to movies with at least
`N` votes.

```{python}
N = 50000
joined = joined.filter(_.num_votes > N)
joined
```

What if you're in the mood for a bad movie?

```{python}
joined = joined.order_by([_.average_rating.asc(), _.num_votes.desc()])
joined
```

And specifically a bad comedy?

```{python}
joined = joined.filter(_.genres.contains("Comedy"))
joined
```

Perfect!

## Next Steps

We only used two of the IMDB tables. What else can we do with the rest of the
data? Play around and let us know!
141 changes: 0 additions & 141 deletions docs/how-to/extending/elementwise.qmd

This file was deleted.

165 changes: 0 additions & 165 deletions docs/how-to/extending/reduction.qmd

This file was deleted.

40 changes: 19 additions & 21 deletions docs/how-to/extending/sql.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ t = ibis.examples.penguins.fetch(backend=con, table_name="penguins") # <2>
1. Connect to an in-memory DuckDB database
2. Read in the `penguins` example with our DuckDB database, and name it `penguins`

## `Table.sql`
## [`Table.sql`](../../reference/expression-tables.qmd#ibis.expr.types.relations.Table.sql)

At the highest level there's the `Table.sql` method. This method allows you to
run arbitrary `SELECT` statements against a table expression:
Expand Down Expand Up @@ -119,14 +119,18 @@ another.

## `Backend.sql`

::: {.callout-tip}
## `Backend.sql` supports the `dialect` argument.
:::

There's also the `Backend.sql` method, which can handle arbitrary `SELECT`
statements as well and returns an Ibis table expression.

The main difference with `Table.sql` is that `Backend.sql` **can only refer to
tables that already exist in the database**, because the API is defined on
`Backend` instances.

After the `Backend.sql` call, however, you're able to mix and match similar
After calling `Backend.sql`, however, you're able to mix and match similar
to `Table.sql`:

```{python}
Expand All @@ -147,10 +151,6 @@ to `Table.sql`:
)
```

::: {.callout-tip}
## `Backend.sql` also supports the `dialect` argument.
:::

## `Backend.raw_sql`

At the lowest level there's `Backend.raw_sql` which is for those situations
Expand All @@ -161,28 +161,26 @@ modeled as a table expression.
with the SQL statement's execution.

::: {.callout-caution}
## You **must** close the cursor returned from `raw_sql` to avoid leaking resources
## You may need to close the cursor returned from `raw_sql` to avoid leaking resources

Failure to do so can result in a variety of errors and hard-to-debug behaviors.

For DDL statements, you may not need to close the cursor since DDL statements
do not produce results.

Failure to do results in variety of errors and hard-to-debug behaviors.
Depending on the backend you may have to experiment to see when closing the
cursor is necessary.

In most cases a cursor returned from a `SELECT` statement requires a call to
`close()`.

The easiest way to do this is to use a context manager:

```{python}
from contextlib import closing
with closing(con.raw_sql("CREATE TEMP TABLE my_table AS SELECT * FROM RANGE(10)")) as c:
with closing(con.raw_sql("SELECT * FROM RANGE(10)")) as c:
... # do something with c if necessary
```
:::

Here's an example:

```{python}
cur = con.raw_sql("CREATE TEMP TABLE t AS SELECT * FROM RANGE(10)")
cur.close() # <1>
```

1. Only needed if you're not using a context manager **and** the query returns
rows. In this case `CREATE` doesn't return any rows, so the `close()` isn't
strictly necessary. However, it's good practice to always close cursors,
even if those `close()` call isn't strictly required.
118 changes: 118 additions & 0 deletions docs/how-to/extending/unbound_expression.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
title: Write and execute unbound expressions
---

One of the most powerful features of Ibis is the separation of transformation
logic from the execution engine, which allows you to "write once, execute
everywhere".

## Unbound tables

In Ibis, you can define unbound tables. An unbound table is a table with a
specified schema but not connected to a data source. You can think of it as an
empty spreadsheet with just the header. Even though the spreadsheet is empty,
you know what the data would look like.

Unbound tables allow you to write transformations for data as long as it
conforms to the provided schema. You don't need to connect to a data source
until you're ready to execute the expression and compute outputs.

## Execute an unbound expression

Here's how we can define an unbound table in Ibis:

```{python}
import ibis
schema = {
"carat": "float64",
"cut": "string",
"color": "string",
"clarity": "string",
"depth": "float64",
"table": "float64",
"price": "int64",
"x": "float64",
"y": "float64",
"z": "float64",
}
diamonds = ibis.table(schema, name="diamonds")
diamonds
```

So far, we have an empty `diamonds` table that contains 10 columns. Even though
there is no data in the `diamonds` table right now, we can write
transformations knowing that these are the columns available to us.

Given this table of diamonds of various carats, cuts, and colors, we're
interested in learning the average carat for each color of premium and ideal
diamonds. In order to do this, we can first calculate the average carat for
each color and cut of diamonds, then make a pivot table to show the results:

```{python}
from ibis import _
expr = (
diamonds.group_by(["cut", "color"])
.agg(carat=_.carat.mean())
.pivot_wider(
names=("Premium", "Ideal"), names_from="cut", values_from="carat", names_sort=True, values_agg="mean"
)
)
```

Now that we're ready to compute results, we can connect to any of Ibis'
supported backends. This feature logic can be reused and you don't need to
modify it again!

This is a dataset that we can process locally. Let's connect to DuckDB and load
the data into a DuckDB table:

```{python}
parquet_dir = "diamonds.parquet"
# download data into a local file
ibis.examples.diamonds.fetch().to_parquet(parquet_dir)
con = ibis.duckdb.connect()
con.read_parquet(parquet_dir, table_name="diamonds")
```

Connecting to this DuckDB table and executing the transformation on the loaded
data is now as simple as

```{python}
con.to_pandas(expr)
```

VoilĂ !

If you want to continue to work with the data in DuckDB, you can create a new
table and insert the outputs into it like so:

```{python}
output_schema = ibis.schema(
{
"color": "string",
"Ideal": "float64",
"Premium": "float64",
}
)
con.create_table("results", schema=output_schema)
con.insert("results", expr)
con.table("results").to_pandas()
```

## Execute on another backend

Because Ibis separates the transformation logic from the execution engine, you
can easily reuse the written transformation for another backend. Here we use
Polars as an example, but you can do the same for any of Ibis' 20+ supported
backends as long as that particular backend supports the operations
(see [the operation support matrix](../../support_matrix.qmd)).

```{python}
pl = ibis.polars.connect()
pl.read_parquet(parquet_dir, table_name="diamonds")
pl.to_pandas(expr)
```
96 changes: 96 additions & 0 deletions docs/how-to/input-output/duckdb-parquet.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: Read parquet files with Ibis
---

In this example, we will use Ibis's DuckDB backend to analyze data from
a remote parquet source using `ibis.read_parquet`. `ibis.read_parquet` can also
read local parquet files, and there are other `ibis.read_*` functions that
conveniently return a table expression from a file. One such function is
`ibis.read_csv`, which reads from local and remote CSV.

We will be reading from the [**Global Biodiversity Information Facility (GBIF)
Species Occurrences**](https://registry.opendata.aws/gbif/) dataset. It is
hosted on S3 at `s3://gbif-open-data-us-east-1/occurrence/`

## Reading One Partition

We can read a single partition by specifying its path.

We do this by calling
[`read_parquet`](https://ibis-project.org/api/expressions/top_level/#ibis.read_parquet)
on the partition we care about.

So to read the first partition in this dataset, we'll call `read_parquet` on
`00000` in that path:

```{python}
import ibis
t = ibis.read_parquet(
"s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/000000"
)
t
```

Note that we're calling `read_parquet` and receiving a table expression without
establishing a connection first. Ibis spins up a DuckDB connection (or
whichever default backend you have) when you call `ibis.read_parquet` (or even
`ibis.read_csv`).

Since our result, `t`, is a table expression, we can now run queries against
the file using Ibis expressions. For example, we can select columns, filter the
file, and then view the first five rows of the result:

```{python}
cols = [
"gbifid",
"datasetkey",
"occurrenceid",
"kingdom",
"phylum",
"class",
"order",
"family",
"genus",
"species",
"day",
"month",
"year",
]
t.select(cols).filter(t["family"].isin(["Corvidae"])).limit(5).to_pandas()
```

We can count the rows in the table (partition):

```{python}
t.count().to_pandas()
```

## Reading all partitions: filter, aggregate, export

We can use `read_parquet` to read an entire parquet file by globbing all
partitions:

```{python}
t = ibis.read_parquet(
"s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*"
)
```

Since the function returns a table expression, we can perform valid selections,
filters, aggregations, and exports just as we could with any other table
expression:

```{python}
df = (
t.select(["gbifid", "family", "species"])
.filter(t["family"].isin(["Corvidae"]))
# Here we limit by 10,000 to fetch a quick batch of results
.limit(10000)
.group_by("species")
.count()
.to_pandas()
)
df
```
20 changes: 20 additions & 0 deletions docs/how-to/input-output/gcs_duckdb.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: Loading Google Cloud Storage files with DuckDB
---

Loading data stored on Google Cloud Storage to DuckDB requires specifying `gcs` as filesystem.

```python
from ibis.interactive import *

con = ibis.duckdb.connect()

import fsspec
import gcsfs

gcs = fsspec.filesystem("gcs")

con.register_filesystem(gcs)

t = con.read_json("gs://...")
```
2 changes: 1 addition & 1 deletion docs/how-to/visualization/altair.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ in Ibis tables or expressions:
import altair as alt
chart = (
alt.Chart(t.group_by("species").agg(count=ibis._.count()).to_pandas())
alt.Chart(t.group_by("species").agg(count=ibis._.count()))
.mark_bar()
.encode(
x="species",
Expand Down
84 changes: 84 additions & 0 deletions docs/how-to/visualization/graphs.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# GraphViz + Ibis

Ibis supports visualizing an expression as a [directed
graph](https://en.wikipedia.org/wiki/Directed_graph) using
[GraphViz](https://graphviz.org/).

To get started, make sure you've installed the necessary dependencies.

```sh
$ pip install 'ibis-framework[duckdb,examples,visualization]'
```

::: {.callout-note collapse="true"}
## Replace `duckdb` with your backend of choice

For instance, you can replace `duckdb` with `snowflake` if you want to use the
[Snowflake backend](../../backends/snowflake.qmd).
:::


Let's run through a few examples.

First we'll import the things we need.

```{python}
import ibis
from ibis import _
from ibis.expr.visualize import to_graph
```

Now we can visualize an expression graph.

Here's a call to `select`.

```{python}
t = ibis.examples.penguins.fetch()
expr = t.select(lowered=_.species.lower())
to_graph(expr)
```

The way to read the graph is from top to bottom.

* The top of the graph contains Ibis operations with no dependents.
* The edges encode dependencies from one node to another.
* The bold text in the rectangles is the name of the Ibis operation class.
* The bottom of the graph contains Ibis operations with no dependencies.
* If this were a data flow graph, data would flow from bottom to top.

Let's look at a more complex example: `group_by`.

```{python}
expr = (
t.group_by(_.species)
.agg(
bill_depth_avg=_.bill_depth_mm.mean(),
bill_length_avg=_.bill_length_mm.mean(),
)
)
to_graph(expr)
```

Switching gears, let's look at a `join` and show customization of node and edge
attributes.

```{python}
left = ibis.table(dict(a="int64", b="string"), name="left")
right = ibis.table(dict(b="string", c="int64", d="string"), name="right")
expr = (
left.inner_join(right, "b")
.select(left.a, b=right.c, c=right.d)
.mutate(arrays=ibis.array([1, 2, 3]))
)
to_graph(
expr,
label_edges=True,
node_attr={"shape": "hexagon", "color": "green", "fontname": "Roboto Mono"},
edge_attr={"fontsize": "12", "fontname": "Comic Sans MS"}
)
```

Please try out `to_graph` and give us feedback on
[Zulip](https://ibis-project.zulipchat.org) or in a [GitHub
issue](https://github.com/ibis-project/ibis/issues/new/choose)!
8 changes: 8 additions & 0 deletions docs/ibis-wasm-wheels_cors_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[
{
"origin": ["*"],
"method": ["GET"],
"responseHeader": ["Content-Type"],
"maxAgeSeconds": 3600
}
]
Binary file added docs/images/backends.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 9 additions & 14 deletions docs/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ about:
href: install.qmd
- icon: book
href: tutorials/getting_started.qmd
- icon: postcard
text: Blog
href: posts.qmd
- icon: github
text: GitHub
href: https://github.com/ibis-project
Expand Down Expand Up @@ -144,14 +147,6 @@ t["species"].topk(num_species) # <2>

Ibis works with any Python plotting library that supports the [dataframe interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html).

```{python}
# | code-fold: true
width = 640 # <1>
height = 480 # <1>
```

1. Set the width and height of the plots.

```{python}
grouped = ( # <1>
t.group_by("species") # <1>
Expand Down Expand Up @@ -183,7 +178,7 @@ chart = (
y="count",
tooltip=["species", "count"],
)
.properties(width=width, height=height)
.properties(width=600, height=400)
.interactive()
)
chart
Expand All @@ -201,7 +196,7 @@ import matplotlib.pyplot as plt
chart = grouped.to_pandas().plot.bar(
x="species",
y="count",
figsize=(width / 100, height / 100),
figsize=(600 / 100, 400 / 100),
)
plt.show()
```
Expand All @@ -219,8 +214,8 @@ chart = px.bar(
grouped.to_pandas(),
x="species",
y="count",
width=width,
height=height,
width=600,
height=400,
)
chart
```
Expand All @@ -239,7 +234,7 @@ chart = (
aes(x="species", y="count"),
)
+ geom_bar(stat="identity")
+ theme(figure_size=(width / 100, height / 100))
+ theme(figure_size=(600 / 100, 400 / 100))
)
chart
```
Expand All @@ -258,7 +253,7 @@ chart = sns.barplot(
x="species",
y="count",
)
chart.figure.set_size_inches(width / 100, height / 100)
chart.figure.set_size_inches(600 / 100, 400 / 100)
```

:::
Expand Down
8 changes: 8 additions & 0 deletions docs/jupyter_lite_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"PipliteAddon": {
"piplite_urls": [
"https://duckdb.github.io/duckdb-pyodide/wheels/duckdb-0.10.2-cp311-cp311-emscripten_3_1_46_wasm32.whl",
"dist/ibis_framework-8.0.0-py3-none-any.whl"
]
}
}
1,369 changes: 0 additions & 1,369 deletions docs/portability.excalidraw

This file was deleted.

Binary file removed docs/portability.png
Binary file not shown.
4 changes: 2 additions & 2 deletions docs/posts/Ibis-version-3.1.0-release/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ categories:

Ibis 3.1 has officially been released as the latest version of the package.
With this release comes new convenience features, increased backend operation coverage and a plethora of bug fixes.
As usual, a full list of the changes can be found in the project release notes [here](../../release_notes.md) Let’s talk about some of the new changes 3.1 brings for Ibis users.
As usual, a full list of the changes can be found in the project release notes [here](../../release_notes.qmd) Let’s talk about some of the new changes 3.1 brings for Ibis users.

## `ibis.connect`

Expand Down Expand Up @@ -226,7 +226,7 @@ Along with these changes, the operation matrix has had a few more holes filled.
Contributors should note that backend test data is now loaded dynamically.
Most users won’t be exposed to this update, but it should make contribution a bit more streamlined.

To see the full patch notes, go to the [patch notes page](../../release_notes.md)
To see the full patch notes, go to the [patch notes page](../../release_notes.qmd)

As always, Ibis is free and open source.
Contributions are welcome and encouraged–drop into the discussions, raise an issue, or put in a pull request.
Expand Down
493 changes: 493 additions & 0 deletions docs/posts/duckdb-for-rag/index.qmd

Large diffs are not rendered by default.

Binary file added docs/posts/duckdb-for-rag/thumbnail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
307 changes: 307 additions & 0 deletions docs/posts/flink-announcement/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
---
title: "Ibis goes real-time! Introducing the new Flink backend for Ibis"
author: "Deepyaman Datta"
date: "2024-02-12"
categories:
- blog
- flink
- stream processing
---

## Introduction

Ibis 8.0 marks the official release of the Apache Flink backend for Ibis. Ibis
users can now manipulate data across streaming and batch contexts using the same
interface. Flink is one of the most established stream-processing frameworks out
there and a central part of the real-time data infrastructure at companies like
DoorDash, LinkedIn, Netflix, and Uber. It is commonly applied in use cases such
as fraud detection, anomaly detection, real-time recommendation, dynamic
pricing, and online advertising. The Flink backend is also the first streaming
backend Ibis supports. Follow along as we define and execute a simple streaming
job using Ibis!

## Installation prerequisites

* **Docker Compose:** This tutorial uses Docker Compose to manage an Apache
Kafka environment (including sample data generation) and a Flink cluster (for
[remote execution](#remote-execution)). You can [download and install Docker
Compose from the official website](https://docs.docker.com/compose/install/).
* **JDK 11:** Flink requires Java 11. If you don't already have JDK 11
installed, you can [get the appropriate Eclipse Temurin
release](https://adoptium.net/temurin/releases/?package=jdk&version=11).
* **Python:** To follow along, you need Python 3.9 or 3.10.

## Installing the Flink backend for Ibis

We use a Python client to explore data in Kafka topics. You can install it,
alongside the Flink backend for Ibis, with `pip`, `conda`, `mamba`, or `pixi`:

::: {.panel-tabset}

## Using `pip`

```bash
pip install ibis-framework apache-flink kafka-python
```

## Using `conda`

```bash
conda install -c conda-forge ibis-flink
```

## Using `mamba`

```bash
mamba install -c conda-forge ibis-flink
```

## Using `pixi`

```bash
pixi add ibis-flink
```

:::

## Spinning up the services using Docker Compose

The [ibis-project/ibis-flink-example GitHub
repository](https://github.com/ibis-project/ibis-flink-example) includes the
relevant Docker Compose configuration for this tutorial. Clone the repository,
and run `docker compose up` from the cloned directory to create Kafka topics,
generate sample data, and launch a Flink cluster:

```bash
git clone https://github.com/claypotai/ibis-flink-example.git
cd ibis-flink-example
docker compose up
```

::: {.callout-tip}
If you don't intend to try [remote execution](#remote-execution), you can start
only the Kafka-related services with `docker compose up kafka init-kafka
data-generator`.
:::

After a few seconds, you should see messages indicating your Kafka environment
is ready:

```bash
ibis-flink-example-init-kafka-1 | Successfully created the following topics:
ibis-flink-example-init-kafka-1 | payment_msg
ibis-flink-example-init-kafka-1 | sink
ibis-flink-example-init-kafka-1 exited with code 0
ibis-flink-example-data-generator-1 | Connected to Kafka
ibis-flink-example-data-generator-1 | Producing 20000 records to Kafka topic payment_msg
```

This example uses mock payments data. The `payment_msg` Kafka topic contains
messages in the following format:

```json
{
"createTime": "2023-09-20 22:19:02.224",
"orderId": 1695248388,
"payAmount": 88694.71922270155,
"payPlatform": 0,
"provinceId": 6
}
```

In a separate terminal, we can explore what these messages look like:

```{python}
from itertools import islice
from kafka import KafkaConsumer
consumer = KafkaConsumer("payment_msg")
for msg in islice(consumer, 3):
print(msg)
```

## Running the tutorial

This tutorial uses Ibis with the Flink backend to process the aforementioned
payment messages. You can choose to either [run it locally](#local-execution) or
[submit a job to an already-running Flink cluster](#remote-execution).

### Local execution

The simpler option is to run the example using the Flink mini cluster.

#### Create a table environment

The [table
environment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/python/table/table_environment/)
serves as the main entry point for interacting with the Flink runtime. The
`flink` backend does not create `TableEnvironment` objects; you must create a
`TableEnvironment` and pass that to
[`ibis.flink.connect`](../../backends/flink.qmd#ibis.flink.connect):

```{python}
import ibis
from pyflink.table import EnvironmentSettings, TableEnvironment
env_settings = EnvironmentSettings.in_streaming_mode()
table_env = TableEnvironment.create(env_settings)
table_env.get_config().set("parallelism.default", "1") # <1>
con = ibis.flink.connect(table_env)
```
1. write all the data to one file

Flink’s streaming connectors aren't part of the binary distribution. Link the
[Kafka
connector](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/connectors/table/kafka/)
for cluster execution by adding the JAR file from the cloned repository. [Ibis
exposes the `raw_sql` method for situations like this, where you need to run
arbitrary SQL that cannot be modeled as a table
expression](https://ibis-project.org/how-to/extending/sql#backend.raw_sql):

```{python}
#| echo: false
#| output: false
import urllib.request
urllib.request.urlretrieve(
"https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka/3.0.2-1.18/flink-sql-connector-kafka-3.0.2-1.18.jar",
filename="flink-sql-connector-kafka-3.0.2-1.18.jar",
)
```

```{python}
#| output: false
con.raw_sql("ADD JAR 'flink-sql-connector-kafka-3.0.2-1.18.jar'")
```

#### Create the source and sink tables

Use
[`create_table`](../../backends/flink.qmd#ibis.backends.flink.Backend.create_table)
to register tables. Notice the new top-level `ibis.watermark` API for
[specifying a watermark
strategy](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/concepts/time/#event-time-and-watermarks).

```{python}
source_schema = ibis.schema( # <1>
{ # <1>
"createTime": "timestamp(3)", # <1>
"orderId": "int64", # <1>
"payAmount": "float64", # <1>
"payPlatform": "int32", # <1>
"provinceId": "int32", # <1>
} # <1>
) # <1>
source_configs = { # <1>
"connector": "kafka", # <1>
"topic": "payment_msg", # <1>
"properties.bootstrap.servers": "localhost:9092", # <1>
"properties.group.id": "test_3", # <1>
"scan.startup.mode": "earliest-offset", # <1>
"format": "json", # <1>
} # <1>
t = con.create_table( # <1>
"payment_msg", # <1>
schema=source_schema, # <1>
tbl_properties=source_configs, # <1>
watermark=ibis.watermark( # <1>
time_col="createTime", allowed_delay=ibis.interval(seconds=15) # <1>
), # <1>
) # <1>
sink_schema = ibis.schema( # <2>
{ # <2>
"province_id": "int32", # <2>
"pay_amount": "float64", # <2>
} # <2>
) # <2>
sink_configs = { # <2>
"connector": "kafka", # <3>
"topic": "sink", # <2>
"properties.bootstrap.servers": "localhost:9092", # <2>
"format": "json", # <2>
} # <2>
con.create_table( # <2>
"total_amount_by_province_id", schema=sink_schema, tbl_properties=sink_configs # <2>
) # <2>
```
1. create source Table
2. create sink Table

#### Perform calculations

Compute the total pay amount per province in the past 10 seconds (as of each
message, for the province in the incoming message):

```{python}
agged = t.select(
province_id=t.provinceId,
pay_amount=t.payAmount.sum().over(
range=(-ibis.interval(seconds=10), 0),
group_by=t.provinceId,
order_by=t.createTime,
),
)
```

Finally, emit the query result to the sink table:

```{python}
con.insert("total_amount_by_province_id", agged)
```

### Remote execution

You can also submit the example to the [remote cluster started using Docker
Compose](#spinning-up-the-services-using-docker-compose). The
`window_aggregation.py` file in the cloned repository contains the [same steps
that we performed for local execution](#local-execution). We will [use the
method described in the official Flink
documentation](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/cli/#submitting-pyflink-jobs).

::: {.callout-tip}
You can find the `./bin/flink` executable with the following command:

```bash
python -c'from pathlib import Path; import pyflink; print(Path(pyflink.__spec__.origin).parent / "bin" / "flink")'
```
:::

My full command looks like this:

```bash
/opt/miniconda3/envs/ibis-dev/lib/python3.10/site-packages/pyflink/bin/flink run --jobmanager localhost:8081 --python window_aggregation.py
```

The command will exit after displaying a submission message:

```
Job has been submitted with JobID b816faaf5ef9126ea5b9b6a37012cf56
```

## Viewing the results

Similar to how we viewed messages in the `payment_msg` topic, we can print
results from the `sink` topic:

```{python}
consumer = KafkaConsumer("sink")
for msg in islice(consumer, 10):
print(msg)
```

VoilĂ ! You've run your first streaming application using Ibis.

## Shutting down the Compose environment

Press <kbd>Ctrl</kbd>+<kbd>C</kbd> to stop the Docker Compose containers. Once
stopped, run `docker compose down` to remove the services created for this
tutorial.
Binary file added docs/posts/hamilton-ibis/column_lineage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/posts/hamilton-ibis/hamilton_abc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/posts/hamilton-ibis/ibis_lineage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading