# Substrait

[Substrait](https://substrait.io) is a cross-language specification for data compute operations. Ibis can produce Substrait plans using the `ibis-substrait` python package. 

### Why Substrait?

The current state of the world requires tools like Ibis to build connectors for each unique data system. This is a many-to-many relationship that grows exponentially. Substrait removes the need for connectors by introducing an Intermediate Representation (IR). Now, we can have a many-to-one relationship from frontend -> IR and a one-to-many relationship from IR -> backend. 

### But, how is this useful to me?

Interoperability now _and in the future_. The same Substrait Plan can run anywhere that has built-in support for the Substrait specification. No need to wait for Ibis to implement the shiny new connector for your data system of choice.

## Example

Let's see Ibis Substrait in action.

### Setup

Let's build a toy example of a database server. Our example uses a local DuckDB database, but in practice we can imagine talking to a database server over the network.

In [None]:
import duckdb
import os
from urllib.request import urlretrieve


class DatabaseServer:
    DB_NAME = "palmer_penguins.ddb"
    DB_URL = "https://storage.googleapis.com/ibis-tutorial-data/palmer_penguins.ddb"

    def __init__(self):
        if not os.path.exists(self.DB_NAME):
            urlretrieve(self.DB_URL, self.DB_NAME)
        self.db = duckdb.connect(self.DB_NAME)
        self.db.install_extension("substrait")
        self.db.load_extension("substrait")

    def execute(self, substrait):
        result = self.db.from_substrait(substrait)
        return result.fetchall()


db_server = DatabaseServer()

### Ibis Table

We need an Ibis Table to query against. Let's define one that matches the table in our mock DB server.

In [None]:
import ibis
from ibis.expr.datatypes.core import Float64, Int64, String

table = ibis.table(
    name="penguins",
    schema=[
        ("species", String()),
        ("island", String()),
        ("bill_length_mm", Float64()),
        ("bill_depth_mm", Float64()),
        ("flipper_length_mm", Int64()),
        ("body_mass_g", Int64()),
        ("sex", String()),
        ("year", Int64),
    ],
)

print(table)

### Substrait Compiler

The `ibis-substrait` package provides a `SubstraitCompiler` that can both compile and decompile Substrait Plans.

Let's see it in action:

In [None]:
from ibis import _
from ibis_substrait.compiler.core import SubstraitCompiler

compiler = SubstraitCompiler()

query = table.select(_.species).group_by(_.species).agg(count=_.species.count())

substrait_plan = compiler.compile(query)

print(substrait_plan)

### Substrait Execution

Let's serialize the Substrait Plan to bytes that can be sent over the network and pass them to our mock DB server.

The query counts the number of penguins per species.

In [None]:
plan_bytes = substrait_plan.SerializeToString()

db_server.execute(substrait=plan_bytes)

Success! We've created an Ibis Table expression, serialized it to the Substrait IR, sent it to our DB server, and received the resulting rows back.

We can iterate on our data analysis. Let's see how many of each species lives on each island.

In [None]:
query = (
    table.select(_.island, _.species)
    .group_by([_.island, _.species])
    .agg(num=_.species.count())
    .order_by([ibis.asc(_.island), ibis.asc(_.species)])
)

plan_bytes = compiler.compile(query).SerializeToString()

db_server.execute(substrait=plan_bytes)

Interesting! And what is the average body mass in grams for each row result?

In [None]:
query = (
    table.select(_.island, _.species, _.body_mass_g)
    .group_by([_.island, _.species])
    .agg(num=_.species.count(), avg_weight=_.body_mass_g.mean())
    .order_by([ibis.asc(_.island), ibis.asc(_.species)])
)

plan_bytes = compiler.compile(query).SerializeToString()

db_server.execute(substrait=plan_bytes)

## Conclusion

We saw how we can translate Ibis expressions into Substrait Plans that can theoretically run anywhere. Backend support for Substrait is growing. Checkout some compatible projects such as [DuckDB](https://duckdb.org/docs/extensions/substrait), [Apache DataFusion](https://arrow.apache.org/datafusion), and Apache Arrow's [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html)!