| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| --- | ||
| title: Backend Table Hierarchy | ||
| --- | ||
|
|
||
| Several SQL backends support two levels of hierarchy in organizing tables | ||
| (although the levels are also used for other purposes, like data access, | ||
| billing, etc.). | ||
|
|
||
| Ibis uses the following terminology: | ||
|
|
||
| - `database`: a collection of tables | ||
| - `catalog`: a collection of databases | ||
|
|
||
| Below is a table with the terminology used by each backend for the two levels of | ||
| hierarchy. This is provided as a reference, note that when using Ibis, we will | ||
| use the terms `catalog` and `database` and map them onto the appropriate fields. | ||
|
|
||
|
|
||
| | Backend | Catalog | Database | | ||
| |------------|----------------|------------| | ||
| | bigquery | project | database | | ||
| | clickhouse | | database | | ||
| | dask | | NA | | ||
| | datafusion | catalog | schema | | ||
| | druid | dataSourceType | dataSource | | ||
| | duckdb | database | schema | | ||
| | flink | catalog | database | | ||
| | impala | | database | | ||
| | mssql | database | schema | | ||
| | mysql | | database | | ||
| | oracle | | database | | ||
| | pandas | | NA | | ||
| | polars | | NA | | ||
| | postgres | database | schema | | ||
| | pyspark | | database | | ||
| | risingwave | database | schema | | ||
| | snowflake | | database | | ||
| | trino | catalog | schema | |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| # RisingWave | ||
|
|
||
| [https://risingwave.com/](https://risingwave.com/) | ||
|
|
||
|    | ||
|
|
||
| ## Install | ||
|
|
||
| Install Ibis and dependencies for the RisingWave backend: | ||
|
|
||
| ::: {.panel-tabset} | ||
|
|
||
| ## `pip` | ||
|
|
||
| Install with the `risingwave` extra: | ||
|
|
||
| ```{.bash} | ||
| pip install 'ibis-framework[risingwave]' | ||
| ``` | ||
|
|
||
| And connect: | ||
|
|
||
| ```{.python} | ||
| import ibis | ||
| con = ibis.risingwave.connect() # <1> | ||
| ``` | ||
|
|
||
| 1. Adjust connection parameters as needed. | ||
|
|
||
| ## `conda` | ||
|
|
||
| Install for Risingwave: | ||
|
|
||
| ```{.bash} | ||
| conda install -c conda-forge ibis-risingwave | ||
| ``` | ||
|
|
||
| And connect: | ||
|
|
||
| ```{.python} | ||
| import ibis | ||
| con = ibis.risingwave.connect() # <1> | ||
| ``` | ||
|
|
||
| 1. Adjust connection parameters as needed. | ||
|
|
||
| ## `mamba` | ||
|
|
||
| Install for Risingwave: | ||
|
|
||
| ```{.bash} | ||
| mamba install -c conda-forge ibis-risingwave | ||
| ``` | ||
|
|
||
| And connect: | ||
|
|
||
| ```{.python} | ||
| import ibis | ||
| con = ibis.risingwave.connect() # <1> | ||
| ``` | ||
|
|
||
| 1. Adjust connection parameters as needed. | ||
|
|
||
| ::: | ||
|
|
||
| ## Connect | ||
|
|
||
| ### `ibis.risingwave.connect` | ||
|
|
||
| ```python | ||
| con = ibis.risingwave.connect( | ||
| user="username", | ||
| password="password", | ||
| host="hostname", | ||
| port=4566, | ||
| database="database", | ||
| ) | ||
| ``` | ||
|
|
||
| ::: {.callout-note} | ||
| `ibis.risingwave.connect` is a thin wrapper around | ||
| [`ibis.backends.risingwave.Backend.do_connect`](#ibis.backends.risingwave.Backend.do_connect). | ||
| ::: | ||
|
|
||
|
|
||
| ### Connection Parameters | ||
|
|
||
| ```{python} | ||
| #| echo: false | ||
| #| output: asis | ||
| from _utils import render_do_connect | ||
| render_do_connect("risingwave") | ||
| ``` | ||
|
|
||
| ```{python} | ||
| #| echo: false | ||
| BACKEND = "Risingwave" | ||
| ``` | ||
|
|
||
| {{< include ./_templates/api.qmd >}} |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| from __future__ import annotations | ||
|
|
||
| import plotly.graph_objects as go | ||
|
|
||
|
|
||
| def to_greyish(hex_code, grey_value=128): | ||
| hex_code = hex_code.lstrip("#") | ||
| r, g, b = int(hex_code[0:2], 16), int(hex_code[2:4], 16), int(hex_code[4:6], 16) | ||
|
|
||
| new_r = (r + grey_value) // 2 | ||
| new_g = (g + grey_value) // 2 | ||
| new_b = (b + grey_value) // 2 | ||
|
|
||
| new_hex_code = f"#{new_r:02x}{new_g:02x}{new_b:02x}" | ||
|
|
||
| return new_hex_code | ||
|
|
||
|
|
||
| category_colors = { | ||
| "Ibis API": "#7C65A0", | ||
| "SQL": "#6A9BC9", | ||
| "DataFrame": "#D58273", | ||
| } | ||
|
|
||
| backend_categories = { | ||
| list(category_colors.keys())[1]: [ | ||
| "BigQuery", | ||
| "ClickHouse", | ||
| "DataFusion", | ||
| "Druid", | ||
| "DuckDB", | ||
| "Exasol", | ||
| "Flink", | ||
| "Impala", | ||
| "MSSQL", | ||
| "MySQL", | ||
| "Oracle", | ||
| "PostgreSQL", | ||
| "PySpark", | ||
| "RisingWave", | ||
| "Snowflake", | ||
| "SQLite", | ||
| "Trino", | ||
| ], | ||
| list(category_colors.keys())[2]: ["Dask", "pandas", "Polars"], | ||
| } | ||
|
|
||
| nodes, links = [], [] | ||
| node_index = {} | ||
|
|
||
| nodes.append({"label": "Ibis API", "color": category_colors["Ibis API"]}) | ||
| node_index["Ibis API"] = 0 | ||
|
|
||
| idx = 1 | ||
| for category, backends in backend_categories.items(): | ||
| nodes.append({"label": category, "color": category_colors[category]}) | ||
| node_index[category] = idx | ||
| links.append({"source": 0, "target": idx, "value": len(backends)}) | ||
| idx += 1 | ||
|
|
||
| for backend in backends: | ||
| if backend not in node_index: | ||
| nodes.append({"label": backend, "color": category_colors[category]}) | ||
| node_index[backend] = idx | ||
| idx += 1 | ||
| links.append( | ||
| { | ||
| "source": node_index[category], | ||
| "target": node_index[backend], | ||
| "value": 1, | ||
| } | ||
| ) | ||
|
|
||
| fig = go.Figure( | ||
| data=[ | ||
| go.Sankey( | ||
| node=dict( | ||
| pad=20, | ||
| thickness=20, | ||
| line=dict(color="grey", width=0.5), | ||
| label=[node["label"] for node in nodes], | ||
| color=[node["color"] for node in nodes], | ||
| ), | ||
| link=dict( | ||
| source=[link["source"] for link in links], | ||
| target=[link["target"] for link in links], | ||
| value=[link["value"] for link in links], | ||
| line=dict(color="grey", width=0.5), | ||
| color=[to_greyish(nodes[link["target"]]["color"]) for link in links], | ||
| ), | ||
| ) | ||
| ], | ||
| ) | ||
|
|
||
| fig.update_layout( | ||
| title_text="Ibis backend types", | ||
| font_size=24, | ||
| # font_family="Arial", | ||
| title_font_size=30, | ||
| margin=dict(l=30, r=30, t=80, b=30), | ||
| template="plotly_dark", | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,117 @@ | ||
| --- | ||
| title: Datatypes and Datashapes | ||
| --- | ||
|
|
||
| Every value in Ibis has two important properties: a type and shape. | ||
|
|
||
| The type is probably familiar to you. It is something like | ||
|
|
||
| - `Integer` | ||
| - `Floating` | ||
| - `String` | ||
| - `Array` | ||
|
|
||
| The shape is one of | ||
|
|
||
| - `Scalar` (a single value) | ||
| - `Column` (a series of values) | ||
|
|
||
| ## Datatype Flavors | ||
|
|
||
| For some datatypes, there are further options that define them. | ||
| For instance, `Integer` values can be signed or unsigned, and | ||
| they have a precision. For example, "uint8", "int64", etc. | ||
| These flavors don't affect their capabilities | ||
| (eg both signed and unsigned ints have a `.abs()` method), | ||
| but the flavor does impact how the underlying backend performs the computation. | ||
|
|
||
| ## Capabilities | ||
|
|
||
| Depending on the combination of datatype and datashape, a value has | ||
| different capabilities. For example: | ||
|
|
||
| - All `String` values (both `StringScalars` and `StringColumns`) have the | ||
| method `.upper()` that transforms the string to uppercase. | ||
| `Floating` and `Array` values don't have this method, of course. | ||
| - `IntegerColumn` and `FloatingColumn` values have `.mean()`, `.max()`, etc methods | ||
| because you can aggregate over them, since they are a collection of values. | ||
| On the other hand, `IntegerScalar` and `FloatingScalar` values do **not** have these | ||
| methods, because it doesn't make sense to take the mean or max of a single value. | ||
| - If you call `.to_pandas()` on these values, you get different results. | ||
| `Scalar` shapes result in scalar objects: | ||
| - `IntegerScalar`: NumPy `int64` object (or whatever specific flavor). | ||
| - `FloatingScalar`: NumPy `float64` object (or whatever specific flavor). | ||
| - `StringScalar`: plain python `str` object. | ||
| - `ArrayScalar`: plain python `list` object. | ||
| - On the other hand, `Column` shapes result in `pandas.Series`: | ||
| - `IntegerColumn`: pd.Series of integers, with the same flavor. | ||
| For example, if the `IntegerColumn` was specifically "uint16", | ||
| then the pandas series will hold a numpy array of type "uint16". | ||
| - `FloatingColumn`: pd.Series of numpy floats with the same flavor. | ||
| - etc. | ||
|
|
||
| ## Broadcasting and Alignment | ||
|
|
||
| There are rules for how different datashapes are combined. This is similar to | ||
| how SQL and NumPy handles merging datashapes, if you are familiar with them. | ||
|
|
||
| ```{python} | ||
| import ibis | ||
| ibis.options.interactive = True | ||
| t1 = ibis.examples.penguins.fetch().head(100) | ||
| t1 | ||
| ``` | ||
|
|
||
| We can look at the datatype of the year Column | ||
|
|
||
| ```{python} | ||
| t1.year.type() | ||
| ``` | ||
|
|
||
| Combining two `Scalar`s results in a `Scalar`: | ||
|
|
||
| ```{python} | ||
| t1.year.mean() + t1.year.std() | ||
| ``` | ||
|
|
||
| Combining a `Column` and `Scalar` results in a `Column`: | ||
|
|
||
| ```{python} | ||
| t1.year + 1000 | ||
| ``` | ||
|
|
||
| Combining two `Column`s results in a `Column`: | ||
|
|
||
| ```{python} | ||
| t1.year + t1.bill_length_mm | ||
| ``` | ||
|
|
||
| One requirement that might surprise you if you are coming from NumPy is | ||
| Ibis's requirements on aligning `Columns`: In NumPy, if you have two arbitrary | ||
| arrays, each of length 100, you can add them together, and it works because the | ||
| elements are "lined up" based on position. Ibis is different. Because it is based | ||
| around SQL, and SQL has no notion of inherent row ordering, you cannot "line up" | ||
| any two `Column`s in Ibis: They both **have** to be derived from the same | ||
| `Table` expression. For example: | ||
|
|
||
| ```{python} | ||
| t2 = ibis.examples.population.fetch().head(100) | ||
| t2 | ||
| ``` | ||
|
|
||
| ```{python} | ||
| #| error: true | ||
| t1.bill_depth_mm + t2.population | ||
| ``` | ||
|
|
||
| If you want to use these two columns together, you would need to join the tables together first: | ||
|
|
||
| ```{python} | ||
| j = ibis.join(t1, t2, "year") | ||
| j | ||
| ``` | ||
|
|
||
| ```{python} | ||
| j.bill_depth_mm + j.population | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| function Header(el) | ||
| -- Walk the pandoc AST and find all links in Header elements | ||
| -- Pluck out the version string and set it if it hasn't been set already | ||
| local version = nil; | ||
|
|
||
| el = el:walk({ | ||
| Link = function(link) | ||
| return link:walk({ | ||
| Str = function(str) | ||
| if version == nil then | ||
| version = str.text | ||
| end | ||
| return str.text | ||
| end, | ||
| }) | ||
| end, | ||
| }) | ||
|
|
||
| el.attr.identifier = version | ||
|
|
||
| return el | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,276 @@ | ||
| --- | ||
| title: Analyze IMDB data using Ibis | ||
| --- | ||
|
|
||
| Let's use the Ibis examples module and the DuckDB backend to find some movies | ||
| to watch. | ||
|
|
||
| Adapted from [Phillip in the Cloud's livestream using the same | ||
| data](https://www.youtube.com/watch?v=J7sEn9VklKY). | ||
|
|
||
| ## Imports | ||
|
|
||
| For this example, we'll just use Ibis. | ||
|
|
||
| ```{python} | ||
| from ibis.interactive import * # <1> | ||
| ``` | ||
|
|
||
| 1. This import imports `ibis.examples` as `ex`. | ||
|
|
||
| ## Fetch the example data | ||
|
|
||
| We can use the `ibis.examples` module to fetch the IMDB data. Ibis | ||
| automatically caches the data on disk so subsequent runs don't require fetching | ||
| from cloud storage on each call to `fetch`. | ||
|
|
||
| ```{python} | ||
| name_basics = ex.imdb_name_basics.fetch() | ||
| name_basics | ||
| ``` | ||
|
|
||
| To ensure column names are Pythonic, we can rename as `snake_case`. | ||
|
|
||
| ```{python} | ||
| name_basics.rename("snake_case") | ||
| ``` | ||
|
|
||
| Let's grab all of the relevant IMDB tables and rename columns. | ||
|
|
||
| ```{python} | ||
| name_basics = ex.imdb_name_basics.fetch().rename("snake_case") | ||
| title_akas = ex.imdb_title_akas.fetch().rename("snake_case") | ||
| title_basics = ex.imdb_title_basics.fetch().rename("snake_case") | ||
| title_crew = ex.imdb_title_crew.fetch().rename("snake_case") | ||
| title_episode = ex.imdb_title_episode.fetch().rename("snake_case") | ||
| title_principals = ex.imdb_title_principals.fetch().rename("snake_case") | ||
| title_ratings = ex.imdb_title_ratings.fetch().rename("snake_case") | ||
| ``` | ||
|
|
||
| ## Preview the data | ||
|
|
||
| We'll print out the first few rows of each table to get an idea of what is | ||
| contained in each. | ||
|
|
||
| ```{python} | ||
| name_basics.head() | ||
| ``` | ||
|
|
||
| ```{python} | ||
| title_akas.head() | ||
| ``` | ||
|
|
||
| ```{python} | ||
| title_basics.head() | ||
| ``` | ||
|
|
||
| ```{python} | ||
| title_crew.head() | ||
| ``` | ||
|
|
||
| ```{python} | ||
| title_episode.head() | ||
| ``` | ||
|
|
||
| ```{python} | ||
| title_principals.head() | ||
| ``` | ||
|
|
||
| ```{python} | ||
| title_ratings.head() | ||
| ``` | ||
|
|
||
| ## Basic data exploration | ||
|
|
||
| Let's check how many records are in each table. It's just Python, so we can | ||
| construct a dictionary and iterate through it in a for loop. | ||
|
|
||
| ```{python} | ||
| tables = { | ||
| "name_basics": name_basics, | ||
| "title_akas": title_akas, | ||
| "title_basics": title_basics, | ||
| "title_crew": title_crew, | ||
| "title_episode": title_episode, | ||
| "title_principals": title_principals, | ||
| "title_ratings": title_ratings, | ||
| } | ||
| max_name_len = max(map(len, tables.keys())) + 1 | ||
| ``` | ||
|
|
||
| ```{python} | ||
| print("Length of tables:") | ||
| for t in tables: | ||
| print(f"\t{t.ljust(max_name_len)}: {tables[t].count().to_pandas():,}") | ||
| ``` | ||
|
|
||
| ## Clean data | ||
|
|
||
| Looking at the data, the `nconst` and `tconst` columns seem to be unique | ||
| identifiers. Let's confirm and adjust them accordingly. | ||
|
|
||
| ```{python} | ||
| name_basics.head() | ||
| ``` | ||
|
|
||
| Check the number of unique `nconst` values. | ||
|
|
||
| ```{python} | ||
| name_basics.nconst.nunique() | ||
| ``` | ||
|
|
||
| Confirm it's equal to the number of rows. | ||
|
|
||
| ```{python} | ||
| name_basics.nconst.nunique() == name_basics.count() | ||
| ``` | ||
|
|
||
| Mutate the table to convert `nconst` to an integer. | ||
|
|
||
| ```{python} | ||
| t = name_basics.mutate(nconst=_.nconst.replace("nm", "").cast("int")) | ||
| t.head() | ||
| ``` | ||
|
|
||
| Let's also turn `primary_profession` into an array of strings instead of | ||
| a single comma-separated string. | ||
|
|
||
| ```{python} | ||
| t = t.mutate(primary_profession=_.primary_profession.split(",")) | ||
| t | ||
| ``` | ||
|
|
||
| And, combining the two concepts, convert `known_for_titles` into an array of | ||
| integers corresponding to `tconst` identifiers. | ||
|
|
||
| ```{python} | ||
| t = t.mutate( | ||
| known_for_titles=_.known_for_titles.split(",").map( | ||
| lambda tconst: tconst.replace("tt", "").cast("int") | ||
| ) | ||
| ) | ||
| t | ||
| ``` | ||
|
|
||
| ## DRY-ing up the code | ||
|
|
||
| We can define functions to convert `nconst` and `tconst` to integers. | ||
|
|
||
| ```{python} | ||
| def nconst_to_int(nconst): | ||
| return nconst.replace("nm", "").cast("int") | ||
| def tconst_to_int(tconst): | ||
| return tconst.replace("tt", "").cast("int") | ||
| ``` | ||
|
|
||
| Then combine the previous data cleansing in a single mutate call. | ||
|
|
||
| ```{python} | ||
| name_basics = name_basics.mutate( | ||
| nconst=nconst_to_int(_.nconst), | ||
| primary_profession=_.primary_profession.split(","), | ||
| known_for_titles=_.known_for_titles.split(",").map(tconst_to_int), | ||
| ) | ||
| name_basics | ||
| ``` | ||
|
|
||
| We can use `ibis.to_sql` to see the SQL this generates. | ||
|
|
||
| ```{python} | ||
| ibis.to_sql(name_basics) | ||
| ``` | ||
|
|
||
| Clean the rest of the tables. We'll convert `nconst` and `tconst` columns | ||
| consistently to allow for easy joining. | ||
|
|
||
| ```{python} | ||
| title_akas = title_akas.mutate(title_id=tconst_to_int(_.title_id)).rename( | ||
| tconst="title_id" | ||
| ) | ||
| title_basics = title_basics.mutate(tconst=tconst_to_int(_.tconst)) | ||
| title_crew = title_crew.mutate( | ||
| tconst=tconst_to_int(_.tconst), | ||
| directors=_.directors.split(",").map(nconst_to_int), | ||
| writers=_.writers.split(",").map(nconst_to_int), | ||
| ) | ||
| title_episode = title_episode.mutate( | ||
| tconst=tconst_to_int(_.tconst), parent_tconst=tconst_to_int(_.parent_tconst) | ||
| ) | ||
| title_principals = title_principals.mutate( | ||
| tconst=tconst_to_int(_.tconst), nconst=nconst_to_int(_.nconst) | ||
| ) | ||
| title_ratings = title_ratings.mutate(tconst=tconst_to_int(_.tconst)) | ||
| ``` | ||
|
|
||
| ## Finding good (and bad) movies to watch | ||
|
|
||
| Join the IMDB rankings with information about the movies. | ||
|
|
||
| ```{python} | ||
| joined = title_basics.join(title_ratings, "tconst") | ||
| joined | ||
| ``` | ||
|
|
||
| ```{python} | ||
| joined.title_type.value_counts().order_by(_.title_type_count.desc()) | ||
| ``` | ||
|
|
||
| Filter down to movies. | ||
|
|
||
| ```{python} | ||
| joined = joined.filter(_.title_type == "movie") | ||
| joined | ||
| ``` | ||
|
|
||
| Reorder the columns and drop some. | ||
|
|
||
| ```{python} | ||
| joined = joined.select( | ||
| "tconst", | ||
| "primary_title", | ||
| "average_rating", | ||
| "num_votes", | ||
| "genres", | ||
| "runtime_minutes", | ||
| ) | ||
| joined | ||
| ``` | ||
|
|
||
| Sort by the average rating. | ||
|
|
||
| ```{python} | ||
| joined = joined.order_by([_.average_rating.desc(), _.num_votes.desc()]) | ||
| joined | ||
| ``` | ||
|
|
||
| A lot of 10/10 movies I haven't heard of … let's filter to movies with at least | ||
| `N` votes. | ||
|
|
||
| ```{python} | ||
| N = 50000 | ||
| joined = joined.filter(_.num_votes > N) | ||
| joined | ||
| ``` | ||
|
|
||
| What if you're in the mood for a bad movie? | ||
|
|
||
| ```{python} | ||
| joined = joined.order_by([_.average_rating.asc(), _.num_votes.desc()]) | ||
| joined | ||
| ``` | ||
|
|
||
| And specifically a bad comedy? | ||
|
|
||
| ```{python} | ||
| joined = joined.filter(_.genres.contains("Comedy")) | ||
| joined | ||
| ``` | ||
|
|
||
| Perfect! | ||
|
|
||
| ## Next Steps | ||
|
|
||
| We only used two of the IMDB tables. What else can we do with the rest of the | ||
| data? Play around and let us know! |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| --- | ||
| title: Write and execute unbound expressions | ||
| --- | ||
|
|
||
| One of the most powerful features of Ibis is the separation of transformation | ||
| logic from the execution engine, which allows you to "write once, execute | ||
| everywhere". | ||
|
|
||
| ## Unbound tables | ||
|
|
||
| In Ibis, you can define unbound tables. An unbound table is a table with a | ||
| specified schema but not connected to a data source. You can think of it as an | ||
| empty spreadsheet with just the header. Even though the spreadsheet is empty, | ||
| you know what the data would look like. | ||
|
|
||
| Unbound tables allow you to write transformations for data as long as it | ||
| conforms to the provided schema. You don't need to connect to a data source | ||
| until you're ready to execute the expression and compute outputs. | ||
|
|
||
| ## Execute an unbound expression | ||
|
|
||
| Here's how we can define an unbound table in Ibis: | ||
|
|
||
| ```{python} | ||
| import ibis | ||
| schema = { | ||
| "carat": "float64", | ||
| "cut": "string", | ||
| "color": "string", | ||
| "clarity": "string", | ||
| "depth": "float64", | ||
| "table": "float64", | ||
| "price": "int64", | ||
| "x": "float64", | ||
| "y": "float64", | ||
| "z": "float64", | ||
| } | ||
| diamonds = ibis.table(schema, name="diamonds") | ||
| diamonds | ||
| ``` | ||
|
|
||
| So far, we have an empty `diamonds` table that contains 10 columns. Even though | ||
| there is no data in the `diamonds` table right now, we can write | ||
| transformations knowing that these are the columns available to us. | ||
|
|
||
| Given this table of diamonds of various carats, cuts, and colors, we're | ||
| interested in learning the average carat for each color of premium and ideal | ||
| diamonds. In order to do this, we can first calculate the average carat for | ||
| each color and cut of diamonds, then make a pivot table to show the results: | ||
|
|
||
| ```{python} | ||
| from ibis import _ | ||
| expr = ( | ||
| diamonds.group_by(["cut", "color"]) | ||
| .agg(carat=_.carat.mean()) | ||
| .pivot_wider( | ||
| names=("Premium", "Ideal"), names_from="cut", values_from="carat", names_sort=True, values_agg="mean" | ||
| ) | ||
| ) | ||
| ``` | ||
|
|
||
| Now that we're ready to compute results, we can connect to any of Ibis' | ||
| supported backends. This feature logic can be reused and you don't need to | ||
| modify it again! | ||
|
|
||
| This is a dataset that we can process locally. Let's connect to DuckDB and load | ||
| the data into a DuckDB table: | ||
|
|
||
| ```{python} | ||
| parquet_dir = "diamonds.parquet" | ||
| # download data into a local file | ||
| ibis.examples.diamonds.fetch().to_parquet(parquet_dir) | ||
| con = ibis.duckdb.connect() | ||
| con.read_parquet(parquet_dir, table_name="diamonds") | ||
| ``` | ||
|
|
||
| Connecting to this DuckDB table and executing the transformation on the loaded | ||
| data is now as simple as | ||
|
|
||
| ```{python} | ||
| con.to_pandas(expr) | ||
| ``` | ||
|
|
||
| VoilĂ ! | ||
|
|
||
| If you want to continue to work with the data in DuckDB, you can create a new | ||
| table and insert the outputs into it like so: | ||
|
|
||
| ```{python} | ||
| output_schema = ibis.schema( | ||
| { | ||
| "color": "string", | ||
| "Ideal": "float64", | ||
| "Premium": "float64", | ||
| } | ||
| ) | ||
| con.create_table("results", schema=output_schema) | ||
| con.insert("results", expr) | ||
| con.table("results").to_pandas() | ||
| ``` | ||
|
|
||
| ## Execute on another backend | ||
|
|
||
| Because Ibis separates the transformation logic from the execution engine, you | ||
| can easily reuse the written transformation for another backend. Here we use | ||
| Polars as an example, but you can do the same for any of Ibis' 20+ supported | ||
| backends as long as that particular backend supports the operations | ||
| (see [the operation support matrix](../../support_matrix.qmd)). | ||
|
|
||
| ```{python} | ||
| pl = ibis.polars.connect() | ||
| pl.read_parquet(parquet_dir, table_name="diamonds") | ||
| pl.to_pandas(expr) | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| --- | ||
| title: Read parquet files with Ibis | ||
| --- | ||
|
|
||
| In this example, we will use Ibis's DuckDB backend to analyze data from | ||
| a remote parquet source using `ibis.read_parquet`. `ibis.read_parquet` can also | ||
| read local parquet files, and there are other `ibis.read_*` functions that | ||
| conveniently return a table expression from a file. One such function is | ||
| `ibis.read_csv`, which reads from local and remote CSV. | ||
|
|
||
| We will be reading from the [**Global Biodiversity Information Facility (GBIF) | ||
| Species Occurrences**](https://registry.opendata.aws/gbif/) dataset. It is | ||
| hosted on S3 at `s3://gbif-open-data-us-east-1/occurrence/` | ||
|
|
||
| ## Reading One Partition | ||
|
|
||
| We can read a single partition by specifying its path. | ||
|
|
||
| We do this by calling | ||
| [`read_parquet`](https://ibis-project.org/api/expressions/top_level/#ibis.read_parquet) | ||
| on the partition we care about. | ||
|
|
||
| So to read the first partition in this dataset, we'll call `read_parquet` on | ||
| `00000` in that path: | ||
|
|
||
| ```{python} | ||
| import ibis | ||
| t = ibis.read_parquet( | ||
| "s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/000000" | ||
| ) | ||
| t | ||
| ``` | ||
|
|
||
| Note that we're calling `read_parquet` and receiving a table expression without | ||
| establishing a connection first. Ibis spins up a DuckDB connection (or | ||
| whichever default backend you have) when you call `ibis.read_parquet` (or even | ||
| `ibis.read_csv`). | ||
|
|
||
| Since our result, `t`, is a table expression, we can now run queries against | ||
| the file using Ibis expressions. For example, we can select columns, filter the | ||
| file, and then view the first five rows of the result: | ||
|
|
||
| ```{python} | ||
| cols = [ | ||
| "gbifid", | ||
| "datasetkey", | ||
| "occurrenceid", | ||
| "kingdom", | ||
| "phylum", | ||
| "class", | ||
| "order", | ||
| "family", | ||
| "genus", | ||
| "species", | ||
| "day", | ||
| "month", | ||
| "year", | ||
| ] | ||
| t.select(cols).filter(t["family"].isin(["Corvidae"])).limit(5).to_pandas() | ||
| ``` | ||
|
|
||
| We can count the rows in the table (partition): | ||
|
|
||
| ```{python} | ||
| t.count().to_pandas() | ||
| ``` | ||
|
|
||
| ## Reading all partitions: filter, aggregate, export | ||
|
|
||
| We can use `read_parquet` to read an entire parquet file by globbing all | ||
| partitions: | ||
|
|
||
| ```{python} | ||
| t = ibis.read_parquet( | ||
| "s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*" | ||
| ) | ||
| ``` | ||
|
|
||
| Since the function returns a table expression, we can perform valid selections, | ||
| filters, aggregations, and exports just as we could with any other table | ||
| expression: | ||
|
|
||
| ```{python} | ||
| df = ( | ||
| t.select(["gbifid", "family", "species"]) | ||
| .filter(t["family"].isin(["Corvidae"])) | ||
| # Here we limit by 10,000 to fetch a quick batch of results | ||
| .limit(10000) | ||
| .group_by("species") | ||
| .count() | ||
| .to_pandas() | ||
| ) | ||
| df | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| --- | ||
| title: Loading Google Cloud Storage files with DuckDB | ||
| --- | ||
|
|
||
| Loading data stored on Google Cloud Storage to DuckDB requires specifying `gcs` as filesystem. | ||
|
|
||
| ```python | ||
| from ibis.interactive import * | ||
|
|
||
| con = ibis.duckdb.connect() | ||
|
|
||
| import fsspec | ||
| import gcsfs | ||
|
|
||
| gcs = fsspec.filesystem("gcs") | ||
|
|
||
| con.register_filesystem(gcs) | ||
|
|
||
| t = con.read_json("gs://...") | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| # GraphViz + Ibis | ||
|
|
||
| Ibis supports visualizing an expression as a [directed | ||
| graph](https://en.wikipedia.org/wiki/Directed_graph) using | ||
| [GraphViz](https://graphviz.org/). | ||
|
|
||
| To get started, make sure you've installed the necessary dependencies. | ||
|
|
||
| ```sh | ||
| $ pip install 'ibis-framework[duckdb,examples,visualization]' | ||
| ``` | ||
|
|
||
| ::: {.callout-note collapse="true"} | ||
| ## Replace `duckdb` with your backend of choice | ||
|
|
||
| For instance, you can replace `duckdb` with `snowflake` if you want to use the | ||
| [Snowflake backend](../../backends/snowflake.qmd). | ||
| ::: | ||
|
|
||
|
|
||
| Let's run through a few examples. | ||
|
|
||
| First we'll import the things we need. | ||
|
|
||
| ```{python} | ||
| import ibis | ||
| from ibis import _ | ||
| from ibis.expr.visualize import to_graph | ||
| ``` | ||
|
|
||
| Now we can visualize an expression graph. | ||
|
|
||
| Here's a call to `select`. | ||
|
|
||
| ```{python} | ||
| t = ibis.examples.penguins.fetch() | ||
| expr = t.select(lowered=_.species.lower()) | ||
| to_graph(expr) | ||
| ``` | ||
|
|
||
| The way to read the graph is from top to bottom. | ||
|
|
||
| * The top of the graph contains Ibis operations with no dependents. | ||
| * The edges encode dependencies from one node to another. | ||
| * The bold text in the rectangles is the name of the Ibis operation class. | ||
| * The bottom of the graph contains Ibis operations with no dependencies. | ||
| * If this were a data flow graph, data would flow from bottom to top. | ||
|
|
||
| Let's look at a more complex example: `group_by`. | ||
|
|
||
| ```{python} | ||
| expr = ( | ||
| t.group_by(_.species) | ||
| .agg( | ||
| bill_depth_avg=_.bill_depth_mm.mean(), | ||
| bill_length_avg=_.bill_length_mm.mean(), | ||
| ) | ||
| ) | ||
| to_graph(expr) | ||
| ``` | ||
|
|
||
| Switching gears, let's look at a `join` and show customization of node and edge | ||
| attributes. | ||
|
|
||
| ```{python} | ||
| left = ibis.table(dict(a="int64", b="string"), name="left") | ||
| right = ibis.table(dict(b="string", c="int64", d="string"), name="right") | ||
| expr = ( | ||
| left.inner_join(right, "b") | ||
| .select(left.a, b=right.c, c=right.d) | ||
| .mutate(arrays=ibis.array([1, 2, 3])) | ||
| ) | ||
| to_graph( | ||
| expr, | ||
| label_edges=True, | ||
| node_attr={"shape": "hexagon", "color": "green", "fontname": "Roboto Mono"}, | ||
| edge_attr={"fontsize": "12", "fontname": "Comic Sans MS"} | ||
| ) | ||
| ``` | ||
|
|
||
| Please try out `to_graph` and give us feedback on | ||
| [Zulip](https://ibis-project.zulipchat.org) or in a [GitHub | ||
| issue](https://github.com/ibis-project/ibis/issues/new/choose)! |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| [ | ||
| { | ||
| "origin": ["*"], | ||
| "method": ["GET"], | ||
| "responseHeader": ["Content-Type"], | ||
| "maxAgeSeconds": 3600 | ||
| } | ||
| ] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| { | ||
| "PipliteAddon": { | ||
| "piplite_urls": [ | ||
| "https://duckdb.github.io/duckdb-pyodide/wheels/duckdb-0.10.2-cp311-cp311-emscripten_3_1_46_wasm32.whl", | ||
| "dist/ibis_framework-8.0.0-py3-none-any.whl" | ||
| ] | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,307 @@ | ||
| --- | ||
| title: "Ibis goes real-time! Introducing the new Flink backend for Ibis" | ||
| author: "Deepyaman Datta" | ||
| date: "2024-02-12" | ||
| categories: | ||
| - blog | ||
| - flink | ||
| - stream processing | ||
| --- | ||
|
|
||
| ## Introduction | ||
|
|
||
| Ibis 8.0 marks the official release of the Apache Flink backend for Ibis. Ibis | ||
| users can now manipulate data across streaming and batch contexts using the same | ||
| interface. Flink is one of the most established stream-processing frameworks out | ||
| there and a central part of the real-time data infrastructure at companies like | ||
| DoorDash, LinkedIn, Netflix, and Uber. It is commonly applied in use cases such | ||
| as fraud detection, anomaly detection, real-time recommendation, dynamic | ||
| pricing, and online advertising. The Flink backend is also the first streaming | ||
| backend Ibis supports. Follow along as we define and execute a simple streaming | ||
| job using Ibis! | ||
|
|
||
| ## Installation prerequisites | ||
|
|
||
| * **Docker Compose:** This tutorial uses Docker Compose to manage an Apache | ||
| Kafka environment (including sample data generation) and a Flink cluster (for | ||
| [remote execution](#remote-execution)). You can [download and install Docker | ||
| Compose from the official website](https://docs.docker.com/compose/install/). | ||
| * **JDK 11:** Flink requires Java 11. If you don't already have JDK 11 | ||
| installed, you can [get the appropriate Eclipse Temurin | ||
| release](https://adoptium.net/temurin/releases/?package=jdk&version=11). | ||
| * **Python:** To follow along, you need Python 3.9 or 3.10. | ||
|
|
||
| ## Installing the Flink backend for Ibis | ||
|
|
||
| We use a Python client to explore data in Kafka topics. You can install it, | ||
| alongside the Flink backend for Ibis, with `pip`, `conda`, `mamba`, or `pixi`: | ||
|
|
||
| ::: {.panel-tabset} | ||
|
|
||
| ## Using `pip` | ||
|
|
||
| ```bash | ||
| pip install ibis-framework apache-flink kafka-python | ||
| ``` | ||
|
|
||
| ## Using `conda` | ||
|
|
||
| ```bash | ||
| conda install -c conda-forge ibis-flink | ||
| ``` | ||
|
|
||
| ## Using `mamba` | ||
|
|
||
| ```bash | ||
| mamba install -c conda-forge ibis-flink | ||
| ``` | ||
|
|
||
| ## Using `pixi` | ||
|
|
||
| ```bash | ||
| pixi add ibis-flink | ||
| ``` | ||
|
|
||
| ::: | ||
|
|
||
| ## Spinning up the services using Docker Compose | ||
|
|
||
| The [ibis-project/ibis-flink-example GitHub | ||
| repository](https://github.com/ibis-project/ibis-flink-example) includes the | ||
| relevant Docker Compose configuration for this tutorial. Clone the repository, | ||
| and run `docker compose up` from the cloned directory to create Kafka topics, | ||
| generate sample data, and launch a Flink cluster: | ||
|
|
||
| ```bash | ||
| git clone https://github.com/claypotai/ibis-flink-example.git | ||
| cd ibis-flink-example | ||
| docker compose up | ||
| ``` | ||
|
|
||
| ::: {.callout-tip} | ||
| If you don't intend to try [remote execution](#remote-execution), you can start | ||
| only the Kafka-related services with `docker compose up kafka init-kafka | ||
| data-generator`. | ||
| ::: | ||
|
|
||
| After a few seconds, you should see messages indicating your Kafka environment | ||
| is ready: | ||
|
|
||
| ```bash | ||
| ibis-flink-example-init-kafka-1 | Successfully created the following topics: | ||
| ibis-flink-example-init-kafka-1 | payment_msg | ||
| ibis-flink-example-init-kafka-1 | sink | ||
| ibis-flink-example-init-kafka-1 exited with code 0 | ||
| ibis-flink-example-data-generator-1 | Connected to Kafka | ||
| ibis-flink-example-data-generator-1 | Producing 20000 records to Kafka topic payment_msg | ||
| ``` | ||
|
|
||
| This example uses mock payments data. The `payment_msg` Kafka topic contains | ||
| messages in the following format: | ||
|
|
||
| ```json | ||
| { | ||
| "createTime": "2023-09-20 22:19:02.224", | ||
| "orderId": 1695248388, | ||
| "payAmount": 88694.71922270155, | ||
| "payPlatform": 0, | ||
| "provinceId": 6 | ||
| } | ||
| ``` | ||
|
|
||
| In a separate terminal, we can explore what these messages look like: | ||
|
|
||
| ```{python} | ||
| from itertools import islice | ||
| from kafka import KafkaConsumer | ||
| consumer = KafkaConsumer("payment_msg") | ||
| for msg in islice(consumer, 3): | ||
| print(msg) | ||
| ``` | ||
|
|
||
| ## Running the tutorial | ||
|
|
||
| This tutorial uses Ibis with the Flink backend to process the aforementioned | ||
| payment messages. You can choose to either [run it locally](#local-execution) or | ||
| [submit a job to an already-running Flink cluster](#remote-execution). | ||
|
|
||
| ### Local execution | ||
|
|
||
| The simpler option is to run the example using the Flink mini cluster. | ||
|
|
||
| #### Create a table environment | ||
|
|
||
| The [table | ||
| environment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/python/table/table_environment/) | ||
| serves as the main entry point for interacting with the Flink runtime. The | ||
| `flink` backend does not create `TableEnvironment` objects; you must create a | ||
| `TableEnvironment` and pass that to | ||
| [`ibis.flink.connect`](../../backends/flink.qmd#ibis.flink.connect): | ||
|
|
||
| ```{python} | ||
| import ibis | ||
| from pyflink.table import EnvironmentSettings, TableEnvironment | ||
| env_settings = EnvironmentSettings.in_streaming_mode() | ||
| table_env = TableEnvironment.create(env_settings) | ||
| table_env.get_config().set("parallelism.default", "1") # <1> | ||
| con = ibis.flink.connect(table_env) | ||
| ``` | ||
| 1. write all the data to one file | ||
|
|
||
| Flink’s streaming connectors aren't part of the binary distribution. Link the | ||
| [Kafka | ||
| connector](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/connectors/table/kafka/) | ||
| for cluster execution by adding the JAR file from the cloned repository. [Ibis | ||
| exposes the `raw_sql` method for situations like this, where you need to run | ||
| arbitrary SQL that cannot be modeled as a table | ||
| expression](https://ibis-project.org/how-to/extending/sql#backend.raw_sql): | ||
|
|
||
| ```{python} | ||
| #| echo: false | ||
| #| output: false | ||
| import urllib.request | ||
| urllib.request.urlretrieve( | ||
| "https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka/3.0.2-1.18/flink-sql-connector-kafka-3.0.2-1.18.jar", | ||
| filename="flink-sql-connector-kafka-3.0.2-1.18.jar", | ||
| ) | ||
| ``` | ||
|
|
||
| ```{python} | ||
| #| output: false | ||
| con.raw_sql("ADD JAR 'flink-sql-connector-kafka-3.0.2-1.18.jar'") | ||
| ``` | ||
|
|
||
| #### Create the source and sink tables | ||
|
|
||
| Use | ||
| [`create_table`](../../backends/flink.qmd#ibis.backends.flink.Backend.create_table) | ||
| to register tables. Notice the new top-level `ibis.watermark` API for | ||
| [specifying a watermark | ||
| strategy](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/concepts/time/#event-time-and-watermarks). | ||
|
|
||
| ```{python} | ||
| source_schema = ibis.schema( # <1> | ||
| { # <1> | ||
| "createTime": "timestamp(3)", # <1> | ||
| "orderId": "int64", # <1> | ||
| "payAmount": "float64", # <1> | ||
| "payPlatform": "int32", # <1> | ||
| "provinceId": "int32", # <1> | ||
| } # <1> | ||
| ) # <1> | ||
| source_configs = { # <1> | ||
| "connector": "kafka", # <1> | ||
| "topic": "payment_msg", # <1> | ||
| "properties.bootstrap.servers": "localhost:9092", # <1> | ||
| "properties.group.id": "test_3", # <1> | ||
| "scan.startup.mode": "earliest-offset", # <1> | ||
| "format": "json", # <1> | ||
| } # <1> | ||
| t = con.create_table( # <1> | ||
| "payment_msg", # <1> | ||
| schema=source_schema, # <1> | ||
| tbl_properties=source_configs, # <1> | ||
| watermark=ibis.watermark( # <1> | ||
| time_col="createTime", allowed_delay=ibis.interval(seconds=15) # <1> | ||
| ), # <1> | ||
| ) # <1> | ||
| sink_schema = ibis.schema( # <2> | ||
| { # <2> | ||
| "province_id": "int32", # <2> | ||
| "pay_amount": "float64", # <2> | ||
| } # <2> | ||
| ) # <2> | ||
| sink_configs = { # <2> | ||
| "connector": "kafka", # <3> | ||
| "topic": "sink", # <2> | ||
| "properties.bootstrap.servers": "localhost:9092", # <2> | ||
| "format": "json", # <2> | ||
| } # <2> | ||
| con.create_table( # <2> | ||
| "total_amount_by_province_id", schema=sink_schema, tbl_properties=sink_configs # <2> | ||
| ) # <2> | ||
| ``` | ||
| 1. create source Table | ||
| 2. create sink Table | ||
|
|
||
| #### Perform calculations | ||
|
|
||
| Compute the total pay amount per province in the past 10 seconds (as of each | ||
| message, for the province in the incoming message): | ||
|
|
||
| ```{python} | ||
| agged = t.select( | ||
| province_id=t.provinceId, | ||
| pay_amount=t.payAmount.sum().over( | ||
| range=(-ibis.interval(seconds=10), 0), | ||
| group_by=t.provinceId, | ||
| order_by=t.createTime, | ||
| ), | ||
| ) | ||
| ``` | ||
|
|
||
| Finally, emit the query result to the sink table: | ||
|
|
||
| ```{python} | ||
| con.insert("total_amount_by_province_id", agged) | ||
| ``` | ||
|
|
||
| ### Remote execution | ||
|
|
||
| You can also submit the example to the [remote cluster started using Docker | ||
| Compose](#spinning-up-the-services-using-docker-compose). The | ||
| `window_aggregation.py` file in the cloned repository contains the [same steps | ||
| that we performed for local execution](#local-execution). We will [use the | ||
| method described in the official Flink | ||
| documentation](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/cli/#submitting-pyflink-jobs). | ||
|
|
||
| ::: {.callout-tip} | ||
| You can find the `./bin/flink` executable with the following command: | ||
|
|
||
| ```bash | ||
| python -c'from pathlib import Path; import pyflink; print(Path(pyflink.__spec__.origin).parent / "bin" / "flink")' | ||
| ``` | ||
| ::: | ||
|
|
||
| My full command looks like this: | ||
|
|
||
| ```bash | ||
| /opt/miniconda3/envs/ibis-dev/lib/python3.10/site-packages/pyflink/bin/flink run --jobmanager localhost:8081 --python window_aggregation.py | ||
| ``` | ||
|
|
||
| The command will exit after displaying a submission message: | ||
|
|
||
| ``` | ||
| Job has been submitted with JobID b816faaf5ef9126ea5b9b6a37012cf56 | ||
| ``` | ||
|
|
||
| ## Viewing the results | ||
|
|
||
| Similar to how we viewed messages in the `payment_msg` topic, we can print | ||
| results from the `sink` topic: | ||
|
|
||
| ```{python} | ||
| consumer = KafkaConsumer("sink") | ||
| for msg in islice(consumer, 10): | ||
| print(msg) | ||
| ``` | ||
|
|
||
| VoilĂ ! You've run your first streaming application using Ibis. | ||
|
|
||
| ## Shutting down the Compose environment | ||
|
|
||
| Press <kbd>Ctrl</kbd>+<kbd>C</kbd> to stop the Docker Compose containers. Once | ||
| stopped, run `docker compose down` to remove the services created for this | ||
| tutorial. |