344 changes: 344 additions & 0 deletions docs/posts/1brc/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,344 @@
---
title: "Using one Python dataframe API to take the billion row challenge with DuckDB, Polars, and DataFusion"
author: "Cody"
date: "2024-01-22"
categories:
- blog
- duckdb
- polars
- datafusion
- portability
---

## Overview

This is an implementation of the [The One Billion Row
Challenge](https://www.morling.dev/blog/one-billion-row-challenge/):

> Let’s kick off 2024 true coder style—​I’m excited to announce the One Billion
> Row Challenge (1BRC), running from Jan 1 until Jan 31.
> Your mission, should you decide to accept it, is deceptively simple: write a
> Java program for retrieving temperature measurement values from a text file and
> calculating the min, mean, and max temperature per weather station. There’s just
> one caveat: the file has 1,000,000,000 rows!
I haven't written Java since dropping a computer science course my second year
of college that forced us to do functional programming exclusively in Java.
However, I'll gladly take the challenge in Python using Ibis! In fact, I did
something like this (generating a billion rows with 26 columns of random numbers
and doing basic aggregations) to test out DuckDB and Polars.

In this blog, we'll demonstrate how Ibis provides a single Python dataframe API
to take the billion row challenge with DuckDB, Polars, and DataFusion.

## Setup

We need to generate the data from the challenge. First, clone the
[repo](https://github.com/gunnarmorling/1brc):

```{.bash}
gh repo clone gunnarmorling/1brc
```

Then change into the Python directory and run the generation script with the
number of rows you want to generate:

```{.bash}
cd 1brc/src/main/python
python create_measurements.py 1_000_000_000
```

This will generate a file called `measurements.txt` in the `data` directory at
the root of the repo. It is 15GB on disk:

```{.bash}
(venv) cody@voda 1brc % du 1brc/data/*
15G 1brc/data/measurements.txt
808K 1brc/data/weather_stations.csv
```

And consists of one billion rows with two columns separated by a semicolon:

```{.bash}
(venv) cody@voda 1brc % head 1brc/data/measurements.txt
Kusugal;-67.2
Ipil;-88.6
Sohna;-31.2
Lubuagan;-2.3
Szentes;29.2
Sylvan Lake;-70.7
Ambato;-35.2
Berkine;97.0
Wernau;73.4
Kennewick;-19.9
```

Also, you'll need to install Ibis with the three backends we'll use:

```{.bash}
pip install 'ibis-framework[duckdb,polars,datafusion]'
```

## Understanding Ibis

Ibis provides a standard dataframe API decoupled from the execution engine. It
compiles Ibis expressions to a form of intermediary representation (often SQL)
that can be executed by different backends.

This allows us to write a single Ibis expression to complete the challenge with
many different execution engine backends.

:::{.callout-warning}
While Ibis does its best to abstract away the differences between backends, this
cannot be done in some areas like data input and output. For example, the
`read_csv` function across various backends (in their SQL and Python forms) have
different parameters. We'll handle that with different `kwargs` dictionaries for
these backends in this post.

In general, besides creating a connection and data input/output, the Ibis API is
the same across backends.
:::

## Completing the challenge thrice

We'll use three great options for local backends -- DuckDB, Polars, and
DataFusion -- to complete the challenge.

### Setup

Before we get started, we'll make some imports, turn on interactive mode, and
define the `kwargs` dictionary for the backends corresponding to their
`read_csv` function:

```{python}
import ibis
import polars as pl
import pyarrow as pa
ibis.options.interactive = True
duckdb_kwargs = {
"delim": ";",
"header": False,
"columns": {"station": "VARCHAR", "temperature": "DOUBLE"},
}
polars_kwargs = {
"separator": ";",
"has_header": False,
"new_columns": ["station", "temperature"],
"schema": {"station": pl.Utf8, "temperature": pl.Float64},
}
datafusion_kwargs = {
"delimiter": ";",
"has_header": False,
"schema": pa.schema(
[
(
"station",
pa.string(),
),
(
"temperature",
pa.float64(),
),
]
),
"file_extension": ".txt",
}
```

Let's define a function to run the same code with each backend to complete the challenge:

```{python}
def run_challenge(t):
res = (
t.group_by(ibis._.station)
.agg(
min_temp=ibis._.temperature.min(),
mean_temp=ibis._.temperature.mean(),
max_temp=ibis._.temperature.max(),
)
.order_by(ibis._.station.desc())
)
return res
```

### Completing the challenge

Let's complete the challenge with each backend.

:::{.callout-note}
The results are the same across backends but look suspicious. It is noted in the
repository that the Python generation code is "unofficial", so may have some
problems. Given this is a contrived example of generated data, I'm not going to
worry about it.

The point is that we can easily complete the challenge with the same code across
many backends, letting them worry about the details of execution. For this
reason, I'm also not providing execution times. Try it out yourself!
:::

::: {.panel-tabset}

## DuckDB

First let's set the backend to DuckDB (redundantly since it's the default) and
the `kwargs` dictionary:

```{python}
ibis.set_backend("duckdb") # <1>
kwargs = duckdb_kwargs
```

```{python}
# | code-fold: true
# | echo: false
_ = ibis.get_backend().raw_sql("set enable_progress_bar = false")
```

1. Redundant given DuckDB is the default

Next, we'll read in the data and take a look at the table:

```{python}
t = ibis.read_csv("1brc/data/measurements.txt", **kwargs)
t.limit(3)
```

Then let's confirm it's **a billion** rows:

```{python}
f"{t.count().to_pandas():,}"
```

Finally, we'll compute the min, mean, and max temperature per weather station:

```{python}
res = run_challenge(t)
res
```

## Polars

First let's set the backend to Polars and the `kwargs` dictionary:

```{python}
ibis.set_backend("polars") # <1>
kwargs = polars_kwargs
```

1. Set Polars as the default backend used

Next, we'll read in the data and take a look at the table:

```{python}
t = ibis.read_csv("1brc/data/measurements.txt", **kwargs)
t.limit(3)
```

Then let's confirm it's **a billion** rows:

```{python}
f"{t.count().to_pandas():,}"
```

Finally, we'll compute the min, mean, and max temperature per weather station:

```{python}
res = run_challenge(t)
res
```

## DataFusion

First let's set the backend to DataFusion and the `kwargs` dictionary:

```{python}
ibis.set_backend("datafusion") # <1>
kwargs = datafusion_kwargs
```

1. Set DataFusion as the default backend used

Next, we'll read in the data and take a look at the table:

```{python}
t = ibis.read_csv("1brc/data/measurements.txt", **kwargs)
t.limit(3)
```

Then let's confirm it's **a billion** rows:

```{python}
f"{t.count().to_pandas():,}"
```

Finally, we'll compute the min, mean, and max temperature per weather station:

```{python}
res = run_challenge(t)
res
```

:::

## Conclusion

While the one billion row challenge isn't a great benchmark, it's a fun way to
demonstrate how Ibis provides a single Python dataframe API to take the billion
row challenge with DuckDB, Polars, and DataFusion. Feel free to try it out with
other backends!

Happy coding!

## Bonus: more billion row data generation

While we're here, I'll share the code I've used in the past to generate a
billion rows of random data:

```{.python}
import ibis
con = ibis.connect("duckdb://data.ddb")
ROWS = 1_000_000_000
sql_str = ""
sql_str += "select\n"
for c in list(map(chr, range(ord("a"), ord("z") + 1))):
sql_str += f" random() as {c},\n"
sql_str += f"from generate_series(1, {ROWS})"
t = con.sql(sql_str)
con.create_table("billion", t, overwrite=True)
```

Nowadays I'd convert that to an Ibis expression:

:::{.callout-note}
This is a slightly different result with a monotonic index column, but I prefer
it anyway. You could drop that column or adjust the expression.
:::

```{.python}
import ibis
con = ibis.connect("duckdb://data.ddb")
ROWS = 1_000_000_000
t = (
ibis.range(ROWS)
.unnest()
.name("index")
.as_table()
.mutate(**{c: ibis.random() for c in list(map(chr, range(ord("a"), ord("z") + 1)))})
)
con.create_table("billion", t, overwrite=True)
```

But if you do need to construct a programmatic SQL string, it's cool that you
can!
614 changes: 614 additions & 0 deletions docs/posts/backend-agnostic-arrays/index.qmd

Large diffs are not rendered by default.

Binary file added docs/posts/ibis-analytics/dag.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2,109 changes: 1,883 additions & 226 deletions docs/posts/ibis-analytics/index.qmd

Large diffs are not rendered by default.

Binary file modified docs/posts/ibis-analytics/motherduck.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/posts/ibis-analytics/thumbnail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/posts/ibis-analytics/top.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
201 changes: 201 additions & 0 deletions docs/posts/ibis-duckdb-geospatial-dev-guru/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
---
title: "Geospatial analysis with Ibis and DuckDB (redux)"
author: Naty Clementi and Gil Forsyth
date: 2024-01-16
categories:
- blog
- duckdb
- geospatial
execute:
freeze: false
---

Spatial Dev Guru wrote a great [tutorial](https://spatial-dev.guru/2023/12/09/geospatial-analysis-using-duckdb/)
that walks you through a step-by-step geospatial analysis of bike sharing data using DuckDB.

Ibis has support for all the geospatial functions used on the tutorial, and we
decided to replicate it and share it with you.

## Installation

Install Ibis with the dependencies needed to work with geospatial data using DuckDB:

```bash
$ pip install 'ibis-framework[duckdb,geospatial]'
```


## Data

The parquet file used in the original tutorial is available at
https://github.com/iamgeoknight/common_datasets/tree/main/parquet. The original
data is also available from the citibike
[source](https://s3.amazonaws.com/tripdata/index.html) but as a `.csv` file.

```{python}
from pathlib import Path
import tarfile
from urllib.request import urlretrieve
# Download data
url = "https://github.com/iamgeoknight/common_datasets/raw/main/parquet/202003-citibike-tripdata.tar.xz"
tar_path = Path("202003-citibike-tripdata.tar.xz")
parquet_path = Path("202003-citibike-tripdata.parquet")
if not tar_path.exists():
urlretrieve(url, tar_path)
if not parquet_path.exists():
with tarfile.open(tar_path, "r:xz") as t:
t.extract("202003-citibike-tripdata.parquet")
```

Now that we have the data, we import Ibis and turn on the interactive mode, to
easily explore the output of our queries.

```{python}
from ibis.interactive import *
```

## Let's get started

Because this dataset does not contain any geometries, we have to load the spatial
extension. If the dataset included any geometry columns, Ibis is smart enough to
load the extension for us upon reading the data.

```{python}
con = ibis.duckdb.connect("biketrip.ddb")
con.load_extension("spatial")
# read data and rename columns to use snake case
biketrip = con.read_parquet("202003-citibike-tripdata.parquet").rename("snake_case")
biketrip
```

We have the information about the longitude and latitude for start and end stations,
to create geometry points and put the spatial features to use.

## Create bike trip table

In the original tutorial, Spatial Dev Guru creates a table with transformed
"Pickup" and "Dropoff" points. In DuckDB the `st_transform` function takes by default
points as `YX` (latitude/longitude) while in Ibis, we assume data in the form
`XY` (longitude/latitude) to be consistent with PostGIS and Geopandas.

```{python}
# Notice longitude/latitude order
pickup = _.start_station_longitude.point(_.start_station_latitude)
dropoff = _.end_station_longitude.point(_.end_station_latitude)
# convert is the equivalent of `st_transform`
biketrip = biketrip.mutate(
pickup_point=pickup.convert("EPSG:4326", "EPSG:3857"),
dropoff_point=dropoff.convert("EPSG:4326", "EPSG:3857"),
)
biketrip[["pickup_point", "dropoff_point"]]
```

Using `mutate` we add two new columns to our `biketrip` table with transformed
spatial points for pickup and dropoff locations, that are in the Web Mercator projection ([EPSG:3857](https://epsg.io/3857)).

## Identify popular starts and end stations

The following queries retrieve a list of bike start and end stations with their respective trip count in descending order.

**Top 10 start stations by trip count**

```{python}
biketrip.group_by(biketrip.start_station_name).agg(trips=ibis._.count()).order_by(
ibis.desc("trips")
)
```

Similarly, in Ibis you can use the [`topk`](https://ibis-project.org/tutorials/ibis-for-sql-users#top-k-operations) operation:

```{python}
biketrip.start_station_name.topk(10)
```

**Top 10 end stations by trip count**

```{python}
biketrip.end_station_name.topk(10)
```

## Explore trip patterns by user type

We can also calculate the average trip duration and distance traveled for each
user type. According to the [data dictionary](https://ride.citibikenyc.com/system-data), user type can be "customer" or "subscriber" where:

- Customer = 24-hour pass or 3-day pass user
- Subscriber = Annual Member

```{python}
biketrip.group_by(_.usertype).aggregate(
avg_duration=_.tripduration.mean(),
avg_distance=_.pickup_point.distance(_.dropoff_point).mean()
)
```

## Analyzing efficiency: trip duration vs linear distance

The original tutorial defines `efficiency_ratio` as the `trip_duration` / `linear_distance`, where a higher efficiency ratio could mean a more direct
route or faster travel times.

```{python}
# linear distance
trip_distance = biketrip.pickup_point.distance(biketrip.dropoff_point)
biketrip = biketrip.mutate(
linear_distance=trip_distance,
efficiency_ratio=_.tripduration / trip_distance,
)
biketrip[["pickup_point", "dropoff_point", "linear_distance", "efficiency_ratio"]]
```

Let's take take a look at the table in descending order for the `linear_distance`, for trips that are longer than 0 meters.

```{python}
biketrip.filter(_.linear_distance > 0).order_by(ibis.desc("linear_distance"))
```

## Analyzing bike trips within a 500 meters radius

In the original tutorial, the author choses a point (first point on the table),
and it creates a buffer of 500 m radius around it. In our table we already have
the point in meters, since we converted them in a previous query.

The following query shows all the bike trips whose pickup point falls within a
500 meter radius of the first point of the table with `long=-74.00552427` and
`lat=40.71146364`.

```{python}
# grab the first row of the data
first_point = biketrip.limit(1)
trips_within_500 = biketrip.filter(
_.pickup_point.within(first_point.select(_.pickup_point.buffer(500)).to_array())
)
trips_within_500
```

## Acknowledgements and resources

Thank you to [Spatial Dev Guru](https://spatial-dev.guru/), for the amazing
tutorial showcasing DuckDB spatial features. It was fun to replicate the tutorial
using Ibis.

If you are interested in learning more about Ibis-DuckDB geospatial support,
here is another blog post [bis + DuckDB geospatial: a match made on Earth](https://ibis-project.org/posts/ibis-duckdb-geospatial/).

Here are some resources to learn more about Ibis:

- [Ibis Docs](https://ibis-project.org/)
- [Ibis GitHub](https://github.com/ibis-project/ibis)

Chat with us on Zulip:

- [Ibis Zulip Chat](https://ibis-project.zulipchat.com/)
21 changes: 12 additions & 9 deletions docs/posts/ibis-duckdb-geospatial/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,14 @@ You can check Dr. Qiusheng Wu's full Spatial Data Management course material on
[YouTube](https://www.youtube.com/watch?v=A4TOAdsXsEs&list=PLAxJ4-o7ZoPe9SkgnophygyLjTDBzIEbi).
:::

## Installation

Install Ibis with the dependencies needed to work with geospatial data using DuckDB:

```bash
$ pip install 'ibis-framework[duckdb,geospatial]'
```

## Data

We are going to be working with data from New York City. The database contains multiple tables with information about
Expand Down Expand Up @@ -109,7 +117,7 @@ boroughs
```

```{python}
boroughs.filter(boroughs.geom.intersects(broad_station.select(broad_station.geom).to_array()))
boroughs.filter(_.geom.intersects(broad_station.geom))
```

### `d_within` (ST_DWithin)
Expand All @@ -125,15 +133,10 @@ streets
Using the deferred API, we can check which streets are within `d=10` meters of distance.

```{python}
sts_near_broad = streets.filter(_.geom.d_within(broad_station.select(_.geom).to_array(), 10))
sts_near_broad = streets.filter(_.geom.d_within(broad_station.geom, 10))
sts_near_broad
```

::: {.callout-note}
In the previous query, `streets` and `broad_station` are different tables. We use [`to_array()`](../../reference/expression-tables.qmd#ibis.expr.types.relations.Table.to_array) to generate a
scalar subquery from a table with a single column (whose shape is scalar).
:::

To visualize the findings, we will convert the tables to GeoPandas DataFrames.

```{python}
Expand Down Expand Up @@ -193,7 +196,7 @@ To find if there were any homicides in that area, we can find where the polygon
200 meters buffer to our "Broad St" station point intersects with the geometry column in our homicides table.

```{python}
h_near_broad = homicides.filter(_.geom.intersects(broad_station.select(_.geom.buffer(200)).to_array()))
h_near_broad = homicides.filter(_.geom.intersects(broad_station.geom.buffer(200)))
h_near_broad
```

Expand All @@ -202,7 +205,7 @@ data we can't tell the street near which it happened. However, we can check if t
distance of a street.

```{python}
h_street = streets.filter(_.geom.d_within(h_near_broad.select(_.geom).to_array(), 2))
h_street = streets.filter(_.geom.d_within(h_near_broad.geom, 2))
h_street
```

Expand Down
27 changes: 27 additions & 0 deletions docs/posts/kedro-ibis/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: "Building scalable data pipelines with Kedro"
author: "Cody"
date: "2024-01-31"
categories:
- blog
- kedro
- data engineering
---

# Overview

[Kedro](https://kedro.org) is a toolbox for production-ready data science. It is
an open-source Python framework like Ibis, and together you can bring the
portability and scale of Ibis to the production-ready pipelines of Kedro.

> In your ~~Kedro~~ data journey, have you ever...
>
> ...slurped up large amounts of data into memory, instead of pushing execution down to the source database/engine?
>
> ...prototyped a node in pandas, and then rewritten it in PySpark/Snowpark/some other native dataframe API?
>
> ...implemented a proof-of-concept solution in 3-4 months on data extracts, and then struggled massively when you needed to move to running against the production databases and scale out?
> ...
If so, [read the full article on the Kedro
blog](https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis)!
2 changes: 1 addition & 1 deletion docs/posts/pydata-performance/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ def version_pair(module):
subprocess.run(("git", "fetch", "upstream"), check=True, capture_output=True)
cmd = "git", "rev-parse", "--short", "upstream/master"
cmd = "git", "rev-parse", "--short", "upstream/main"
proc = subprocess.run(cmd, check=True, text=True, capture_output=True)
commit = proc.stdout.strip()
link = f"https://github.com/ibis-project/ibis/tree/{commit}"
Expand Down
43 changes: 43 additions & 0 deletions docs/posts/zulip/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: "Announcing Zulip for Ibis community chat"
author: "Ibis team"
date: "2024-01-04"
categories:
- blog
- chat
- community
---

## Overview

The Ibis project has moved to Zulip for its community chat! We've been testing
it out for a few months and are happy with the results. From the [Zulip
repository's README](https://github.com/zulip/zulip):

> Zulip is an open-source team collaboration tool with unique topic-based
> threading that combines the best of email and chat to make remote work
> productive and delightful. Fortune 500 companies, leading open source projects,
> and thousands of other organizations use Zulip every day. Zulip is the only
> modern team chat app that is designed for both live and asynchronous
> conversations.
>
> Zulip is built by a distributed community of developers from
> all around the world, with 74+ people who have each contributed 100+ commits.
> With over 1000 contributors merging over 500 commits a month, Zulip is the
> largest and fastest growing open source team chat project.
## Benefits for Ibis users

GitHub issues remain the source of truth for work item tracking and bug reports,
while Zulip offers a more interactive chat experience. This is useful when
you're not sure if you've found a bug or just need help with something. It's
also a great place to ask questions about Ibis or get help with your code.

Zulip splits conversations into streams (like channels in Slack or Teams), but
uniquely requires each individual conversation to also have a topic. This makes
it easy to follow along with conversations that are relevant to you, and to find
conversations that you've participated in.

## Next steps

[Join us on Zulip and introduce yourself!](https://ibis-project.zulipchat.com/)
144 changes: 142 additions & 2 deletions docs/release_notes.md

Large diffs are not rendered by default.

98 changes: 66 additions & 32 deletions docs/support_matrix.qmd
Original file line number Diff line number Diff line change
@@ -1,57 +1,91 @@
---
title: "Operation support matrix"
format: dashboard
hide:
- toc
---

# Operation support matrix
```{python}
#| echo: false
!python ../gen_matrix.py
```

We provide Ibis's operation support matrix as
a [Streamlit](https://streamlit.io/) app that shows supported operations for
each backend. Ibis defines a common API for analytics and data transformation
code that is transpiled to native code for each backend. Due to differences in
SQL dialects and upstream support for different operations in different
backends, support for the full breadth of the Ibis API varies.
```{python}
#| echo: false
import pandas as pd
support_matrix = pd.read_csv("./backends/raw_support_matrix.csv")
support_matrix = support_matrix.assign(
Category=support_matrix.Operation.map(lambda op: op.rsplit(".", 1)[0].rsplit(".", 1)[-1]),
Operation=support_matrix.Operation.map(lambda op: op.rsplit(".", 1)[-1]),
).set_index(["Category", "Operation"])
all_visible_ops_count = len(support_matrix)
coverage = pd.Index(
support_matrix.sum()
.map(lambda n: f"{n} ({round(100 * n / all_visible_ops_count)}%)")
.T
)
support_matrix.columns = pd.MultiIndex.from_tuples(
list(zip(support_matrix.columns, coverage)), names=("Backend", "API coverage")
)
support_matrix = support_matrix.replace({True: "✔", False: "🚫"})
```

## {height=25%}

::: {.card title="Welcome to the operation support matrix!"}

You can use this page to see which operations are supported on each backend.
This is a [Quarto dashboard](https://quarto.org/docs/dashboards/) that shows
the operations each backend supports.

Due to differences in SQL dialects and upstream support for different
operations in different backends, support for the full breadth of the Ibis API
varies.

::: {.callout-tip}
Backends with low coverage are good places to start contributing!

Each backend implements operations differently, but this is usually very similar to other backends. If you want to start contributing to ibis, it's a good idea to start by adding missing operations to backends that have low operation coverage.
:::

<div class="streamlit-app">
<iframe class="streamlit-app-inner" width=760 height=600 src="https://ibis-project.streamlit.app/?embedded=true"></iframe>
</div>

::: {.callout-note}
This app is built using [`streamlit`](https://streamlit.io/).

You can develop the app locally by editing `docs/backends/app/backend_info_app.py` and opening a PR with your changes.
:::

Test your changes locally by running
### {width=25%}

```sh
$ streamlit run docs/backends/app/backend_info_app.py
```{python}
#| content: valuebox
#| title: "Number of backends"
import ibis
dict(
value=len(ibis.util.backend_entry_points()),
color="info",
icon="signpost-split-fill",
)
```

The changes will show up in the dev docs when your PR is merged!
:::

## Raw Data
### {width=25%}

```{python}
#| echo: false
!python ../gen_matrix.py
#| content: valuebox
#| title: "Number of SQL backends"
import importlib
from ibis.backends.base.sql import BaseSQLBackend
sql_backends = sum(
issubclass(
importlib.import_module(f"ibis.backends.{entry_point.name}").Backend,
BaseSQLBackend
)
for entry_point in ibis.util.backend_entry_points()
)
dict(value=sql_backends, color="green", icon="database")
```

You can also download data from the above tables in [CSV format](./backends/raw_support_matrix.csv).

The code used to generate the linked CSV file is below.
## {height=70%}

```{python}
#| echo: false
#| output: asis
with open("../gen_matrix.py", mode="r") as f:
print(f"```python\n{f.read()}\n```")
from itables import show
show(support_matrix, ordering=False, paging=False, buttons=["copy", "excel", "csv"])
```
5 changes: 5 additions & 0 deletions docs/theme-dark.scss
Original file line number Diff line number Diff line change
@@ -1,2 +1,7 @@
/*-- scss:defaults --*/
$code-color: #c2d94c;
$code-bg: #2b2b2b;

thead.tableFloatingHeaderOriginal {
background-color: rgb(47, 47, 47);
}
4 changes: 4 additions & 0 deletions docs/theme-light.scss
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,7 @@
.quarto-title-banner .quarto-title .title {
color: #ccd1d5;
}

thead.tableFloatingHeaderOriginal {
background-color: #ffffff;
}
2 changes: 1 addition & 1 deletion docs/tutorials/ibis-for-sql-users.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -1144,7 +1144,7 @@ SQL dialects often support two kinds of `UNION` operations:
- `UNION ALL`: the combination of all rows from each table, whether or
not they are distinct.

The Ibis `union` function by distinct is a `UNION ALL`, and you can set
The Ibis `union` function by default is a `UNION ALL`, and you can set
`distinct=True` to get the normal `UNION` behavior:

```{python}
Expand Down
24 changes: 12 additions & 12 deletions flake.lock
9 changes: 9 additions & 0 deletions flake.nix
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@
duckdb
# mysql
mariadb-client
# pyodbc setup debugging
# in particular: odbcinst -j
unixODBC
# pyspark
openjdk17_headless
# postgres client
Expand Down Expand Up @@ -111,6 +114,12 @@
MSSQL_SA_PASSWORD = "1bis_Testing!";
DRUID_URL = "druid://localhost:8082/druid/v2/sql";

# needed for mssql+pyodbc
ODBCSYSINI = pkgs.writeTextDir "odbcinst.ini" ''
[FreeTDS]
Driver = ${pkgs.lib.makeLibraryPath [ pkgs.freetds ]}/libtdsodbc.so
'';

__darwinAllowLocalNetworking = true;
};
in
Expand Down
2 changes: 1 addition & 1 deletion gen_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def main():
with Path(ibis.__file__).parents[1].joinpath(
"docs", "backends", "raw_support_matrix.csv"
).open(mode="w") as f:
df.to_csv(f, index_label="FullOperation")
df.to_csv(f, index_label="Operation")


if __name__ == "__main__":
Expand Down
2 changes: 1 addition & 1 deletion ibis/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""Initialize Ibis module."""
from __future__ import annotations

__version__ = "7.2.0"
__version__ = "8.0.0"

from ibis import examples, util
from ibis.backends.base import BaseBackend
Expand Down
4 changes: 2 additions & 2 deletions ibis/backends/base/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1024,8 +1024,8 @@ def compile(
def _to_sql(self, expr: ir.Expr, **kwargs) -> str:
"""Convert an expression to a SQL string.
Called by `ibis.to_sql`/`ibis.show_sql`, gives the backend an
opportunity to generate nicer SQL for human consumption.
Called by `ibis.to_sql`; gives the backend an opportunity to generate
nicer SQL for human consumption.
"""
raise NotImplementedError(f"Backend '{self.name}' backend doesn't support SQL")

Expand Down
32 changes: 0 additions & 32 deletions ibis/backends/base/sql/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -397,38 +397,6 @@ def compile(
def _to_sql(self, expr: ir.Expr, **kwargs) -> str:
return str(self.compile(expr, **kwargs))

def explain(
self,
expr: ir.Expr | str,
params: Mapping[ir.Expr, Any] | None = None,
) -> str:
"""Explain an expression.
Return the query plan associated with the indicated expression or SQL
query.
Returns
-------
str
Query plan
"""
if isinstance(expr, ir.Expr):
context = self.compiler.make_context(params=params)
query_ast = self.compiler.to_ast(expr, context)
if len(query_ast.queries) > 1:
raise Exception("Multi-query expression")

query = query_ast.queries[0].compile()
else:
query = expr

statement = f"EXPLAIN {query}"

with self._safe_raw_sql(statement) as cur:
result = self._get_list(cur)

return "\n".join(["Query:", util.indent(query, 2), "", *result])

@classmethod
@lru_cache
def _get_operations(cls):
Expand Down
22 changes: 17 additions & 5 deletions ibis/backends/base/sql/alchemy/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import abc
import atexit
import contextlib
import getpass
import warnings
from operator import methodcaller
from typing import TYPE_CHECKING, Any
Expand Down Expand Up @@ -136,18 +135,28 @@ def _compile_type(self, dtype) -> str:
self.compiler.translator_class.get_sqla_type(dtype)
).compile(dialect=dialect)

def _build_alchemy_url(self, url, host, port, user, password, database, driver):
def _build_alchemy_url(
self,
url: str | None,
host: str | None,
port: int | None,
user: str | None,
password: str | None,
database: str | None,
driver: str | None,
query: Mapping[str, Any] | None = None,
) -> sa.engine.URL:
if url is not None:
return sa.engine.url.make_url(url)

user = user or getpass.getuser()
return sa.engine.url.URL.create(
driver,
host=host,
port=port,
username=user,
password=password,
database=database,
query=query or {},
)

@property
Expand Down Expand Up @@ -875,8 +884,11 @@ def _get_compiled_statement(
compiled = definition.compile(
dialect=self.con.dialect, compile_kwargs=compile_kwargs
)
lines = self._get_temp_view_definition(name, definition=compiled)
return lines, compiled.params
create_view = self._get_temp_view_definition(name, definition=compiled)
params = compiled.params
if compiled.positional:
params = tuple(params.values())
return create_view, params

def _create_temp_view(self, view: sa.Table, definition: sa.sql.Selectable) -> None:
raw_name = view.name
Expand Down
13 changes: 13 additions & 0 deletions ibis/backends/base/sql/alchemy/datatypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,19 @@ def compiles_array(element, compiler, **kw):
return f"ARRAY({compiler.process(element.value_type, **kw)})"


@compiles(sat.FLOAT, "duckdb")
def compiles_float(element, compiler, **kw):
precision = element.precision
if precision is None or 1 <= precision <= 24:
return "FLOAT"
elif 24 < precision <= 53:
return "DOUBLE"
else:
raise ValueError(
"FLOAT precision must be between 1 and 53 inclusive, or `None`"
)


class StructType(sat.UserDefinedType):
cache_ok = True

Expand Down
4 changes: 3 additions & 1 deletion ibis/backends/base/sql/alchemy/geospatial.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
from importlib.util import find_spec as _find_spec

geospatial_supported = (
_find_spec("geoalchemy2") is not None and _find_spec("geopandas") is not None
_find_spec("geoalchemy2") is not None
and _find_spec("geopandas") is not None
and _find_spec("shapely") is not None
)
__all__ = ["geospatial_supported"]
5 changes: 5 additions & 0 deletions ibis/backends/base/sql/alchemy/query_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,11 @@ def _format_table(self, op):
elif ctx.is_extracted(op):
if isinstance(orig_op, ops.SelfReference):
result = ctx.get_ref(op)
elif isinstance(alias, str):
result = sa.table(
alias,
*translator._schema_to_sqlalchemy_columns(orig_op.schema),
)
else:
result = alias
else:
Expand Down
6 changes: 3 additions & 3 deletions ibis/backends/base/sql/compiler/translator.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,12 +113,12 @@ def get_ref(self, node, search_parents=False):
"""Return the alias used to refer to an expression."""
assert isinstance(node, ops.Node), type(node)

if self.is_extracted(node):
return self.top_context.table_refs.get(node)

if (ref := self.table_refs.get(node)) is not None:
return ref

if self.is_extracted(node):
return self.top_context.table_refs.get(node)

if search_parents and (parent := self.parent) is not None:
return parent.get_ref(node, search_parents=search_parents)

Expand Down
1 change: 1 addition & 0 deletions ibis/backends/base/sql/registry/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ def needs_parens(op: ops.Node):
"boolean": "boolean",
"timestamp": "timestamp",
"decimal": "decimal",
"date": "date",
}


Expand Down
2 changes: 1 addition & 1 deletion ibis/backends/base/sql/registry/literal.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def _interval_literal_format(translator, op):
def _date_literal_format(translator, op):
value = op.value
if isinstance(value, datetime.date):
value = value.strftime("%Y-%m-%d")
value = value.isoformat()

return repr(value)

Expand Down
31 changes: 24 additions & 7 deletions ibis/backends/base/sql/registry/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,15 +165,18 @@ def exists_subquery(translator, op):

# XXX this is not added to operation_registry, but looks like impala is
# using it in the tests, and it works, even if it's not imported anywhere
def round(translator, op):
def _round(translator, op):
arg, digits = op.args

arg_formatted = translator.translate(arg)

if digits is not None:
digits_formatted = translator.translate(digits)
return f"round({arg_formatted}, {digits_formatted})"
return f"round({arg_formatted})"
rounded = f"round({arg_formatted})"
if op.dtype.is_integer():
return f"cast({rounded} AS BIGINT)"
return round


def concat(translator, op):
Expand All @@ -196,6 +199,20 @@ def count_star(translator, op):
)


def _ceil(t, op):
ceil = f"ceil({t.translate(op.arg)})"
if op.dtype.is_integer():
return f"cast({ceil} AS BIGINT)"
return ceil


def _floor(t, op):
floor = f"floor({t.translate(op.arg)})"
if op.dtype.is_integer():
return f"cast({floor} AS BIGINT)"
return floor


binary_infix_ops = {
# Binary operations
ops.Add: binary_infix.binary_infix_op("+"),
Expand Down Expand Up @@ -237,10 +254,10 @@ def count_star(translator, op):
ops.NullIf: fixed_arity("nullif", 2),
ops.Abs: unary("abs"),
ops.BaseConvert: fixed_arity("conv", 3),
ops.Ceil: unary("ceil"),
ops.Floor: unary("floor"),
ops.Ceil: _ceil,
ops.Floor: _floor,
ops.Exp: unary("exp"),
ops.Round: round,
ops.Round: _round,
ops.Sign: sign,
ops.Sqrt: unary("sqrt"),
ops.HashBytes: hashbytes,
Expand Down Expand Up @@ -324,8 +341,8 @@ def count_star(translator, op):
ops.ExtractHour: timestamp.extract_field("hour"),
ops.ExtractMinute: timestamp.extract_field("minute"),
ops.ExtractSecond: timestamp.extract_field("second"),
ops.ExtractMicrosecond: timestamp.extract_field("microsecond"),
ops.ExtractMillisecond: timestamp.extract_field("millisecond"),
ops.ExtractMicrosecond: timestamp.extract_microsecond,
ops.ExtractMillisecond: timestamp.extract_millisecond,
ops.TimestampTruncate: timestamp.truncate,
ops.DateTruncate: timestamp.truncate,
ops.IntervalFromInteger: timestamp.interval_from_integer,
Expand Down
16 changes: 16 additions & 0 deletions ibis/backends/base/sql/registry/timestamp.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,22 @@ def extract_field_formatter(translator, op):
return extract_field_formatter


def extract_millisecond(translator, op):
arg = translator.translate(op.args[0])

# This is pre-2.0 Impala-style, which did not used to support the
# SQL-99 format extract($FIELD from expr)
return f"extract({arg}, 'millisecond') % 1000"


def extract_microsecond(translator, op):
arg = translator.translate(op.args[0])

# This is pre-2.0 Impala-style, which did not used to support the
# SQL-99 format extract($FIELD from expr)
return f"extract({arg}, 'microsecond') % 1000000"


def extract_epoch_seconds(t, op):
return f"unix_timestamp({t.translate(op.arg)})"

Expand Down
6 changes: 3 additions & 3 deletions ibis/backends/base/sqlglot/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,9 @@ def interval(value, *, unit):

C = ColGen()
F = FuncGen()
NULL = sg.exp.NULL
FALSE = sg.exp.FALSE
TRUE = sg.exp.TRUE
NULL = sg.exp.Null()
FALSE = sg.exp.false()
TRUE = sg.exp.true()
STAR = sg.exp.Star()


Expand Down
67 changes: 56 additions & 11 deletions ibis/backends/base/sqlglot/datatypes.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
from __future__ import annotations

import abc
from functools import partial

import sqlglot as sg
import sqlglot.expressions as sge

import ibis.common.exceptions as com
import ibis.expr.datatypes as dt
from ibis.common.collections import FrozenDict
from ibis.formats import TypeMapper
Expand Down Expand Up @@ -122,10 +122,8 @@


class SqlglotType(TypeMapper):
@property
@abc.abstractmethod
def dialect(self) -> str:
"""The dialect this parser is for."""
dialect: str | None = None
"""The dialect this parser is for."""

default_nullable = True
"""Default nullability when not specified."""
Expand Down Expand Up @@ -230,11 +228,20 @@ def _from_sqlglot_TIMESTAMPLTZ(cls, scale=None) -> dt.Timestamp:

@classmethod
def _from_sqlglot_INTERVAL(
cls, precision: sge.DataTypeParam | None = None
cls, precision_or_span: sge.IntervalSpan | None = None
) -> dt.Interval:
if precision is None:
precision = cls.default_interval_precision
return dt.Interval(str(precision), nullable=cls.default_nullable)
nullable = cls.default_nullable
if precision_or_span is None:
precision_or_span = cls.default_interval_precision

if isinstance(precision_or_span, str):
return dt.Interval(precision_or_span, nullable=nullable)
elif isinstance(precision_or_span, sge.IntervalSpan):
return dt.Interval(unit=precision_or_span.this.this, nullable=nullable)
elif precision_or_span is None:
raise com.IbisTypeError("Interval precision is None")
else:
raise com.IbisTypeError(precision_or_span)

@classmethod
def _from_sqlglot_DECIMAL(
Expand All @@ -254,6 +261,22 @@ def _from_sqlglot_DECIMAL(

return dt.Decimal(precision, scale, nullable=cls.default_nullable)

@classmethod
def _from_sqlglot_GEOMETRY(cls) -> sge.DataType:
return dt.GeoSpatial(geotype="geometry", nullable=cls.default_nullable)

@classmethod
def _from_sqlglot_GEOGRAPHY(cls) -> sge.DataType:
return dt.GeoSpatial(geotype="geography", nullable=cls.default_nullable)

@classmethod
def _from_ibis_Interval(cls, dtype: dt.Interval) -> sge.DataType:
assert dtype.unit is not None, "interval unit cannot be None"
return sge.DataType(
this=typecode.INTERVAL,
expressions=[sge.IntervalSpan(this=sge.Var(this=dtype.unit.name))],
)

@classmethod
def _from_ibis_Array(cls, dtype: dt.Array) -> sge.DataType:
value_type = cls.from_ibis(dtype.value_type)
Expand All @@ -277,11 +300,17 @@ def _from_ibis_Struct(cls, dtype: dt.Struct) -> sge.DataType:

@classmethod
def _from_ibis_Decimal(cls, dtype: dt.Decimal) -> sge.DataType:
if (precision := dtype.precision) is None:
precision = cls.default_decimal_precision

if (scale := dtype.scale) is None:
scale = cls.default_decimal_scale

return sge.DataType(
this=typecode.DECIMAL,
expressions=[
sge.DataTypeParam(this=sge.Literal.number(dtype.precision)),
sge.DataTypeParam(this=sge.Literal.number(dtype.scale)),
sge.DataTypeParam(this=sge.Literal.number(precision)),
sge.DataTypeParam(this=sge.Literal.number(scale)),
],
)

Expand All @@ -294,6 +323,20 @@ def _from_ibis_Timestamp(cls, dtype: dt.Timestamp) -> sge.DataType:
else:
return sge.DataType(this=code)

@classmethod
def _from_ibis_GeoSpatial(cls, dtype: dt.GeoSpatial):
if (geotype := dtype.geotype) is not None:
return sge.DataType(this=getattr(typecode, geotype.upper()))
return sge.DataType(this=typecode.GEOMETRY)

_from_ibis_Point = (
_from_ibis_LineString
) = (
_from_ibis_Polygon
) = (
_from_ibis_MultiLineString
) = _from_ibis_MultiPoint = _from_ibis_MultiPolygon = _from_ibis_GeoSpatial


class PostgresType(SqlglotType):
dialect = "postgres"
Expand Down Expand Up @@ -354,6 +397,8 @@ class DuckDBType(SqlglotType):
default_decimal_scale = 3
default_interval_precision = "us"

unknown_type_strings = FrozenDict({"wkb_blob": dt.binary})

@classmethod
def _from_sqlglot_TIMESTAMP(cls) -> dt.Timestamp:
return dt.Timestamp(scale=6, nullable=cls.default_nullable)
Expand Down
25 changes: 24 additions & 1 deletion ibis/backends/base/sqlglot/tests/test_datatypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@

import hypothesis as h
import hypothesis.strategies as st
import pytest
import sqlglot.expressions as sge

import ibis.common.exceptions as com
import ibis.expr.datatypes as dt
import ibis.tests.strategies as its
from ibis.backends.base.sqlglot.datatypes import SqlglotType
from ibis.backends.base.sqlglot.datatypes import DuckDBType, PostgresType, SqlglotType


def assert_dtype_roundtrip(ibis_type, sqlglot_expected=None):
Expand Down Expand Up @@ -39,6 +41,10 @@ def assert_dtype_roundtrip(ibis_type, sqlglot_expected=None):
| its.array_dtypes(roundtripable_types, nullable=true)
| its.map_dtypes(roundtripable_types, roundtripable_types, nullable=true)
| its.struct_dtypes(roundtripable_types, nullable=true)
| its.geometry_dtypes(nullable=true)
| its.geography_dtypes(nullable=true)
| its.decimal_dtypes(nullable=true)
| its.interval_dtype(nullable=true)
)
)

Expand All @@ -51,3 +57,20 @@ def assert_dtype_roundtrip(ibis_type, sqlglot_expected=None):
@h.given(roundtripable_types)
def test_roundtripable_types(ibis_type):
assert_dtype_roundtrip(ibis_type)


@h.given(its.specific_geometry_dtypes(nullable=true))
def test_specific_geometry_types(ibis_type):
sqlglot_result = SqlglotType.from_ibis(ibis_type)
assert isinstance(sqlglot_result, sge.DataType)
assert sqlglot_result == sge.DataType(this=sge.DataType.Type.GEOMETRY)
assert SqlglotType.to_ibis(sqlglot_result) == dt.GeoSpatial(
geotype="geometry", nullable=ibis_type.nullable
)


def test_interval_without_unit():
with pytest.raises(com.IbisTypeError, match="precision is None"):
SqlglotType.from_string("INTERVAL")
assert PostgresType.from_string("INTERVAL") == dt.Interval("s")
assert DuckDBType.from_string("INTERVAL") == dt.Interval("us")
40 changes: 7 additions & 33 deletions ibis/backends/bigquery/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
import glob
import os
import re
import warnings
from functools import partial
from typing import TYPE_CHECKING, Any, Callable, Optional
from urllib.parse import parse_qs, urlparse
Expand Down Expand Up @@ -495,14 +494,9 @@ def table(
self, name: str, database: str | None = None, schema: str | None = None
) -> ir.TableExpr:
if database is not None and schema is None:
util.warn_deprecated(
"database",
instead=(
f"The {self.name} backend cannot return a table expression using only a `database` specifier. "
"Include a `schema` argument."
),
as_of="7.1",
removed_in="8.0",
raise com.IbisInputError(
f"The {self.name} backend cannot return a table expression using only a "
"`database` specifier. Include a `schema` argument."
)

table = sg.parse_one(name, into=sg.exp.Table, read=self.name)
Expand Down Expand Up @@ -654,13 +648,7 @@ def raw_sql(self, query: str, results=False, params=None):

@property
def current_database(self) -> str:
warnings.warn(
"current_database will return the current *data project* in ibis 7.0.0; "
"use current_schema for the current BigQuery dataset",
category=FutureWarning,
)
# TODO: return self.data_project in ibis 7.0.0
return self.dataset
return self.data_project

@property
def current_schema(self) -> str | None:
Expand Down Expand Up @@ -812,12 +800,6 @@ def list_schemas(
]
return self._filter_with_like(results, like)

@ibis.util.deprecated(
instead="use `list_schemas()`", as_of="6.1.0", removed_in="8.0.0"
)
def list_databases(self, like=None):
return self.list_schemas(like=like)

def list_tables(
self,
like: str | None = None,
Expand All @@ -843,18 +825,10 @@ def list_tables(
:::
"""
if database is not None and schema is None:
util.warn_deprecated(
"database",
instead=(
f"{self.name} cannot list tables only using `database` specifier. "
"Include a `schema` argument."
),
as_of="7.1",
removed_in="8.0",
raise com.com.IbisInputError(
f"{self.name} cannot list tables only using `database` specifier. "
"Include a `schema` argument."
)
database = sg.parse_one(database, into=sg.exp.Table, read=self.name)
database.args["quoted"] = False
database = database.sql(dialect=self.name)
elif database is None and schema is not None:
database = sg.parse_one(schema, into=sg.exp.Table, read=self.name)
database.args["quoted"] = False
Expand Down
4 changes: 2 additions & 2 deletions ibis/backends/bigquery/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def _array_concat(translator, op):


def _array_column(translator, op):
return "[{}]".format(", ".join(map(translator.translate, op.cols)))
return "[{}]".format(", ".join(map(translator.translate, op.exprs)))


def _array_index(translator, op):
Expand Down Expand Up @@ -912,7 +912,7 @@ def _timestamp_range(translator, op):
ops.StructColumn: _struct_column,
ops.ArrayCollect: _array_agg,
ops.ArrayConcat: _array_concat,
ops.ArrayColumn: _array_column,
ops.Array: _array_column,
ops.ArrayIndex: _array_index,
ops.ArrayLength: unary("ARRAY_LENGTH"),
ops.ArrayRepeat: _array_repeat,
Expand Down
43 changes: 10 additions & 33 deletions ibis/backends/bigquery/tests/system/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,14 +35,8 @@ def test_list_tables(con):
assert set(tables) == {"functional_alltypes", "functional_alltypes_parted"}


def test_current_database(con, dataset_id):
with pytest.warns(FutureWarning, match="data project"):
db = con.current_database
assert db == dataset_id
assert db == con.dataset_id
assert con.list_tables(schema=db, like="alltypes") == con.list_tables(
like="alltypes"
)
def test_current_database(con):
assert con.current_database == con.billing_project


def test_array_collect(struct_table):
Expand Down Expand Up @@ -244,37 +238,20 @@ def test_exists_table_different_project(con):


def test_multiple_project_queries(con, snapshot):
with pytest.warns(FutureWarning, match="`database` is deprecated as of v7.1"):
so = con.table("posts_questions", database="bigquery-public-data.stackoverflow")
with pytest.warns(FutureWarning, match="`database` is deprecated as of v7.1"):
trips = con.table("trips", database="nyc-tlc.yellow")
so = con.table(
"posts_questions", database="bigquery-public-data", schema="stackoverflow"
)
trips = con.table("trips", database="nyc-tlc", schema="yellow")
join = so.join(trips, so.tags == trips.rate_code)[[so.title]]
result = join.compile()
snapshot.assert_match(result, "out.sql")


def test_multiple_project_queries_database_api(con, snapshot):
stackoverflow = con.database("bigquery-public-data.stackoverflow")
with pytest.warns(FutureWarning, match="`database` is deprecated as of v7.1"):
posts_questions = stackoverflow.posts_questions
yellow = con.database("nyc-tlc.yellow")
with pytest.warns(FutureWarning, match="`database` is deprecated as of v7.1"):
trips = yellow.trips
predicate = posts_questions.tags == trips.rate_code
join = posts_questions.join(trips, predicate)[[posts_questions.title]]
result = join.compile()
snapshot.assert_match(result, "out.sql")


def test_multiple_project_queries_execute(con):
stackoverflow = con.database("bigquery-public-data.stackoverflow")
with pytest.warns(FutureWarning, match="`database` is deprecated as of v7.1"):
posts_questions = stackoverflow.posts_questions
posts_questions = posts_questions.limit(5)
yellow = con.database("nyc-tlc.yellow")
with pytest.warns(FutureWarning, match="`database` is deprecated as of v7.1"):
trips = yellow.trips
trips = trips.limit(5)
posts_questions = con.table(
"posts_questions", database="bigquery-public-data", schema="stackoverflow"
).limit(5)
trips = con.table("trips", database="nyc-tlc", schema="yellow").limit(5)
predicate = posts_questions.tags == trips.rate_code
cols = [posts_questions.title]
join = posts_questions.left_join(trips, predicate)[cols]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
SELECT
t0.`rowindex`,
IF(pos = pos_2, `repeated_struct_col`, NULL) AS `repeated_struct_col`
FROM array_test AS t0, UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(t0.`repeated_struct_col`)) - 1)) AS pos
FROM array_test AS t0
CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(t0.`repeated_struct_col`)) - 1)) AS pos
CROSS JOIN UNNEST(t0.`repeated_struct_col`) AS `repeated_struct_col` WITH OFFSET AS pos_2
WHERE
pos = pos_2
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ FROM (
SELECT
t1.`rowindex`,
IF(pos = pos_2, `level_one`, NULL).`nested_struct_col` AS `level_one`
FROM array_test AS t1, UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(t1.`repeated_struct_col`)) - 1)) AS pos
FROM array_test AS t1
CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(t1.`repeated_struct_col`)) - 1)) AS pos
CROSS JOIN UNNEST(t1.`repeated_struct_col`) AS `level_one` WITH OFFSET AS pos_2
WHERE
pos = pos_2
Expand All @@ -16,7 +17,8 @@ FROM (
ARRAY_LENGTH(t1.`repeated_struct_col`) - 1
)
)
) AS t0, UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(t0.`level_one`)) - 1)) AS pos
) AS t0
CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(t0.`level_one`)) - 1)) AS pos
CROSS JOIN UNNEST(t0.`level_one`) AS `level_two` WITH OFFSET AS pos_2
WHERE
pos = pos_2
Expand Down
19 changes: 10 additions & 9 deletions ibis/backends/clickhouse/compiler/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@
from ibis.backends.clickhouse.compiler.relations import translate_rel
from ibis.backends.clickhouse.compiler.values import translate_val
from ibis.common.deferred import _
from ibis.expr.analysis import c, find_first_base_table, p, x, y
from ibis.common.patterns import replace
from ibis.expr.analysis import c, find_first_base_table, p, x
from ibis.expr.rewrites import rewrite_dropna, rewrite_fillna, rewrite_sample

if TYPE_CHECKING:
Expand All @@ -42,6 +43,14 @@ def _translate_node(node, **kwargs):
return translate_rel(node, **kwargs)


@replace(ops.InColumn)
def replace_in_column_with_table_array_view(_):
# replace the right side of InColumn into a scalar subquery for sql backends
base = find_first_base_table(_.options)
options = ops.TableArrayView(ops.Selection(table=base, selections=(_.options,)))
return _.copy(options=options)


def translate(op: ops.TableNode, params: Mapping[ir.Value, Any]) -> sg.exp.Expression:
"""Translate an ibis operation to a sqlglot expression.
Expand Down Expand Up @@ -88,14 +97,6 @@ def fn(node, _, **kwargs):
lambda _, x: ops.Literal(value=params[_], dtype=x)
)

# replace the right side of InColumn into a scalar subquery for sql
# backends
replace_in_column_with_table_array_view = p.InColumn(options=y) >> _.copy(
options=c.TableArrayView(
c.Selection(table=lambda _, y: find_first_base_table(y), selections=(y,))
),
)

# replace any checks against an empty right side of the IN operation with
# `False`
replace_empty_in_values_with_false = p.InValues(options=()) >> c.Literal(
Expand Down
41 changes: 24 additions & 17 deletions ibis/backends/clickhouse/compiler/values.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,17 @@ def _array_slice_op(op, *, arg, start, stop, **_):
return F.arraySlice(arg, start_correct)


@translate_val.register(ops.Count)
def _count(op, *, arg, where, **_):
if where is not None:
return sg.exp.Anonymous(this="countIf", expressions=[arg, where])
return sg.exp.Count(this=arg)


@translate_val.register(ops.CountStar)
def _count_star(op, *, where, **_):
if where is not None:
return F.countIf(where)
return sg.exp.Anonymous(this="countIf", expressions=[where])
return sg.exp.Count(this=STAR)


Expand Down Expand Up @@ -273,6 +280,8 @@ def _hash(op, *, arg, **_):

@translate_val.register(ops.HashBytes)
def _hash_bytes(op, *, arg, how, **_):
if how in ("md5", "sha1", "sha224", "sha256"):
how = how.upper()
if how not in _SUPPORTED_ALGORITHMS:
raise com.UnsupportedOperationError(f"Unsupported hash algorithm {how}")

Expand Down Expand Up @@ -346,24 +355,18 @@ def _literal(op, *, value, dtype, **kw):

return interval(value, unit=dtype.resolution.upper())
elif dtype.is_timestamp():
funcname = "makeDateTime"
funcname = "parseDateTime"

if micros := value.microsecond:
funcname += "64"

args = [
value.year,
value.month,
value.day,
value.hour,
value.minute,
value.second,
]
funcname += "BestEffort"

args = [value.isoformat()]

if micros % 1000:
args.append(micros)
args.append(6)
elif millis := micros // 1000:
args.append(millis)
elif micros // 1000:
args.append(3)

if (timezone := dtype.timezone) is not None:
Expand Down Expand Up @@ -553,9 +556,9 @@ def _translate(op, *, arg, where, **_):
return _translate


@translate_val.register(ops.ArrayColumn)
def _array_column(op, *, cols, **_):
return F.array(*cols)
@translate_val.register(ops.Array)
def _array_column(op, *, exprs, **_):
return F.array(*exprs)


@translate_val.register(ops.StructColumn)
Expand Down Expand Up @@ -750,7 +753,6 @@ def formatter(op, *, left, right, **_):
ops.ArgMin: "argMin",
ops.ArgMax: "argMax",
ops.ArrayCollect: "groupArray",
ops.Count: "count",
ops.CountDistinct: "uniq",
ops.First: "any",
ops.Last: "anyLast",
Expand Down Expand Up @@ -1050,3 +1052,8 @@ def _timestamp_range(op, *, start, stop, step, **_):
func, F.range(0, F.timestampDiff(unit, start, stop), step_value)
)
return result


@translate_val.register(ops.RegexSplit)
def _regex_split(op, *, arg, pattern, **_):
return F.splitByRegexp(pattern, cast(arg, dt.String(nullable=False)))
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
toDate(makeDateTime(2009, 5, 17, 12, 34, 56)) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
toDate(parseDateTimeBestEffort('2009-05-17T12:34:56')) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
toStartOfHour(makeDateTime(2009, 5, 17, 12, 34, 56)) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
toStartOfHour(parseDateTimeBestEffort('2009-05-17T12:34:56')) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
toStartOfMinute(makeDateTime(2009, 5, 17, 12, 34, 56)) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
toStartOfMinute(parseDateTimeBestEffort('2009-05-17T12:34:56')) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
toStartOfMinute(makeDateTime(2009, 5, 17, 12, 34, 56)) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
toStartOfMinute(parseDateTimeBestEffort('2009-05-17T12:34:56')) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
toMonday(makeDateTime(2009, 5, 17, 12, 34, 56)) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
toMonday(parseDateTimeBestEffort('2009-05-17T12:34:56')) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
toStartOfYear(makeDateTime(2009, 5, 17, 12, 34, 56)) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
toStartOfYear(parseDateTimeBestEffort('2009-05-17T12:34:56')) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56))"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
makeDateTime64(2015, 1, 1, 12, 34, 56, 789321, 6) AS "datetime.datetime(2015, 1, 1, 12, 34, 56, 789321)"
parseDateTime64BestEffort('2015-01-01T12:34:56.789321', 6) AS "datetime.datetime(2015, 1, 1, 12, 34, 56, 789321)"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
makeDateTime64(2015, 1, 1, 12, 34, 56, 789321, 6, 'UTC') AS "datetime.datetime(2015, 1, 1, 12, 34, 56, 789321, tzinfo=tzutc())"
parseDateTime64BestEffort('2015-01-01T12:34:56.789321+00:00', 6, 'UTC') AS "datetime.datetime(2015, 1, 1, 12, 34, 56, 789321, tzinfo=tzutc())"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
makeDateTime64(2015, 1, 1, 12, 34, 56, 789, 3) AS "datetime.datetime(2015, 1, 1, 12, 34, 56, 789000)"
parseDateTime64BestEffort('2015-01-01T12:34:56.789000', 3) AS "datetime.datetime(2015, 1, 1, 12, 34, 56, 789000)"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
makeDateTime64(2015, 1, 1, 12, 34, 56, 789, 3, 'UTC') AS "datetime.datetime(2015, 1, 1, 12, 34, 56, 789000, tzinfo=tzutc())"
parseDateTime64BestEffort('2015-01-01T12:34:56.789000+00:00', 3, 'UTC') AS "datetime.datetime(2015, 1, 1, 12, 34, 56, 789000, tzinfo=tzutc())"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
makeDateTime(2015, 1, 1, 12, 34, 56) AS "datetime.datetime(2015, 1, 1, 12, 34, 56)"
parseDateTimeBestEffort('2015-01-01T12:34:56') AS "datetime.datetime(2015, 1, 1, 12, 34, 56)"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
makeDateTime(2015, 1, 1, 12, 34, 56) AS "datetime.datetime(2015, 1, 1, 12, 34, 56)"
parseDateTimeBestEffort('2015-01-01T12:34:56') AS "datetime.datetime(2015, 1, 1, 12, 34, 56)"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
makeDateTime(2015, 1, 1, 12, 34, 56) AS "datetime.datetime(2015, 1, 1, 12, 34, 56)"
parseDateTimeBestEffort('2015-01-01T12:34:56') AS "datetime.datetime(2015, 1, 1, 12, 34, 56)"
5 changes: 2 additions & 3 deletions ibis/backends/clickhouse/tests/test_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,9 +92,8 @@ def test_timestamp_truncate(con, unit, snapshot):


@pytest.mark.parametrize(("value", "expected"), [(0, None), (5.5, 5.5)])
def test_nullifzero(con, value, expected):
with pytest.warns(FutureWarning):
result = con.execute(L(value).nullifzero())
def test_nullif_zero(con, value, expected):
result = con.execute(L(value).nullif(0))
if expected is None:
assert pd.isnull(result)
else:
Expand Down
2 changes: 1 addition & 1 deletion ibis/backends/clickhouse/tests/test_operators.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ def test_array_index(con, arr, gen_idx):
)
def test_array_concat(con, arrays):
expr = L([]).cast("!array<int8>")
expected = sum(arrays, [])
expected = sum(arrays, []) # noqa: RUF017
for arr in arrays:
expr += L(arr, type="!array<int8>")

Expand Down
1 change: 0 additions & 1 deletion ibis/backends/clickhouse/tests/test_select.py
Original file line number Diff line number Diff line change
Expand Up @@ -392,7 +392,6 @@ def test_count_name(snapshot):
A=t.count(where=~t.b).fillna(0), B=t.count(where=t.b).fillna(0)
)

ibis.show_sql(expr, dialect="clickhouse")
snapshot.assert_match(str(ibis.to_sql(expr, dialect="clickhouse")), "out.sql")


Expand Down
10 changes: 4 additions & 6 deletions ibis/backends/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
if TYPE_CHECKING:
from collections.abc import Iterable

from ibis.backends.tests.base import BackendTest

TEST_TABLES = {
"functional_alltypes": ibis.schema(
{
Expand Down Expand Up @@ -399,11 +401,6 @@ def pytest_runtest_call(item):

backend = next(iter(backend))

if tpch_markers := list(item.iter_markers(name="tpch")):
assert len(tpch_markers) == 1
# TODO: there has to be a better way than hacking `_fixtureinfo`
item._fixtureinfo.argnames += ("backend", "snapshot")

# Ibis hasn't exposed existing functionality
# This xfails so that you know when it starts to pass
for marker in item.iter_markers(name="notimpl"):
Expand Down Expand Up @@ -472,7 +469,7 @@ def pytest_runtest_call(item):


@pytest.fixture(params=_get_backends_to_test(), scope="session")
def backend(request, data_dir, tmp_path_factory, worker_id):
def backend(request, data_dir, tmp_path_factory, worker_id) -> BackendTest:
"""Return an instance of BackendTest, loaded with data."""

cls = _get_backend_conf(request.param)
Expand Down Expand Up @@ -544,6 +541,7 @@ def ddl_con(ddl_backend):
"mysql",
"oracle",
"postgres",
"risingwave",
"snowflake",
"sqlite",
"trino",
Expand Down
22 changes: 19 additions & 3 deletions ibis/backends/dask/execution/arrays.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import dask.dataframe as dd
import dask.dataframe.groupby as ddgb
import numpy as np
import pandas as pd

import ibis.expr.operations as ops
from ibis.backends.dask.core import execute
Expand Down Expand Up @@ -34,10 +35,25 @@
)


@execute_node.register(ops.ArrayColumn, tuple)
@execute_node.register(ops.Array, tuple)
def execute_array_column(op, cols, **kwargs):
cols = [execute(arg, **kwargs) for arg in cols]
df = dd.concat(cols, axis=1)
vals = [execute(arg, **kwargs) for arg in cols]

length = next((len(v) for v in vals if isinstance(v, dd.Series)), None)
if length is None:
return vals

n_partitions = next((v.npartitions for v in vals if isinstance(v, dd.Series)), None)

def ensure_series(v):
if isinstance(v, dd.Series):
return v
else:
return dd.from_pandas(pd.Series([v] * length), npartitions=n_partitions)

# dd.concat() can only handle array-likes.
# If we're given a scalar, we need to broadcast it as a Series.
df = dd.concat([ensure_series(v) for v in vals], axis=1)
return df.apply(
lambda row: np.array(row, dtype=object), axis=1, meta=(None, "object")
)
Expand Down
5 changes: 2 additions & 3 deletions ibis/backends/dask/tests/execution/test_operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -353,9 +353,8 @@ def test_boolean_aggregation(t, df, reduction):


@pytest.mark.parametrize("column", ["float64_with_zeros", "int64_with_zeros"])
def test_null_if_zero(t, df, column):
with pytest.warns(FutureWarning):
expr = t[column].nullifzero()
def test_nullif_zero(t, df, column):
expr = t[column].nullif(0)
result = expr.compile()
expected = df[column].replace(0, np.nan)
tm.assert_series_equal(result.compute(), expected.compute(), check_index=False)
Expand Down
7 changes: 7 additions & 0 deletions ibis/backends/datafusion/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,12 @@
_exclude_exp = (exp.Pow, exp.ArrayContains)


def _lower_unit(self, expr):
value = expr.this.sql(dialect=self.dialect)
unit = expr.unit.this.lower()
return f"INTERVAL '{value} {unit}'"


# the DataFusion dialect was created to skip the power function to operator transformation
# in the future this could be used to optimize sqlglot for datafusion
class DataFusion(Postgres):
Expand All @@ -61,6 +67,7 @@ class Generator(Postgres.Generator):
]
),
exp.IsNan: rename_func("isnan"),
exp.Interval: _lower_unit,
}


Expand Down
7 changes: 4 additions & 3 deletions ibis/backends/datafusion/compiler/values.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ def translate_val(op, **_):
ops.ArrayContains: "array_contains",
ops.ArrayLength: "array_length",
ops.ArrayRemove: "array_remove_all",
ops.RegexSplit: "regex_split",
}

for _op, _name in _simple_ops.items():
Expand Down Expand Up @@ -732,9 +733,9 @@ def _not_null(op, *, arg, **_):
return sg.not_(arg.is_(NULL))


@translate_val.register(ops.ArrayColumn)
def array_column(op, *, cols, **_):
return F.make_array(*cols)
@translate_val.register(ops.Array)
def array_column(op, *, exprs, **_):
return F.make_array(*exprs)


@translate_val.register(ops.ArrayRepeat)
Expand Down
18 changes: 17 additions & 1 deletion ibis/backends/datafusion/udfs.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
import pyarrow.compute as pc
import pyarrow_hotfix # noqa: F401

import ibis.expr.datatypes as dt # noqa: TCH001
import ibis.common.exceptions as com
import ibis.expr.datatypes as dt


def _extract_epoch_seconds(array) -> dt.int32:
Expand Down Expand Up @@ -113,3 +114,18 @@ def extract_minute_timestamp(array: dt.Timestamp(scale=9)) -> dt.int32:

def extract_hour_time(array: dt.time) -> dt.int32:
return pc.cast(pc.hour(array), pa.int32())


def regex_split(s: str, pattern: str) -> list[str]:
# TODO: pretty inefficient, but this is a stopgap until we can get an
# upstream version of this function
#
# unique is necessary because when `s` is coming from a column, `pattern`
# is repeated to match the length of `s`
patterns = pattern.unique()
if len(patterns) != 1:
raise com.IbisError(
"Only a single scalar pattern is supported for DataFusion re_split"
)
pattern = patterns[0].as_py()
return pc.split_pattern_regex(s, pattern)
37 changes: 33 additions & 4 deletions ibis/backends/duckdb/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.functions import GenericFunction

import ibis.backends.base.sql.registry.geospatial as geo
import ibis.expr.operations as ops
from ibis.backends.base.sql import alchemy
from ibis.backends.base.sql.alchemy import unary
Expand Down Expand Up @@ -61,6 +62,11 @@ def _centroid(t, op):
return sa.func.st_centroid(arg, type_=Geometry_WKB)


def _geo_flip_coordinates(t, op):
arg = t.translate(op.arg)
return sa.func.st_flipcoordinates(arg, type_=Geometry_WKB)


def _geo_end_point(t, op):
arg = t.translate(op.arg)
return sa.func.st_endpoint(arg, type_=Geometry_WKB)
Expand Down Expand Up @@ -193,6 +199,8 @@ def _literal(t, op):

if dtype.is_interval():
return getattr(sa.func, f"to_{dtype.unit.plural}")(value)
elif dtype.is_geospatial():
return sa.literal_column(geo.translate_literal(op, inline_metadata=True))
elif dtype.is_array():
values = value.tolist() if isinstance(value, np.ndarray) else value
return sa.cast(sa.func.list_value(*values), sqla_type)
Expand Down Expand Up @@ -389,11 +397,21 @@ def _array_remove(t, op):
)


def _hexdigest(translator, op):
how = op.how

arg_formatted = translator.translate(op.arg)
if how in ("md5", "sha256"):
return getattr(sa.func, how)(arg_formatted)
else:
raise NotImplementedError(how)


operation_registry.update(
{
ops.ArrayColumn: (
ops.Array: (
lambda t, op: sa.cast(
sa.func.list_value(*map(t.translate, op.cols)),
sa.func.list_value(*map(t.translate, op.exprs)),
t.get_sqla_type(op.dtype),
)
),
Expand Down Expand Up @@ -489,8 +507,16 @@ def _array_remove(t, op):
ops.JSONGetItem: fixed_arity(_json_get_item, 2),
ops.RowID: lambda *_: sa.literal_column("rowid"),
ops.StringToTimestamp: _strptime,
ops.Quantile: reduction(sa.func.quantile_cont),
ops.MultiQuantile: reduction(sa.func.quantile_cont),
ops.Quantile: lambda t, op: (
reduction(sa.func.quantile_cont)(t, op)
if op.arg.dtype.is_numeric()
else reduction(sa.func.quantile_disc)(t, op)
),
ops.MultiQuantile: lambda t, op: (
reduction(sa.func.quantile_cont)(t, op)
if op.arg.dtype.is_numeric()
else reduction(sa.func.quantile_disc)(t, op)
),
ops.TypeOf: unary(sa.func.typeof),
ops.IntervalAdd: fixed_arity(operator.add, 2),
ops.IntervalSubtract: fixed_arity(operator.sub, 2),
Expand All @@ -517,6 +543,7 @@ def _array_remove(t, op):
ops.MapValues: unary(sa.func.map_values),
ops.MapMerge: fixed_arity(sa.func.map_concat, 2),
ops.Hash: unary(sa.func.hash),
ops.HexDigest: _hexdigest,
ops.Median: reduction(sa.func.median),
ops.First: reduction(sa.func.first),
ops.Last: reduction(sa.func.last),
Expand Down Expand Up @@ -560,8 +587,10 @@ def _array_remove(t, op):
ops.GeoX: unary(sa.func.ST_X),
ops.GeoY: unary(sa.func.ST_Y),
ops.GeoConvert: _geo_convert,
ops.GeoFlipCoordinates: _geo_flip_coordinates,
# other ops
ops.TimestampRange: fixed_arity(sa.func.range, 3),
ops.RegexSplit: fixed_arity(sa.func.str_split_regex, 2),
}
)

Expand Down
31 changes: 16 additions & 15 deletions ibis/backends/duckdb/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,33 +97,34 @@ def con(data_dir, tmp_path_factory, worker_id):


@pytest.fixture(scope="session")
def zones(con, data_dir):
zones = con.read_geo(data_dir / "geojson" / "zones.geojson")
return zones
def gpd():
pytest.importorskip("shapely")
pytest.importorskip("geoalchemy2")
return pytest.importorskip("geopandas")


@pytest.fixture(scope="session")
def lines(con, data_dir):
lines = con.read_geo(data_dir / "geojson" / "lines.geojson")
return lines
def zones(con, data_dir, gpd):
return con.read_geo(data_dir / "geojson" / "zones.geojson")


@pytest.fixture(scope="session")
def zones_gdf(data_dir):
gpd = pytest.importorskip("geopandas")
zones_gdf = gpd.read_file(data_dir / "geojson" / "zones.geojson")
return zones_gdf
def lines(con, data_dir, gpd):
return con.read_geo(data_dir / "geojson" / "lines.geojson")


@pytest.fixture(scope="session")
def lines_gdf(data_dir):
gpd = pytest.importorskip("geopandas")
lines_gdf = gpd.read_file(data_dir / "geojson" / "lines.geojson")
return lines_gdf
def zones_gdf(data_dir, gpd):
return gpd.read_file(data_dir / "geojson" / "zones.geojson")


@pytest.fixture(scope="session")
def geotable(con):
def lines_gdf(data_dir, gpd):
return gpd.read_file(data_dir / "geojson" / "lines.geojson")


@pytest.fixture(scope="session")
def geotable(con, gpd):
return con.table("geo")


Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SELECT
CAST('1.0' AS REAL) AS "Cast('1.0', float32)"
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SELECT
CAST('1.0' AS DOUBLE) AS "Cast('1.0', float64)"
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
SELECT
ST_DWITHIN(t0.geom, t0.geom, CAST(3.0 AS REAL(53))) AS tmp
ST_DWITHIN(t0.geom, t0.geom, CAST(3.0 AS DOUBLE)) AS tmp
FROM t AS t0
23 changes: 15 additions & 8 deletions ibis/backends/duckdb/tests/test_datatypes.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
from __future__ import annotations

import duckdb_engine
import numpy as np
import pytest
import sqlalchemy as sa
from packaging.version import parse as vparse
from pytest import param

import ibis
import ibis.backends.base.sql.alchemy.datatypes as sat
import ibis.common.exceptions as exc
import ibis.expr.datatypes as dt
Expand Down Expand Up @@ -66,17 +68,13 @@ def test_parser(typ, expected):

@pytest.mark.parametrize("uint_type", ["uint8", "uint16", "uint32", "uint64"])
def test_cast_uints(uint_type, snapshot):
import ibis

t = ibis.table(dict(a="int8"), name="t")
snapshot.assert_match(
str(ibis.to_sql(t.a.cast(uint_type), dialect="duckdb")), "out.sql"
)


def test_null_dtype():
import ibis

con = ibis.connect("duckdb://:memory:")

t = ibis.memtable({"a": [None, None]})
Expand Down Expand Up @@ -110,10 +108,6 @@ def test_generate_quoted_struct():
reason="mapping from UINTEGER query metadata fixed in 0.9.2",
)
def test_read_uint8_from_parquet(tmp_path):
import numpy as np

import ibis

con = ibis.duckdb.connect()

# There is an incorrect mapping in duckdb-engine from UInteger -> UInt8
Expand All @@ -129,3 +123,16 @@ def test_read_uint8_from_parquet(tmp_path):
t2 = con.read_parquet(parqpath)

assert t2.schema() == t.schema()


@pytest.mark.parametrize("typ", ["float32", "float64"])
def test_cast_to_floating_point_type(con, snapshot, typ):
expected = 1.0
value = ibis.literal(str(expected))
expr = value.cast(typ)

result = con.execute(expr)
assert result == expected

sql = str(ibis.to_sql(expr, dialect="duckdb"))
snapshot.assert_match(sql, "out.sql")
85 changes: 85 additions & 0 deletions ibis/backends/duckdb/tests/test_geospatial.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from __future__ import annotations

from operator import methodcaller

import numpy.testing as npt
import pandas.testing as tm
import pyarrow as pa
Expand All @@ -11,6 +13,7 @@
gpd = pytest.importorskip("geopandas")
gtm = pytest.importorskip("geopandas.testing")
shapely = pytest.importorskip("shapely")
pytest.importorskip("geoalchemy2")


def test_geospatial_point(zones, zones_gdf):
Expand Down Expand Up @@ -210,3 +213,85 @@ def test_geospatial_convert(geotable, gdf):
gtm.assert_geoseries_equal(
geo_ll.to_pandas(), gdf_ll, check_less_precise=True, check_crs=False
)


def test_geospatial_flip_coordinates(geotable):
flipped = geotable.geom.flip_coordinates()

# flipped coords
point = shapely.geometry.Point(40, -100)
line_string = shapely.geometry.LineString([[0, 0], [1, 1], [1, 2], [2, 2]])
polygon = shapely.geometry.Polygon(((0, 0), (0, 1), (1, 1), (1, 0), (0, 0)))

d = {
"name": ["Point", "LineString", "Polygon"],
"geometry": [point, line_string, polygon],
}

flipped_gdf = gpd.GeoDataFrame(d)

gtm.assert_geoseries_equal(
flipped.to_pandas(), flipped_gdf.geometry, check_crs=False
)


def test_create_table_geospatial_types(geotable, con):
name = ibis.util.gen_name("geotable")

# con = ibis.get_backend(geotable)

t = con.create_table(name, geotable, temp=True)

assert t.op().name in con.list_tables()
assert any(map(methodcaller("is_geospatial"), t.schema().values()))


# geo literals declaration
point = ibis.literal((1, 0), type="point").name("p")
point_geom = ibis.literal((1, 0), type="point:geometry").name("p")


@pytest.mark.parametrize(
("expr", "expected"),
[
(point, "'POINT (1.0 0.0)'"),
(point_geom, "'POINT (1.0 0.0)'::geometry"),
],
)
def test_literal_geospatial_explicit(con, expr, expected):
result = str(con.compile(expr))
assert result == f"SELECT {expected} AS p"


# test input data with shapely geometries
shp_point_0 = shapely.Point(0, 0)
shp_point_1 = shapely.Point(1, 1)
shp_point_2 = shapely.Point(2, 2)

shp_linestring_0 = shapely.LineString([shp_point_0, shp_point_1, shp_point_2])
shp_linestring_1 = shapely.LineString([shp_point_2, shp_point_1, shp_point_0])
shp_polygon_0 = shapely.Polygon(shp_linestring_0)
shp_multilinestring_0 = shapely.MultiLineString([shp_linestring_0, shp_linestring_1])
shp_multipoint_0 = shapely.MultiPoint([shp_point_0, shp_point_1, shp_point_2])
shp_multipolygon_0 = shapely.MultiPolygon([shp_polygon_0])


@pytest.mark.parametrize(
("shp", "expected"),
[
(shp_point_0, "(0 0)"),
(shp_point_1, "(1 1)"),
(shp_point_2, "(2 2)"),
(shp_linestring_0, "(0 0, 1 1, 2 2)"),
(shp_linestring_1, "(2 2, 1 1, 0 0)"),
(shp_polygon_0, "((0 0, 1 1, 2 2, 0 0))"),
(shp_multipolygon_0, "(((0 0, 1 1, 2 2, 0 0)))"),
(shp_multilinestring_0, "((0 0, 1 1, 2 2), (2 2, 1 1, 0 0))"),
(shp_multipoint_0, "(0 0, 1 1, 2 2)"),
],
)
def test_literal_geospatial_inferred(con, shp, expected):
result = str(con.compile(ibis.literal(shp).name("result")))
name = type(shp).__name__.upper()
pair = f"{name} {expected}"
assert result == f"SELECT {pair!r} AS result"
48 changes: 22 additions & 26 deletions ibis/backends/duckdb/tests/test_register.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,17 @@
import sqlalchemy as sa

import ibis
import ibis.common.exceptions as exc
import ibis.expr.datatypes as dt
from ibis.conftest import LINUX, SANDBOXED


def test_read_csv(data_dir):
t = ibis.read_csv(data_dir / "csv" / "functional_alltypes.csv")
def test_read_csv(con, data_dir):
t = con.read_csv(data_dir / "csv" / "functional_alltypes.csv")
assert t.count().execute()


def test_read_csv_with_columns(data_dir):
t = ibis.read_csv(
def test_read_csv_with_columns(con, data_dir):
t = con.read_csv(
data_dir / "csv" / "awards_players.csv",
header=True,
columns={
Expand All @@ -41,8 +40,8 @@ def test_read_csv_with_columns(data_dir):
assert t.count().execute()


def test_read_parquet(data_dir):
t = ibis.read_parquet(data_dir / "parquet" / "functional_alltypes.parquet")
def test_read_parquet(con, data_dir):
t = con.read_parquet(data_dir / "parquet" / "functional_alltypes.parquet")
assert t.count().execute()


Expand Down Expand Up @@ -92,25 +91,24 @@ def test_load_spatial_when_geo_column(tmpdir):
assert "spatial" in con.sql(query).name.to_pandas().values


@pytest.mark.usefixtures("gpd")
def test_read_geo_to_pyarrow(con, data_dir):
pytest.importorskip("geopandas")
shapely = pytest.importorskip("shapely")

t = con.read_geo(data_dir / "geojson" / "zones.geojson")
raw_geometry = t.head().to_pyarrow()["geom"].to_pandas()
assert len(shapely.from_wkb(raw_geometry))


def test_read_geo_to_geopandas(con, data_dir):
gpd = pytest.importorskip("geopandas")
def test_read_geo_to_geopandas(con, data_dir, gpd):
t = con.read_geo(data_dir / "geojson" / "zones.geojson")
gdf = t.head().to_pandas()
assert isinstance(gdf, gpd.GeoDataFrame)


def test_read_geo_from_url(con, monkeypatch):
loaded_exts = []
monkeypatch.setattr(con, "_load_extensions", lambda x, **kw: loaded_exts.extend(x))
monkeypatch.setattr(con, "_load_extensions", lambda x, **_: loaded_exts.extend(x))

with pytest.raises((sa.exc.OperationalError, sa.exc.ProgrammingError)):
# The read will fail, either because the URL is bogus (which it is) or
Expand All @@ -122,16 +120,13 @@ def test_read_geo_from_url(con, monkeypatch):
assert "httpfs" in loaded_exts


@pytest.mark.xfail_version(
duckdb=["duckdb<0.7.0"], reason="read_json_auto doesn't exist", raises=exc.IbisError
)
def test_read_json(data_dir, tmp_path):
pqt = ibis.read_parquet(data_dir / "parquet" / "functional_alltypes.parquet")
def test_read_json(con, data_dir, tmp_path):
pqt = con.read_parquet(data_dir / "parquet" / "functional_alltypes.parquet")

path = tmp_path.joinpath("ft.json")
path.write_text(pqt.execute().to_json(orient="records", lines=True))

jst = ibis.read_json(path)
jst = con.read_json(path)

nrows = pqt.count().execute()
assert nrows
Expand Down Expand Up @@ -301,7 +296,7 @@ def test_re_read_in_memory_overwrite(con):
assert table.schema() == ibis.schema([("a", "int"), ("c", "float")])


def test_memtable_with_nullable_dtypes():
def test_memtable_with_nullable_dtypes(con):
data = pd.DataFrame(
{
"a": pd.Series(["a", None, "c"], dtype="string"),
Expand All @@ -317,19 +312,19 @@ def test_memtable_with_nullable_dtypes():
}
)
expr = ibis.memtable(data)
res = expr.execute()
res = con.execute(expr)
assert len(res) == len(data)


def test_memtable_with_nullable_pyarrow_string():
def test_memtable_with_nullable_pyarrow_string(con):
pytest.importorskip("pyarrow")
data = pd.DataFrame({"a": pd.Series(["a", None, "c"], dtype="string[pyarrow]")})
expr = ibis.memtable(data)
res = expr.execute()
res = con.execute(expr)
assert len(res) == len(data)


def test_memtable_with_nullable_pyarrow_not_string():
def test_memtable_with_nullable_pyarrow_not_string(con):
pytest.importorskip("pyarrow")

data = pd.DataFrame(
Expand All @@ -346,7 +341,7 @@ def test_memtable_with_nullable_pyarrow_not_string():
}
)
expr = ibis.memtable(data)
res = expr.execute()
res = con.execute(expr)
assert len(res) == len(data)


Expand Down Expand Up @@ -433,7 +428,8 @@ def test_csv_with_slash_n_null(con, tmp_path):
reason=("nix can't hit GCS because it is sandboxed."),
)
def test_register_filesystem_gcs(con):
import fsspec
fsspec = pytest.importorskip("fsspec")
pytest.importorskip("gcsfs")

gcs = fsspec.filesystem("gcs")

Expand All @@ -447,7 +443,7 @@ def test_register_filesystem_gcs(con):

def test_memtable_null_column_parquet_dtype_roundtrip(con, tmp_path):
before = ibis.memtable({"a": [None, None, None]}, schema={"a": "string"})
before.to_parquet(tmp_path / "tmp.parquet")
after = ibis.read_parquet(tmp_path / "tmp.parquet")
con.to_parquet(before, tmp_path / "tmp.parquet")
after = con.read_parquet(tmp_path / "tmp.parquet")

assert before.a.type() == after.a.type()
296 changes: 281 additions & 15 deletions ibis/backends/flink/__init__.py

Large diffs are not rendered by default.

58 changes: 55 additions & 3 deletions ibis/backends/flink/datatypes.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
from __future__ import annotations

from pyflink.table.types import DataType, DataTypes, RowType
from typing import TYPE_CHECKING

from pyflink.table.types import DataType, DataTypes, RowType, _from_java_data_type

import ibis.expr.datatypes as dt
import ibis.expr.schema as sch
from ibis.formats import SchemaMapper, TypeMapper

if TYPE_CHECKING:
from pyflink.table import TableSchema


class FlinkRowSchema(SchemaMapper):
@classmethod
Expand Down Expand Up @@ -48,7 +53,10 @@ def to_ibis(cls, typ: DataType, nullable=True) -> dt.DataType:
elif typ == DataTypes.TIME():
return dt.Time(nullable=nullable)
elif typ == DataTypes.TIMESTAMP():
return dt.Timestamp(nullable=nullable)
return dt.Timestamp(
scale=typ.precision,
nullable=nullable,
)
else:
return super().to_ibis(typ, nullable=nullable)

Expand Down Expand Up @@ -88,10 +96,54 @@ def from_ibis(cls, dtype: dt.DataType) -> DataType:
elif dtype.is_time():
return DataTypes.TIME(nullable=dtype.nullable)
elif dtype.is_timestamp():
return DataTypes.TIMESTAMP(nullable=dtype.nullable)
# Note (mehmet): If `precision` is None, set it to 6.
# This is because `DataTypes.TIMESTAMP` throws TypeError
# if `precision` is None, and assumes `precision = 6`
# if it is not provided.
return DataTypes.TIMESTAMP(
precision=dtype.scale if dtype.scale is not None else 6,
nullable=dtype.nullable,
)
else:
return super().from_ibis(dtype)

@classmethod
def to_string(cls, dtype: dt.DataType) -> str:
return cls.from_ibis(dtype).type_name()


def get_field_data_types(pyflink_schema: TableSchema) -> list[DataType]:
"""Returns all field data types in `pyflink_schema` as a list.

This is a re-implementation of `get_field_data_types()` available for PyFlink
schemas. PyFlink's implementation currently supports only `precision = 3` for
`TimestampType` (for some reason that we could not figure out -- might be just
a bug). The lack of precision support led to an error due to unmatched schemas
for batches and the file to write in `to_csv()` and `to_parquet()`.

:return: A list of all field data types.
"""
from pyflink.java_gateway import get_gateway
from pyflink.util.java_utils import is_instance_of

gateway = get_gateway()

data_type_list = []
for j_data_type in pyflink_schema._j_table_schema.getFieldDataTypes():
if not is_instance_of(j_data_type, gateway.jvm.AtomicDataType):
data_type = _from_java_data_type(j_data_type)

else:
logical_type = j_data_type.getLogicalType()
if is_instance_of(logical_type, gateway.jvm.TimestampType):
data_type = DataTypes.TIMESTAMP(
precision=logical_type.getPrecision(),
nullable=logical_type.isNullable(),
)

else:
data_type = _from_java_data_type(j_data_type)

data_type_list.append(data_type)

return data_type_list
46 changes: 41 additions & 5 deletions ibis/backends/flink/ddl.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

import sqlglot as sg

import ibis.common.exceptions as exc
import ibis.expr.schema as sch
from ibis.backends.base.sql.ddl import (
CreateTable,
Expand All @@ -19,15 +20,19 @@
)
from ibis.backends.base.sql.registry import quote_identifier
from ibis.backends.flink.registry import type_to_sql_string
from ibis.util import promote_list

if TYPE_CHECKING:
from collections.abc import Sequence

from ibis.api import Watermark


def format_schema(schema):
def format_schema(schema: sch.Schema):
elements = [
_format_schema_element(name, t) for name, t in zip(schema.names, schema.types)
]

return "({})".format(",\n ".join(elements))


Expand All @@ -37,7 +42,7 @@ def _format_schema_element(name, t):

def type_to_flink_sql_string(tval):
if tval.is_timestamp():
return f"TIMESTAMP({tval.scale})"
return f"timestamp({tval.scale})" if tval.scale is not None else "timestamp"
else:
sql_string = type_to_sql_string(tval)
if not tval.nullable:
Expand All @@ -54,15 +59,31 @@ def _format_watermark_strategy(watermark: Watermark) -> str:


def format_schema_with_watermark(
schema: sch.Schema, watermark: Watermark | None = None
schema: sch.Schema,
watermark: Watermark | None = None,
primary_keys: Sequence[str] | None = None,
) -> str:
elements = [
_format_schema_element(name, t) for name, t in zip(schema.names, schema.types)
]

if watermark is not None:
elements.append(
f"WATERMARK FOR {watermark.time_col} AS {_format_watermark_strategy(watermark)}"
)

if primary_keys is not None and primary_keys:
# Note (mehmet): Currently supports "NOT ENFORCED" only. For the reason
# of this choice, the following quote from Flink docs is self-explanatory:
# "SQL standard specifies that a constraint can either be ENFORCED or
# NOT ENFORCED. This controls if the constraint checks are performed on
# the incoming/outgoing data. Flink does not own the data therefore the
# only mode we want to support is the NOT ENFORCED mode. It is up to the
# user to ensure that the query enforces key integrity."
# Ref: https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/table/sql/create/#primary-key
comma_separated_keys = ", ".join(f"`{key}`" for key in primary_keys)
elements.append(f"PRIMARY KEY ({comma_separated_keys}) NOT ENFORCED")

return "({})".format(",\n ".join(elements))


Expand All @@ -88,6 +109,7 @@ def __init__(
schema: sch.Schema,
tbl_properties: dict,
watermark: Watermark | None = None,
primary_key: str | Sequence[str] | None = None,
database: str | None = None,
catalog: str | None = None,
temporary: bool = False,
Expand All @@ -107,6 +129,16 @@ def __init__(
self.temporary = temporary
self.watermark = watermark

self.primary_keys = promote_list(primary_key)

# Check if `primary_keys` is a subset of the columns in `schema`.
if self.primary_keys and not set(self.primary_keys) <= set(schema.names):
raise exc.IbisError(
"`primary_key` must be a subset of the columns in `schema`. \n"
f"\t primary_key= {primary_key} \n"
f"\t schema.names= {schema.names}"
)

def _storage(self) -> str:
return f"STORED AS {self.format}" if self.format else None

Expand Down Expand Up @@ -142,10 +174,14 @@ def _pieces(self):
}
main_schema = sch.Schema(fields)

yield format_schema_with_watermark(main_schema, self.watermark)
yield format_schema_with_watermark(
main_schema, self.watermark, self.primary_keys
)
yield f"PARTITIONED BY {format_schema(part_schema)}"
else:
yield format_schema_with_watermark(self.schema, self.watermark)
yield format_schema_with_watermark(
self.schema, self.watermark, self.primary_keys
)

yield self._format_tbl_properties()

Expand Down
152 changes: 142 additions & 10 deletions ibis/backends/flink/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,28 @@ def _count_star(translator: ExprTranslator, op: ops.Node) -> str:
return f"COUNT(*){condition}"


def _string_concat(translator: ExprTranslator, op: ops.StringConcat) -> str:
joined_args = ", ".join(map(translator.translate, op.arg))
return f"CONCAT({joined_args})"


def _strftime(translator: ExprTranslator, op: ops.Strftime) -> str:
import sqlglot as sg

import ibis.expr.datatypes as dt

hive_dialect = sg.dialects.hive.Hive
if (time_mapping := getattr(hive_dialect, "TIME_MAPPING", None)) is None:
time_mapping = hive_dialect.time_mapping
reverse_hive_mapping = {v: k for k, v in time_mapping.items()}

format_str = translator.translate(op.format_str)
transformed_format_str = sg.time.format_time(format_str, reverse_hive_mapping)
arg = translator.translate(ops.Cast(op.arg, to=dt.string))

return f"FROM_UNIXTIME(UNIX_TIMESTAMP({arg}), {transformed_format_str})"


def _date(translator: ExprTranslator, op: ops.Node) -> str:
(arg,) = op.args
return f"CAST({translator.translate(arg)} AS DATE)"
Expand All @@ -68,7 +90,8 @@ def _cast(translator: ExprTranslator, op: ops.generic.Cast) -> str:
if to.timezone:
return f"TO_TIMESTAMP(CONVERT_TZ(CAST({arg_translated} AS STRING), 'UTC+0', '{to.timezone}'))"
else:
return f"TO_TIMESTAMP({arg_translated})"
return f"TO_TIMESTAMP({arg_translated}, 'yyyy-MM-dd HH:mm:ss.SSS')"

elif to.is_date():
return f"CAST({arg_translated} AS date)"
elif to.is_json():
Expand Down Expand Up @@ -249,13 +272,21 @@ def _clip(translator: ExprTranslator, op: ops.Node) -> str:
return f"CAST({arg} AS {FlinkType.from_ibis(op.dtype)!s})"


def _ntile(translator: ExprTranslator, op: ops.NTile) -> str:
return f"NTILE({op.buckets.value})"


def _floor_divide(translator: ExprTranslator, op: ops.Node) -> str:
left = translator.translate(op.left)
right = translator.translate(op.right)
return f"FLOOR(({left}) / ({right}))"


def _array_index(translator: ExprTranslator, op: ops.arrays.ArrayIndex):
def _array(translator: ExprTranslator, op: ops.Array) -> str:
return f"ARRAY[{', '.join(map(translator.translate, op.exprs))}]"


def _array_index(translator: ExprTranslator, op: ops.ArrayIndex):
table_column = op.arg
index = op.index

Expand All @@ -265,10 +296,35 @@ def _array_index(translator: ExprTranslator, op: ops.arrays.ArrayIndex):
return f"{table_column_translated} [ {index_translated} + 1 ]"


def _array_length(translator: ExprTranslator, op: ops.arrays.ArrayLength) -> str:
def _array_length(translator: ExprTranslator, op: ops.ArrayLength) -> str:
return f"CARDINALITY({translator.translate(op.arg)})"


def _array_position(translator: ExprTranslator, op: ops.ArrayPosition) -> str:
arg = translator.translate(op.arg)
other = translator.translate(op.other)
return f"ARRAY_POSITION({arg}, {other}) - 1"


def _array_slice(translator: ExprTranslator, op: ops.ArraySlice) -> str:
array = translator.translate(op.arg)
start = op.start.value
# The offsets are 1-based for ARRAY_SLICE.
# Ref: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/systemfunctions
if start >= 0:
start += 1

if op.stop is None:
return f"ARRAY_SLICE({array}, {start})"

stop = op.stop.value
if stop >= 0:
return f"ARRAY_SLICE({array}, {start}, {stop})"
else:
# To imitate the behavior of pandas array slicing.
return f"ARRAY_SLICE({array}, {start}, CARDINALITY({array}) - {abs(stop)})"


def _json_get_item(translator: ExprTranslator, op: ops.json.JSONGetItem) -> str:
arg_translated = translator.translate(op.arg)
if op.index.dtype.is_integer():
Expand All @@ -294,17 +350,40 @@ def _map_get(translator: ExprTranslator, op: ops.maps.MapGet) -> str:
return f"{map_} [ {key} ]"


def _struct_field(translator: ExprTranslator, op: ops.StructField) -> str:
arg = translator.translate(op.arg)
return f"{arg}.`{op.field}`"


def _day_of_week_index(
translator: ExprTranslator, op: ops.temporal.DayOfWeekIndex
) -> str:
arg = translator.translate(op.arg)
return f"MOD(DAYOFWEEK({arg}) + 5, 7)"


def _day_of_week_name(
translator: ExprTranslator, op: ops.temporal.DayOfWeekName
) -> str:
arg = translator.translate(op.arg)
map_str = "1=Sunday,2=Monday,3=Tuesday,4=Wednesday,5=Thursday,6=Friday,7=Saturday"
return f"STR_TO_MAP('{map_str}')[CAST(DAYOFWEEK(CAST({arg} AS DATE)) AS STRING)]"


def _date_add(translator: ExprTranslator, op: ops.temporal.DateAdd) -> str:
return _left_op_right(translator=translator, op_node=op, op_sign="+")


def _date_delta(translator: ExprTranslator, op: ops.temporal.DateDelta) -> str:
left = translator.translate(op.left)
right = translator.translate(op.right)
unit = op.part.value.upper()

return (
f"TIMESTAMPDIFF({unit}, CAST({right} AS TIMESTAMP), CAST({left} AS TIMESTAMP))"
)


def _date_diff(translator: ExprTranslator, op: ops.temporal.DateDiff) -> str:
raise com.UnsupportedOperationError("DATE_DIFF is not supported in Flink.")

Expand Down Expand Up @@ -338,12 +417,22 @@ def _string_to_timestamp(
def _time(translator: ExprTranslator, op: ops.temporal.Time) -> str:
if op.arg.dtype.is_timestamp():
datetime = op.arg.value
return f"TIME '{datetime.hour}:{datetime.minute}:{datetime.second}'"
return f"TIME '{datetime.hour}:{datetime.minute}:{datetime.second}.{datetime.microsecond}'"

else:
raise com.UnsupportedOperationError(f"Does NOT support dtype= {op.arg.dtype}")


def _time_delta(translator: ExprTranslator, op: ops.temporal.TimeDiff) -> str:
left = translator.translate(op.left)
right = translator.translate(op.right)
unit = op.part.value.upper()

return (
f"TIMESTAMPDIFF({unit}, CAST({right} AS TIMESTAMP), CAST({left} AS TIMESTAMP))"
)


def _time_from_hms(translator: ExprTranslator, op: ops.temporal.TimeFromHMS) -> str:
hours, minutes, seconds = (
f"CAST({translator.translate(e)} AS STRING)"
Expand All @@ -357,6 +446,33 @@ def _timestamp_add(translator: ExprTranslator, op: ops.temporal.TimestampAdd) ->
return _left_op_right(translator=translator, op_node=op, op_sign="+")


def _timestamp_bucket(
translator: ExprTranslator, op: ops.temporal.TimestampBucket
) -> str:
arg_translated = translator.translate(op.arg)

unit = op.interval.dtype.unit.name
unit_for_mod = "DAYOFMONTH" if unit == "DAY" else unit
bucket_width = op.interval.value
offset = op.offset.value if op.offset else 0

arg_offset = f"TIMESTAMPADD({unit}, -({offset}), {arg_translated})"
num = f"{unit_for_mod}({arg_offset})"
mod = f"{num} % {bucket_width}"

return f"TIMESTAMPADD({unit}, -({mod}) + {offset}, FLOOR({arg_offset} TO {unit}))"


def _timestamp_delta(
translator: ExprTranslator, op: ops.temporal.TimestampDelta
) -> str:
left = translator.translate(op.left)
right = translator.translate(op.right)
unit = op.part.value.upper()

return f"TIMESTAMPDIFF({unit}, {right}, {left})"


def _timestamp_diff(translator: ExprTranslator, op: ops.temporal.TimestampDiff) -> str:
return _left_op_right(translator=translator, op_node=op, op_sign="-")

Expand All @@ -370,18 +486,18 @@ def _timestamp_sub(translator: ExprTranslator, op: ops.temporal.TimestampSub) ->
return f"{table_column_translated} - {interval_translated}"


def _timestamp_from_unix(translator: ExprTranslator, op: ops.Node) -> str:
arg, unit = op.args
def _timestamp_from_unix(translator: ExprTranslator, op: ops.TimestampFromUNIX) -> str:
arg, unit = op.arg, op.unit

numeric = helpers.quote_identifier(arg.name, force=True)
if unit == TimestampUnit.MILLISECOND:
precision = 3
elif unit == TimestampUnit.SECOND:
precision = 0
else:
raise ValueError(f"{unit!r} unit is not supported!")

return f"TO_TIMESTAMP_LTZ({numeric}, {precision})"
arg = translator.translate(op.arg)
return f"CAST(TO_TIMESTAMP_LTZ({arg}, {precision}) AS TIMESTAMP)"


def _timestamp_from_ymdhms(
Expand Down Expand Up @@ -412,9 +528,11 @@ def _struct_field(translator, op):
ops.ApproxCountDistinct: aggregate.reduction("approx_count_distinct"),
ops.CountStar: _count_star,
# String operations
ops.RegexSearch: fixed_arity("regexp", 2),
ops.StringConcat: _string_concat,
ops.Strftime: _strftime,
ops.StringLength: unary("char_length"),
ops.StrRight: fixed_arity("right", 2),
ops.RegexSearch: fixed_arity("regexp", 2),
# Timestamp operations
ops.Date: _date,
ops.ExtractEpochSeconds: _extract_epoch_seconds,
Expand All @@ -439,25 +557,39 @@ def _struct_field(translator, op):
ops.IfElse: _filter,
ops.Window: _window,
ops.Clip: _clip,
ops.NTile: _ntile,
# Binary operations
ops.Power: fixed_arity("power", 2),
ops.FloorDivide: _floor_divide,
# Collection functions
# Collection operations
ops.Array: _array,
ops.ArrayContains: fixed_arity("ARRAY_CONTAINS", 2),
ops.ArrayDistinct: fixed_arity("ARRAY_DISTINCT", 1),
ops.ArrayIndex: _array_index,
ops.ArrayLength: _array_length,
ops.ArrayPosition: _array_position,
ops.ArrayRemove: fixed_arity("ARRAY_REMOVE", 2),
ops.ArraySlice: _array_slice,
ops.ArrayUnion: fixed_arity("ARRAY_UNION", 2),
ops.JSONGetItem: _json_get_item,
ops.Map: _map,
ops.MapGet: _map_get,
ops.StructField: _struct_field,
# Temporal functions
ops.DateAdd: _date_add,
ops.DateDelta: _date_delta,
ops.DateDiff: _date_diff,
ops.DateFromYMD: _date_from_ymd,
ops.DateSub: _date_sub,
ops.DayOfWeekIndex: _day_of_week_index,
ops.DayOfWeekName: _day_of_week_name,
ops.StringToTimestamp: _string_to_timestamp,
ops.Time: _time,
ops.TimeDelta: _time_delta,
ops.TimeFromHMS: _time_from_hms,
ops.TimestampAdd: _timestamp_add,
ops.TimestampBucket: _timestamp_bucket,
ops.TimestampDelta: _timestamp_delta,
ops.TimestampDiff: _timestamp_diff,
ops.TimestampFromUNIX: _timestamp_from_unix,
ops.TimestampFromYMDHMS: _timestamp_from_ymdhms,
Expand Down
Loading