358 changes: 358 additions & 0 deletions docs/posts/better-pypi-stats/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,358 @@
---
title: "Better PyPI stats with Python"
author: "Cody Peterson"
date: "2024-09-03"
image: thumbnail.png
categories:
- clickhouse
- shiny
---

***Ibis + ClickHouse + Shiny for Python = better PyPI stats.***

## Overview

[PyPI Stats](https://pypistats.org/about) is a great resource for Python package
download statistics from PyPI. However, it only contains 180 days of data and
lacks more detailed analysis we might be interested in. In this post, we'll
build a dynamic Python application for better PyPI stats using
[ClickHouse](https://github.com/clickhouse/clickhouse) as our data platform,
[Ibis](https://github.com/ibis-project/ibis) as our Python data interface, and
[Shiny for Python](https://github.com/posit-dev/py-shiny) as our dashboarding
tool.

::: {.callout-note title="What about ClickPy?"}
[ClickPy](https://github.com/ClickHouse/clickpy) is an existing open source and
reproducible project built on the same data with ClickHouse. The primary
difference is that ClickPy uses SQL and JavaScript whereas this project is in
Python. We also focus on different visualizations and metrics.
:::

## Prerequisites

Install the required dependencies:

```bash
pip install 'ibis-framework[clickhouse]' plotly
```

Then run imports and setup:

```{python}
import ibis
import plotly.express as px
import clickhouse_connect
px.defaults.template = "plotly_dark"
ibis.options.interactive = True
```

## Connecting to ClickHouse

You can connect to the public ClickHouse playground's PyPI database:

```{python}
host = "clickpy-clickhouse.clickhouse.com"
port = 443
user = "play"
database = "pypi"
con = ibis.clickhouse.connect(
host=host,
port=port,
user=user,
database=database,
)
con.list_tables()
```

## Top packages by downloads

Let's start by looking at the most downloaded packages:

```{python}
overall_t = con.table("pypi_downloads")
top_k = 10_000
overall_t = (
overall_t.order_by(ibis.desc("count"))
.limit(top_k)
.mutate(rank=1 + ibis.row_number().over(order_by=ibis.desc("count")))
.rename({"downloads": "count"})
.relocate("rank")
.order_by("rank")
)
overall_t
```

## Analyzing downloads for a package

Let's choose a package to analyze:

```{python}
project = "clickhouse-connect"
```

And see where it ranks in the top downloads:

```{python}
overall_t.filter(overall_t["project"] == project)
```

Let's look at downloads per day by various categories for this package:

```{python}
downloads_t = con.table(
"pypi_downloads_per_day_by_version_by_installer_by_type_by_country"
).filter(ibis._["project"] == project)
downloads_t
```

We might be interested in the day-of-week seasonality of downloads:

```{python}
def day_of_week_bar(t):
t = t.mutate(day_of_week=t["date"].day_of_week.full_name())
t = t.group_by("day_of_week").agg(downloads=ibis._["count"].sum())
c = px.bar(
t,
x="day_of_week",
y="downloads",
category_orders={
"day_of_week": [
"Sunday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
]
},
)
return c
day_of_week_bar(downloads_t)
```

Or the rolling 28-day downloads metric:

```{python}
def rolling_downloads(t, days=28):
t = t.mutate(
timestamp=t["date"].cast("timestamp"),
)
t = t.group_by("timestamp").agg(downloads=ibis._["count"].sum())
t = t.select(
"timestamp",
rolling_downloads=ibis._["downloads"]
.sum()
.over(
ibis.window(
order_by="timestamp",
preceding=days,
following=0,
)
),
).order_by("timestamp")
c = px.line(
t,
x="timestamp",
y="rolling_downloads",
)
return c
rolling_downloads(downloads_t)
```

Or rolling 28-days downloads by version with a few options for how to group
versions:

```{python}
def rolling_downloads_by_version(t, days=28, version_style="major.minor"):
t = t.mutate(
timestamp=t["date"].cast("timestamp"),
)
match version_style:
case "major":
t = t.mutate(version=t["version"].split(".")[0])
case "major.minor":
t = t.mutate(
version=t["version"].split(".")[0] + "." + t["version"].split(".")[1]
)
case _:
pass
t = t.group_by("timestamp", "version").agg(downloads=ibis._["count"].sum())
t = t.select(
"timestamp",
"version",
rolling_downloads=ibis._["downloads"]
.sum()
.over(
ibis.window(
order_by="timestamp",
group_by="version",
preceding=28,
following=0,
)
),
).order_by("timestamp")
c = px.line(
t,
x="timestamp",
y="rolling_downloads",
color="version",
category_orders={
"version": reversed(
sorted(
t.distinct(on="version")["version"].to_pyarrow().to_pylist(),
key=lambda x: tuple(int(y) for y in x.split(".") if y.isdigit()),
)
)
},
)
return c
rolling_downloads_by_version(downloads_t)
```

Or a bar chart of downloads grouped by a category:

```{python}
def group_bar(t, group_by="installer", log_y=True):
t = t.mutate(timestamp=t["date"].cast("timestamp"))
t = t.group_by(group_by).agg(downloads=ibis._["count"].sum())
t = t.order_by(ibis.desc("downloads"))
c = px.bar(
t,
x=group_by,
y="downloads",
log_y=log_y,
)
return c
group_bar(downloads_t)
```

::: {.callout-tip title="More examples" collapse="true"}

Since we're just writing Python, we've already organized code into functions for
reuse. We can rerun our above analytics on a different package by changing the
`project` variable and adjusting our table accordingly. We'll demonstrate this
with a few more packages below.

Notice you could also pass in Ibis tables from different backends, not just
ClickHouse, to these functions!

::: {.panel-tabset}

## PyArrow

```{python}
package = "pyarrow"
t = con.table(
"pypi_downloads_per_day_by_version_by_installer_by_type_by_country"
).filter(ibis._["project"] == package)
```

```{python}
day_of_week_bar(t)
```

```{python}
rolling_downloads(t)
```

```{python}
rolling_downloads_by_version(t, version_style="major")
```

```{python}
group_bar(t, group_by="installer")
```

## chDB

```{python}
package = "chdb"
t = con.table(
"pypi_downloads_per_day_by_version_by_installer_by_type_by_country"
).filter(ibis._["project"] == package)
```

```{python}
day_of_week_bar(t)
```

```{python}
rolling_downloads(t)
```

```{python}
rolling_downloads_by_version(t, version_style="major.minor")
```

```{python}
group_bar(t, group_by="installer")
```

## Ibis

```{python}
package = "ibis-framework"
t = con.table(
"pypi_downloads_per_day_by_version_by_installer_by_type_by_country"
).filter(ibis._["project"] == package)
```

```{python}
day_of_week_bar(t)
```

```{python}
rolling_downloads(t)
```

```{python}
rolling_downloads_by_version(t, version_style="major")
```

```{python}
group_bar(t, group_by="installer")
```

:::

:::

## Shiny for Python application

We can create an interactive Shiny with Python application using the code above
to serve as a dashboard for better PyPI stats:

::: {.callout-tip}
See [the GitHub repository](https://github.com/ibis-project/better-pypi-stats)
for the most up-to-date code.
:::

{{< video https://youtu.be/jkdWaL8CbK4 >}}

## Reproducing and contributing

The code is [available on
GitHub](https://github.com/ibis-project/better-pypi-stats). Feel free to open an
issue or pull request if you have any suggested improvements.
Binary file added docs/posts/better-pypi-stats/thumbnail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
104 changes: 104 additions & 0 deletions docs/posts/farewell-pandas/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
title: Farewell pandas, and thanks for all the fish.
author: Gil Forsyth
date: 2024-08-26
categories:
- blog
- pandas
- community
---

**TL; DR**: we are deprecating the `pandas` and `dask` backends and will be
removing them in version 10.0.

There is no feature gap between the `pandas` backend and our default DuckDB
backend, and DuckDB is _much_ more performant. `pandas` DataFrames will still
be available as _format_ for getting data to and from Ibis, we just won't
support using `pandas` to execute queries.

Most of the rationale below applies to the Dask backend since it has so much in
common with pandas. Dask is a great project and people should continue to use
it outside the Ibis context.

## Why `pandas`? And a bit of Ibis history

Way back in the early days of Ibis, there was only one backend: Impala. Not
everyone used Impala (mindblowing, we know), and so it wasn't too long until the
Postgres backend was added (by the inimitable Phillip Cloud).

These two backends were both featureful, but there was a big problem with adoption:
Want to try out Ibis? You need to install Impala or Postgres first.

Not an insurmountable problem, but a LOT more work than "just `pip install
<newthing>`" -- which prompted the question, how can a prospective Ibis user
take the API for a spin without requiring a DBA or extra infrastructure beyond a
laptop?

The obvious answer (at the time) was to use the only in-memory DataFrame engine
around and wire up a `pandas` backend.

## The agony and the agony

`pandas` was the best option at the time, and it allowed new users to try out
Ibis. But, it never fit well into the model of data analysis that Ibis strives
for. The `pandas` backend has more specialized code than any other backend,
because it is so fundamentally different than all the other systems Ibis works
with.

### Deferred vs Eager

`pandas` is inherently an eager engine -- every time you hit Enter you are
computing an intermediate result. Ibis uses a deferred execution model, similar
to what nearly all SQL backends use, that enables query planning and
optimization passes.

Trying to make a `pandas` interface that behaves in a deferred way is hard.

One of the unfortunate effects of this mismatch is that, unlike our other
backends, the `pandas` backend is often _much_ slower than just using `pandas`
directly.

And to provide this suboptimal experience, we have a few thousand lines of code
that are only used in the `pandas` backend.

### `NaN` vs `NULL`

The choice was made a long time ago to accept using `NaN` as the marker for
missing values in `pandas`. This is because NumPy has a notion of `NaN`, but a
Python `None` would lead to an `object`-dtype and poor performance.

Practicality beats purity, but this is a horrible decision to have to make.
Ibis _doesn't_ have to make it with any other backend, because NULL indicates a
missing value, and NaN is Not a Number.

Those are fundamentally different ideas and it is an ongoing headache for Ibis
to try to pretend that they aren't.

### Data types

The new Arrow-backed types in `pandas` are a great improvement and we'll leave
it at that.

## Misleading new users

People reach for what is familiar. When you try Ibis for the first time, we're
asking you to both a) try Ibis and b) pick a backend. We have defaults to try to
help with this, but it can be confusing at first.

We have many reports from new users that "Ibis is slow". What this almost
always means is that they tried the `pandas` backend (because they know
`pandas`) and they are having a less-than-great time.

If they tried DuckDB or Polars, instead, they would have a much easier time
getting things going.

## Feature parity

This is the one of the strongest reasons to drop the `pandas` backend -- it is redundant. The
DuckDB backend can seamlessly query pandas DataFrames, supports several flavors
of UDF, and can read and write parquet, CSV, JSON, and other formats.

There is a reason DuckDB is our default backend: it's easy to install, it runs
locally, it's blazing fast, and it interacts well with the Python ecosystem.
Those are all the reasons we added `pandas` as a backend in the first place, but
with the added benefit of blazing-fast results, and no type-system headaches.
928 changes: 928 additions & 0 deletions docs/posts/ibisml/index.qmd

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/presentations/linkedin-meetup-2024-04-24.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
title: "Ibis @ LinkedIn"
subtitle: "Portable Python DataFrames"
date: "2024-04-24"
author:
- Chloe He
- Phillip Cloud
Expand Down
1 change: 1 addition & 0 deletions docs/presentations/overview/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
title: "Ibis: an overview"
author:
- Cody Peterson
date: "2024-07-24"
execute:
echo: true
format:
Expand Down
12 changes: 12 additions & 0 deletions docs/presentations/positconf2024/custom.scss
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
/*-- scss:rules --*/
.reveal div.sourceCode {
font-size: 2.4rem !important;
}

.cell-output-display {
font-size: 2.2rem !important;
display: block;
margin-left: 30%;
margin-right: 25%;
margin-top: 2.5%;
}
Binary file added docs/presentations/positconf2024/fine.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
312 changes: 312 additions & 0 deletions docs/presentations/positconf2024/talk.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,312 @@
---
title: "Testing 20 databases on every commit"
author: Phillip Cloud
date: "2024-08-14"
execute:
echo: true
format:
revealjs:
theme: [default, custom.scss]
footer: <https://ibis-project.org/presentations/positconf2024/talk>
---

# Let's all stand!

## Sit if you work with…

::: {.incremental}
- 0 DBs ✅
- 1 DB 😇
- 2 DBs 😬
- 3+ DBs 😱
:::

::: {.fragment}
::: {.r-fit-text}
_I feel your pain._
:::
:::

## Who?

:::: {.columns}

::: {.column width="50%"}
### Me

- Phillip Cloud
- Ibis project
- Voltron Data
- Data tools for 10+ years
:::

::: {.column width="50%"}
### Where

- {{< fa brands github >}} [`@cpcloud`](https://github.com/cpcloud)
- {{< fa brands youtube >}} [Phillip in the Cloud](https://www.youtube.com/@cpcloud)
- {{< fa brands twitter >}} [`@cpcloudy`](https://x.com/cpcloudy)
:::

::::

# Ever needed to test a complex system?

## Maybe this is you

![](../pycon2024/docker-eye-roll.gif){fig-align="center"}

## Or this

![](../pycon2024/wonka.png){fig-align="center"}

## Or maybe even this

![](https://storage.googleapis.com/posit-conf-2024/fine.jpg){fig-align="center"}

# A complex system: Ibis

![](../../logo.svg){fig-align="center" width="50%" height="50%"}

## What's Ibis?

- Python library
- Exploratory data analysis
- Data engineering
- ML preprocessing

::: {.fragment}
::: {.r-fit-text}
_dbplyr, but Python_
:::
:::

## One API, 20+ backends {.smaller .scrollable}

```{python}
#| code-fold: true
#| echo: false
import ibis
ibis.options.interactive = True
t = ibis.examples.penguins.fetch()
t.to_parquet("penguins.parquet")
```

::: {.panel-tabset}

## DuckDB

```{python}
con = ibis.connect("duckdb://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
```

## Polars

```{python}
#| code-line-numbers: "1,1"
con = ibis.connect("polars://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
```

## DataFusion

```{python}
#| code-line-numbers: "1,1"
con = ibis.connect("datafusion://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
```

## PySpark

```{python}
#| code-line-numbers: "1,1"
con = ibis.connect("pyspark://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
```

## 16+ other DBs

![](../pycon2024/machine.gif){fig-align="center" width="100%" height="100%"}

:::

# Why is this hard to test?

## By the numbers {.smaller}

:::: {.columns}
::: {.column width="50%"}
### Backends
- **17** SQL
- **3** non-SQL
- **2** cloud
:::

::: {.column width="50%"}
### Engines + APIs
- **9** distributed SQL
- **3** dataframe
- oldest: **~45** years 👀
- youngest: **~2** years
:::
::::

### Other facts

- Latency is variable
- Deployment models vary

::: {.fragment}
::: {.r-fit-text}
_**Feature development**_
:::
:::

## Bit of a pickle

![](../pycon2024/picklerick.png)

# How

## High level

### Goal: fast iteration

- fast env setup (dependency management)
- fast(ish) tests (test-running library)
- high **job** concurrency (ci/provider)
- **easy to run**: dev speed ([`just`](https://github.com/casey/just))

::: {.fragment}
::: {.r-fit-text}
_CI must complete "quickly"_
:::
:::

## Tools: overview

- 📦 **deps**: _poetry_
- 🖥️ **ci**: _GitHub Actions_
- 🦁 **"big" backends**: _docker_
- 🐱 **"small" backends**: _no special tx (duckdb, polars)_
- 🏃 **tasks**: [`just`](https://github.com/casey/just) (e.g.: `just up postgres`)

## Tools: poetry

- **Env setup must be _fast_**: no constraint solving
- Poetry is one way; there are others
- Get yourself a lockfile
- Downsides?

::: {.fragment}
::: {.r-fit-text}
_Are you doing that **now**_
:::
:::

## Tools: docker

- Do you use it locally?
- Use health checks; "dumb" ones are fine
- Make it easy for devs to use

![](https://storage.googleapis.com/posit-conf-2024/terminal.png){fig-align="center"}

## Tools: GitHub Actions {.smaller}

- Pay for the [the Teams plan](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits) to get more concurrency
- Automate dependency updates

::: {.columns}
::: {.column width="50%"}
### GHA limits

![](../pycon2024/gha.png)
:::

::: {.column width="50%"}
### Ibis CI cost

![](../pycon2024/bill.png)
:::
:::

# How does this stack up?

## Terminology {auto-animate=true}

::: {.fragment}
Job
: a set of commands

```yaml
my_job:
- run: pip install ibis-framework
- run: just ci-check -m ${{ matrix.backend.name }}
- run: coverage upload
```
:::
::: {.fragment}
Workflow
: A collection of jobs, one `.yml` file

```yaml
name: Backends
my_job:
- run: ...
my_other_job:
- run: ...
```
:::

## Job metrics {auto-animate=true}

![](https://storage.googleapis.com/posit-conf-2024/jobs.svg){fig-align="center"}

::: {.fragment}
::: {.r-fit-text}
_We've added 3 or 4 new backends since the switch_
:::
:::

## Workflow metrics {auto-animate=true}

![Queue time and workflow duration](https://storage.googleapis.com/posit-conf-2024/workflows.svg){fig-align="center"}

## Workflow metrics {auto-animate=true}

![](https://storage.googleapis.com/posit-conf-2024/workflowscorr.svg){fig-align="center"}

## Workflow metrics {auto-animate=true}

![](https://storage.googleapis.com/posit-conf-2024/workflowscorr.svg){fig-align="center"}

- 🟢 Queues + workflows correlated
- 🟡 Queues slow + workflows fast: not enough concurrency
- 🟡 Queues fast + workflows slow: jobs doing too much
- 🔴 Queues slow + workflows slow: hard to say

# Summary

- Testing complex projects is possible
- Use docker for dev **and** prod
- Don't SAT solve in CI
- Track CI run durations, workflow metrics
- Spend time on dev ex

# Questions?

![](../../logo.svg){fig-align="center"}

::: {.r-fit-text}
<https://ibis-project.org>
:::
Binary file added docs/presentations/positconf2024/terminal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/presentations/pycon2024/maintainers.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
title: "Test 20 databases on every commit"
title: "Test 20 databases on every commit @ PyCon 2024"
subtitle: "It's not hyperbole"
date: "2024-05-19"
author:
- Phillip Cloud
execute:
Expand Down
106 changes: 106 additions & 0 deletions docs/release_notes_generated.qmd

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions flake.lock
6 changes: 3 additions & 3 deletions flake.nix
Original file line number Diff line number Diff line change
Expand Up @@ -122,11 +122,11 @@
in
rec {
packages = {
inherit (pkgs) ibisCore310 ibisCore311 ibisCore312 ibisLocal310 ibisLocal311 ibisLocal312;
inherit (pkgs) ibis310 ibis311 ibis312;

default = pkgs.ibisCore312;
default = pkgs.ibis312;

inherit (pkgs) update-lock-files gen-examples check-release-notes-spelling;
inherit (pkgs) update-lock-files check-release-notes-spelling;
};

devShells = rec {
Expand Down
2 changes: 1 addition & 1 deletion ibis/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from __future__ import annotations

__version__ = "9.3.0"
__version__ = "9.4.0"

import warnings
from typing import Any
Expand Down
171 changes: 111 additions & 60 deletions ibis/backends/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,18 @@
import importlib.metadata
import keyword
import re
import sys
import urllib.parse
import weakref
from pathlib import Path
from typing import TYPE_CHECKING, Any, ClassVar
from typing import TYPE_CHECKING, Any, ClassVar, NamedTuple

import ibis
import ibis.common.exceptions as exc
import ibis.config
import ibis.expr.operations as ops
import ibis.expr.types as ir
from ibis import util
from ibis.common.caching import RefCountedCache

if TYPE_CHECKING:
from collections.abc import Iterable, Iterator, Mapping, MutableMapping
Expand All @@ -41,10 +42,9 @@ class TablesAccessor(collections.abc.Mapping):
>>> con = ibis.sqlite.connect("example.db")
>>> people = con.tables["people"] # access via index
>>> people = con.tables.people # access via attribute
"""

def __init__(self, backend: BaseBackend):
def __init__(self, backend: BaseBackend) -> None:
self._backend = backend

def __getitem__(self, name) -> ir.Table:
Expand Down Expand Up @@ -473,6 +473,43 @@ def to_parquet(
for batch in batch_reader:
writer.write_batch(batch)

@util.experimental
def to_parquet_dir(
self,
expr: ir.Table,
directory: str | Path,
*,
params: Mapping[ir.Scalar, Any] | None = None,
**kwargs: Any,
) -> None:
"""Write the results of executing the given expression to a parquet file in a directory.
This method is eager and will execute the associated expression
immediately.
Parameters
----------
expr
The ibis expression to execute and persist to parquet.
directory
The data source. A string or Path to the directory where the parquet file will be written.
params
Mapping of scalar parameter expressions to value.
**kwargs
Additional keyword arguments passed to pyarrow.dataset.write_dataset
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html
"""
self._import_pyarrow()
import pyarrow.dataset as ds

# by default write_dataset creates the directory
with expr.to_pyarrow_batches(params=params) as batch_reader:
ds.write_dataset(
batch_reader, base_dir=directory, format="parquet", **kwargs
)

@util.experimental
def to_csv(
self,
Expand Down Expand Up @@ -578,11 +615,6 @@ def list_catalogs(self, like: str | None = None) -> list[str]:
"""

@property
@abc.abstractmethod
def current_catalog(self) -> str:
"""The current catalog in use."""


class CanCreateCatalog(CanListCatalog):
@abc.abstractmethod
Expand Down Expand Up @@ -669,11 +701,6 @@ def list_databases(
"""

@property
@abc.abstractmethod
def current_database(self) -> str:
"""The current database in use."""


class CanCreateDatabase(CanListDatabase):
@abc.abstractmethod
Expand Down Expand Up @@ -751,7 +778,74 @@ def drop_schema(
self.drop_database(name=name, catalog=database, force=force)


class BaseBackend(abc.ABC, _FileIOHandler):
class CacheEntry(NamedTuple):
orig_op: ops.Relation
cached_op_ref: weakref.ref[ops.Relation]
finalizer: weakref.finalize


class CacheHandler:
"""A mixin for handling `.cache()`/`CachedTable` operations."""

def __init__(self):
self._cache_name_to_entry = {}
self._cache_op_to_entry = {}

def _cached_table(self, table: ir.Table) -> ir.CachedTable:
"""Convert a Table to a CachedTable.
Parameters
----------
table
Table expression to cache
Returns
-------
Table
Cached table
"""
entry = self._cache_op_to_entry.get(table.op())
if entry is None or (cached_op := entry.cached_op_ref()) is None:
cached_op = self._create_cached_table(util.gen_name("cached"), table).op()
entry = CacheEntry(
table.op(),
weakref.ref(cached_op),
weakref.finalize(
cached_op, self._finalize_cached_table, cached_op.name
),
)
self._cache_op_to_entry[table.op()] = entry
self._cache_name_to_entry[cached_op.name] = entry
return ir.CachedTable(cached_op)

def _finalize_cached_table(self, name: str) -> None:
"""Release a cached table given its name.
This is a no-op if the cached table is already released.
Parameters
----------
name
The name of the cached table.
"""
if (entry := self._cache_name_to_entry.pop(name, None)) is not None:
self._cache_op_to_entry.pop(entry.orig_op)
entry.finalizer.detach()
try:
self._drop_cached_table(name)
except Exception:
# suppress exceptions during interpreter shutdown
if not sys.is_finalizing():
raise

def _create_cached_table(self, name: str, expr: ir.Table) -> ir.Table:
return self.create_table(name, expr, schema=expr.schema(), temp=True)

def _drop_cached_table(self, name: str) -> None:
self.drop_table(name, force=True)


class BaseBackend(abc.ABC, _FileIOHandler, CacheHandler):
"""Base backend class.
All Ibis backends must subclass this class and implement all the
Expand All @@ -768,12 +862,7 @@ def __init__(self, *args, **kwargs):
self._con_args: tuple[Any] = args
self._con_kwargs: dict[str, Any] = kwargs
self._can_reconnect: bool = True
# expression cache
self._query_cache = RefCountedCache(
populate=self._load_into_cache,
lookup=lambda name: self.table(name).op(),
finalize=self._clean_up_cached_table,
)
super().__init__()

@property
@abc.abstractmethod
Expand Down Expand Up @@ -968,7 +1057,7 @@ def table(
"""

@functools.cached_property
@property
def tables(self):
"""An accessor for tables in the database.
Expand Down Expand Up @@ -1199,44 +1288,6 @@ def has_operation(cls, operation: type[ops.Value]) -> bool:
f"{cls.name} backend has not implemented `has_operation` API"
)

def _cached(self, expr: ir.Table):
"""Cache the provided expression.
All subsequent operations on the returned expression will be performed on the cached data.
Parameters
----------
expr
Table expression to cache
Returns
-------
Expr
Cached table
"""
op = expr.op()
if (result := self._query_cache.get(op)) is None:
result = self._query_cache.store(expr)
return ir.CachedTable(result)

def _release_cached(self, expr: ir.CachedTable) -> None:
"""Releases the provided cached expression.
Parameters
----------
expr
Cached expression to release
"""
self._query_cache.release(expr.op().name)

def _load_into_cache(self, name, expr):
raise NotImplementedError(self.name)

def _clean_up_cached_table(self, name):
raise NotImplementedError(self.name)

def _transpile_sql(self, query: str, *, dialect: str | None = None) -> str:
# only transpile if dialect was passed
if dialect is None:
Expand Down
13 changes: 4 additions & 9 deletions ibis/backends/bigquery/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,10 +156,6 @@ class Backend(SQLBackend, CanCreateDatabase, CanCreateSchema):
def __init__(self, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.__session_dataset: bq.DatasetReference | None = None
self._query_cache.lookup = lambda name: self.table(
name,
database=(self._session_dataset.project, self._session_dataset.dataset_id),
).op()

@property
def _session_dataset(self):
Expand Down Expand Up @@ -923,7 +919,7 @@ def create_table(
| pl.LazyFrame
| None = None,
*,
schema: ibis.Schema | None = None,
schema: sch.SchemaLike | None = None,
database: str | None = None,
temp: bool = False,
overwrite: bool = False,
Expand Down Expand Up @@ -972,6 +968,8 @@ def create_table(
"""
if obj is None and schema is None:
raise com.IbisError("One of the `schema` or `obj` parameter is required")
if schema is not None:
schema = ibis.schema(schema)

if isinstance(obj, ir.Table) and schema is not None:
if not schema.equals(obj.schema()):
Expand Down Expand Up @@ -1135,10 +1133,7 @@ def drop_view(
)
self.raw_sql(stmt.sql(self.name))

def _load_into_cache(self, name, expr):
self.create_table(name, expr, schema=expr.schema(), temp=True)

def _clean_up_cached_table(self, name):
def _drop_cached_table(self, name):
self.drop_table(
name,
database=(self._session_dataset.project, self._session_dataset.dataset_id),
Expand Down
6 changes: 3 additions & 3 deletions ibis/backends/bigquery/tests/system/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,19 +237,19 @@ def test_multiple_project_queries_execute(con):
assert len(result) == 5


def test_string_to_timestamp(con):
def test_string_as_timestamp(con):
timestamp = pd.Timestamp(
datetime.datetime(year=2017, month=2, day=6), tz=pytz.timezone("UTC")
)
expr = ibis.literal("2017-02-06").to_timestamp("%F")
expr = ibis.literal("2017-02-06").as_timestamp("%F")
result = con.execute(expr)
assert result == timestamp

timestamp_tz = pd.Timestamp(
datetime.datetime(year=2017, month=2, day=6, hour=5),
tz=pytz.timezone("UTC"),
)
expr_tz = ibis.literal("2017-02-06 America/New_York").to_timestamp("%F %Z")
expr_tz = ibis.literal("2017-02-06 America/New_York").as_timestamp("%F %Z")
result_tz = con.execute(expr_tz)
assert result_tz == timestamp_tz

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
SELECT
[
approx_quantiles(`t0`.`double_col`, 4 IGNORE NULLS)[1],
approx_quantiles(`t0`.`double_col`, 4 IGNORE NULLS)[2],
approx_quantiles(`t0`.`double_col`, 4 IGNORE NULLS)[3]
] AS `qs`
FROM `functional_alltypes` AS `t0`
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
SELECT
approx_quantiles(`t0`.`double_col`, 4 IGNORE NULLS) AS `qs`
FROM `functional_alltypes` AS `t0`
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
SELECT
approx_quantiles(`t0`.`double_col`, 2 IGNORE NULLS)[1] AS `qs`
FROM `functional_alltypes` AS `t0`
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
SELECT
[
approx_quantiles(`t0`.`double_col`, 4 IGNORE NULLS)[2],
approx_quantiles(`t0`.`double_col`, 4 IGNORE NULLS)[1],
approx_quantiles(`t0`.`double_col`, 4 IGNORE NULLS)[3]
] AS `qs`
FROM `functional_alltypes` AS `t0`
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
SELECT
approx_quantiles(`t0`.`double_col`, 100000 IGNORE NULLS)[33333] AS `qs`
FROM `functional_alltypes` AS `t0`
31 changes: 28 additions & 3 deletions ibis/backends/bigquery/tests/unit/test_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def test_hashbytes(case, how, dtype, snapshot):
),
)
def test_integer_to_timestamp(case, unit, snapshot):
expr = ibis.literal(case, type=dt.int64).to_timestamp(unit=unit).name("tmp")
expr = ibis.literal(case, type=dt.int64).as_timestamp(unit=unit).name("tmp")
snapshot.assert_match(to_sql(expr), "out.sql")


Expand Down Expand Up @@ -424,12 +424,12 @@ def test_identical_to(alltypes, snapshot):


def test_to_timestamp_no_timezone(alltypes, snapshot):
expr = alltypes.date_string_col.to_timestamp("%F")
expr = alltypes.date_string_col.as_timestamp("%F")
snapshot.assert_match(to_sql(expr), "out.sql")


def test_to_timestamp_timezone(alltypes, snapshot):
expr = (alltypes.date_string_col + " America/New_York").to_timestamp("%F %Z")
expr = (alltypes.date_string_col + " America/New_York").as_timestamp("%F %Z")
snapshot.assert_match(to_sql(expr), "out.sql")


Expand Down Expand Up @@ -677,3 +677,28 @@ def test_time_from_hms_with_micros(snapshot):
literal = ibis.literal(datetime.time(12, 34, 56))
result = ibis.to_sql(literal, dialect="bigquery")
snapshot.assert_match(result, "no_micros.sql")


@pytest.mark.parametrize(
"quantiles",
[
param(0.5, id="scalar"),
param(1 / 3, id="tricky-scalar"),
param([0.25, 0.5, 0.75], id="array"),
param([0.5, 0.25, 0.75], id="shuffled-array"),
param([0, 0.25, 0.5, 0.75, 1], id="complete-array"),
],
)
def test_approx_quantiles(alltypes, quantiles, snapshot):
query = alltypes.double_col.approx_quantile(quantiles).name("qs")
result = ibis.to_sql(query, dialect="bigquery")
snapshot.assert_match(result, "out.sql")


def test_unreasonably_long_name():
expr = ibis.literal("hello, world!").name("a" * 301)
with pytest.raises(
com.IbisError,
match="BigQuery does not allow column names longer than 300 characters",
):
ibis.to_sql(expr, dialect="bigquery")
41 changes: 25 additions & 16 deletions ibis/backends/clickhouse/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ def to_pyarrow_batches(
params: Mapping[ir.Scalar, Any] | None = None,
external_tables: Mapping[str, Any] | None = None,
chunk_size: int = 1_000_000,
**_: Any,
**kwargs: Any,
) -> pa.ipc.RecordBatchReader:
"""Execute expression and return an iterator of pyarrow record batches.
Expand All @@ -328,6 +328,8 @@ def to_pyarrow_batches(
External data
chunk_size
Maximum number of row to return in a single chunk
kwargs
Extra arguments passed directly to clickhouse-connect
Returns
-------
Expand Down Expand Up @@ -357,14 +359,17 @@ def to_pyarrow_batches(
external_tables = self._collect_in_memory_tables(expr, external_tables)
external_data = self._normalize_external_tables(external_tables)

def batcher(sql: str, *, schema: pa.Schema) -> Iterator[pa.RecordBatch]:
settings = {}
settings = kwargs.pop("settings", {})

# readonly != 1 means that the server setting is writable
if self.con.server_settings["max_block_size"].readonly != 1:
settings["max_block_size"] = chunk_size
# readonly != 1 means that the server setting is writable
if self.con.server_settings["max_block_size"].readonly != 1:
settings["max_block_size"] = chunk_size

def batcher(
sql: str, *, schema: pa.Schema, settings, **kwargs
) -> Iterator[pa.RecordBatch]:
with self.con.query_column_block_stream(
sql, external_data=external_data, settings=settings
sql, external_data=external_data, settings=settings, **kwargs
) as blocks:
yield from map(
partial(pa.RecordBatch.from_arrays, schema=schema), blocks
Expand All @@ -373,29 +378,34 @@ def batcher(sql: str, *, schema: pa.Schema) -> Iterator[pa.RecordBatch]:
self._log(sql)
schema = table.schema().to_pyarrow()
return pa.ipc.RecordBatchReader.from_batches(
schema, batcher(sql, schema=schema)
schema, batcher(sql, schema=schema, settings=settings, **kwargs)
)

def execute(
self,
expr: ir.Expr,
limit: str | None = "default",
params: Mapping[ir.Scalar, Any] | None = None,
external_tables: Mapping[str, pd.DataFrame] | None = None,
**kwargs: Any,
) -> Any:
"""Execute an expression."""
import pandas as pd

table = expr.as_table()
sql = self.compile(table, limit=limit, **kwargs)
sql = self.compile(table, params=params, limit=limit)

schema = table.schema()
self._log(sql)

external_tables = self._collect_in_memory_tables(expr, external_tables)
external_data = self._normalize_external_tables(external_tables)
df = self.con.query_df(
sql, external_data=external_data, use_na_values=False, use_none=True
sql,
external_data=external_data,
use_na_values=False,
use_none=True,
**kwargs,
)

if df.empty:
Expand Down Expand Up @@ -601,7 +611,7 @@ def create_table(
| pl.LazyFrame
| None = None,
*,
schema: ibis.Schema | None = None,
schema: sch.SchemaLike | None = None,
database: str | None = None,
temp: bool = False,
overwrite: bool = False,
Expand Down Expand Up @@ -656,21 +666,20 @@ def create_table(

if obj is None and schema is None:
raise com.IbisError("The `schema` or `obj` parameter is required")
if schema is not None:
schema = ibis.schema(schema)

if obj is not None and not isinstance(obj, ir.Expr):
obj = ibis.memtable(obj, schema=schema)

if schema is None:
schema = obj.schema()

this = sge.Schema(
this=sg.table(name, db=database),
this=sg.table(name, db=database, quoted=self.compiler.quoted),
expressions=[
sge.ColumnDef(
this=sg.to_identifier(name, quoted=self.compiler.quoted),
kind=self.compiler.type_mapper.from_ibis(typ),
)
for name, typ in schema.items()
for name, typ in (schema or obj.schema()).items()
],
)
properties = [
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
SELECT
CAST("t0"."string_col" AS Nullable(DATETIME)) AS "Cast(string_col, timestamp)"
CAST("t0"."string_col" AS Nullable(DateTime)) AS "Cast(string_col, timestamp)"
FROM "functional_alltypes" AS "t0"
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
SELECT
CAST("t0"."timestamp_col" AS DATETIME) AS "Cast(timestamp_col, !timestamp)"
CAST("t0"."timestamp_col" AS DateTime) AS "Cast(timestamp_col, !timestamp)"
FROM "functional_alltypes" AS "t0"
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
SELECT
CAST("t0"."int_col" AS DATETIME) AS "Cast(int_col, !timestamp)"
CAST("t0"."int_col" AS DateTime) AS "Cast(int_col, !timestamp)"
FROM "functional_alltypes" AS "t0"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
SELECT
toDate(parseDateTimeBestEffort('2009-05-17T12:34:56')) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56), DAY)"
toStartOfDay(parseDateTimeBestEffort('2009-05-17T12:34:56')) AS "TimestampTruncate(datetime.datetime(2009, 5, 17, 12, 34, 56), DAY)"
2 changes: 1 addition & 1 deletion ibis/backends/clickhouse/tests/test_aggregations.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def test_reduction_invalid_where(alltypes, reduction):
),
(
lambda t, cond: t.int_col.approx_median(),
lambda df, cond: np.int32(df.int_col.median()),
lambda df, cond: df.int_col.median(),
),
(
lambda t, cond: t.double_col.min(),
Expand Down
33 changes: 33 additions & 0 deletions ibis/backends/clickhouse/tests/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -409,3 +409,36 @@ def test_subquery_with_join(con):
expr = s.join(w, "a").select(a=w.a).select(b=lambda t: t.a + 1)
result = expr.to_pandas()
assert set(result["b"].tolist()) == {2, 3, 4}


def test_alias_column_ref(con):
data = {"user_id": [1, 2, 3], "account_id": [4, 5, 6]}
t = con.create_table(gen_name("clickhouse_temp_table"), data, temp=True)
expr = t.alias("df").sql("select *, halfMD5(account_id) as id_md5 from df")

result = expr.execute()

assert len(result) == 3

assert result.columns.tolist() == ["user_id", "account_id", "id_md5"]

assert result.user_id.notnull().all()
assert result.account_id.notnull().all()
assert result.id_md5.notnull().all()


@pytest.mark.parametrize("method_name", ["to_pandas", "to_pyarrow"])
def test_query_cache(con, method_name):
t = con.table("functional_alltypes")
expr = t.count()

method = getattr(expr, method_name)

expected = method()
result = method(settings={"use_query_cache": True})

# test a bogus setting
with pytest.raises(ClickHouseDatabaseError):
method(settings={"ooze_query_cash": True})

assert result == expected
2 changes: 1 addition & 1 deletion ibis/backends/clickhouse/tests/test_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -438,7 +438,7 @@ def test_literal_none_to_nullable_column(alltypes):

def test_timestamp_from_integer(con, alltypes, assert_sql):
# timestamp_col has datetime type
expr = alltypes.int_col.to_timestamp()
expr = alltypes.int_col.as_timestamp("s")
assert_sql(expr, "out.sql")
assert len(con.execute(expr))

Expand Down
54 changes: 3 additions & 51 deletions ibis/backends/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
from packaging.version import parse as vparse

import ibis
import ibis.common.exceptions as com
from ibis import util
from ibis.backends import (
CanCreateCatalog,
Expand Down Expand Up @@ -151,7 +150,7 @@ def is_older_than(module_name, given_version):
# For now, many of our tests don't do this, and we're working to change this situation
# by improving all tests file by file. All files that have already been improved are
# added to this list to prevent regression.
FIlES_WITH_STRICT_EXCEPTION_CHECK = [
FILES_WITH_STRICT_EXCEPTION_CHECK = [
"ibis/backends/tests/test_api.py",
"ibis/backends/tests/test_array.py",
"ibis/backends/tests/test_aggregation.py",
Expand Down Expand Up @@ -337,7 +336,7 @@ def _filter_none_from_raises(kwargs):
for marker in item.iter_markers(name="notimpl"):
if backend in marker.args[0]:
if (
item.location[0] in FIlES_WITH_STRICT_EXCEPTION_CHECK
item.location[0] in FILES_WITH_STRICT_EXCEPTION_CHECK
and "raises" not in marker.kwargs.keys()
):
raise ValueError("notimpl requires a raises")
Expand All @@ -351,7 +350,7 @@ def _filter_none_from_raises(kwargs):
for marker in item.iter_markers(name="notyet"):
if backend in marker.args[0]:
if (
item.location[0] in FIlES_WITH_STRICT_EXCEPTION_CHECK
item.location[0] in FILES_WITH_STRICT_EXCEPTION_CHECK
and "raises" not in marker.kwargs.keys()
):
raise ValueError("notyet requires a raises")
Expand Down Expand Up @@ -486,12 +485,6 @@ def udf_backend(request, data_dir, tmp_path_factory, worker_id):
return cls.load_data(data_dir, tmp_path_factory, worker_id)


@pytest.fixture(scope="session")
def udf_con(udf_backend):
"""Instance of Client, already connected to the db (if applies)."""
return udf_backend.connection


@pytest.fixture(scope="session")
def alltypes(backend):
return backend.functional_alltypes
Expand All @@ -507,11 +500,6 @@ def struct(backend):
return backend.struct


@pytest.fixture(scope="session")
def sorted_alltypes(alltypes):
return alltypes.order_by("id")


@pytest.fixture(scope="session")
def udf_alltypes(udf_backend):
return udf_backend.functional_alltypes
Expand All @@ -527,11 +515,6 @@ def awards_players(backend):
return backend.awards_players


@pytest.fixture
def analytic_alltypes(alltypes):
return alltypes


@pytest.fixture(scope="session")
def df(alltypes):
return alltypes.execute()
Expand Down Expand Up @@ -562,13 +545,6 @@ def awards_players_df(awards_players):
return awards_players.execute(limit=None)


@pytest.fixture(scope="session")
def geo_df(geo):
if geo is not None:
return geo.execute(limit=None)
return None


@pytest.fixture
def temp_table(con):
"""Return a temporary table name.
Expand Down Expand Up @@ -623,30 +599,6 @@ def temp_view(ddl_con):
ddl_con.drop_view(name, force=True)


@pytest.fixture
def alternate_current_database(ddl_con, ddl_backend):
"""Create a temporary database and yield its name. Drops the created
database upon completion.
Parameters
----------
ddl_con : ibis.backends.Client
Yields
------
str
"""
name = util.gen_name("database")
try:
ddl_con.create_database(name)
except AttributeError:
pytest.skip(f"{ddl_backend.name()} doesn't have a `create_database` method.")
yield name

with contextlib.suppress(com.UnsupportedOperationError):
ddl_con.drop_database(name, force=True)


@pytest.fixture
def assert_sql(con, snapshot):
def checker(expr, file_name="out.sql"):
Expand Down
10 changes: 4 additions & 6 deletions ibis/backends/dask/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,14 @@ def do_connect(
>>> ibis.dask.connect(data)
"""
if dictionary is None:
dictionary = {}
super().do_connect(dictionary)

for k, v in dictionary.items():
for k, v in self.dictionary.items():
if not isinstance(v, (dd.DataFrame, pd.DataFrame)):
raise TypeError(
f"Expected an instance of 'dask.dataframe.DataFrame' for {k!r},"
f" got an instance of '{type(v).__name__}' instead."
)
super().do_connect(dictionary)

def disconnect(self) -> None:
pass
Expand Down Expand Up @@ -181,5 +179,5 @@ def _convert_object(self, obj) -> dd.DataFrame:
pandas_df = super()._convert_object(obj)
return dd.from_pandas(pandas_df, npartitions=1)

def _load_into_cache(self, name, expr):
self.create_table(name, self.compile(expr).persist())
def _create_cached_table(self, name, expr):
return self.create_table(name, self.compile(expr).persist())
5 changes: 0 additions & 5 deletions ibis/backends/dask/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,11 +352,6 @@ def time_right(client):
return client.table("time_df2")


@pytest.fixture(scope="module")
def time_table(client):
return client.table("time_df3")


@pytest.fixture(scope="module")
def time_keyed_left(client):
return client.table("time_keyed_df1")
Expand Down
25 changes: 0 additions & 25 deletions ibis/backends/dask/tests/test_window.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from __future__ import annotations

from datetime import date
from operator import methodcaller

import dask.dataframe as dd
Expand Down Expand Up @@ -500,27 +499,3 @@ def count_both(v1, v2):
t[control].count().over(row_window).execute(),
check_names=False,
)


@pytest.fixture
def events(npartitions) -> dd.DataFrame:
df = pd.DataFrame(
{
"event_id": [1] * 4 + [2] * 6 + [3] * 2,
"measured_on": map(
pd.Timestamp,
map(
date,
[2021] * 12,
[6] * 4 + [5] * 6 + [7] * 2,
range(1, 13),
),
),
"measurement": np.nan,
}
)
df.at[1, "measurement"] = 5.0
df.at[4, "measurement"] = 42.0
df.at[5, "measurement"] = 42.0
df.at[7, "measurement"] = 11.0
return dd.from_pandas(df, npartitions=npartitions)
24 changes: 14 additions & 10 deletions ibis/backends/datafusion/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,11 @@
except ImportError:
SessionConfig = None

try:
from datafusion import RuntimeConfig
except ImportError:
RuntimeConfig = None

if TYPE_CHECKING:
import pandas as pd
import polars as pl
Expand Down Expand Up @@ -105,7 +110,12 @@ def do_connect(
).with_information_schema(True)
else:
df_config = None
self.con = SessionContext(df_config)
if RuntimeConfig is None:
self.con = SessionContext(df_config)
else:
# datafusion 40.1.0 has a bug where SessionContext requires
# both SessionConfig and RuntimeConfig be provided.
self.con = SessionContext(df_config, RuntimeConfig())

self._register_builtin_udfs()

Expand Down Expand Up @@ -225,14 +235,6 @@ def raw_sql(self, query: str | sge.Expression) -> Any:
self._log(query)
return self.con.sql(query)

@property
def current_catalog(self) -> str:
raise NotImplementedError()

@property
def current_database(self) -> str:
return NotImplementedError()

def list_catalogs(self, like: str | None = None) -> list[str]:
code = (
sg.select(C.table_catalog)
Expand Down Expand Up @@ -586,7 +588,7 @@ def create_table(
| pl.LazyFrame
| None = None,
*,
schema: sch.Schema | None = None,
schema: sch.SchemaLike | None = None,
database: str | None = None,
temp: bool = False,
overwrite: bool = False,
Expand Down Expand Up @@ -615,6 +617,8 @@ def create_table(
"""
if obj is None and schema is None:
raise ValueError("Either `obj` or `schema` must be specified")
if schema is not None:
schema = ibis.schema(schema)

properties = []

Expand Down
4 changes: 0 additions & 4 deletions ibis/backends/datafusion/tests/test_select.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,6 @@
pytest.importorskip("datafusion")


@pytest.mark.xfail_version(
datafusion=["datafusion==28.0.0"],
reason="datafusion panics with with the float_col * 2 filter",
)
def test_where_multiple_conditions(alltypes, alltypes_df):
expr = alltypes.filter(
[
Expand Down
5 changes: 0 additions & 5 deletions ibis/backends/datafusion/tests/test_udf.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

import pandas.testing as tm
import pytest
from packaging.version import parse as vparse

import ibis.expr.datatypes as dt
import ibis.expr.types as ir
Expand Down Expand Up @@ -69,10 +68,6 @@ def median(a: float) -> float:
assert result == con.tables.batting.G.execute().median()


@pytest.mark.xfail(
condition=vparse(datafusion.__version__) >= vparse("38.0.1"),
reason="internal error about MEDIAN(G) naming",
)
def test_builtin_agg_udf_filtered(con):
@udf.agg.builtin
def median(a: float, where: bool = True) -> float:
Expand Down
8 changes: 1 addition & 7 deletions ibis/backends/druid/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
from itertools import chain, repeat
from typing import TYPE_CHECKING, Any

import pytest
from requests import Session

import ibis
Expand Down Expand Up @@ -116,7 +115,7 @@ def functional_alltypes(self) -> ir.Table:
# tool that calls itself a time series database or "good for
# working with time series", that lacks a first-class timestamp
# type.
timestamp_col=t.timestamp_col.to_timestamp(unit="ms"),
timestamp_col=t.timestamp_col.as_timestamp(unit="ms"),
)

@property
Expand Down Expand Up @@ -148,8 +147,3 @@ def _load_data(self, **_: Any) -> None:
@staticmethod
def connect(*, tmpdir, worker_id, **kw):
return ibis.connect(DRUID_URL, **kw)


@pytest.fixture(scope="session")
def con(data_dir, tmp_path_factory, worker_id):
return TestConf.load_data(data_dir, tmp_path_factory, worker_id).connection
24 changes: 12 additions & 12 deletions ibis/backends/duckdb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
from ibis.backends import CanCreateDatabase, CanCreateSchema, UrlFromPath
from ibis.backends.duckdb.converter import DuckDBPandasData
from ibis.backends.sql import SQLBackend
from ibis.backends.sql.compilers.base import STAR, C
from ibis.backends.sql.compilers.base import STAR, AlterTable, C
from ibis.common.dispatch import lazy_singledispatch
from ibis.expr.operations.udf import InputType
from ibis.util import deprecated
Expand All @@ -41,6 +41,8 @@
import torch
from fsspec import AbstractFileSystem

from ibis.expr.schema import SchemaLike


_UDF_INPUT_TYPE_MAPPING = {
InputType.PYARROW: duckdb.functional.ARROW,
Expand Down Expand Up @@ -103,7 +105,7 @@ def create_table(
| pl.LazyFrame
| None = None,
*,
schema: ibis.Schema | None = None,
schema: SchemaLike | None = None,
database: str | None = None,
temp: bool = False,
overwrite: bool = False,
Expand Down Expand Up @@ -147,6 +149,8 @@ def create_table(

if obj is None and schema is None:
raise ValueError("Either `obj` or `schema` must be specified")
if schema is not None:
schema = ibis.schema(schema)

properties = []

Expand Down Expand Up @@ -187,10 +191,8 @@ def create_table(
else:
temp_name = name

initial_table = sge.Table(
this=sg.to_identifier(temp_name, quoted=self.compiler.quoted),
catalog=catalog,
db=database,
initial_table = sg.table(
temp_name, catalog=catalog, db=database, quoted=self.compiler.quoted
)
target = sge.Schema(this=initial_table, expressions=column_defs)

Expand All @@ -201,10 +203,8 @@ def create_table(
)

# This is the same table as initial_table unless overwrite == True
final_table = sge.Table(
this=sg.to_identifier(name, quoted=self.compiler.quoted),
catalog=catalog,
db=database,
final_table = sg.table(
name, catalog=catalog, db=database, quoted=self.compiler.quoted
)
with self._safe_raw_sql(create_stmt) as cur:
if query is not None:
Expand Down Expand Up @@ -236,7 +236,7 @@ def create_table(
)
else:
cur.execute(
sge.AlterTable(
AlterTable(
this=initial_table,
actions=[sge.RenameTable(this=final_table)],
).sql(self.name)
Expand Down Expand Up @@ -894,7 +894,7 @@ def _read_parquet_pyarrow_dataset(
# explicitly.

@util.deprecated(
instead="Pass in-memory data to `create_table` instead.",
instead="Pass in-memory data to `memtable` instead.",
as_of="9.1",
removed_in="10.0",
)
Expand Down
9 changes: 5 additions & 4 deletions ibis/backends/duckdb/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,16 +112,17 @@ def connect(*, tmpdir, worker_id, **kw) -> BaseBackend:
def _load_tpc(self, *, suite, scale_factor):
con = self.connection
schema = f"tpc{suite}"
con.con.execute(f"CREATE OR REPLACE SCHEMA {schema}")
con.create_database(schema, force=True)
parquet_dir = self.data_dir.joinpath(schema, f"sf={scale_factor}", "parquet")
assert parquet_dir.exists(), parquet_dir
for path in parquet_dir.glob("*.parquet"):
table_name = path.with_suffix("").name
# duckdb automatically infers the sf= as a hive partition so we
# need to disable it
con.con.execute(
f"CREATE OR REPLACE VIEW {schema}.{table_name} AS "
f"FROM read_parquet({str(path)!r}, hive_partitioning=false)"
con.create_table(
table_name,
con.read_parquet(path, hive_partitioning=False),
database=schema,
)

def _transform_tpc_sql(self, parsed, *, suite, leaves):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import ibis


lineitem = ibis.table(
name="lineitem",
schema={
"l_orderkey": "int32",
"l_partkey": "int32",
"l_suppkey": "int32",
"l_linenumber": "int32",
"l_quantity": "decimal(15, 2)",
"l_extendedprice": "decimal(15, 2)",
"l_discount": "decimal(15, 2)",
"l_tax": "decimal(15, 2)",
"l_returnflag": "string",
"l_linestatus": "string",
"l_shipdate": "date",
"l_commitdate": "date",
"l_receiptdate": "date",
"l_shipinstruct": "string",
"l_shipmode": "string",
"l_comment": "string",
},
)
lit = ibis.literal(1)
f = lineitem.filter((lineitem.l_shipdate <= ibis.literal("1998-09-02").cast("date")))
multiply = f.l_extendedprice * ((lit - f.l_discount))
agg = f.aggregate(
[
f.l_quantity.sum().name("sum_qty"),
f.l_extendedprice.sum().name("sum_base_price"),
multiply.sum().name("sum_disc_price"),
((multiply) * ((lit + f.l_tax))).sum().name("sum_charge"),
f.l_quantity.mean().name("avg_qty"),
f.l_extendedprice.mean().name("avg_price"),
f.l_discount.mean().name("avg_disc"),
f.count().name("count_order"),
],
by=[f.l_returnflag, f.l_linestatus],
)

result = agg.order_by(
agg.l_returnflag.asc(nulls_first=True), agg.l_linestatus.asc(nulls_first=True)
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
import ibis


customer = ibis.table(
name="customer",
schema={
"c_custkey": "int64",
"c_name": "string",
"c_address": "string",
"c_nationkey": "int16",
"c_phone": "string",
"c_acctbal": "decimal",
"c_mktsegment": "string",
"c_comment": "string",
},
)
lit = ibis.literal(True)
orders = ibis.table(
name="orders",
schema={
"o_orderkey": "int64",
"o_custkey": "int64",
"o_orderstatus": "string",
"o_totalprice": "decimal(12, 2)",
"o_orderdate": "date",
"o_orderpriority": "string",
"o_clerk": "string",
"o_shippriority": "int32",
"o_comment": "string",
},
)
lineitem = ibis.table(
name="lineitem",
schema={
"l_orderkey": "int32",
"l_partkey": "int32",
"l_suppkey": "int32",
"l_linenumber": "int32",
"l_quantity": "decimal(15, 2)",
"l_extendedprice": "decimal(15, 2)",
"l_discount": "decimal(15, 2)",
"l_tax": "decimal(15, 2)",
"l_returnflag": "string",
"l_linestatus": "string",
"l_shipdate": "date",
"l_commitdate": "date",
"l_receiptdate": "date",
"l_shipinstruct": "string",
"l_shipmode": "string",
"l_comment": "string",
},
)
cast = ibis.literal("1995-03-15").cast("date")
joinchain = (
customer.inner_join(
orders,
[(customer.c_custkey == orders.o_custkey), lit, (orders.o_orderdate < cast)],
)
.inner_join(
lineitem,
[(orders.o_orderkey == lineitem.l_orderkey), lit, (lineitem.l_shipdate > cast)],
)
.select(
customer.c_custkey,
customer.c_name,
customer.c_address,
customer.c_nationkey,
customer.c_phone,
customer.c_acctbal,
customer.c_mktsegment,
customer.c_comment,
orders.o_orderkey,
orders.o_custkey,
orders.o_orderstatus,
orders.o_totalprice,
orders.o_orderdate,
orders.o_orderpriority,
orders.o_clerk,
orders.o_shippriority,
orders.o_comment,
lineitem.l_orderkey,
lineitem.l_partkey,
lineitem.l_suppkey,
lineitem.l_linenumber,
lineitem.l_quantity,
lineitem.l_extendedprice,
lineitem.l_discount,
lineitem.l_tax,
lineitem.l_returnflag,
lineitem.l_linestatus,
lineitem.l_shipdate,
lineitem.l_commitdate,
lineitem.l_receiptdate,
lineitem.l_shipinstruct,
lineitem.l_shipmode,
lineitem.l_comment,
)
)
f = joinchain.filter((joinchain.c_mktsegment == "BUILDING"))
agg = f.aggregate(
[(f.l_extendedprice * ((1 - f.l_discount))).sum().name("revenue")],
by=[f.l_orderkey, f.o_orderdate, f.o_shippriority],
)
s = agg.order_by(agg.revenue.desc(), agg.o_orderdate.asc(nulls_first=True))

result = s.select(s.l_orderkey, s.revenue, s.o_orderdate, s.o_shippriority).limit(10)
14 changes: 14 additions & 0 deletions ibis/backends/duckdb/tests/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,20 @@ def test_connect_named_in_memory_db():
assert "ork" not in default_memory_db.list_tables()


@pytest.mark.parametrize(
"database_file",
[
"with spaces.ddb",
"space catalog.duckdb.db",
],
)
def test_create_table_quoting(database_file, tmp_path):
conn = ibis.duckdb.connect(tmp_path / database_file)
t = conn.create_table("t", {"a": [0, 1, 2]})
result = set(conn.execute(t.a))
assert result == {0, 1, 2}


@pytest.mark.parametrize(
("url", "method_name"),
[
Expand Down
1 change: 1 addition & 0 deletions ibis/backends/duckdb/tests/test_datatypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
("UUID", dt.uuid),
("VARCHAR", dt.string),
("INTEGER[]", dt.Array(dt.int32)),
("INTEGER[3]", dt.Array(dt.int32)),
("MAP(VARCHAR, BIGINT)", dt.Map(dt.string, dt.int64)),
(
"STRUCT(a INTEGER, b VARCHAR, c MAP(VARCHAR, DOUBLE[])[])",
Expand Down
101 changes: 101 additions & 0 deletions ibis/backends/duckdb/tests/test_decompile_tpch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
from __future__ import annotations

import importlib
from contextlib import contextmanager
from pathlib import Path

import pytest
from pytest import param

import ibis
from ibis.backends.tests.tpc.conftest import compare_tpc_results
from ibis.formats.pandas import PandasData

tpch_catalog = {
"lineitem": {
"l_orderkey": "int32",
"l_partkey": "int32",
"l_suppkey": "int32",
"l_linenumber": "int32",
"l_quantity": "decimal(15, 2)",
"l_extendedprice": "decimal(15, 2)",
"l_discount": "decimal(15, 2)",
"l_tax": "decimal(15, 2)",
"l_returnflag": "string",
"l_linestatus": "string",
"l_shipdate": "date",
"l_commitdate": "date",
"l_receiptdate": "date",
"l_shipinstruct": "string",
"l_shipmode": "string",
"l_comment": "string",
},
"customer": [
("c_custkey", "int64"),
("c_name", "string"),
("c_address", "string"),
("c_nationkey", "int16"),
("c_phone", "string"),
("c_acctbal", "decimal"),
("c_mktsegment", "string"),
("c_comment", "string"),
],
"orders": [
("o_orderkey", "int64"),
("o_custkey", "int64"),
("o_orderstatus", "string"),
("o_totalprice", "decimal(12,2)"),
("o_orderdate", "date"),
("o_orderpriority", "string"),
("o_clerk", "string"),
("o_shippriority", "int32"),
("o_comment", "string"),
],
}

root = Path(__file__).absolute().parents[3]

SQL_QUERY_PATH = root / "backends" / "tests" / "tpc" / "queries" / "duckdb" / "h"


@contextmanager
def set_database(con, db):
olddb = con.current_database
con.raw_sql(f"USE {db}")
yield
con.raw_sql(f"USE {olddb}")


@pytest.mark.parametrize(
"tpch_query",
[
param(1, id="tpch01"),
param(3, id="tpch03"),
],
)
def test_parse_sql_tpch(tpch_query, snapshot, con, data_dir):
tpch_query_file = SQL_QUERY_PATH / f"{tpch_query:02d}.sql"
with open(tpch_query_file) as f:
sql = f.read()

expr = ibis.parse_sql(sql, tpch_catalog)
code = ibis.decompile(expr, format=True)
snapshot.assert_match(code, "out_tpch.py")

# Import just-created snapshot
SNAPSHOT_MODULE = f"ibis.backends.duckdb.tests.snapshots.test_decompile_tpch.test_parse_sql_tpch.tpch{tpch_query:02d}.out_tpch"
module = importlib.import_module(SNAPSHOT_MODULE)

with set_database(con, "tpch"):
# Get results from executing SQL directly on DuckDB
expected_df = con.con.execute(sql).df()
# Get results from decompiled ibis query
result_df = con.to_pandas(module.result)

# Then set the expected columns so we can coerce the datatypes
# of the pandas dataframe correctly
expected_df.columns = result_df.columns

expected_df = PandasData.convert_table(expected_df, module.result.schema())

compare_tpc_results(result_df, expected_df)
10 changes: 5 additions & 5 deletions ibis/backends/duckdb/tests/test_register.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,12 +321,12 @@ def test_re_read_in_memory_overwrite(con):
df_pandas_1 = pd.DataFrame({"a": ["a"], "b": [1], "d": ["hi"]})
df_pandas_2 = pd.DataFrame({"a": [1], "c": [1.4]})

with pytest.warns(FutureWarning, match="create_table"):
with pytest.warns(FutureWarning, match="memtable"):
table = con.read_in_memory(df_pandas_1, table_name="df")
assert len(table.columns) == 3
assert table.schema() == ibis.schema([("a", "str"), ("b", "int"), ("d", "str")])

with pytest.warns(FutureWarning, match="create_table"):
with pytest.warns(FutureWarning, match="memtable"):
table = con.read_in_memory(df_pandas_2, table_name="df")
assert len(table.columns) == 2
assert table.schema() == ibis.schema([("a", "int"), ("c", "float")])
Expand Down Expand Up @@ -417,7 +417,7 @@ def test_s3_403_fallback(con, httpserver, monkeypatch):

def test_register_numpy_str(con):
data = pd.DataFrame({"a": [np.str_("xyz"), None]})
with pytest.warns(FutureWarning, match="create_table"):
with pytest.warns(FutureWarning, match="memtable"):
result = con.read_in_memory(data)
tm.assert_frame_equal(result.execute(), data)

Expand All @@ -431,7 +431,7 @@ def test_register_recordbatchreader_warns(con):
)
reader = table.to_reader()
sol = table.to_pandas()
with pytest.warns(FutureWarning, match="create_table"):
with pytest.warns(FutureWarning, match="memtable"):
t = con.read_in_memory(reader)

# First execute is fine
Expand All @@ -444,7 +444,7 @@ def test_register_recordbatchreader_warns(con):

# Re-registering over the name with a new reader is fine
reader = table.to_reader()
with pytest.warns(FutureWarning, match="create_table"):
with pytest.warns(FutureWarning, match="memtable"):
t = con.read_in_memory(reader, table_name=t.get_name())
res = t.execute()
tm.assert_frame_equal(res, sol)
Expand Down
4 changes: 3 additions & 1 deletion ibis/backends/exasol/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,7 @@ def create_table(
| pl.LazyFrame
| None = None,
*,
schema: sch.Schema | None = None,
schema: sch.SchemaLike | None = None,
database: str | None = None,
overwrite: bool = False,
temp: bool = False,
Expand Down Expand Up @@ -342,6 +342,8 @@ def create_table(
"""
if obj is None and schema is None:
raise ValueError("Either `obj` or `schema` must be specified")
if schema is not None:
schema = ibis.schema(schema)

if temp:
raise com.UnsupportedOperationError(
Expand Down
9 changes: 9 additions & 0 deletions ibis/backends/exasol/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,15 @@ def convert_String(cls, s, dtype, pandas_type):
else:
return s

@classmethod
def convert_Int64(cls, s, dtype, pandas_dtype):
if s.dtype == "object":
# exasol returns BIGINT types as strings (or None for NULL).
# s.astype("int64") will fail in this case, using `Series.map`
# is the best we can do.
return s.map(int, na_action="ignore")
return s if s.dtype == pandas_dtype else s.astype(pandas_dtype)

@classmethod
def convert_Interval(cls, s, dtype, pandas_dtype):
def parse_timedelta(value):
Expand Down
5 changes: 0 additions & 5 deletions ibis/backends/flink/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,11 +174,6 @@ def functional_alltypes_schema():
return TEST_TABLES["functional_alltypes"]


@pytest.fixture
def batting_schema():
return TEST_TABLES["batting"]


@pytest.fixture
def csv_source_configs():
def generate_csv_configs(csv_file):
Expand Down
2 changes: 1 addition & 1 deletion ibis/backends/flink/tests/test_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ def test_count_star(simple_table, assert_sql):
],
)
def test_timestamp_from_unix(simple_table, unit, assert_sql):
expr = simple_table.d.to_timestamp(unit=unit)
expr = simple_table.d.as_timestamp(unit=unit)
assert_sql(expr)


Expand Down
11 changes: 9 additions & 2 deletions ibis/backends/impala/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -463,14 +463,15 @@ def create_table(
| pl.LazyFrame
| None = None,
*,
schema=None,
schema: sch.SchemaLike | None = None,
database=None,
temp: bool | None = None,
overwrite: bool = False,
external: bool = False,
format="parquet",
location=None,
partition=None,
tbl_properties: Mapping[str, Any] | None = None,
like_parquet=None,
) -> ir.Table:
"""Create a new table using an Ibis table expression or in-memory data.
Expand Down Expand Up @@ -501,12 +502,16 @@ def create_table(
partition
Must pass a schema to use this. Cannot partition from an
expression.
tbl_properties
Table properties to set on table creation.
like_parquet
Can specify instead of a schema
"""
if obj is None and schema is None:
raise com.IbisError("The schema or obj parameter is required")
if schema is not None:
schema = ibis.schema(schema)

if temp is not None:
raise NotImplementedError(
Expand Down Expand Up @@ -534,6 +539,7 @@ def create_table(
format=format,
external=True if location is not None else external,
partition=partition,
tbl_properties=tbl_properties,
path=location,
)
)
Expand All @@ -543,12 +549,13 @@ def create_table(
self._safe_exec_sql(
CreateTableWithSchema(
name,
schema if schema is not None else obj.schema(),
schema or obj.schema(),
database=database or self.current_database,
format=format,
external=external,
path=location,
partition=partition,
tbl_properties=tbl_properties,
)
)
return self.table(name, database=database or self.current_database)
Expand Down
11 changes: 11 additions & 0 deletions ibis/backends/impala/ddl.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,13 @@ def _create_line(self):
def _location(self):
return f"LOCATION '{self.path}'" if self.path else None

def _tbl_properties(self):
return (
self.format_tblproperties(self.tbl_properties)
if self.tbl_properties
else None
)

def _storage(self):
# By the time we're here, we have a valid format
return f"STORED AS {self.format}"
Expand Down Expand Up @@ -152,6 +159,7 @@ def _pieces(self):
yield self._storage()

yield self._location()
yield self._tbl_properties()


class AlterTable(ImpalaBase, DDL):
Expand Down Expand Up @@ -258,6 +266,7 @@ def __init__(
can_exist=False,
path=None,
partition=None,
tbl_properties=None,
):
super().__init__(
table_name,
Expand All @@ -267,6 +276,7 @@ def __init__(
can_exist=can_exist,
path=path,
partition=partition,
tbl_properties=tbl_properties,
)
self.select = select

Expand All @@ -275,6 +285,7 @@ def _pieces(self):
yield self._partitioned_by()
yield self._storage()
yield self._location()
yield self._tbl_properties()
yield "AS"
yield self.select

Expand Down
7 changes: 0 additions & 7 deletions ibis/backends/impala/tests/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,13 +192,6 @@ def test_attr_name_conflict(temp_parquet_table, temp_parquet_table2):
assert left.join(right, ["id", "files"]) is not None


@pytest.fixture
def con2(env):
return ibis.impala.connect(
host=env.impala_host, port=env.impala_port, auth_mechanism=env.auth_mechanism
)


def test_day_of_week(con):
date_var = ibis.literal(datetime.date(2017, 1, 1), type=dt.date)
expr_index = date_var.day_of_week.index()
Expand Down
14 changes: 7 additions & 7 deletions ibis/backends/impala/tests/test_exprs.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,10 @@ def test_builtins(con, alltypes):
i1.fill_null(0),
i4.fill_null(0),
i8.fill_null(0),
i4.to_timestamp("s"),
i4.to_timestamp("ms"),
i4.to_timestamp("us"),
i8.to_timestamp(),
i4.as_timestamp("s"),
i4.as_timestamp("ms"),
i4.as_timestamp("us"),
i8.as_timestamp("s"),
d.abs(),
d.cast("decimal(12, 2)"),
d.cast("int32"),
Expand Down Expand Up @@ -191,9 +191,9 @@ def test_column_types(alltypes_df, col, expected):
@pytest.mark.parametrize(
("expr", "expected"),
[
(L(50000).to_timestamp("s"), pd.to_datetime(50000, unit="s")),
(L(50000).to_timestamp("ms"), pd.to_datetime(50000, unit="ms")),
(L(5 * 10**8).to_timestamp(), pd.to_datetime(5 * 10**8, unit="s")),
(L(50000).as_timestamp("s"), pd.to_datetime(50000, unit="s")),
(L(50000).as_timestamp("ms"), pd.to_datetime(50000, unit="ms")),
(L(5 * 10**8).as_timestamp("s"), pd.to_datetime(5 * 10**8, unit="s")),
(
ibis.timestamp("2009-05-17 12:34:56").truncate("y"),
pd.Timestamp("2009-01-01"),
Expand Down
25 changes: 18 additions & 7 deletions ibis/backends/impala/tests/test_partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
import pandas as pd
import pandas.testing as tm
import pytest
from impala.error import HiveServer2Error

import ibis
from ibis import util
Expand Down Expand Up @@ -142,10 +141,14 @@ def test_create_partitioned_table_from_expr(con, alltypes, tmp_parted):
tm.assert_frame_equal(result, expected)


@pytest.mark.xfail(raises=HiveServer2Error)
def test_add_drop_partition_no_location(con, temp_table):
schema = ibis.schema([("foo", "string"), ("year", "int32"), ("month", "int16")])
con.create_table(temp_table, schema=schema, partition=["year", "month"])
con.create_table(
temp_table,
schema=schema,
partition=["year", "month"],
tbl_properties={"transactional": "false"},
)
table = con.table(temp_table)

part = {"year": 2007, "month": 4}
Expand All @@ -159,10 +162,14 @@ def test_add_drop_partition_no_location(con, temp_table):
assert len(table.partitions()) == 1


@pytest.mark.xfail(raises=HiveServer2Error)
def test_add_drop_partition_owned_by_impala(con, temp_table):
schema = ibis.schema([("foo", "string"), ("year", "int32"), ("month", "int16")])
con.create_table(temp_table, schema=schema, partition=["year", "month"])
con.create_table(
temp_table,
schema=schema,
partition=["year", "month"],
tbl_properties={"transactional": "false"},
)

table = con.table(temp_table)

Expand All @@ -181,10 +188,14 @@ def test_add_drop_partition_owned_by_impala(con, temp_table):
assert len(table.partitions()) == 1


@pytest.mark.xfail(raises=HiveServer2Error)
def test_add_drop_partition_hive_bug(con, temp_table):
schema = ibis.schema([("foo", "string"), ("year", "int32"), ("month", "int16")])
con.create_table(temp_table, schema=schema, partition=["year", "month"])
con.create_table(
temp_table,
schema=schema,
partition=["year", "month"],
tbl_properties={"transactional": "false"},
)

table = con.table(temp_table)

Expand Down
6 changes: 3 additions & 3 deletions ibis/backends/impala/tests/test_value_exprs.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,9 +235,9 @@ def test_timestamp_day_of_week(method_name, snapshot):
@pytest.mark.parametrize(
"expr_fn",
[
lambda col: col.to_timestamp(),
lambda col: col.to_timestamp("ms"),
lambda col: col.to_timestamp("us"),
lambda col: col.as_timestamp("s"),
lambda col: col.as_timestamp("ms"),
lambda col: col.as_timestamp("us"),
],
ids=["default", "ms", "us"],
)
Expand Down
80 changes: 59 additions & 21 deletions ibis/backends/mssql/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from contextlib import closing
from operator import itemgetter
from typing import TYPE_CHECKING, Any
from urllib.parse import unquote_plus

import pyodbc
import sqlglot as sg
Expand All @@ -21,12 +22,13 @@
import ibis.expr.schema as sch
import ibis.expr.types as ir
from ibis import util
from ibis.backends import CanCreateCatalog, CanCreateDatabase, CanCreateSchema, NoUrl
from ibis.backends import CanCreateCatalog, CanCreateDatabase, CanCreateSchema
from ibis.backends.sql import SQLBackend
from ibis.backends.sql.compilers.base import STAR, C

if TYPE_CHECKING:
from collections.abc import Iterable, Mapping
from urllib.parse import ParseResult

import pandas as pd
import polars as pl
Expand Down Expand Up @@ -73,7 +75,7 @@ def datetimeoffset_to_datetime(value):
# Databases: sys.schemas


class Backend(SQLBackend, CanCreateCatalog, CanCreateDatabase, CanCreateSchema, NoUrl):
class Backend(SQLBackend, CanCreateCatalog, CanCreateDatabase, CanCreateSchema):
name = "mssql"
compiler = sc.mssql.compiler
supports_create_or_replace = False
Expand Down Expand Up @@ -169,6 +171,40 @@ def _post_connect(self):
with closing(self.con.cursor()) as cur:
cur.execute("SET DATEFIRST 1")

def _from_url(self, url: ParseResult, **kwargs):
database, *_ = url.path[1:].split("/", 1)
kwargs.update(
{
"user": url.username,
"password": unquote_plus(url.password or ""),
"host": url.hostname,
"database": database or "",
"port": url.port or None,
}
)

self._convert_kwargs(kwargs)

if "host" in kwargs and not kwargs["host"]:
del kwargs["host"]

if "user" in kwargs and not kwargs["user"]:
del kwargs["user"]

if "password" in kwargs and kwargs["password"] is None:
del kwargs["password"]

if "port" in kwargs and kwargs["port"] is None:
del kwargs["port"]

if "database" in kwargs and not kwargs["database"]:
del kwargs["database"]

if "driver" in kwargs and not kwargs["driver"]:
del kwargs["driver"]

return self.connect(**kwargs)

def get_schema(
self, name: str, *, catalog: str | None = None, database: str | None = None
) -> sch.Schema:
Expand Down Expand Up @@ -244,24 +280,21 @@ def _get_schema_using_query(self, query: str) -> sch.Schema:
# us to pre-filter the columns we want back.
# The syntax is:
# `sys.dm_exec_describe_first_result_set(@tsql, @params, @include_browse_information)`
query = f"""SELECT name,
is_nullable AS nullable,
system_type_name,
precision,
scale
FROM
sys.dm_exec_describe_first_result_set({tsql}, NULL, 0)"""
query = f"""
SELECT
name,
is_nullable,
system_type_name,
precision,
scale
FROM sys.dm_exec_describe_first_result_set({tsql}, NULL, 0)
ORDER BY column_ordinal
"""
with self._safe_raw_sql(query) as cur:
rows = cur.fetchall()

schema = {}
for (
name,
nullable,
system_type_name,
precision,
scale,
) in sorted(rows, key=itemgetter(1)):
for name, nullable, system_type_name, precision, scale in rows:
newtyp = self.compiler.type_mapper.from_string(
system_type_name, nullable=nullable
)
Expand Down Expand Up @@ -531,7 +564,7 @@ def create_table(
| pl.LazyFrame
| None = None,
*,
schema: sch.Schema | None = None,
schema: sch.SchemaLike | None = None,
database: str | None = None,
temp: bool = False,
overwrite: bool = False,
Expand Down Expand Up @@ -572,6 +605,8 @@ def create_table(
"""
if obj is None and schema is None:
raise ValueError("Either `obj` or `schema` must be specified")
if schema is not None:
schema = ibis.schema(schema)

if temp and overwrite:
raise ValueError(
Expand Down Expand Up @@ -657,15 +692,18 @@ def create_table(
new = raw_this.sql(self.dialect)
cur.execute(f"EXEC sp_rename '{old}', '{new}'")

if temp:
# If a temporary table, amend the output name/catalog/db accordingly
name = "##" + name
catalog = "tempdb"
db = "dbo"

if schema is None:
# Clean up temporary memtable if we've created one
# for in-memory reads
if temp_memtable_view is not None:
self.drop_table(temp_memtable_view)
return self.table(
"##" * temp + name,
database=("tempdb" * temp or catalog, "dbo" * temp or db),
)
return self.table(name, database=(catalog, db))

# preserve the input schema if it was provided
return ops.DatabaseTable(
Expand Down
108 changes: 92 additions & 16 deletions ibis/backends/mssql/tests/test_client.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,25 @@
from __future__ import annotations

from urllib.parse import urlencode

import pytest
import sqlglot as sg
import sqlglot.expressions as sge
from pytest import param

import ibis
import ibis.expr.datatypes as dt
from ibis import udf

DB_TYPES = [
from ibis.backends.mssql.tests.conftest import (
IBIS_TEST_MSSQL_DB,
MSSQL_HOST,
MSSQL_PASS,
MSSQL_PORT,
MSSQL_PYODBC_DRIVER,
MSSQL_USER,
)

RAW_DB_TYPES = [
# Exact numbers
("BIGINT", dt.int64),
("BIT", dt.boolean),
Expand Down Expand Up @@ -36,23 +48,9 @@
("DATETIME", dt.Timestamp(scale=3)),
# Characters strings
("CHAR", dt.string),
param(
"TEXT",
dt.string,
marks=pytest.mark.notyet(
["mssql"], reason="Not supported by UTF-8 aware collations"
),
),
("VARCHAR", dt.string),
# Unicode character strings
("NCHAR", dt.string),
param(
"NTEXT",
dt.string,
marks=pytest.mark.notyet(
["mssql"], reason="Not supported by UTF-8 aware collations"
),
),
("NVARCHAR", dt.string),
# Binary strings
("BINARY", dt.binary),
Expand All @@ -67,6 +65,23 @@
("GEOGRAPHY", dt.geography),
("HIERARCHYID", dt.string),
]
PARAM_TYPES = [
param(
"TEXT",
dt.string,
marks=pytest.mark.notyet(
["mssql"], reason="Not supported by UTF-8 aware collations"
),
),
param(
"NTEXT",
dt.string,
marks=pytest.mark.notyet(
["mssql"], reason="Not supported by UTF-8 aware collations"
),
),
]
DB_TYPES = RAW_DB_TYPES + PARAM_TYPES


@pytest.mark.parametrize(("server_type", "expected_type"), DB_TYPES, ids=str)
Expand All @@ -81,6 +96,40 @@ def test_get_schema(con, server_type, expected_type, temp_table):
assert con.sql(f"SELECT * FROM [{temp_table}]").schema() == expected_schema


def test_schema_type_order(con, temp_table):
columns = []
pairs = {}

quoted = con.compiler.quoted
dialect = con.dialect
table_id = sg.to_identifier(temp_table, quoted=quoted)

for i, (server_type, expected_type) in enumerate(RAW_DB_TYPES):
column_name = f"col_{i}"
columns.append(
sge.ColumnDef(
this=sg.to_identifier(column_name, quoted=quoted), kind=server_type
)
)
pairs[column_name] = expected_type

query = sge.Create(
kind="TABLE", this=sge.Schema(this=table_id, expressions=columns)
)
stmt = query.sql(dialect)

with con.begin() as c:
c.execute(stmt)

expected_schema = ibis.schema(pairs)

assert con.get_schema(temp_table) == expected_schema
assert con.table(temp_table).schema() == expected_schema

raw_sql = sg.select("*").from_(table_id).sql(dialect)
assert con.sql(raw_sql).schema() == expected_schema


def test_builtin_scalar_udf(con):
@udf.scalar.builtin
def difference(a: str, b: str) -> int:
Expand Down Expand Up @@ -165,3 +214,30 @@ def test_create_temp_table_from_obj(con):
assert persisted_from_temp.to_pyarrow().equals(t2.to_pyarrow())

con.drop_table("fuhreal")


@pytest.mark.parametrize("explicit_schema", [False, True])
def test_create_temp_table_from_expression(con, explicit_schema, temp_table):
t = ibis.memtable(
{"x": [1, 2, 3], "y": ["a", "b", "c"]}, schema={"x": "int64", "y": "str"}
)
t2 = con.create_table(
temp_table, t, temp=True, schema=t.schema() if explicit_schema else None
)
res = con.to_pandas(t.order_by("y"))
sol = con.to_pandas(t2.order_by("y"))
assert res.equals(sol)


def test_from_url():
user = MSSQL_USER
password = MSSQL_PASS
host = MSSQL_HOST
port = MSSQL_PORT
database = IBIS_TEST_MSSQL_DB
driver = MSSQL_PYODBC_DRIVER
new_con = ibis.connect(
f"mssql://{user}:{password}@{host}:{port}/{database}?{urlencode(dict(driver=driver))}"
)
result = new_con.sql("SELECT 1 AS [a]").to_pandas().a.iat[0]
assert result == 1
4 changes: 3 additions & 1 deletion ibis/backends/mysql/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -388,13 +388,15 @@ def create_table(
| pl.LazyFrame
| None = None,
*,
schema: ibis.Schema | None = None,
schema: sch.SchemaLike | None = None,
database: str | None = None,
temp: bool = False,
overwrite: bool = False,
) -> ir.Table:
if obj is None and schema is None:
raise ValueError("Either `obj` or `schema` must be specified")
if schema is not None:
schema = ibis.schema(schema)

properties = []

Expand Down
31 changes: 15 additions & 16 deletions ibis/backends/oracle/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,8 +192,16 @@ def _from_url(self, url: ParseResult, **kwargs):
return self

@property
def current_database(self) -> str:
def current_catalog(self) -> str:
with self._safe_raw_sql(sg.select(STAR).from_("global_name")) as cur:
[(catalog,)] = cur.fetchall()
return catalog

@property
def current_database(self) -> str:
# databases correspond to users, other than that there's
# no notion of a database inside a catalog for oracle
with self._safe_raw_sql(sg.select("user").from_("dual")) as cur:
[(database,)] = cur.fetchall()
return database

Expand Down Expand Up @@ -374,7 +382,7 @@ def create_table(
| pl.LazyFrame
| None = None,
*,
schema: ibis.Schema | None = None,
schema: sch.SchemaLike | None = None,
database: str | None = None,
temp: bool = False,
overwrite: bool = False,
Expand Down Expand Up @@ -403,6 +411,8 @@ def create_table(
"""
if obj is None and schema is None:
raise ValueError("Either `obj` or `schema` must be specified")
if schema is not None:
schema = ibis.schema(schema)

properties = []

Expand Down Expand Up @@ -621,19 +631,8 @@ def _fetch_from_cursor(self, cursor, schema: sch.Schema) -> pd.DataFrame:

from ibis.backends.oracle.converter import OraclePandasData

try:
df = pd.DataFrame.from_records(
cursor, columns=schema.names, coerce_float=True
)
except Exception:
# clean up the cursor if we fail to create the DataFrame
#
# in the sqlite case failing to close the cursor results in
# artificially locked tables
cursor.close()
raise
df = OraclePandasData.convert_table(df, schema)
return df
df = pd.DataFrame.from_records(cursor, columns=schema.names, coerce_float=True)
return OraclePandasData.convert_table(df, schema)

def _clean_up_tmp_table(self, name: str) -> None:
with self.begin() as bind:
Expand All @@ -648,5 +647,5 @@ def _clean_up_tmp_table(self, name: str) -> None:
with contextlib.suppress(oracledb.DatabaseError):
bind.execute(f'DROP TABLE "{name}"')

def _clean_up_cached_table(self, name):
def _drop_cached_table(self, name):
self._clean_up_tmp_table(name)
5 changes: 3 additions & 2 deletions ibis/backends/oracle/tests/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,11 @@ def test_list_tables_schema_warning_refactor(con):
assert con.list_tables(database="SYS", like="EXU8OPT") == ["EXU8OPT"]


def test_from_url(con):
def test_from_url():
new_con = ibis.connect("oracle://ibis:ibis@localhost:1521/IBIS_TESTING")

assert new_con.list_tables()
result = new_con.sql('SELECT 1 AS "a"').to_pandas()
assert result.a.iat[0] == 1


def test_invalid_port(con):
Expand Down
13 changes: 10 additions & 3 deletions ibis/backends/pandas/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from __future__ import annotations

import warnings
from functools import lru_cache
from typing import TYPE_CHECKING, Any

Expand Down Expand Up @@ -51,6 +52,10 @@ def do_connect(
<ibis.backends.pandas.Backend at 0x...>

"""
warnings.warn(
f"The {self.name} backend is slated for removal in 10.0.",
DeprecationWarning,
)
self.dictionary = dictionary or {}
self.schemas: MutableMapping[str, sch.Schema] = {}

Expand Down Expand Up @@ -205,6 +210,8 @@ def create_table(
)
if obj is None and schema is None:
raise com.IbisError("The schema or obj parameter is required")
if schema is not None:
schema = ibis.schema(schema)

if obj is not None:
df = self._convert_object(obj)
Expand Down Expand Up @@ -257,7 +264,7 @@ def _get_operations(cls):
def has_operation(cls, operation: type[ops.Value]) -> bool:
return operation in cls._get_operations()

def _clean_up_cached_table(self, name):
def _drop_cached_table(self, name):
del self.dictionary[name]

def to_pyarrow(
Expand Down Expand Up @@ -321,8 +328,8 @@ def execute(self, query, params=None, limit="default", **kwargs):

return PandasExecutor.execute(query.op(), backend=self, params=params)

def _load_into_cache(self, name, expr):
self.create_table(name, expr.execute())
def _create_cached_table(self, name, expr):
return self.create_table(name, expr.execute())


@lazy_singledispatch
Expand Down
5 changes: 0 additions & 5 deletions ibis/backends/pandas/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,11 +277,6 @@ def time_right(client):
return client.table("time_df2")


@pytest.fixture(scope="module")
def time_table(client):
return client.table("time_df3")


@pytest.fixture(scope="module")
def time_keyed_left(client):
return client.table("time_keyed_df1")
Expand Down
Loading