Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctness issue: Having clause not honored in SQL statement #15950

Open
2 tasks done
timvw opened this issue Apr 29, 2024 · 4 comments
Open
2 tasks done

Correctness issue: Having clause not honored in SQL statement #15950

timvw opened this issue Apr 29, 2024 · 4 comments
Labels
A-sql Area: Polars SQL functionality bug Something isn't working python Related to Python Polars

Comments

@timvw
Copy link

timvw commented Apr 29, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

if __name__ == "__main__":

    df = pl.DataFrame(
        {'region': [1, 1, 2]},
    )
    #print(df)

    ctx = pl.SQLContext(register_globals=False, eager_execution=True)
    ctx.register("df", df)
    print(ctx.execute("select region, count(*) from df group by region having count(*) > 1"))

To my surprise this prints out both regions (1 and 2.. Even though region 2 has only 1 element)

shape: (2, 2)
┌────────┬─────┐
│ region ┆ len │
│ ---    ┆ --- │
│ i64    ┆ u32 │
╞════════╪═════╡
│ 1      ┆ 2   │
│ 2      ┆ 1   │
└────────┴─────┘

Log output

No response

Issue description

The having clause is not honored.

Expected behavior

Only region 1 is printed

Installed versions

--------Version info---------
Polars:               0.20.22
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.9.15 (main, Apr 20 2024, 22:00:10) 
[Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              <not installed>
pydantic:             2.7.0
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.29
xlsx2csv:             <not installed>
xlsxwriter:           3.2.0
@timvw timvw added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 29, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Apr 29, 2024

Ahh... we require the having clause to reference named select columns here as we apply the constraint as a post-aggregation step (this is in line with MySQL's take on the "having" clause, though we prefer targeting PostgreSQL behaviour wherever possible).

This formulation will work correctly:

select region, count(*) as n from df group by region having n > 1
# ┌────────┬─────┐
# │ region ┆ n   │
# │ ---    ┆ --- │
# │ i64    ┆ u32 │
# ╞════════╪═════╡
# │ 1      ┆ 2   │
# └────────┴─────┘

I'll have to see if we can improve this (either by raising a suitable error, or automatically substituting to achieve the same effect) 🤔

@alexander-beedie alexander-beedie added A-sql Area: Polars SQL functionality and removed needs triage Awaiting prioritization by a maintainer labels Apr 29, 2024
@timvw
Copy link
Author

timvw commented Apr 29, 2024

I assume that adding this to the documentation would be helpful as well

@alexander-beedie
Copy link
Collaborator

I assume that adding this to the documentation would be helpful as well

Yup; we also need a total overhaul of our SQL documentation :))

@timvw
Copy link
Author

timvw commented Apr 30, 2024

Just ran the following on a mysql (mysql/8.3.0_1) instance:

with data as (
  select 1 as region
  union all
  select 1 as region
  union all 
  select 2 as region
)
select region, count(*) as region_count
from data
group by region
having count(*) > 1;
+--------+--------------+
| region | region_count |
+--------+--------------+
|      1 |            2 |
+--------+--------------+
1 row in set (0.00 sec)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql Area: Polars SQL functionality bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants