Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: topk() always says a column contains 0 NULLs #8518

Closed
1 task done
NickCrews opened this issue Mar 1, 2024 · 7 comments · Fixed by #8531
Closed
1 task done

bug: topk() always says a column contains 0 NULLs #8518

NickCrews opened this issue Mar 1, 2024 · 7 comments · Fixed by #8531
Assignees
Labels
bug Incorrect behavior inside of ibis
Milestone

Comments

@NickCrews
Copy link
Contributor

What happened?

I would expect this to say that there is 1 NULL:

import ibis

ibis.options.interactive = True

t = ibis.memtable({"x": [1, 1, 2, None]})
tk = t.x.topk(10)
tk
# ┏━━━━━━━━━┳━━━━━━━━━━┓
# ┃ x       ┃ Count(x) ┃
# ┡━━━━━━━━━╇━━━━━━━━━━┩
# │ float64 │ int64    │
# ├─────────┼──────────┤
# │     1.0 │        2 │
# │     2.0 │        1 │
# │    NULL │        0 │
# └─────────┴──────────┘
ibis.to_sql(tk)
# SELECT
#   "t1"."x",
#   "t1"."Count(x)"
# FROM (
#   SELECT
#     "t0"."x",
#     COUNT("t0"."x") AS "Count(x)"
#   FROM "ibis_pandas_memtable_46aexxpm3fdfhdexrstqhyuvyi" AS "t0"
#   GROUP BY
#     1
# ) AS "t1"
# ORDER BY
#   "t1"."Count(x)" DESC
# LIMIT 10

What version of ibis are you using?

main

What backend(s) are you using, if any?

duckdb

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Mar 1, 2024
@cpcloud
Copy link
Member

cpcloud commented Mar 3, 2024

This is expected behavior due to the count(x) in the group by. count(expr), like most other aggregates ignores NULL inputs.

If you want to count NULL you can pass t.count() to by:

tk = t.x.topk(10, by=t.count())  # or by=_.count()

@cpcloud cpcloud closed this as not planned Won't fix, can't repro, duplicate, stale Mar 3, 2024
@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 3, 2024

Could we consider changing what is expected behavior to what I expect? I could see two main uses for topk():

  1. Making a leaderboard for a dashboard or something. There, I bet some/most/all users would want nulls NOT included.
  2. Exploratory data analysis, just getting a sense of the data. Here, all users would want nulls included.

Messing up the expectations for usecase 1 reveals itself much earlier than for case 2. With case 2, your error might lurk for months without you realizing it. I was in this second boat.

If we keep current behavior, the we definitely should remove the NULL: 0 row from the output, since that is actively wrong.

@NickCrews
Copy link
Contributor Author

Whatever we decide, value_counts() should also have the same semantics

@cpcloud
Copy link
Member

cpcloud commented Mar 3, 2024

Hm, yeah the NULL is confusing in the output. I see what you mean. I think it makes sense to change this to be _.count() by default. This is already the behavior of value_counts() IIRC.

@cpcloud cpcloud reopened this Mar 3, 2024
@NickCrews
Copy link
Contributor Author

@cpcloud just to be sure you saw it, I edited my comment with some more info/context

@cpcloud
Copy link
Member

cpcloud commented Mar 3, 2024

Thanks, got it!

I think avoiding a loss of information is probably the way to go here. You can't recover uncounted nulls, whereas you can always discard the counted ones after the fact.

@cpcloud cpcloud added this to the 9.0 milestone Mar 3, 2024
@NickCrews
Copy link
Contributor Author

related: #8540

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants