Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return all nulls from map_groups causes panic #15260

Open
2 tasks done
drhagen opened this issue Mar 24, 2024 · 2 comments
Open
2 tasks done

Return all nulls from map_groups causes panic #15260

drhagen opened this issue Mar 24, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@drhagen
Copy link

drhagen commented Mar 24, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
def foo(x): return pl.Series([x[0][0]], dtype=x[0].dtype)
pl.DataFrame({"key": [0,0,1], "a": [None, None, None]}).group_by("key").agg(pl.map_groups(exprs=["a"], function=foo))

Log output

keys/aggregates are not partitionable: running default HASH AGGREGATION
thread 'polars-2' panicked at crates/polars-lazy/src/physical_plan/expressions/apply.rs:166:22:
called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("invalid series dtype: expected `List`, got `null`"))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: rayon_core::thread_pool::ThreadPool::install::{{closure}}
   4: <polars_lazy::physical_plan::expressions::apply::ApplyExpr as polars_lazy::physical_plan::expressions::PhysicalExpr>::evaluate_on_groups
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::thread_pool::ThreadPool::install::{{closure}}
   7: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   8: rayon_core::registry::WorkerThread::wait_until_cold
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/david/tabeline/.venv/lib/python3.11/site-packages/polars/dataframe/group_by.py", line 250, in agg
    .collect(no_optimization=True)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/tabeline/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1943, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("invalid series dtype: expected `List`, got `null`"))

Issue description

This is a regression from v0.19 to v0.20. It appears that map_groups really does not like getting back Series that are all null.

Expected behavior

shape: (2, 2)
┌─────┬──────┐
│ key ┆ a    │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 0   ┆ null │
│ 1   ┆ null │
└─────┴──────┘

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-5.15.0-101-generic-x86_64-with-glibc2.31
Python:               3.11.3 | packaged by conda-forge | (main, Apr  6 2023, 08:57:19) [GCC 11.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@drhagen drhagen added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 24, 2024
@cmdlineluser
Copy link
Contributor

Can reproduce.

Just some notes:

It seems like the panic actually happens on the way in to the function.

df.select(pl.map_groups("a", lambda x: print(x)).over("key"))
# thread 'polars-1' panicked at crates/polars-lazy/src/physical_plan/expressions/apply.rs:166:22:

.GroupBy.map_groups() doesn't have the issue.

def foo(x): 
    print(x)
    return x

df = pl.DataFrame({"key": [0,0,1], "a": [None, None, None]})

df.group_by("key").map_groups(foo)

# shape: (2, 2)
# ┌─────┬──────┐
# │ key ┆ a    │
# │ --- ┆ ---  │
# │ i64 ┆ null │
# ╞═════╪══════╡
# │ 0   ┆ null │
# │ 0   ┆ null │
# └─────┴──────┘
# ...

@drhagen
Copy link
Author

drhagen commented Jun 2, 2024

This issue is now wrong in a slightly different way. The original example no longer crashes, but the dtype of the resulting columns is wrong.

shape: (2, 2)
┌─────┬────────────┐
│ key ┆ a          │
│ --- ┆ ---        │
│ i64 ┆ list[null] │
╞═════╪════════════╡
│ 1   ┆ null       │
│ 0   ┆ null       │
└─────┴────────────┘

The dtype of column a should be null, not list[null].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants