Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map with struct return_dtype errors if return is single dict with correct keys & None #10398

Closed
2 tasks done
desmond-dsouza opened this issue Aug 9, 2023 · 7 comments
Closed
2 tasks done
Labels
bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@desmond-dsouza
Copy link

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

S = pl.Struct([pl.Field("x", pl.Int64)])

df = pl.DataFrame({"a": [1]})

def f_id(ds: pl.Series) -> pl.Series:
    return pl.Series("r", [{"x": d["a"] + 1} for d in ds])

def f_null(ds: pl.Series) -> pl.Series:
    return pl.Series("r", [{"x": None} for d in ds])

# return_dtype works, f_id returns dicts with correct keys & Ints
df.select(pl.struct("a").map(f_id, return_dtype=S))

# return_dtype crashes, f_null returns dicts with correct keys & None
df.select(pl.struct("a").map(f_null, return_dtype=S))

Issue description

I believe both map calls should work since there is an explicit return_dtype and the shape and order of the dicts match that dtype. Changing the return to just None without the enclosing dict still results in a SchemaError

Expected behavior

No SchemaError

Installed versions

In [29]: pl.show_versions()
--------Version info---------
Polars:              0.18.11
Index type:          UInt32
Platform:            macOS-11.7.8-x86_64-i386-64bit
Python:              3.11.3 (v3.11.3:f3909b8bc8, Apr  4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         2.2.1
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.6.0
matplotlib:          3.7.2
numpy:               1.25.0
pandas:              2.0.3
pyarrow:             12.0.1
pydantic:            1.10.11
sqlalchemy:          1.4.49
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

@desmond-dsouza desmond-dsouza added bug Something isn't working python Related to Python Polars labels Aug 9, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 10, 2023

@orlp, @ritchie46, a quick coffee-break triage for you:

I think the comparison check here needs to be slightly more forgiving than ==, handling Null fields in nested dtypes (we handle the flat/scalar dtype case fine). In the example above, the comparison fails because...

Struct([Field {name: "x",dtype: Null}]) !=
Struct([Field {name: "x",dtype: Int64}])

...which is true, but should still pass the check as all-null data is acceptable for any dtype in this context. Perhaps we already have such a comparison function hiding somewhere? If not, guess we need one :)

@alexander-beedie alexander-beedie added the accepted Ready for implementation label Aug 10, 2023
@ritchie46 ritchie46 changed the title map with struct return_dtype crashes if return is single dict with correct keys & None map with struct return_dtype errors if return is single dict with correct keys & None Aug 10, 2023
@DeflateAwning
Copy link
Contributor

I encountered this bug. What would it take to get it fixed?

@CanglongCl
Copy link
Contributor

I will prefer it is correct behavior since dtype of pl.Series("r", [{"x": None} for d in ds]) is absolutely null.

Solution is tell series the dtype you want here like pl.Series("r", [{"x": None} for d in ds], dtype=S).

@DeflateAwning
Copy link
Contributor

DeflateAwning commented Apr 7, 2024

Tbh, have lost track of what this error means. Would appreciate if someone could express a minimal reproducable example.

Would be happy to take a stab at fixing it.

@cmdlineluser
Copy link
Contributor

@DeflateAwning

When using a flat/scalar null, there is no error and it is "upcast":

df = pl.DataFrame({"a": 1})

df.with_columns(
   pl.all().map_elements(lambda x:
      None,
      return_dtype = pl.Float64
   )
)

# shape: (1, 1)
# ┌──────┐
# │ a    │
# │ ---  │
# │ f64  │ # <- dtype `null` "upcast" to `f64` as per return_dtype
# ╞══════╡
# │ null │
# └──────┘

But with a list/dict it errors instead:

df.with_columns(
   pl.all().map_elements(lambda x:
      {"x": None},
      return_dtype = pl.Struct({"x": pl.Float64})
   )
)

# SchemaError: expected output type ...

The expectation appears to be that the inner null would upcast:

s = pl.Series([{"x": None}])

s.dtype
# Struct({'x': Null})

s.cast(pl.Struct({"x": pl.Float64})).dtype
# Struct({'x': Float64})

side-note: I've also just noticed on a Series there is no error but the inner dtype remains as null:

pl.Series(["a"]).map_elements(lambda x:
   {"x": None},
   return_dtype = pl.Struct({"x": pl.Float64})
).dtype

# Struct({'x': Null})

@DeflateAwning
Copy link
Contributor

Any idea where to start looking in the codebase?

@stinodego
Copy link
Member

Should be fixed by #15699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

6 participants