Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support empty structs #9216

Open
stinodego opened this issue Jun 4, 2023 · 5 comments
Open

Support empty structs #9216

stinodego opened this issue Jun 4, 2023 · 5 comments
Labels
A-dtype-struct Area: struct data type accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@stinodego
Copy link
Member

stinodego commented Jun 4, 2023

Problem description

Although perhaps not extremely useful, we should allow structs without any fields for the sake of consistency.

In the current behaviour, Polars conjures up a single unnamed field of type Null:

>>> pl.Series(dtype=pl.Struct())
shape: (1,)
Series: '' [struct[1]]
[
        {null}
]

Trying to create an empty struct through the struct expression results in a PanicException:

>>> pl.select(pl.struct())
thread '<unnamed>' panicked at 'index out of bounds: the len is 0 but the index is 0', /home/stijn/code/polars/polars/polars-lazy/polars-plan/src/dsl/functions.rs:1296:48
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stijn/code/polars/py-polars/polars/functions/lazy.py", line 2391, in select
    return pl.DataFrame().select(exprs, *more_exprs, **named_exprs)
  File "/home/stijn/code/polars/py-polars/polars/dataframe/frame.py", line 7117, in select
    self.lazy()
  File "/home/stijn/code/polars/py-polars/polars/lazyframe/frame.py", line 2040, in select
    return self._from_pyldf(self._ldf.select(exprs))
pyo3_runtime.PanicException: index out of bounds: the len is 0 but the index is 0

Desired behaviour would be:

>>> pl.Series(dtype=pl.Struct())
shape: (0,)
Series: '' [struct[0]]
[
]
>>> pl.select(pl.struct())
shape: (1, 1)
┌───────────┐
│ struct    │
│ ---       │
│ struct[0] │
╞═══════════╡
│ {}        │
└───────────┘
@stinodego stinodego added the enhancement New feature or an improvement of an existing feature label Jun 4, 2023
@stinodego stinodego added the accepted Ready for implementation label Aug 28, 2023
@stinodego stinodego added the A-dtype-struct Area: struct data type label Feb 18, 2024
@sibarras
Copy link

Hi, does the team have a plan to support this? In a lot of cases, when parsing empty json columns from DB, the function panics.

@stinodego
Copy link
Member Author

stinodego commented Apr 7, 2024

Hi, does the team have a plan to support this? In a lot of cases, when parsing empty json columns from DB, the function panics.

@sibarras Could you give a reproducible example of that panic?

@sibarras
Copy link

sibarras commented Apr 8, 2024

Hi, does the team have a plan to support this? In a lot of cases, when parsing empty json columns from DB, the function panics.

@sibarras Could you give a reproducible example of that panic?

Sure, using sqlite, when you read a json column, it gets parsed as a str on polars. Then when you try to cast this to a struct, we got a panic.

from sqlite3 import connect
import polars as pl


def main():
    with connect(":memory:") as con:
        df = pl.read_database(
            "SELECT JSON('{}') as json_col;", con
        )  # it works fine, but it's parsed as a string
        print(df)
        df.select(pl.col("json_col").str.json_decode())  # panics here


if __name__ == "__main__":
    main()

This is the output using Python 3.9.18 on WSL2.

shape: (1, 1)
┌──────────┐
│ json_col │
│ ---      │
│ str      │
╞══════════╡
│ {}       │
└──────────┘
thread 'python' panicked at crates/polars-arrow/src/array/struct_/mod.rs:117:52:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("a StructArray must contain at least one field"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/samuel_e_ibarra/coding/python/targeting_lib/example.py", line 15, in <module>
    main()
  File "/home/samuel_e_ibarra/coding/python/targeting_lib/example.py", line 11, in main
    df.select(pl.col("json_col").str.json_decode())  # panics here
  File "/home/samuel_e_ibarra/coding/python/targeting_lib/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py", line 8124, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
  File "/home/samuel_e_ibarra/coding/python/targeting_lib/.venv/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 1943, in collect
    return wrap_df(ldf.collect())
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("a StructArray must contain at least one field"))

@stinodego
Copy link
Member Author

I looked into this and empty structs just don't really make much sense. An empty struct column would have to behave somewhat like a Null column as it doesn't contain any Series/values.

We should probably first address #3462 before implementing this.

str.json_decode should either error or return a Null column here. I will make a separate issue for that.

@jcmuel
Copy link

jcmuel commented May 29, 2024

The empty struct also creates issues in read_ndjson and json_decode:

Polars already handles empty structs, but in an inconsistent way. And the inconsistency causes panic exceptions in more complex situations.

import io
import polars as pl

frame = pl.read_ndjson(io.StringIO('{"id": 1, "empty_struct": {}, "list_of_empty_struct": [{}]}'))
print(frame)

for col_name, col_type in frame.schema.items():
    print(f'{col_name:>20}   {col_type}')

Output:

shape: (1, 3)
┌─────┬──────────────┬──────────────────────┐
│ id  ┆ empty_struct ┆ list_of_empty_struct │
│ --- ┆ ---          ┆ ---                  │
│ i64 ┆ struct[1]    ┆ list[struct[0]]      │
╞═════╪══════════════╪══════════════════════╡
│ 1   ┆ {null}       ┆ []                   │
└─────┴──────────────┴──────────────────────┘
                  id   Int64
        empty_struct   Struct({'': Null})
list_of_empty_struct   List(Struct({}))

The expected type of the "empty_struct" column would be pl.Struct({}), but it is pl.Struct({pl.Field('', pl.Null)}).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-struct Area: struct data type accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

3 participants