Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infer_schema_length=None fails with unexpected type for nested data #16607

Open
2 tasks done
theelderbeever opened this issue May 30, 2024 · 1 comment
Open
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@theelderbeever
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

example_data = [
    {
        "customer": "customer_1",
        "summaries": [
            {
                "id": "summary_1",
                "object": "object_1",
                "aggregated_value": 1000.0,
                "end_time": 1625155200,
                "livemode": True,
                "meter": "meter_1",
                "start_time": 1625078800,
            },
            {
                "id": "summary_2",
                "object": "object_2",
                "aggregated_value": 2000,
                "end_time": 1625241600,
                "livemode": False,
                "meter": "meter_2",
                "start_time": 1625165200,
            }
        ]
    },
    {
        "customer": "customer_2",
        "summaries": [
            {
                "id": "summary_3",
                "object": "object_3",
                "aggregated_value": 3000,
                "end_time": 1625328000,
                "livemode": True,
                "meter": "meter_3",
                "start_time": 1625251600,
            }
        ]
    }
]

pl.DataFrame(example_data, infer_schema_length=None)

Log output

❯ POLARS_VERBOSE=1 python notebooks/test.py
Traceback (most recent call last):
  File "/Users/taylorbeever/git/quiknode-labs/billing/billing-platform-pipelines/notebooks/test.py", line 43, in <module>
    pl.DataFrame(example_data, infer_schema_length=None)
  File "/Users/taylorbeever/.pyenv/versions/billing-platform-pipelines/lib/python3.11/site-packages/polars/dataframe/frame.py", line 366, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "/Users/taylorbeever/.pyenv/versions/billing-platform-pipelines/lib/python3.11/site-packages/polars/_utils/construction/dataframe.py", line 437, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/taylorbeever/.pyenv/versions/3.11.8/lib/python3.11/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/taylorbeever/.pyenv/versions/billing-platform-pipelines/lib/python3.11/site-packages/polars/_utils/construction/dataframe.py", line 678, in _sequence_of_dict_to_pydf
    pydf = PyDataFrame.from_dicts(
           ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unexpected value while building Series of type Float64; found value of type Int64: 2000

Hint: Try setting `strict=False` to allow passing data with mixed types.

Issue description

Polars fails to correctly infer the datatype of a nested struct even with infer_schema_length=None. The column in the example that is failing is the aggregated_value field in the List(Struct( ... )).

Expected behavior

infer_schema_length should apply to nested types as well.

Installed versions

--------Version info---------
Polars:               0.20.30
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.11.8 (main, Apr 27 2024, 07:50:56) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              <not installed>
pydantic:             2.5.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.29
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@theelderbeever theelderbeever added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 30, 2024
@theelderbeever theelderbeever changed the title infer_schema_length=None faile with unexpected type infer_schema_length=None fails with unexpected type for nested data May 30, 2024
@cmdlineluser
Copy link
Contributor

Can reproduce.

pl.DataFrame({"A": [[[1.0], [2]]]})
# shape: (1, 1)
# ┌─────────────────┐
# │ A               │
# │ ---             │
# │ list[list[f64]] │
# ╞═════════════════╡
# │ [[1.0], [2.0]]  │
# └─────────────────┘
pl.DataFrame({"A": [[{"B":1.0}, {"B":2}]]})
TypeError: unexpected value while building Series of type Float64; found value of type Int64: 2

INFER_SCHEMA_LENGTH is hardcoded to 25 here, but it doesn't seem to come into play:

The issue seems to be that structs are treated differently to other types.

e.g. inside to_list there is an explicit cast:

But to_struct ends up calling from_any_values_and_dtype again on the inner values:

So in this case, we end up with a strict call on the inner values that fails.

Series::from_any_values_and_dtype("name", [1.0, 2], Float64, true)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants