Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.DataFrame loads in 2D lists in unexpected way #16818

Closed
2 tasks done
zacmon opened this issue Jun 7, 2024 · 5 comments · Fixed by #16976
Closed
2 tasks done

pl.DataFrame loads in 2D lists in unexpected way #16818

zacmon opened this issue Jun 7, 2024 · 5 comments · Fixed by #16976
Assignees
Labels
A-api Area: changes to the public API accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars
Milestone

Comments

@zacmon
Copy link

zacmon commented Jun 7, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
x1 = [[1, 2]]
df0 = pl.DataFrame(x1, schema=['c0', 'c1'])

x2 = [[1, 2], [3, 4]]
df1 = pl.DataFrame(x2, schema=['c0', 'c1'])
df2 = pl.DataFrame(x2 + [[]], schema=['c0', 'c1'])

x3 = [[1, 2], [3, 4], [5, 6]]
df3 = pl.DataFrame(x3, schema=['c0', 'c1'])

Log output

No response

Issue description

I would expect a polars DataFrame to view each constituent list of a 2D list as a row. This does not happen by looking at the output of df1.

Expected behavior

In the above example, df2 and df3 behave as I would expect.

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Linux-4.18.0-513.18.1.el8_9.x86_64-x86_64-with-glibc2.28
Python:               3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
nest_asyncio:         1.6.0
numpy:                1.24.4
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@zacmon zacmon added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 7, 2024
@cjackal
Copy link
Contributor

cjackal commented Jun 8, 2024

You can control the way polars interprete 2D data input with orient keyword:

x2 = [[1, 2], [3, 4]]
df1 = pl.DataFrame(x2, schema=['c0', 'c1'], orient="row")

Note that for a square matrix like yours both orientation are valid and make sense. Although the default orientation for polars (orient="col") here differs from that of pandas, which may confuse users from pandas.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jun 8, 2024

Yup, this is not a bug. Polars/Arrow are column-oriented by design, so when there is ambiguity (same number of rows/columns and the schema types don't help and you have not set the "orient" parameter), "col" will be the default.

This is detailed in the DataFrame docstring:

orient{‘col’, ‘row’}, default None
    Whether to interpret two-dimensional data as columns or as rows. If None, 
    the orientation is inferred by matching the columns and data dimensions.
    If this does not yield conclusive results, column orientation is used.

Note that, if you have the option, column data will load more efficiently; otherwise, set orient="row" explicitly to avoid the need for any data-level inference 👍

@alexander-beedie alexander-beedie added invalid A bug report that is not actually a bug and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jun 8, 2024
@cjackal
Copy link
Contributor

cjackal commented Jun 8, 2024

@alexander-beedie Another (relevant) confusion as a polars newbie - when the 2D list is converted as numpy array, it's default orientation becomes "row", and this behavior change seems not documented:

x2 = [[1, 2], [3, 4]]
pl.DataFrame(x2, schema=['c0', 'c1'])
# shape: (2, 2)
# ┌─────┬─────┐
# │ c0  ┆ c1  │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 3   │
# │ 2   ┆ 4   │
# └─────┴─────┘
pl.DataFrame(np.asarray(x2), schema=['c0', 'c1'])
# shape: (2, 2)
# ┌─────┬─────┐
# │ c0  ┆ c1  │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# │ 3   ┆ 4   │
# └─────┴─────┘

If I were to suggest, such a silent behavior change is often dangerous (as a full-time ML engineer, I spent way more time debugging a silent behavior change than fixing a noisy spam of warnings, silent killer is a true evil...), it would be consistent across input class or well-documented at least.

@zacmon
Copy link
Author

zacmon commented Jun 8, 2024

Thank you both for clarifying! I'm leaving the issue open due to @cjackal's observation about the inconsistency. Once that is resolved, whoever would like can close the issue.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2024
@stinodego stinodego reopened this Jun 9, 2024
@stinodego
Copy link
Member

stinodego commented Jun 9, 2024

People bump their heads on this one all the time. This must be the 5th issue with this exact complaint.

I think it's time to flip the switch and use row-orientation by default for sequence-of-sequences (if we cannot infer that it should be column-oriented). It just makes sense to parse these as rows - we have the dict format for column-oriented input. And we do the same for NumPy inputs.

@ritchie46 What do you think?

@stinodego stinodego added A-api Area: changes to the public API enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer and removed invalid A bug report that is not actually a bug labels Jun 9, 2024
@stinodego stinodego added this to the 1.0.0 milestone Jun 10, 2024
@stinodego stinodego removed the needs decision Awaiting decision by a maintainer label Jun 16, 2024
@c-peters c-peters added the accepted Ready for implementation label Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
Archived in project
5 participants