Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot can create dataframe with duplicate columns #13994

Closed
2 tasks done
mcrumiller opened this issue Jan 25, 2024 · 4 comments
Closed
2 tasks done

pivot can create dataframe with duplicate columns #13994

mcrumiller opened this issue Jan 25, 2024 · 4 comments
Labels
bug Something isn't working python Related to Python Polars

Comments

@mcrumiller
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({
    "index": ["one"],
    "columns1": ["x"],
    "columns2": ["x"],
    "values": [1],
})

out = df.pivot(
    values="values",
    columns=["columns1", "columns2"],
    index="index"
)

print(out)

Log output

shape: (1, 3)
┌───────┬─────┬─────┐
│ index ┆ x   ┆ x   │
│ ---   ┆ --- ┆ --- │
│ str   ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ one   ┆ 1   ┆ 1   │
└───────┴─────┴─────┘

Issue description

When multiple column columns are supplied and they contain overlapping values, and there is only one values column, then duplicate columns will arise.

Note that when there are multiple values columns, the values and columns names are included as part of the output column names:

import polars as pl

df = pl.DataFrame({
    "index": ["one"],
    "columns1": ["x"],
    "columns2": ["x"],
    "values1": [1],
    "values2": [1],
})

out = df.pivot(
    values=["values1", "values2"],
    columns=["columns1", "columns2"],
    index="index"
)

print(out)
shape: (1, 5)
┌───────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┐
│ index ┆ values1_columns1_x ┆ values2_columns1_x ┆ values1_columns2_x ┆ values2_columns2_x │
│ ---   ┆ ---                ┆ ---                ┆ ---                ┆ ---                │
│ str   ┆ i64                ┆ i64                ┆ i64                ┆ i64                │
╞═══════╪════════════════════╪════════════════════╪════════════════════╪════════════════════╡
│ one   ┆ 1                  ┆ 1                  ┆ 1                  ┆ 1                  │
└───────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┘

However, if there is only a single values column, then the new column names use the unique values found within the columns only. Thus, if there is any value that is common to the columns columns, we end up with a duplicate column name.

Expected behavior

We should probably simply always include the values and columns names, and perhaps in a separate issue allow for a name generator.

Installed versions

--------Version info---------
Polars:               0.20.5
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.4
pyarrow:              14.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.23
xlsx2csv:             0.8.2
xlsxwriter:           3.1.9
@cmdlineluser
Copy link
Contributor

Haven't checked if it is the exact same cause, but there has been some previous on this topic which is probably relevant:

@MarcoGorelli
Copy link
Collaborator

yup, I'd say that this is a duplicate of #11663

FWIW tomorrow I have all day free to do Polars work so I'll do a full-immersion into this, hope I can come up with a way forwards 🤞

@mcrumiller
Copy link
Contributor Author

@MarcoGorelli great. I think first we should agree on a path forward though.

pivot currently auto-determines the output column names. If there are multiple values parameters, it comes up with column names with the format [values_column1]_[columns_column1]_[value]. If there is only a single values parameter, it comes up with column names that are simply [value].

I think the best way forward is to use the first format for everything (which avoids duplicate column names, but creates messy looking names always) by default, but also provide an output naming strategy. I'm not entirely sure what the best way to approach implementing the naming strategy would be, so that is something to think about.

@stinodego
Copy link
Member

Duplicate of #11663

@stinodego stinodego marked this as a duplicate of #11663 Jan 26, 2024
@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jan 26, 2024
@stinodego stinodego removed the needs triage Awaiting prioritization by a maintainer label Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants