`pivot` can create dataframe with duplicate columns #13994

mcrumiller · 2024-01-25T16:06:36Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({
    "index": ["one"],
    "columns1": ["x"],
    "columns2": ["x"],
    "values": [1],
})

out = df.pivot(
    values="values",
    columns=["columns1", "columns2"],
    index="index"
)

print(out)

Log output

shape: (1, 3)
┌───────┬─────┬─────┐
│ index ┆ x   ┆ x   │
│ ---   ┆ --- ┆ --- │
│ str   ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ one   ┆ 1   ┆ 1   │
└───────┴─────┴─────┘

Issue description

When multiple column columns are supplied and they contain overlapping values, and there is only one values column, then duplicate columns will arise.

Note that when there are multiple values columns, the values and columns names are included as part of the output column names:

import polars as pl

df = pl.DataFrame({
    "index": ["one"],
    "columns1": ["x"],
    "columns2": ["x"],
    "values1": [1],
    "values2": [1],
})

out = df.pivot(
    values=["values1", "values2"],
    columns=["columns1", "columns2"],
    index="index"
)

print(out)

shape: (1, 5)
┌───────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┐
│ index ┆ values1_columns1_x ┆ values2_columns1_x ┆ values1_columns2_x ┆ values2_columns2_x │
│ ---   ┆ ---                ┆ ---                ┆ ---                ┆ ---                │
│ str   ┆ i64                ┆ i64                ┆ i64                ┆ i64                │
╞═══════╪════════════════════╪════════════════════╪════════════════════╪════════════════════╡
│ one   ┆ 1                  ┆ 1                  ┆ 1                  ┆ 1                  │
└───────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┘

However, if there is only a single values column, then the new column names use the unique values found within the columns only. Thus, if there is any value that is common to the columns columns, we end up with a duplicate column name.

Expected behavior

We should probably simply always include the values and columns names, and perhaps in a separate issue allow for a name generator.

Installed versions

--------Version info---------
Polars:               0.20.5
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.4
pyarrow:              14.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.23
xlsx2csv:             0.8.2
xlsxwriter:           3.1.9

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-01-25T16:26:35Z

Haven't checked if it is the exact same cause, but there has been some previous on this topic which is probably relevant:

MarcoGorelli · 2024-01-25T18:39:28Z

yup, I'd say that this is a duplicate of #11663

FWIW tomorrow I have all day free to do Polars work so I'll do a full-immersion into this, hope I can come up with a way forwards 🤞

mcrumiller · 2024-01-25T18:47:04Z

@MarcoGorelli great. I think first we should agree on a path forward though.

pivot currently auto-determines the output column names. If there are multiple values parameters, it comes up with column names with the format [values_column1]_[columns_column1]_[value]. If there is only a single values parameter, it comes up with column names that are simply [value].

I think the best way forward is to use the first format for everything (which avoids duplicate column names, but creates messy looking names always) by default, but also provide an output naming strategy. I'm not entirely sure what the best way to approach implementing the naming strategy would be, so that is something to think about.

stinodego · 2024-01-26T05:20:26Z

Duplicate of #11663

mcrumiller added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 25, 2024

mcrumiller mentioned this issue Jan 25, 2024

feat(python): auto-determine index/columns/values columns in pivot if one is left out, deprecate passing arguments positionally #12125

Closed

stinodego marked this as a duplicate of #11663 Jan 26, 2024

stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jan 26, 2024

stinodego removed the needs triage Awaiting prioritization by a maintainer label Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pivot` can create dataframe with duplicate columns #13994

`pivot` can create dataframe with duplicate columns #13994

mcrumiller commented Jan 25, 2024

cmdlineluser commented Jan 25, 2024

MarcoGorelli commented Jan 25, 2024

mcrumiller commented Jan 25, 2024

stinodego commented Jan 26, 2024

pivot can create dataframe with duplicate columns #13994

pivot can create dataframe with duplicate columns #13994

Comments

mcrumiller commented Jan 25, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented Jan 25, 2024

MarcoGorelli commented Jan 25, 2024

mcrumiller commented Jan 25, 2024

stinodego commented Jan 26, 2024

`pivot` can create dataframe with duplicate columns #13994

`pivot` can create dataframe with duplicate columns #13994