Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV reader column projection does not respect order #10572

Closed
2 tasks done
orlp opened this issue Aug 17, 2023 · 3 comments
Closed
2 tasks done

CSV reader column projection does not respect order #10572

orlp opened this issue Aug 17, 2023 · 3 comments
Labels
A-io Area: reading and writing data bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@orlp
Copy link
Collaborator

orlp commented Aug 17, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import io

df = pl.read_csv(io.StringIO("A,B\n1,2"), columns=["B", "A"], new_columns=["b", "a"])
print(df)

gives output

shape: (1, 2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘

Issue description

The columns parameter of pl.read_csv does not project the columns in the right order in the resulting dataframe.

This is especially jarring when specifying both columns and new_columns, the new_columns still renames the columns as per the order they appeared in the CSV, rather than the order specified in columns. I would expect columns to first do a projection after which new_columns renames them in the same order as columns was specified.

Expected behavior

shape: (1, 2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 1   │
└─────┴─────┘

Installed versions

I can reproduce on main.

@orlp orlp added bug Something isn't working python Related to Python Polars labels Aug 17, 2023
@mcrumiller
Copy link
Contributor

mcrumiller commented Aug 17, 2023

Ahh yes I meant to submit this one a few months ago. Thank you. What's even weirder is that if you supply dtypes, the ordering seems to make even less sense:

df = pl.read_csv(
    io.StringIO(
        "A,B,C\n"
        "a,1,3\n"
        "b,2,4"
    ),
    columns=["B", "C", "A"],
    new_columns=["b", "c", "a"],
    dtypes=[pl.UInt8, pl.Int32, pl.Utf8]
)
shape: (2, 3)
┌─────┬─────┬─────┐
│ b   ┆ c   ┆ a   │
│ --- ┆ --- ┆ --- │
│ str ┆ u8  ┆ i32 │
╞═════╪═════╪═════╡
│ a   ┆ 1   ┆ 3   │
│ b   ┆ 2   ┆ 4   │
└─────┴─────┴─────┘

this can be explained by the following order of operations, which may or may not be correct without more testing:

  1. dtypes are mapped to the columns ordering: A to Utf8, B to UInt8, C to Int32.
  2. Those columns are read-in in original order, with the above dtype mapping.
  3. The columns are renamed, in order, with the new_columns parameter.

@ritchie46
Copy link
Member

Maybe we should simpley dispatch via the lazy api in this case. That all solves these.

@stinodego stinodego added needs triage Awaiting prioritization by a maintainer A-io Area: reading and writing data labels Jan 13, 2024
@stinodego
Copy link
Member

Maybe we should simpley dispatch via the lazy api in this case. That all solves these.

Yes, that's what we should do for columns.

new_columns is actually working correctly here. Closing in favor of #13066 which mentions just the columns issue.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants