Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv not respecting object dtype when option is set #56047

Merged
merged 14 commits into from Dec 9, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 18, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

we are not honouring object dtype here, thoughts on performance @jbrockmendel ?

@mroeschke mroeschke added the IO CSV read_csv, to_csv label Nov 26, 2023
@phofl
Copy link
Member Author

phofl commented Nov 29, 2023

Can we get this one in?

@@ -1846,7 +1851,29 @@ def read(self, nrows: int | None = None) -> DataFrame:
else:
new_rows = len(index)

df = DataFrame(col_dict, columns=columns, index=index)
if hasattr(self, "orig_options"):
dtype_arg = self.orig_options.get("dtype", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the dtype option normally applied in _engine.read? Just curious why it needs to be done here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the DataFrame constructor infers object to string again if the option is set, which would discard the original dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK makes sense.

Could we defer looping over col_dict if dtype isn't specified to be object-like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, only doing this now if we have a dict or object dtype

@@ -295,18 +295,8 @@ def read(self) -> DataFrame:
dtype_mapping[pa.null()] = pd.Int64Dtype()
frame = table.to_pandas(types_mapper=dtype_mapping.get)
elif using_pyarrow_string_dtype():

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These mappers don't work, arrow supports type -> type not column -> type

@phofl
Copy link
Member Author

phofl commented Dec 8, 2023

cc @mroeschke gentle ping

@mroeschke mroeschke added this to the 2.2 milestone Dec 9, 2023
@mroeschke mroeschke merged commit fb05cc7 into pandas-dev:main Dec 9, 2023
44 checks passed
@mroeschke
Copy link
Member

Thanks @phofl

@phofl phofl deleted the csv_dtype_string_option branch December 9, 2023 19:39
@@ -1846,7 +1853,40 @@ def read(self, nrows: int | None = None) -> DataFrame:
else:
new_rows = len(index)

df = DataFrame(col_dict, columns=columns, index=index)
if hasattr(self, "orig_options"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do something more explicit than a hasattr check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you can subclass the reader, so we don't have any control over it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does anybody actually do this? i judge those people, their ethics, and their hygiene.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's something I can't answer, we might want to deprecate maybe, but we are stuck with hasattr here until then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants