Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix some more arrow CSV tests #52087

Merged
merged 14 commits into from
Apr 10, 2023
Merged

Conversation

lithomas1
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@lithomas1 lithomas1 added IO CSV read_csv, to_csv Arrow pyarrow functionality labels Mar 20, 2023
@@ -158,6 +160,19 @@ def read(self) -> DataFrame:
parse_options=pyarrow_csv.ParseOptions(**self.parse_options),
convert_options=pyarrow_csv.ConvertOptions(**self.convert_options),
)

# Convert all pa.null() cols -> float64
# TODO: There has to be a better way... right?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm looking for here is something like a combination of select_dtypes + astype.

I couldn't find anything in the pyarrow docs, but maybe I'm just bad at Googling.

Sadly the types_mapper doesn't work, since you can only feed that EA types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @jorisvandenbossche knows a better way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Table.select not work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is an easier way. We currently don't have something like select_dtypes, so what you do (iterating over the schema, checking the type, and if needed adapt the new schema) seems the best option.

pandas/io/_util.py Outdated Show resolved Hide resolved
@lithomas1 lithomas1 marked this pull request as ready for review March 24, 2023 14:59
dtype_backend = self.kwds["dtype_backend"]

# Convert all pa.null() cols -> float64 (non nullable)
# else Int64 (nullable case)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was really confusing for me, but apparently even convert_dtypes on an all null column(float64) will change it to an Int64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should replicate that here.
In convert_dtypes, I suppose that is the consequence of trying to cast a float column where all non-null values can faithfully be represented as an integer. Which gives this corner case where this set of values is empty.

But if you don't know the dtype you are starting with (like here), I think using float is safer (it won't give problems later on if you then want to use it for floats and not just integers). Or actually using object dtype might even be safest (as that can represent everything), but of course we don't (yet) have a nullable object dtype, so that's probably not really an option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the C and python parsers already convert to an Int64. Is this something that we need to fix before 2.0?

>>> import pandas as pd
>>> from io import StringIO
>>> f = StringIO("a\n1,")
>>> pd.read_csv(f, dtype_backend="numpy_nullable")
      a
1  <NA>
>>> f.seek(0)
0
>>> pd.read_csv(f, dtype_backend="numpy_nullable").dtypes
a    Int64
dtype: object
>>> f.seek(0)
0
>>> pd.read_csv(f)
    a
1 NaN
>>> f.seek(0)
0
>>> pd.read_csv(f).dtypes
a    float64
dtype: object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @phofl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this everywhere, so I am +1 on doing Int64 here as well. Initially this came from read_parquet I think

@@ -8,6 +8,9 @@
def _arrow_dtype_mapping() -> dict:
pa = import_optional_dependency("pyarrow")
return {
# All nulls should still give Float64 not object
# TODO: This breaks parquet
# pa.null(): pd.Float64Dtype(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way does it break parquet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, first I hit this error.

>>> import pandas as pd
>>> import pyarrow as pa
>>> a = pa.nulls(0)
>>> pd.Float64Dtype().__from__arrow__(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Float64Dtype' object has no attribute '__from__arrow__'

After patching that(another PR incoming),
now an empty DataFrame of dtype object(pd.DataFrame({"value": pd.array([], dtype=object)})) returns a Float64Dtype when roundtripped.

Best guess here is that somehow the types_mapper is overriding the pandas metadata somehow.

@lithomas1
Copy link
Member Author

Anyone want to have another look?
Assuming we are OK with read_csv returning Int64 when dtype_backend is numpy_nullable, this should be good to go now.

if dtype_backend != "pyarrow":
new_schema = table.schema
if dtype_backend == "numpy_nullable":
new_type = pa.int64()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just path the _arrow_dtype_mapping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to wait for #52223 then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but yeah since #52223 is close lets wait for that one then merge this

"h": pd.Series(
[pd.NA if parser.engine != "pyarrow" else "", "a"], dtype="string"
),
"h": pd.Series([pd.NA, "a"], dtype="string"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for these

@lithomas1 lithomas1 added this to the 2.1 milestone Apr 10, 2023
@lithomas1 lithomas1 requested a review from phofl April 10, 2023 11:53
@phofl phofl merged commit f29ef30 into pandas-dev:main Apr 10, 2023
@phofl
Copy link
Member

phofl commented Apr 10, 2023

thx @lithomas1

@lithomas1
Copy link
Member Author

@meeseeksdev backport 2.0.x.

@lithomas1 lithomas1 modified the milestones: 2.1, 2.0.2 May 20, 2023
@lumberbot-app
Copy link

lumberbot-app bot commented May 20, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.0.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 f29ef30e9939468e1b866b6a554ee6b69b8322c5
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #52087: BUG: Fix some more arrow CSV tests'
  1. Push to a named branch:
git push YOURFORK 2.0.x:auto-backport-of-pr-52087-on-2.0.x
  1. Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #52087 on branch 2.0.x (BUG: Fix some more arrow CSV tests)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants