Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add option to use nullable dtypes in read_csv #48776

Merged
merged 10 commits into from
Oct 7, 2022

Conversation

phofl
Copy link
Member

@phofl phofl commented Sep 25, 2022

@phofl phofl added IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Sep 25, 2022
@phofl phofl marked this pull request as draft September 25, 2022 20:46
@phofl phofl marked this pull request as ready for review September 29, 2022 08:14
"""
Infer types of values, possibly casting

Parameters
----------
values : ndarray
na_values : set
cast_type: Specifies if we want to cast explicitly
Copy link
Member

@mroeschke mroeschke Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this bool? Looks like we only need to check that it's not None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

bool_mask = np.zeros(result.shape, dtype=np.bool_)
result = BooleanArray(result, bool_mask)
elif result.dtype == np.object_ and use_nullable_dtypes:
result = StringDtype().construct_array_type()._from_sequence(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you test what happens when the string pyarrow global config is true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

parser = all_parsers

data = """a,b,c,d,e,f,g,h,i
1,2.5,True,a,,,,,12-31-2019
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a column here where both rows have an empty value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, casts to Int64 now in both cases. Better question is what we actually want here, because this could be everything

@mroeschke
Copy link
Member

Probably worth discussing in the issue, but I want to mention here since this will be the first instance of use_nullable_dtypes.

Motivation: It would be cool to see a state where read_*(engine="pyarrow") can result in a DataFrame that is backed by ArrowExtensionArray (trying to avoid the conversion to numpy)

Understandably engine="pyarrow" doesn't do that today and may be fairly difficult to change (deprecate) that behavior in the future.

An alternative and easier path to my goal would be to have use_nullable_dtypes="pandas"|"pyarrow" (default "pandas") to allow the user to pick the "nullable representation". Thoughts?

@phofl
Copy link
Member Author

phofl commented Sep 30, 2022

Wouldn’t it be easier to do this via an option like we are currently inferring for string? I guess this should also apply to our constructors etc?

Similar to what the final state of nullable is supposed to be. Provide a global option to opt into them

@mroeschke
Copy link
Member

Wouldn’t it be easier to do this via an option like we are currently inferring for string?

Hmm so the end state idea could be have like a global option like mode.nullable=None|"pandas"|"pyarrow" where dtype/array representation is either numpy/pd.array & pd.NA/ pa.array & pd.NA more or less consistently?

I am taking a particular focus on IO methods since I am hoping to avoid the jump from pa.Table -> np.array -> pa.ChunkedArray (in theory) with engine="pyarrow" and just have pa.Table -> pa.ChunkedArray

@phofl
Copy link
Member Author

phofl commented Sep 30, 2022

Currently the idea is to make a global option to opt into nullable dtypes, yes. I think we can most certainly make this into a three way option to allow arrow too.

But wouldn’t this cause problems on the first operation done with a object backed by a numpy array?

@mroeschke
Copy link
Member

But wouldn’t this cause problems on the first operation done with a object backed by a numpy array?

Are you referring to the IO conversion I mentioned?

@phofl
Copy link
Member Author

phofl commented Sep 30, 2022

Ah no, sorry. More like if you get a pyarrow backed object from IO but a NDArray backed object from a constructor and you want to combine them somehow (concat, merge, ...)

Basically what I wanted to ask: Wouldn't it make more sense if everything could be backed by arrow if a single flag is set to avoid these inconsistencies?

@mroeschke
Copy link
Member

Wouldn't it make more sense if everything could be backed by arrow if a single flag is set to avoid these inconsistencies?

Ah yeah, most definitely. I haven't really encountered/thought too hard about the op(arrow-backed, ndarray-backed) outcomes yet, but a global option would hopefully avoid this.

As long as use_nullable_types=True + a global config can lead to maintaining arrow objects from parsing, that would satisfy my goal.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked pretty good. Could you merge in main once more?

@@ -385,3 +390,79 @@ def test_dtypes_defaultdict_invalid(all_parsers):
parser = all_parsers
with pytest.raises(TypeError, match="not understood"):
parser.read_csv(StringIO(data), dtype=dtype)


def test_use_nullabla_dtypes(all_parsers):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo here and below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, fixed

3,4.5,False,b,6,7.5,True,a,12-31-2019,
"""
result = parser.read_csv(
StringIO(data), use_nullable_dtypes=True, parse_dates=["i"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize for use_nullable_dtypes = True/False here and for the other tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is impossible to understand if paramterized. Expected looks completely different. I could add a new test in theory, but would not bring much value, we are testing all possible cases already with numpy dtypes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for checking.

@mroeschke mroeschke added this to the 1.6 milestone Oct 7, 2022
@mroeschke mroeschke merged commit 7f24bff into pandas-dev:main Oct 7, 2022
@mroeschke
Copy link
Member

Thanks @phofl

zain581 added a commit to zain581/pandas that referenced this pull request Oct 7, 2022
ENH: Add option to use nullable dtypes in read_csv (pandas-dev#48776)
@phofl phofl deleted the use_nullable_dtypes branch October 7, 2022 17:09
@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* ENH: Add option to use nullable dtypes in read_csv

* Finish implementation

* Update

* Fix mypy

* Add tests and fix call

* Fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add option to get nullable dtypes to pd.read_csv
3 participants