-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add option to use nullable dtypes in read_csv #48776
Conversation
d79a552
to
af6056b
Compare
pandas/io/parsers/base_parser.py
Outdated
""" | ||
Infer types of values, possibly casting | ||
|
||
Parameters | ||
---------- | ||
values : ndarray | ||
na_values : set | ||
cast_type: Specifies if we want to cast explicitly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this bool? Looks like we only need to check that it's not None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
bool_mask = np.zeros(result.shape, dtype=np.bool_) | ||
result = BooleanArray(result, bool_mask) | ||
elif result.dtype == np.object_ and use_nullable_dtypes: | ||
result = StringDtype().construct_array_type()._from_sequence(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you test what happens when the string pyarrow global config is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
parser = all_parsers | ||
|
||
data = """a,b,c,d,e,f,g,h,i | ||
1,2.5,True,a,,,,,12-31-2019 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a column here where both rows have an empty value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, casts to Int64 now in both cases. Better question is what we actually want here, because this could be everything
Probably worth discussing in the issue, but I want to mention here since this will be the first instance of Motivation: It would be cool to see a state where Understandably An alternative and easier path to my goal would be to have |
Wouldn’t it be easier to do this via an option like we are currently inferring for string? I guess this should also apply to our constructors etc? Similar to what the final state of nullable is supposed to be. Provide a global option to opt into them |
Hmm so the end state idea could be have like a global option like I am taking a particular focus on IO methods since I am hoping to avoid the jump from pa.Table -> np.array -> pa.ChunkedArray (in theory) with |
Currently the idea is to make a global option to opt into nullable dtypes, yes. I think we can most certainly make this into a three way option to allow arrow too. But wouldn’t this cause problems on the first operation done with a object backed by a numpy array? |
Are you referring to the IO conversion I mentioned? |
Ah no, sorry. More like if you get a pyarrow backed object from IO but a NDArray backed object from a constructor and you want to combine them somehow (concat, merge, ...) Basically what I wanted to ask: Wouldn't it make more sense if everything could be backed by arrow if a single flag is set to avoid these inconsistencies? |
Ah yeah, most definitely. I haven't really encountered/thought too hard about the As long as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked pretty good. Could you merge in main once more?
@@ -385,3 +390,79 @@ def test_dtypes_defaultdict_invalid(all_parsers): | |||
parser = all_parsers | |||
with pytest.raises(TypeError, match="not understood"): | |||
parser.read_csv(StringIO(data), dtype=dtype) | |||
|
|||
|
|||
def test_use_nullabla_dtypes(all_parsers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, fixed
3,4.5,False,b,6,7.5,True,a,12-31-2019, | ||
""" | ||
result = parser.read_csv( | ||
StringIO(data), use_nullable_dtypes=True, parse_dates=["i"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize for use_nullable_dtypes = True/False here and for the other tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is impossible to understand if paramterized. Expected looks completely different. I could add a new test in theory, but would not bring much value, we are testing all possible cases already with numpy dtypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for checking.
Thanks @phofl |
ENH: Add option to use nullable dtypes in read_csv (pandas-dev#48776)
* ENH: Add option to use nullable dtypes in read_csv * Finish implementation * Update * Fix mypy * Add tests and fix call * Fix typo
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.