ENH: Add option to use nullable dtypes in read_csv #48776

phofl · 2022-09-25T20:45:54Z

closes ENH: add option to get nullable dtypes to pd.read_csv #36712 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

mroeschke · 2022-09-30T00:57:25Z

pandas/io/parsers/base_parser.py

        """
        Infer types of values, possibly casting

        Parameters
        ----------
        values : ndarray
        na_values : set
+        cast_type: Specifies if we want to cast explicitly


Could we make this bool? Looks like we only need to check that it's not None?

mroeschke · 2022-09-30T01:02:06Z

pandas/io/parsers/base_parser.py

+                    bool_mask = np.zeros(result.shape, dtype=np.bool_)
+                result = BooleanArray(result, bool_mask)
+            elif result.dtype == np.object_ and use_nullable_dtypes:
+                result = StringDtype().construct_array_type()._from_sequence(values)


Could you test what happens when the string pyarrow global config is true?

mroeschke · 2022-09-30T01:03:21Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

+    parser = all_parsers
+
+    data = """a,b,c,d,e,f,g,h,i
+1,2.5,True,a,,,,,12-31-2019


Could you add a column here where both rows have an empty value?

Added, casts to Int64 now in both cases. Better question is what we actually want here, because this could be everything

mroeschke · 2022-09-30T17:08:32Z

Probably worth discussing in the issue, but I want to mention here since this will be the first instance of use_nullable_dtypes.

Motivation: It would be cool to see a state where read_*(engine="pyarrow") can result in a DataFrame that is backed by ArrowExtensionArray (trying to avoid the conversion to numpy)

Understandably engine="pyarrow" doesn't do that today and may be fairly difficult to change (deprecate) that behavior in the future.

An alternative and easier path to my goal would be to have use_nullable_dtypes="pandas"|"pyarrow" (default "pandas") to allow the user to pick the "nullable representation". Thoughts?

phofl · 2022-09-30T17:12:06Z

Wouldn’t it be easier to do this via an option like we are currently inferring for string? I guess this should also apply to our constructors etc?

Similar to what the final state of nullable is supposed to be. Provide a global option to opt into them

mroeschke · 2022-09-30T17:30:57Z

Wouldn’t it be easier to do this via an option like we are currently inferring for string?

Hmm so the end state idea could be have like a global option like mode.nullable=None|"pandas"|"pyarrow" where dtype/array representation is either numpy/pd.array & pd.NA/ pa.array & pd.NA more or less consistently?

I am taking a particular focus on IO methods since I am hoping to avoid the jump from pa.Table -> np.array -> pa.ChunkedArray (in theory) with engine="pyarrow" and just have pa.Table -> pa.ChunkedArray

phofl · 2022-09-30T17:33:34Z

Currently the idea is to make a global option to opt into nullable dtypes, yes. I think we can most certainly make this into a three way option to allow arrow too.

But wouldn’t this cause problems on the first operation done with a object backed by a numpy array?

mroeschke · 2022-09-30T18:29:04Z

But wouldn’t this cause problems on the first operation done with a object backed by a numpy array?

Are you referring to the IO conversion I mentioned?

phofl · 2022-09-30T18:30:44Z

Ah no, sorry. More like if you get a pyarrow backed object from IO but a NDArray backed object from a constructor and you want to combine them somehow (concat, merge, ...)

Basically what I wanted to ask: Wouldn't it make more sense if everything could be backed by arrow if a single flag is set to avoid these inconsistencies?

mroeschke · 2022-09-30T18:40:50Z

Wouldn't it make more sense if everything could be backed by arrow if a single flag is set to avoid these inconsistencies?

Ah yeah, most definitely. I haven't really encountered/thought too hard about the op(arrow-backed, ndarray-backed) outcomes yet, but a global option would hopefully avoid this.

As long as use_nullable_types=True + a global config can lead to maintaining arrow objects from parsing, that would satisfy my goal.

mroeschke

Looked pretty good. Could you merge in main once more?

lithomas1 · 2022-10-01T23:23:24Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

@@ -385,3 +390,79 @@ def test_dtypes_defaultdict_invalid(all_parsers):
    parser = all_parsers
    with pytest.raises(TypeError, match="not understood"):
        parser.read_csv(StringIO(data), dtype=dtype)
+
+
+def test_use_nullabla_dtypes(all_parsers):


nit: typo here and below.

lithomas1 · 2022-10-01T23:24:25Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

+3,4.5,False,b,6,7.5,True,a,12-31-2019,
+"""
+    result = parser.read_csv(
+        StringIO(data), use_nullable_dtypes=True, parse_dates=["i"]


Can you parametrize for use_nullable_dtypes = True/False here and for the other tests?

No this is impossible to understand if paramterized. Expected looks completely different. I could add a new test in theory, but would not bring much value, we are testing all possible cases already with numpy dtypes

OK, thanks for checking.

mroeschke · 2022-10-07T16:41:40Z

Thanks @phofl

ENH: Add option to use nullable dtypes in read_csv (pandas-dev#48776)

* ENH: Add option to use nullable dtypes in read_csv * Finish implementation * Update * Fix mypy * Add tests and fix call * Fix typo

phofl added 3 commits September 21, 2022 21:13

ENH: Add option to use nullable dtypes in read_csv

d1c0b51

Finish implementation

d7a7eca

Update

4f05540

phofl added IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Sep 25, 2022

phofl marked this pull request as draft September 25, 2022 20:46

Merge remote-tracking branch 'upstream/main' into use_nullable_dtypes

afe9ca5

vuule mentioned this pull request Sep 28, 2022

[BUG] cudf.read_csv should not cast to floating types if there are null entries in csv rapidsai/cudf#6313

Open

Fix mypy

af6056b

phofl force-pushed the use_nullable_dtypes branch from d79a552 to af6056b Compare September 29, 2022 08:13

Merge remote-tracking branch 'upstream/main' into use_nullable_dtypes

8b0dc2f

phofl marked this pull request as ready for review September 29, 2022 08:14

Merge remote-tracking branch 'upstream/main' into use_nullable_dtypes

291e761

mroeschke reviewed Sep 30, 2022

View reviewed changes

Add tests and fix call

8a4d206

mroeschke reviewed Oct 7, 2022

View reviewed changes

lithomas1 reviewed Oct 7, 2022

View reviewed changes

phofl added 2 commits October 7, 2022 13:45

Fix typo

30d68a8

Merge remote-tracking branch 'upstream/main' into use_nullable_dtypes

6139d87

lithomas1 approved these changes Oct 7, 2022

View reviewed changes

mroeschke added this to the 1.6 milestone Oct 7, 2022

mroeschke approved these changes Oct 7, 2022

View reviewed changes

mroeschke merged commit 7f24bff into pandas-dev:main Oct 7, 2022

zain581 added a commit to zain581/pandas that referenced this pull request Oct 7, 2022

Merge pull request #2 from pandas-dev/main

a8737bf

ENH: Add option to use nullable dtypes in read_csv (pandas-dev#48776)

phofl deleted the use_nullable_dtypes branch October 7, 2022 17:09

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

coroa mentioned this pull request Oct 17, 2022

BUG: pd.read_csv with use_nullable_dtypes incompatible with dtype coercion #49146

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add option to use nullable dtypes in read_csv #48776

ENH: Add option to use nullable dtypes in read_csv #48776

phofl commented Sep 25, 2022

mroeschke Sep 30, 2022 •

edited

Loading

phofl Sep 30, 2022

mroeschke Sep 30, 2022

phofl Sep 30, 2022

mroeschke Sep 30, 2022

phofl Sep 30, 2022

mroeschke commented Sep 30, 2022

phofl commented Sep 30, 2022

mroeschke commented Sep 30, 2022

phofl commented Sep 30, 2022

mroeschke commented Sep 30, 2022

phofl commented Sep 30, 2022

mroeschke commented Sep 30, 2022

mroeschke left a comment

lithomas1 Oct 1, 2022

phofl Oct 7, 2022

lithomas1 Oct 1, 2022

phofl Oct 7, 2022

lithomas1 Oct 7, 2022

mroeschke commented Oct 7, 2022

ENH: Add option to use nullable dtypes in read_csv #48776

ENH: Add option to use nullable dtypes in read_csv #48776

Conversation

phofl commented Sep 25, 2022

mroeschke Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Sep 30, 2022

phofl commented Sep 30, 2022

mroeschke commented Sep 30, 2022

phofl commented Sep 30, 2022

mroeschke commented Sep 30, 2022

phofl commented Sep 30, 2022

mroeschke commented Sep 30, 2022

mroeschke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Oct 7, 2022

mroeschke Sep 30, 2022 •

edited

Loading