convert numeric column to dedicated `pd.StringDtype()` #31204

vadella · 2020-01-22T13:01:30Z

Code Sample, a copy-pastable example if possible

pd.Series(range(5, 10), dtype="Int64").astype("string")

raises TypeError: data type not understood

while

pd.Series(range(5, 10)).astype("string")

raises ValueError: StringArray requires a sequence of strings or missing values.

If you first do astype(str):

pd.Series(range(5, 10)).astype(str).astype("string")

and

pd.Series(range(5, 10), dtype="Int64").astype(str).astype("string")

work as expected:

0    5
1    6
2    7
3    8
4    9
dtype: string

While astype(object) raises in both cases ValueError: StringArray requires a sequence of strings or missing values.

Problem description

I can understand the ValueError, since you don't feed strings to the StringArray. Best for me would be if the astype("string") converts it to strings, or if the astype(str) would return a StringArray, but in any case, I would expect both pd.Series(range(5, 10), dtype="Int64").astype("string") and pd.Series(range(5, 10)).astype("string") to raise the same error.

Expected Output

0    5
1    6
2    7
3    8
4    9
dtype: string

or

ValueError: StringArray requires a sequence of strings or missing values.

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-01-22T19:42:08Z

Overlaps with #22384, which is trying to solve this problem in general.

In the meantime, we can add support for this in IntegerArray.astype.

Dr-Irv · 2020-01-24T16:19:01Z

@TomAugspurger Was skimming issues and saw this one.

I'm wondering if Series.astype('string') should be treated as a special case independent of the underlying dtype of the underlying Series. That's because the following code should always work assuming s is a Series:

pd.Series([str(x) if not pd.isna(x) else pd.NA for x in s], dtype="string")

So since we know that the underlying objects in the EA have to support str(), there is a straightforward way of doing that conversion.

If you agree, I can look into doing a PR

TomAugspurger · 2020-01-24T16:23:35Z

I don't think we should preempt the array from having a chance to perform the conversion. And code-wise, I don't think we'd want a special case for this in NDFrame.astype.

…

On Fri, Jan 24, 2020 at 10:19 AM Irv Lustig ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> Was skimming issues and saw this one. I'm wondering if Series.astype('string') should be treated as a special case independent of the underlying dtype of the underlying Series. That's because the following code should always work assuming s is a Series: pd.Series([str(x) if not pd.isna(x) else pd.NA for x in s], dtype="string") So since we know that the underlying objects in the EA have to support str(), there is a straightforward way of doing that conversion. If you agree, I can look into doing a PR — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31204?email_source=notifications&email_token=AAKAOIWWCMHG5VDJ2ALQ6KTQ7MIHNA5CNFSM4KKFMCQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ3JVCQ#issuecomment-578198154>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIWKL7DYNRJPOCG5XFLQ7MIHNANCNFSM4KKFMCQQ> .

Dr-Irv · 2020-01-24T16:40:29Z

And code-wise, I don't think we'd want a special case for this in NDFrame.astype.

Well, I'm suggesting that we do want a special case just for StringDtype in NDFrame.astype. It seems natural that the 2 operations below should produce the same result, modulo having an object vs string dtype, for a given Series s (and returning np.nan vs. pd.NA for missing values), independent of the dtype of the Series s:

s.astype(str)
s.astype('string')

TomAugspurger · 2020-01-24T17:35:40Z

Right, that's definitely desirable. But I don't think NDFrame.astype is the place for the fix.

TomAugspurger · 2020-01-24T17:39:45Z

For example, consider SparseArray. It implements astype such that the result is also a SparseArray and preserves the sparsity. So Series[SparseArray].astype("string") would return a Series[SparseArray[string]]. But NDFrame.astype has no awareness of that.

Dr-Irv · 2020-01-24T18:47:50Z

For example, consider SparseArray. It implements astype such that the result is also a SparseArray and preserves the sparsity. So Series[SparseArray].astype("string") would return a Series[SparseArray[string]]. But NDFrame.astype has no awareness of that.

I guess this depends on the semantics of astype() in the following sense.

If I have an EA of type "current_dtype" and I write s.astype("target_dtype"), is it either:

the responsibility of the EA with type "current_dtype" to know how to convert to every possible "target_dtype", or
the responsibility of EA's of type "target_dtype" to know how to convert whatever types it can to "target_dtype"?

I think you are saying that the design we have supports (1), and I'm suggesting a design corresponding to (2).

Now, the reason that I prefer (2) is that when I construct a Series, and I provide the dtype as an argument, then pandas figures out how to convert the data passed to the Series to the corresponding dtype if it can. That is behavior corresponding to (2). So s=pd.Series([1,2,3], dtype="category") and s=pd.Series([1,0,pd.NA], "boolean") both work, but s=pd.Series([1,2,3], dtype="string") does not.

Another possible design would be to have a property of EA's called something like can_convert_anydtype being True or False, and if True, then astype knows it can ask the EA to convert any dtype, and if False, it then asks the target dtype to do the conversion. So, for StringDtype, we set it to be can_convert_anydtype to True, and for other dtypes set it to False

TomAugspurger · 2020-01-24T23:55:27Z

We have another issue for an astype dispatch mechanism.

tritemio · 2020-02-10T14:23:36Z

I want to chime in just to give another use-case from the duplicated issue #31839 .

In addition to conversion to "string", converting "string" to "Int8/16/64" when the initial series contains pd.NA is currently quite tricky:

s = pd.Series(['0', pd.NA], dtype='string')
s.astype('object').replace(pd.NA, np.nan).astype('float64').astype('Int8')

It should be possible to do simply s.astype('Int8')

vadella · 2020-02-12T08:44:47Z

I want to chime in just to give another use-case from the duplicated issue #31839 .

In addition to conversion to "string", converting "string" to "Int8/16/64" when the initial series contains pd.NA is currently quite tricky:
s = pd.Series(['0', pd.NA], dtype='string')
s.astype('object').replace(pd.NA, np.nan).astype('float64').astype('Int8')
It should be possible to do simply s.astype('Int8')

Your solution can give rounding errors when dealing with large integers. (I've been bitten by this when importing production data. The batch numbers were too large to fit exactly in a float64)

s = pd.Series(["0", pd.NA, str(2 ** 60 + 2)], dtype="string")
s.to_frame().assign(
    a=s.astype("object")
    .replace(pd.NA, np.nan)
    .astype("float64")
    .astype("Int64"),
    b=s.apply(lambda x: int(x) if pd.notnull(x) else x).astype("Int64"),
)

	0	a	b
0	0	0	0
1
2	1152921504606846978	1152921504606846976	1152921504606846978

This explicitly loops over the column, so is not ideal performance wise

tritemio · 2020-02-12T11:17:04Z

@vadella, thanks for the code example. I had the same problem too and I currently side-step it by loading the data directly in string format. Another reason why it is important that this conversion is handled by pandas internally.

TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 22, 2020

TomAugspurger added this to the Contributions Welcome milestone Jan 22, 2020

Dr-Irv mentioned this issue Jan 27, 2020

API: astype mechanism for extension arrays #22384

Open

TomAugspurger mentioned this issue Feb 10, 2020

Convert type nullable int <-> nullable string #31839

Closed

jorisvandenbossche mentioned this issue Feb 16, 2020

Should I be able to initialise a Series with a pandas array and pandas dtype? #32028

Closed

jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Mar 14, 2020

jorisvandenbossche mentioned this issue Apr 9, 2020

ENH: More permissable conversion to StringDtype #33412

Closed

This was referenced Apr 9, 2020

API: More permissive conversion to StringDtype #33421

Closed

API: more permissive conversion to StringDtype #33465

Merged

mroeschke added the Bug label Apr 28, 2020

jreback modified the milestones: Contributions Welcome, 1.1 May 25, 2020

jreback closed this as completed in #33465 May 26, 2020

xinrong-meng mentioned this issue Jul 8, 2021

[SPARK-36035][PYTHON] Adjust test_astype, test_neg for old pandas versions apache/spark#33250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert numeric column to dedicated `pd.StringDtype()` #31204

convert numeric column to dedicated `pd.StringDtype()` #31204

vadella commented Jan 22, 2020

TomAugspurger commented Jan 22, 2020

Dr-Irv commented Jan 24, 2020

TomAugspurger commented Jan 24, 2020 via email

Dr-Irv commented Jan 24, 2020

TomAugspurger commented Jan 24, 2020

TomAugspurger commented Jan 24, 2020

Dr-Irv commented Jan 24, 2020

TomAugspurger commented Jan 24, 2020

tritemio commented Feb 10, 2020

vadella commented Feb 12, 2020

tritemio commented Feb 12, 2020

convert numeric column to dedicated pd.StringDtype() #31204

convert numeric column to dedicated pd.StringDtype() #31204

Comments

vadella commented Jan 22, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Jan 22, 2020

Dr-Irv commented Jan 24, 2020

TomAugspurger commented Jan 24, 2020 via email

Dr-Irv commented Jan 24, 2020

TomAugspurger commented Jan 24, 2020

TomAugspurger commented Jan 24, 2020

Dr-Irv commented Jan 24, 2020

TomAugspurger commented Jan 24, 2020

tritemio commented Feb 10, 2020

vadella commented Feb 12, 2020

tritemio commented Feb 12, 2020

convert numeric column to dedicated `pd.StringDtype()` #31204

convert numeric column to dedicated `pd.StringDtype()` #31204

Output of `pd.show_versions()`