Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert numeric column to dedicated pd.StringDtype() #31204

Closed
vadella opened this issue Jan 22, 2020 · 11 comments · Fixed by #33465
Closed

convert numeric column to dedicated pd.StringDtype() #31204

vadella opened this issue Jan 22, 2020 · 11 comments · Fixed by #33465
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@vadella
Copy link

vadella commented Jan 22, 2020

Code Sample, a copy-pastable example if possible

pd.Series(range(5, 10), dtype="Int64").astype("string")

raises TypeError: data type not understood

while

pd.Series(range(5, 10)).astype("string")

raises ValueError: StringArray requires a sequence of strings or missing values.

If you first do astype(str):

pd.Series(range(5, 10)).astype(str).astype("string")

and

pd.Series(range(5, 10), dtype="Int64").astype(str).astype("string")

work as expected:

0    5
1    6
2    7
3    8
4    9
dtype: string

While astype(object) raises in both cases ValueError: StringArray requires a sequence of strings or missing values.

Problem description

I can understand the ValueError, since you don't feed strings to the StringArray. Best for me would be if the astype("string") converts it to strings, or if the astype(str) would return a StringArray, but in any case, I would expect both pd.Series(range(5, 10), dtype="Int64").astype("string") and pd.Series(range(5, 10)).astype("string") to raise the same error.

Expected Output

0    5
1    6
2    7
3    8
4    9
dtype: string

or

ValueError: StringArray requires a sequence of strings or missing values.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

@TomAugspurger
Copy link
Contributor

Overlaps with #22384, which is trying to solve this problem in general.

In the meantime, we can add support for this in IntegerArray.astype.

@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 22, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jan 22, 2020
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 24, 2020

@TomAugspurger Was skimming issues and saw this one.

I'm wondering if Series.astype('string') should be treated as a special case independent of the underlying dtype of the underlying Series. That's because the following code should always work assuming s is a Series:

pd.Series([str(x) if not pd.isna(x) else pd.NA for x in s], dtype="string")

So since we know that the underlying objects in the EA have to support str(), there is a straightforward way of doing that conversion.

If you agree, I can look into doing a PR

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 24, 2020 via email

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 24, 2020

And code-wise, I don't think we'd want a special case for this in NDFrame.astype.

Well, I'm suggesting that we do want a special case just for StringDtype in NDFrame.astype. It seems natural that the 2 operations below should produce the same result, modulo having an object vs string dtype, for a given Series s (and returning np.nan vs. pd.NA for missing values), independent of the dtype of the Series s:

s.astype(str)
s.astype('string')

@TomAugspurger
Copy link
Contributor

Right, that's definitely desirable. But I don't think NDFrame.astype is the place for the fix.

@TomAugspurger
Copy link
Contributor

For example, consider SparseArray. It implements astype such that the result is also a SparseArray and preserves the sparsity. So Series[SparseArray].astype("string") would return a Series[SparseArray[string]]. But NDFrame.astype has no awareness of that.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 24, 2020

For example, consider SparseArray. It implements astype such that the result is also a SparseArray and preserves the sparsity. So Series[SparseArray].astype("string") would return a Series[SparseArray[string]]. But NDFrame.astype has no awareness of that.

I guess this depends on the semantics of astype() in the following sense.

If I have an EA of type "current_dtype" and I write s.astype("target_dtype"), is it either:

  1. the responsibility of the EA with type "current_dtype" to know how to convert to every possible "target_dtype", or
  2. the responsibility of EA's of type "target_dtype" to know how to convert whatever types it can to "target_dtype"?

I think you are saying that the design we have supports (1), and I'm suggesting a design corresponding to (2).

Now, the reason that I prefer (2) is that when I construct a Series, and I provide the dtype as an argument, then pandas figures out how to convert the data passed to the Series to the corresponding dtype if it can. That is behavior corresponding to (2). So s=pd.Series([1,2,3], dtype="category") and s=pd.Series([1,0,pd.NA], "boolean") both work, but s=pd.Series([1,2,3], dtype="string") does not.

Another possible design would be to have a property of EA's called something like can_convert_anydtype being True or False, and if True, then astype knows it can ask the EA to convert any dtype, and if False, it then asks the target dtype to do the conversion. So, for StringDtype, we set it to be can_convert_anydtype to True, and for other dtypes set it to False

@TomAugspurger
Copy link
Contributor

We have another issue for an astype dispatch mechanism.

@tritemio
Copy link

I want to chime in just to give another use-case from the duplicated issue #31839 .

In addition to conversion to "string", converting "string" to "Int8/16/64" when the initial series contains pd.NA is currently quite tricky:

s = pd.Series(['0', pd.NA], dtype='string')
s.astype('object').replace(pd.NA, np.nan).astype('float64').astype('Int8')

It should be possible to do simply s.astype('Int8')

@vadella
Copy link
Author

vadella commented Feb 12, 2020

I want to chime in just to give another use-case from the duplicated issue #31839 .

In addition to conversion to "string", converting "string" to "Int8/16/64" when the initial series contains pd.NA is currently quite tricky:

s = pd.Series(['0', pd.NA], dtype='string')
s.astype('object').replace(pd.NA, np.nan).astype('float64').astype('Int8')

It should be possible to do simply s.astype('Int8')

Your solution can give rounding errors when dealing with large integers. (I've been bitten by this when importing production data. The batch numbers were too large to fit exactly in a float64)

s = pd.Series(["0", pd.NA, str(2 ** 60 + 2)], dtype="string")
s.to_frame().assign(
    a=s.astype("object")
    .replace(pd.NA, np.nan)
    .astype("float64")
    .astype("Int64"),
    b=s.apply(lambda x: int(x) if pd.notnull(x) else x).astype("Int64"),
)
  0 a b
0 0 0 0
1
2 1152921504606846978 1152921504606846976 1152921504606846978

This explicitly loops over the column, so is not ideal performance wise

@tritemio
Copy link

@vadella, thanks for the code example. I had the same problem too and I currently side-step it by loading the data directly in string format. Another reason why it is important that this conversion is handled by pandas internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
7 participants