Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: SparseArray.astype behaviour to always preserve sparseness #34457

Closed
jorisvandenbossche opened this issue May 29, 2020 · 4 comments · Fixed by #45339
Closed

API: SparseArray.astype behaviour to always preserve sparseness #34457

jorisvandenbossche opened this issue May 29, 2020 · 4 comments · Fixed by #45339
Labels
Astype Enhancement Needs Discussion Requires discussion from core team before further action Sparse Sparse Data Type
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Currently, the SparseArray.astype function will always convert the specified target dtype to a sparse dtype, if it is not one. For example, this gives:

In [64]: arr = pd.arrays.SparseArray([1, 0, 0, 2])  

In [65]: arr   
Out[65]: 
[1, 0, 0, 2]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32)

In [66]: arr.astype(float)  
Out[66]: 
[1.0, 0.0, 0.0, 2.0]
Fill: 0.0
IntIndex
Indices: array([0, 3], dtype=int32)

This ensures that a simple astype doesn't densify the sparse array (and you don't need to do astype(pd.SparseDtype(float, fill_value))).
And note this also gives this behaviour to Series.astype(..)

But, this also gives the inconsistency that arr.astype(target_dtype).dtype != target_dtype, so you can rely on the fact that you get back an array of the actual dtype that you specified.
See eg the workaround I need to add for this in #34338

@TomAugspurger
Copy link
Contributor

Do you think this is worth deprecating? It seems that this would require a keyword to control the behavior, like

SparseArray.astype(dtype, sparsify=True)

which would cast non-sparse dtypes to SparseDtypes. So we can introduce that keyword with a default of None and warn that in the future it will change from True to False?

@jorisvandenbossche
Copy link
Member Author

I think that it is too much exposed (also in Series.astype) to just change (and it was also a deliberate behaviour, so it's not that we can see it as a bug fix).
So in that case a keyword like that might be a good solution. We will only need to expose it in Series.astype as well and pass it through to SparseArray.astype.

@TomAugspurger
Copy link
Contributor

Hmmm having to expose it through Series.astype (and DataFrame as well?) is unfortunate.

@mroeschke mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action and removed API Design labels Aug 7, 2021
@jbrockmendel
Copy link
Member

this also gives the inconsistency that arr.astype(target_dtype).dtype != target_dtype

I just ran into this. Not a strong opinion, but on the margin id prefer to deprecate so that obj.astype(dtype).dtype == dtype

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Astype Enhancement Needs Discussion Requires discussion from core team before further action Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants