Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Collection of inconsistencies in .astype conversions #37626

Open
3 tasks done
mlondschien opened this issue Nov 4, 2020 · 5 comments
Open
3 tasks done

BUG: Collection of inconsistencies in .astype conversions #37626

mlondschien opened this issue Nov 4, 2020 · 5 comments
Labels
Astype Bug Dtype Conversions Unexpected or buggy dtype conversions NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@mlondschien
Copy link
Contributor

mlondschien commented Nov 4, 2020

I have a use case where (automatic) casting between the following pandas dtypes is necessary;

bool, boolean, int64, Int64, float64, object and string.

Note that boolean, Int64 and string are the new pandas 1.0 nullable dtypes.

The default approach for this would be series.astype(target_dtype), for target_dtype one of the above dtypes as strings. This works (given no issues with missings) but for the inconsistencies below:

In [1]: import pandas as pd
   ...: import numpy as np

pd.NA to float:

Summary: Casting float(pd.NA) raises a TypeError (as does casting float(None)). While np.array([None, "1"]).astype("float") works (and gets called here), the same call with pd.NA fails.

In [2]: pd.Series(["1", "2", pd.NA], dtype="object").astype("float")
TypeError: float() argument must be a string or a number, not 'NAType'

In [3]: pd.Series(["1", "2", pd.NA], dtype="string").astype("float")
TypeError: float() argument must be a string or a number, not 'NAType'

In [4]: pd.Series(["1", "2", pd.NA], dtype="string").astype("category").astype("float")  # magic workaround
Out[4]: 
0    1.0
1    2.0
2    NaN
dtype: float64

Edit: Fixed with #37974.

object / string to Int64 (nullable)

Summary: object columns cannot be casted to Int64. Casting

  • object -> string -> Int64 works.
  • object -> float -> Int64 works if the data does not contain pd.NA (see above)
  • object -> int64 -> Int64 works if the data does not contain missings.
In [5]: pd.Series(["1", "2", "3"]).astype("Int64")
TypeError: object cannot be converted to an IntegerDtype

In [5]: pd.Series(["1", "2", "3"], dtype="string").astype("Int64")
Out[5]: 
0    1
1    2
2    3
dtype: Int64

In [6]:  pd.Series(["1", "2", "3"]).astype("int").astype("Int64")
Out[6]:
0    1
1    2
2    3
dtype: Int64

In [7]:  pd.Series(["1", "2", None]).astype("int").astype("Int64")  # int columns cannot contain missings
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

In [8]:  pd.Series(["1", "2", None]).astype("float").astype("Int64")  # magic workaround
Out[8]:
0       1
1       2
2    <NA>
dtype: Int64

In [9]: pd.Series(["1", "2", pd.NA], dtype="object").astype("float").astype("Int64")  # see pd.NA -> float
TypeError: float() argument must be a string or a number, not 'NAType'

In [10]: pd.Series(["1", "2", pd.NA], dtype="object").astype("string").astype("Int64")
Out[10]: 
0       1
1       2
2    <NA>
dtype: Int64

Related: #25472 (comment)

string / object to bool or boolean

Summary: Casting string or object columns to bool or boolean behaves strangely. I am not sure what the expected behaviour for string / object to bool / boolean should be. It would be nice to have consistent behaviour.

  • string / object -> bool works if there are no missings, but yields only True
  • string / object -> boolean raises
In [11]: pd.Series(["True", "False", "bogus"], dtype="string").astype("bool")  # everything is True
Out[11]:
0    True
1    True
2    True
dtype: bool

In [12]: pd.Series(["True", "False", "bogus"], dtype="object").astype("bool")
Out[12]:
0    True
1    True
2    True
dtype: bool

In [13]: pd.Series(["True", "False", "bogus"], dtype="string").astype("boolean")
TypeError: data type not understood

In [14]: pd.Series(["True", "False", "bogus"], dtype="object").astype("boolean")
TypeError: Need to pass bool-like values

int (non-nullable) to boolean

Summary: Casting from (non-nullable) int64 to (nullable) boolean raises.

  • int64 -> Int64 -> boolean works
  • int64 -> bool -> boolean works as long as there are no missings.
In [15] pd.Series([-1, 0, 1], dtype="int").astype("boolean")
TypeError: Need to pass bool-like values

In [16]: pd.Series([-1, 0, 1], dtype="Int64").astype("boolean")
Out[16]:
0     True
1    False
2     True
dtype: boolean

In [17]: pd.Series([-1, 0, 1], dtype="int").astype("bool").astype("boolean")
Out[17]:
0     True
1    False
2     True
dtype: boolean

Related: #37614

While there exist separate issues for the first and last report, I gathered that it might be nice to have a collection of these somewhere, which I did not find.

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 15f843ab102d7a0cd7f1c7870dfec72d0e28d252
python           : 3.8.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-52-generic
Version          : #57~18.04.1-Ubuntu SMP Thu Oct 15 14:04:49 UTC 2020
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.2.0.dev0+1059.g15f843ab1.dirty
numpy            : 1.18.5
pytz             : 2020.4
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 49.6.0.post20201009
Cython           : 0.29.21
pytest           : 5.4.3
hypothesis       : 5.41.1
sphinx           : 3.1.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.7
lxml.etree       : 4.6.1
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 0.8.4
fastparquet      : 0.4.1
gcsfs            : 0.7.1
matplotlib       : 3.2.2
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : 2.0.0
pyxlsb           : None
s3fs             : 0.4.2
scipy            : 1.5.3
sqlalchemy       : 1.3.20
tables           : 3.6.1
tabulate         : 0.8.7
xarray           : 0.16.1
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.51.2
@mlondschien mlondschien added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 4, 2020
@jreback
Copy link
Contributor

jreback commented Nov 4, 2020

pls run these in master and not an older version

@mlondschien
Copy link
Contributor Author

pls run these in master and not an older version

My mistake. I updated the post.

@mlondschien mlondschien changed the title BUG: Collection of inconsitencies in .astype conversions BUG: Collection of inconsistencies in .astype conversions Nov 6, 2020
@mlondschien
Copy link
Contributor Author

What are the expected behaviours for string / object -> bool / boolean and int -> boolean?

For the former I would have assumed that this raises unless the object values are actually numeric and can be casted, as done here for object -> boolean.

For the latter, this line ensures that only zeros and ones can be casted from int -> boolean. As bool(2) returns True, does it make sense to allow this conversion in pandas?

@jreback jreback added NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 20, 2020
@david-cortes
Copy link
Contributor

Another one:

pd.Series([1, 2, None], dtype="Int64").to_numpy().astype(float)

(throws error)

@mroeschke mroeschke added the Dtype Conversions Unexpected or buggy dtype conversions label Aug 14, 2021
@lorenzwalthert
Copy link

Another one

>>> import pandas as pd # version 1.3.2 (current release)
>>> pd.DataFrame({"x": [1, pd.NA]}, index = [1])['x'].astype(float)
TypeError: float() argument must be a string or a number, not 'NAType'

@mzeitlin11 mzeitlin11 mentioned this issue Nov 6, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Astype Bug Dtype Conversions Unexpected or buggy dtype conversions NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

6 participants