NaNs in Float64Index are converted to silly integers using index.astype('int') #13149

Closed
ch41rmn opened this Issue May 12, 2016 · 6 comments

Comments

Projects
None yet
4 participants

ch41rmn commented May 12, 2016

Code Sample, a copy-pastable example if possible

>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.index = [None, 1]
>>> df
      a  b
NaN   1  3
 1.0  2  4
>>> df.index = df.index.astype('int')
>>> df
                      a  b
-9223372036854775808  1  3
 1                    2  4

output of pd.show_versions()

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.13-100.fc21.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.utf8

pandas: 0.18.1
nose: None
pip: 8.1.1
setuptools: 20.2.2
Cython: None
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
xarray: 0.7.2
IPython: 4.2.0
sphinx: 1.3.5
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

This is numpy behaviour:

In [22]: np.array([np.nan, 1.]).astype(int)
Out[22]: array([-2147483648,           1])

But, we should probably check for the occurence of NaNs, just as we do for Series:

In [29]: df.iloc[0,0] = np.nan

In [30]: df.a
Out[30]:
NaN   NaN
 1      2
Name: a, dtype: float64

In [31]: df.a.astype(int)
...

C:\Anaconda\lib\site-packages\pandas\core\common.pyc in _astype_nansafe(arr, dty
pe, copy)
   2726
   2727         if np.isnan(arr).any():
-> 2728             raise ValueError('Cannot convert NA to integer')
   2729     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.intege
r):
   2730         # work around NumPy brokenness, #1987

ValueError: Cannot convert NA to integer

jorisvandenbossche added this to the Next Major Release milestone May 12, 2016

Contributor

pijucha commented May 17, 2016

I wanted to fix this bug but noticed a similar behaviour of other objects: DatetimeIndex, TimedeltaIndex, Categorical, CategoricalIndex. Namely (all four of them behave identically):

A = pd.DatetimeIndex([1e10,2e10,None])
A
Out[76]: DatetimeIndex(['1970-01-01 00:00:10', '1970-01-01 00:00:20', 'NaT'], dtype='datetime64[ns]', freq=None)
A.astype(int)
Out[77]: array([         10000000000,          20000000000, -9223372036854775808])

However, unlike with Float64Index, this is invertible:

pd.DatetimeIndex(A.astype(int))
Out[78]: DatetimeIndex(['1970-01-01 00:00:10', '1970-01-01 00:00:20', 'NaT'], dtype='datetime64[ns]', freq=None)

My question: is this behaviour also a bug and should be fixed the same way (raising a ValueError)? And if so, should all the fixes be placed into one commit/pull request?

By the way, there might be other objects with the same issue, which call numpy.ndarray.astype(). And numpy is also a bit inconsistent here:

np.array([1,np.nan]).astype(int)
Out[84]: array([                   1, -9223372036854775808])
np.array([1,np.nan], dtype = int)
Traceback...
ValueError: cannot convert float NaN to integer
Contributor

jreback commented May 17, 2016

@ch41rmn these are all as expected. converting to int converts to the underlying integer based representation.

The only issue is that Float64Index.astype(int) should raise (as its effectively non-convertible).

@jreback I actually think we should raise in the datetimeindex case as well (ideally). A NaT cannot be converted to int (just as float nan cannot be converted). There is the asi8 attribute if you want this.
But, of course, that is not really back compat. Internally I think we consequently use asi8? But not sure about external use of course

Raising for CategoricalIndex seems less of a problem (not a common thing to do)

Contributor

jreback commented May 17, 2016

This is excactly what should be returned (and is useful). yes its equivalen to internal .asi8, but I dont' see a good reason to NOT do this.

In [20]: pd.DatetimeIndex([1e10,2e10,None]).astype(int)
Out[20]: array([         10000000000,          20000000000, -9223372036854775808])

@pijucha pijucha added a commit to pijucha/pandas that referenced this issue May 23, 2016

@pijucha @pijucha pijucha + pijucha BUG: Fix #13149 and ENH: 'copy' param in Index.astype()
1. Float64Index.astype(int) raises ValueError if a NaN is present.
Previously, it converted NaN's to the smallest negative integer.

2. TimedeltaIndex.astype(int) and DatetimeIndex.astype(int) return
Int64Index, which is consistent with behavior of other Indexes.
Previously, they returned a numpy.array of ints.

3. Added:
  - bool parameter 'copy' to Index.astype()
  - shared doc string to .astype()
  - tests on .astype() (consolidated and added new)
  - bool parameter 'copy' to Categorical.astype()

4. Internals:
  - Fixed core.common.is_timedelta64_ns_dtype().
  - Set a default NaT representation to a string type in a parameter
    of DatetimeIndex._format_native_types().
    Previously, it produced a unicode u'NaT' in Python2.
8b29902

jreback closed this in afde718 May 23, 2016

@nps nps added a commit to nps/pandas that referenced this issue May 30, 2016

@pijucha @nps pijucha + nps BUG: Fix #13149 and ENH: 'copy' param in Index.astype()
closes #13149

1. Float64Index.astype(int) raises ValueError if a NaN is present.
   Previously, it converted NaN's to the smallest negative integer.
2. TimedeltaIndex.astype(int) and DatetimeIndex.astype(int) return.  Int64Index, which is consistent
   with behavior of other Indexes.  Previously, they returned a numpy.array of ints.
3. Added bool parameter 'copy' to Index.astype()
4. Fixed core.common.is_timedelta64_ns_dtype().
5. Set a default NaT representation to a string type in a parameter of
   DatetimeIndex._format_native_types().  Previously, it produced a
   unicode u'NaT' in Python2.

Author: pijucha <pi.jucha@gmail.com>

Closes #13209 from pijucha/bug13149 and squashes the following commits:

8b29902 [pijucha] BUG: Fix #13149 and ENH: 'copy' param in Index.astype()
9a4b5d5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment