Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in 0.24: to_timedelta handling of float values #25077

Closed
Sup3rGeo opened this issue Feb 1, 2019 · 15 comments

Comments

@Sup3rGeo
Copy link
Contributor

commented Feb 1, 2019

Code Sample, a copy-pastable example if possible

Creating an TimedeltaIndex with 1 microsecond steps:

import pandas as pd
import scipy as sp
t = sp.arange(0,1,1e-6)
index = pd.to_timedelta(t, unit='s')

Problem description

Output in 0.24:

image

We have repeated index values (e.g. 99992 in the image) and missing ones (e.g. 99994 in the image).

This looks like some sort of rounding issue.

Expected Output

Output in 0.23.4:

image

In this version everything is okay.

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.0
pytest: 4.1.0
pip: 18.1
setuptools: 39.0.1
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.2
IPython: 7.2.0
sphinx: 1.8.3
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: 1.7.0
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml.etree: 4.3.0
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.2.0
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@Sup3rGeo Sup3rGeo changed the title Regression in 0.24: Timedeltaindex frk has missing/repeated values Regression in 0.24: Timedeltaindex from to_timedelta has missing/repeated values Feb 1, 2019

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2019

@Sup3rGeo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

Sorry, was in the process of doing it!

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2019

Thanks. Likely some rounding issue. Are floating point values allowed in to_timedelta?

@gfyoung gfyoung added this to the 0.24.2 milestone Feb 7, 2019

@gfyoung

This comment has been minimized.

Copy link
Member

commented Feb 7, 2019

cc @jreback @mroeschke

Feel free to change milestone depending on what your thoughts on this are.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Feb 7, 2019

Are floating point values allowed in to_timedelta?

It's not really documented very clearly (under unit there is a mention of "float number"), but in any case it works in practice and is also tested behaviour:

# Tests with fractional seconds as input:
expected = np.array(
[0, 500000000, 800000000, 1200000000], dtype='timedelta64[ns]')
result = to_timedelta([0., 0.5, 0.8, 1.2], unit='s', box=False)
tm.assert_numpy_array_equal(expected, result)

So clearly a regression I would say.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2019

this is just a precision issue.

In [7]: index.asi8
Out[7]: 
array([        0,      1000,      2000, ..., 999996999, 999998000,
       999999000])

In [8]: np.diff(index.asi8)
Out[8]: array([1000, 1000, 1000, ...,  999, 1001, 1000])

try this to see full precision printing

In [17]: [t._repr_base(format='all') for t in index[0:10]]
Out[17]: 
['0 days 00:00:00.000000000',
 '0 days 00:00:00.000001000',
 '0 days 00:00:00.000002000',
 '0 days 00:00:00.000003000',
 '0 days 00:00:00.000004000',
 '0 days 00:00:00.000005000',
 '0 days 00:00:00.000006000',
 '0 days 00:00:00.000007000',
 '0 days 00:00:00.000008000',
 '0 days 00:00:00.000009000']

not a bug / regression at all; I suppose docs could be enhanced

@jreback jreback removed the Regression label Feb 7, 2019

@jreback jreback modified the milestones: 0.24.2, Contributions Welcome Feb 7, 2019

@jreback jreback added the Docs label Feb 7, 2019

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Feb 7, 2019

It's not just about display. The actual integer values have changed between 0.23 and 0.24. Before 0.24, the floats were converted to ints with a better precison.

0.23.4:

In [16]: arr = np.arange(0, 1, 1e-6)

In [17]: pd.to_timedelta(arr, unit='s').asi8
Out[17]: 
array([        0,      1000,      2000, ..., 999997000, 999998000,
       999999000])

0.24.1:

In [10]: arr = np.arange(0, 1, 1e-6)    

In [11]: pd.to_timedelta(arr, unit='s').asi8   
Out[11]: 
array([        0,      1000,      2000, ..., 999996999, 999998000,
       999999000])

Notice the 999997000 vs 999996999

So in any case, it is a regression in behaviour (whether we think it was supported behaviour or not can be another discussion)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2019

are these exactly the same numpy version? this is at the precision limit

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Feb 7, 2019

Yes, the same numpy version apparently. But, it's only the 6th decimal, that's a precision we should be able to handle.

@Sup3rGeo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 7, 2019

Just to add to the discussion from a user application perspective:

>>> import pandas as pd
>>> import scipy as sp
>>> t = sp.arange(0,1,1e-6)
>>> index = pd.to_timedelta(t, unit='s')
>>> y = sp.sin(t)
>>> series = pd.Series(y, index=index)

>>> # This works
... series["300ms"]
0.29552020666133955


>>> # This don't
... series["400ms"]

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Program Files\Python36\lib\site-packages\pandas\core\series.py", line 868, in __getitem__
    result = self.index.get_value(self, key)
  File "C:\Program Files\Python36\lib\site-packages\pandas\core\indexes\timedeltas.py", line 501, in get_value
    return self.get_value_maybe_box(series, key)
  File "C:\Program Files\Python36\lib\site-packages\pandas\core\indexes\timedeltas.py", line 521, in get_value_maybe_box
    values = self._engine.get_value(com.values_from_object(series), key)
  File "pandas\_libs\index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
  File "pandas\_libs\index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
  File "pandas\_libs\index.pyx", line 430, in pandas._libs.index.DatetimeEngine.get_loc
KeyError: Timedelta('0 days 00:00:00.400000')

>>> series["399.99ms":"400.01ms"]
00:00:00.399991    0.389410
00:00:00.399991    0.389411
00:00:00.399993    0.389412
00:00:00.399993    0.389413
00:00:00.399995    0.389414
00:00:00.399995    0.389415
00:00:00.399997    0.389416
00:00:00.399997    0.389417
00:00:00.399999    0.389417
00:00:00.399999    0.389418
00:00:00.400001    0.389419
00:00:00.400001    0.389420
00:00:00.400003    0.389421
00:00:00.400004    0.389422
00:00:00.400005    0.389423
00:00:00.400006    0.389424
00:00:00.400007    0.389425
00:00:00.400008    0.389426
00:00:00.400009    0.389427
00:00:00.400010    0.389428
dtype: float64
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

I think this is caused by #23539. In that PR, a new code path for floats was added in sequence_to_td64:

elif is_float_dtype(data.dtype):
# treat as multiples of the given unit. If after converting to nanos,
# there are fractional components left, these are truncated
# (i.e. NOT rounded)
mask = np.isnan(data)
coeff = np.timedelta64(1, unit) / np.timedelta64(1, 'ns')
data = (coeff * data).astype(np.int64).view('timedelta64[ns]')
data[mask] = iNaT
copy = False

While before that PR, floats were handled by (the slower) array_to_timedelta64 (which is now exposed in objects_to_td64ns). That one is converting the example data from this issue correctly:

In [12]: data = np.arange(0, 1, 1e-6)  

In [13]: pd.core.arrays.timedeltas.sequence_to_td64ns(data, unit='s')  
Out[13]: 
(array([        0,      1000,      2000, ..., 999996999, 999998000,
        999999000], dtype='timedelta64[ns]'), None)

In [14]: pd.core.arrays.timedeltas.objects_to_td64ns(data, unit='s')
Out[14]: 
array([        0,      1000,      2000, ..., 999997000, 999998000,
       999999000], dtype='timedelta64[ns]')

In the array_to_timedelta64 code, there is some special handling for float precision:

# cast the unit, multiply base/frace separately
# to avoid precision issues from float -> int
base = <int64_t>ts
frac = ts - base
if p:
frac = round(frac, p)
return <int64_t>(base * m) + <int64_t>(frac * m)

cc @jbrockmendel

@jorisvandenbossche jorisvandenbossche changed the title Regression in 0.24: Timedeltaindex from to_timedelta has missing/repeated values Regression in 0.24: to_timedelta handling of float values Feb 8, 2019

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

@Sup3rGeo it looks like @jorisvandenbossche has found the source of the issue. The quoted code in arrays.timedeltas needs to use the quoted code in tslibs.timedeltas instead of numpy's astype. Want to make a PR with a fix?

@Sup3rGeo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 18, 2019

@jbrockmendel maybe! Will check if I can sneak this into the weekend.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

I think we're planning a 0.24.2 release for sometime next week. @Sup3rGeo do you have time to submit a PR in the next few days?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

I can also do it tomorrow afternoon, if that might help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.