BUG: `dt.total_seconds()` gives the wrong number of seconds #48521

randolf-scholz · 2022-09-13T01:25:12Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
starttime = pd.Series(["2145-11-02 06:00:00"]).astype("datetime64[ns]")
endtime = pd.Series(["2145-11-02 07:06:00"]).astype("datetime64[ns]")
diff = endtime - starttime
assert diff.values.item() == 3960000000000
a = (endtime - starttime).dt.total_seconds().values
b = (endtime - starttime).values.astype(int) / 1_000_000_000
c = (endtime - starttime).values / np.timedelta64(1, "s")
assert b == c, f"{c-b}"  # ✔
assert a == c, f"{a-c}"  # ✘ AssertionError: [4.54747351e-13]

Issue Description

I noticed this when I was trying to reproduce a preprocessing pipeline for some dataset. (Don't mind the weird dates, they just come from some de-identified data).

It seems that dt.total_seconds yields a too large value, probably due to a rounding issue.

In this example,

starttime = 5_548_888_800_000_000_000
endtime = 5_548_892_760_000_000_000
diff = 3_960_000_000_000

Since 1_000_000_000 divides the diff, the result should be precisely 3960 seconds, which is exactly representable as a float, however the dt.total_seconds seems to accidentally round up:

np.set_printoptions(100)
print(np.frexp(a))   # 0.9667968750000001
print(np.frexp(b))   # 0.966796875
print(np.frexp(c))   # 0.966796875

However, curiously:

endtime= pd.Timestamp("2145-11-02 07:06:00")
starttime = pd.Timestamp("2145-11-02 06:00:00")
np.frexp( (endtime - starttime).total_seconds() )  # 0.966796875

So the issue might be related to the .dt?

Expected Behavior

It should agree with the numpy result.

Installed Versions

INSTALLED VERSIONS

commit : ca60aab
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-40-generic
Version : #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.4
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 65.3.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.3
hypothesis : None
sphinx : 5.1.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli :
fastparquet : 0.8.3
fsspec : 2022.7.1
gcsfs : None
markupsafe : 2.1.1
matplotlib : 3.5.3
numba : None
numexpr : 2.8.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.1
snappy : None
sqlalchemy : 1.4.40
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.6.0
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

randolf-scholz · 2022-09-13T01:42:30Z

Potentially relevant code sections:

pandas/pandas/core/arrays/timedeltas.py

Lines 771 to 772 in 54347fe

pps = periods_per_second(self._reso)

return self._maybe_mask_results(self.asi8 / pps, fill_value=None)

pandas/pandas/_libs/tslibs/dtypes.pyx

Lines 393 to 394 in 6dc589b

    
           cpdef int64_t periods_per_second(NPY_DATETIMEUNIT reso) except? -1: 
        
               return get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_s, reso)

pandas/pandas/_libs/tslibs/np_datetime.pyx

Lines 515 to 550 in 62a69be

    
           @cython.overflowcheck(True) 
        
           cdef int64_t get_conversion_factor(NPY_DATETIMEUNIT from_unit, NPY_DATETIMEUNIT to_unit) except? -1: 
        
               """ 
        
               Find the factor by which we need to multiply to convert from from_unit to to_unit. 
        
               """ 
        
               if ( 
        
                   from_unit == NPY_DATETIMEUNIT.NPY_FR_GENERIC 
        
                   or to_unit == NPY_DATETIMEUNIT.NPY_FR_GENERIC 
        
               ): 
        
                   raise ValueError("unit-less resolutions are not supported") 
        
               if from_unit > to_unit: 
        
                   raise ValueError 
        
               if from_unit == to_unit: 
        
                   return 1 
        
               if from_unit == NPY_DATETIMEUNIT.NPY_FR_W: 
        
                   return 7 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_D, to_unit) 
        
               elif from_unit == NPY_DATETIMEUNIT.NPY_FR_D: 
        
                   return 24 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_h, to_unit) 
        
               elif from_unit == NPY_DATETIMEUNIT.NPY_FR_h: 
        
                   return 60 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_m, to_unit) 
        
               elif from_unit == NPY_DATETIMEUNIT.NPY_FR_m: 
        
                   return 60 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_s, to_unit) 
        
               elif from_unit == NPY_DATETIMEUNIT.NPY_FR_s: 
        
                   return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ms, to_unit) 
        
               elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ms: 
        
                   return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_us, to_unit) 
        
               elif from_unit == NPY_DATETIMEUNIT.NPY_FR_us: 
        
                   return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ns, to_unit) 
        
               elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ns: 
        
                   return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ps, to_unit) 
        
               elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ps: 
        
                   return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_fs, to_unit) 
        
               elif from_unit == NPY_DATETIMEUNIT.NPY_FR_fs: 
        
                   return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_as, to_unit)

seljaks · 2022-09-20T08:45:27Z

Hi @randolf-scholz, I ran your example in 1.4.4 and got the same error. However it didn't appear in 1.5.0 or on the main branch, so it would appear this bug has been fixed.

MarcoGorelli · 2022-09-20T16:30:56Z

~~Thanks for the report From git bisect, looks like this was fixed in #46828~~ this doesn't look right, will take another look later

Looks like it was #47421

git bisect: https://www.kaggle.com/code/marcogorelli/pandas-regression?scriptVersionId=106198797

MarcoGorelli · 2022-09-20T18:51:08Z

Looks like this is due to using 1_000_000_000 instead of 1e9

There is

https://github.com/jbrockmendel/pandas/blob/2f6116f259b88f9a6684fdf0ffe0e45078439602/pandas/core/arrays/timedeltas.py#L821-L823

which tests this, but still, probably better to have a proper test rather than relying on a doctest

@randolf-scholz fancy submitting a PR?

MarcoGorelli · 2022-09-20T18:53:02Z

cc @jbrockmendel just FYI, as it looks like this was a welcome surprise

add test to reproduce bug in the issue #48521

randolf-scholz added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 13, 2022

MarcoGorelli added Needs Tests Unit test(s) needed to prevent regressions and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 20, 2022

phofl mentioned this issue Apr 9, 2023

Add regression tests noatamir/pyladies-workshop#7

Closed

26 tasks

vagechirkov added a commit to vagechirkov/pandas that referenced this issue Apr 18, 2023

add test to reproduce bug in the issue pandas-dev#48521

f5177bd

vagechirkov mentioned this issue Apr 18, 2023

Regression test for dt.total_seconds() #52731

Merged

5 tasks

mroeschke closed this as completed in #52731 Apr 19, 2023

mroeschke pushed a commit that referenced this issue Apr 19, 2023

Regression test for dt.total_seconds() (#52731)

2776e79

add test to reproduce bug in the issue #48521

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `dt.total_seconds()` gives the wrong number of seconds #48521

BUG: `dt.total_seconds()` gives the wrong number of seconds #48521

randolf-scholz commented Sep 13, 2022 •

edited

Loading

INSTALLED VERSIONS

randolf-scholz commented Sep 13, 2022

seljaks commented Sep 20, 2022

MarcoGorelli commented Sep 20, 2022 •

edited

Loading

MarcoGorelli commented Sep 20, 2022

MarcoGorelli commented Sep 20, 2022

BUG: dt.total_seconds() gives the wrong number of seconds #48521

BUG: dt.total_seconds() gives the wrong number of seconds #48521

Comments

randolf-scholz commented Sep 13, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

randolf-scholz commented Sep 13, 2022

seljaks commented Sep 20, 2022

MarcoGorelli commented Sep 20, 2022 • edited Loading

MarcoGorelli commented Sep 20, 2022

MarcoGorelli commented Sep 20, 2022

BUG: `dt.total_seconds()` gives the wrong number of seconds #48521

BUG: `dt.total_seconds()` gives the wrong number of seconds #48521

randolf-scholz commented Sep 13, 2022 •

edited

Loading

MarcoGorelli commented Sep 20, 2022 •

edited

Loading