Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: dt.total_seconds() gives the wrong number of seconds #48521

Closed
2 of 3 tasks
Tracked by #7
randolf-scholz opened this issue Sep 13, 2022 · 5 comments · Fixed by #52731
Closed
2 of 3 tasks
Tracked by #7

BUG: dt.total_seconds() gives the wrong number of seconds #48521

randolf-scholz opened this issue Sep 13, 2022 · 5 comments · Fixed by #52731
Labels
Needs Tests Unit test(s) needed to prevent regressions

Comments

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Sep 13, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
starttime = pd.Series(["2145-11-02 06:00:00"]).astype("datetime64[ns]")
endtime = pd.Series(["2145-11-02 07:06:00"]).astype("datetime64[ns]")
diff = endtime - starttime
assert diff.values.item() == 3960000000000
a = (endtime - starttime).dt.total_seconds().values
b = (endtime - starttime).values.astype(int) / 1_000_000_000
c = (endtime - starttime).values / np.timedelta64(1, "s")
assert b == c, f"{c-b}"  # ✔
assert a == c, f"{a-c}"  # ✘ AssertionError: [4.54747351e-13]

Issue Description

I noticed this when I was trying to reproduce a preprocessing pipeline for some dataset. (Don't mind the weird dates, they just come from some de-identified data).

It seems that dt.total_seconds yields a too large value, probably due to a rounding issue.

In this example,

  • starttime = 5_548_888_800_000_000_000
  • endtime = 5_548_892_760_000_000_000
  • diff = 3_960_000_000_000

Since 1_000_000_000 divides the diff, the result should be precisely 3960 seconds, which is exactly representable as a float, however the dt.total_seconds seems to accidentally round up:

np.set_printoptions(100)
print(np.frexp(a))   # 0.9667968750000001
print(np.frexp(b))   # 0.966796875
print(np.frexp(c))   # 0.966796875

However, curiously:

endtime= pd.Timestamp("2145-11-02 07:06:00")
starttime = pd.Timestamp("2145-11-02 06:00:00")
np.frexp( (endtime - starttime).total_seconds() )  # 0.966796875

So the issue might be related to the .dt?

Expected Behavior

It should agree with the numpy result.

Installed Versions

INSTALLED VERSIONS

commit : ca60aab
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-40-generic
Version : #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.4
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 65.3.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.3
hypothesis : None
sphinx : 5.1.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli :
fastparquet : 0.8.3
fsspec : 2022.7.1
gcsfs : None
markupsafe : 2.1.1
matplotlib : 3.5.3
numba : None
numexpr : 2.8.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.1
snappy : None
sqlalchemy : 1.4.40
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.6.0
xlrd : None
xlwt : None
zstandard : None

@randolf-scholz randolf-scholz added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 13, 2022
@randolf-scholz
Copy link
Contributor Author

Potentially relevant code sections:

  • pps = periods_per_second(self._reso)
    return self._maybe_mask_results(self.asi8 / pps, fill_value=None)

  • cpdef int64_t periods_per_second(NPY_DATETIMEUNIT reso) except? -1:
    return get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_s, reso)

  • @cython.overflowcheck(True)
    cdef int64_t get_conversion_factor(NPY_DATETIMEUNIT from_unit, NPY_DATETIMEUNIT to_unit) except? -1:
    """
    Find the factor by which we need to multiply to convert from from_unit to to_unit.
    """
    if (
    from_unit == NPY_DATETIMEUNIT.NPY_FR_GENERIC
    or to_unit == NPY_DATETIMEUNIT.NPY_FR_GENERIC
    ):
    raise ValueError("unit-less resolutions are not supported")
    if from_unit > to_unit:
    raise ValueError
    if from_unit == to_unit:
    return 1
    if from_unit == NPY_DATETIMEUNIT.NPY_FR_W:
    return 7 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_D, to_unit)
    elif from_unit == NPY_DATETIMEUNIT.NPY_FR_D:
    return 24 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_h, to_unit)
    elif from_unit == NPY_DATETIMEUNIT.NPY_FR_h:
    return 60 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_m, to_unit)
    elif from_unit == NPY_DATETIMEUNIT.NPY_FR_m:
    return 60 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_s, to_unit)
    elif from_unit == NPY_DATETIMEUNIT.NPY_FR_s:
    return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ms, to_unit)
    elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ms:
    return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_us, to_unit)
    elif from_unit == NPY_DATETIMEUNIT.NPY_FR_us:
    return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ns, to_unit)
    elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ns:
    return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_ps, to_unit)
    elif from_unit == NPY_DATETIMEUNIT.NPY_FR_ps:
    return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_fs, to_unit)
    elif from_unit == NPY_DATETIMEUNIT.NPY_FR_fs:
    return 1000 * get_conversion_factor(NPY_DATETIMEUNIT.NPY_FR_as, to_unit)

@seljaks
Copy link
Contributor

seljaks commented Sep 20, 2022

Hi @randolf-scholz, I ran your example in 1.4.4 and got the same error. However it didn't appear in 1.5.0 or on the main branch, so it would appear this bug has been fixed.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Sep 20, 2022

Thanks for the report From git bisect, looks like this was fixed in #46828 this doesn't look right, will take another look later

Looks like it was #47421

git bisect: https://www.kaggle.com/code/marcogorelli/pandas-regression?scriptVersionId=106198797

@MarcoGorelli
Copy link
Member

Looks like this is due to using 1_000_000_000 instead of 1e9

There is

https://github.com/jbrockmendel/pandas/blob/2f6116f259b88f9a6684fdf0ffe0e45078439602/pandas/core/arrays/timedeltas.py#L821-L823

which tests this, but still, probably better to have a proper test rather than relying on a doctest

@randolf-scholz fancy submitting a PR?

@MarcoGorelli MarcoGorelli added Needs Tests Unit test(s) needed to prevent regressions and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 20, 2022
@MarcoGorelli
Copy link
Member

vagechirkov added a commit to vagechirkov/pandas that referenced this issue Apr 18, 2023
mroeschke pushed a commit that referenced this issue Apr 19, 2023
add test to reproduce bug in the issue #48521
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants