Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Data corruption when mode is performed on timedelta64 dtype in pandas-2.0 #53497

Closed
2 of 3 tasks
galipremsagar opened this issue Jun 2, 2023 · 1 comment · Fixed by #53505
Closed
2 of 3 tasks
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution

Comments

@galipremsagar
Copy link

galipremsagar commented Jun 2, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [22]: df = pd.DataFrame(
    ...:             {
    ...:                 "a": ["hello", "world", "pandas", "numpy", "df"],
    ...:                 "b": pd.Series(
    ...:                     np.array([1, 21, 21, 11, 11],
    ...:                     dtype="timedelta64[s]"),
    ...:                     index=["a", "b", "c", "d", " e"],
    ...:                 ),
    ...:             },
    ...:             index=["a", "b", "c", "d", " e"],
    ...:         )

In [23]: df
Out[23]: 
         a               b
a    hello 0 days 00:00:01
b    world 0 days 00:00:21
c   pandas 0 days 00:00:21
d    numpy 0 days 00:00:11
 e      df 0 days 00:00:11

In [24]: df.mode()
Out[24]: 
        a                    b
0      df      0 days 00:00:11
1   hello      0 days 00:00:21
2   numpy 106751 days 23:47:15 <--  # Expected NaT
3  pandas 106751 days 23:47:15 <--  # Expected NaT
4   world 106751 days 23:47:15 <--  # Expected NaT

Issue Description

In pandas-2.0, when we perform a mode operation on timedelta64 dtype, it looks like NaT values are not being returned correctly. For example see the pandas-1.5.x behavior below:

In [27]: df = pd.DataFrame(
    ...:             {
    ...:                 "a": ["hello", "world", "pandas", "numpy", "df"],
    ...:                 "b": pd.Series(
    ...:                     [1, 21, 21, 11, 11],
    ...:                     dtype="timedelta64[ns]",
    ...:                     index=["a", "b", "c", "d", " e"],
    ...:                 ),
    ...:             },
    ...:             index=["a", "b", "c", "d", " e"],
    ...:         )

In [28]: df.mode()
Out[28]: 
        a                         b
0      df 0 days 00:00:00.000000011
1   hello 0 days 00:00:00.000000021
2   numpy                       NaT
3  pandas                       NaT
4   world                       NaT

Expected Behavior

In [24]: df.mode()
Out[24]: 
        a                    b
0      df      0 days 00:00:11
1   hello      0 days 00:00:21
2   numpy                  NaT
3  pandas                  NaT
4   world                  NaT

Installed Versions

INSTALLED VERSIONS

commit : 965ceca
python : 3.10.11.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.2
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 0.29.35
pytest : 7.3.1
hypothesis : 6.75.7
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.13.2
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli :
fastparquet : None
fsspec : 2023.5.0
gcsfs : None
matplotlib : None
numba : 0.57.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.5.0
scipy : 1.10.1
snappy :
sqlalchemy : 2.0.15
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@galipremsagar galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2023
@asishm-wk
Copy link

Simpler repro:

import pandas as pd
df = pd.DataFrame(
    {
        '0': pd.Series(['a', 'b', 'c']),
        '1': pd.Series(
            np.array([1, 21],
            dtype="timedelta64[s]"),
        )
    }
)
df
# outputs
   0                    1
0  a      0 days 00:00:01
1  b      0 days 00:00:21
2  c 106751 days 23:47:15

@mroeschke mroeschke added Non-Nano datetime64/timedelta64 with non-nanosecond resolution and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2023
galipremsagar added a commit to rapidsai/cudf that referenced this issue Jun 6, 2023
This PR xfails a condition that is failing due to a pandas bug: pandas-dev/pandas#53497
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants