-
-
Notifications
You must be signed in to change notification settings - Fork 19k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
df = pd.DataFrame(
data={
"date_time": pd.to_datetime(
["2020-01-11 23:59:59.999999", "2020-01-01", np.nan], errors="coerce", format="%Y-%m-%d %H:%M:%S.%f",
),
"string": ["should_fail", "1999-11-03 15:52:48.123456", ""],
"junk": ["", "", ""],
"item_type": ["A", "B", "C"],
}
)
# Using coerce so we get some NaT values to reproduce the error
df["string"] = pd.to_datetime(df["string"], errors="coerce", format="%Y-%m-%d %H:%M:%S.%f")
df["junk"] = pd.to_datetime(df["junk"], errors="coerce", format="%Y-%m-%d %H:%M:%S.%f")
df["date_time"][0].nanosecond
# 0
# Yields to `max` values both at the microsecond grain
df[["date_time", "string"]].max()
# date_time 2020-01-11 23:59:59.999999
# string 1999-11-03 15:52:48.123456
# dtype: datetime64[ns]
# Yields to `max` values at the ns grain. Expected nanoseconds to be zero filled (.999999000) but
# received 2020-01-11 23:59:59.999998976, original value -4 ns
df[["date_time", "string"]].max(axis=1)
# 0 2020-01-11 23:59:59.999998976
# 1 2020-01-01 00:00:00.000000000
df[["date_time", "junk"]].max()
# date_time 2020-01-11 23:59:59.999998976
Issue Description
Hey pandas maintainers, found what feels like an edge case in Timestamp nanosecond behavior when doing a DataFrame max()
operation on some timestamp columns with NaT values involved.
I have a microsecond-grained timestamp value in the example above, 2020-01-11 23:59:59.999999
, that shows a value of 0 nanoseconds when that attribute is retrieved. When there is a max()
aggregation of a timestamp row or column in a dataframe where the output is in nanoseconds, the output is suddenly 4 nanoseconds off. This seems unexpected given that attribute being zero previously.
Expected Behavior
If a timestamp's nanosecond attribute is zero, I would expect that to still be the case when it is expanded to the full 9 nanosecond digits.
Installed Versions
Reproduced in two conda environments
INSTALLED VERSIONS
commit : 2cb9652
python : 3.7.13.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.2.4
numpy : 1.20.3
pytz : 2022.2.1
dateutil : 2.8.2
pip : 22.1.2
setuptools : 59.8.0
Cython : None
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 2022.8.2
fastparquet : None
gcsfs : None
matplotlib : 3.5.3
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 9.0.0
pyxlsb : None
s3fs : 0.4.2
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : 0.8.10
xarray : None
xlrd : 2.0.1
xlwt : None
numba : 0.56.2
pandas : 1.5.1
numpy : 1.21.5
pytz : 2022.1
dateutil : 2.8.2
setuptools : 57.5.0
pip : 21.2.4
Cython : None
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : None
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli :
fastparquet : None
fsspec : 2022.7.1
gcsfs : None
matplotlib : 3.4.3
numba : 0.55.1
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.8.0
snappy :
sqlalchemy : 1.4.32
tables : None
tabulate : 0.8.9
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None
tzdata : None
</details.