Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wrong save of datetime64[s] in HDFStore #59018

Merged
merged 9 commits into from
Jun 18, 2024

Conversation

chaoyihu
Copy link
Contributor

@chaoyihu chaoyihu commented Jun 14, 2024

pandas/io/pytables.py Outdated Show resolved Hide resolved
@mroeschke mroeschke added IO HDF5 read_hdf, HDFStore Non-Nano datetime64/timedelta64 with non-nanosecond resolution labels Jun 14, 2024
@chaoyihu
Copy link
Contributor Author

chaoyihu commented Jun 14, 2024

Thanks for reviewing my code. I've accepted your suggestions and made new commits.

Why did the CI checks fail?

pandas/io/pytables.py Outdated Show resolved Hide resolved
pandas/io/pytables.py Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member

Might also close #57085?

@chaoyihu
Copy link
Contributor Author

Might also close #57085?

Will look into that.

@chaoyihu
Copy link
Contributor Author

chaoyihu commented Jun 15, 2024

The code passed new tests but has regression failure on an old test case in tests/io/pytables/test_round_trip.py::test_table_values_dtypes_roundtrip.

Digging deeper, the current code base seems to be consistently using ns as the time unit, and lots of code are written on top of this assumption. I'm concerned that changing dtype of the _set_tz return value may break this consistency and produce more unwanted effects.

So the alternative I'm thinking is to do a forceful conversion on user input to nanosecond, or issue a warning when user tries to use other dtypes besides datetime64[ns] in a DataFrame. Any thoughts?

@chaoyihu
Copy link
Contributor Author

Might also close #57085?

I see, thanks @jbrockmendel for linking this issue. If I understand the discussion correctly, non-nano resolutions are not yet fully integrated with the current code base, but changes are intended.

And yes, this fix might also resolve 57085. I think I'll keep working to add support for non-nano time units in HDFStore.

@@ -465,7 +465,7 @@ Timedelta

Timezones
^^^^^^^^^
-
- :meth:`io::pytables::_set_tz` was failing to parse datetime64 dtypes correctly (:issue:`59004`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use your older whatsnew note instead of this one? Also could you put this in the IO section?

@@ -345,3 +347,22 @@ def test_read_with_where_tz_aware_index(tmp_path, setup_path):
store.append(key, expected, format="table", append=True)
result = pd.read_hdf(path, key, where="DATE > 20151130")
tm.assert_frame_equal(result, expected)


def test_set_tz_datetime64_unit(tmp_path, setup_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use your older test instead of this one? We prefer tests that exercise the public API

@mroeschke
Copy link
Member

Any thoughts?

If you're up for it, I think the entire HDF code needs to be updated to support non-nanosecond resolutions i.e. not always coerce things to nanoseconds

@chaoyihu chaoyihu force-pushed the hdfstore-datetime64-wrong-unit branch from 0e4abde to bf3ee62 Compare June 15, 2024 21:49
@chaoyihu
Copy link
Contributor Author

chaoyihu commented Jun 15, 2024

I think the code is working now. The CI test xfails on test_to_xarray.py and the reason points to pydata/xarray#9026, which has been fixed in the latest xarray release.

I upgraded to xarray==2024.6 in my local dev environment and the results from pytest pandas/tests/generic/test_to_xarray.py look fine:

================================================= test session starts =================================================
platform linux -- Python 3.10.14, pytest-8.0.2, pluggy-1.4.0
PyQt5 5.15.9 -- Qt runtime 5.15.8 -- Qt compiled 5.15.8
rootdir: /home/pandas
configfile: pyproject.toml
plugins: xvfb-3.0.0, cov-5.0.0, timeout-2.3.1, rerunfailures-14.0, anyio-4.2.0, xdist-3.6.1, cython-0.3.1, localserver-0.0.0, hypothesis-6.103.0, qt-4.4.0
collected 64 items                                                                                                    

pandas/tests/generic/test_to_xarray.py ......................s.........................................

-------------------------------- generated xml file: /home/pandas/test-data.xml --------------------------------
...
============================================ 63 passed, 1 skipped in 0.83s ============================================

@mroeschke mroeschke added this to the 3.0 milestone Jun 18, 2024
@mroeschke mroeschke merged commit a4c9446 into pandas-dev:main Jun 18, 2024
45 of 47 checks passed
@mroeschke
Copy link
Member

Thanks @chaoyihu

@chaoyihu chaoyihu deleted the hdfstore-datetime64-wrong-unit branch June 19, 2024 00:05
@chaoyihu chaoyihu restored the hdfstore-datetime64-wrong-unit branch June 19, 2024 00:06
@chaoyihu chaoyihu deleted the hdfstore-datetime64-wrong-unit branch June 19, 2024 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: HDFStore doesn't save datetime64[s] right
3 participants