Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp incorrectly handles datetime64 with exotic units #25611

Open
cbarrick opened this issue Mar 8, 2019 · 5 comments
Open

Timestamp incorrectly handles datetime64 with exotic units #25611

cbarrick opened this issue Mar 8, 2019 · 5 comments
Labels
Bug Error Reporting Incorrect or improved errors from pandas Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timeseries

Comments

@cbarrick
Copy link

cbarrick commented Mar 8, 2019

Code Sample, a copy-pastable example if possible

>>> pd.Timestamp(np.datetime64('2019-01-01', '6h'))
Timestamp('1978-03-02 20:00:00')

Problem description

The pd.Timestamp constructor gives the wrong result when converting a np.datetime64 that uses multiples of a standard unit. In the example above, I create a datetime64 with units of 6h but the conversion appears to assume units of s.

This happens with units of 6h but not with units of h:

>>> pd.Timestamp(np.datetime64('2019-01-01', 'h'))
Timestamp('2019-01-01 00:00:00')

Expected Output

Pandas should either perform a correct conversion or raise a value error:

>>> pd.Timestamp(np.datetime64('2019-01-01', '6h'))
Timestamp('2019-01-01 00:00:00')

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-134-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: 0.11.3
IPython: None
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.2
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

cbarrick added a commit to cbarrick/apollo that referenced this issue Mar 8, 2019
This is a work around for an issue in Pandas (pandas-dev/pandas#25611).
Plus the Timestamp API is easier to use than datetime64 anyway.

This does not touch the solar loader or models since those APIs are
rapidly evolving. I'd like to deprecate all uses of datetime64 though.
cbarrick added a commit to cbarrick/apollo that referenced this issue Mar 9, 2019
This is a work around for an issue in Pandas (pandas-dev/pandas#25611).
Plus the Timestamp API is easier to use than datetime64 anyway.

This does not touch the solar loader or models since those APIs are
rapidly evolving. I'd like to deprecate all uses of datetime64 though.
@chris-b1 chris-b1 added Bug Timeseries Error Reporting Incorrect or improved errors from pandas labels Mar 10, 2019
@chris-b1
Copy link
Contributor

chris-b1 commented Mar 10, 2019

Not sure we want to do a lot to support these multiple units, but at minimum should raise an error message - thanks for the report!

@chris-b1 chris-b1 added this to the Contributions Welcome milestone Mar 10, 2019
@cbarrick
Copy link
Author

cbarrick commented Mar 10, 2019

FWIW, I'll describe my use case that led to the bug.

I deal with weather forecasts that are released every six hours. Originally, our code base used np.datetime64 for timestamps, and the easiest way to truncate to the six hour mark was to use 6h units. When we switched to pd.Timestamp incrementally, we passed numpy datetimes to the constructor, and then discovered the bug.

The two features provided by the numpy behavior are truncation and type safety. For both cases, the Pandas way is to just call Timestamp.floor. So the exception message should probably mention Timestamp.floor as a workaround.

Alternatively, I think we could support the exotic units without too much trouble. I'm not familiar with Pandas internals, but presumably we could use numpy to perform a conversion to the nearest supported unit, e.g. 6h to h, then proceed as usual. The edge case here is handling overflow.

@mukundm19
Copy link

Does this issue still need work? I would be happy to look into this further and work on the error message as previously mentioned.

@darynwhite
Copy link

darynwhite commented May 23, 2019

I'm running into a similar situation with 10-minute data from our buoys.

Here is the input numpy arrary:

In [21]: atIndex                                                                                                                            
Out[21]: 
array(['2018-03-23T15:10', '2018-03-23T15:20', '2018-03-23T15:30', ...,
       '2019-03-17T10:30', '2019-03-17T10:40', '2019-03-17T10:50'],
      dtype='datetime64[10m]')

And here is what happens when I attempt to make a pandas.DatetimeIndex with it:

In [23]: pandas.DatetimeIndex(atIndex)                                                                                                      
Out[23]: 
DatetimeIndex(['1974-10-28 08:43:00', '1974-10-28 08:44:00',
               '1974-10-28 08:45:00', '1974-10-28 08:46:00',
               '1974-10-28 08:47:00', '1974-10-28 08:48:00',
               '1974-10-28 08:49:00', '1974-10-28 08:50:00',
               '1974-10-28 08:51:00', '1974-10-28 08:52:00',
               ...
               '1974-12-03 05:44:00', '1974-12-03 05:45:00',
               '1974-12-03 05:46:00', '1974-12-03 05:47:00',
               '1974-12-03 05:48:00', '1974-12-03 05:49:00',
               '1974-12-03 05:50:00', '1974-12-03 05:51:00',
               '1974-12-03 05:52:00', '1974-12-03 05:53:00'],
              dtype='datetime64[ns]', length=51671, freq=None)

Is there a workaround for this that has discovered yet?
Answered my question with some trial and error.
Workaround:

In [44]: pandas.DatetimeIndex(atIndex.astype('datetime64[ns]'))                                                                             
Out[44]: 
DatetimeIndex(['2018-03-23 15:10:00', '2018-03-23 15:20:00',
               '2018-03-23 15:30:00', '2018-03-23 15:40:00',
               '2018-03-23 15:50:00', '2018-03-23 16:00:00',
               '2018-03-23 16:10:00', '2018-03-23 16:20:00',
               '2018-03-23 16:30:00', '2018-03-23 16:40:00',
               ...
               '2019-03-17 09:20:00', '2019-03-17 09:30:00',
               '2019-03-17 09:40:00', '2019-03-17 09:50:00',
               '2019-03-17 10:00:00', '2019-03-17 10:10:00',
               '2019-03-17 10:20:00', '2019-03-17 10:30:00',
               '2019-03-17 10:40:00', '2019-03-17 10:50:00'],
              dtype='datetime64[ns]', length=51671, freq=None)

Perhaps this sort of type casting could be used if/when the inpute datetime array has a multiple of a standard unit?

@TomAugspurger
Copy link
Contributor

Casting seems fine if it's lossless. But if the values can't be represented correctly as datetime64[ns] then we should raise (I suspect we do).

@jbrockmendel jbrockmendel added the Non-Nano datetime64/timedelta64 with non-nanosecond resolution label Jan 10, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timeseries
Projects
None yet
Development

No branches or pull requests

7 participants