Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/BUG: hashing of datetimes is based on UTC values #16372

Open
jreback opened this issue May 16, 2017 · 6 comments
Open

API/BUG: hashing of datetimes is based on UTC values #16372

jreback opened this issue May 16, 2017 · 6 comments
Labels
Bug Datetime Datetime data dtype hashing hash_pandas_object Needs Discussion Requires discussion from core team before further action

Comments

@jreback
Copy link
Contributor

jreback commented May 16, 2017

These should are 3 different 'views' of the same time. We DO disambiguate these in mains. So we should do so when hashing as well.

xref #16346

In [1]: from pandas.util import hash_pandas_object

In [8]: hash_pandas_object(pd.date_range('20130101', periods=3, tz='UTC').tz_convert('US/Eastern'))
Out[8]: 
2012-12-31 19:00:00-05:00     4326795898974544501
2013-01-01 19:00:00-05:00     2833560015380952180
2013-01-02 19:00:00-05:00    14913883737423839247
Freq: D, dtype: uint64

In [9]: hash_pandas_object(pd.date_range('20130101', periods=3, tz='UTC'))
Out[9]: 
2013-01-01 00:00:00+00:00     4326795898974544501
2013-01-02 00:00:00+00:00     2833560015380952180
2013-01-03 00:00:00+00:00    14913883737423839247
Freq: D, dtype: uint64

In [10]: hash_pandas_object(pd.date_range('20130101', periods=3))
Out[10]: 
2013-01-01     4326795898974544501
2013-01-02     2833560015380952180
2013-01-03    14913883737423839247
Freq: D, dtype: uint64

@jreback jreback added API Design IO Data IO issues that don't fit into a more specific label Difficulty Intermediate Numeric Operations Arithmetic, Comparison, and Logical operations Datetime Datetime data dtype labels May 16, 2017
@jreback jreback added this to the 0.21.0 milestone May 16, 2017
@jreback
Copy link
Contributor Author

jreback commented May 16, 2017

cc @mrocklin

@TomAugspurger @jorisvandenbossche

from a practical perspective I don't think this makes a whole lot of difference, but should fix to be correct.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 17, 2017

What alternative do you think off to hash upon? Hash the timezone separately and combine the hashes?

@jreback
Copy link
Contributor Author

jreback commented May 17, 2017

I think you could do something like this.

In [1]: df = pd.DataFrame({'tz': pd.date_range('20130101', periods=3, tz='UTC').tz_convert('US/Eastern'),
   ...: 'utc': pd.date_range('20130101', periods=3, tz='UTC'),
   ...: 'naive': pd.date_range('20130101', periods=3)})

In [2]: df
Out[2]: 
       naive                        tz                       utc
0 2013-01-01 2012-12-31 19:00:00-05:00 2013-01-01 00:00:00+00:00
1 2013-01-02 2013-01-01 19:00:00-05:00 2013-01-02 00:00:00+00:00
2 2013-01-03 2013-01-02 19:00:00-05:00 2013-01-03 00:00:00+00:00

In [3]: from pandas.util import hash_pandas_object

In [6]: hash_pandas_object(pd.DataFrame({'tz':df['tz'],'zone':df['tz'].dt.tz}), index=False)
Out[6]: 
0    11960632900184590671
1    17909201100930397932
2      244240496600445005
dtype: uint64

In [7]: hash_pandas_object(pd.DataFrame({'utc':df['tz'],'zone':df['utc'].dt.tz}), index=False)
Out[7]: 
0     557885042773898185
1    1996380570925580138
2    5435501107539799243
dtype: uint64

In [8]: hash_pandas_object(pd.DataFrame({'naivec':df['naive']}), index=False)
Out[8]: 
0    14376405836841727586
1     1052390041072582175
2    12596642793234779168
dtype: uint64

IOW, hash the tz as an additional column and combine (which is what we do with a DataFrame with index=False).

This would break backward compat for tz-aware, but (and maybe should document this more), that this is version-to-version hashing, it is not (necessarily) designed to be backward compat.

@jreback jreback modified the milestones: 0.21.0, Interesting Issues Sep 23, 2017
@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017
@TomAugspurger
Copy link
Contributor

I suppose it depends on what people are using the hashing for. Suppose I have hashed values a and a pandas object x.

  1. hash_pandas_object(x) == a, then x may be the same as the object that originally hashed to a
  2. hash_pandas_ojbect(x) != a, then x is not the same as the object that originally hashed to a.

To me, the most common use case is likely storing hashed values somewhere and wanting to answer "are these new values the same as what I have hashed?", so a stronger form of 1. (is the same instead of may be the same).

So I think it's on pandas to either mix the dtype information into the hash somehow, or provide guidance that you should store the original dtype along with the hashed values.

IOW, hash the tz as an additional column and combine (which is what we do with a DataFrame with index=False).

Hashing an extra column seems wasteful. I'd rather have some kind of stable map of each type and do a bit-shift on each type after hashing.

type_map = {
    int: 0,
    float: 1,
    ...
}

h = hash_array(obj.values, encoding, hash_key,
               categorize).astype('uint64', copy=False)
h >>= type_map[obj.dtype]

Building that type map is tricky (impossible?), because of parameterize types, 3rd party extension types...

version-to-version hashing, it is not (necessarily) designed to be backward compat.

We should explicitly state that hashing can change between versions. Maintaining that seems like it would be a nightmare.

@jorisvandenbossche
Copy link
Member

To me, the most common use case is likely storing hashed values somewhere and wanting to answer "are these new values the same as what I have hashed?", so a stronger form of 1. (is the same instead of may be the same).

Yes, I agree with that (that is eg what joblib uses hashing for)

Building that type map is tricky (impossible?), because of parameterize types, 3rd party extension types...

I am not familiar with how hash values are calculated. But would it be possible to somehow combine the hash of the dtype with the hash of the values?

@jorisvandenbossche jorisvandenbossche modified the milestones: Contributions Welcome, 0.24.0 Oct 18, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Nov 6, 2018
@jbrockmendel jbrockmendel removed the IO Data IO issues that don't fit into a more specific label label Dec 11, 2019
@mroeschke mroeschke added the Bug label Mar 31, 2020
@mroeschke mroeschke added hashing hash_pandas_object Needs Discussion Requires discussion from core team before further action and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 12, 2021
@jbrockmendel
Copy link
Member

Was a satisfactory solution ever found for this? Looks like this is about hash_pandas_object and DatetimeIndex, but I'm looking at Timestamp.__hash__ and it ignores the tz too. Current motivation is to adapt for non-nano.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype hashing hash_pandas_object Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants