BUG: hash_pandas_object ignores column name values #46705

lmeyerov · 2022-04-08T19:42:32Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import hashlib, pandas as pd

df = pd.DataFrame({'s': ['a', 'b', 'c', 'd'], 'd': ['b', 'c', 'd', 'e'], 'i': [0, 2, 4, 6]})

df_renamed = df.rename(columns={'s': 'ss'})

hash_df = hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
hash_df_renamed = hashlib.sha256(pd.util.hash_pandas_object(df_renamed, index=True).values).hexdigest()

assert hash_df != hash_df_renamed

Issue Description

When hashing a df, the column names are ignored

Expected Behavior

I expected dfs with diff column names to hash differently. If ignoring col names is desired, I'd expect a default-off flag for ignoring col names in the hash calc.

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.7.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.144+
Version : #1 SMP Tue Dec 7 09:58:10 PST 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.21.5
pytz : 2018.9
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.4.0
Cython : 0.29.28
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.6
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 5.5.0
pandas_datareader: 0.9.0
bs4 : 4.6.3
bottleneck : 1.3.4
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : 0.13.3
pyarrow : 6.0.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.4.32
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 1.1.0
xlwt : 1.3.0
numba : 0.51.2

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-04-09T20:33:28Z

By doing .values you are getting the underlying NumPy array. This doesn't have column names.

rhshadrach · 2022-04-09T20:34:40Z

E.g.

df = pd.DataFrame({'s': ['a', 'b', 'c', 'd'], 'd': ['b', 'c', 'd', 'e'], 'i': [0, 2, 4, 6]})
print(pd.util.hash_pandas_object(df, index=True).values)

gives

[ 7494370580092021340  5889295906979463579  1670643539686186963
 10982910506778653110]

lmeyerov · 2022-04-09T20:44:04Z

That makes sense

We're currently doing a second explicit round of hashing to include df.columns -- is there a more natural way to get a full hash? Ultimately the goal is to get a hash of the df, not the individual .values

rhshadrach · 2022-04-09T20:59:20Z

Perhaps this is an odd solution, but one thought is to use to_parquet with a byte stream.

buffer = BytesIO()
df.to_parquet(buffer)
hash_df = hashlib.sha256(buffer.getbuffer()).hexdigest()

This will fail if the DataFrame contains complex objects (e.g. lists) as those can't be handled by parquet.

lmeyerov · 2022-04-09T21:46:13Z

Yeah we already have a a try/except here (for arrow conversion) :)

Interestingly, on a quick microbenchmark, parquet path is about ~10% faster too!

rhshadrach · 2022-04-10T16:28:52Z

@Imeyerov - Can you describe your use case for hashing a DataFrame? In general one typically doesn't hash mutable objects (especially for us in e.g dict keys).

Certainly DataFrames should not be hashable (i.e. hash(df) fails), but perhaps there is a use case for pandas to provide a hash value for DataFrame, or at least make obtaining such values more straight foward.

Gabriel-ROBIN · 2022-04-11T13:05:18Z

Interesting, in my use case, I need to hash the dataframe and then save it in parquet. Can I save the buffer on disk ? Or do I need to call to_parquet again ?

lmeyerov · 2022-04-11T16:45:00Z

Our standard practice is to avoid mutations, so less of a concern here ;-)

Our rough scenario is around memoization optimizations where we can't rely on pointer equality, such as someone rerunning a notebook cell and getting a new df but with old values, and we want to avoid large uploads when that happens: https://github.com/graphistry/pygraphistry/blob/4e71239e177d76a9b8bdca64c6995b7727afc5a5/graphistry/PlotterBase.py#L1684

rhshadrach · 2022-04-11T21:15:36Z

@Gabriel-ROBIN: You can use the BytesIO object directly.

df = pd.DataFrame({"a": [1, 1, 1], "b": ["x", "y", "z"], "c": [1, 2, 3]})
buffer = BytesIO()
df.to_parquet(buffer)
with open("temp.parquet", mode="wb") as f:
    f.write(buffer.getvalue())
print(pd.read_parquet("temp.parquet"))

rhshadrach · 2022-04-11T21:22:19Z

It does seem to be difficult to get an accurate hash value that encompasses all the components of a DataFrame (values, index, columns, flags, metadata, maybe others?). I'm +1 with supporting this.

MarcSkovMadsen · 2024-04-17T17:50:10Z

Hashing of DataFrames are often done in data app frameworks like Streamlit, Panel etc. Its used for caching purposes to speed up the apps.

hagenw · 2024-06-13T14:05:25Z

Can you describe your use case for hashing a DataFrame?

Besides caching it might also be useful to check if parts of a dataset have changed.
E.g. when storing a dataframe as a csv file you can calculate a md5 sum of the csv file, which will only change if the content of the csv file changes. But if you store the dataframe in a parquet file, it will have a different md5 sum every time you store the file, even if the underlying data is the same (that is the way parquet files are designed). In this case, you could calculate a hash sum based on the actual dataframe and store this inside the parquet file, and can use that to track if the content of the file has changed.

hagenw · 2024-06-13T14:41:39Z

Perhaps this is an odd solution, but one thought is to use to_parquet with a byte stream.
buffer = BytesIO()
df.to_parquet(buffer)
hash_df = hashlib.sha256(buffer.getbuffer()).hexdigest()
This will fail if the DataFrame contains complex objects (e.g. lists) as those can't be handled by parquet.

This returns different results on different machines for me.

lmeyerov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 8, 2022

rhshadrach added the hashing hash_pandas_object label Apr 9, 2022

rhshadrach added Enhancement Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2022

rhshadrach removed the Needs Info Clarification about behavior needed to assess issue label Apr 11, 2022

MarcSkovMadsen mentioned this issue Apr 17, 2024

Enable syncing dataframe parameters holoviz/panel#6745

Merged

hagenw mentioned this issue Jun 13, 2024

audformat.utils.hash() does not consider column names audeering/audformat#434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: hash_pandas_object ignores column name values #46705

BUG: hash_pandas_object ignores column name values #46705

lmeyerov commented Apr 8, 2022

INSTALLED VERSIONS

rhshadrach commented Apr 9, 2022

rhshadrach commented Apr 9, 2022

lmeyerov commented Apr 9, 2022 •

edited

Loading

rhshadrach commented Apr 9, 2022

lmeyerov commented Apr 9, 2022 •

edited

Loading

rhshadrach commented Apr 10, 2022

Gabriel-ROBIN commented Apr 11, 2022

lmeyerov commented Apr 11, 2022 •

edited

Loading

rhshadrach commented Apr 11, 2022

rhshadrach commented Apr 11, 2022

MarcSkovMadsen commented Apr 17, 2024

hagenw commented Jun 13, 2024

hagenw commented Jun 13, 2024

BUG: hash_pandas_object ignores column name values #46705

BUG: hash_pandas_object ignores column name values #46705

Comments

lmeyerov commented Apr 8, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Apr 9, 2022

rhshadrach commented Apr 9, 2022

lmeyerov commented Apr 9, 2022 • edited Loading

rhshadrach commented Apr 9, 2022

lmeyerov commented Apr 9, 2022 • edited Loading

rhshadrach commented Apr 10, 2022

Gabriel-ROBIN commented Apr 11, 2022

lmeyerov commented Apr 11, 2022 • edited Loading

rhshadrach commented Apr 11, 2022

rhshadrach commented Apr 11, 2022

MarcSkovMadsen commented Apr 17, 2024

hagenw commented Jun 13, 2024

hagenw commented Jun 13, 2024

lmeyerov commented Apr 9, 2022 •

edited

Loading

lmeyerov commented Apr 9, 2022 •

edited

Loading

lmeyerov commented Apr 11, 2022 •

edited

Loading