Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: hash_pandas_object ignores column name values #46705

Open
3 tasks done
lmeyerov opened this issue Apr 8, 2022 · 13 comments
Open
3 tasks done

BUG: hash_pandas_object ignores column name values #46705

lmeyerov opened this issue Apr 8, 2022 · 13 comments
Labels
Enhancement hashing hash_pandas_object Needs Discussion Requires discussion from core team before further action

Comments

@lmeyerov
Copy link

lmeyerov commented Apr 8, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import hashlib, pandas as pd

df = pd.DataFrame({'s': ['a', 'b', 'c', 'd'], 'd': ['b', 'c', 'd', 'e'], 'i': [0, 2, 4, 6]})

df_renamed = df.rename(columns={'s': 'ss'})

hash_df = hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
hash_df_renamed = hashlib.sha256(pd.util.hash_pandas_object(df_renamed, index=True).values).hexdigest()

assert hash_df != hash_df_renamed

Issue Description

When hashing a df, the column names are ignored

Expected Behavior

I expected dfs with diff column names to hash differently. If ignoring col names is desired, I'd expect a default-off flag for ignoring col names in the hash calc.

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.7.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.144+
Version : #1 SMP Tue Dec 7 09:58:10 PST 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.21.5
pytz : 2018.9
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.4.0
Cython : 0.29.28
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.6
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 5.5.0
pandas_datareader: 0.9.0
bs4 : 4.6.3
bottleneck : 1.3.4
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : 0.13.3
pyarrow : 6.0.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.4.32
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 1.1.0
xlwt : 1.3.0
numba : 0.51.2

@lmeyerov lmeyerov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 8, 2022
@rhshadrach rhshadrach added the hashing hash_pandas_object label Apr 9, 2022
@rhshadrach
Copy link
Member

By doing .values you are getting the underlying NumPy array. This doesn't have column names.

@rhshadrach
Copy link
Member

E.g.

df = pd.DataFrame({'s': ['a', 'b', 'c', 'd'], 'd': ['b', 'c', 'd', 'e'], 'i': [0, 2, 4, 6]})
print(pd.util.hash_pandas_object(df, index=True).values)

gives

[ 7494370580092021340  5889295906979463579  1670643539686186963
 10982910506778653110]

@lmeyerov
Copy link
Author

lmeyerov commented Apr 9, 2022

That makes sense

We're currently doing a second explicit round of hashing to include df.columns -- is there a more natural way to get a full hash? Ultimately the goal is to get a hash of the df, not the individual .values

@rhshadrach
Copy link
Member

Perhaps this is an odd solution, but one thought is to use to_parquet with a byte stream.

buffer = BytesIO()
df.to_parquet(buffer)
hash_df = hashlib.sha256(buffer.getbuffer()).hexdigest()

This will fail if the DataFrame contains complex objects (e.g. lists) as those can't be handled by parquet.

@lmeyerov
Copy link
Author

lmeyerov commented Apr 9, 2022

Yeah we already have a a try/except here (for arrow conversion) :)

Interestingly, on a quick microbenchmark, parquet path is about ~10% faster too!

@rhshadrach
Copy link
Member

@Imeyerov - Can you describe your use case for hashing a DataFrame? In general one typically doesn't hash mutable objects (especially for us in e.g dict keys).

Certainly DataFrames should not be hashable (i.e. hash(df) fails), but perhaps there is a use case for pandas to provide a hash value for DataFrame, or at least make obtaining such values more straight foward.

@rhshadrach rhshadrach added Enhancement Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2022
@Gabriel-ROBIN
Copy link

Interesting, in my use case, I need to hash the dataframe and then save it in parquet. Can I save the buffer on disk ? Or do I need to call to_parquet again ?

@lmeyerov
Copy link
Author

lmeyerov commented Apr 11, 2022

Our standard practice is to avoid mutations, so less of a concern here ;-)

Our rough scenario is around memoization optimizations where we can't rely on pointer equality, such as someone rerunning a notebook cell and getting a new df but with old values, and we want to avoid large uploads when that happens: https://github.com/graphistry/pygraphistry/blob/4e71239e177d76a9b8bdca64c6995b7727afc5a5/graphistry/PlotterBase.py#L1684

@rhshadrach
Copy link
Member

@Gabriel-ROBIN: You can use the BytesIO object directly.

df = pd.DataFrame({"a": [1, 1, 1], "b": ["x", "y", "z"], "c": [1, 2, 3]})
buffer = BytesIO()
df.to_parquet(buffer)
with open("temp.parquet", mode="wb") as f:
    f.write(buffer.getvalue())
print(pd.read_parquet("temp.parquet"))

@rhshadrach rhshadrach removed the Needs Info Clarification about behavior needed to assess issue label Apr 11, 2022
@rhshadrach
Copy link
Member

It does seem to be difficult to get an accurate hash value that encompasses all the components of a DataFrame (values, index, columns, flags, metadata, maybe others?). I'm +1 with supporting this.

@MarcSkovMadsen
Copy link

Hashing of DataFrames are often done in data app frameworks like Streamlit, Panel etc. Its used for caching purposes to speed up the apps.

@hagenw
Copy link

hagenw commented Jun 13, 2024

Can you describe your use case for hashing a DataFrame?

Besides caching it might also be useful to check if parts of a dataset have changed.
E.g. when storing a dataframe as a csv file you can calculate a md5 sum of the csv file, which will only change if the content of the csv file changes. But if you store the dataframe in a parquet file, it will have a different md5 sum every time you store the file, even if the underlying data is the same (that is the way parquet files are designed). In this case, you could calculate a hash sum based on the actual dataframe and store this inside the parquet file, and can use that to track if the content of the file has changed.

@hagenw
Copy link

hagenw commented Jun 13, 2024

Perhaps this is an odd solution, but one thought is to use to_parquet with a byte stream.

buffer = BytesIO()
df.to_parquet(buffer)
hash_df = hashlib.sha256(buffer.getbuffer()).hexdigest()

This will fail if the DataFrame contains complex objects (e.g. lists) as those can't be handled by parquet.

This returns different results on different machines for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement hashing hash_pandas_object Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants