-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: hash_pandas_object ignores column name values #46705
Comments
By doing |
E.g.
gives
|
That makes sense We're currently doing a second explicit round of hashing to include |
Perhaps this is an odd solution, but one thought is to use
This will fail if the DataFrame contains complex objects (e.g. lists) as those can't be handled by parquet. |
Yeah we already have a a Interestingly, on a quick microbenchmark, parquet path is about ~10% faster too! |
@Imeyerov - Can you describe your use case for hashing a DataFrame? In general one typically doesn't hash mutable objects (especially for us in e.g dict keys). Certainly DataFrames should not be hashable (i.e. |
Interesting, in my use case, I need to hash the dataframe and then save it in parquet. Can I save the buffer on disk ? Or do I need to call to_parquet again ? |
Our standard practice is to avoid mutations, so less of a concern here ;-) Our rough scenario is around memoization optimizations where we can't rely on pointer equality, such as someone rerunning a notebook cell and getting a new df but with old values, and we want to avoid large uploads when that happens: https://github.com/graphistry/pygraphistry/blob/4e71239e177d76a9b8bdca64c6995b7727afc5a5/graphistry/PlotterBase.py#L1684 |
@Gabriel-ROBIN: You can use the BytesIO object directly.
|
It does seem to be difficult to get an accurate hash value that encompasses all the components of a DataFrame (values, index, columns, flags, metadata, maybe others?). I'm +1 with supporting this. |
Hashing of DataFrames are often done in data app frameworks like Streamlit, Panel etc. Its used for caching purposes to speed up the apps. |
Besides caching it might also be useful to check if parts of a dataset have changed. |
This returns different results on different machines for me. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When hashing a df, the column names are ignored
Expected Behavior
I expected dfs with diff column names to hash differently. If ignoring col names is desired, I'd expect a default-off flag for ignoring col names in the hash calc.
Installed Versions
INSTALLED VERSIONS
commit : 66e3805
python : 3.7.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.144+
Version : #1 SMP Tue Dec 7 09:58:10 PST 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.5
numpy : 1.21.5
pytz : 2018.9
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.4.0
Cython : 0.29.28
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.6
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 5.5.0
pandas_datareader: 0.9.0
bs4 : 4.6.3
bottleneck : 1.3.4
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : 0.13.3
pyarrow : 6.0.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.4.32
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 1.1.0
xlwt : 1.3.0
numba : 0.51.2
The text was updated successfully, but these errors were encountered: