Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: use of partition_cols raises incompatibility between fastparquet & pyarrow #39499

Open
2 of 3 tasks
yohplala opened this issue Jan 31, 2021 · 2 comments
Open
2 of 3 tasks
Labels
Bug IO Parquet parquet, feather

Comments

@yohplala
Copy link

yohplala commented Jan 31, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import fastparquet as fp
from os import path as os_path

df = pd.DataFrame({'ab': [1, 2, 3], 'a':['a', 'b', 'c']})
file = os_path.expanduser('~/Documents/code/data/test.parquet')

# Case 1: write: pandas/pyarrow      || read: pandas/pyarrow
# OK    : 'ab' column is read back with type 'category'
# NOOK  : does not naturally overwrite (file names are different)
#         as a consequence, df that is read back contains twice more data.
# OK    : with snappy not installed, naturally does not compress data.
df.to_parquet(file, partition_cols=['ab'])
df.to_parquet(file, partition_cols=['ab'])
df_rec_pd1 = pd.read_parquet(file)
print(df_rec_pd1['ab'])

# Case 2: write: pandas/fastparquet  || read: pandas/fastparquet
# OK    : 'ab' column is read back with type 'category'
# OK    : naturally overwrites (same file names)
# NOOK  : with snappy not installed, it does not naturally not compress the data,
#         and use of 'compression' keyword is required to specify 'uncompressed'.
df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df_rec_pd2 = pd.read_parquet(file, engine='fastparquet')
print(df_rec_pd2['ab'])

# Case 3: write: pandas/fastparquet  || read: pandas/pyarrow
# OK    : 'ab' column is read back with type 'category'
# OK    : naturally overwrites (same file names)
# NOOK  : with snappy not installed, it does not naturally not compress the data,
#         and use of 'compression' keyword is required to specify 'uncompressed'.
df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df_rec_pd3 = pd.read_parquet(file)
print(df_rec_pd3['ab'])

# Case 4: write: pandas/pyarrow      || read: pandas/fastparquet
# NOOK  : reading does not work. pyarrow does not generate common metada which
#         fastparquet is looking for.
df.to_parquet(file, partition_cols=['ab'])
df.to_parquet(file, partition_cols=['ab'])
df_rec_pd4 = pd.read_parquet(file, engine='fastparquet')
print(df_rec_pd4['ab'])

# Case 5: write: fastparquet         || read: pandas/pyarrow
# OK    : 'ab' column is read back with type 'category'
# OK    : naturally overwrites (same file names)
fp.write(file, df, file_scheme='hive', partition_on=['ab'], compression='BROTLI')
fp.write(file, df, file_scheme='hive', partition_on=['ab'], compression='BROTLI')
df_rec_fp_pd = pd.read_parquet(file)

# Case 6: write: fastparquet         || read: fastparquet
# OK    : nothing to say, perfect :)
fp.write(file, df, file_scheme='hive', partition_on=['ab'])
fp.write(file, df, file_scheme='hive', partition_on=['ab'])
df_rec_fp = fp.ParquetFile(file).to_pandas()

# Case 7: write: pandas/pyarrow      || read: fastparquet
# NOOK  : reading does not work. It seems that even if snappy is not available
#         at writing, it is indicated in metadata as the engine that has been
#         used. As fastparquet is not finding it, it raise an error.
df.to_parquet(file)
df_rec_pd_fp = fp.ParquetFile(file).to_pandas()

# Case 8: write: pandas/pyarrow      || read: fastparquet
# NOOK  : bug already reported: https://github.com/pandas-dev/pandas/issues/39480
#         It seems pandas/pyarrow is not writing categories of int as categories.
df.to_parquet(file, compression='BROTLI')
df_rec_pd_fp = fp.ParquetFile(file).to_pandas()

Problem description

Hi. I report here 4 different problems as far as I can see (a 5th related one having been reported earlier in #39480). Trying to synthetize:

  • when using partition_cols in to_parquet, the ways it is managed by fastparquet and pyarrow are different. Most notably, fastparquet is writing common metdata in the root directory, while pyarrow ...? I don't know what it does, but at least, it does not write this 'common metadata' file. This raises a bug as in case 4: writing with pyarrow, reading with fastparquet. fastparquet does not find the common metadata file and refuses to read the data

  • when not having 'snappy' installed, for writing step, no error is raised with pyarrow. But when read back with fastparquet, this time, reader believes it has been compressed with 'snappy' and complains not to find it. This raises the error as in case 7.

  • when not having 'snappy' installed, for the writing and reading steps, could the writer and reader simply/naturally fall back to 'uncompressed'? With fastparquet, this forces the user to add the parameter compression set to uncompressed as in cases 2, & 3.

  • when using partition_cols with pyarrow, names of parquet files are always different. Hence, writing twice the same dataset will result in duplicating the data (on the opposite, fastparquet uses always the same file names, which ensures natural overwriting), as in case 1.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 9d598a5
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-41-generic
Version : #46~20.04.1-Ubuntu SMP Mon Jan 18 17:52:23 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 1.2.1
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : 0.5.0
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2

@yohplala yohplala added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 31, 2021
@jorisvandenbossche jorisvandenbossche added the IO Parquet parquet, feather label Feb 16, 2021
@jbrockmendel
Copy link
Member

@martindurant not sure if this is a pandas issue or fastparquet or something else?

@jbrockmendel jbrockmendel removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 6, 2021
@martindurant
Copy link
Contributor

OK, there are a few things here.

  • the default compression in fastparquet is in fact none. When called from pandas, though, it's snappy (because they wanted to choose one of the two defaults to go with), causing the problem. I don't believe arrow uses python-snappy, so that can compress without it. Note that the newest fastparquet now has a hard dependency on cramjam, which includes snappy, so it will always be available.
  • indeed pyarrow does not write the metadata by default, but I believe it can. Fastparquet always does. As of the latest release, fastparquet can be passed a directory and it will find the data files just as pyarrow does. Previously, you had to use glob and pass a list of data files.
  • in dask we has a lot of discussion about overwriting. It seems the best to explicitly delete the contents of a directory before writing to it, unless "append" is specified. Dask now allows for a filename template (perhaps still in PR) to specify the names of the data flies; but this is not done in the backend libraries themselves. There's probably no pressing need for pandas to implement this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

4 participants