BUG: use of partition_cols raises incompatibility between fastparquet & pyarrow #39499

yohplala · 2021-01-31T13:03:40Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
import fastparquet as fp
from os import path as os_path

df = pd.DataFrame({'ab': [1, 2, 3], 'a':['a', 'b', 'c']})
file = os_path.expanduser('~/Documents/code/data/test.parquet')

# Case 1: write: pandas/pyarrow      || read: pandas/pyarrow
# OK    : 'ab' column is read back with type 'category'
# NOOK  : does not naturally overwrite (file names are different)
#         as a consequence, df that is read back contains twice more data.
# OK    : with snappy not installed, naturally does not compress data.
df.to_parquet(file, partition_cols=['ab'])
df.to_parquet(file, partition_cols=['ab'])
df_rec_pd1 = pd.read_parquet(file)
print(df_rec_pd1['ab'])

# Case 2: write: pandas/fastparquet  || read: pandas/fastparquet
# OK    : 'ab' column is read back with type 'category'
# OK    : naturally overwrites (same file names)
# NOOK  : with snappy not installed, it does not naturally not compress the data,
#         and use of 'compression' keyword is required to specify 'uncompressed'.
df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df_rec_pd2 = pd.read_parquet(file, engine='fastparquet')
print(df_rec_pd2['ab'])

# Case 3: write: pandas/fastparquet  || read: pandas/pyarrow
# OK    : 'ab' column is read back with type 'category'
# OK    : naturally overwrites (same file names)
# NOOK  : with snappy not installed, it does not naturally not compress the data,
#         and use of 'compression' keyword is required to specify 'uncompressed'.
df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df_rec_pd3 = pd.read_parquet(file)
print(df_rec_pd3['ab'])

# Case 4: write: pandas/pyarrow      || read: pandas/fastparquet
# NOOK  : reading does not work. pyarrow does not generate common metada which
#         fastparquet is looking for.
df.to_parquet(file, partition_cols=['ab'])
df.to_parquet(file, partition_cols=['ab'])
df_rec_pd4 = pd.read_parquet(file, engine='fastparquet')
print(df_rec_pd4['ab'])

# Case 5: write: fastparquet         || read: pandas/pyarrow
# OK    : 'ab' column is read back with type 'category'
# OK    : naturally overwrites (same file names)
fp.write(file, df, file_scheme='hive', partition_on=['ab'], compression='BROTLI')
fp.write(file, df, file_scheme='hive', partition_on=['ab'], compression='BROTLI')
df_rec_fp_pd = pd.read_parquet(file)

# Case 6: write: fastparquet         || read: fastparquet
# OK    : nothing to say, perfect :)
fp.write(file, df, file_scheme='hive', partition_on=['ab'])
fp.write(file, df, file_scheme='hive', partition_on=['ab'])
df_rec_fp = fp.ParquetFile(file).to_pandas()

# Case 7: write: pandas/pyarrow      || read: fastparquet
# NOOK  : reading does not work. It seems that even if snappy is not available
#         at writing, it is indicated in metadata as the engine that has been
#         used. As fastparquet is not finding it, it raise an error.
df.to_parquet(file)
df_rec_pd_fp = fp.ParquetFile(file).to_pandas()

# Case 8: write: pandas/pyarrow      || read: fastparquet
# NOOK  : bug already reported: https://github.com/pandas-dev/pandas/issues/39480
#         It seems pandas/pyarrow is not writing categories of int as categories.
df.to_parquet(file, compression='BROTLI')
df_rec_pd_fp = fp.ParquetFile(file).to_pandas()

Problem description

Hi. I report here 4 different problems as far as I can see (a 5th related one having been reported earlier in #39480). Trying to synthetize:

when using partition_cols in to_parquet, the ways it is managed by fastparquet and pyarrow are different. Most notably, fastparquet is writing common metdata in the root directory, while pyarrow ...? I don't know what it does, but at least, it does not write this 'common metadata' file. This raises a bug as in case 4: writing with pyarrow, reading with fastparquet. fastparquet does not find the common metadata file and refuses to read the data
when not having 'snappy' installed, for writing step, no error is raised with pyarrow. But when read back with fastparquet, this time, reader believes it has been compressed with 'snappy' and complains not to find it. This raises the error as in case 7.
when not having 'snappy' installed, for the writing and reading steps, could the writer and reader simply/naturally fall back to 'uncompressed'? With fastparquet, this forces the user to add the parameter compression set to uncompressed as in cases 2, & 3.
when using partition_cols with pyarrow, names of parquet files are always different. Hence, writing twice the same dataset will result in duplicating the data (on the opposite, fastparquet uses always the same file names, which ensures natural overwriting), as in case 1.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 9d598a5
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-41-generic
Version : #46~20.04.1-Ubuntu SMP Mon Jan 18 17:52:23 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 1.2.1
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : 0.5.0
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2021-06-06T17:42:04Z

@martindurant not sure if this is a pandas issue or fastparquet or something else?

martindurant · 2021-06-06T18:16:06Z

OK, there are a few things here.

the default compression in fastparquet is in fact none. When called from pandas, though, it's snappy (because they wanted to choose one of the two defaults to go with), causing the problem. I don't believe arrow uses python-snappy, so that can compress without it. Note that the newest fastparquet now has a hard dependency on cramjam, which includes snappy, so it will always be available.
indeed pyarrow does not write the metadata by default, but I believe it can. Fastparquet always does. As of the latest release, fastparquet can be passed a directory and it will find the data files just as pyarrow does. Previously, you had to use glob and pass a list of data files.
in dask we has a lot of discussion about overwriting. It seems the best to explicitly delete the contents of a directory before writing to it, unless "append" is specified. Dask now allows for a filename template (perhaps still in PR) to specify the names of the data flies; but this is not done in the backend libraries themselves. There's probably no pressing need for pandas to implement this.

yohplala added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 31, 2021

jorisvandenbossche added the IO Parquet parquet, feather label Feb 16, 2021

jbrockmendel removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: use of partition_cols raises incompatibility between fastparquet & pyarrow #39499

BUG: use of partition_cols raises incompatibility between fastparquet & pyarrow #39499

yohplala commented Jan 31, 2021 •

edited

INSTALLED VERSIONS

jbrockmendel commented Jun 6, 2021

martindurant commented Jun 6, 2021

BUG: use of partition_cols raises incompatibility between fastparquet & pyarrow #39499

BUG: use of partition_cols raises incompatibility between fastparquet & pyarrow #39499

Comments

yohplala commented Jan 31, 2021 • edited

Code Sample, a copy-pastable example

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

jbrockmendel commented Jun 6, 2021

martindurant commented Jun 6, 2021

yohplala commented Jan 31, 2021 •

edited

Output of `pd.show_versions()`