Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_parquet Google Cloud Storage (gcs) dir support #36743

Open
2 of 3 tasks
NahsiN opened this issue Sep 30, 2020 · 4 comments
Open
2 of 3 tasks

BUG: read_parquet Google Cloud Storage (gcs) dir support #36743

NahsiN opened this issue Sep 30, 2020 · 4 comments
Labels
Enhancement IO Network Local or Cloud (AWS, GCS, etc.) IO Issues IO Parquet parquet, feather

Comments

@NahsiN
Copy link

NahsiN commented Sep 30, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

df = pd.read_parquet("gs://bucket/path")

Problem description

In the same vain as issue 26388, it would be nice for pandas to read multiple parquet files stored under a gcs path, in this case gs://bucket/path. Issue 26388 hints that this should already work in pandas 1.x.x however, I still get one of the following errors

ArrowInvalid: Parquet file size is 0 bytes

or

OSError: Passed non-file path: bucket/path

Expected Output

Code runs without error.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Thu Jun 18 20:50:10 PDT 2020; root:xnu-4903.278.43~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.2
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.6.0.post20200917
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.3
fastparquet : None
gcsfs : 0.7.1
matplotlib : 3.3.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@NahsiN NahsiN added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 30, 2020
@fangchenli fangchenli added IO Parquet parquet, feather Enhancement and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 2, 2020
@joshtemple
Copy link
Contributor

The expected syntax for multiple parquet files should probably follow the wildcard approach you can use on GCS, e.g. gs://bucket/path/* instead of gs://bucket/path.

For example, the wildcard would be required if you were loading those files from Parquet on GCS into BigQuery.

@mroeschke mroeschke added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Aug 13, 2021
@borislitvak
Copy link

Still waiting for this ..

@jreback
Copy link
Contributor

jreback commented Jun 15, 2022

@borislitvak this is an all volunteer project

you are welcome to contribute a patch

@Kornel
Copy link

Kornel commented Jul 28, 2022

@borislitvak It seems to work now (with pyarrow), at least partially: It still fails when there is only one file in the directory, but this seems more of a pyarrow bug?

Click to expand output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-1080-gcp
Version : #87~18.04.1-Ubuntu SMP Fri Jun 10 18:50:42 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.2
numpy : 1.19.5
pytz : 2022.1
dateutil : 2.8.0
pip : 22.0.4
setuptools : 59.8.0
Cython : 0.29.28
pytest : 7.1.1
hypothesis : None
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 8.2.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fsspec : 2022.3.0
fastparquet : 0.5.0
gcsfs : 2022.3.0
matplotlib : 3.4.3
numexpr : 2.8.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : 1.4.35
tables : 3.6.1
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : 0.53.1

Traceback with one file only:

In [12]: X = pd.read_parquet(path)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 X = pd.read_parquet(path)

File ~/.local/lib/python3.8/site-packages/pandas/io/parquet.py:495, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
    442 """
    443 Load a parquet object from the file path, returning a DataFrame.
    444
   (...)
    491 DataFrame
    492 """
    493 impl = get_engine(engine)
--> 495 return impl.read(
    496     path,
    497     columns=columns,
    498     storage_options=storage_options,
    499     use_nullable_dtypes=use_nullable_dtypes,
    500     **kwargs,
    501 )

File ~/.local/lib/python3.8/site-packages/pandas/io/parquet.py:239, in PyArrowImpl.read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
    232 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
    233     path,
    234     kwargs.pop("filesystem", None),
    235     storage_options=storage_options,
    236     mode="rb",
    237 )
    238 try:
--> 239     result = self.api.parquet.read_table(
    240         path_or_handle, columns=columns, **kwargs
    241     ).to_pandas(**to_pandas_kwargs)
    242     if manager == "array":
    243         result = result._as_manager("array", copy=False)

File ~/.local/lib/python3.8/site-packages/pyarrow/parquet.py:1859, in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit)
   1852     raise ValueError(
   1853         "The 'metadata' keyword is no longer supported with the new "
   1854         "datasets-based implementation. Specify "
   1855         "'use_legacy_dataset=True' to temporarily recover the old "
   1856         "behaviour."
   1857     )
   1858 try:
-> 1859     dataset = _ParquetDatasetV2(
   1860         source,
   1861         filesystem=filesystem,
   1862         partitioning=partitioning,
   1863         memory_map=memory_map,
   1864         read_dictionary=read_dictionary,
   1865         buffer_size=buffer_size,
   1866         filters=filters,
   1867         ignore_prefixes=ignore_prefixes,
   1868         pre_buffer=pre_buffer,
   1869         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
   1870     )
   1871 except ImportError:
   1872     # fall back on ParquetFile for simple cases when pyarrow.dataset
   1873     # module is not available
   1874     if filters is not None:

File ~/.local/lib/python3.8/site-packages/pyarrow/parquet.py:1694, in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, **kwargs)
   1690 if partitioning == "hive":
   1691     partitioning = ds.HivePartitioning.discover(
   1692         infer_dictionary=True)
-> 1694 self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
   1695                            format=parquet_format,
   1696                            partitioning=partitioning,
   1697                            ignore_prefixes=ignore_prefixes)

File ~/.local/lib/python3.8/site-packages/pyarrow/dataset.py:655, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    644 kwargs = dict(
    645     schema=schema,
    646     filesystem=filesystem,
   (...)
    651     selector_ignore_prefixes=ignore_prefixes
    652 )
    654 if _is_path_like(source):
--> 655     return _filesystem_dataset(source, **kwargs)
    656 elif isinstance(source, (tuple, list)):
    657     if all(_is_path_like(elem) for elem in source):

File ~/.local/lib/python3.8/site-packages/pyarrow/dataset.py:410, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    402 options = FileSystemFactoryOptions(
    403     partitioning=partitioning,
    404     partition_base_dir=partition_base_dir,
    405     exclude_invalid_files=exclude_invalid_files,
    406     selector_ignore_prefixes=selector_ignore_prefixes
    407 )
    408 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
--> 410 return factory.finish(schema)

File ~/.local/lib/python3.8/site-packages/pyarrow/_dataset.pyx:2402, in pyarrow._dataset.DatasetFactory.finish()

File ~/.local/lib/python3.8/site-packages/pyarrow/error.pxi:143, in pyarrow.lib.pyarrow_internal_check_status()

File ~/.local/lib/python3.8/site-packages/pyarrow/_fs.pyx:1086, in pyarrow._fs._cb_open_input_file()

File ~/.local/lib/python3.8/site-packages/pyarrow/fs.py:319, in FSSpecHandler.open_input_file(self, path)
    316 from pyarrow import PythonFile
    318 if not self.fs.isfile(path):
--> 319     raise FileNotFoundError(path)
    321 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: my_bucket/some_path/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Network Local or Cloud (AWS, GCS, etc.) IO Issues IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

7 participants