BUG: read_parquet no longer supports file-like objects #34467

claytonlemons · 2020-05-29T21:56:30Z

Code Sample, a copy-pastable example

from io import BytesIO
import pandas as pd

buffer = BytesIO()

df = pd.DataFrame([1,2,3], columns=["a"])
df.to_parquet(buffer)

df2 = pd.read_parquet(buffer)

Problem description

The current behavior of read_parquet(buffer) is that it raises the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./working_dir/tvenv/lib/python3.7/site-packages/pandas/io/parquet.py", line 315, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
    path, filesystem=get_fs_for_path(path), **kwargs
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/parquet.py", line 1162, in __init__
    self.paths = _parse_uri(path_or_paths)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/parquet.py", line 47, in _parse_uri
    path = _stringify_path(path)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/util.py", line 67, in _stringify_path
    raise TypeError("not a path-like object")
TypeError: not a path-like object

Expected Output

Instead, read_parquet(buffer) should return a new DataFrame with the same contents as the serialized DataFrame stored in buffer

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-99-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 9.0.1
setuptools : 39.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 0.999999999
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

jreback · 2020-05-29T22:42:06Z

this is almost certainly a regression pyarrow itself

pls report there

claytonlemons · 2020-05-29T23:32:04Z

this is almost certainly a regression pyarrow itself
pls report there

I mean no offense, but this sounds like the responsibility of pandas maintainers, since pandas directly consumes pyarrow.

jreback · 2020-05-29T23:34:49Z

@claytonlemons you are missing the point

it’s calling pyarrow code - see the traceback

austospumanto · 2020-05-30T00:06:22Z

@claytonlemons I am encountering the same issue.

If I downgrade from 1.0.4 --> 1.0.3 (while keeping the pyarrow version the same), I can again read from BytesIO buffers without issue. Since upgrading the pandas version from 1.0.3 --> 1.0.4 seems both necessary and sufficient to cause the file-like object reading issues, it seems like it may indeed be correct to consider this as an issue with pandas, not pyarrow.

@jreback Would you consider reopening this issue?

jreback · 2020-05-30T00:11:58Z

ok sure something must have gone wrong in the backport.

would be helpful to know exactly where

austospumanto · 2020-05-30T00:20:34Z

Thanks, @jreback.

For starters, it looks like the implementation changed significantly between 1.0.3 --> 1.0.4 for io/parquet.py#PyArrowImpl.read.

1.0.3

def read(self, path, columns=None, **kwargs):
    path, _, _, should_close = get_filepath_or_buffer(path)

    kwargs["use_pandas_metadata"] = True
    result = self.api.parquet.read_table(
        path, columns=columns, **kwargs
    ).to_pandas()
    if should_close:
        path.close()

    return result

1.0.4

def read(self, path, columns=None, **kwargs):
    parquet_ds = self.api.parquet.ParquetDataset(
        path, filesystem=get_fs_for_path(path), **kwargs
    )
    kwargs["columns"] = columns
    result = parquet_ds.read_pandas(**kwargs).to_pandas()
    return result

claytonlemons · 2020-05-30T00:22:07Z

@claytonlemons you are missing the point

@jreback I understood your point, but I was referring to reporting the issue to pyarrow, not the fact that pyarrow is causing the traceback.

That said, it's still an assumption that pyarrow caused the regression. That's why I'm reluctant to report anything to pyarrow in the first place.

Let's dig into the issue some more:

As shown by the stack trace, the first entry point into pyarrow 17.1 is pyarrow/parquet.py:1162. This is the ParquetDataset class, which pandas now uses in the new implementation for pandas.read_parquet.
Running git-blame on parquet.py:1162, I see no recent changes to the ParquetDataset class that would have caused this regression. In fact, neither the current documentation, nor the documentation for previous versions, states anything about supporting file-like objects, only paths.
Even the _ParquetDatasetV2 class, which uses pyarrow's dataset implementation, does not and has not supported (for at least several months back) file-like objects.

So there are three possibilities:

ParquetDataset supported file-like objects in the past but did not document it.
ParquetDataset supported file-like objects in the past but recently removed support.
ParquetDataset never supported file-like objects, but pandas made an incorrect assumption about the interface and did not properly test against it

claytonlemons · 2020-05-30T00:22:56Z

would be helpful to know exactly where

Please see the referenced merge request above.

jorisvandenbossche · 2020-05-30T13:00:15Z

@claytonlemons Thanks for the report!
It could have helped if you mentioned that it is #33632 that broke this, to make it clear it's a pandas issue and not pyarrow (since you actually first commented on that PR. Github shows that it is linked here, but that's easy to miss).

cc @simonjayhawkins @alimcmaster1

So after the fact, we apparently should not have backported this .. (#34173). Anyway, that can happen.

What do we do with this one?
@simonjayhawkins would you have time for doing a quick follow-up bugfix release?

jorisvandenbossche · 2020-05-30T13:05:32Z

On the actual issue, as far as I know, pyarrow.parquet.ParquetDataset never supported file like objects (while pyarrow.parquet.read_table does). So we clearly didn't have test coverage for this for the parquet format, unfortunately.

I think using the ParquetDataset was needed to get

@alimcmaster1 do you remember why you switched to ParquetDataset? I think that read_table should also support the filesystem keyword (which is what you added)

simonjayhawkins · 2020-05-30T13:55:31Z

What do we do with this one?
@simonjayhawkins would you have time for doing a quick follow-up bugfix release?

This also fails on master, so we can either

fix on master and backport
revert Backport PR #33645, #33632 and #34087 on branch 1.0.x #34173 on 1.0.x

From previous discussion #33970 (comment), probably best to create a 1.0.5 tag and setup the whatsnew.. I'll open a PR for that.

simonjayhawkins · 2020-05-30T14:21:04Z

probably best to create a 1.0.5 tag and setup the whatsnew.. I'll open a PR for that.

#34481

alimcmaster1 · 2020-05-30T16:06:54Z

On the actual issue, as far as I know, pyarrow.parquet.ParquetDataset never supported file like objects (while pyarrow.parquet.read_table does). So we clearly didn't have test coverage for this for the parquet format, unfortunately.

I think using the ParquetDataset was needed to get

@alimcmaster1 do you remember why you switched to ParquetDataset? I think that read_table should also support the filesystem keyword (which is what you added)

Yes agree this is related to #33632. There is actually also an open PR to properly document the reading/writing file like obj behaviour #33709 .

@jorisvandenbossche we switched to ParquetDataset as it allows a directory path and hence we can read a partitioned dataset. I can submit a PR to fix and properly test the file like behaviour that’s broken here.

jorisvandenbossche · 2020-05-30T18:20:52Z

we switched to ParquetDataset as it allows a directory path and hence we can read a partitioned dataset

read_table should also allow that (basically, read_table under the hood checks whether the pat argument is a string path, in which case it dispatches to ParquetDataset under the hood, and otherwise uses an implementation that can read from a file-like object)

alimcmaster1 · 2020-05-31T17:09:00Z

we switched to ParquetDataset as it allows a directory path and hence we can read a partitioned dataset

read_table should also allow that (basically, read_table under the hood checks whether the pat argument is a string path, in which case it dispatches to ParquetDataset under the hood, and otherwise uses an implementation that can read from a file-like object)

Understood thanks @jorisvandenbossche ! I think I missed the fact read_table supports a filesystem kwarg - ill fix in arrow as there is no description: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html.

Fix for this here #34500

@jorisvandenbossche

…able docstring Use same doc string as ParquetDataset. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html. Looks like the arg currently has no docs. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html cc. @jorisvandenbossche our discussion here: pandas-dev/pandas#34467 (comment) Closes #7332 from alimcmaster1/patch-1 Authored-by: alimcmaster1 <alimcmaster1@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

DuncanCasteleyn · 2020-06-08T09:33:41Z

I also have a simular issues since version 1.0.4 and had to downgrade to 1.0.3. I was able to read files from an azure blob storage if I provided the https url and as param sas token directly. since 1.0.4 this is completely broken.

Example on 1.0.4

pd.read_parquet("https://*REDACTED*.blob.core.windows.net/raw/*REDACTED*/12.parquet?sv=*REDACTED*&ss=*REDACTED*&srt=*REDACTED*&sp=*REDACTED*&se=*REDACTED*&st=*REDACTED*&spr=https&sig=*REDACTED*")

Raises OSError: Passed non-file path: https://*REDACTED*.blob.core.windows.net/raw/*REDACTED*/12.parquet?sv=*REDACTED*&ss=*REDACTED*&srt=*REDACTED*&sp=*REDACTED*&se=*REDACTED*&st=*REDACTED*&spr=https&sig=*REDACTED*

This work perfectly on 1.0.3 and forced us to rollback to pandas 1.0.3

jorisvandenbossche · 2020-06-08T10:55:54Z

We are reverting the original change for 1.0.5 (#34632), and then will need to ensure to fix this for 1.1.0 in master.

kepler · 2020-06-08T18:08:08Z

Another consequence of using ParquetDataset instead of read_table is that additional keyword arguments are passed both to the constructor and the read method:

parquet_ds = self.api.parquet.ParquetDataset(
    path, filesystem=get_fs_for_path(path), **kwargs
)
kwargs["columns"] = columns
result = parquet_ds.read_pandas(**kwargs).to_pandas()

But since ParquetDataset.read doesn't support all arguments supported in ParquetDataset.__init__, this leads to TypeErrors:

    df = pd.read_parquet(file_path, memory_map=True)
  File ".venv/lib/python3.7/site-packages/pandas/io/parquet.py", line 315, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File ".venv/lib/python3.7/site-packages/pandas/io/parquet.py", line 134, in read
    result = parquet_ds.read_pandas(**kwargs).to_pandas()
  File ".venv/lib/python3.7/site-packages/pyarrow/parquet.py", line 1304, in read_pandas
    return self.read(use_pandas_metadata=True, **kwargs)
TypeError: read() got an unexpected keyword argument 'memory_map'

austospumanto · 2020-06-08T19:38:30Z

@kepler I wonder if explicitly separating the kwargs into two parameters might be a solution. Something like

KW = Optional[Dict[str, Any]]

def read_parquet(.., init_kw: KW = None, read_kw: KW = None) -> pd.DataFrame:
    ..

Though I imagine it would be more "pandas-like" to just implicitly construct init_kw and read_kw according to what we know the allowed init/read parameters to be.

kepler · 2020-06-08T21:00:35Z

Yes, I'd agree it's better to keep consistency with <=1.0.3 and do the magic inside the read_parquet method. Not sure this will be easy to clearly document, though, and so it might actually make things more confusing to the user. But then I'm not familiar with FastParquet, so maybe separating kwargs will not be consistent with it.

So you see I don't have a strong opinion about how's designed. :) Except it would have been great if a patch version hadn't broken compatibility.

alimcmaster1 · 2020-06-08T22:46:34Z

I also have a simular issues since version 1.0.4 and had to downgrade to 1.0.3. I was able to read files from an azure blob storage if I provided the https url and as param sas token directly. since 1.0.4 this is completely broken.

@DuncanCasteleyn

Thanks for highlighting, as Joris mentioned we are reverting for 1.0.5. Could you confirm if this fix targeting master fixes your issue #34500

alimcmaster1 · 2020-06-08T23:08:31Z

@kepler I wonder if explicitly separating the kwargs into two parameters might be a solution.

@austospumanto

The fix for master pandas 1.1 is https://github.com/pandas-dev/pandas/pull/34500/files#diff-cbd427661c53f1dcde6ec5fb9ab0effaR134

We can potentially add tests that’s cover a few more of the kwargs since we clearly currently don’t have coverage here.

There was a bug introduced in pandas 1.0.4 that caused pd.read_parquet to no longer be able to handle file-like objects. They're fixing it in 1.0.5. This change will skip 1.0.4 and use the newer version when available. pandas-dev/pandas#34467

DuncanCasteleyn · 2020-06-10T09:54:19Z

I also have a simular issues since version 1.0.4 and had to downgrade to 1.0.3. I was able to read files from an azure blob storage if I provided the https url and as param sas token directly. since 1.0.4 this is completely broken.

@DuncanCasteleyn

Thanks for highlighting, as Joris mentioned we are reverting for 1.0.5. Could you confirm if this fix targeting master fixes your issue #34500

Hi, I build the branch from the issue you mentioned and ran the code that uses azure data lake/blob storage and my data is being loaded again using direct urls in the format that i mentioned before.

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 92a883d48a9e8d184a0b6f2fcac3f79239c17040
python           : 3.7.6.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.18362
machine          : AMD64
processor        : AMD64 Family 23 Model 49 Stepping 0, AuthenticAMD
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 1.1.0.dev0+1742.g92a883d48
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.19
pytest           : 5.3.5
hypothesis       : 5.5.4
sphinx           : 2.4.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.7
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.12.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.2
fastparquet      : 0.3.2
gcsfs            : None
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : 0.15.1
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.16
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.48.0

lmmentel · 2020-06-14T21:20:12Z

I'm in the same situation as @DuncanCasteleyn using parquet files stored in an azure data lake (blob). When I upgraded to pandas 1.0.4 without changing any code I'm getting

   series = pd.read_parquet(instance.name)
2020-06-14T20:11:13.132239599Z   File "/usr/local/lib/python3.8/site-packages/pandas/io/parquet.py", line 315, in read_parquet
2020-06-14T20:11:13.132242599Z     return impl.read(path, columns=columns, **kwargs)
2020-06-14T20:11:13.132245499Z   File "/usr/local/lib/python3.8/site-packages/pandas/io/parquet.py", line 130, in read
2020-06-14T20:11:13.132248499Z     parquet_ds = self.api.parquet.ParquetDataset(
2020-06-14T20:11:13.132251499Z   File "/usr/local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1171, in __init__
2020-06-14T20:11:13.132254499Z     self.metadata_path) = _make_manifest(
2020-06-14T20:11:13.132257399Z   File "/usr/local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1367, in _make_manifest
2020-06-14T20:11:13.132260399Z     raise OSError('Passed non-file path: {}'

OSError: Passed non-file path: ****

Everything works as expected with version 1.0.3.

jorisvandenbossche · 2020-06-15T11:22:01Z

Closing this, as it has reverted in 1.0.5 (#34632, #34787), and fixed in master (#34500)

ldacey · 2020-06-23T16:04:00Z

Ahh - I was was bit by this but ended up using pq.read_table() with pyarrow and removed the pandas.read_parquet() portions of the code. Upgrading to 1.0.5 now.

claytonlemons added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2020

claytonlemons mentioned this issue May 29, 2020

IO: Fix parquet read from s3 directory #33632

Merged

6 tasks

jreback closed this as completed May 29, 2020

jreback reopened this May 30, 2020

jorisvandenbossche added IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 30, 2020

simonjayhawkins added this to the 1.0.5 milestone May 30, 2020

alimcmaster1 self-assigned this May 30, 2020

alimcmaster1 mentioned this issue May 31, 2020

REG: Fix read_parquet from file-like objects #34500

Merged

5 tasks

alimcmaster1 mentioned this issue Jun 2, 2020

ARROW-9021: [Python] Add the filesystem explanation to parquet.read_table docstring apache/arrow#7332

Closed

nmerket mentioned this issue Jun 9, 2020

Added note for Windows users to install Docker 2.1.0.5 or below. NREL/buildstockbatch#160

Merged

7 tasks

jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Jun 10, 2020

jorisvandenbossche mentioned this issue Jun 10, 2020

RLS: 1.0.5 #34684

Closed

nmerket mentioned this issue Jun 11, 2020

Pinning pandas version for aws postprocessing NREL/buildstockbatch#162

Merged

7 tasks

simonjayhawkins mentioned this issue Jun 15, 2020

Backport Test Only from PR #34500 on branch 1.0.x (REG: Fix read_parquet from file-like objects) #34787

Merged

jorisvandenbossche closed this as completed Jun 15, 2020

TomAugspurger mentioned this issue Jun 16, 2020

BUG: pandas.read_parquet no longer accepting a file-like object #34826

Closed

jorloplaz mentioned this issue Jun 18, 2020

BUG: read_parquet doesn't accept anymore a file route starting with 'file://' #34861

Closed

3 tasks

crypdick mentioned this issue Aug 20, 2020

Update pandas version kedro-org/kedro#488

Closed

JMBurley mentioned this issue Aug 20, 2020

Avoid Kedro fsspec requirements being mutually incompatible with pandas 1.1.0 kedro-org/kedro#489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_parquet no longer supports file-like objects #34467

BUG: read_parquet no longer supports file-like objects #34467

claytonlemons commented May 29, 2020 •

edited by jorisvandenbossche

Loading

INSTALLED VERSIONS

jreback commented May 29, 2020

claytonlemons commented May 29, 2020

jreback commented May 29, 2020

austospumanto commented May 30, 2020 •

edited

Loading

jreback commented May 30, 2020

austospumanto commented May 30, 2020 •

edited

Loading

claytonlemons commented May 30, 2020 •

edited

Loading

claytonlemons commented May 30, 2020

jorisvandenbossche commented May 30, 2020

jorisvandenbossche commented May 30, 2020

simonjayhawkins commented May 30, 2020

simonjayhawkins commented May 30, 2020

alimcmaster1 commented May 30, 2020

jorisvandenbossche commented May 30, 2020

alimcmaster1 commented May 31, 2020

DuncanCasteleyn commented Jun 8, 2020

jorisvandenbossche commented Jun 8, 2020

kepler commented Jun 8, 2020

austospumanto commented Jun 8, 2020

kepler commented Jun 8, 2020

alimcmaster1 commented Jun 8, 2020

alimcmaster1 commented Jun 8, 2020

DuncanCasteleyn commented Jun 10, 2020 •

edited

Loading

lmmentel commented Jun 14, 2020

jorisvandenbossche commented Jun 15, 2020

ldacey commented Jun 23, 2020

BUG: read_parquet no longer supports file-like objects #34467

BUG: read_parquet no longer supports file-like objects #34467

Comments

claytonlemons commented May 29, 2020 • edited by jorisvandenbossche Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented May 29, 2020

claytonlemons commented May 29, 2020

jreback commented May 29, 2020

austospumanto commented May 30, 2020 • edited Loading

jreback commented May 30, 2020

austospumanto commented May 30, 2020 • edited Loading

1.0.3

1.0.4

claytonlemons commented May 30, 2020 • edited Loading

claytonlemons commented May 30, 2020

jorisvandenbossche commented May 30, 2020

jorisvandenbossche commented May 30, 2020

simonjayhawkins commented May 30, 2020

simonjayhawkins commented May 30, 2020

alimcmaster1 commented May 30, 2020

jorisvandenbossche commented May 30, 2020

alimcmaster1 commented May 31, 2020

DuncanCasteleyn commented Jun 8, 2020

jorisvandenbossche commented Jun 8, 2020

kepler commented Jun 8, 2020

austospumanto commented Jun 8, 2020

kepler commented Jun 8, 2020

alimcmaster1 commented Jun 8, 2020

alimcmaster1 commented Jun 8, 2020

DuncanCasteleyn commented Jun 10, 2020 • edited Loading

lmmentel commented Jun 14, 2020

jorisvandenbossche commented Jun 15, 2020

ldacey commented Jun 23, 2020

claytonlemons commented May 29, 2020 •

edited by jorisvandenbossche

Loading

Output of `pd.show_versions()`

austospumanto commented May 30, 2020 •

edited

Loading

austospumanto commented May 30, 2020 •

edited

Loading

claytonlemons commented May 30, 2020 •

edited

Loading

DuncanCasteleyn commented Jun 10, 2020 •

edited

Loading