Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_parquet no longer supports file-like objects #34467

Closed
claytonlemons opened this issue May 29, 2020 · 26 comments
Closed

BUG: read_parquet no longer supports file-like objects #34467

claytonlemons opened this issue May 29, 2020 · 26 comments
Assignees
Labels
Blocker Blocking issue or pull request for an upcoming release IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@claytonlemons
Copy link

claytonlemons commented May 29, 2020

Code Sample, a copy-pastable example

from io import BytesIO
import pandas as pd

buffer = BytesIO()

df = pd.DataFrame([1,2,3], columns=["a"])
df.to_parquet(buffer)

df2 = pd.read_parquet(buffer)

Problem description

The current behavior of read_parquet(buffer) is that it raises the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./working_dir/tvenv/lib/python3.7/site-packages/pandas/io/parquet.py", line 315, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
    path, filesystem=get_fs_for_path(path), **kwargs
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/parquet.py", line 1162, in __init__
    self.paths = _parse_uri(path_or_paths)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/parquet.py", line 47, in _parse_uri
    path = _stringify_path(path)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/util.py", line 67, in _stringify_path
    raise TypeError("not a path-like object")
TypeError: not a path-like object

Expected Output

Instead, read_parquet(buffer) should return a new DataFrame with the same contents as the serialized DataFrame stored in buffer

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-99-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 9.0.1
setuptools : 39.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 0.999999999
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@claytonlemons claytonlemons added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2020
@jreback
Copy link
Contributor

jreback commented May 29, 2020

this is almost certainly a regression pyarrow itself

pls report there

@claytonlemons
Copy link
Author

this is almost certainly a regression pyarrow itself
pls report there

I mean no offense, but this sounds like the responsibility of pandas maintainers, since pandas directly consumes pyarrow.

@jreback
Copy link
Contributor

jreback commented May 29, 2020

@claytonlemons you are missing the point

it’s calling pyarrow code - see the traceback

@jreback jreback closed this as completed May 29, 2020
@austospumanto
Copy link

austospumanto commented May 30, 2020

@claytonlemons I am encountering the same issue.

If I downgrade from 1.0.4 --> 1.0.3 (while keeping the pyarrow version the same), I can again read from BytesIO buffers without issue. Since upgrading the pandas version from 1.0.3 --> 1.0.4 seems both necessary and sufficient to cause the file-like object reading issues, it seems like it may indeed be correct to consider this as an issue with pandas, not pyarrow.

@jreback Would you consider reopening this issue?

@jreback jreback reopened this May 30, 2020
@jreback
Copy link
Contributor

jreback commented May 30, 2020

ok sure something must have gone wrong in the backport.

would be helpful to know exactly where

@austospumanto
Copy link

austospumanto commented May 30, 2020

Thanks, @jreback.

For starters, it looks like the implementation changed significantly between 1.0.3 --> 1.0.4 for io/parquet.py#PyArrowImpl.read.

1.0.3

def read(self, path, columns=None, **kwargs):
    path, _, _, should_close = get_filepath_or_buffer(path)

    kwargs["use_pandas_metadata"] = True
    result = self.api.parquet.read_table(
        path, columns=columns, **kwargs
    ).to_pandas()
    if should_close:
        path.close()

    return result

1.0.4

def read(self, path, columns=None, **kwargs):
    parquet_ds = self.api.parquet.ParquetDataset(
        path, filesystem=get_fs_for_path(path), **kwargs
    )
    kwargs["columns"] = columns
    result = parquet_ds.read_pandas(**kwargs).to_pandas()
    return result

@claytonlemons
Copy link
Author

claytonlemons commented May 30, 2020

@claytonlemons you are missing the point

@jreback I understood your point, but I was referring to reporting the issue to pyarrow, not the fact that pyarrow is causing the traceback.

That said, it's still an assumption that pyarrow caused the regression. That's why I'm reluctant to report anything to pyarrow in the first place.

Let's dig into the issue some more:

  1. As shown by the stack trace, the first entry point into pyarrow 17.1 is pyarrow/parquet.py:1162. This is the ParquetDataset class, which pandas now uses in the new implementation for pandas.read_parquet.
  2. Running git-blame on parquet.py:1162, I see no recent changes to the ParquetDataset class that would have caused this regression. In fact, neither the current documentation, nor the documentation for previous versions, states anything about supporting file-like objects, only paths.
  3. Even the _ParquetDatasetV2 class, which uses pyarrow's dataset implementation, does not and has not supported (for at least several months back) file-like objects.

So there are three possibilities:

  1. ParquetDataset supported file-like objects in the past but did not document it.
  2. ParquetDataset supported file-like objects in the past but recently removed support.
  3. ParquetDataset never supported file-like objects, but pandas made an incorrect assumption about the interface and did not properly test against it

@claytonlemons
Copy link
Author

would be helpful to know exactly where

Please see the referenced merge request above.

@jorisvandenbossche jorisvandenbossche added IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 30, 2020
@jorisvandenbossche
Copy link
Member

@claytonlemons Thanks for the report!
It could have helped if you mentioned that it is #33632 that broke this, to make it clear it's a pandas issue and not pyarrow (since you actually first commented on that PR. Github shows that it is linked here, but that's easy to miss).

cc @simonjayhawkins @alimcmaster1

So after the fact, we apparently should not have backported this .. (#34173). Anyway, that can happen.

What do we do with this one?
@simonjayhawkins would you have time for doing a quick follow-up bugfix release?

@jorisvandenbossche
Copy link
Member

On the actual issue, as far as I know, pyarrow.parquet.ParquetDataset never supported file like objects (while pyarrow.parquet.read_table does). So we clearly didn't have test coverage for this for the parquet format, unfortunately.

I think using the ParquetDataset was needed to get

@alimcmaster1 do you remember why you switched to ParquetDataset? I think that read_table should also support the filesystem keyword (which is what you added)

@simonjayhawkins
Copy link
Member

What do we do with this one?
@simonjayhawkins would you have time for doing a quick follow-up bugfix release?

This also fails on master, so we can either

  1. fix on master and backport
  2. revert Backport PR #33645, #33632 and #34087 on branch 1.0.x #34173 on 1.0.x

From previous discussion #33970 (comment), probably best to create a 1.0.5 tag and setup the whatsnew.. I'll open a PR for that.

@simonjayhawkins simonjayhawkins added this to the 1.0.5 milestone May 30, 2020
@simonjayhawkins
Copy link
Member

probably best to create a 1.0.5 tag and setup the whatsnew.. I'll open a PR for that.

#34481

@alimcmaster1
Copy link
Member

On the actual issue, as far as I know, pyarrow.parquet.ParquetDataset never supported file like objects (while pyarrow.parquet.read_table does). So we clearly didn't have test coverage for this for the parquet format, unfortunately.

I think using the ParquetDataset was needed to get

@alimcmaster1 do you remember why you switched to ParquetDataset? I think that read_table should also support the filesystem keyword (which is what you added)

Yes agree this is related to #33632. There is actually also an open PR to properly document the reading/writing file like obj behaviour #33709 .

@jorisvandenbossche we switched to ParquetDataset as it allows a directory path and hence we can read a partitioned dataset. I can submit a PR to fix and properly test the file like behaviour that’s broken here.

@alimcmaster1 alimcmaster1 self-assigned this May 30, 2020
@jorisvandenbossche
Copy link
Member

we switched to ParquetDataset as it allows a directory path and hence we can read a partitioned dataset

read_table should also allow that (basically, read_table under the hood checks whether the pat argument is a string path, in which case it dispatches to ParquetDataset under the hood, and otherwise uses an implementation that can read from a file-like object)

@alimcmaster1
Copy link
Member

we switched to ParquetDataset as it allows a directory path and hence we can read a partitioned dataset

read_table should also allow that (basically, read_table under the hood checks whether the pat argument is a string path, in which case it dispatches to ParquetDataset under the hood, and otherwise uses an implementation that can read from a file-like object)

Understood thanks @jorisvandenbossche ! I think I missed the fact read_table supports a filesystem kwarg - ill fix in arrow as there is no description: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html.

Fix for this here #34500

jorisvandenbossche pushed a commit to apache/arrow that referenced this issue Jun 3, 2020
…able docstring

Use same doc string as ParquetDataset. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html.

Looks like the arg currently has no docs. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

cc. @jorisvandenbossche our discussion here: pandas-dev/pandas#34467 (comment)

Closes #7332 from alimcmaster1/patch-1

Authored-by: alimcmaster1 <alimcmaster1@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@DuncanCasteleyn
Copy link

I also have a simular issues since version 1.0.4 and had to downgrade to 1.0.3. I was able to read files from an azure blob storage if I provided the https url and as param sas token directly. since 1.0.4 this is completely broken.

Example on 1.0.4

pd.read_parquet("https://*REDACTED*.blob.core.windows.net/raw/*REDACTED*/12.parquet?sv=*REDACTED*&ss=*REDACTED*&srt=*REDACTED*&sp=*REDACTED*&se=*REDACTED*&st=*REDACTED*&spr=https&sig=*REDACTED*")

Raises OSError: Passed non-file path: https://*REDACTED*.blob.core.windows.net/raw/*REDACTED*/12.parquet?sv=*REDACTED*&ss=*REDACTED*&srt=*REDACTED*&sp=*REDACTED*&se=*REDACTED*&st=*REDACTED*&spr=https&sig=*REDACTED*

This work perfectly on 1.0.3 and forced us to rollback to pandas 1.0.3

@jorisvandenbossche
Copy link
Member

We are reverting the original change for 1.0.5 (#34632), and then will need to ensure to fix this for 1.1.0 in master.

@kepler
Copy link

kepler commented Jun 8, 2020

Another consequence of using ParquetDataset instead of read_table is that additional keyword arguments are passed both to the constructor and the read method:

parquet_ds = self.api.parquet.ParquetDataset(
    path, filesystem=get_fs_for_path(path), **kwargs
)
kwargs["columns"] = columns
result = parquet_ds.read_pandas(**kwargs).to_pandas()

But since ParquetDataset.read doesn't support all arguments supported in ParquetDataset.__init__, this leads to TypeErrors:

    df = pd.read_parquet(file_path, memory_map=True)
  File ".venv/lib/python3.7/site-packages/pandas/io/parquet.py", line 315, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File ".venv/lib/python3.7/site-packages/pandas/io/parquet.py", line 134, in read
    result = parquet_ds.read_pandas(**kwargs).to_pandas()
  File ".venv/lib/python3.7/site-packages/pyarrow/parquet.py", line 1304, in read_pandas
    return self.read(use_pandas_metadata=True, **kwargs)
TypeError: read() got an unexpected keyword argument 'memory_map'

@austospumanto
Copy link

@kepler I wonder if explicitly separating the kwargs into two parameters might be a solution. Something like

KW = Optional[Dict[str, Any]]

def read_parquet(.., init_kw: KW = None, read_kw: KW = None) -> pd.DataFrame:
    ..

Though I imagine it would be more "pandas-like" to just implicitly construct init_kw and read_kw according to what we know the allowed init/read parameters to be.

@kepler
Copy link

kepler commented Jun 8, 2020

Yes, I'd agree it's better to keep consistency with <=1.0.3 and do the magic inside the read_parquet method. Not sure this will be easy to clearly document, though, and so it might actually make things more confusing to the user. But then I'm not familiar with FastParquet, so maybe separating kwargs will not be consistent with it.

So you see I don't have a strong opinion about how's designed. :) Except it would have been great if a patch version hadn't broken compatibility.

@alimcmaster1
Copy link
Member

I also have a simular issues since version 1.0.4 and had to downgrade to 1.0.3. I was able to read files from an azure blob storage if I provided the https url and as param sas token directly. since 1.0.4 this is completely broken.

@DuncanCasteleyn

Thanks for highlighting, as Joris mentioned we are reverting for 1.0.5. Could you confirm if this fix targeting master fixes your issue #34500

@alimcmaster1
Copy link
Member

@kepler I wonder if explicitly separating the kwargs into two parameters might be a solution.

@austospumanto

The fix for master pandas 1.1 is https://github.com/pandas-dev/pandas/pull/34500/files#diff-cbd427661c53f1dcde6ec5fb9ab0effaR134

We can potentially add tests that’s cover a few more of the kwargs since we clearly currently don’t have coverage here.

nmerket added a commit to NREL/buildstockbatch that referenced this issue Jun 9, 2020
There was a bug introduced in pandas 1.0.4 that caused pd.read_parquet to no longer be able to handle file-like objects. They're fixing it in 1.0.5. This change will skip 1.0.4 and use the newer version when available.
pandas-dev/pandas#34467
@jorisvandenbossche jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Jun 10, 2020
@DuncanCasteleyn
Copy link

DuncanCasteleyn commented Jun 10, 2020

I also have a simular issues since version 1.0.4 and had to downgrade to 1.0.3. I was able to read files from an azure blob storage if I provided the https url and as param sas token directly. since 1.0.4 this is completely broken.

@DuncanCasteleyn

Thanks for highlighting, as Joris mentioned we are reverting for 1.0.5. Could you confirm if this fix targeting master fixes your issue #34500

Hi, I build the branch from the issue you mentioned and ran the code that uses azure data lake/blob storage and my data is being loaded again using direct urls in the format that i mentioned before.

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 92a883d48a9e8d184a0b6f2fcac3f79239c17040
python           : 3.7.6.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.18362
machine          : AMD64
processor        : AMD64 Family 23 Model 49 Stepping 0, AuthenticAMD
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 1.1.0.dev0+1742.g92a883d48
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.19
pytest           : 5.3.5
hypothesis       : 5.5.4
sphinx           : 2.4.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.7
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.12.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.2
fastparquet      : 0.3.2
gcsfs            : None
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : 0.15.1
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.16
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.48.0

@lmmentel
Copy link

I'm in the same situation as @DuncanCasteleyn using parquet files stored in an azure data lake (blob). When I upgraded to pandas 1.0.4 without changing any code I'm getting

   series = pd.read_parquet(instance.name)
2020-06-14T20:11:13.132239599Z   File "/usr/local/lib/python3.8/site-packages/pandas/io/parquet.py", line 315, in read_parquet
2020-06-14T20:11:13.132242599Z     return impl.read(path, columns=columns, **kwargs)
2020-06-14T20:11:13.132245499Z   File "/usr/local/lib/python3.8/site-packages/pandas/io/parquet.py", line 130, in read
2020-06-14T20:11:13.132248499Z     parquet_ds = self.api.parquet.ParquetDataset(
2020-06-14T20:11:13.132251499Z   File "/usr/local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1171, in __init__
2020-06-14T20:11:13.132254499Z     self.metadata_path) = _make_manifest(
2020-06-14T20:11:13.132257399Z   File "/usr/local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1367, in _make_manifest
2020-06-14T20:11:13.132260399Z     raise OSError('Passed non-file path: {}'

OSError: Passed non-file path: ****

Everything works as expected with version 1.0.3.

@jorisvandenbossche
Copy link
Member

Closing this, as it has reverted in 1.0.5 (#34632, #34787), and fixed in master (#34500)

@ldacey
Copy link

ldacey commented Jun 23, 2020

Ahh - I was was bit by this but ended up using pq.read_table() with pyarrow and removed the pandas.read_parquet() portions of the code. Upgrading to 1.0.5 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

10 participants