Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ComputeError: parquet: File out of specification: Invalid thrift: protocol error #17346

Closed
2 tasks done
eromoe opened this issue Jul 2, 2024 · 2 comments
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@eromoe
Copy link

eromoe commented Jul 2, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
df0 = pl.scan_parquet(r'C:\Datasets\cn_data\ods_date\tushare\fina_indicator_vip\1d\year=2009\*\*.parquet', hive_partitioning=False)
df0 = df0.collect().to_pandas()

Log output

ComputeError                              Traceback (most recent call last)
Cell In[201], line 3
      1 import polars as pl
      2 df0 = pl.scan_parquet(r'C:\Datasets\cn_data\ods_date\tushare\fina_indicator_vip\1d\year=2009\*\*.parquet', hive_partitioning=False)
----> 3 df0 = df0.collect().to_pandas()

File c:\envs\quant\lib\site-packages\polars\lazyframe\frame.py:1967, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1964 # Only for testing purposes atm.
   1965 callback = _kwargs.get("post_opt_callback")
-> 1967 return wrap_df(ldf.collect(callback))

ComputeError: parquet: File out of specification: Invalid thrift: protocol error

Issue description

ComputeError: parquet: File out of specification: Invalid thrift: protocol error.
My data is from same data api , just find this problem on 2009 year partition .
Testing files :
year=2009.zip

Expected behavior

no error

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Windows-10-10.0.19041-SP0
Python:               3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:40:08) [MSC v.1938 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.24.4
openpyxl:             3.1.2
pandas:               1.5.3
pyarrow:              15.0.2
pydantic:             2.6.4
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.29
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>```

</details>
@eromoe eromoe added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 2, 2024
@ritchie46
Copy link
Member

Sounds like an invalid file?

@eromoe
Copy link
Author

eromoe commented Jul 2, 2024

After some investgeing, I found it may by pandas.Stange that, every time can write the file, but can't read by both pandas/ polars.

@eromoe eromoe closed this as completed Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants