ComputeError: unable to parse Hive partition value: "TRUE" #16381

proever · 2024-05-21T20:37:20Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import datetime
from pathlib import Path

import polars as pl
import pyarrow.parquet as pq

dataset_path = Path("./test_dataset")
test_df = (
    pl.DataFrame(
        {
            "timestamp": [datetime(2021, 1, 1), datetime(2021, 2, 1)],
            "data": [1, 2],
            "ticker": ["AAPL", "TRUE"],
        }
    )
    .with_columns(pl.col("ticker").cast(pl.Enum(["AAPL", "TRUE"])))
    .with_columns(
        pl.col("timestamp").dt.year().alias("year"), pl.col("timestamp").dt.month().alias("month")
    )
)

pq.write_to_dataset(test_df.to_arrow(), dataset_path, partition_cols=["year", "month", "ticker"])

df_in = pl.scan_parquet(dataset_path.rglob("*.parquet"))

df_in.filter(pl.col("ticker") == "AAPL").collect()

Log output

no logs it seems, error below:

Traceback (most recent call last):
  File "/home/proever/code/data_pipeline/mve.py", line 26, in <module>
    df_in.filter(pl.col("ticker") == "AAPL").collect()
  File "/home/proever/code/data_pipeline/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: unable to parse Hive partition value: "TRUE"

This error occurred with the following context stack:
        [1] 'parquet scan' failed
        [2] 'filter' input failed to resolve

Issue description

I have a large dataset of stock data, which includes ticker names. One of those tickers is "TRUE" (apparently it's the company TrueCar). I am trying to partition the dataset based on, among other things, the ticker. This seems to work fine, but when I try to load the dataset back in with scan_csv, polars complains about parsing the "TRUE" partition (see above).

Changing the value of "TRUE" to "TEST" (in the data and the Enum) fixes the issue. Changing it back to "FALSE" causes the issue to occur again, which leads me to believe there is some logic being applied to attempt to parse boolean values, causing the issue.

Expected behavior

code above should run with no errors, as it does when "TRUE" is replaced with "TEST" (for example)

Installed versions

--------Version info---------
Polars:               0.20.27
Index type:           UInt32
Platform:             Linux-6.1.0-20-amd64-x86_64-with-glibc2.36
Python:               3.12.3 (main, Apr 15 2024, 18:25:56) [Clang 17.0.6 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              16.1.0
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.30
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

proever · 2024-05-21T20:40:26Z

partitioning only on "ticker" causes the issue too, I just checked

proever · 2024-05-21T20:50:19Z

pq.ParquetDataset(dataset_path).read().to_pandas() works fine as well

proever · 2024-07-09T10:41:28Z

this seems fixed in 1.0, at least with the following changes:

from datetime import datetime
from pathlib import Path

import polars as pl

dataset_path = Path("./test_dataset")
dataset_path.mkdir(exist_ok=True)

test_df = (
    pl.DataFrame(
        {
            "timestamp": [datetime(2021, 1, 1), datetime(2022, 2, 1)],
            "data": [1, 2],
            "ticker": ["AAPL", "TRUE"],
        }
    )
    .with_columns(pl.col("ticker").cast(pl.Enum(["AAPL", "TRUE"])))
    .with_columns(
        pl.col("timestamp").dt.year().alias("year"), pl.col("timestamp").dt.month().alias("month")
    )
)

test_df.write_parquet(
    dataset_path, use_pyarrow=True, pyarrow_options={"partition_cols": ["ticker", "year", "month"]}
)

df_in = pl.scan_parquet(dataset_path, hive_partitioning=True)

print(df_in.filter(pl.col("ticker") == "AAPL").collect())

proever added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 21, 2024

proever closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ComputeError: unable to parse Hive partition value: "TRUE" #16381

ComputeError: unable to parse Hive partition value: "TRUE" #16381

proever commented May 21, 2024 •

edited

Loading

proever commented May 21, 2024

proever commented May 21, 2024

proever commented Jul 9, 2024

ComputeError: unable to parse Hive partition value: "TRUE" #16381

ComputeError: unable to parse Hive partition value: "TRUE" #16381

Comments

proever commented May 21, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

proever commented May 21, 2024

proever commented May 21, 2024

proever commented Jul 9, 2024

proever commented May 21, 2024 •

edited

Loading