LazyFrame() not omitting hive partition columns #16404

j0bekt01 · 2024-05-22T14:58:24Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pyarrow.parquet as pq
import datetime
import polars as pl

dt = datetime.datetime(2024, 5, 17)
path = f"{bucket}/folder-to-files/year={dt.year}/month={dt.month:02d}/"
dataset = pq.ParquetDataset(path, partitioning='hive', filesystem=s3fs.S3FileSystem())

# This Fails
(
pl.LazyFrame(dataset.read())
  .select(pl.all())
  .head(100)
  .collect()
)

# Remove the partition columns and it works
cols = dataset.schema.names
[cols.remove(item) for item in ['year','month', 'day', 'hour'] if item in cols]

(
pl.LazyFrame(dataset.read(cols))
  .select(pl.all())
  .head(100)
  .collect()
)

Log output

1 # Use Lazyframe
      2 # This is like spark nothing is read until you call an action i.e. collect() or count() etc.
      4 (
----> 5     pl.LazyFrame(dataset.read()) 
      6     .select(pl.all()) 
      7     .filter(pl.col('id') == 12701477)
      8     .head(100)
      9     .collect()
     10 )
     12 # Not lazy
     13 # df = pl.from_arrow(dataset.read(cols)) 

File c:\Users\jaybe\AppData\Local\Programs\Python\Python310\lib\site-packages\polars\lazyframe\frame.py:303, in LazyFrame.__init__(self, data, schema, schema_overrides, strict, orient, infer_schema_length, nan_to_null)
    289 def __init__(
    290     self,
    291     data: FrameInitTypes | None = None,
   (...)
    298     nan_to_null: bool = False,
    299 ):
    300     from polars.dataframe import DataFrame
    302     self._ldf = (
--> 303         DataFrame(
    304             data=data,
    305             schema=schema,
    306             schema_overrides=schema_overrides,
    307             strict=strict,
    308             orient=orient,
    309             infer_schema_length=infer_schema_length,
    310             nan_to_null=nan_to_null,
    311         )
    312         .lazy()
    313         ._ldf
    314     )

File c:\Users\jaybe\AppData\Local\Programs\Python\Python310\lib\site-packages\polars\dataframe\frame.py:409, in DataFrame.__init__(self, data, schema, schema_overrides, strict, orient, infer_schema_length, nan_to_null)
    399     self._df = numpy_to_pydf(
    400         data,
    401         schema=schema,
   (...)
    405         nan_to_null=nan_to_null,
    406     )
    408 elif _check_for_pyarrow(data) and isinstance(data, pa.Table):
--> 409     self._df = arrow_to_pydf(
    410         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    411     )
    413 elif _check_for_pandas(data) and isinstance(data, pd.DataFrame):
    414     self._df = pandas_to_pydf(
    415         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    416     )

File c:\Users\jaybe\AppData\Local\Programs\Python\Python310\lib\site-packages\polars\_utils\construction\dataframe.py:1161, in arrow_to_pydf(data, schema, schema_overrides, strict, rechunk)
   1158     reset_order = True
   1160 if reset_order:
-> 1161     df = df[names]
   1162     pydf = df._df
   1164 if column_names != original_schema and (schema_overrides or original_schema):

File c:\Users\jaybe\AppData\Local\Programs\Python\Python310\lib\site-packages\polars\dataframe\frame.py:1171, in DataFrame.__getitem__(self, item)
   1166         return self._from_pydf(self._df.select(item))
   1168 if is_str_sequence(item, allow_str=False):
   1169     # select multiple columns
   1170     # df[["foo", "bar"]]
-> 1171     return self._from_pydf(self._df.select(item))
   1172 elif is_int_sequence(item):
   1173     item = pl.Series("", item)  # fall through to next if isinstance

ColumnNotFoundError: day

Issue description

I'm trying to read parquet files from S3 that have a Hive partition '/year=YYYY/month=MM/day=DD/hour=HH/' using the .read() method from pyarrow, but it fails, stating that one of the partition columns doesn't exist. However, if I exclude the partition columns and provide a list of columns that are actually present in the file, it reads without any issues. According to the documentation, the read() method ignores Hive partition columns. However, polars LazyFrame() still attempts to read in hive partition columns.

Expected behavior

LazyFrame() should omit the hive partition columns.

Installed versions

--------Version info---------
Polars: 0.20.26
Index type: UInt32
Platform: Windows-10-10.0.22000-SP0
Python: 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager: 0.10.0
cloudpickle: 2.2.1
connectorx: 0.3.2
deltalake:
fastexcel:
fsspec: 2024.5.0
gevent: 23.7.0
hvplot: 0.9.2
matplotlib: 3.7.2
nest_asyncio: 1.5.6
numpy: 1.23.4
openpyxl: 3.0.10
pandas: 2.1.3
pyarrow: 16.1.0
pydantic: 2.6.3
pyiceberg:
pyxlsb:
sqlalchemy: 1.4.51
torch:
xlsx2csv:
xlsxwriter:

j0bekt01 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LazyFrame() not omitting hive partition columns #16404

LazyFrame() not omitting hive partition columns #16404

j0bekt01 commented May 22, 2024

LazyFrame() not omitting hive partition columns #16404

LazyFrame() not omitting hive partition columns #16404

Comments

j0bekt01 commented May 22, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions