Skip to content

direct polars scan is broken #754

@changhiskhan

Description

@changhiskhan

Polars scan_arrow works if you convert lance into pyarrow Table first.

but direct pl.scan_pyarrow_dataset does not work.

Symptom:

File ~/.venv/lance/lib/python3.10/site-packages/polars/io/pyarrow_dataset/anonymous_scan.py:36, in _scan_pyarrow_dataset(ds, allow_pyarrow_filter)
     19 """
     20 Pickle the partially applied function `_scan_pyarrow_dataset_impl`.
     21
   (...)
     33
     34 """
     35 func = partial(_scan_pyarrow_dataset_impl, ds)
---> 36 func_serialized = pickle.dumps(func)
     37 return pli.LazyFrame._scan_python_function(
     38     ds.schema, func_serialized, allow_pyarrow_filter
     39 )

File stringsource:2, in pyarrow._dataset.Dataset.__reduce_cython__()

TypeError: self.dataset,self.wrapped cannot be converted to a Python object for pickling

Cause:

  1. Polars pickles the dataset for lazy frame (https://github.com/pola-rs/polars/blob/master/py-polars/polars/io/pyarrow_dataset/anonymous_scan.py#L36)
  2. LanceDataset does not override pickle behavior from pyarrow dataset.
  3. The error is caused by pa.dataset.Dataset trying to pickle .wrapped which doesn't exist in LanceDataset

Proposed fix:

  1. LanceDataset should provide its own pickling behavior to avoid this error

Uncertainty:

  1. Does the internal pyo3 objects support pickling well?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions