Closed
Description
Encountered this while reviewing this PR on the scikit-learn side, xref: scikit-learn/scikit-learn#28804 (comment)
Basically, if the environment doesn't have pyarrow
, conversion from pandas
seems to require pyarrow
eventhough the pandas.DataFrame
isn't using pyarrow
.
Minimal reproducible:
python -m venv /tmp/.venv
source /tmp/.venv/bin/activate
pip install pandas polars
python
>>> import pandas as pd
>>> import polars as pl
>>> pl.DataFrame(pd.DataFrame(['a', 'b']))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 406, in __init__
self._df = pandas_to_pydf(
^^^^^^^^^^^^^^^
File "/tmp/.venv/lib/python3.11/site-packages/polars/_utils/construction/dataframe.py", line 1032, in pandas_to_pydf
arrow_dict[str(col)] = plc.pandas_series_to_arrow(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/.venv/lib/python3.11/site-packages/polars/_utils/construction/other.py", line 92, in pandas_series_to_arrow
return pa.array(values, pa.large_utf8(), from_pandas=nan_to_null)
^^^^^^^^
File "/tmp/.venv/lib/python3.11/site-packages/polars/dependencies.py", line 97, in __getattr__
raise ModuleNotFoundError(msg) from None
ModuleNotFoundError: pa.array requires 'pyarrow' module to be installed
>>> pd.DataFrame(pl.DataFrame(['a', 'b']))
0
0 a
1 b
Note that in the above example the other way around (conversion from polars to pandas) works fine.
The PR on the scikit-learn side, introduced this line:
co2_data = pl.DataFrame({col: co2.frame[col].to_numpy() for col in co2.frame.columns})
which seems very odd, having to move to numpy and then to polars. Also, if the above line is correct, polars could be doing almost the same internally and not require pyarrow for the conversion.