Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: interchange categorical_column_to_series() should not accept only PandasColumn #49889

Closed
3 tasks done
AlenkaF opened this issue Nov 24, 2022 · 10 comments · Fixed by #52763
Closed
3 tasks done

BUG: interchange categorical_column_to_series() should not accept only PandasColumn #49889

AlenkaF opened this issue Nov 24, 2022 · 10 comments · Fixed by #52763
Assignees
Labels
Bug Interchange Dataframe Interchange Protocol

Comments

@AlenkaF
Copy link

AlenkaF commented Nov 24, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pyarrow as pa
import pandas as pd

arr = ["Mon", "Tue", "Mon", "Wed", "Mon", "Thu", "Fri", "Sat", "Sun"]
table = pa.table(
    {"weekday": pa.array(arr).dictionary_encode()}
)
exchange_df = table.__dataframe__()

from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 53, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 74, in _from_dataframe
    pandas_df = protocol_df_chunk_to_pandas(chunk)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 124, in protocol_df_chunk_to_pandas
    columns[name], buf = categorical_column_to_series(col)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 185, in categorical_column_to_series
    assert isinstance(cat_column, PandasColumn), "categories must be a PandasColumn"
AssertionError: categories must be a PandasColumn

Issue Description

I am currently working on implementing a dataframe interchange protocol for pyarrow.Table in Apache Arrow project (apache/arrow#14613).

I am using pandas implementation to test that the produced __dataframe__ object can be correctly consumed.

When consuming a pyarrow.Table with categorical column I get an error from pandas that the categories must be a PandasColumn and not a general __dataframe__ column defined by the interchange protocol. There is a check on line 185 for PandasColumn instance:

def categorical_column_to_series(col: Column) -> tuple[pd.Series, Any]:
"""
Convert a column holding categorical data to a pandas Series.
Parameters
----------
col : Column
Returns
-------
tuple
Tuple of pd.Series holding the data and the memory owner object
that keeps the memory alive.
"""
categorical = col.describe_categorical
if not categorical["is_dictionary"]:
raise NotImplementedError("Non-dictionary categoricals not supported yet")
cat_column = categorical["categories"]
# for mypy/pyright
assert isinstance(cat_column, PandasColumn), "categories must be a PandasColumn"

Expected Behavior

categorical_column_to_series() function should accept a general dataframe_protocol column for the categories in the categorical column.

Installed Versions

INSTALLED VERSIONS

commit : 87cfe4e
python : 3.9.14.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:46 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.5.0
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.28
pytest : 7.1.3
hypothesis : 6.39.4
sphinx : 4.3.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.02.0
gcsfs : 2022.02.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0.dev117+geeca8a4e3.d20221122
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@AlenkaF AlenkaF added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 24, 2022
@AlenkaF
Copy link
Author

AlenkaF commented Nov 24, 2022

cc @jorisvandenbossche @mroeschke

@jorisvandenbossche jorisvandenbossche added Interchange Dataframe Interchange Protocol and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 24, 2022
@ronwho
Copy link

ronwho commented Nov 27, 2022

take

@ronwho
Copy link

ronwho commented Nov 30, 2022

Hello @AlenkaF ,

I am fairly new to open source contribution, and I am trying to take this task for a school project.

I wanted to reproduce your error; however, I see that to do so, I must use your version of pyarrow.

I tried a couple of things to pip install your pyarrow:

  1. pip3 install git+https://github.com/AlenkaF/arrow.git@ARROW-18152#subdirectory=python
  2. also tried clone your version and then doing pip install . in the python directory.

I always run into the error:

"Could not find a package configuration file provided by "Arrow" with any of
        the following names:
          ArrowConfig.cmake
          arrow-config.cmake"

I tried looking into the pyarrow documentation for cpp building on OSX; however I still could not figure out how get past the error.

I have a feeling it should not be this difficult to do; perhaps I am missing something? Please let me know :)

Thanks,
Ron

@AlenkaF
Copy link
Author

AlenkaF commented Dec 1, 2022

Hi @ronwho ,

thank you for working on this issue!

You are correct. To be able to reproduce the error you will have to work from the branch in my Apache Arrow fork.
I will send you a link to a new branch I will create today (need to change)!

But first you will need to build PyArrow from source following Python Development section in our documentation.

You will need to build Arrow C++ and then PyArrow, I suggest on Apache Arrow master branch first, and then you can checkout my branch with dataframe interchange protocol code.

Hope that makes sense. Feel free to ping me for questions in case you run into any difficulties!

@ronwho
Copy link

ronwho commented Dec 1, 2022

Hey @AlenkaF,

Thanks for your response, it has really helped me!

After carefully following the Python development guide I was successfully able to build and pip install the package in main. After checking out the ARROW-18152 branch I noticed there was a compile error while compiling Arrow C++, but I saw that you created a new branch "ARROW-18152-second." I tried building that and it built and pip installed successfully!

However, for some reason when testing the code on python, I am getting an import error from pyarrow.

File "/Users/ron/Projects/open-source/testing-pandas/test-case.py", line 1, in <module>
    import pyarrow as pa
  File "/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/__init__.py", line 65, in <module>
    import pyarrow.lib as _lib

ImportError: dlopen(/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-darwin.so, 0x0002): Library not loaded: '@rpath/libarrow_dataset.1000.dylib'

  Referenced from: '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-darwin.so'

  Reason: tried: '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/libarrow_dataset.1000.dylib' (no such file), '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/libarrow_dataset.1000.dylib' (no such file), '/opt/homebrew/lib/libarrow_dataset.1000.dylib' (no such file), '/opt/homebrew/lib/libarrow_dataset.1000.dylib' (no such file), '/usr/local/lib/libarrow_dataset.1000.dylib' (no such file), '/usr/lib/libarrow_dataset.1000.dylib' (no such file)

I tested with also building the main branch, however I run into the same error. I also tried git clean and removing the build folder and rebuilding. I think I must be doing something wrong. Have you ever encountered an error like this? Please let me know!

Thanks so much again!

@AlenkaF
Copy link
Author

AlenkaF commented Dec 1, 2022

Great, you found the new branch! 👍

Can you check if you built Arrow C++ with -DARROW_DATASET=ON ? Can you also check if you can find libarrow_dataset.* in the folder where the libraries are installed (/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/)?

From the error I would think that the Arrow C++ dataset package is not installed.

@AlenkaF
Copy link
Author

AlenkaF commented Dec 2, 2022

On a second thought, if it build ssuccessfully then there should be some other reason for the error. Somehow it is searching for Arrow C++ dataset module in a wrong place. There are a few things I would check/try:

  • Does the import of PyArrow work in the Python console if opened from arrow/python?
  • Run the tests with python -m pytest ...
  • If you are on MacOS then try setting this before building PyArrow:
export R_LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH

or try building Arrow C++ with adding

-DARROW_INSTALL_NAME_RPATH=OFF
  • You could try to build PyArrow with pip install -e . --no-build-isolation or maybe with build_ext:
pushd arrow/python
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_DATASET=1
export PYARROW_PARALLEL=4
python setup.py build_ext --inplace
popd

@ronwho
Copy link

ronwho commented Dec 5, 2022

Hello Alenka, sorry for the late response, thanks for helping me detect the build problem I was facing. I was successfully able to build your version of pyarrow. Adding "-DARROW_INSTALL_NAME_RPATH=OFF" solved it.

My project team and I tried to the change the assert statement by changing from PandasColumn to a Column. like so:

print(cat_column)
print(type(cat_column))
# for mypy/pyright
assert isinstance(cat_column, Column), "categories must abide by __dataframe__ protocol API"

This is because we should be checking Column instead of PandasColumn since we believeColumn should be the general dataframe_protocol column type. Please correct us if we are mistaken.

However after building and running, it still seems that the column passed in the example code to showcase the edge case is a PyArrowColumn type. This still gives us an assertion error. We printed the cat_column and type of cat_column above to check. Here is the output:

<pyarrow.interchange.column._PyArrowColumn object at 0x124663400>
<class 'pyarrow.interchange.column._PyArrowColumn'>
Traceback (most recent call last):
  File "/Users/ron/Projects/open-source/testing-pandas/test-case.py", line 13, in <module>
    from_dataframe(exchange_df)
  File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 53, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 74, in _from_dataframe
    pandas_df = protocol_df_chunk_to_pandas(chunk)
  File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 124, in protocol_df_chunk_to_pandas
    columns[name], buf = categorical_column_to_series(col)
  File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 187, in categorical_column_to_series
    assert isinstance(cat_column, Column), "categories must abide by __dataframe__ protocol API"
AssertionError: categories must abide by __dataframe__ protocol API

Should the function should be able to accept PyArrowColumn type as well?

@AlenkaF
Copy link
Author

AlenkaF commented Dec 5, 2022

This is because we should be checking Column instead of PandasColumn since we believeColumn should be the general dataframe_protocol column type. Please correct us if we are mistaken.

That is how I understand it also.

However after building and running, it still seems that the column passed in the example code to showcase the edge case is a PyArrowColumn type. This still gives us an assertion error. We printed the cat_column and type of cat_column above to check.

The issue here is that the parent class Column (or Buffer&DataFrame) is defined by the data-apis consortium and I haven't implemented it in PyArrow. For now I am defining a ColumnObject as Any, see:

https://github.com/apache/arrow/blob/3b27cf2e2546efefac224748de16b44fa326689a/python/pyarrow/interchange/from_dataframe.py#L38-L42

But in pandas, only for the categories, PandasColumn seems to be checked and I am not sure why. And currently the check for the Column parent class is also a bit strict as most of the implementations are not using it, see also:

@ronwho
Copy link

ronwho commented Dec 7, 2022

Hello Alenka, I made a PR for this issue by relaxing the assertion statement from PandasColumn to Column, so that once you have implemented the parent class you are able to use this function with no issues. Hope this helped!

@AlenkaF AlenkaF changed the title BUG: categorical_column_to_series() should not accept only PandasColumn BUG: interchange categorical_column_to_series() should not accept only PandasColumn Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Interchange Dataframe Interchange Protocol
Projects
None yet
3 participants