BUG: interchange categorical_column_to_series() should not accept only PandasColumn #49889

AlenkaF · 2022-11-24T14:15:01Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pyarrow as pa
import pandas as pd

arr = ["Mon", "Tue", "Mon", "Wed", "Mon", "Thu", "Fri", "Sat", "Sun"]
table = pa.table(
    {"weekday": pa.array(arr).dictionary_encode()}
)
exchange_df = table.__dataframe__()

from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 53, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 74, in _from_dataframe
    pandas_df = protocol_df_chunk_to_pandas(chunk)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 124, in protocol_df_chunk_to_pandas
    columns[name], buf = categorical_column_to_series(col)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 185, in categorical_column_to_series
    assert isinstance(cat_column, PandasColumn), "categories must be a PandasColumn"
AssertionError: categories must be a PandasColumn

Issue Description

I am currently working on implementing a dataframe interchange protocol for pyarrow.Table in Apache Arrow project (apache/arrow#14613).

I am using pandas implementation to test that the produced __dataframe__ object can be correctly consumed.

When consuming a pyarrow.Table with categorical column I get an error from pandas that the categories must be a PandasColumn and not a general __dataframe__ column defined by the interchange protocol. There is a check on line 185 for PandasColumn instance:

pandas/pandas/core/interchange/from_dataframe.py

Lines 164 to 185 in 70121c7

    
           def categorical_column_to_series(col: Column) -> tuple[pd.Series, Any]: 
        
               """ 
        
               Convert a column holding categorical data to a pandas Series. 
        
               Parameters 
        
               ---------- 
        
               col : Column 
        
               Returns 
        
               ------- 
        
               tuple 
        
                   Tuple of pd.Series holding the data and the memory owner object 
        
                   that keeps the memory alive. 
        
               """ 
        
               categorical = col.describe_categorical 
        
               if not categorical["is_dictionary"]: 
        
                   raise NotImplementedError("Non-dictionary categoricals not supported yet") 
        
               cat_column = categorical["categories"] 
        
               # for mypy/pyright 
        
               assert isinstance(cat_column, PandasColumn), "categories must be a PandasColumn"

Expected Behavior

categorical_column_to_series() function should accept a general dataframe_protocol column for the categories in the categorical column.

Installed Versions

INSTALLED VERSIONS

commit : 87cfe4e
python : 3.9.14.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:46 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.5.0
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.28
pytest : 7.1.3
hypothesis : 6.39.4
sphinx : 4.3.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.02.0
gcsfs : 2022.02.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0.dev117+geeca8a4e3.d20221122
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

AlenkaF · 2022-11-24T14:16:12Z

cc @jorisvandenbossche @mroeschke

ronwho · 2022-11-27T00:45:42Z

take

ronwho · 2022-11-30T21:43:03Z

Hello @AlenkaF ,

I am fairly new to open source contribution, and I am trying to take this task for a school project.

I wanted to reproduce your error; however, I see that to do so, I must use your version of pyarrow.

I tried a couple of things to pip install your pyarrow:

pip3 install git+https://github.com/AlenkaF/arrow.git@ARROW-18152#subdirectory=python
also tried clone your version and then doing pip install . in the python directory.

I always run into the error:

"Could not find a package configuration file provided by "Arrow" with any of
        the following names:
          ArrowConfig.cmake
          arrow-config.cmake"

I tried looking into the pyarrow documentation for cpp building on OSX; however I still could not figure out how get past the error.

I have a feeling it should not be this difficult to do; perhaps I am missing something? Please let me know :)

Thanks,
Ron

AlenkaF · 2022-12-01T11:34:00Z

Hi @ronwho ,

thank you for working on this issue!

You are correct. To be able to reproduce the error you will have to work from the branch in my Apache Arrow fork.
I will send you a link to a new branch I will create today (need to change)!

But first you will need to build PyArrow from source following Python Development section in our documentation.

You will need to build Arrow C++ and then PyArrow, I suggest on Apache Arrow master branch first, and then you can checkout my branch with dataframe interchange protocol code.

Hope that makes sense. Feel free to ping me for questions in case you run into any difficulties!

ronwho · 2022-12-01T17:09:45Z

Hey @AlenkaF,

Thanks for your response, it has really helped me!

After carefully following the Python development guide I was successfully able to build and pip install the package in main. After checking out the ARROW-18152 branch I noticed there was a compile error while compiling Arrow C++, but I saw that you created a new branch "ARROW-18152-second." I tried building that and it built and pip installed successfully!

However, for some reason when testing the code on python, I am getting an import error from pyarrow.

File "/Users/ron/Projects/open-source/testing-pandas/test-case.py", line 1, in <module>
    import pyarrow as pa
  File "/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/__init__.py", line 65, in <module>
    import pyarrow.lib as _lib

ImportError: dlopen(/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-darwin.so, 0x0002): Library not loaded: '@rpath/libarrow_dataset.1000.dylib'

  Referenced from: '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-darwin.so'

  Reason: tried: '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/libarrow_dataset.1000.dylib' (no such file), '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/libarrow_dataset.1000.dylib' (no such file), '/opt/homebrew/lib/libarrow_dataset.1000.dylib' (no such file), '/opt/homebrew/lib/libarrow_dataset.1000.dylib' (no such file), '/usr/local/lib/libarrow_dataset.1000.dylib' (no such file), '/usr/lib/libarrow_dataset.1000.dylib' (no such file)

I tested with also building the main branch, however I run into the same error. I also tried git clean and removing the build folder and rebuilding. I think I must be doing something wrong. Have you ever encountered an error like this? Please let me know!

Thanks so much again!

AlenkaF · 2022-12-01T19:03:53Z

Great, you found the new branch! 👍

Can you check if you built Arrow C++ with -DARROW_DATASET=ON ? Can you also check if you can find libarrow_dataset.* in the folder where the libraries are installed (/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/)?

From the error I would think that the Arrow C++ dataset package is not installed.

AlenkaF · 2022-12-02T08:45:03Z

On a second thought, if it build ssuccessfully then there should be some other reason for the error. Somehow it is searching for Arrow C++ dataset module in a wrong place. There are a few things I would check/try:

Does the import of PyArrow work in the Python console if opened from arrow/python?
Run the tests with python -m pytest ...
If you are on MacOS then try setting this before building PyArrow:

export R_LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH

or try building Arrow C++ with adding

-DARROW_INSTALL_NAME_RPATH=OFF

You could try to build PyArrow with pip install -e . --no-build-isolation or maybe with build_ext:

pushd arrow/python
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_DATASET=1
export PYARROW_PARALLEL=4
python setup.py build_ext --inplace
popd

ronwho · 2022-12-05T00:24:08Z

Hello Alenka, sorry for the late response, thanks for helping me detect the build problem I was facing. I was successfully able to build your version of pyarrow. Adding "-DARROW_INSTALL_NAME_RPATH=OFF" solved it.

My project team and I tried to the change the assert statement by changing from PandasColumn to a Column. like so:

print(cat_column)
print(type(cat_column))
# for mypy/pyright
assert isinstance(cat_column, Column), "categories must abide by __dataframe__ protocol API"

This is because we should be checking Column instead of PandasColumn since we believeColumn should be the general dataframe_protocol column type. Please correct us if we are mistaken.

However after building and running, it still seems that the column passed in the example code to showcase the edge case is a PyArrowColumn type. This still gives us an assertion error. We printed the cat_column and type of cat_column above to check. Here is the output:

<pyarrow.interchange.column._PyArrowColumn object at 0x124663400>
<class 'pyarrow.interchange.column._PyArrowColumn'>
Traceback (most recent call last):
  File "/Users/ron/Projects/open-source/testing-pandas/test-case.py", line 13, in <module>
    from_dataframe(exchange_df)
  File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 53, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 74, in _from_dataframe
    pandas_df = protocol_df_chunk_to_pandas(chunk)
  File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 124, in protocol_df_chunk_to_pandas
    columns[name], buf = categorical_column_to_series(col)
  File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 187, in categorical_column_to_series
    assert isinstance(cat_column, Column), "categories must abide by __dataframe__ protocol API"
AssertionError: categories must abide by __dataframe__ protocol API

Should the function should be able to accept PyArrowColumn type as well?

AlenkaF · 2022-12-05T11:10:39Z

This is because we should be checking Column instead of PandasColumn since we believeColumn should be the general dataframe_protocol column type. Please correct us if we are mistaken.

That is how I understand it also.

However after building and running, it still seems that the column passed in the example code to showcase the edge case is a PyArrowColumn type. This still gives us an assertion error. We printed the cat_column and type of cat_column above to check.

The issue here is that the parent class Column (or Buffer&DataFrame) is defined by the data-apis consortium and I haven't implemented it in PyArrow. For now I am defining a ColumnObject as Any, see:

https://github.com/apache/arrow/blob/3b27cf2e2546efefac224748de16b44fa326689a/python/pyarrow/interchange/from_dataframe.py#L38-L42

But in pandas, only for the categories, PandasColumn seems to be checked and I am not sure why. And currently the check for the Column parent class is also a bit strict as most of the implementations are not using it, see also:

…pandas-dev#49889

ronwho · 2022-12-07T00:57:10Z

Hello Alenka, I made a PR for this issue by relaxing the assertion statement from PandasColumn to Column, so that once you have implemented the parent class you are able to use this function with no issues. Hope this helped!

AlenkaF added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 24, 2022

jorisvandenbossche added Interchange Dataframe Interchange Protocol and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 24, 2022

github-actions bot assigned ronwho Nov 27, 2022

ronwho added a commit to ronwho/pandas that referenced this issue Dec 7, 2022

BUG:categorical_column_to_series() should not accept only PandasColumn …

97f8a73

…pandas-dev#49889

ronwho mentioned this issue Dec 7, 2022

BUG: #49889 #50098

Closed

AlenkaF changed the title ~~BUG: categorical_column_to_series() should not accept only PandasColumn~~ BUG: interchange categorical_column_to_series() should not accept only PandasColumn Jan 16, 2023

MarcoGorelli mentioned this issue Apr 18, 2023

BUG: interchange categorical_column_to_series() should not accept only PandasColumn #52763

Merged

5 tasks

mroeschke closed this as completed in #52763 Apr 19, 2023

AlenkaF mentioned this issue Apr 21, 2023

[Python] Dataframe interchange protocol updates after pandas bug fixes apache/arrow#35264

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: interchange categorical_column_to_series() should not accept only PandasColumn #49889

BUG: interchange categorical_column_to_series() should not accept only PandasColumn #49889

AlenkaF commented Nov 24, 2022 •

edited

Loading

INSTALLED VERSIONS

AlenkaF commented Nov 24, 2022

ronwho commented Nov 27, 2022

ronwho commented Nov 30, 2022

AlenkaF commented Dec 1, 2022

ronwho commented Dec 1, 2022 •

edited

Loading

AlenkaF commented Dec 1, 2022

AlenkaF commented Dec 2, 2022

ronwho commented Dec 5, 2022

AlenkaF commented Dec 5, 2022

ronwho commented Dec 7, 2022

BUG: interchange categorical_column_to_series() should not accept only PandasColumn #49889

BUG: interchange categorical_column_to_series() should not accept only PandasColumn #49889

Comments

AlenkaF commented Nov 24, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

AlenkaF commented Nov 24, 2022

ronwho commented Nov 27, 2022

ronwho commented Nov 30, 2022

AlenkaF commented Dec 1, 2022

ronwho commented Dec 1, 2022 • edited Loading

AlenkaF commented Dec 1, 2022

AlenkaF commented Dec 2, 2022

ronwho commented Dec 5, 2022

AlenkaF commented Dec 5, 2022

ronwho commented Dec 7, 2022

AlenkaF commented Nov 24, 2022 •

edited

Loading

ronwho commented Dec 1, 2022 •

edited

Loading