-
-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: interchange categorical_column_to_series() should not accept only PandasColumn #49889
Comments
take |
Hello @AlenkaF , I am fairly new to open source contribution, and I am trying to take this task for a school project. I wanted to reproduce your error; however, I see that to do so, I must use your version of pyarrow. I tried a couple of things to pip install your pyarrow:
I always run into the error:
I tried looking into the pyarrow documentation for cpp building on OSX; however I still could not figure out how get past the error. I have a feeling it should not be this difficult to do; perhaps I am missing something? Please let me know :) Thanks, |
Hi @ronwho , thank you for working on this issue! You are correct. To be able to reproduce the error you will have to work from the branch in my Apache Arrow fork. But first you will need to build PyArrow from source following Python Development section in our documentation. You will need to build Arrow C++ and then PyArrow, I suggest on Apache Arrow master branch first, and then you can checkout my branch with dataframe interchange protocol code. Hope that makes sense. Feel free to ping me for questions in case you run into any difficulties! |
Hey @AlenkaF, Thanks for your response, it has really helped me! After carefully following the Python development guide I was successfully able to build and pip install the package in main. After checking out the ARROW-18152 branch I noticed there was a compile error while compiling Arrow C++, but I saw that you created a new branch "ARROW-18152-second." I tried building that and it built and pip installed successfully! However, for some reason when testing the code on python, I am getting an import error from pyarrow.
I tested with also building the main branch, however I run into the same error. I also tried git clean and removing the build folder and rebuilding. I think I must be doing something wrong. Have you ever encountered an error like this? Please let me know! Thanks so much again! |
Great, you found the new branch! 👍 Can you check if you built Arrow C++ with From the error I would think that the Arrow C++ dataset package is not installed. |
On a second thought, if it build ssuccessfully then there should be some other reason for the error. Somehow it is searching for Arrow C++ dataset module in a wrong place. There are a few things I would check/try:
or try building Arrow C++ with adding
|
Hello Alenka, sorry for the late response, thanks for helping me detect the build problem I was facing. I was successfully able to build your version of pyarrow. Adding "-DARROW_INSTALL_NAME_RPATH=OFF" solved it. My project team and I tried to the change the assert statement by changing from
This is because we should be checking However after building and running, it still seems that the column passed in the example code to showcase the edge case is a
Should the function should be able to accept |
That is how I understand it also.
The issue here is that the parent class But in pandas, only for the categories, |
Hello Alenka, I made a PR for this issue by relaxing the assertion statement from |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I am currently working on implementing a dataframe interchange protocol for
pyarrow.Table
in Apache Arrow project (apache/arrow#14613).I am using pandas implementation to test that the produced
__dataframe__
object can be correctly consumed.When consuming a
pyarrow.Table
with categorical column I get an error from pandas that the categories must be aPandasColumn
and not a general__dataframe__
column defined by the interchange protocol. There is a check on line 185 forPandasColumn
instance:pandas/pandas/core/interchange/from_dataframe.py
Lines 164 to 185 in 70121c7
Expected Behavior
categorical_column_to_series()
function should accept a general dataframe_protocol column for the categories in the categorical column.Installed Versions
INSTALLED VERSIONS
commit : 87cfe4e
python : 3.9.14.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:46 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.5.0
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.28
pytest : 7.1.3
hypothesis : 6.39.4
sphinx : 4.3.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.02.0
gcsfs : 2022.02.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0.dev117+geeca8a4e3.d20221122
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : None
xlwt : None
zstandard : None
tzdata : None
The text was updated successfully, but these errors were encountered: