Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: "S" dtype.kind not supported by pandas.core.internals.con_dtype_to_na_valuecat. #53525

Open
3 tasks done
garciampred opened this issue Jun 5, 2023 · 1 comment
Open
3 tasks done
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@garciampred
Copy link

garciampred commented Jun 5, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame(
    dict(a=np.concatenate([np.repeat([b"Hello"], 200), np.repeat([b"bye"], 200)]),
    b=np.repeat([2.3223], 400), c=np.repeat([np.nan], 400)), index=range(400), copy=False
)
df.copy()

Issue Description

The issue looks simple. "S" dtype.kind related to character arrays is not taken into account in pandas.core.internals.con_dtype_to_na_valuecat and it raises a NotImplementedError.

What I found very hard is to write the MCVE, I gave up after more that one hour. I don't know how to make the code to go through that way. I wrote a dataframe with a "|S5" data type column long enough to require truncation when printed, but it is not enough. So please note that the MCVE I wrote it is not actually able to reproduce the error.

I can reliably reproduce it with my data, even saving it to HDF5 and reading it afterwards, but it does not look appropriated to upload it here.

Fixing this looks very easy, but I wonder if there was a reason for leaving "S" outside that function.

Also, note that I was not able to install the version in the main branch (my CPU got stock in 100% usage in "Preparing metadata (pyproject.toml)" ), but I checked and the function is unchanged, so I think the bug is there too.

Regards

Expected Behavior

Print the data frame normally, without raising errors.

Installed Versions

INSTALLED VERSIONS

commit : 965ceca
python : 3.10.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-73-generic
Version : #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.2
numpy : 1.23.5
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : None
pytest : 7.3.1
hypothesis : None
sphinx : 6.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli :
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : 2.0.12
tables : 3.8.0
tabulate : None
xarray : 2023.4.2
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@garciampred garciampred added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 5, 2023
@garciampred
Copy link
Author

Could someone please look at this? I can write a PR to handle this dtype in .internals.con_dtype_ but I need someone to confirm that this makes sense. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant