Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Pyarrow engine doesn't seem to support \s+ but error message implies it does? #52554

Open
2 of 3 tasks
IgnacioJPickering opened this issue Apr 9, 2023 · 1 comment
Open
2 of 3 tasks
Labels
Arrow pyarrow functionality Bug Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv

Comments

@IgnacioJPickering
Copy link

IgnacioJPickering commented Apr 9, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas
pandas.read_csv("", sep="\s+", engine="pyarrow")

Issue Description

This fails with the following error:

ValueError: the 'pyarrow' engine does not support regex separators
(separators > 1 char and different from '\s+' are interpreted as regex)

Expected Behavior

I'm not sure if pyarrow is meant to support \s+. If pyarrow supports it, then this should not fail. If pyarrow does not support it, then I believe the error should be modified to reflect this, since it now seems to imply that \s+ is not interpreted as a regex, so pyarrow should support it.

Update: I looked in the main branch and it seems that pyarrow does not to support \s+, so changing the error message should be enough.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 478d340667831908b5b4bf09a2787a11a14560c9
python           : 3.11.0.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.15.0-69-generic
Version          : #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.0
numpy            : 1.24.2
pytz             : 2023.2
dateutil         : 2.8.2
setuptools       : 67.6.0
pip              : 23.0.1
Cython           : None
pytest           : 7.2.2
hypothesis       : None
sphinx           : 6.1.3
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.12.0
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : 
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 11.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None
@IgnacioJPickering IgnacioJPickering added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2023
@lithomas1 lithomas1 added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2023
@lithomas1
Copy link
Member

Thanks for the report. This is probably due to erroneously sharing that line of code with the C parser.

If you're interested in making a PR, We can probably add a check for pyarrow inside the elif block here.

elif sep is not None and len(sep) > 1:

Otherwise, I'll take care of this in hopefully a week or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants