BUG: pd.read_csv error when parse_dates used on index_col for dtype_backend="pyarrow" #57930

MCRE-BE · 2024-03-20T14:26:22Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime

import pandas as pd

df = pd.DataFrame({'a': [1], 'b': [datetime.datetime.now()]})
df.to_csv('test.csv', index=None)


a = pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
    dtype_backend="pyarrow",
    engine="pyarrow",
)

pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
)

the following fails :

pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
    dtype_backend="pyarrow",
)

Issue Description

returns
ValueError: not all elements from date_cols are numpy arrays

ValueError Traceback (most recent call last) Cell In[78], [line 1](vscode-notebook-cell:?execution_count=78&line=1) ----> [1](vscode-notebook-cell:?execution_count=78&line=1) pd.read_csv( [2](vscode-notebook-cell:?execution_count=78&line=2) 'test.csv', [3](vscode-notebook-cell:?execution_count=78&line=3) parse_dates=['b'], [4](vscode-notebook-cell:?execution_count=78&line=4) index_col="b", [5](vscode-notebook-cell:?execution_count=78&line=5) dtype_backend="pyarrow", [6](vscode-notebook-cell:?execution_count=78&line=6) )

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
935 kwds_defaults = _refine_defaults_read(
936 dialect,
937 delimiter,
(...)
944 dtype_backend=dtype_backend,
945 )
946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:617, in _read(filepath_or_buffer, kwds)
614 return parser
616 with parser:
--> 617 return parser.read(nrows)

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:1748, in TextFileReader.read(self, nrows)
1741 nrows = validate_integer("nrows", nrows)
1742 try:
1743 # error: "ParserBase" has no attribute "read"
1744 (
1745 index,
1746 columns,
1747 col_dict,
-> 1748 ) = self._engine.read( # type: ignore[attr-defined]
1749 nrows
1750 )
1751 except Exception:
1752 self.close()

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:333, in CParserWrapper.read(self, nrows)
330 data = {k: v for k, (i, v) in zip(names, data_tups)}
332 names, date_data = self._do_date_conversions(names, data)
--> 333 index, column_names = self._make_index(date_data, alldata, names)
335 return index, column_names, date_data

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:371, in ParserBase._make_index(self, data, alldata, columns, indexnamerow)
369 elif not self._has_complex_date_col:
370 simple_index = self._get_simple_index(alldata, columns)
--> 371 index = self._agg_index(simple_index)
372 elif self._has_complex_date_col:
373 if not self._name_processed:

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:468, in ParserBase._agg_index(self, index, try_parse_dates)
466 for i, arr in enumerate(index):
467 if try_parse_dates and self._should_parse_dates(i):
--> 468 arr = self._date_conv(
469 arr,
470 col=self.index_names[i] if self.index_names is not None else None,
471 )
473 if self.na_filter:
474 col_na_values = self.na_values

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:1142, in _make_date_converter..converter(col, *date_cols)
1139 return date_cols[0]
1141 if date_parser is lib.no_default:
-> 1142 strs = parsing.concat_date_cols(date_cols)
1143 date_fmt = (
1144 date_format.get(col) if isinstance(date_format, dict) else date_format
1145 )
1147 with warnings.catch_warnings():

File parsing.pyx:1155, in pandas._libs.tslibs.parsing.concat_date_cols()
ValueError: not all elements from date_cols are numpy arrays

Expected Behavior

As with dtype_backend="numpy_nullable" the index column should be parsed correctly.

It works with the following set-up, so likely an issue with the C parser (?)

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.11.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Dutch_Belgium.1252

pandas : 2.1.4
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.2.0
pip : 24.0
Cython : None
pytest : 8.1.1
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
bs4 : 4.12.3
bottleneck : 1.3.8
dataframe-api-compat: None
fastparquet : None
fsspec : 2024.3.0
gcsfs : None
matplotlib : 3.8.3
numba : 0.59.0
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.12.0
sqlalchemy : 2.0.28
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
qtpy : 2.4.1
pyqt5 : None

The text was updated successfully, but these errors were encountered:

lithomas1 · 2024-03-20T22:47:53Z

Hm, this works on main for me.

Can you try a newer version of pandas (e.g. 2.2.1)?

pmhatre1 · 2024-03-20T23:45:12Z

@lithomas1 I could reproduce this error on pandas version 2.3.0

pmhatre1 · 2024-03-20T23:45:47Z

@lithomas1 Do you need any help in other triaging other defects/ bugs?

lithomas1 · 2024-03-21T00:34:49Z

My bad, forgot to run the second command.

Nice catch.

Looks like parsing.concat_date_cols (which concatenates date_cols) is getting passed an ArrowExtensionArray.

That behavior is deprecated (to be removed in 3.0), so it should be easy to make this work for 3.0.

MCRE-BE · 2024-03-21T09:41:03Z

Can you try a newer version of pandas (e.g. 2.2.1)?

My bad. I should have checked and not have assumed the kernel I was using used the latest version. Thanks for the work.

MCRE-BE added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 20, 2024

lithomas1 added IO CSV read_csv, to_csv Arrow pyarrow functionality Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 20, 2024

lithomas1 removed the Closing Candidate May be closeable, needs more eyeballs label Mar 21, 2024

lithomas1 self-assigned this Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.read_csv error when parse_dates used on index_col for dtype_backend="pyarrow" #57930

BUG: pd.read_csv error when parse_dates used on index_col for dtype_backend="pyarrow" #57930

MCRE-BE commented Mar 20, 2024 •

edited

INSTALLED VERSIONS

lithomas1 commented Mar 20, 2024

pmhatre1 commented Mar 20, 2024

pmhatre1 commented Mar 20, 2024

lithomas1 commented Mar 21, 2024

MCRE-BE commented Mar 21, 2024

BUG: pd.read_csv error when parse_dates used on index_col for dtype_backend="pyarrow" #57930

BUG: pd.read_csv error when parse_dates used on index_col for dtype_backend="pyarrow" #57930

Comments

MCRE-BE commented Mar 20, 2024 • edited

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

lithomas1 commented Mar 20, 2024

pmhatre1 commented Mar 20, 2024

pmhatre1 commented Mar 20, 2024

lithomas1 commented Mar 21, 2024

MCRE-BE commented Mar 21, 2024

MCRE-BE commented Mar 20, 2024 •

edited