Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv with engine pyarrow parsing multiple date columns #50056

Merged
merged 14 commits into from
May 18, 2023
Merged
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -354,6 +354,7 @@ I/O
^^^
- :meth:`DataFrame.to_orc` now raising ``ValueError`` when non-default :class:`Index` is given (:issue:`51828`)
- :meth:`DataFrame.to_sql` now raising ``ValueError`` when the name param is left empty while using SQLAlchemy to connect (:issue:`52675`)
- Bug in :func:`read_csv` where it would error when ``parse_dates`` was set to a list or dictionary with ``engine="pyarrow"`` (:issue:`47961`)
- Bug in :func:`read_hdf` not properly closing store after a ``IndexError`` is raised (:issue:`52781`)
- Bug in :func:`read_html`, style elements were read into DataFrames (:issue:`52197`)
- Bug in :func:`read_html`, tail texts were removed together with elements containing ``display:none`` style (:issue:`51629`)
Expand Down
18 changes: 17 additions & 1 deletion pandas/io/parsers/arrow_parser_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,21 @@ def _get_pyarrow_options(self) -> None:
if pandas_name in self.kwds and self.kwds.get(pandas_name) is not None:
self.kwds[pyarrow_name] = self.kwds.pop(pandas_name)

# Date format handling
# If we get a string, we need to convert it into a list for pyarrow
# If we get a dict, we want to parse those separately
date_format = self.date_format
if isinstance(date_format, str):
date_format = [date_format]
else:
# In case of dict, we don't want to propagate through, so
# just set to pyarrow default of None

# Ideally, in future we disable pyarrow dtype inference (read in as string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to create a conversion option that is arrow only for this, otherwise we incur a big performance penalty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's no way to disable the parsing, it'll only get parsed once as a pyarrow timestamp/date.

I think in your other PR, you set it so that date parsing will be bypassed for Arrow Timestamp cols so we won't have double parsing of the input.

So there won't be a perf penalty, just a wrong result if you didn't want pyarrow to parse the date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am misunderstanding this, but we have to convert to NumPy to parse with to_datetime? This is what I meant with slow.

What happens if dtype_backend is set to pyarrow in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I went back and checked the output, and it looks right, except for the one case where parse_dates is a dict (that does a no-op, e.g. it maps the column to itself). This is a very uncommon case, though (I'm not sure why you would want to do that, instead of passing a list).

Do you think it'd be better to fix the root cause #52545, than to special case here (I can try to take a look at that sometime soon)?

Here's what I get in the REPL btw.

>>> import pandas as pd
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow")
       index         A         B         C         D
0 2000-01-03  0.980269  3.685731 -0.364217 -1.159738
1 2000-01-04  1.047916 -0.041232 -0.161812  0.212549
2 2000-01-05  0.498581  0.731168 -0.537677  1.346270
3 2000-01-06  1.120202  1.567621  0.003641  0.675253
4 2000-01-07 -0.487094  0.571455 -1.611639  0.103469
5 2000-01-10  0.836649  0.246462  0.588543  1.062782
6 2000-01-11 -0.157161  1.340307  1.195778 -1.097007
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow")
                 index         A         B         C         D
0  2000-01-03 00:00:00  0.980269  3.685731 -0.364217 -1.159738
1  2000-01-04 00:00:00  1.047916 -0.041232 -0.161812  0.212549
2  2000-01-05 00:00:00  0.498581  0.731168 -0.537677  1.346270
3  2000-01-06 00:00:00  1.120202  1.567621  0.003641  0.675253
4  2000-01-07 00:00:00 -0.487094  0.571455 -1.611639  0.103469
5  2000-01-10 00:00:00  0.836649  0.246462  0.588543  1.062782
6  2000-01-11 00:00:00 -0.157161  1.340307  1.195778 -1.097007
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow").dtypes # Default, OK
index    timestamp[s][pyarrow]
A              double[pyarrow]
B              double[pyarrow]
C              double[pyarrow]
D              double[pyarrow]
dtype: object
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates=["index"]).dtypes # Parse dates as list OK
index    timestamp[s][pyarrow]
A              double[pyarrow]
B              double[pyarrow]
C              double[pyarrow]
D              double[pyarrow]
dtype: object
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates=False).dtypes # The bug I was talking about, no way to disable dates from being parsed
index    timestamp[s][pyarrow]
A              double[pyarrow]
B              double[pyarrow]
C              double[pyarrow]
D              double[pyarrow]
dtype: object
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates={"a": ["index"]}) # Buggy, returns datetime64[ns]
a     datetime64[ns]
A    double[pyarrow]
B    double[pyarrow]
C    double[pyarrow]
D    double[pyarrow]
dtype: object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl OK to punt on this and fix the remaining issues in a follow-up?

# to prevent misreads.
date_format = None
self.kwds["timestamp_parsers"] = date_format

self.parse_options = {
option_name: option_value
for option_name, option_value in self.kwds.items()
Expand All @@ -79,6 +94,7 @@ def _get_pyarrow_options(self) -> None:
"true_values",
"false_values",
"decimal_point",
"timestamp_parsers",
)
}
self.convert_options["strings_can_be_null"] = "" in self.kwds["null_values"]
Expand Down Expand Up @@ -119,7 +135,7 @@ def _finalize_pandas_output(self, frame: DataFrame) -> DataFrame:
multi_index_named = False
frame.columns = self.names
# we only need the frame not the names
frame.columns, frame = self._do_date_conversions(frame.columns, frame)
_, frame = self._do_date_conversions(frame.columns, frame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives us back the frame with already changed column names?

Copy link
Member Author

@lithomas1 lithomas1 Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, _do_date_conversions changes the names in the data_dict/frame too.

I'm actually not too sure why names is returned from this function again. (I guess it might have been before dicts were ordered???)
EDIT: It's probably related to making a multi-index from the columns for the other engines. _do_date_conversions can always fix the frame directly, so this isn't relevant for the pyarrow engine.

if self.index_col is not None:
for i, item in enumerate(self.index_col):
if is_integer(item):
Expand Down
9 changes: 6 additions & 3 deletions pandas/io/parsers/base_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,10 @@

from pandas import (
ArrowDtype,
DataFrame,
DatetimeIndex,
StringDtype,
concat,
)
from pandas.core import algorithms
from pandas.core.arrays import (
Expand Down Expand Up @@ -93,8 +95,6 @@
Scalar,
)

from pandas import DataFrame


class ParserBase:
class BadLineHandleMethod(Enum):
Expand Down Expand Up @@ -1280,7 +1280,10 @@ def _isindex(colspec):
new_cols.append(new_name)
date_cols.update(old_names)

data_dict.update(new_data)
if isinstance(data_dict, DataFrame):
data_dict = concat([DataFrame(new_data), data_dict], axis=1)
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
else:
data_dict.update(new_data)
new_cols.extend(columns)

if not keep_date_col:
Expand Down
36 changes: 20 additions & 16 deletions pandas/tests/io/parser/test_parse_dates.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,9 +139,8 @@ def test_separator_date_conflict(all_parsers):
tm.assert_frame_equal(df, expected)


@xfail_pyarrow
@pytest.mark.parametrize("keep_date_col", [True, False])
def test_multiple_date_col_custom(all_parsers, keep_date_col):
def test_multiple_date_col_custom(all_parsers, keep_date_col, request):
data = """\
KORD,19990127, 19:00:00, 18:56:00, 0.8100, 2.8100, 7.2000, 0.0000, 280.0000
KORD,19990127, 20:00:00, 19:56:00, 0.0100, 2.2100, 7.2000, 0.0000, 260.0000
Expand All @@ -152,6 +151,14 @@ def test_multiple_date_col_custom(all_parsers, keep_date_col):
"""
parser = all_parsers

if keep_date_col and parser.engine == "pyarrow":
# For this to pass, we need to disable auto-inference on the date columns
# in parse_dates. We have no way of doing this though
mark = pytest.mark.xfail(
reason="pyarrow doesn't support disabling auto-inference on column numbers."
)
request.node.add_marker(mark)

def date_parser(*date_cols):
"""
Test date parser.
Expand Down Expand Up @@ -301,9 +308,8 @@ def test_concat_date_col_fail(container, dim):
parsing.concat_date_cols(date_cols)


@xfail_pyarrow
@pytest.mark.parametrize("keep_date_col", [True, False])
def test_multiple_date_col(all_parsers, keep_date_col):
def test_multiple_date_col(all_parsers, keep_date_col, request):
data = """\
KORD,19990127, 19:00:00, 18:56:00, 0.8100, 2.8100, 7.2000, 0.0000, 280.0000
KORD,19990127, 20:00:00, 19:56:00, 0.0100, 2.2100, 7.2000, 0.0000, 260.0000
Expand All @@ -313,6 +319,15 @@ def test_multiple_date_col(all_parsers, keep_date_col):
KORD,19990127, 23:00:00, 22:56:00, -0.5900, 1.7100, 4.6000, 0.0000, 280.0000
"""
parser = all_parsers

if keep_date_col and parser.engine == "pyarrow":
# For this to pass, we need to disable auto-inference on the date columns
# in parse_dates. We have no way of doing this though
mark = pytest.mark.xfail(
reason="pyarrow doesn't support disabling auto-inference on column numbers."
)
request.node.add_marker(mark)

kwds = {
"header": None,
"parse_dates": [[1, 2], [1, 3]],
Expand Down Expand Up @@ -469,7 +484,6 @@ def test_date_col_as_index_col(all_parsers):
tm.assert_frame_equal(result, expected)


@xfail_pyarrow
def test_multiple_date_cols_int_cast(all_parsers):
data = (
"KORD,19990127, 19:00:00, 18:56:00, 0.8100\n"
Expand Down Expand Up @@ -530,7 +544,6 @@ def test_multiple_date_cols_int_cast(all_parsers):
tm.assert_frame_equal(result, expected)


@xfail_pyarrow
def test_multiple_date_col_timestamp_parse(all_parsers):
parser = all_parsers
data = """05/31/2012,15:30:00.029,1306.25,1,E,0,,1306.25
Expand Down Expand Up @@ -1170,7 +1183,6 @@ def test_multiple_date_cols_chunked(all_parsers):
tm.assert_frame_equal(chunks[2], expected[4:])


@xfail_pyarrow
def test_multiple_date_col_named_index_compat(all_parsers):
parser = all_parsers
data = """\
Expand All @@ -1194,7 +1206,6 @@ def test_multiple_date_col_named_index_compat(all_parsers):
tm.assert_frame_equal(with_indices, with_names)


@xfail_pyarrow
def test_multiple_date_col_multiple_index_compat(all_parsers):
parser = all_parsers
data = """\
Expand Down Expand Up @@ -1410,7 +1421,6 @@ def test_parse_date_time_multi_level_column_name(all_parsers):
tm.assert_frame_equal(result, expected)


@xfail_pyarrow
@pytest.mark.parametrize(
"data,kwargs,expected",
[
Expand Down Expand Up @@ -1500,9 +1510,6 @@ def test_parse_date_time(all_parsers, data, kwargs, expected):
tm.assert_frame_equal(result, expected)


@xfail_pyarrow
# From date_parser fallback behavior
@pytest.mark.filterwarnings("ignore:elementwise comparison:FutureWarning")
def test_parse_date_fields(all_parsers):
parser = all_parsers
data = "year,month,day,a\n2001,01,10,10.\n2001,02,1,11."
Expand All @@ -1512,7 +1519,7 @@ def test_parse_date_fields(all_parsers):
StringIO(data),
header=0,
parse_dates={"ymd": [0, 1, 2]},
date_parser=pd.to_datetime,
date_parser=lambda x: x,
)

expected = DataFrame(
Expand All @@ -1522,7 +1529,6 @@ def test_parse_date_fields(all_parsers):
tm.assert_frame_equal(result, expected)


@xfail_pyarrow
@pytest.mark.parametrize(
("key", "value", "warn"),
[
Expand Down Expand Up @@ -1559,7 +1565,6 @@ def test_parse_date_all_fields(all_parsers, key, value, warn):
tm.assert_frame_equal(result, expected)


@xfail_pyarrow
@pytest.mark.parametrize(
("key", "value", "warn"),
[
Expand Down Expand Up @@ -1596,7 +1601,6 @@ def test_datetime_fractional_seconds(all_parsers, key, value, warn):
tm.assert_frame_equal(result, expected)


@xfail_pyarrow
def test_generic(all_parsers):
parser = all_parsers
data = "year,month,day,a\n2001,01,10,10.\n2001,02,1,11."
Expand Down