-
-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_csv with engine pyarrow parsing multiple date columns #50056
Changes from 9 commits
321aaa5
eec548c
062426b
aad1c55
fc5a750
ba3e3be
3e41179
077abfd
3347dc9
4bb00ac
5cbe608
5ae2dbd
2bb9659
7c52e3d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,6 +61,21 @@ def _get_pyarrow_options(self) -> None: | |
if pandas_name in self.kwds and self.kwds.get(pandas_name) is not None: | ||
self.kwds[pyarrow_name] = self.kwds.pop(pandas_name) | ||
|
||
# Date format handling | ||
# If we get a string, we need to convert it into a list for pyarrow | ||
# If we get a dict, we want to parse those separately | ||
date_format = self.date_format | ||
if isinstance(date_format, str): | ||
date_format = [date_format] | ||
else: | ||
# In case of dict, we don't want to propagate through, so | ||
# just set to pyarrow default of None | ||
|
||
# Ideally, in future we disable pyarrow dtype inference (read in as string) | ||
# to prevent misreads. | ||
date_format = None | ||
self.kwds["timestamp_parsers"] = date_format | ||
|
||
self.parse_options = { | ||
option_name: option_value | ||
for option_name, option_value in self.kwds.items() | ||
|
@@ -79,6 +94,7 @@ def _get_pyarrow_options(self) -> None: | |
"true_values", | ||
"false_values", | ||
"decimal_point", | ||
"timestamp_parsers", | ||
) | ||
} | ||
self.convert_options["strings_can_be_null"] = "" in self.kwds["null_values"] | ||
|
@@ -119,7 +135,7 @@ def _finalize_pandas_output(self, frame: DataFrame) -> DataFrame: | |
multi_index_named = False | ||
frame.columns = self.names | ||
# we only need the frame not the names | ||
frame.columns, frame = self._do_date_conversions(frame.columns, frame) | ||
_, frame = self._do_date_conversions(frame.columns, frame) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This gives us back the frame with already changed column names? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, _do_date_conversions changes the names in the data_dict/frame too.
|
||
if self.index_col is not None: | ||
for i, item in enumerate(self.index_col): | ||
if is_integer(item): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to create a conversion option that is arrow only for this, otherwise we incur a big performance penalty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's no way to disable the parsing, it'll only get parsed once as a pyarrow timestamp/date.
I think in your other PR, you set it so that date parsing will be bypassed for Arrow Timestamp cols so we won't have double parsing of the input.
So there won't be a perf penalty, just a wrong result if you didn't want pyarrow to parse the date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am misunderstanding this, but we have to convert to NumPy to parse with to_datetime? This is what I meant with slow.
What happens if dtype_backend is set to pyarrow in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I went back and checked the output, and it looks right, except for the one case where parse_dates is a dict (that does a no-op, e.g. it maps the column to itself). This is a very uncommon case, though (I'm not sure why you would want to do that, instead of passing a list).
Do you think it'd be better to fix the root cause #52545, than to special case here (I can try to take a look at that sometime soon)?
Here's what I get in the REPL btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phofl OK to punt on this and fix the remaining issues in a follow-up?