Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv raising for arrow engine and parse_dates #53295

Merged
merged 2 commits into from
May 19, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented May 18, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Not sure if this is an actual regression, but ties a bit into the dtype backend and would be nice if this works, since this raises for every parse_date case with numpy dtype backend (and I want to advertise the engine in my pyarrow blog...)

@phofl phofl added IO CSV read_csv, to_csv Arrow pyarrow functionality labels May 18, 2023
@@ -1137,6 +1137,9 @@ def unpack_if_single_element(arg):
return arg

def converter(*date_cols, col: Hashable):
if len(date_cols) == 1 and date_cols[0].dtype.kind in "Mm":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understood your comment in the OP correctly, but does this also fix parse_dates for numpy dtypes too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only NumPy dtypes as far as I know, we fixed this for Arrow dtypes a couple of weeks ago.

@mroeschke mroeschke added this to the 2.0.2 milestone May 19, 2023
@lithomas1
Copy link
Member

Can you share a MRE of the issue you're having?

I'm wondering if #50056 fixes/causes the issue.

@phofl
Copy link
Member Author

phofl commented May 19, 2023

data = """a,b
2000-01-01 00:00:00,1
2000-01-01 00:00:01,1"""

result = pd.read_csv(StringIO(data), parse_dates=["a"], engine="pyarrow")

Nope, still raises after your PR is in

Edit: The main issue is that arrow infers as datetime and we try to infer again, which raises.

@mroeschke mroeschke merged commit aaf5037 into pandas-dev:main May 19, 2023
40 checks passed
@mroeschke
Copy link
Member

Thanks @phofl

@lumberbot-app
Copy link

lumberbot-app bot commented May 19, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.0.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 aaf503784a7cbf6425d681551d1e1e686ce14815
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #53295: BUG: read_csv raising for arrow engine and parse_dates'
  1. Push to a named branch:
git push YOURFORK 2.0.x:auto-backport-of-pr-53295-on-2.0.x
  1. Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #53295 on branch 2.0.x (BUG: read_csv raising for arrow engine and parse_dates)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@phofl
Copy link
Member Author

phofl commented May 19, 2023

Will backport later today or tomorrow

@lithomas1
Copy link
Member

Might want to consider backporting #50056 and #52087 too.

@phofl
Copy link
Member Author

phofl commented May 20, 2023

The first one is a bit less common, can think about the second one if you like

@lithomas1
Copy link
Member

That'd be great, I think there were a couple issues opened about nans in string columns not being read properly related to the second.

lithomas1 pushed a commit that referenced this pull request May 20, 2023
…ngine and parse_dates) (#53317)

BUG: read_csv raising for arrow engine and parse_dates (#53295)

(cherry picked from commit aaf5037)
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 22, 2023
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants