Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX preserve dtype with datetime columns of different resolution when merging #53213

Merged
merged 6 commits into from
May 14, 2023

Conversation

glemaitre
Copy link
Contributor

@glemaitre glemaitre commented May 13, 2023

Avoid coercing to object dtype when merging datetime columns with different resolutions. Instead, use the most fine-grain resolution.

@jorisvandenbossche jorisvandenbossche added Non-Nano datetime64/timedelta64 with non-nanosecond resolution Bug labels May 13, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.0.2 milestone May 13, 2023
}
)
df2 = df1.copy()
df2["t"] = df2["t"].dt.as_unit("s")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set the resolution for df1 explicitly as well? We might change the inference at some point which would make the test useless

@phofl phofl merged commit 935244a into pandas-dev:main May 14, 2023
@phofl
Copy link
Member

phofl commented May 14, 2023

thx @glemaitre

@lumberbot-app
Copy link

lumberbot-app bot commented May 14, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.0.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 935244a9b9c0fe796de315cd381d6364719bfdc1
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #53213: FIX preserve dtype with datetime columns of different resolution when merging'
  1. Push to a named branch:
git push YOURFORK 2.0.x:auto-backport-of-pr-53213-on-2.0.x
  1. Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #53213 on branch 2.0.x (FIX preserve dtype with datetime columns of different resolution when merging)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

phofl pushed a commit to phofl/pandas that referenced this pull request May 14, 2023
mroeschke pushed a commit that referenced this pull request May 15, 2023
…columns of different resolution when merging) (#53228)

FIX preserve dtype with datetime columns of different resolution when merging (#53213)

(cherry picked from commit 935244a)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@jorisvandenbossche
Copy link
Member

One aspect we were still looking into was that the reason this is correctly working for tz-naive datetimes with different resolutions is only because the merging uses the object factorizer (at which point the different resolution doesn't matter). We should also call _ensure_matching_resos for tz-naive data somewhere as well.

This also seems to point to a performance regression from 1.5 to 2.0, as looking a bit into it, it seems we are actually also using the object factorizer even for datetimes with the same resolution.

Will do a follow-up PR

@jorisvandenbossche
Copy link
Member

Performance fix -> #53231

There was also one more situation I noticed where we incorrectly cast to object dtype, and that's for an outer/right join if the join key column needs to be combined. Possible short-term workaround -> #53233

Rylie-W pushed a commit to Rylie-W/pandas that referenced this pull request May 19, 2023
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 22, 2023
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 27, 2023
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
mroeschke pushed a commit that referenced this pull request Jul 13, 2023
* ENH: better dtype inference when doing DataFrame reductions

* precommit issues

* fix failures

* fix failures

* mypy + some docs

* doc linting linting

* refactor to use _reduce_with_wrap

* docstring linting

* pyarrow failure + linting

* pyarrow failure + linting

* linting

* doc stuff

* linting fixes

* fix fix doc string

* remove _wrap_na_result

* doc string example

* pyarrow + categorical

* silence bugs

* silence errors

* silence errors II

* fix errors III

* various fixups

* various fixups

* delay fixing windows and 32bit failures

* BUG: Adding a columns to a Frame with RangeIndex columns using a non-scalar key (#52877)

* DOC: Update whatsnew (#52882)

* CI: Change development python version to 3.10 (#51133)

* CI: Change development python version to 3.10

* Update checks

* Remove strict

* Remove strict

* Fixes

* Add dt

* Switch python to 3.9

* Remove

* Fix

* Try attribute

* Adjust

* Fix mypy

* Try fixing doc build

* Fix mypy

* Fix stubtest

* Remove workflow file

* Rename back

* Update

* Rename

* Rename

* Change python version

* Remove

* Fix doc errors

* Remove pypy

* Update ci/deps/actions-pypy-39.yaml

Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com>

* Revert pypy removal

* Remove again

* Fix

* Change to 3.9

* Address

---------

Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com>

* update

* update

* add docs

* fix windows tests

* fix windows tests

* remove guards for 32bit linux

* add bool tests + fix 32-bit failures

* fix pre-commit failures

* fix mypy failures

* rename _reduce_with -> _reduce_and_wrap

* assert missing attributes

* reduction dtypes on windows and 32bit systems

* add tests for min_count=0

* PERF:median with axis=1

* median with axis=1 fix

* streamline Block.reduce

* fix comments

* FIX preserve dtype with datetime columns of different resolution when merging (#53213)

* BUG Merge not behaving correctly when having `MultiIndex` with a single level (#53215)

* fix merge when MultiIndex with single level

* resolved conversations

* fixed code style

* BUG: preserve dtype for right/outer merge of datetime with different resolutions (#53233)

* remove special BooleanArray.sum method

* remove BooleanArray.prod

* fixes

* Update doc/source/whatsnew/v2.1.0.rst

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

* Update pandas/core/array_algos/masked_reductions.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

* small cleanup

* small cleanup

* only reduce 1d

* fix after #53418

* update according to comments

* revome note

* update _minmax

* REF: add keepdims parameter to ExtensionArray._reduce + remove ExtensionArray._reduce_and_wrap

* REF: add keepdims parameter to ExtensionArray._reduce + remove ExtensionArray._reduce_and_wrap

* fix whatsnew

* fix _reduce call

* simplify test

* add tests for any/all

---------

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Yao Xiao <108576690+Charlie-XIAO@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.merge on Timestamp column broken if when timestamps are timezone-aware and have different units
3 participants