Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: preserve dtype for right/outer merge of datetime with different resolutions #53233

Merged

Conversation

jorisvandenbossche
Copy link
Member

Follow-up on #53213, fixing one more corner case for the right/outer joins making the key column object dtype.

Expanded the test added in that PR to test the different join types.
The whatsnew added in #53213 should also already cover this.

@jorisvandenbossche jorisvandenbossche added Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution labels May 15, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.0.2 milestone May 15, 2023
Comment on lines +997 to +1004
if (
lvals.dtype.kind == "M"
and rvals.dtype.kind == "M"
and result_dtype.kind == "O"
):
# TODO(non-nano) Workaround for common_type not dealing
# with different resolutions
result_dtype = key_col.dtype
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an ugly band-aid solution: the proper fix is to handle this in find_common_type. But I am not sure if changing find_common_type would be in scope for a bug fix release? (xref #46587 (comment))

cc @jbrockmendel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing find_common_type would affect concat, so im hesitant to call it a bugfix. But avoiding object may be nicer behavior, so id be open to it as an api change longer-term.

Do we know that both are tznaive at this point? If one is tzaware then object is correct (though i guess we should be able to determine that the merge will be empty?)

I've given a little bit of thought to something similar in the context of Index.get_indexer with mismatched resos. Instead of converting both to object, could find a fill_value not present in left and convert right to left.unit, filling non-convertible entries with that fill_value. The "finding a fill_value" part seems really hacky though. (update: when i wrote this paragraph I thought this chunk of code was in get_merge_keys)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know that both are tznaive at this point? If one is tzaware then object is correct (though i guess we should be able to determine that the merge will be empty?)

I think so yes, normally we should raise an error earlier in the code if you try to merge on a tz-aware and tz-naive column:

In [1]: df1 = pd.DataFrame({"key": [pd.Timestamp("1970-01-01")], "a": [1]})

In [2]: df2 = pd.DataFrame({"key": [pd.Timestamp("1970-01-01", tz="Europe/Brussels")], "a": [1]})

In [3]: pd.merge(df1, df2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 pd.merge(df1, df2)

File ~/scipy/pandas/pandas/core/reshape/merge.py:146, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
    129 @Substitution("\nleft : DataFrame or named Series")
    130 @Appender(_merge_doc, indents=0)
    131 def merge(
   (...)
    144     validate: str | None = None,
    145 ) -> DataFrame:
--> 146     op = _MergeOperation(
    147         left,
    148         right,
    149         how=how,
    150         on=on,
    151         left_on=left_on,
    152         right_on=right_on,
    153         left_index=left_index,
    154         right_index=right_index,
    155         sort=sort,
    156         suffixes=suffixes,
    157         indicator=indicator,
    158         validate=validate,
    159     )
    160     return op.get_result(copy=copy)

File ~/scipy/pandas/pandas/core/reshape/merge.py:735, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate)
    727 (
    728     self.left_join_keys,
    729     self.right_join_keys,
    730     self.join_names,
    731 ) = self._get_merge_keys()
    733 # validate the merge keys dtypes. We may need to coerce
    734 # to avoid incompatible dtypes
--> 735 self._maybe_coerce_merge_keys()
    737 # If argument passed to validate,
    738 # check if columns specified as unique
    739 # are in fact unique.
    740 if validate is not None:

File ~/scipy/pandas/pandas/core/reshape/merge.py:1397, in _MergeOperation._maybe_coerce_merge_keys(self)
   1393     raise ValueError(msg)
   1394 elif not isinstance(lk.dtype, DatetimeTZDtype) and isinstance(
   1395     rk.dtype, DatetimeTZDtype
   1396 ):
-> 1397     raise ValueError(msg)
   1398 elif (
   1399     isinstance(lk.dtype, DatetimeTZDtype)
   1400     and isinstance(rk.dtype, DatetimeTZDtype)
   1401 ) or (lk.dtype.kind == "M" and rk.dtype.kind == "M"):
   1402     # allows datetime with different resolutions
   1403     continue

ValueError: You are trying to merge on datetime64[ns] and datetime64[ns, Europe/Brussels] columns for key 'key'. If you wish to proceed you should use pd.concat

@mroeschke mroeschke merged commit 93f9ae0 into pandas-dev:main May 17, 2023
38 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the bug-merge-datetime-unit-outer branch May 17, 2023 16:15
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request May 17, 2023
mroeschke pushed a commit that referenced this pull request May 17, 2023
…er merge of datetime with different resolutions) (#53275)

Backport PR #53233: BUG: preserve dtype for right/outer merge of datetime with different resolutions

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 22, 2023
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 27, 2023
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
mroeschke pushed a commit that referenced this pull request Jul 13, 2023
* ENH: better dtype inference when doing DataFrame reductions

* precommit issues

* fix failures

* fix failures

* mypy + some docs

* doc linting linting

* refactor to use _reduce_with_wrap

* docstring linting

* pyarrow failure + linting

* pyarrow failure + linting

* linting

* doc stuff

* linting fixes

* fix fix doc string

* remove _wrap_na_result

* doc string example

* pyarrow + categorical

* silence bugs

* silence errors

* silence errors II

* fix errors III

* various fixups

* various fixups

* delay fixing windows and 32bit failures

* BUG: Adding a columns to a Frame with RangeIndex columns using a non-scalar key (#52877)

* DOC: Update whatsnew (#52882)

* CI: Change development python version to 3.10 (#51133)

* CI: Change development python version to 3.10

* Update checks

* Remove strict

* Remove strict

* Fixes

* Add dt

* Switch python to 3.9

* Remove

* Fix

* Try attribute

* Adjust

* Fix mypy

* Try fixing doc build

* Fix mypy

* Fix stubtest

* Remove workflow file

* Rename back

* Update

* Rename

* Rename

* Change python version

* Remove

* Fix doc errors

* Remove pypy

* Update ci/deps/actions-pypy-39.yaml

Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com>

* Revert pypy removal

* Remove again

* Fix

* Change to 3.9

* Address

---------

Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com>

* update

* update

* add docs

* fix windows tests

* fix windows tests

* remove guards for 32bit linux

* add bool tests + fix 32-bit failures

* fix pre-commit failures

* fix mypy failures

* rename _reduce_with -> _reduce_and_wrap

* assert missing attributes

* reduction dtypes on windows and 32bit systems

* add tests for min_count=0

* PERF:median with axis=1

* median with axis=1 fix

* streamline Block.reduce

* fix comments

* FIX preserve dtype with datetime columns of different resolution when merging (#53213)

* BUG Merge not behaving correctly when having `MultiIndex` with a single level (#53215)

* fix merge when MultiIndex with single level

* resolved conversations

* fixed code style

* BUG: preserve dtype for right/outer merge of datetime with different resolutions (#53233)

* remove special BooleanArray.sum method

* remove BooleanArray.prod

* fixes

* Update doc/source/whatsnew/v2.1.0.rst

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

* Update pandas/core/array_algos/masked_reductions.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

* small cleanup

* small cleanup

* only reduce 1d

* fix after #53418

* update according to comments

* revome note

* update _minmax

* REF: add keepdims parameter to ExtensionArray._reduce + remove ExtensionArray._reduce_and_wrap

* REF: add keepdims parameter to ExtensionArray._reduce + remove ExtensionArray._reduce_and_wrap

* fix whatsnew

* fix _reduce call

* simplify test

* add tests for any/all

---------

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Yao Xiao <108576690+Charlie-XIAO@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants