Allow merging on object / non-object column #21681

jorisvandenbossche · 2018-06-29T13:38:42Z

Follow-up on discussion in #21310 (cc @jreback )

What this does: allow merge on an object and a non-object column (eg object and int, object and categorical, ..), by coercing the non-object column to object dtype.

Reasoning:

object dtype can be anything, so IMO pandas should not try to be smart and think to know the merge is not valid
all other, certainly invalid dtype combinations (eg int and datetime64) keep raising an error

The main disadvantage is that we no longer detect the case of merging object strings with eg integer column, but IMO we cannot avoid such cases as long as we don't have a proper string dtype but use object dtype for this (you can also mix string and ints in a column, we also don't raise for that for a similar reason).

Additional option would be to issue a warning in this case.

Closes #23733

codecov · 2018-06-29T15:58:49Z

Codecov Report

Merging #21681 into master will not change coverage.
The diff coverage is 88%.

@@           Coverage Diff           @@
##           master   #21681   +/-   ##
=======================================
  Coverage   92.31%   92.31%           
=======================================
  Files         166      166           
  Lines       52391    52404   +13     
=======================================
+ Hits        48363    48375   +12     
- Misses       4028     4029    +1

Flag	Coverage Δ
#multiple	`90.73% <88%> (ø)`	⬆️
#single	`43.03% <0%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape/merge.py	`94.26% <88%> (-0.04%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4ac0d6...647f2de. Read the comment docs.

jreback · 2018-07-01T15:20:50Z

-1 in this - if u want to merge an object column with a non object column then coerce to object

this will propagate objects columns very easily - against the principle of strongly dtyping things
what is the use case that is motivating this?

jorisvandenbossche · 2018-07-02T12:26:15Z

what is the use case that is motivating this?

The concrete case was a regression in the released version of geopandas because of the change in pandas to start raising errors on merging with unequal dtypes. This actual case was caused because we first merge an empty frame with another one, which creates object columns of NaN, and then merging the result yet with another, where the object column of NaN is merged on a float column (integers with NaNs), raising the error.
Yes, we can workaround this, by for example coercing both key columns to object as you mentioned (or coercing both to float). But that adds some complexity to the code, and given that I had it in geopandas could indicate that also other people might run into it.

But my main reason is what I said in the top post: object dtype can be anything => pandas should not make assumptions on what it thinks is valid/invalid for operating on such data. That's up to the user.
But given that merging by accident strings on integers (where the strings should have been coerced to ints before) silently gives a not useful result, we could raise a warning as well.

It also boils down a bit to that the original change to start erroring should ideally have been done with a deprecation warning instead of directly erroring, so it would not cause regression directly.

jorisvandenbossche · 2018-07-02T20:39:58Z

There was actually a question on SO today reporting a problem in geopandas caused by this: https://stackoverflow.com/questions/51138156/how-can-i-use-sjoin-to-merge-two-geodataframes-without-an-error.

In a certain way, you could also see this PR as a regression fix. We probably should have made such a change with a deprecation warning first (if we want the original change).

jreback · 2018-07-02T23:37:50Z

I disagree here. merging on non-alike dtypes is simply wrong. objects cannot merge with non-object, full stop. Sure we have a special case for boolean, but allowing object with integer is just a disaster. '

If you want to try a bit more inference I am ok with that. e.g. say you have an object column, then am ok with running infer_dtype on it. and allowing that to merge with a similar type (e.g. integer). That is a trivial coercion.

jorisvandenbossche · 2018-07-02T23:47:56Z

objects cannot merge with non-object, full stop.

They can perfectly well, depending on what is inside the object dtype column. It's not about "cannot", it is about "do we allow it or not". And for the record, until one month ago pandas has always allowed it.

If you want to try a bit more inference I am ok with that.

I have thought about that as well, but inference can be costly, so I would rather not do it.
Also, it will not work in all cases, eg with an empty frame.

Are you OK with raising an informative warning instead? Where we clearly indicate to the user what is happening and this typically points to a problem with their types? (but at least it allow people to ignore the warning, or to fix it only later) (I thought you hinted on being OK with that in the previous PR, but not sure)

jorisvandenbossche · 2018-07-02T23:48:56Z

BTW, input from some other devs is certainly welcome, Jeff and I clearly disagree on this .. :-)

jreback · 2018-07-02T23:56:57Z

Are you OK with raising an informative warning instead? Where we clearly indicate to the user what is happening and this typically points to a problem with their types? (but at least it allow people to ignore the warning, or to fix it only later) (I thought you hinted on being OK with that in the previous PR, but not sure)

I guess that is ok. The original reason that this was changed is to avoid the all-NaN problem (e.g. you merge with something that has no matches because of the dtype mismatch). I don't want to go back to that as its worse from a user POV.

open to compromise but would rather be strict here. (and inference is not generally expensive compared to a borked merge :>)

WillAyd · 2018-07-03T00:02:04Z

I'm +1 on being strict. At the very least if the end user is explicit about types they stand to benefit performance- and resource-wise instead of pandas falling back to object.

jorisvandenbossche · 2018-07-03T07:21:13Z

(and inference is not generally expensive compared to a borked merge :>)

yeah, that is true .. :)
Thinking this a bit more through, then the main problem seems to how to handle the result. For example then need to add logic that eg 'integer' and 'mixed-integer-float' might be OK, which is repeating the logic (and also need to special case the 'empty').
Of course, I could also do the full existing check based on the inferred types hierarchy (to avoid such duplicated logic), assuming that for non-object dtypes this inferring of the type is cheap.

The original reason that this was changed is to avoid the all-NaN problem .. I don't want to go back to that as its worse from a user POV.

I just realized that I actually think a warning can be better from a user POV. Instead of the error, you would then see the NaNs, together with the informative warning explaining why the NaNs are there and how to solve it. I think this could actually be better to get what is happening than only the error.

And this also alleviates the direct code breakages caused by the change in 0.23.0 to be more strict.

At the very least if the end user is explicit about types they stand to benefit performance- and resource-wise instead of pandas falling back to object.

Yes, but note that there can be good reasons to want or to end up with object data, and not all use cases of pandas are performance sensitive (eg small data).

jreback · 2018-10-23T02:50:42Z

@jorisvandenbossche can you rebase this

jreback · 2018-11-01T01:23:50Z

pandas/core/reshape/merge.py

@@ -946,11 +947,14 @@ def _maybe_coerce_merge_keys(self):
                   "you should use pd.concat".format(lk_dtype=lk.dtype,
                                                     rk_dtype=rk.dtype))

+            coerce_to_object = False
+            if is_object_dtype(lk) or is_object_dtype(rk):


is this actually necessary here? e.g. it would hit the else clause if this is true as well.

Then can just do the conversions in the else. (e.g. stuff on line 1011)

Without this we hit

956 elif ((is_numeric_dtype(lk) and not is_bool_dtype(lk)) 957 and not is_numeric_dtype(rk)): --> 958 raise ValueError(msg)

and raise.

jreback · 2018-11-18T22:08:19Z

@jorisvandenbossche can you merge master

h-vetinari · 2018-11-19T17:35:06Z

I've had colleages come and complain to me about this recently - especially since this used to work.

I would advocate to be strict, but based on the inferred type. The case where it's an all-integer index hiding behind an object dtype would be easily solved through that, and the other cases would continue to emit the warning.

jreback · 2018-12-02T16:48:22Z

@jorisvandenbossche can you merge master

…e-merge-object-columns

pep8speaks · 2018-12-18T18:42:56Z

Hello @jorisvandenbossche! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 27, 2018 at 12:56 Hours UTC

TomAugspurger · 2018-12-18T18:42:57Z

Fixed the merge conflicts. Haven't taken a look at the changes yet.

jreback · 2018-12-26T01:24:42Z

I pushed a commit which fixes this up in a reasonably clean way. if folks can look at the cases and see if anything glaring pops out. We have a few odd cases already, e.g. bools are inferred if possible and we actually allow stringlike non-unicode strings to infer, but these worked before.

cc @pandas-dev/pandas-core

pandas/tests/reshape/merge/test_merge.py

pandas/core/reshape/merge.py

jreback · 2018-12-27T00:18:48Z

@TomAugspurger

TomAugspurger

LGTM.

May leave this open for a day in case @jorisvandenbossche is around to look tomorrow, but I think this is good to go.

jorisvandenbossche · 2018-12-27T08:25:18Z

@jreback So basically you reverted to allow again all combinations of an object dtype with a non-object dtype?

jreback · 2018-12-27T12:12:26Z

yes anything with object just forces conversion to object

we
could be more specific and disallow strings vs non strings

jreback · 2018-12-27T12:56:44Z

added a commit to be more restrictive.

jreback · 2018-12-27T22:46:44Z

@jorisvandenbossche if you have a chance

jreback · 2019-01-01T16:29:56Z

@TomAugspurger can you look

TomAugspurger

LGTM

jreback · 2019-01-03T00:45:09Z

thanks @jorisvandenbossche !

jorisvandenbossche · 2019-01-03T22:56:31Z

@jreback and thanks for finishing the PR!

One thing I note now, after your last comment to be a bit more strict on merging object and numeric columns, the whatsnew note is not fully correct any more.

I am also wondering a bit how costly this inferring of the type for object columns is.

jreback · 2019-01-03T23:07:35Z

One thing I note now, after your last comment to be a bit more strict on merging object and numeric columns, the whatsnew note is not fully correct any more.

yeah ok we are a bit more strict so whatsnew could be sligthly better.

inference is << cost of actually merging things, its a no-brainer and we do it all over the place.

kevalshah90 · 2020-04-23T17:56:57Z

I am running into this issue and found this thread. Was this change pushed to pandas or can we not merge object/non-object type.

Here is my code:

res_leads = pd.merge(df_temp,
                         df_market[['Tenant_Industry','Tenant','num_employees','size_calc']],
                         left_on = df_temp.index,
                         right_on = 'Tenant',
                         how='left')

Error traceback:

Traceback (most recent call last):
  File "/Users/....../market.py", line 267, in render_table
    result = top_leads(company, market)
  File "/Users/...../CRE/Platform/Deploy/return_leads.py", line 77, in top_leads
    how='left')
  File "/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 61, in merge
    validate=validate)
  File "/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 555, in __init__
    self._maybe_coerce_merge_keys()
  File "/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 983, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

Allow merging on object / non-object column

f0f3a82

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions labels Jul 2, 2018

jorisvandenbossche added this to the 0.23.2 milestone Jul 2, 2018

jorisvandenbossche modified the milestones: 0.23.2, 0.23.3 Jul 5, 2018

jreback modified the milestones: 0.23.4, 0.23.5 Aug 6, 2018

jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018

jreback requested changes Nov 1, 2018

View reviewed changes

TomAugspurger mentioned this pull request Nov 19, 2018

Merging on int64 and object columns, pandas 0.23.4 #23733

Closed

jreback mentioned this pull request Nov 27, 2018

When merging on boolean index in right dataframe, merge key becomes an object #23884

Closed

Merge remote-tracking branch 'upstream/master' into jorisvandenbossch…

5d9c2f4

…e-merge-object-columns

jreback added 2 commits December 25, 2018 19:28

Merge branch 'master' into PR_TOOL_MERGE_PR_21681

ed8e8ca

all but bools

8ff0f35

jreback approved these changes Dec 26, 2018

View reviewed changes

jreback added 2 commits December 26, 2018 14:56

Merge branch 'master' into PR_TOOL_MERGE_PR_21681

7cf75e1

add whatsnew note

9fc37be

TomAugspurger reviewed Dec 26, 2018

View reviewed changes

pandas/tests/reshape/merge/test_merge.py Outdated Show resolved Hide resolved

TomAugspurger reviewed Dec 26, 2018

View reviewed changes

pandas/core/reshape/merge.py Show resolved Hide resolved

jreback added 2 commits December 26, 2018 16:18

fix with numeric-like object types

32d737f

simplify

ed94284

TomAugspurger reviewed Dec 27, 2018

View reviewed changes

disallow inferred numeric with non-numeric

c805634

Merge branch 'master' into PR_TOOL_MERGE_PR_21681

647f2de

TomAugspurger approved these changes Jan 2, 2019

View reviewed changes

jreback merged commit 819418e into pandas-dev:master Jan 3, 2019

jorisvandenbossche deleted the merge-object-columns branch January 3, 2019 22:50

jreback mentioned this pull request Jan 4, 2019

CLN: use idiomatic pandas_dtypes in pandas/dtypes/common.py #24541

Merged

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Allow merging on object / non-object column (pandas-dev#21681)

7adcb77

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Allow merging on object / non-object column (pandas-dev#21681)

f98ab8d

fdosani mentioned this pull request Sep 25, 2019

tests are failing due to test_decimal_with_joins capitalone/datacompy#47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow merging on object / non-object column #21681

Allow merging on object / non-object column #21681

jorisvandenbossche commented Jun 29, 2018 •

edited by TomAugspurger

Loading

codecov bot commented Jun 29, 2018 •

edited

Loading

jreback commented Jul 1, 2018 •

edited

Loading

jorisvandenbossche commented Jul 2, 2018

jorisvandenbossche commented Jul 2, 2018

jreback commented Jul 2, 2018

jorisvandenbossche commented Jul 2, 2018

jorisvandenbossche commented Jul 2, 2018

jreback commented Jul 2, 2018

WillAyd commented Jul 3, 2018

jorisvandenbossche commented Jul 3, 2018

jreback commented Oct 23, 2018

jreback Nov 1, 2018

TomAugspurger Dec 21, 2018

jreback commented Nov 18, 2018

h-vetinari commented Nov 19, 2018

jreback commented Dec 2, 2018

pep8speaks commented Dec 18, 2018 •

edited

Loading

TomAugspurger commented Dec 18, 2018

jreback commented Dec 26, 2018

jreback commented Dec 27, 2018

TomAugspurger left a comment

jorisvandenbossche commented Dec 27, 2018

jreback commented Dec 27, 2018

jreback commented Dec 27, 2018

jreback commented Dec 27, 2018

jreback commented Jan 1, 2019

TomAugspurger left a comment

jreback commented Jan 3, 2019

jorisvandenbossche commented Jan 3, 2019

jreback commented Jan 3, 2019 •

edited

Loading

kevalshah90 commented Apr 23, 2020

Allow merging on object / non-object column #21681

Allow merging on object / non-object column #21681

Conversation

jorisvandenbossche commented Jun 29, 2018 • edited by TomAugspurger Loading

codecov bot commented Jun 29, 2018 • edited Loading

Codecov Report

jreback commented Jul 1, 2018 • edited Loading

jorisvandenbossche commented Jul 2, 2018

jorisvandenbossche commented Jul 2, 2018

jreback commented Jul 2, 2018

jorisvandenbossche commented Jul 2, 2018

jorisvandenbossche commented Jul 2, 2018

jreback commented Jul 2, 2018

WillAyd commented Jul 3, 2018

jorisvandenbossche commented Jul 3, 2018

jreback commented Oct 23, 2018

jreback Nov 1, 2018

Choose a reason for hiding this comment

TomAugspurger Dec 21, 2018

Choose a reason for hiding this comment

jreback commented Nov 18, 2018

h-vetinari commented Nov 19, 2018

jreback commented Dec 2, 2018

pep8speaks commented Dec 18, 2018 • edited Loading

Comment last updated on December 27, 2018 at 12:56 Hours UTC

TomAugspurger commented Dec 18, 2018

jreback commented Dec 26, 2018

jreback commented Dec 27, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 27, 2018

jreback commented Dec 27, 2018

jreback commented Dec 27, 2018

jreback commented Dec 27, 2018

jreback commented Jan 1, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

jreback commented Jan 3, 2019

jorisvandenbossche commented Jan 3, 2019

jreback commented Jan 3, 2019 • edited Loading

kevalshah90 commented Apr 23, 2020

jorisvandenbossche commented Jun 29, 2018 •

edited by TomAugspurger

Loading

codecov bot commented Jun 29, 2018 •

edited

Loading

jreback commented Jul 1, 2018 •

edited

Loading

pep8speaks commented Dec 18, 2018 •

edited

Loading

jreback commented Jan 3, 2019 •

edited

Loading