Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow merging on object / non-object column #21681

Merged
merged 10 commits into from
Jan 3, 2019

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 29, 2018

Follow-up on discussion in #21310 (cc @jreback )

What this does: allow merge on an object and a non-object column (eg object and int, object and categorical, ..), by coercing the non-object column to object dtype.

Reasoning:

  • object dtype can be anything, so IMO pandas should not try to be smart and think to know the merge is not valid
  • all other, certainly invalid dtype combinations (eg int and datetime64) keep raising an error

The main disadvantage is that we no longer detect the case of merging object strings with eg integer column, but IMO we cannot avoid such cases as long as we don't have a proper string dtype but use object dtype for this (you can also mix string and ints in a column, we also don't raise for that for a similar reason).

Additional option would be to issue a warning in this case.

Closes #23733

@codecov
Copy link

codecov bot commented Jun 29, 2018

Codecov Report

Merging #21681 into master will not change coverage.
The diff coverage is 88%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21681   +/-   ##
=======================================
  Coverage   92.31%   92.31%           
=======================================
  Files         166      166           
  Lines       52391    52404   +13     
=======================================
+ Hits        48363    48375   +12     
- Misses       4028     4029    +1
Flag Coverage Δ
#multiple 90.73% <88%> (ø) ⬆️
#single 43.03% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.26% <88%> (-0.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4ac0d6...647f2de. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Jul 1, 2018

-1 in this - if u want to merge an object column with a non object column then coerce to object

this will propagate objects columns very easily - against the principle of strongly dtyping things
what is the use case that is motivating this?

@jorisvandenbossche
Copy link
Member Author

what is the use case that is motivating this?

The concrete case was a regression in the released version of geopandas because of the change in pandas to start raising errors on merging with unequal dtypes. This actual case was caused because we first merge an empty frame with another one, which creates object columns of NaN, and then merging the result yet with another, where the object column of NaN is merged on a float column (integers with NaNs), raising the error.
Yes, we can workaround this, by for example coercing both key columns to object as you mentioned (or coercing both to float). But that adds some complexity to the code, and given that I had it in geopandas could indicate that also other people might run into it.

But my main reason is what I said in the top post: object dtype can be anything => pandas should not make assumptions on what it thinks is valid/invalid for operating on such data. That's up to the user.
But given that merging by accident strings on integers (where the strings should have been coerced to ints before) silently gives a not useful result, we could raise a warning as well.

It also boils down a bit to that the original change to start erroring should ideally have been done with a deprecation warning instead of directly erroring, so it would not cause regression directly.

@jorisvandenbossche
Copy link
Member Author

There was actually a question on SO today reporting a problem in geopandas caused by this: https://stackoverflow.com/questions/51138156/how-can-i-use-sjoin-to-merge-two-geodataframes-without-an-error.

In a certain way, you could also see this PR as a regression fix. We probably should have made such a change with a deprecation warning first (if we want the original change).

@jreback
Copy link
Contributor

jreback commented Jul 2, 2018

I disagree here. merging on non-alike dtypes is simply wrong. objects cannot merge with non-object, full stop. Sure we have a special case for boolean, but allowing object with integer is just a disaster. '

If you want to try a bit more inference I am ok with that. e.g. say you have an object column, then am ok with running infer_dtype on it. and allowing that to merge with a similar type (e.g. integer). That is a trivial coercion.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions labels Jul 2, 2018
@jorisvandenbossche
Copy link
Member Author

objects cannot merge with non-object, full stop.

They can perfectly well, depending on what is inside the object dtype column. It's not about "cannot", it is about "do we allow it or not". And for the record, until one month ago pandas has always allowed it.

If you want to try a bit more inference I am ok with that.

I have thought about that as well, but inference can be costly, so I would rather not do it.
Also, it will not work in all cases, eg with an empty frame.

Are you OK with raising an informative warning instead? Where we clearly indicate to the user what is happening and this typically points to a problem with their types? (but at least it allow people to ignore the warning, or to fix it only later) (I thought you hinted on being OK with that in the previous PR, but not sure)

@jorisvandenbossche
Copy link
Member Author

BTW, input from some other devs is certainly welcome, Jeff and I clearly disagree on this .. :-)

@jorisvandenbossche jorisvandenbossche added this to the 0.23.2 milestone Jul 2, 2018
@jreback
Copy link
Contributor

jreback commented Jul 2, 2018

Are you OK with raising an informative warning instead? Where we clearly indicate to the user what is happening and this typically points to a problem with their types? (but at least it allow people to ignore the warning, or to fix it only later) (I thought you hinted on being OK with that in the previous PR, but not sure)

I guess that is ok. The original reason that this was changed is to avoid the all-NaN problem (e.g. you merge with something that has no matches because of the dtype mismatch). I don't want to go back to that as its worse from a user POV.

open to compromise but would rather be strict here. (and inference is not generally expensive compared to a borked merge :>)

@WillAyd
Copy link
Member

WillAyd commented Jul 3, 2018

I'm +1 on being strict. At the very least if the end user is explicit about types they stand to benefit performance- and resource-wise instead of pandas falling back to object.

@jorisvandenbossche
Copy link
Member Author

(and inference is not generally expensive compared to a borked merge :>)

yeah, that is true .. :)
Thinking this a bit more through, then the main problem seems to how to handle the result. For example then need to add logic that eg 'integer' and 'mixed-integer-float' might be OK, which is repeating the logic (and also need to special case the 'empty').
Of course, I could also do the full existing check based on the inferred types hierarchy (to avoid such duplicated logic), assuming that for non-object dtypes this inferring of the type is cheap.

The original reason that this was changed is to avoid the all-NaN problem .. I don't want to go back to that as its worse from a user POV.

I just realized that I actually think a warning can be better from a user POV. Instead of the error, you would then see the NaNs, together with the informative warning explaining why the NaNs are there and how to solve it. I think this could actually be better to get what is happening than only the error.

And this also alleviates the direct code breakages caused by the change in 0.23.0 to be more strict.

At the very least if the end user is explicit about types they stand to benefit performance- and resource-wise instead of pandas falling back to object.

Yes, but note that there can be good reasons to want or to end up with object data, and not all use cases of pandas are performance sensitive (eg small data).

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.2, 0.23.3 Jul 5, 2018
@jreback jreback modified the milestones: 0.23.4, 0.23.5 Aug 6, 2018
@jreback jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018
@jreback
Copy link
Contributor

jreback commented Oct 23, 2018

@jorisvandenbossche can you rebase this

@@ -946,11 +947,14 @@ def _maybe_coerce_merge_keys(self):
"you should use pd.concat".format(lk_dtype=lk.dtype,
rk_dtype=rk.dtype))

coerce_to_object = False
if is_object_dtype(lk) or is_object_dtype(rk):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually necessary here? e.g. it would hit the else clause if this is true as well.

Then can just do the conversions in the else. (e.g. stuff on line 1011)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this we hit

    956             elif ((is_numeric_dtype(lk) and not is_bool_dtype(lk))
    957                     and not is_numeric_dtype(rk)):
--> 958                 raise ValueError(msg)

and raise.

@jreback
Copy link
Contributor

jreback commented Nov 18, 2018

@jorisvandenbossche can you merge master

@h-vetinari
Copy link
Contributor

I've had colleages come and complain to me about this recently - especially since this used to work.

I would advocate to be strict, but based on the inferred type. The case where it's an all-integer index hiding behind an object dtype would be easily solved through that, and the other cases would continue to emit the warning.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2018

@jorisvandenbossche can you merge master

@pep8speaks
Copy link

pep8speaks commented Dec 18, 2018

Hello @jorisvandenbossche! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 27, 2018 at 12:56 Hours UTC

@TomAugspurger
Copy link
Contributor

Fixed the merge conflicts. Haven't taken a look at the changes yet.

@jreback
Copy link
Contributor

jreback commented Dec 26, 2018

I pushed a commit which fixes this up in a reasonably clean way. if folks can look at the cases and see if anything glaring pops out. We have a few odd cases already, e.g. bools are inferred if possible and we actually allow stringlike non-unicode strings to infer, but these worked before.

cc @pandas-dev/pandas-core

@jreback
Copy link
Contributor

jreback commented Dec 27, 2018

@TomAugspurger

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

May leave this open for a day in case @jorisvandenbossche is around to look tomorrow, but I think this is good to go.

@jorisvandenbossche
Copy link
Member Author

@jreback So basically you reverted to allow again all combinations of an object dtype with a non-object dtype?

@jreback
Copy link
Contributor

jreback commented Dec 27, 2018

yes anything with object just forces conversion to object

we
could be more specific and disallow strings vs non strings

@jreback
Copy link
Contributor

jreback commented Dec 27, 2018

added a commit to be more restrictive.

@jreback
Copy link
Contributor

jreback commented Dec 27, 2018

@jorisvandenbossche if you have a chance

@jreback
Copy link
Contributor

jreback commented Jan 1, 2019

@TomAugspurger can you look

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jreback jreback merged commit 819418e into pandas-dev:master Jan 3, 2019
@jreback
Copy link
Contributor

jreback commented Jan 3, 2019

thanks @jorisvandenbossche !

@jorisvandenbossche jorisvandenbossche deleted the merge-object-columns branch January 3, 2019 22:50
@jorisvandenbossche
Copy link
Member Author

@jreback and thanks for finishing the PR!

One thing I note now, after your last comment to be a bit more strict on merging object and numeric columns, the whatsnew note is not fully correct any more.

I am also wondering a bit how costly this inferring of the type for object columns is.

@jreback
Copy link
Contributor

jreback commented Jan 3, 2019

One thing I note now, after your last comment to be a bit more strict on merging object and numeric columns, the whatsnew note is not fully correct any more.

yeah ok we are a bit more strict so whatsnew could be sligthly better.

inference is << cost of actually merging things, its a no-brainer and we do it all over the place.

@kevalshah90
Copy link

I am running into this issue and found this thread. Was this change pushed to pandas or can we not merge object/non-object type.

Here is my code:

res_leads = pd.merge(df_temp,
                         df_market[['Tenant_Industry','Tenant','num_employees','size_calc']],
                         left_on = df_temp.index,
                         right_on = 'Tenant',
                         how='left')

Error traceback:

Traceback (most recent call last):
  File "/Users/....../market.py", line 267, in render_table
    result = top_leads(company, market)
  File "/Users/...../CRE/Platform/Deploy/return_leads.py", line 77, in top_leads
    how='left')
  File "/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 61, in merge
    validate=validate)
  File "/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 555, in __init__
    self._maybe_coerce_merge_keys()
  File "/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 983, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merging on int64 and object columns, pandas 0.23.4
7 participants