Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow merging on object / non-object column #21681

Merged
merged 10 commits into from Jan 3, 2019

Conversation

Projects
None yet
6 participants
@jorisvandenbossche
Copy link
Member

commented Jun 29, 2018

Follow-up on discussion in #21310 (cc @jreback )

What this does: allow merge on an object and a non-object column (eg object and int, object and categorical, ..), by coercing the non-object column to object dtype.

Reasoning:

  • object dtype can be anything, so IMO pandas should not try to be smart and think to know the merge is not valid
  • all other, certainly invalid dtype combinations (eg int and datetime64) keep raising an error

The main disadvantage is that we no longer detect the case of merging object strings with eg integer column, but IMO we cannot avoid such cases as long as we don't have a proper string dtype but use object dtype for this (you can also mix string and ints in a column, we also don't raise for that for a similar reason).

Additional option would be to issue a warning in this case.

Closes #23733

@codecov

This comment has been minimized.

Copy link

commented Jun 29, 2018

Codecov Report

Merging #21681 into master will not change coverage.
The diff coverage is 88%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21681   +/-   ##
=======================================
  Coverage   92.31%   92.31%           
=======================================
  Files         166      166           
  Lines       52391    52404   +13     
=======================================
+ Hits        48363    48375   +12     
- Misses       4028     4029    +1
Flag Coverage Δ
#multiple 90.73% <88%> (ø) ⬆️
#single 43.03% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.26% <88%> (-0.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4ac0d6...647f2de. Read the comment docs.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 1, 2018

-1 in this - if u want to merge an object column with a non object column then coerce to object

this will propagate objects columns very easily - against the principle of strongly dtyping things
what is the use case that is motivating this?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2018

what is the use case that is motivating this?

The concrete case was a regression in the released version of geopandas because of the change in pandas to start raising errors on merging with unequal dtypes. This actual case was caused because we first merge an empty frame with another one, which creates object columns of NaN, and then merging the result yet with another, where the object column of NaN is merged on a float column (integers with NaNs), raising the error.
Yes, we can workaround this, by for example coercing both key columns to object as you mentioned (or coercing both to float). But that adds some complexity to the code, and given that I had it in geopandas could indicate that also other people might run into it.

But my main reason is what I said in the top post: object dtype can be anything => pandas should not make assumptions on what it thinks is valid/invalid for operating on such data. That's up to the user.
But given that merging by accident strings on integers (where the strings should have been coerced to ints before) silently gives a not useful result, we could raise a warning as well.

It also boils down a bit to that the original change to start erroring should ideally have been done with a deprecation warning instead of directly erroring, so it would not cause regression directly.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2018

There was actually a question on SO today reporting a problem in geopandas caused by this: https://stackoverflow.com/questions/51138156/how-can-i-use-sjoin-to-merge-two-geodataframes-without-an-error.

In a certain way, you could also see this PR as a regression fix. We probably should have made such a change with a deprecation warning first (if we want the original change).

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2018

I disagree here. merging on non-alike dtypes is simply wrong. objects cannot merge with non-object, full stop. Sure we have a special case for boolean, but allowing object with integer is just a disaster. '

If you want to try a bit more inference I am ok with that. e.g. say you have an object column, then am ok with running infer_dtype on it. and allowing that to merge with a similar type (e.g. integer). That is a trivial coercion.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2018

objects cannot merge with non-object, full stop.

They can perfectly well, depending on what is inside the object dtype column. It's not about "cannot", it is about "do we allow it or not". And for the record, until one month ago pandas has always allowed it.

If you want to try a bit more inference I am ok with that.

I have thought about that as well, but inference can be costly, so I would rather not do it.
Also, it will not work in all cases, eg with an empty frame.

Are you OK with raising an informative warning instead? Where we clearly indicate to the user what is happening and this typically points to a problem with their types? (but at least it allow people to ignore the warning, or to fix it only later) (I thought you hinted on being OK with that in the previous PR, but not sure)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2018

BTW, input from some other devs is certainly welcome, Jeff and I clearly disagree on this .. :-)

@jorisvandenbossche jorisvandenbossche added this to the 0.23.2 milestone Jul 2, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2018

Are you OK with raising an informative warning instead? Where we clearly indicate to the user what is happening and this typically points to a problem with their types? (but at least it allow people to ignore the warning, or to fix it only later) (I thought you hinted on being OK with that in the previous PR, but not sure)

I guess that is ok. The original reason that this was changed is to avoid the all-NaN problem (e.g. you merge with something that has no matches because of the dtype mismatch). I don't want to go back to that as its worse from a user POV.

open to compromise but would rather be strict here. (and inference is not generally expensive compared to a borked merge :>)

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jul 3, 2018

I'm +1 on being strict. At the very least if the end user is explicit about types they stand to benefit performance- and resource-wise instead of pandas falling back to object.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2018

(and inference is not generally expensive compared to a borked merge :>)

yeah, that is true .. :)
Thinking this a bit more through, then the main problem seems to how to handle the result. For example then need to add logic that eg 'integer' and 'mixed-integer-float' might be OK, which is repeating the logic (and also need to special case the 'empty').
Of course, I could also do the full existing check based on the inferred types hierarchy (to avoid such duplicated logic), assuming that for non-object dtypes this inferring of the type is cheap.

The original reason that this was changed is to avoid the all-NaN problem .. I don't want to go back to that as its worse from a user POV.

I just realized that I actually think a warning can be better from a user POV. Instead of the error, you would then see the NaNs, together with the informative warning explaining why the NaNs are there and how to solve it. I think this could actually be better to get what is happening than only the error.

And this also alleviates the direct code breakages caused by the change in 0.23.0 to be more strict.

At the very least if the end user is explicit about types they stand to benefit performance- and resource-wise instead of pandas falling back to object.

Yes, but note that there can be good reasons to want or to end up with object data, and not all use cases of pandas are performance sensitive (eg small data).

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.2, 0.23.3 Jul 5, 2018

@jreback jreback modified the milestones: 0.23.4, 0.23.5 Aug 6, 2018

@jreback jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2018

@jorisvandenbossche can you rebase this

@@ -946,11 +947,14 @@ def _maybe_coerce_merge_keys(self):
"you should use pd.concat".format(lk_dtype=lk.dtype,
rk_dtype=rk.dtype))

coerce_to_object = False
if is_object_dtype(lk) or is_object_dtype(rk):

This comment has been minimized.

Copy link
@jreback

jreback Nov 1, 2018

Contributor

is this actually necessary here? e.g. it would hit the else clause if this is true as well.

Then can just do the conversions in the else. (e.g. stuff on line 1011)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Dec 21, 2018

Contributor

Without this we hit

    956             elif ((is_numeric_dtype(lk) and not is_bool_dtype(lk))
    957                     and not is_numeric_dtype(rk)):
--> 958                 raise ValueError(msg)

and raise.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 18, 2018

@jorisvandenbossche can you merge master

@h-vetinari

This comment has been minimized.

Copy link
Contributor

commented Nov 19, 2018

I've had colleages come and complain to me about this recently - especially since this used to work.

I would advocate to be strict, but based on the inferred type. The case where it's an all-integer index hiding behind an object dtype would be easily solved through that, and the other cases would continue to emit the warning.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 2, 2018

@jorisvandenbossche can you merge master

@pep8speaks

This comment has been minimized.

Copy link

commented Dec 18, 2018

Hello @jorisvandenbossche! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 27, 2018 at 12:56 Hours UTC
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Dec 18, 2018

Fixed the merge conflicts. Haven't taken a look at the changes yet.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 26, 2018

I pushed a commit which fixes this up in a reasonably clean way. if folks can look at the cases and see if anything glaring pops out. We have a few odd cases already, e.g. bools are inferred if possible and we actually allow stringlike non-unicode strings to infer, but these worked before.

cc @pandas-dev/pandas-core

jreback added some commits Dec 26, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 27, 2018

@TomAugspurger
Copy link
Contributor

left a comment

LGTM.

May leave this open for a day in case @jorisvandenbossche is around to look tomorrow, but I think this is good to go.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Dec 27, 2018

@jreback So basically you reverted to allow again all combinations of an object dtype with a non-object dtype?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 27, 2018

yes anything with object just forces conversion to object

we
could be more specific and disallow strings vs non strings

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 27, 2018

added a commit to be more restrictive.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 27, 2018

@jorisvandenbossche if you have a chance

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 1, 2019

@TomAugspurger can you look

@TomAugspurger
Copy link
Contributor

left a comment

LGTM

@jreback jreback merged commit 819418e into pandas-dev:master Jan 3, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20181230.61 succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2019

@jorisvandenbossche jorisvandenbossche deleted the jorisvandenbossche:merge-object-columns branch Jan 3, 2019

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jan 3, 2019

@jreback and thanks for finishing the PR!

One thing I note now, after your last comment to be a bit more strict on merging object and numeric columns, the whatsnew note is not fully correct any more.

I am also wondering a bit how costly this inferring of the type for object columns is.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2019

One thing I note now, after your last comment to be a bit more strict on merging object and numeric columns, the whatsnew note is not fully correct any more.

yeah ok we are a bit more strict so whatsnew could be sligthly better.

inference is << cost of actually merging things, its a no-brainer and we do it all over the place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.