New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: option to check merge is one-to-one, many-to-one, one-to-many, or many-to-many #16270

Closed
nickeubank opened this Issue May 6, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@nickeubank
Contributor

nickeubank commented May 6, 2017

In my experience, there are few places in data work where problems with data are more evident than when merging datasets -- something that is both a problem (if you think you're doing a one-to-one merge and one of the keys isn't unique in one dataset, you can introduce huge problems) and an opportunity (checking that a merge works as expected is a great way to catch problems).

With that in mind, I'd like to propose adding a check_merge argument that accepts strings one_to_one, one_to_many, may_to_one, and many_to_many. If one of those strings is passed, the merge function would then check uniqueness of merge variables before running the merge, and if for example the merge key for the first dataset has duplicates in a one_to_many merge, it would throw an (informative) exception.

(Stata made a similar move to bake this functionality into its merge command around Stata 12 using 1:1, 1:m, m:1 syntax, which I'm also open to).

Though this functionality can be replicated with user tests, it gets tiring to write them every time...

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 6, 2017

Contributor

see also #15068

Contributor

jreback commented May 6, 2017

see also #15068

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 6, 2017

Contributor

I like the stat syntax, so maybe for readability accept both. Also call this check= I think. or maybe validate=

Contributor

jreback commented May 6, 2017

I like the stat syntax, so maybe for readability accept both. Also call this check= I think. or maybe validate=

@jreback jreback added this to the Next Major Release milestone May 6, 2017

@nickeubank

This comment has been minimized.

Show comment
Hide comment
@nickeubank

nickeubank May 6, 2017

Contributor

Sounds good. Will pull together a PR and ping for input.

Contributor

nickeubank commented May 6, 2017

Sounds good. Will pull together a PR and ping for input.

@nickeubank

This comment has been minimized.

Show comment
Hide comment
@nickeubank

nickeubank May 7, 2017

Contributor

Thoughts on whether it should check that keys AREN'T unique for the many tests?

i.e. if someone runs merge(left, right, on='a', validate="many_to_one") and it turns out that a is unique in left, should that throw an exception? I don't think Stata does, and my inclination is to say no, but open to input

Contributor

nickeubank commented May 7, 2017

Thoughts on whether it should check that keys AREN'T unique for the many tests?

i.e. if someone runs merge(left, right, on='a', validate="many_to_one") and it turns out that a is unique in left, should that throw an exception? I don't think Stata does, and my inclination is to say no, but open to input

@nickeubank nickeubank referenced this issue May 7, 2017

Merged

ENH: add validate argument to merge #16275

4 of 4 tasks complete

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.21.0, Next Major Release May 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment