Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding (Insert or update if key exists) option to .to_sql #14553 #29636

Closed
wants to merge 125 commits into from

Conversation

cvonsteg
Copy link

This PR adds SQL upsert functionality to the to_sql method. As outlined in the issue discussion, there will be 2 types of upsert:

  1. upsert_delete - prioritizes new data from incoming dataframe. Deletes clashing records in the database and then inserts new dataframe.
  2. upsert_ignore - prioritizes what is already in the database, over what is coming in from the new dataframe. Deletes records which already exist, from existing dataframe before inserting the rest into the db.

Both upserts currently only check for primary key duplicates.

Please note 1 important change over the design proposal made in #14553 - upsert_delete and upsert_ignore have now been made options for the if_exists arg, rather than the method argument. This was revised, because on second consideration, upsert was deemed mutually exclusive to append and replace, but not to the method argument, so making it part of if_exists seemed the more sound design.

…t 88a5a481b

git-subtree-dir: vendor/github.com/V0RT3X4/python_utils
git-subtree-split: 88a5a481b5dbec610e762df862fd69918c1b77d4
git-vendor-name: python_utils
git-vendor-dir: vendor/github.com/V0RT3X4/python_utils
git-vendor-repository: git@github.com:V0RT3X4/python_utils.git
git-vendor-ref: master
git-vendor-name: python_utils
git-vendor-dir: vendor/github.com/V0RT3X4/python_utils
git-vendor-repository: git@github.com:V0RT3X4/python_utils.git
git-vendor-ref: master
…tch function written down for deleting pkeys
…tch function written down for deleting pkeys
…t 88a5a481b

git-subtree-dir: vendor/github.com/V0RT3X4/python_utils
git-subtree-split: 88a5a481b5dbec610e762df862fd69918c1b77d4
git-vendor-name: python_utils
git-vendor-dir: vendor/github.com/V0RT3X4/python_utils
git-vendor-repository: git@github.com:V0RT3X4/python_utils.git
git-vendor-ref: master
git-vendor-name: python_utils
git-vendor-dir: vendor/github.com/V0RT3X4/python_utils
git-vendor-repository: git@github.com:V0RT3X4/python_utils.git
git-vendor-ref: master
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cvonsteg

you are changing so much that this will need extensive review and likley will take an extended time to merge if that even happens.

I would suggest starting with a really simple change and not try to do everything. you are likely to have better results.

on_conflict : {None, 'do_nothing', 'do_update'}, optional
Determine insertion behaviour in case of a primary key clash.
- None: Do nothing to handle primary key clashes, will raise an Error.
- 'do_nothing': Ignore incoming rows with primary key clashes, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is completely different than the other arguments above.

@cvonsteg
Copy link
Author

cvonsteg commented Oct 7, 2021

@jreback I appreciate that a change of this size is difficult to review properly. Elaborating on your suggestion of starting simple - do you mean breaking this functionality down into smaller components which can be added to master one at a time e.g. a first PR to add a method which checks if there will be primary key clashes before calling to_sql, then one to return the actual keys, then one to split a dataframe by those keys, etc. working our way up to the full insert or update functionality? Or do you mean branching off this PR so smaller, more easily reviewable changes can be merged into this branch 1 by 1, and then merging everything into master after?

@jreback
Copy link
Contributor

jreback commented Oct 7, 2021

@cvonsteg well, if you are refactoring at all, then push this as a separate PR that can be merged first.
do just a single upsert option to avoid confusion.

@mzeitlin11 mzeitlin11 mentioned this pull request Oct 13, 2021
@maveec
Copy link

maveec commented Nov 17, 2021

Any chance this will be shipped in the coming 1.4 or 1.5 release?

@jreback
Copy link
Contributor

jreback commented Nov 18, 2021

this PR has lots of comments which need to be addressed

@cvonsteg
Copy link
Author

@jreback - taking your feedback around smaller changes into consideration, I propose splitting this into the following as separate PRs:

  1. on_conflict do_nothing - add the on_conflict parameter to the to_sql method. This will only check for conflicts in primary keys and only provide a do_nothing option, as this requires fewer changes changes than the do_update functionality. I still plan to use None as default for on_conflict as we want to bypass this behaviour altogether if doing table-wide operations such as append, replace, etc.
  2. on_conflict do_update - Add the do_update option. This will still look solely at primary keys, but will allow users to overwrite the entire row for cases where they have a conflict.

If the above sounds good for you, I will raise these as new PRs instead of trying to untangle what has been done on this branch.

@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

@cvonsteg sounds ok. key is well tested & documented changes.

@b0lle
Copy link

b0lle commented Jan 20, 2022

Hi Folks, is there any update on this?

@pranjal-joshi
Copy link

This needs to be merged and shipped with the latest update! Why it is kept on hold even after passing all of the tests?

@frafra
Copy link

frafra commented Feb 13, 2022

This needs to be merged and shipped with the latest update! Why it is kept on hold even after passing all of the tests?

The branch cannot be merged as there are conflicts, and some changes have been requested.

@TheGustafson
Copy link

I'd like to add that I am really rooting for this change. I would immediately make use of the new functionality in my projects.

@AkaLucasYu
Copy link

Hi folks, really looking forward to this change! I know this change hasn't been merged yet, but any chance we will see this functionality being also added to the GeoPandas (.to_postgis)?

@antoscha
Copy link

Hi,
This is the most exciting feature pandas may have. Please fix&merge it as soon as possible.

@jreback
Copy link
Contributor

jreback commented Feb 22, 2022

i'll repeat the same statement as above

a simple, well tested and documented change that implements this would be accepted

this PR does way too much and is not mergable

if someone in the community wants to read the comments and implement the requested changes the core team will review and merge

@maveec
Copy link

maveec commented Feb 23, 2022

Could some of the capacity of the pandas core devs be allocated to the implementation of these changes? I believe this is one of the most wanted features in pandas.

@jreback
Copy link
Contributor

jreback commented Feb 23, 2022

Could some of the capacity of the pandas core devs be allocated to the implementation of these changes? I believe this is one of the most wanted features in pandas.

pandas is all volunteer

you can certainly ask but it is up to individuals to work in things

@cvonsteg
Copy link
Author

I wanted to add that I fully understand @jreback 's comments, and I agree that this first attempt was doing too much. Unfortunately, due to a change in circumstances over the past few months, I simply don't have time to work on this at the moment. If anybody wants to pick this feature up, please do! I will be more than happy to give feedback and input on what's been done so far, or to clarify any code that is unclear.

@JustinGuese
Copy link

Would sponsoring for this specific issue work? And what do you think is a fair donation amount for the test-fixes?

@jreback
Copy link
Contributor

jreback commented Mar 2, 2022

i don't think anyone has capacity

code contributions are the most helpful

@LSturtew
Copy link

LSturtew commented Mar 3, 2022

@cvonsteg If someone else would want to pick up this feature, should they create separate PRs for the on_conflict do_nothing and on_conflict do_update as proposed above, or continue from this PR?

@ebailin
Copy link

ebailin commented Jan 18, 2023

Is this still in limbo? This is an import enough topic for me, I'd like to help get it out, if possible. Not sure how to go about it, though.

@ManPython
Copy link

I can confirm that most of know solution as external not working due bugs in relation to new pandas and especially sqlalchemy.
https://github.com/ThibTrip/pangres/wiki/Upsert
https://github.com/LawrentChen/pandas_upsert_to_mysql
Is something working let me know, more other solution.

@ThibTrip
Copy link
Contributor

ThibTrip commented Feb 6, 2023

I can confirm that most of know solution as external not working due bugs in relation to new pandas and especially sqlalchemy. https://github.com/ThibTrip/pangres/wiki/Upsert https://github.com/LawrentChen/pandas_upsert_to_mysql Is something working let me know, more other solution.

Yes, sorry about that @ManPython, I made a new release for pangres which supports sqlalchemy>=2.0.0. Please mind the requirements listed in the release notes.

@simonjayhawkins
Copy link
Member

Unfortunately, due to a change in circumstances over the past few months, I simply don't have time to work on this at the moment.

Thanks @cvonsteg for the work on this. closing this PR.

For all those who have been following this, please comment in the issue instead. #14553

@sarathsairam
Copy link

which version of pandas contains this implemented and ready to use?

@gy-mate
Copy link

gy-mate commented Dec 31, 2023

@sarathsairam This was never implemented as this PR was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding (Insert or update if key exists) option to .to_sql