Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding bulk inserts or updates for PostgreSQL #226

Merged
merged 12 commits into from Jan 2, 2021

Conversation

PedroD
Copy link

@PedroD PedroD commented Dec 28, 2020

Adding support for PostgreSQL bulk insert or update (also allowing to bulk ignore conflicts).

         database.bulkInsert(Employees) {
                item {
                    set(it.id, 1)
                    set(it.name, "vince")
                    set(it.job, "engineer")
                    set(it.salary, 1000)
                    set(it.hireDate, LocalDate.now())
                    set(it.departmentId, 1)
                }
                item {
                    set(it.id, 5)
                    set(it.name, "vince")
                    set(it.job, "engineer")
                    set(it.salary, 1000)
                    set(it.hireDate, LocalDate.now())
                    set(it.departmentId, 1)
                }

                onDuplicateKey(Employees.id) {
                        set(it.salary, it.salary + 900)
                }
            }

(Will add +900 to each already existing employee salary)

         database.bulkInsert(Employees) {
                item {
                    set(it.id, 1)
                    set(it.name, "vince")
                    set(it.job, "engineer")
                    set(it.salary, 1000)
                    set(it.hireDate, LocalDate.now())
                    set(it.departmentId, 1)
                }
                item {
                    set(it.id, 5)
                    set(it.name, "vince")
                    set(it.job, "engineer")
                    set(it.salary, 1000)
                    set(it.hireDate, LocalDate.now())
                    set(it.departmentId, 1)
                }

                onDuplicateKey(Employees.id) {
                }
            }

(Will ignore any already existing employee and continue the query)

* }
* ```
*
* @since 2.7
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincentlauvlwj what version should go here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next version 3.3.0

@vincentlauvlwj
Copy link
Member

Thank you for your contribution!!

One question: What's the difference between calling onDuplicateKey with an empty closure and not calling it? What if a user wants to throw an exception when any key conflict exists? Seems it is a more common case than just ignoring conflicts.

@vincentlauvlwj
Copy link
Member

Also, please update the build.gradle file, add your GitHub ID to the developers info (line 90), that will let more people know your contributions.

@PedroD
Copy link
Author

PedroD commented Dec 29, 2020

Thank you for your contribution!!

One question: What's the difference between calling onDuplicateKey with an empty closure and not calling it? What if a user wants to throw an exception when any key conflict exists? Seems it is a more common case than just ignoring conflicts.

Good question, I haven't though about that scenario...

But thinking about it now, I only see two possibilities:

Given that this method is an Insert or Update, it will potentially never throw an error on finding duplicates with the given key columns (by design). And this behavior is probably enforced by PostgreSQL itself. So the developer using a query like:

INSERT INTO $table ($col1, $col2, ...)
VALUES (...), (...), (...), ...
ON CONFLICT ($col1, $col2, ...)
DO UPDATE SET ....

Must be aware that this will never return an error for the first-level "collision" of ($col1, $col2, ...).

So, my conceptual line of though was the following; Given that when I use "upserts" I either want to:

  1. Insert as new record if not exists, or update the existing record matching the conflict columns I mention in the query
  2. Insert as new record if not exists, and ignore record if already exists matching the conflict columns I mention in the query
  3. (maybe) Insert as new record if not exists, and receive an error if it conflicts

So, to account for scenario 1 we can use the form:

// Generates a DO UPDATE SET
onDuplicateKey(Employees.id) {
      set(it.salary, it.salary + 900)
}

For scenario 2 we can use the form:

// Generates a DO NOTHING
onDuplicateKey(Employees.id) {
}

For scenario 3, it cannot be done in this type of query as far as I understand by design of PostgreSQL. There's no way to throw an error if a record already exists matching the conflict columns I mention in the query.
However, it will throw an error from the DBMS if:

  • it finds a collision with any column constraint not matching the conflict ones mentioned in the query
  • if during the DO UPDATE operation one or more SETs create a new collision

This is all behavior that is native from PostgreSQL, given that there are only two actions we can do in these type of queries (NOTHING and UPDATE) according to their documentation.

So, for scenario 3, the user should probably use the default batchInsert which, by not performing any update/upsert, will return an error on collisions.

Does this make sense to you?

@PedroD
Copy link
Author

PedroD commented Dec 29, 2020

PS. I made some commits renaming these methods to make more clear what they do and what operation they represent (Insert or Update, i.e. Upsert).

Also, what is the difference between a batchInsert and a bulkInsert in the case of MySQL, why not simply overwrite the default batchInsert?
Should I give the name of bulkInsertOrUpdate or batchInsertOrUpdate to the method I just implemented on this PR?

@vincentlauvlwj
Copy link
Member

vincentlauvlwj commented Dec 31, 2020

Thank you for the explanation, I fully understand what you mean. But would it be possible to generate an insert SQL without on conflict clause when onDuplicateKey is not called? For example:

database.bulkInsert(Employees) {
    item {
        set(it.name, "jerry")
        set(it.job, "trainee")
        set(it.managerId, 1)
        set(it.hireDate, LocalDate.now())
        set(it.salary, 50)
        set(it.departmentId, 1)
    }
    item {
        set(it.name, "linda")
        set(it.job, "assistant")
        set(it.managerId, 3)
        set(it.hireDate, LocalDate.now())
        set(it.salary, 100)
        set(it.departmentId, 2)
    }
}

Then generated SQL:

INSERT INTO t_employee (name, job, manager_id, hire_date, salary, department_id) 
VALUES (?, ?, ?, ?, ?, ?), (?, ?, ?, ?, ?, ?) 

This is exactly the scenario 3 you mentioned, it inserts new records if not exists, and throws an error on collisions.

@vincentlauvlwj
Copy link
Member

If it is possible to make onDuplicateKey optional, then we don't need to rename the function to bulkInsertOrUpdate. bulkInsert is good.

@PedroD
Copy link
Author

PedroD commented Jan 1, 2021

If it is possible to make onDuplicateKey optional, then we don't need to rename the function to bulkInsertOrUpdate. bulkInsert is good.

Ok, I get the idea, I didn't go that way because it didn't seem to be the way the code was designed initially, meaning that I will need to change this behavior to allow an empty "conflictTarget" to exist (remove that ifEmpty part):

image

Is this behavior also intended on insertOrUpdate()? I am gonna assume "yes", so please let me know.

I applied these changes, so that both insertOrUpdate and bulkInsert now have their "update" part as optional, so that we can cover scenario 3 as well with both methods.


I also added a TODO note on the tests, because I think some tests have overlapping ids, so maybe we should make sure each test generates unique (thread-safe) entry ids? Let me know what you think.

image


I am still wondering if bulkInsert is the best name for this method, or if it should be batchInsertOrUpdate, so it is consistent with the pair (default) insert method and (pgsql) insertOrUpdate. Because as a user, if my IDE shows me batchInsert and bulkInsert I am left wondering what difference does bulk and batch reflect, and after reading the docs I see that batchInsert has no Update part and bulkInsert has. bulk does not convey this difference between these 2 methods IMO.

image
VS
image
or even VS
image

@vincentlauvlwj vincentlauvlwj changed the base branch from master to v3.3.x January 2, 2021 06:25
@vincentlauvlwj vincentlauvlwj merged commit 72c24ed into kotlin-orm:v3.3.x Jan 2, 2021
@vincentlauvlwj
Copy link
Member

Merged, I will release v3.3.0 these days.

@vincentlauvlwj
Copy link
Member

batchInsert is implemented based on JDBC's addBatch and executeBatch, it generates multiple SQL statements, while bulkInsert generates only one SQL statement, that's the difference.

@vincentlauvlwj
Copy link
Member

You are right, the name bulkInsert is not good enough, with this name it should not have the Update part.

Maybe it is better to have another function bulkInsertOrUpdate, I will do a refactoring based on your code.

@PedroD
Copy link
Author

PedroD commented Jan 4, 2021

@vincentlauvlwj awesome!

Keep up the great work! 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants