Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres upsert behavior inconsistent with "do nothing" #4242

Closed
noahw3 opened this issue Apr 20, 2023 · 8 comments · Fixed by #4370
Closed

Postgres upsert behavior inconsistent with "do nothing" #4242

noahw3 opened this issue Apr 20, 2023 · 8 comments · Fixed by #4370
Labels
bug Something isn't working

Comments

@noahw3
Copy link

noahw3 commented Apr 20, 2023

Describe the bug
This is somewhere between a bug report and a feature request. When performing an upsert call using only the unique key(s) of the entity, the underlying postgres call gets created with an "on conflict do nothing" clause. The problem with this is that, if there is a conflict, postgres will not return anything in the "returning" clause.

This causes a few issues:

  1. Confusion on what the expected return type of upsert is. If the expectation is that it should always return a value, that expectation is broken.
  2. Inconsistency in the upsert API. If used with entity properties in addition to the unique key(s), it will always perform an update and always return a result. If only used with the unique key(s), it may not return the full result.
  3. This poisons the identity map. If an entity is upserted and nothing is returned, that value is cached in the identity map. If you then try to populate a relationship later (without doing any other explicit lookup to refresh the value), it will reuse the partially hydrated value without doing a full lookup.

To Reproduce
Steps to reproduce the behavior:

  1. Create a basic entity with a primary key and another optional/default column
  2. Add a row to the table
  3. Attempt to upsert a value with the same primary key (and only the primary key)
  4. Inspect the result, and see that the return value only includes the primary key
  5. Create another entity with a ManyToOne relationship to the first entity
  6. Attempt to query for a value of the second entity and populate the first entity. Note that if this is done without first attempting to upsert the first entity, it will include all information. If done after attempting to upsert the first entity, it will only include the provided primary key.

Expected behavior
Upsert should always return a full value.

Additional context
I imagine the simplest way to achieve this would be to always perform the update (just using the provided unique key(s)). I suppose this could potentially introduce more writes in situations where it wasn't previously, and it could trigger automatic column updates, but that could already be happening if users were specifying non-unique keys in the query.

Versions

Dependency Version
node 18.15.0
typescript 4.9
mikro-orm 5.6.15
your-driver postgres@5.6.15
@B4nan
Copy link
Member

B4nan commented Apr 21, 2023

Can you share the reproduction as code, please?

@noahw3
Copy link
Author

noahw3 commented Apr 21, 2023

Should look something like this.

Entities:

@Entity()
export class B extends BaseEntity<B, "id"> {
  @PrimaryKey({ type: "uuid", defaultRaw: "uuid_generate_v4()" })
  id!: string;

  @ManyToOne(() => D, { onDelete: "cascade", ref: true })
  d!: Ref<D>;
}

@Entity()
@Unique({
  properties: ["tenantWorkflowId"],
})
export class D extends BaseEntity<D, "id"> {
  @PrimaryKey({ type: "uuid", defaultRaw: "uuid_generate_v4()" })
  id!: string;

  @Property()
  tenantWorkflowId: number;

  @Property({ length: 6, defaultRaw: "now()", onUpdate: () => new Date() })
  updatedAt: Date = new Date();
}

Application code:

const ds = [1];
const loadedDs = await this.em.upsertMany(
  D,
  ds.map(({ tenantWorkflowId }) => ({
    tenantWorkflowId,
  }))
);
await this.em.flush();
// loadedDs should have an id, tenantWorkflowId, and updatedAt. With debug statements on, the query issued was
`insert into "d" ("tenant_workflow_id") values (1) on conflict ("tenant_workflow_id") do nothing returning "id", "updated_at"`

const loadedDs2 = await this.em.upsertMany(
  D,
  ds.map(({ tenantWorkflowId }) => ({
    tenantWorkflowId,
  }))
);
await this.em.flush();
// loadedDs2 only has tenantWorkflowId set. With debug statements on, the query issued was
`insert into "d" ("tenant_workflow_id") values (1) on conflict ("tenant_workflow_id") do nothing returning "id", "updated_at"`
`select "w0"."id" from "d" as "w0" where ("w0"."tenant_workflow_id" = 1)`

const loadedDs3 = await this.em.upsertMany(
  D,
  ds.map(({ tenantWorkflowId }) => ({
    tenantWorkflowId,
    updatedAt: new Date(),
  }))
);
await this.em.flush();
// loadedDs3 has id, tenantWorkflowId, and updatedAt set. With debug statements on, the query issued was
`insert into "d" ("tenant_workflow_id", "updated_at") values (1, '2023-04-21T22:01:07.995Z') on conflict ("tenant_workflow_id") do update set "updated_at" = excluded."updated_at" returning "id", "updated_at"`

Populating the relation:

// Assuming that tenantWorkflowId 1 already exists in the database
const ds = [1];
const loadedDs = await this.em.upsertMany(
  D,
  ds.map(({ tenantWorkflowId }) => ({
    tenantWorkflowId,
  }))
);
await this.em.flush();

await const b = await this.em.upsert(B, {
  d: loadedDs[0],
  order: 0,
  updatedAt: new Date(),
});
await this.em.flush();

// await b.d.load() This only includes the id and tenantWorkflowId, it's missing updatedAt (or any other additional columns on the entity)

@B4nan
Copy link
Member

B4nan commented May 14, 2023

Confusion on what the expected return type of upsert is. If the expectation is that it should always return a value, that expectation is broken.

By "a value" you are referring to what exactly? The updatedAt? The return value is an array of managed entities, and that works as expected - they don't have to be fully initialized, the only invariant is that they will be managed, therefore they will have a primary key.

Maybe we can have a flag to enforce reloading of the entity fully? Maybe it would be even a good default? I really don't know, I was hesitant to add this API because I never needed it myself, so I don't have any first-class experience with the problems it should be solving. From what I understood, people often provide all the values explicitly, and for that it

Inconsistency in the upsert API. If used with entity properties in addition to the unique key(s), it will always perform an update and always return a result. If only used with the unique key(s), it may not return the full result.

This method is dynamic by definition, I don't see this as a problem, e.g. when you use upsert based on PK and the entity is already part of the identity map, you just get an assign on it, without any queries being made. This method is there to ensure some entity is created in the database, it is not meant as a replacement for find. But as said above, maybe that is something to improve.

This poisons the identity map. If an entity is upserted and nothing is returned, that value is cached in the identity map. If you then try to populate a relationship later (without doing any other explicit lookup to refresh the value), it will reuse the partially hydrated value without doing a full lookup.

If this part is what your "Populating the relation:" code snippet should be reproducing, it seems to be working fine on my end (with latest version).

I wasn't really sure what you mean by the snippets, is that supposed to be a sequence, so you literally call upsert + flush 3 times or is that supposed to be a comparison of 3 different versions? I'd appreciate a test case so its clear. Here is what I compiled from your snippets so far, note that I replaced the flush calls with clear as that feels more appropriate if you want to compare how things differ. Note that this is also dependent on the database state, e.g. if I wipe the database between the calls, it all behaves the same.

import { Entity, ManyToOne, PrimaryKey, Property, Ref, Unique } from '@mikro-orm/core';
import { MikroORM } from '@mikro-orm/postgresql';

@Entity()
class B {

  @PrimaryKey({ type: 'uuid', defaultRaw: 'uuid_generate_v4()' })
  id!: string;

  @ManyToOne(() => D, { onDelete: 'cascade', ref: true })
  d!: Ref<D>;

  @Property({ unique: true })
  order!: number;

  @Property({ length: 6, defaultRaw: 'now()', onUpdate: () => new Date() })
  updatedAt: Date = new Date();

}

@Entity()
@Unique({ properties: ['tenantWorkflowId'] })
class D {

  @PrimaryKey({ type: 'uuid', defaultRaw: 'uuid_generate_v4()' })
  id!: string;

  @Property()
  tenantWorkflowId!: number;

  @Property({ length: 6, defaultRaw: 'now()', onUpdate: () => new Date() })
  updatedAt: Date = new Date();

}

let orm: MikroORM;

beforeAll(async () => {
  orm = await MikroORM.init({
    entities: [B, D],
    dbName: `4242`,
  });

  await orm.schema.ensureDatabase();
  await orm.schema.execute('CREATE EXTENSION IF NOT EXISTS "uuid-ossp"');
  await orm.schema.refreshDatabase();
});

afterAll(async () => {
  await orm.close(true);
});

beforeEach(async () => {
  await orm.schema.clearDatabase();
});

test('4242 1/2', async () => {
  const loadedDs = await orm.em.upsertMany(D, [
    { tenantWorkflowId: 1 },
  ]);
  console.log(loadedDs);
  orm.em.clear();
  // loadedDs should have an id, tenantWorkflowId, and updatedAt. With debug statements on, the query issued was
  //   `insert into "d" ("tenant_workflow_id") values (1) on conflict ("tenant_workflow_id") do nothing returning "id", "updated_at"`

  const loadedDs2 = await orm.em.upsertMany(D, [
    { tenantWorkflowId: 1 },
  ]);
  console.log(loadedDs2);
  orm.em.clear();
  // loadedDs2 only has tenantWorkflowId set. With debug statements on, the query issued was
  //   `insert into "d" ("tenant_workflow_id") values (1) on conflict ("tenant_workflow_id") do nothing returning "id", "updated_at"`
  //     `select "w0"."id" from "d" as "w0" where ("w0"."tenant_workflow_id" = 1)`

  const loadedDs3 = await orm.em.upsertMany(D, [
    { tenantWorkflowId: 1, updatedAt: new Date() },
  ]);
  console.log(loadedDs3);
  orm.em.clear();
  // loadedDs3 has id, tenantWorkflowId, and updatedAt set. With debug statements on, the query issued was
  //   `insert into "d" ("tenant_workflow_id", "updated_at") values (1, '2023-04-21T22:01:07.995Z') on conflict ("tenant_workflow_id") do update set "updated_at" = excluded."updated_at" returning "id", "updated_at"`
});

test('4242 2/2', async () => {
  // Assuming that tenantWorkflowId 1 already exists in the database
  const loadedDs4 = await orm.em.upsertMany(D, [
    { tenantWorkflowId: 1 },
  ]);
  console.log(loadedDs4);
  await orm.em.flush();

  const b = await orm.em.upsert(B, {
    d: loadedDs4[0],
    order: 0,
    updatedAt: new Date(),
  });
  await orm.em.flush();
  console.log(b);

  await b.d.load(); // this corrertly populates the relation, it is not marked as initialized before the load() call
  console.log(b.d);
});

@noahw3
Copy link
Author

noahw3 commented May 15, 2023

At a high level, the way that I'd want to use upsert is as a "findAndUpdateOrCreate" method. Because the return type is the entity, my assumption is that it should always return a fully initialized entity - otherwise if it isn't a fully loaded entity I'd expect it to be a reference. However, given that the entire point of using upsert is to reduce the number of database calls under the hood, I don't think it makes a ton of sense to return a partial result that then needs to hit the database again to fully initialize.

The implication of this then is that:

  1. All columns necessary to fully initialize the entity should be included in the "returning" clause.
  2. Upon failure to update, instead of only querying to populate the primary key it should query on all columns. Alternatively, instead of specifying "do nothing" in the conflict clause, it could always perform a no-op update to force a return value in the original query.

I think what I was trying to show in the example code was the differing behavior across subsequent requests. On the first call, because nothing exists yet it does a full create and returns the full entity. On the second call the same code returned a partial result and fired an extra query because the entity had already been created. The third call demonstrates how always forcing an update (for example, by always updating updatedAt) also forces a return value.

The other snippet shows how calling load on the relation doesn't return fully initialized values, because the prior upsert call put partially initialized values into the identity map.

@B4nan
Copy link
Member

B4nan commented May 20, 2023

Thanks for the details, it was very helpful! I completely agree with the points, already working on a PoC.

Some interesting reading about the topic can be found here. So I guess it will be better to keep the separate query for reloading the values, in the end, it will be always there for MySQL that does not have returning statements.

Already got some promising results (#4370):

[query] insert into "d" ("tenant_workflow_id") values (1) on conflict ("tenant_workflow_id") do nothing returning * [took 3 ms]
[
  D {
    tenantWorkflowId: 1,
    updatedAt: 2023-05-20T13:14:15.783Z,
    id: 'ef103b8a-2143-43f4-8736-7e0b4582ff2c'
  }
]

[query] insert into "d" ("tenant_workflow_id") values (1) on conflict ("tenant_workflow_id") do nothing returning * [took 2 ms]
[query] select "d0"."id", "d0"."tenant_workflow_id", "d0"."updated_at" from "d" as "d0" where "d0"."tenant_workflow_id" = 1 [took 1 ms, 1 result]
[
  D {
    tenantWorkflowId: 1,
    updatedAt: 2023-05-20T13:14:15.783Z,
    id: 'ef103b8a-2143-43f4-8736-7e0b4582ff2c'
  }
]

[query] insert into "d" ("tenant_workflow_id", "updated_at") values (1, '2023-05-20T13:14:15.806Z') on conflict ("tenant_workflow_id") do update set "updated_at" = excluded."updated_at" returning * [took 1 ms]
[
  D {
    tenantWorkflowId: 1,
    updatedAt: 2023-05-20T13:14:15.806Z,
    id: 'ef103b8a-2143-43f4-8736-7e0b4582ff2c'
  }
]

Not sure if it makes more sense to whitelist the colums in returning clause, I guess I will do that in the end, so its in line with the select clause.

It will need more tests and polishing, let's see if I can find some time for that tomorrow, need to go AFK now.

@B4nan
Copy link
Member

B4nan commented May 21, 2023

Released as 5.7.8, let me know how it works, maybe there are more cases that will need improvements.

@B4nan
Copy link
Member

B4nan commented May 21, 2023

What do you think about how the upsert works for already managed entities (loaded in context)? Now it does just the assign call and you need to flush explicitly for that, but maybe we could fire update query instead? Also if the entity is just a reference, it will still return a reference (with the data being assigned on it, after flushing it will become initialized but still might be missing optional data that were not part of the upsert call).

@B4nan
Copy link
Member

B4nan commented Aug 29, 2023

FYI next version will improve the behavior of "no nothing" in postgres, and make things explicitly configurable, see #4669

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants