Optimize RDAP entity event query #1635

weiminyu · 2022-05-18T06:24:23Z

For each EPP entity, directly load the latest HistoryEntry per event type
instead of loading all events through the HistoryEntryDao.

Although most entities have a small number of history entries, there are
a few entities with many entries, enough to cause OutOfMemory error.

This change is

For each EPP entity, directly load the latest HistoryEntry per event type instead of loading all events through the HistoryEntryDao. Although most entities have a small number of history entries, there are a few entities with many entries, enough to cause OutOfMemory error.

gbrodman

Reviewable status: 0 of 3 files reviewed, 4 unresolved discussions (waiting on @CydeWeys, @mindhog, and @weiminyu)

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 165 at r1 (raw file):

   * DomainHistory where domainRepoId = '17-Q9JYB4C' and type is not null group by type)}
   */
  private static final String GET_LAST_HISTORY_BY_TYPE_JPQL_TEMPLATE =

I'm not 100% sure how this would work in HQL, but in raw Postgres you can do

SELECT DISTINCT ON (history_type) * FROM "%entityName%" WHERE %repoIdField = '%repoIdValue' ORDER BY history_type, history_modification_time DESC;

and it gets the same results without having nested statements

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 890 at r1 (raw file):

    // domain name has not been transferred since it was created.
    Iterable<? extends HistoryEntry> historyEntries;
    if (tm().isOfy()) {

We don't need the isOfy check anymore, right?

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 897 at r1 (raw file):

          HistoryEntryDao.getHistoryClassFromParent(resourceVkey.getKind());
      String entityName = historyClass.getAnnotation(Entity.class).name();
      if (Strings.isNullOrEmpty(entityName)) {

I don't think this is the case for any of our history classes right?

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 902 at r1 (raw file):

      String repoIdFieldName = HistoryEntryDao.getRepoIdFieldNameFromHistoryClass(historyClass);
      String jpql =
          GET_LAST_HISTORY_BY_TYPE_JPQL_TEMPLATE

thoughts on using String::format?

weiminyu

Reviewable status: 0 of 3 files reviewed, 4 unresolved discussions (waiting on @CydeWeys, @gbrodman, and @mindhog)

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 165 at r1 (raw file):

Previously, gbrodman wrote…

I'm not 100% sure how this would work in HQL, but in raw Postgres you can do

SELECT DISTINCT ON (history_type) * FROM "%entityName%" WHERE %repoIdField = '%repoIdValue' ORDER BY history_type, history_modification_time DESC;

and it gets the same results without having nested statements

select distinct ON (column) is postgresql-specific, so unlikely to be supported by JPQL.

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 890 at r1 (raw file):

Previously, gbrodman wrote…

We don't need the isOfy check anymore, right?

Until we remove TestOfyAndSql tags from all Rdap tests, we need it.

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 897 at r1 (raw file):

Previously, gbrodman wrote…

I don't think this is the case for any of our history classes right?

It is. All of them are annotated with a plain Entity annotation without name.

I can call getSimpleName directly, but feel that calling getAnnotation first is a good habit.

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 902 at r1 (raw file):

Previously, gbrodman wrote…

thoughts on using String::format?

I think this makes the template more readable. Even if we use argument indexes, the template itself can be confusing unless it is put next to the place when the actual query is generated.

mindhog

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @CydeWeys and @gbrodman)

CydeWeys

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @gbrodman, @mindhog, and @weiminyu)

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 165 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

select distinct ON (column) is postgresql-specific, so unlikely to be supported by JPQL.

If the SELECT DISTINCT ON (column) is more performant, then we should probably just use that even though it's PostgreSQL-specific. (Is it more performant?)

We're already using SQL arrays, so we're already locked into PostgreSQL anyway.

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 897 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

It is. All of them are annotated with a plain Entity annotation without name.

I can call getSimpleName directly, but feel that calling getAnnotation first is a good habit.

Is there anything in TypeUtils that is useful here? getClassFromString() perhaps?

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 902 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

I think this makes the template more readable. Even if we use argument indexes, the template itself can be confusing unless it is put next to the place when the actual query is generated.

Any reason this replacement isn't happening using built-in SQL

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 907 at r1 (raw file):

              .replace("%repoIdValue%", resourceVkey.getSqlKey().toString());
      historyEntries =
          jpaTm().transact(() -> jpaTm().getEntityManager().createQuery(jpql).getResultList());

Why not use setParameter() here instead of String munging from above? Does it not work on column names? repoIdValue at least is not a column name -- it's actual data.

gbrodman

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @CydeWeys, @gbrodman, @mindhog, and @weiminyu)

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 165 at r1 (raw file):

Previously, CydeWeys (Ben McIlwain) wrote…

If the SELECT DISTINCT ON (column) is more performant, then we should probably just use that even though it's PostgreSQL-specific. (Is it more performant?)

We're already using SQL arrays, so we're already locked into PostgreSQL anyway.

I didn't actually know which one was superior before but I just did a test using separate repo IDs on prod (if I don't use separate repo IDs, whichever query runs last is faster due to caching):

explain analyze select distinct on (history_type) * FROM "DomainHistory" WHERE domain_repo_id = '339578985-DEV' ORDER BY history_type, history_modification_time DESC; Planning Time: 0.177 ms
Execution Time: 0.364 ms
(8 rows)

explain analyze select e from "DomainHistory" e where domain_repo_id = '436E9C8CE-DEV' and history_modification_time in (select max(history_modification_time) from "DomainHistory" where domain_repo_id = '436E9C8CE-DEV' and history_type is not null group by history_type);
Planning Time: 0.240 ms
Execution Time: 8.925 ms
(16 rows)

Interestingly, it seems like the second one (with the select(max)) has a lot more variance when testing it with various domains. The highest execution time I saw with a few different choices was 35ms, and the lowest was about 1ms

I'm not sure if this is a meaningful difference. There is a tangible benefit to writing HQL (what already exists here), but there's also probably a benefit in a simpler shorter SQL statement.

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 897 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

It is. All of them are annotated with a plain Entity annotation without name.

I can call getSimpleName directly, but feel that calling getAnnotation first is a good habit.

what I meant was that our classes don't have a name in the Entity annotation, so this doesn't seem necessary

gbrodman

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @CydeWeys, @gbrodman, @mindhog, and @weiminyu)

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 890 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

Until we remove TestOfyAndSql tags from all Rdap tests, we need it.

boo :/

CydeWeys

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @CydeWeys, @gbrodman, @mindhog, and @weiminyu)

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 911 at r1 (raw file):

    for (HistoryEntry historyEntry : historyEntries) {
      EventAction rdapEventAction =
          HISTORY_ENTRY_TYPE_TO_RDAP_EVENT_ACTION_MAP.get(historyEntry.getType());

Looking at the contents of this map, it seems like we can make our query above smarter, by only querying for the relevant types from the DB, rather than querying for all distinct ones and ignroing them. In particular, HISTORY_ENTRY_TYPE_TO_RDAP_EVENT_ACTION_MAP does not even contain UPDATEs, which are the types causing us all these problems in the first place!

weiminyu

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @gbrodman, @mindhog, and @weiminyu)

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 165 at r1 (raw file):

Previously, gbrodman wrote…

I didn't actually know which one was superior before but I just did a test using separate repo IDs on prod (if I don't use separate repo IDs, whichever query runs last is faster due to caching):

explain analyze select distinct on (history_type) * FROM "DomainHistory" WHERE domain_repo_id = '339578985-DEV' ORDER BY history_type, history_modification_time DESC; Planning Time: 0.177 ms
Execution Time: 0.364 ms
(8 rows)

explain analyze select e from "DomainHistory" e where domain_repo_id = '436E9C8CE-DEV' and history_modification_time in (select max(history_modification_time) from "DomainHistory" where domain_repo_id = '436E9C8CE-DEV' and history_type is not null group by history_type);
Planning Time: 0.240 ms
Execution Time: 8.925 ms
(16 rows)

Interestingly, it seems like the second one (with the select(max)) has a lot more variance when testing it with various domains. The highest execution time I saw with a few different choices was 35ms, and the lowest was about 1ms

I'm not sure if this is a meaningful difference. There is a tangible benefit to writing HQL (what already exists here), but there's also probably a benefit in a simpler shorter SQL statement.

In my measurements comparing domains with similar number of history entries, the sub-select query is usually faster. E.g., 334745225-DEV (select distint) took 170ms while 34040F2D0-DEV (sub select) took 130ms, on cold starts. When cache is loaded, it's 50ms vs 40ms, still in subselect's favor. Note that explain analyze does not execute the query. The Execution Time reported is the analysis cost.

Performance is a smaller matter. Only the sub-select query can be executed in JPQL. If we use select distinct with native query, we will have to
cast string to Enum and DateTime.

PS: Here is someone else's report of select distinct on being a bit slower: https://stackoverflow.com/questions/54810250/postgresql-latest-row-in-distinct-on-less-performant-than-max-row-in-group-by

weiminyu

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @CydeWeys, @gbrodman, and @mindhog)

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 902 at r1 (raw file):

Previously, CydeWeys (Ben McIlwain) wrote…

Any reason this replacement isn't happening using built-in SQL

If only parameter values are undetermined, we can use Query.setParameter to set them.
Here the table and column names are variables. We have to replace them before creation the query.

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 907 at r1 (raw file):

Previously, CydeWeys (Ben McIlwain) wrote…

Why not use setParameter() here instead of String munging from above? Does it not work on column names? repoIdValue at least is not a column name -- it's actual data.

Entityname and repoIdFieldName cannot be set with setParameter(). It's an extra mental cost to repoIdValue separately.

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 911 at r1 (raw file):

Previously, CydeWeys (Ben McIlwain) wrote…

Looking at the contents of this map, it seems like we can make our query above smarter, by only querying for the relevant types from the DB, rather than querying for all distinct ones and ignroing them. In particular, HISTORY_ENTRY_TYPE_TO_RDAP_EVENT_ACTION_MAP does not even contain UPDATEs, which are the types causing us all these problems in the first place!

The query as it stands takes < 10s for the million-row repoId on cold start, and 1s if cache is loaded. For the others it is at millisecond level.
We don't need to spend too much time on it right now.

weiminyu

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @CydeWeys, @gbrodman, and @mindhog)

core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 897 at r1 (raw file):

Previously, gbrodman wrote…

what I meant was that our classes don't have a name in the Entity annotation, so this doesn't seem necessary

There is a getEntityName() utility method in some test utils. We should move it to production and replace the few places entity name is needed. That can be left for later.

CydeWeys

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @gbrodman and @mindhog)

gbrodman

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mindhog)

weiminyu requested review from gbrodman, CydeWeys and mindhog May 18, 2022 06:30

gbrodman requested changes May 18, 2022

View reviewed changes

weiminyu commented May 18, 2022

View reviewed changes

mindhog approved these changes May 18, 2022

View reviewed changes

weiminyu requested review from gbrodman and mindhog May 19, 2022 02:00

CydeWeys requested changes May 19, 2022

View reviewed changes

gbrodman requested changes May 19, 2022

View reviewed changes

gbrodman reviewed May 19, 2022

View reviewed changes

CydeWeys requested changes May 19, 2022

View reviewed changes

weiminyu commented May 19, 2022

View reviewed changes

CydeWeys approved these changes May 19, 2022

View reviewed changes

gbrodman approved these changes May 20, 2022

View reviewed changes

weiminyu merged commit eeca516 into google:master May 20, 2022

weiminyu deleted the rdap branch May 20, 2022 03:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize RDAP entity event query #1635

Optimize RDAP entity event query #1635

weiminyu commented May 18, 2022 •

edited by mindhog

Loading

gbrodman left a comment

weiminyu left a comment

mindhog left a comment

CydeWeys left a comment

gbrodman left a comment

gbrodman left a comment

CydeWeys left a comment

weiminyu left a comment

weiminyu left a comment

weiminyu left a comment

CydeWeys left a comment

gbrodman left a comment

Optimize RDAP entity event query #1635

Optimize RDAP entity event query #1635

Conversation

weiminyu commented May 18, 2022 • edited by mindhog Loading

gbrodman left a comment

Choose a reason for hiding this comment

weiminyu left a comment

Choose a reason for hiding this comment

mindhog left a comment

Choose a reason for hiding this comment

CydeWeys left a comment

Choose a reason for hiding this comment

gbrodman left a comment

Choose a reason for hiding this comment

gbrodman left a comment

Choose a reason for hiding this comment

CydeWeys left a comment

Choose a reason for hiding this comment

weiminyu left a comment

Choose a reason for hiding this comment

weiminyu left a comment

Choose a reason for hiding this comment

weiminyu left a comment

Choose a reason for hiding this comment

CydeWeys left a comment

Choose a reason for hiding this comment

gbrodman left a comment

Choose a reason for hiding this comment

weiminyu commented May 18, 2022 •

edited by mindhog

Loading