Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize RDAP entity event query #1635

Merged
merged 1 commit into from
May 20, 2022
Merged

Optimize RDAP entity event query #1635

merged 1 commit into from
May 20, 2022

Conversation

weiminyu
Copy link
Collaborator

@weiminyu weiminyu commented May 18, 2022

For each EPP entity, directly load the latest HistoryEntry per event type
instead of loading all events through the HistoryEntryDao.

Although most entities have a small number of history entries, there are
a few entities with many entries, enough to cause OutOfMemory error.


This change is Reviewable

For each EPP entity, directly load the latest HistoryEntry per event type
instead of loading all events through the HistoryEntryDao.

Although most entities have a small number of history entries, there are
a few entities with many entries, enough to cause OutOfMemory error.
Copy link
Collaborator

@gbrodman gbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 4 unresolved discussions (waiting on @CydeWeys, @mindhog, and @weiminyu)


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 165 at r1 (raw file):

   * DomainHistory where domainRepoId = '17-Q9JYB4C' and type is not null group by type)}
   */
  private static final String GET_LAST_HISTORY_BY_TYPE_JPQL_TEMPLATE =

I'm not 100% sure how this would work in HQL, but in raw Postgres you can do

SELECT DISTINCT ON (history_type) * FROM "%entityName%" WHERE %repoIdField = '%repoIdValue' ORDER BY history_type, history_modification_time DESC;

and it gets the same results without having nested statements


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 890 at r1 (raw file):

    // domain name has not been transferred since it was created.
    Iterable<? extends HistoryEntry> historyEntries;
    if (tm().isOfy()) {

We don't need the isOfy check anymore, right?


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 897 at r1 (raw file):

          HistoryEntryDao.getHistoryClassFromParent(resourceVkey.getKind());
      String entityName = historyClass.getAnnotation(Entity.class).name();
      if (Strings.isNullOrEmpty(entityName)) {

I don't think this is the case for any of our history classes right?


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 902 at r1 (raw file):

      String repoIdFieldName = HistoryEntryDao.getRepoIdFieldNameFromHistoryClass(historyClass);
      String jpql =
          GET_LAST_HISTORY_BY_TYPE_JPQL_TEMPLATE

thoughts on using String::format?

Copy link
Collaborator Author

@weiminyu weiminyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 4 unresolved discussions (waiting on @CydeWeys, @gbrodman, and @mindhog)


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 165 at r1 (raw file):

Previously, gbrodman wrote…

I'm not 100% sure how this would work in HQL, but in raw Postgres you can do

SELECT DISTINCT ON (history_type) * FROM "%entityName%" WHERE %repoIdField = '%repoIdValue' ORDER BY history_type, history_modification_time DESC;

and it gets the same results without having nested statements

select distinct ON (column) is postgresql-specific, so unlikely to be supported by JPQL.


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 890 at r1 (raw file):

Previously, gbrodman wrote…

We don't need the isOfy check anymore, right?

Until we remove TestOfyAndSql tags from all Rdap tests, we need it.


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 897 at r1 (raw file):

Previously, gbrodman wrote…

I don't think this is the case for any of our history classes right?

It is. All of them are annotated with a plain Entity annotation without name.

I can call getSimpleName directly, but feel that calling getAnnotation first is a good habit.


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 902 at r1 (raw file):

Previously, gbrodman wrote…

thoughts on using String::format?

I think this makes the template more readable. Even if we use argument indexes, the template itself can be confusing unless it is put next to the place when the actual query is generated.

Copy link
Member

@mindhog mindhog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @CydeWeys and @gbrodman)

@weiminyu weiminyu requested review from gbrodman and mindhog May 19, 2022 02:00
Copy link
Member

@CydeWeys CydeWeys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @gbrodman, @mindhog, and @weiminyu)


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 165 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

select distinct ON (column) is postgresql-specific, so unlikely to be supported by JPQL.

If the SELECT DISTINCT ON (column) is more performant, then we should probably just use that even though it's PostgreSQL-specific. (Is it more performant?)

We're already using SQL arrays, so we're already locked into PostgreSQL anyway.


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 897 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

It is. All of them are annotated with a plain Entity annotation without name.

I can call getSimpleName directly, but feel that calling getAnnotation first is a good habit.

Is there anything in TypeUtils that is useful here? getClassFromString() perhaps?


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 902 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

I think this makes the template more readable. Even if we use argument indexes, the template itself can be confusing unless it is put next to the place when the actual query is generated.

Any reason this replacement isn't happening using built-in SQL


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 907 at r1 (raw file):

              .replace("%repoIdValue%", resourceVkey.getSqlKey().toString());
      historyEntries =
          jpaTm().transact(() -> jpaTm().getEntityManager().createQuery(jpql).getResultList());

Why not use setParameter() here instead of String munging from above? Does it not work on column names? repoIdValue at least is not a column name -- it's actual data.

Copy link
Collaborator

@gbrodman gbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @CydeWeys, @gbrodman, @mindhog, and @weiminyu)


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 165 at r1 (raw file):

Previously, CydeWeys (Ben McIlwain) wrote…

If the SELECT DISTINCT ON (column) is more performant, then we should probably just use that even though it's PostgreSQL-specific. (Is it more performant?)

We're already using SQL arrays, so we're already locked into PostgreSQL anyway.

I didn't actually know which one was superior before but I just did a test using separate repo IDs on prod (if I don't use separate repo IDs, whichever query runs last is faster due to caching):

explain analyze select distinct on (history_type) * FROM "DomainHistory" WHERE domain_repo_id = '339578985-DEV' ORDER BY history_type, history_modification_time DESC; Planning Time: 0.177 ms
Execution Time: 0.364 ms
(8 rows)

explain analyze select e from "DomainHistory" e where domain_repo_id = '436E9C8CE-DEV' and history_modification_time in (select max(history_modification_time) from "DomainHistory" where domain_repo_id = '436E9C8CE-DEV' and history_type is not null group by history_type);
Planning Time: 0.240 ms
Execution Time: 8.925 ms
(16 rows)

Interestingly, it seems like the second one (with the select(max)) has a lot more variance when testing it with various domains. The highest execution time I saw with a few different choices was 35ms, and the lowest was about 1ms

I'm not sure if this is a meaningful difference. There is a tangible benefit to writing HQL (what already exists here), but there's also probably a benefit in a simpler shorter SQL statement.


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 897 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

It is. All of them are annotated with a plain Entity annotation without name.

I can call getSimpleName directly, but feel that calling getAnnotation first is a good habit.

what I meant was that our classes don't have a name in the Entity annotation, so this doesn't seem necessary

Copy link
Collaborator

@gbrodman gbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @CydeWeys, @gbrodman, @mindhog, and @weiminyu)


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 890 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

Until we remove TestOfyAndSql tags from all Rdap tests, we need it.

boo :/

Copy link
Member

@CydeWeys CydeWeys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @CydeWeys, @gbrodman, @mindhog, and @weiminyu)


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 911 at r1 (raw file):

    for (HistoryEntry historyEntry : historyEntries) {
      EventAction rdapEventAction =
          HISTORY_ENTRY_TYPE_TO_RDAP_EVENT_ACTION_MAP.get(historyEntry.getType());

Looking at the contents of this map, it seems like we can make our query above smarter, by only querying for the relevant types from the DB, rather than querying for all distinct ones and ignroing them. In particular, HISTORY_ENTRY_TYPE_TO_RDAP_EVENT_ACTION_MAP does not even contain UPDATEs, which are the types causing us all these problems in the first place!

Copy link
Collaborator Author

@weiminyu weiminyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @gbrodman, @mindhog, and @weiminyu)


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 165 at r1 (raw file):

Previously, gbrodman wrote…

I didn't actually know which one was superior before but I just did a test using separate repo IDs on prod (if I don't use separate repo IDs, whichever query runs last is faster due to caching):

explain analyze select distinct on (history_type) * FROM "DomainHistory" WHERE domain_repo_id = '339578985-DEV' ORDER BY history_type, history_modification_time DESC; Planning Time: 0.177 ms
Execution Time: 0.364 ms
(8 rows)

explain analyze select e from "DomainHistory" e where domain_repo_id = '436E9C8CE-DEV' and history_modification_time in (select max(history_modification_time) from "DomainHistory" where domain_repo_id = '436E9C8CE-DEV' and history_type is not null group by history_type);
Planning Time: 0.240 ms
Execution Time: 8.925 ms
(16 rows)

Interestingly, it seems like the second one (with the select(max)) has a lot more variance when testing it with various domains. The highest execution time I saw with a few different choices was 35ms, and the lowest was about 1ms

I'm not sure if this is a meaningful difference. There is a tangible benefit to writing HQL (what already exists here), but there's also probably a benefit in a simpler shorter SQL statement.

In my measurements comparing domains with similar number of history entries, the sub-select query is usually faster. E.g., 334745225-DEV (select distint) took 170ms while 34040F2D0-DEV (sub select) took 130ms, on cold starts. When cache is loaded, it's 50ms vs 40ms, still in subselect's favor. Note that explain analyze does not execute the query. The Execution Time reported is the analysis cost.

Performance is a smaller matter. Only the sub-select query can be executed in JPQL. If we use select distinct with native query, we will have to
cast string to Enum and DateTime.

PS: Here is someone else's report of select distinct on being a bit slower: https://stackoverflow.com/questions/54810250/postgresql-latest-row-in-distinct-on-less-performant-than-max-row-in-group-by

Copy link
Collaborator Author

@weiminyu weiminyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @CydeWeys, @gbrodman, and @mindhog)


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 902 at r1 (raw file):

Previously, CydeWeys (Ben McIlwain) wrote…

Any reason this replacement isn't happening using built-in SQL

If only parameter values are undetermined, we can use Query.setParameter to set them.
Here the table and column names are variables. We have to replace them before creation the query.


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 907 at r1 (raw file):

Previously, CydeWeys (Ben McIlwain) wrote…

Why not use setParameter() here instead of String munging from above? Does it not work on column names? repoIdValue at least is not a column name -- it's actual data.

Entityname and repoIdFieldName cannot be set with setParameter(). It's an extra mental cost to repoIdValue separately.


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 911 at r1 (raw file):

Previously, CydeWeys (Ben McIlwain) wrote…

Looking at the contents of this map, it seems like we can make our query above smarter, by only querying for the relevant types from the DB, rather than querying for all distinct ones and ignroing them. In particular, HISTORY_ENTRY_TYPE_TO_RDAP_EVENT_ACTION_MAP does not even contain UPDATEs, which are the types causing us all these problems in the first place!

The query as it stands takes < 10s for the million-row repoId on cold start, and 1s if cache is loaded. For the others it is at millisecond level.
We don't need to spend too much time on it right now.

Copy link
Collaborator Author

@weiminyu weiminyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @CydeWeys, @gbrodman, and @mindhog)


core/src/main/java/google/registry/rdap/RdapJsonFormatter.java line 897 at r1 (raw file):

Previously, gbrodman wrote…

what I meant was that our classes don't have a name in the Entity annotation, so this doesn't seem necessary

There is a getEntityName() utility method in some test utils. We should move it to production and replace the few places entity name is needed. That can be left for later.

Copy link
Member

@CydeWeys CydeWeys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @gbrodman and @mindhog)

Copy link
Collaborator

@gbrodman gbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mindhog)

@weiminyu weiminyu merged commit eeca516 into google:master May 20, 2022
@weiminyu weiminyu deleted the rdap branch May 20, 2022 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants