Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding tuned indexes for query plan analysis #446

Merged
merged 2 commits into from May 20, 2023

Conversation

rtbenfield
Copy link
Contributor

Hi Kent 馃憢馃徎 After our discussion about indexes after Remix Conf, I took a closer look at the queries you pointed out. Here's some indexes I would recommend trying. I verified these using EXPLAIN QUERY PLAN, which I'll go into details about below. That said, I don't think my local environment is representative of your production system, so I'd keep an eye on your timing numbers and see how things compare. The full write-up is quite long as I walked through how I tested these.

Two things to keep in mind if you choose to implement these would be extra disk usage on your SQLite database and additional write overhead. I don't expect either of these to be concerning for your workload, but I wanted to call out the trade-off.

Let me know if you have any questions and thanks for all the fantastic work!

Analysis

getActiveMembers

getActiveMembers

Prisma query

Based on the Prisma operation, we can see that User.team and PostRead.createdAt are index candidates. A count in SQLite (and many other engines) will always result in some scanning, but our goal is to reduce that to only relevant data such that every row scanned is part of the final count. In other words, we don't want to read a row that doesn't match the where here.

prisma.user.count({
  where: {
    team,
    postReads: {
      some: {
        createdAt: {gt: withinTheLastYear},
      },
    },
  },
})

SQL query

The SQL aligns with our assessment above. Nothing additional to note here.

SELECT COUNT(*)
FROM (
  SELECT `main`.`User`.`id`
  FROM `main`.`User`
  WHERE (`main`.`User`.`team` = ?
    AND (`main`.`User`.`id`) IN (SELECT `t0`.`id`
      FROM `main`.`User` AS `t0`
      INNER JOIN `main`.`PostRead` AS `j0` ON (`j0`.`userId`) = (`t0`.`id`)
      WHERE (`j0`.`createdAt` > ? AND `t0`.`id` IS NOT NULL)
)) LIMIT ? OFFSET ?) AS `sub`

EXPLAIN QUERY PLAN before

The original query plan showed that PostRead (j0) was read in full with a SCAN. This would imply that the createdAt filter was applied during the scan. PostRead outputs userId, which is then used to match User twice: once for the JOIN and again for the final COUNT. While User appears to be a SEARCH (rather than SCAN), the query plan omits the team filter and we can infer that it is applied during the outer SEARCH.

Why does it show a SEARCH if we have yet to filter by team? Because the query loops over userId values returned by the subquery and performs an index search for each. SEARCH does not always imply an optimal query. In SQL Server world we call this a residual scan.

QUERY PLAN
|--CO-ROUTINE sub
|  |--SEARCH main.User USING INDEX sqlite_autoindex_User_1 (id=?)
|  `--LIST SUBQUERY 1
|     |--SCAN j0
|     `--SEARCH t0 USING COVERING INDEX sqlite_autoindex_User_1 (id=?)
`--SCAN sub

Relevant Indexes

We'll add two indexes. One on User.team will improve the outer query's performance by limiting to only relevant records for the count. One on PostRead.createdAt will similarly improve the inner query, but we'll also include PostRead.userId in that index to make it a covering index. We need the userId to match on the outer query and including it allows the database engine to leverage the value in the results without an additional read to the PostRead table data.

model User {
  @@index([team])
}

model PostRead {
  @@index([createdAt, userId])
}

EXPLAIN QUERY PLAN after

Now our query plan has eliminated the SCAN with a targeted covering index, as well as a targeted index for the outer count. While SQLite doesn't show the amount of reads ops for plans, we can deduce that we are now only touching records that contribute to the count 馃檶馃徎

QUERY PLAN
|--CO-ROUTINE sub
|  |--SEARCH main.User USING INDEX User_team_idx (team=?)
|  `--LIST SUBQUERY 1
|     |--SEARCH j0 USING COVERING INDEX PostRead_createdAt_userId_idx (createdAt>?)
|     `--SEARCH t0 USING COVERING INDEX sqlite_autoindex_User_1 (id=?)
`--SCAN sub
getRecentReads

getRecentReads

Prisma query

Based on the Prisma operation, we can see that PostRead.postSlug, PostRead.createdAt and User.team are index candidates. Note that PostRead.createdAt is an inequality, so there will be some residual scanning. Our goal once again is to reduce that scan to only the relevant records.

prisma.postRead.count({
  where: {
    postSlug: slug,
    createdAt: {gt: withinTheLastSixMonths},
    user: {team},
  },
})

SQL query

The SQL aligns with our assessment above. Nothing additional to note here.

SELECT COUNT(*)
FROM (
  SELECT `main`.`PostRead`.`id`
  FROM `main`.`PostRead`
  WHERE (`main`.`PostRead`.`postSlug` = ?
    AND `main`.`PostRead`.`createdAt` > ?
    AND (`main`.`PostRead`.`id`) IN (
      SELECT `t0`.`id`
      FROM `main`.`PostRead` AS `t0`
      INNER JOIN `main`.`User` AS `j0` ON (`j0`.`id`) = (`t0`.`userId`)
      WHERE (`j0`.`team` = ? AND `t0`.`id` IS NOT NULL)
)) LIMIT ? OFFSET ?) AS `sub`

EXPLAIN QUERY PLAN before

The original query plan showed that User (j0) was was read in full with a SCAN. This would imply that the team filter was applied during the scan. User.id is then used to match PostRead (t0) using the existing index @@index(userId). This match is then used to filter PostRead again in the
outer query, but we see that the outer SEARCH does not include postSlug or createdAt in the criteria. We can again infer that these are applied in memory with a residual scan.

QUERY PLAN
|--CO-ROUTINE sub
|  |--SEARCH main.PostRead USING INDEX sqlite_autoindex_PostRead_1 (id=?)
|  `--LIST SUBQUERY 1
|     |--SCAN j0
|     `--SEARCH t0 USING INDEX PostRead_userId_idx (userId=?)
`--SCAN sub

Relevant Indexes

We'll add three indexes to optimize this query, but note that one is a repeat of above. One on User.team will improve the inner query's performance by limiting to only relevant records for the join. One on PostRead.userId is used by the inner query's JOIN. One onPostRead.postSlugandPostRead.createdAt` will narrow the outer query, but in this case order is important.

If we flip the order to @@index([createdAt, postSlug]), then the query plan will not utilize the index. Why is that? Remember the callout that PostRead.createdAt is filter on an inequality? Inequalities can only be partially utilized as
defined here. Index columns must be utilized from left to right, so having createdAt first with partial utilization eliminates our ability to utilize postSlug as an equality match (a SEARCH).

model User {
  @@index([team])
}

model PostRead {
  // we only need @@index([userId]) here, but postSlug will be used later
  // since index colums are utilized left to right, it works here too
  // indexes impact write performance and take disk space, so reuse is ideal
  @@index([userId, postSlug])
  @@index([postSlug, createdAt])
}

EXPLAIN QUERY PLAN after

Our new query plan has replaced the inner SCAN with a specific SEARCH on User.team, followed by another SEARCH to join with PostRead.userId. Finally, the outer query is also a SEARCH on postSlug and createdAt. We've narrowed the results down to the absolute minimum to perform the count 馃帀

QUERY PLAN
|--CO-ROUTINE sub
|  |--SEARCH main.PostRead USING INDEX PostRead_postSlug_createdAt_idx (postSlug=? AND createdAt>?)
|  `--LIST SUBQUERY 1
|     |--SEARCH j0 USING INDEX User_team_idx (team=?)
|     `--SEARCH t0 USING INDEX PostRead_userId_postSlug_idx (userId=?)
`--SCAN sub
getBlogReadRankings

getBlogReadRankings

getBlogReadRankings is nearly identical to optimizing getRecentReads, but without the PostRead.createdAt filter. This is great, because it will leverage the same indexes we created above. Since PostRead.createdAt was the rightmost column in our index, the index is still fully utilized 馃帄

Rather than repeat the whole process from above, I'll leave keep this one short
馃槃

getBlogRecommendations

getBlogRecommendations

This one is particularly interesting because there are two paths to optimize separately: user and clientId. I'll do my best to consolidate them here 馃檪

Prisma query

Based on the Prisma operation, we can see that PostRead.postSlug is a good index candidate. Depending on the ternary, either PostRead.userId or PostRead.clientId would be useful as well.

const where = user
  ? {user: {id: user.id}, postSlug: {notIn: exclude.filter(Boolean)}}
  : {clientId, postSlug: {notIn: exclude.filter(Boolean)}}
prisma.postRead.groupBy({
  by: ['postSlug'],
  where,
})

SQL query

The SQL changes with the ternary, but follows this pattern.

SELECT `main`.`PostRead`.`postSlug`
FROM `main`.`PostRead`
WHERE (
  `main`.`PostRead`.`clientId` = ?
  AND `main`.`PostRead`.`postSlug` NOT IN (?,?,?,?,?,?,?,?,?,?,?))
GROUP BY `main`.`PostRead`.`postSlug`
LIMIT ? OFFSET ?

EXPLAIN QUERY PLAN before

The original query plan showed that the existing indexes for PostRead.clientId and PostRead.userId are being utilized. Awesome! But notice the PostRead.postSlug filter is missing. Let's see what happens if we fix that.

QUERY PLAN
|--SEARCH main.PostRead USING INDEX PostRead_clientId_idx (clientId=?)
`--USE TEMP B-TREE FOR GROUP BY

Relevant Indexes

We'll update the existing PostRead indexes to include postSlug as the right-most column. Order is important here, since any query that was previously utilizing the existing indexes (like getReaderCount and getSlugReadsByUser) can still use the new ones this way. Once again, we'll leverage the fact that index columns are utilized left-to-right.

model PostRead {
// we'll replace the original indexes
// @@index(userId)
// @@index(clientId)

  // remember the callout for getRecentReads about including postSlug? this was why 馃榿
  @@index([userId, postSlug])
  @@index([clientId, postSlug])
}

EXPLAIN QUERY PLAN after

馃く our new query plan not only replaced the SEARCH with a covering index, but also eliminated the B-TREE FOR GROUP BY operation. I suppose this is because indexes are ordered, so the optimizer knows that unique postSlug values will be grouped together and can use that to avoid in-memory grouping.

Why is the postSlug not shown in the index filters? Probably because it's an inverted condition (NOT IN) so it's more efficient to match it in-memory instead. It's still valuable to include postSlug in the index because of the change to a covering index, which eliminates the additional table lookup and orders the values as mentioned above.

QUERY PLAN
`--SEARCH main.PostRead USING COVERING INDEX PostRead_clientId_postSlug_idx (clientId=?)

@@ -29,6 +29,7 @@ translations:
author:
name: Noelia Donato
link: https://twitter.com/vamoacodear
- language: Espa帽ol
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having trouble populating the blog post cache without this change. Maybe it was just me though 馃檪 It seemed consistent with the structure here, so I kept it for reference in case it is needed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That's definitely making the cache update fail.

Copy link
Owner

@kentcdodds kentcdodds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! I'm excited to see how this improves query performance.

@@ -29,6 +29,7 @@ translations:
author:
name: Noelia Donato
link: https://twitter.com/vamoacodear
- language: Espa帽ol
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That's definitely making the cache update fail.

@kentcdodds kentcdodds merged commit c756b74 into kentcdodds:main May 20, 2023
@kentcdodds
Copy link
Owner

Also, amazing analysis. Thanks for going through all the explanation!

@rtbenfield
Copy link
Contributor Author

Happy to help! I enjoyed diving in further. If you get the time, let me know how they impact the times you're logging. It's difficult to measure the real impact without the same data and infrastructure, so I'm curious how it turns out 馃槃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants