adding tuned indexes for query plan analysis #446

rtbenfield · 2023-05-20T22:24:38Z

Hi Kent 👋🏻 After our discussion about indexes after Remix Conf, I took a closer look at the queries you pointed out. Here's some indexes I would recommend trying. I verified these using EXPLAIN QUERY PLAN, which I'll go into details about below. That said, I don't think my local environment is representative of your production system, so I'd keep an eye on your timing numbers and see how things compare. The full write-up is quite long as I walked through how I tested these.

Two things to keep in mind if you choose to implement these would be extra disk usage on your SQLite database and additional write overhead. I don't expect either of these to be concerning for your workload, but I wanted to call out the trade-off.

Let me know if you have any questions and thanks for all the fantastic work!

Analysis

getActiveMembers

Prisma query

Based on the Prisma operation, we can see that User.team and PostRead.createdAt are index candidates. A count in SQLite (and many other engines) will always result in some scanning, but our goal is to reduce that to only relevant data such that every row scanned is part of the final count. In other words, we don't want to read a row that doesn't match the where here.

prisma.user.count({
  where: {
    team,
    postReads: {
      some: {
        createdAt: {gt: withinTheLastYear},
      },
    },
  },
})

SQL query

The SQL aligns with our assessment above. Nothing additional to note here.

SELECT COUNT(*)
FROM (
  SELECT `main`.`User`.`id`
  FROM `main`.`User`
  WHERE (`main`.`User`.`team` = ?
    AND (`main`.`User`.`id`) IN (SELECT `t0`.`id`
      FROM `main`.`User` AS `t0`
      INNER JOIN `main`.`PostRead` AS `j0` ON (`j0`.`userId`) = (`t0`.`id`)
      WHERE (`j0`.`createdAt` > ? AND `t0`.`id` IS NOT NULL)
)) LIMIT ? OFFSET ?) AS `sub`

`EXPLAIN QUERY PLAN` before

The original query plan showed that PostRead (j0) was read in full with a SCAN. This would imply that the createdAt filter was applied during the scan. PostRead outputs userId, which is then used to match User twice: once for the JOIN and again for the final COUNT. While User appears to be a SEARCH (rather than SCAN), the query plan omits the team filter and we can infer that it is applied during the outer SEARCH.

Why does it show a SEARCH if we have yet to filter by team? Because the query loops over userId values returned by the subquery and performs an index search for each. SEARCH does not always imply an optimal query. In SQL Server world we call this a residual scan.

QUERY PLAN
|--CO-ROUTINE sub
|  |--SEARCH main.User USING INDEX sqlite_autoindex_User_1 (id=?)
|  `--LIST SUBQUERY 1
|     |--SCAN j0
|     `--SEARCH t0 USING COVERING INDEX sqlite_autoindex_User_1 (id=?)
`--SCAN sub

Relevant Indexes

We'll add two indexes. One on User.team will improve the outer query's performance by limiting to only relevant records for the count. One on PostRead.createdAt will similarly improve the inner query, but we'll also include PostRead.userId in that index to make it a covering index. We need the userId to match on the outer query and including it allows the database engine to leverage the value in the results without an additional read to the PostRead table data.

model User {
  @@index([team])
}

model PostRead {
  @@index([createdAt, userId])
}

`EXPLAIN QUERY PLAN` after

Now our query plan has eliminated the SCAN with a targeted covering index, as well as a targeted index for the outer count. While SQLite doesn't show the amount of reads ops for plans, we can deduce that we are now only touching records that contribute to the count 🙌🏻

QUERY PLAN
|--CO-ROUTINE sub
|  |--SEARCH main.User USING INDEX User_team_idx (team=?)
|  `--LIST SUBQUERY 1
|     |--SEARCH j0 USING COVERING INDEX PostRead_createdAt_userId_idx (createdAt>?)
|     `--SEARCH t0 USING COVERING INDEX sqlite_autoindex_User_1 (id=?)
`--SCAN sub

getRecentReads

Prisma query

Based on the Prisma operation, we can see that PostRead.postSlug, PostRead.createdAt and User.team are index candidates. Note that PostRead.createdAt is an inequality, so there will be some residual scanning. Our goal once again is to reduce that scan to only the relevant records.

prisma.postRead.count({
  where: {
    postSlug: slug,
    createdAt: {gt: withinTheLastSixMonths},
    user: {team},
  },
})

SQL query

The SQL aligns with our assessment above. Nothing additional to note here.

SELECT COUNT(*)
FROM (
  SELECT `main`.`PostRead`.`id`
  FROM `main`.`PostRead`
  WHERE (`main`.`PostRead`.`postSlug` = ?
    AND `main`.`PostRead`.`createdAt` > ?
    AND (`main`.`PostRead`.`id`) IN (
      SELECT `t0`.`id`
      FROM `main`.`PostRead` AS `t0`
      INNER JOIN `main`.`User` AS `j0` ON (`j0`.`id`) = (`t0`.`userId`)
      WHERE (`j0`.`team` = ? AND `t0`.`id` IS NOT NULL)
)) LIMIT ? OFFSET ?) AS `sub`

`EXPLAIN QUERY PLAN` before

The original query plan showed that User (j0) was was read in full with a SCAN. This would imply that the team filter was applied during the scan. User.id is then used to match PostRead (t0) using the existing index @@index(userId). This match is then used to filter PostRead again in the
outer query, but we see that the outer SEARCH does not include postSlug or createdAt in the criteria. We can again infer that these are applied in memory with a residual scan.

QUERY PLAN
|--CO-ROUTINE sub
|  |--SEARCH main.PostRead USING INDEX sqlite_autoindex_PostRead_1 (id=?)
|  `--LIST SUBQUERY 1
|     |--SCAN j0
|     `--SEARCH t0 USING INDEX PostRead_userId_idx (userId=?)
`--SCAN sub

Relevant Indexes

We'll add three indexes to optimize this query, but note that one is a repeat of above. One on User.team will improve the inner query's performance by limiting to only relevant records for the join. One on PostRead.userId is used by the inner query's JOIN. One onPostRead.postSlugandPostRead.createdAt` will narrow the outer query, but in this case order is important.

If we flip the order to @@index([createdAt, postSlug]), then the query plan will not utilize the index. Why is that? Remember the callout that PostRead.createdAt is filter on an inequality? Inequalities can only be partially utilized as
defined here. Index columns must be utilized from left to right, so having createdAt first with partial utilization eliminates our ability to utilize postSlug as an equality match (a SEARCH).

model User {
  @@index([team])
}

model PostRead {
  // we only need @@index([userId]) here, but postSlug will be used later
  // since index colums are utilized left to right, it works here too
  // indexes impact write performance and take disk space, so reuse is ideal
  @@index([userId, postSlug])
  @@index([postSlug, createdAt])
}

`EXPLAIN QUERY PLAN` after

Our new query plan has replaced the inner SCAN with a specific SEARCH on User.team, followed by another SEARCH to join with PostRead.userId. Finally, the outer query is also a SEARCH on postSlug and createdAt. We've narrowed the results down to the absolute minimum to perform the count 🎉

QUERY PLAN
|--CO-ROUTINE sub
|  |--SEARCH main.PostRead USING INDEX PostRead_postSlug_createdAt_idx (postSlug=? AND createdAt>?)
|  `--LIST SUBQUERY 1
|     |--SEARCH j0 USING INDEX User_team_idx (team=?)
|     `--SEARCH t0 USING INDEX PostRead_userId_postSlug_idx (userId=?)
`--SCAN sub

getBlogReadRankings

getBlogReadRankings is nearly identical to optimizing getRecentReads, but without the PostRead.createdAt filter. This is great, because it will leverage the same indexes we created above. Since PostRead.createdAt was the rightmost column in our index, the index is still fully utilized 🎊

Rather than repeat the whole process from above, I'll leave keep this one short
😄

getBlogRecommendations

This one is particularly interesting because there are two paths to optimize separately: user and clientId. I'll do my best to consolidate them here 🙂

Prisma query

Based on the Prisma operation, we can see that PostRead.postSlug is a good index candidate. Depending on the ternary, either PostRead.userId or PostRead.clientId would be useful as well.

const where = user
  ? {user: {id: user.id}, postSlug: {notIn: exclude.filter(Boolean)}}
  : {clientId, postSlug: {notIn: exclude.filter(Boolean)}}
prisma.postRead.groupBy({
  by: ['postSlug'],
  where,
})

SQL query

The SQL changes with the ternary, but follows this pattern.

SELECT `main`.`PostRead`.`postSlug`
FROM `main`.`PostRead`
WHERE (
  `main`.`PostRead`.`clientId` = ?
  AND `main`.`PostRead`.`postSlug` NOT IN (?,?,?,?,?,?,?,?,?,?,?))
GROUP BY `main`.`PostRead`.`postSlug`
LIMIT ? OFFSET ?

`EXPLAIN QUERY PLAN` before

The original query plan showed that the existing indexes for PostRead.clientId and PostRead.userId are being utilized. Awesome! But notice the PostRead.postSlug filter is missing. Let's see what happens if we fix that.

QUERY PLAN
|--SEARCH main.PostRead USING INDEX PostRead_clientId_idx (clientId=?)
`--USE TEMP B-TREE FOR GROUP BY

Relevant Indexes

We'll update the existing PostRead indexes to include postSlug as the right-most column. Order is important here, since any query that was previously utilizing the existing indexes (like getReaderCount and getSlugReadsByUser) can still use the new ones this way. Once again, we'll leverage the fact that index columns are utilized left-to-right.

model PostRead {
// we'll replace the original indexes
// @@index(userId)
// @@index(clientId)

  // remember the callout for getRecentReads about including postSlug? this was why 😁
  @@index([userId, postSlug])
  @@index([clientId, postSlug])
}

`EXPLAIN QUERY PLAN` after

🤯 our new query plan not only replaced the SEARCH with a covering index, but also eliminated the B-TREE FOR GROUP BY operation. I suppose this is because indexes are ordered, so the optimizer knows that unique postSlug values will be grouped together and can use that to avoid in-memory grouping.

Why is the postSlug not shown in the index filters? Probably because it's an inverted condition (NOT IN) so it's more efficient to match it in-memory instead. It's still valuable to include postSlug in the index because of the change to a covering index, which eliminates the additional table lookup and orders the values as mentioned above.

QUERY PLAN
`--SEARCH main.PostRead USING COVERING INDEX PostRead_clientId_postSlug_idx (clientId=?)

rtbenfield · 2023-05-20T22:27:32Z

content/blog/javascript-to-know-for-react.mdx

@@ -29,6 +29,7 @@ translations:
    author:
      name: Noelia Donato
      link: https://twitter.com/vamoacodear
+  - language: Español


I was having trouble populating the blog post cache without this change. Maybe it was just me though 🙂 It seemed consistent with the structure here, so I kept it for reference in case it is needed.

Thanks! That's definitely making the cache update fail.

kentcdodds

Thanks a lot! I'm excited to see how this improves query performance.

kentcdodds · 2023-05-20T23:50:29Z

content/blog/javascript-to-know-for-react.mdx

@@ -29,6 +29,7 @@ translations:
    author:
      name: Noelia Donato
      link: https://twitter.com/vamoacodear
+  - language: Español


Thanks! That's definitely making the cache update fail.

kentcdodds · 2023-05-20T23:53:18Z

Also, amazing analysis. Thanks for going through all the explanation!

rtbenfield · 2023-05-21T00:34:26Z

Happy to help! I enjoyed diving in further. If you get the time, let me know how they impact the times you're logging. It's difficult to measure the real impact without the same data and infrastructure, so I'm curious how it turns out 😄

rtbenfield added 2 commits May 20, 2023 18:02

adding missing language field

eb47d15

adding indexes

f231235

rtbenfield commented May 20, 2023

View reviewed changes

kentcdodds approved these changes May 20, 2023

View reviewed changes

kentcdodds merged commit c756b74 into kentcdodds:main May 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding tuned indexes for query plan analysis #446

adding tuned indexes for query plan analysis #446

rtbenfield commented May 20, 2023

rtbenfield May 20, 2023

kentcdodds May 20, 2023

kentcdodds left a comment

kentcdodds May 20, 2023

kentcdodds commented May 20, 2023

rtbenfield commented May 21, 2023

adding tuned indexes for query plan analysis #446

adding tuned indexes for query plan analysis #446

Conversation

rtbenfield commented May 20, 2023

Analysis

getActiveMembers

Prisma query

SQL query

EXPLAIN QUERY PLAN before

Relevant Indexes

EXPLAIN QUERY PLAN after

getRecentReads

Prisma query

SQL query

EXPLAIN QUERY PLAN before

Relevant Indexes

EXPLAIN QUERY PLAN after

getBlogReadRankings

getBlogRecommendations

Prisma query

SQL query

EXPLAIN QUERY PLAN before

Relevant Indexes

EXPLAIN QUERY PLAN after

rtbenfield May 20, 2023

Choose a reason for hiding this comment

kentcdodds May 20, 2023

Choose a reason for hiding this comment

kentcdodds left a comment

Choose a reason for hiding this comment

kentcdodds May 20, 2023

Choose a reason for hiding this comment

kentcdodds commented May 20, 2023

rtbenfield commented May 21, 2023

`EXPLAIN QUERY PLAN` before

`EXPLAIN QUERY PLAN` after

`EXPLAIN QUERY PLAN` before

`EXPLAIN QUERY PLAN` after

`EXPLAIN QUERY PLAN` before

`EXPLAIN QUERY PLAN` after