Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache on the indexes stats #3541

Merged
merged 6 commits into from
Mar 9, 2023
Merged

Conversation

irevoire
Copy link
Member

Fix #3540

@irevoire irevoire added the enhancement New feature or improvement label Feb 23, 2023
@irevoire irevoire added this to the v1.1.0 milestone Feb 23, 2023
@github-actions
Copy link

github-actions bot commented Feb 23, 2023

Uffizzi Preview deployment-17252 was deleted.

Copy link
Contributor

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, I think it is a very good first step towards caching the stats of the indexes! 🎉

I have a few concerns to resolve before we can accept this PR:

  1. upgrade path: my understanding from reading the code is that running this code on an instance that has indexes from before this PR will lead to IndexNotFound errors until an update is executed on the index. Similarly, failing a stats write on index creation is not a cause for error (only a log), yet it will cause an IndexNotFound error in the stats route.
  2. support for index swapping: before this PR, since the stats are computed eagerly, swapping two indexes will also swap this stats. Because the new DB uses the name of the index as key, my understanding here is that the swapping the indexes won't update their stats, and so won't swap the stats at the same time than the indexes.
  3. persistent database vs in-memory cache: this PR implements the cache as a persistent DB. @Kerollmops expressed a preference for an in-memory cache, to avoid storing this redundant information on disk.

Regarding (1), I think the DB should be made optional such as if an index has no entry in the cache, then it is fetched eagerly and added to the cache. This will make upgrading seamless.

Regarding (2), I think this can be fixed by having the new DB use UUIDs as keys rather than the index names and use the existing name -> UUID table as an intermediary to retrieve the current correspondence between an index name's and its stats.

I don't have a strong opinion one way or the other regarding (3). The main advantage of the disk representation is that there won't be a "cold start" due to the need of populating the cache on startup (or first call to the stats route). The main drawback is that if we get the implementation wrong, any mistake appearing in edge cases will be more persistent and harder to correct. On the other hand, the lifetime of any entry being one index update, I don't think this is too much of a drawback.

For now, I intent to modify this PR to implement solutions for points (1) and (2). I don't intent to switch the DB to an in-memory representation. I feel like this can be easily changed anyway.

@dureuill
Copy link
Contributor

dureuill commented Feb 28, 2023

I think the DB should be made optional such as if an index has no entry in the cache, then it is fetched eagerly and added to the cache.

Actually, the and added to the cache part is going to be harder than I surmised as the stats route is a reader of the DB and so can't typically write to it. This is a motivation for switing to an in-memory database, that can be set behind a RWLock so it can be shared.

@dureuill
Copy link
Contributor

dureuill commented Feb 28, 2023

Pushed an update with the following changes:

  • Rebase on main
  • Switch the DB to UUID -> Stats instead of str -> Stats
  • Fallback to eager computation of the stats if the stats DB is missing/incomplete: this eager computation is not cached, but since this is only for update purposes I don't think it is an issue.
  • Refactor all around to avoid reopening indexes
  • Fix an issue where we could reopen an index while already holding it, resulting in a deadlock if the index is evicted from the cache between its updates and the stat computation.
  • Finish fixing snapshots
  • Documented the introduced functions and structures

Also performed some tests:

  • On a DB created before this change, running the stats route with >1000 indexes no longer returns index_not_found, but takes 8s to complete
  • After sending an update to most indexes, the stats route returns almost instantly.

@dureuill dureuill dismissed their stale review February 28, 2023 14:48

Performed some changes according to my review

@dureuill
Copy link
Contributor

@irevoire doesn't look like I can request your review (which is a bit silly since I changed your PR, but...).

Also requesting @Kerollmops ' review

@@ -1,6 +1,5 @@
---
source: index-scheduler/src/lib.rs
assertion_line: 1755
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have any idea why there was an assertion_line here and what it means and why it is not here anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know either but if that makes insta happy I'm 100% for it 😂

@curquiza
Copy link
Member

curquiza commented Mar 6, 2023

@irevoire can you rebase the branch to point to release-v1.1.0?

Copy link
Member Author

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

index-scheduler/src/index_mapper/mod.rs Show resolved Hide resolved
@@ -1,6 +1,5 @@
---
source: index-scheduler/src/lib.rs
assertion_line: 1755
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know either but if that makes insta happy I'm 100% for it 😂

@irevoire irevoire force-pushed the add-cache-on-the-index-stats branch from 60e26cf to 76288fa Compare March 6, 2023 15:57
index-scheduler/src/index_mapper/mod.rs Show resolved Hide resolved
Comment on lines 431 to 444
pub fn store_stats_of(
&self,
wtxn: &mut RwTxn,
index_uid: &str,
stats: IndexStats,
) -> Result<()> {
let uuid = self
.index_mapping
.get(wtxn, index_uid)?
.ok_or_else(|| Error::IndexNotFound(index_uid.to_string()))?;

self.index_stats.put(wtxn, &uuid, &stats)?;
Ok(())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn store_stats_of(
&self,
wtxn: &mut RwTxn,
index_uid: &str,
stats: IndexStats,
) -> Result<()> {
let uuid = self
.index_mapping
.get(wtxn, index_uid)?
.ok_or_else(|| Error::IndexNotFound(index_uid.to_string()))?;
self.index_stats.put(wtxn, &uuid, &stats)?;
Ok(())
}
pub fn store_stats_of(
&self,
wtxn: &mut RwTxn,
index_uid: &str,
stats: &IndexStats,
) -> Result<()> {
let uuid = self
.index_mapping
.get(wtxn, index_uid)?
.ok_or_else(|| Error::IndexNotFound(index_uid.to_string()))?;
self.index_stats.put(wtxn, &uuid, stats)?;
Ok(())
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

meilisearch/src/routes/mod.rs Show resolved Hide resolved
@curquiza
Copy link
Member

curquiza commented Mar 6, 2023

(In case you missed it, still not the right branch to merge into it, should be release-v1.1.0 😇)

@irevoire irevoire changed the base branch from main to release-v1.1.0 March 7, 2023 09:28
- the index size now contributes to the db size even if the index is not authorized
@dureuill
Copy link
Contributor

dureuill commented Mar 7, 2023

Updated:

  • Restore contribution of the index sizes to the db size
    • the index size now contributes to the db size even if the index is not authorized
  • Pass IndexStat by ref in store_stats_of

@dureuill dureuill requested a review from Kerollmops March 9, 2023 08:48
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

However, could please you create an issue describing that the stats are now computing the index disk size too? And make this PR close it?

Thank you 🦪

@dureuill
Copy link
Contributor

dureuill commented Mar 9, 2023

This PR also fixes #3578

@dureuill dureuill linked an issue Mar 9, 2023 that may be closed by this pull request
@dureuill
Copy link
Contributor

dureuill commented Mar 9, 2023

bors merge

@bors
Copy link
Contributor

bors bot commented Mar 9, 2023

@bors bors bot merged commit 667bb87 into release-v1.1.0 Mar 9, 2023
@bors bors bot deleted the add-cache-on-the-index-stats branch March 9, 2023 14:35
@meili-bot meili-bot added the v1.1.0 PRs/issues solved in v1.1.0 released on 2023-04-03 label Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improvement v1.1.0 PRs/issues solved in v1.1.0 released on 2023-04-03
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compute the database size from all indexes in stats Cache the result of the indexes stats
5 participants