Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Reintroduce in-flight state caching changes (backed out because of poor performance) #12116

Closed
reivilibre opened this issue Mar 1, 2022 · 3 comments
Assignees
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@reivilibre
Copy link
Contributor

reivilibre commented Mar 1, 2022

Not much information to speak of at current, but this graph seems bad:

image

CPU usage has been poor on client readers, as well.

(m.org is running the commit on matrix-org-hotfixes from 7 days ago)

#10870 and friends are my first suspect. (They strip out batching when fetching state groups), but it would be worth confirming that...

@reivilibre reivilibre added X-Release-Blocker Must be resolved before making a release X-Regression Something broke which worked on a previous release labels Mar 1, 2022
@reivilibre reivilibre self-assigned this Mar 1, 2022
@reivilibre
Copy link
Contributor Author

reivilibre commented Mar 2, 2022

After backing it out for a while (since 15h40 yest'day), we can see:
image

DB scheduling looks healthier, though I note that since backing it out there has been a higher peak of DB txn time...

@reivilibre reivilibre removed X-Release-Blocker Must be resolved before making a release X-Regression Something broke which worked on a previous release labels Mar 2, 2022
@reivilibre reivilibre changed the title _get_state_groups_for_groups seems to cause poor performance / is harmful to the database pool scheduler Reintroduce in-flight state caching changes (backed out because of poor performance) Mar 2, 2022
@reivilibre
Copy link
Contributor Author

The changes were backed out by #12126 — will need to think about how to reintroduce a fix for the original issue (#10301) safely.

@reivilibre reivilibre added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Mar 2, 2022
@reivilibre
Copy link
Contributor Author

We still had some OOMs after using the updated branch. I'm not sure it's giving us all that much. I'm going to close this issue and see how #12522 does in addressing #10301 — in our most recent case it seemed like backfill was the main culprit.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

1 participant