-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage hotfix release 2024-09-20 #9085
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Tristan Partin <tristan@neon.tech>
This also reduces the GRANT statements to one per created _reset function
This should generally be faster when running tests, especially those that run with higher scales. Ignoring test_lfc_resize since it seems like we are hitting a query timeout for some reason that I have yet to investigate. A little bit of improvemnt is better than none. Signed-off-by: Tristan Partin <tristan@neon.tech>
This is in preparation of replacing neon_fixtures.get_dir_size with neon_fixtures.utils.get_dir_size() in next commit.
There was another copy of it in utils.py. The only difference is that the version in utils.py tolerates files that are concurrently removed. That seems fine for the few callers in neon_fixtures.py too.
pg_distrib_dir doesn't include the Postgres version and only depends on env variables which cannot change during a test run, so it can be marked as session-scoped. Similarly, the platform cannot change during a test run.
…eader (#8954) ## Problem It turns out that we can't rely on external orchestration to promptly route trafic to the new leader. This is downtime inducing. Forwarding provides a safe way out. ## Safety We forward when: 1. Request is not one of ["/control/v1/step_down", "/status", "/ready", "/metrics"] 2. Current instance is in [`LeadershipStatus::SteppedDown`] state 3. There is a leader in the database to forward to 4. Leader from step (3) is not the current instance If a storcon instance is persisted in the database, then we know that it is the current leader. There's one exception: time between handling step-down request and the new leader updating the database. Let's treat the happy case first. The stepped down node does not produce any side effects, since all request handling happens on the leader. As for the edge case, we are guaranteed to always have a maximum of two running instances. Hence, if we are in the edge case scenario the leader persisted in the database is the stepped down instance that received the request. Condition (4) above covers this scenario. ## Summary of changes * Conversion utilities for reqwest <-> hyper. I'm not happy with these, but I don't see a better way. Open to suggestions. * Add request forwarding logic * Update each request handler. Again, not happy with this. If anyone knows a nice to wrap the handlers, lmk. Me and Joonas tried :/ * Update each handler to maybe forward * Tweak tests to showcase new behaviour
## Problem We've got 2 non-blocking failures on the release pipeline: - `promote-compatibility-data` job got skipped _presumably_ because one of the dependencies of `deploy` job (`push-to-acr-dev`) got skipped (#8940) - `coverage-report` job fails because we don't build debug artifacts in the release branch (#8561) ## Summary of changes - Always run `push-to-acr-dev` / `push-to-acr-prod` jobs, but add `skip_if` parameter to the reusable workflow, which can skip the job internally, without skipping externally - Do not run `coverage-report` on release branches
Found by "cargo machete"
We added another migration in 5876c44, but didn't bump this value. This had no effect, but best to fix it anyway. Signed-off-by: Tristan Partin <tristan@neon.tech>
In exposed messages like log messages we mentioned "S3", which is not entirely accurate as we support Azure blob storage now as well.
## Problem We do use `actions/checkout` with `fetch-depth: 0` when it's not required ## Summary of changes - Remove unneeded `fetch-depth: 0` - Add a comment if `fetch-depth: 0` is required
Immediate benefit: easier to spot what's going on. Later benefit: use the extracted method in PR - #8952 which adds a `ping` command to walredo. Found this useful during investigation neondatabase/cloud#16886.
… tests (#8948) There's currently no way to just start/stop broker from `neon_local`. This PR * adds a sub-command * uses that sub-command from the test suite instead of the pre-existing Python `subprocess` based approach. Found this useful during investigation neondatabase/cloud#16886.
) Commit ca5390a made a similar change to DeltaLayerWriter. We bumped into this with Stas with our hackathon project, to create a standalong program to create image layers directly from a Postgres data directory. It needs to create image layers without having a Timeline and other pageserver machinery. This downgrades the "created image layer {}" message from INFO to TRACE level. TRACE is used for the corresponding message on delta layer creation too. The path logged in the message is now the temporary path, before the file is renamed to its final name. Again commit ca5390a made the same change for the message on delta layer creation.
Signed-off-by: Tristan Partin <tristan@neon.tech>
(Found this useful during investigation neondatabase/cloud#16886.) Problem ------- Before this PR, `neon_local` sequentially does the following: 1. launch storcon process 2. wait for storcon to signal readiness [here](https://github.com/neondatabase/neon/blob/75310fe441b87d399213e365f1364aa9f08aa40d/control_plane/src/storage_controller.rs#L804-L808) 3. start pageserver 4. wait for pageserver to become ready [here](https://github.com/neondatabase/neon/blob/c43e664ff577d4568722e4e7a2b2c6267b609607/control_plane/src/pageserver.rs#L343-L346) 5. etc The problem is that storcon's readiness waits for the [`startup_reconcile`](https://github.com/neondatabase/neon/blob/cbcd4058edb7a2c2bb3bfe1a6fc1ffb0d820b870/storage_controller/src/service.rs#L520-L523) to complete. But pageservers aren't started at this point. So, worst case we wait for `STARTUP_RECONCILE_TIMEOUT/2`, i.e., 15s. This is more than the 10s default timeout allowed by neon_local. So, the result is that `neon_local start` fails to start storcon and stops everything. Solution -------- In this PR I choose the the radical solution to start everything in parallel. It junks up the output because we do stuff like `print!(".")` to indicate progress. We should just abandon that. And switch to `utils::logging` + `tracing` with separate spans for each component. I can do that in this PR or we leave it as a follow-up. Alternatives Considered ----------------------- The Pageserver's `/v1/status` or in fact any endpoint of the mgmt API will not `accept()` on the mgmt API socket until after the `re-attach` call to storcon returned success. So, it's insufficient to change the startup order to start Pageservers first. We cannot easily change Pageserver startup order because `init_tenant_mgr` must complete before we start serving the mgmt API. Otherwise tenant detach calls et al can race with `init_tenant_mgr`. We'd have to add a "loading" state to tenant mgr and make all API endpoints except `/v1/status` wait for _that_ to complete. Related ------- - #6475
When I checked the log in Grafana I couldn't find the scrubber version. Then I realized that it should be logged after the logger gets initialized. ## Summary of changes Log after initializing the logger for the scrubber. Signed-off-by: Alex Chi Z <chi@neon.tech>
Dead code is generally useless, but with Postgres constants in particular, I'm also worried that if they're not used anywhere, we might fail to update them at a Postgres version update, and get very confused later when they have wrong values.
…is blocked on sleep (#9072) ## Problem Seems that PS might be too eager in reporting throttled tasks ## Summary of changes Introduce a sleep counter. If the sleep counter increases, then the acquire tasks was throttled.
## Problem When layer visibility was added, an info log was included for the situation where actual access to a layer disagrees with the visibility calculation. This situation is safe, but I was interested in seeing when it happens. The log is pretty high volume, so this PR refines it to fire less often. ## Summary of changes - For cases where accessing non-visible layers is normal, don't log at all. - Extend a unit test to increase confidence that the updates to visibility on access are working as expected - During compaction, only call the visibility calculation routine if some image layers were created: previously, frequent calls resulted in the visibility of layers getting reset every time we passed through create_image_layers.
…erstand throttle queue depth (#9077)
…/2024-09-20-hotfix
problame
requested review from
hlinnaka,
arssher,
cloneable,
Omrigan and
nikitakalyanov
and removed request for
a team
September 20, 2024 18:52
problame
requested review from
arpad-m
and removed request for
hlinnaka,
arssher,
cloneable,
Omrigan and
nikitakalyanov
September 20, 2024 18:56
koivunej
approved these changes
Sep 20, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This storage hotfix release adds valuable metrics to pageserver.
We will only deploy this hotfix manually to a dedicated pageserver that is currently empty.
Context https://neondb.slack.com/archives/C07MU9ES6NP/p1726827244185729
Created using