Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage hotfix release 2024-09-20 #9085

Merged
merged 50 commits into from
Sep 20, 2024
Merged

Conversation

problame
Copy link
Contributor

@problame problame commented Sep 20, 2024

This storage hotfix release adds valuable metrics to pageserver.

We will only deploy this hotfix manually to a dedicated pageserver that is currently empty.

Context https://neondb.slack.com/archives/C07MU9ES6NP/p1726827244185729

Created using

git switch -c releases/2024-09-20-hotfix
git reset --hard origin/release
git merge ec5dce04ebfa51b727dfc9bc04ebb1e68aef6434

tristan957 and others added 30 commits September 16, 2024 16:38
Signed-off-by: Tristan Partin <tristan@neon.tech>
This also reduces the GRANT statements to one per created _reset
function
This should generally be faster when running tests, especially those
that run with higher scales.

Ignoring test_lfc_resize since it seems like we are hitting a query
timeout for some reason that I have yet to investigate. A little bit of
improvemnt is better than none.

Signed-off-by: Tristan Partin <tristan@neon.tech>
This is in preparation of replacing neon_fixtures.get_dir_size with
neon_fixtures.utils.get_dir_size() in next commit.
There was another copy of it in utils.py. The only difference is that
the version in utils.py tolerates files that are concurrently
removed. That seems fine for the few callers in neon_fixtures.py too.
pg_distrib_dir doesn't include the Postgres version and only depends
on env variables which cannot change during a test run, so it can be
marked as session-scoped. Similarly, the platform cannot change during
a test run.
…eader (#8954)

## Problem
It turns out that we can't rely on external orchestration to promptly
route trafic to the new leader. This is downtime inducing.
Forwarding provides a safe way out.

## Safety
We forward when:
1. Request is not one of ["/control/v1/step_down", "/status", "/ready",
"/metrics"]
2. Current instance is in [`LeadershipStatus::SteppedDown`] state
3. There is a leader in the database to forward to
4. Leader from step (3) is not the current instance

If a storcon instance is persisted in the database, then we know that it
is the current leader.
There's one exception: time between handling step-down request and the
new leader updating the
database.

Let's treat the happy case first. The stepped down node does not produce
any side effects,
since all request handling happens on the leader.

As for the edge case, we are guaranteed to always have a maximum of two
running instances.
Hence, if we are in the edge case scenario the leader persisted in the
database is the
stepped down instance that received the request. Condition (4) above
covers this scenario.

## Summary of changes
* Conversion utilities for reqwest <-> hyper. I'm not happy with these,
but I don't see a better way. Open to suggestions.
* Add request forwarding logic
* Update each request handler. Again, not happy with this. If anyone
knows a nice to wrap the handlers, lmk. Me and Joonas tried :/
* Update each handler to maybe forward
* Tweak tests to showcase new behaviour
## Problem

We've got 2 non-blocking failures on the release pipeline:
- `promote-compatibility-data` job got skipped _presumably_ because one
of the dependencies of `deploy` job (`push-to-acr-dev`) got skipped
(#8940)
- `coverage-report` job fails because we don't build debug artifacts in
the release branch (#8561)

## Summary of changes
- Always run `push-to-acr-dev` / `push-to-acr-prod` jobs, but add
`skip_if` parameter to the reusable workflow, which can skip the job
internally, without skipping externally
- Do not run `coverage-report` on release branches
We added another migration in 5876c44,
but didn't bump this value. This had no effect, but best to fix it
anyway.

Signed-off-by: Tristan Partin <tristan@neon.tech>
In exposed messages like log messages we mentioned "S3", which is not
entirely accurate as we support Azure blob storage now as well.
## Problem
We do use `actions/checkout` with `fetch-depth: 0` when it's not
required

## Summary of changes
- Remove unneeded `fetch-depth: 0`
- Add a comment if `fetch-depth: 0` is required
Immediate benefit: easier to spot what's going on.

Later benefit: use the extracted method in PR

- #8952

which adds a `ping` command to walredo.

Found this useful during investigation
neondatabase/cloud#16886.
… tests (#8948)

There's currently no way to just start/stop broker from `neon_local`.

This PR
* adds a sub-command
* uses that sub-command from the test suite instead of the pre-existing
Python `subprocess` based approach.

Found this useful during investigation
neondatabase/cloud#16886.
)

Commit ca5390a made a similar change to DeltaLayerWriter.

We bumped into this with Stas with our hackathon project, to create a
standalong program to create image layers directly from a Postgres data
directory. It needs to create image layers without having a Timeline and
other pageserver machinery.

This downgrades the "created image layer {}" message from INFO to TRACE
level. TRACE is used for the corresponding message on delta layer
creation too. The path logged in the message is now the temporary path,
before the file is renamed to its final name. Again commit ca5390a
made the same change for the message on delta layer creation.
## Problem

It turns out the previous approach (with `skip_if` input) doesn't work
(from #9017).
Revert it and use more straightforward if-conditions

## Summary of changes
- Revert efbe8db
- Add if-condition to`promote-compatibility-data` job and relevant
comments
Signed-off-by: Tristan Partin <tristan@neon.tech>
(Found this useful during investigation
neondatabase/cloud#16886.)

Problem
-------

Before this PR, `neon_local` sequentially does the following:
1. launch storcon process
2. wait for storcon to signal readiness
[here](https://github.com/neondatabase/neon/blob/75310fe441b87d399213e365f1364aa9f08aa40d/control_plane/src/storage_controller.rs#L804-L808)
3. start pageserver
4. wait for pageserver to become ready
[here](https://github.com/neondatabase/neon/blob/c43e664ff577d4568722e4e7a2b2c6267b609607/control_plane/src/pageserver.rs#L343-L346)
5. etc

The problem is that storcon's readiness waits for the
[`startup_reconcile`](https://github.com/neondatabase/neon/blob/cbcd4058edb7a2c2bb3bfe1a6fc1ffb0d820b870/storage_controller/src/service.rs#L520-L523)
to complete.

But pageservers aren't started at this point.

So, worst case we wait for `STARTUP_RECONCILE_TIMEOUT/2`, i.e., 15s.

This is more than the 10s default timeout allowed by neon_local.

So, the result is that `neon_local start` fails to start storcon and
stops everything.

Solution
--------

In this PR I choose the the radical solution to start everything in
parallel.

It junks up the output because we do stuff like `print!(".")` to
indicate progress.
We should just abandon that.
And switch to `utils::logging` + `tracing` with separate spans for each
component.
I can do that in this PR or we leave it as a follow-up.

Alternatives Considered
-----------------------

The Pageserver's `/v1/status` or in fact any endpoint of the mgmt API
will not `accept()` on the mgmt API socket until after the `re-attach`
call to storcon returned success.

So, it's insufficient to change the startup order to start Pageservers
first.

We cannot easily change Pageserver startup order because
`init_tenant_mgr` must complete before we start serving the mgmt API.
Otherwise tenant detach calls et al can race with `init_tenant_mgr`.

We'd have to add a "loading" state to tenant mgr and make all API
endpoints except `/v1/status` wait for _that_ to complete.


Related
-------

- #6475
When I checked the log in Grafana I couldn't find the scrubber version.
Then I realized that it should be logged after the logger gets
initialized.

## Summary of changes

Log after initializing the logger for the scrubber.

Signed-off-by: Alex Chi Z <chi@neon.tech>
Dead code is generally useless, but with Postgres constants in
particular, I'm also worried that if they're not used anywhere, we
might fail to update them at a Postgres version update, and get very
confused later when they have wrong values.
conradludgate and others added 4 commits September 20, 2024 16:09
…is blocked on sleep (#9072)

## Problem

Seems that PS might be too eager in reporting throttled tasks

## Summary of changes

Introduce a sleep counter. If the sleep counter increases, then the
acquire tasks was throttled.
## Problem

When layer visibility was added, an info log was included for the
situation where actual access to a layer disagrees with the visibility
calculation. This situation is safe, but I was interested in seeing when
it happens.

The log is pretty high volume, so this PR refines it to fire less often.

## Summary of changes

- For cases where accessing non-visible layers is normal, don't log at
all.
- Extend a unit test to increase confidence that the updates to
visibility on access are working as expected
- During compaction, only call the visibility calculation routine if
some image layers were created: previously, frequent calls resulted in
the visibility of layers getting reset every time we passed through
create_image_layers.
@problame problame requested review from a team as code owners September 20, 2024 18:52
@problame problame requested review from hlinnaka, arssher, cloneable, Omrigan and nikitakalyanov and removed request for a team September 20, 2024 18:52
@problame problame changed the title storage hotfix 2024-09-20 storage hotfix release 2024-09-20 Sep 20, 2024
@problame problame requested review from arpad-m and removed request for hlinnaka, arssher, cloneable, Omrigan and nikitakalyanov September 20, 2024 18:56
@problame problame merged commit ec0550e into release Sep 20, 2024
30 of 46 checks passed
@problame problame deleted the releases/2024-09-20-hotfix branch September 20, 2024 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.