Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: start adding tests for secondary mode, live migration #5842

Merged
merged 9 commits into from
Dec 11, 2023

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Nov 9, 2023

These tests have been loitering on a branch of mine for a while: they already provide value even without all the secondary mode bits landed yet, and the Workload helper is handy for other tests too.

  • Workload is a re-usable test workload that replaces some of the arbitrary "write a few rows" SQL that I've found my self repeating, and adds a systematic way to append data and check that reads properly reflect the changes. This append+validate stuff is important when doing migrations, as we want to detect situations where we might be reading from a pageserver that has not properly seen latest changes.
  • test_multi_attach is a validation of how the pageserver handles attaching the same tenant to multiple pageservers, from a safety point of view. This is intentionally separate from the larger testing of migration, to provide an isolated environment for multi-attachment.
  • test_location_conf_churn is a pseudo-random walk through the various states that TenantSlot can be put into, with validation that attached tenants remain externally readable when they should, and as a side effect validating that the compute endpoint's online configuration changes work as expected.
  • test_live_migration is the reference implementation of how to drive a pair of pageservers through a zero-downtime migration of a tenant.

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/test Area: related to testing a/tech_debt Area: related to tech debt labels Nov 9, 2023
Copy link

github-actions bot commented Nov 9, 2023

2178 tests run: 2093 passed, 0 failed, 85 skipped (full report)


Flaky tests (4)

Postgres 16

Postgres 14

  • test_crafted_wal_end[last_wal_record_xlog_switch_ends_on_page_boundary]: debug
  • test_statvfs_pressure_min_avail_bytes: debug
  • test_statvfs_pressure_usage: debug

Code coverage (full report)

  • functions: 55.2% (9431 of 17080 functions)
  • lines: 82.3% (54575 of 66296 lines)

The comment gets automatically updated with the latest test results
9a7f20d at 2023-12-11T16:40:52.732Z :recycle:

jcsp added a commit that referenced this pull request Nov 30, 2023
- During migration of tenants, it is useful for callers to
`/location_conf` to flush a tenant's layers while transitioning to
AttachedStale: this optimization reduces the redundant WAL replay work
that the tenant's new attached pageserver will have to do. Test coverage
for this will come as part of the larger tests for live migration in
#5745 #5842
- Flushing is controlled with `flush_ms` query parameter: it is the
caller's job to decide how long they want to wait for a flush to
complete. If flush is not complete within the time limit, the pageserver
proceeds to succeed anyway: flushing is only an optimization.
- Add swagger definitions for all this: the location_config API is the
primary interface for driving tenant migration as described in
docs/rfcs/028-pageserver-migration.md, and will eventually replace the
various /attach /detach /load /ignore APIs.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Previously we were rapid-retrying, which worked when
doing local loads for activating tenants, but gives us
very little time for tenants that are actually taking
some time to activate, e.g. in tests that use fault
injection for remote storage, or lots of layers.
@jcsp jcsp force-pushed the jcsp/secondary-locations-tests branch from e6f6735 to f21103a Compare December 6, 2023 09:46
@jcsp jcsp requested a review from arpad-m December 6, 2023 10:29
@jcsp jcsp marked this pull request as ready for review December 6, 2023 10:29
@jcsp jcsp mentioned this pull request Dec 6, 2023
5 tasks
This test could fail due to future layers getting deleted
on restart, if we have not waited for all uploads before restart.

Related: #6092
test_runner/regress/test_pageserver_generations.py Outdated Show resolved Hide resolved
test_runner/regress/test_pageserver_generations.py Outdated Show resolved Hide resolved
test_runner/fixtures/workload.py Show resolved Hide resolved
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
@jcsp jcsp enabled auto-merge (squash) December 11, 2023 15:26
@jcsp jcsp merged commit 6a922b1 into main Dec 11, 2023
39 of 41 checks passed
@jcsp jcsp deleted the jcsp/secondary-locations-tests branch December 11, 2023 16:55
]
)

# these can happen, if we shutdown at a good time. to be fixed as part of #5172.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#5172 is now closed. @jcsp is this still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be -- I'll clean up in the next PR that touches these tests.

jcsp added a commit that referenced this pull request Dec 14, 2023
Dependency (commits inline):
#5842

## Problem

Secondary mode tenants need a manifest of what to download. Ultimately
this will be some kind of heat-scored set of layers, but as a robust
first step we will simply use the set of resident layers: secondary
tenant locations will aim to match the on-disk content of the attached
location.

## Summary of changes

- Add heatmap types representing the remote structure
- Add hooks to Tenant/Timeline for generating these heatmaps
- Create a new `HeatmapUploader` type that is external to `Tenant`, and
responsible for walking the list of attached tenants and scheduling
heatmap uploads.

Notes to reviewers:
- Putting the logic for uploads (and later, secondary mode downloads)
outside of `Tenant` is an opinionated choice, motivated by:
- Enable future smarter scheduling of operations, e.g. uploading the
stalest tenant first, rather than having all tenants compete for a fair
semaphore on a first-come-first-served basis. Similarly for downloads,
we may wish to schedule the tenants with the hottest un-downloaded
layers first.
- Enable accessing upload-related state without synchronization (it
belongs to HeatmapUploader, rather than being some Mutex<>'d part of
Tenant)
- Avoid further expanding the scope of Tenant/Timeline types, which are
already among the largest in the codebase
- You might reasonably wonder how much of the uploader code could be a
generic job manager thing. Probably some of it: but let's defer pulling
that out until we have at least two users (perhaps secondary downloads
will be the second one) to highlight which bits are really generic.

Compromises:
- Later, instead of using digests of heatmaps to decide whether anything
changed, I would prefer to avoid walking the layers in tenants that
don't have changes: tracking that will be a bit invasive, as it needs
input from both remote_timeline_client and Layer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt a/test Area: related to testing c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants