Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent auth from starting when an incompatible major upgrade is detected #43520

Merged
merged 14 commits into from
Jul 12, 2024

Conversation

vapopov
Copy link
Contributor

@vapopov vapopov commented Jun 25, 2024

Added KV for storing teleport version in backend. If the current major version of teleport bigger than previous one more than one we must refuse to start.

Closes #41782

changelog: Teleport auth prevents upgrade major version more than one

Copy link

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Copy link

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Copy link

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Copy link
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of high level thoughts:

  1. I don't think we need the version quasi resource in api/types. This should only be internal to Auth and not something exposed to users.
  2. We should use the semver package instead of regular expressions to parse versions.
  3. ValidateAndUpdateTeleportVersion is currently very racy. Multiple Auth processes that start at the same time will both see no initial version in the backend and overwrite the others version. We should make use of ConditionalUpdate or AtomicWrite to guarantee that we are always acting on the most recent data stored in the backend and cannot accidentally overwrite the version of another process.
  4. The error returned when the differences in versions should prevent starting Auth should be more informative and actionable. Let's make sure the users know that what they are trying to do is not supported, and inform them of the proper major version to upgrade to in an effort to reduce support load.
  5. We should also consider in the future that an absence of a version in the backend might also mean the upgrade should not proceed. If this is introduced in v17, any updates from < v17 to >= v18 will not have a persisted version but would be an attempt to skip major versions.

lib/services/local/version.go Outdated Show resolved Hide resolved
api/types/version.go Outdated Show resolved Hide resolved
lib/services/local/version.go Outdated Show resolved Hide resolved
lib/services/version.go Outdated Show resolved Hide resolved
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/

package local
Copy link
Collaborator

@zmb3 zmb3 Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought - but did we consider using the local process storage (SQLite) instead of plumbing this through all of the backend interfaces?

This would also solve the raciness that @rosstimothy mentioned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that mean that all a user needs to do to violate the upgrade guidance is rm -rf /var/lib/teleport ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but we're not trying to make this bulletproof. You could make the same argument for the backend - "couldn't the user just remove the version entry from the backend to violate the upgrade guidance?"

Copy link
Collaborator

@r0mant r0mant Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping it in local process storage would also not work in cases when auth runs in Kubernetes and pods recycle (which is every upgrade basically).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, ignore my previous comment, we use secrets for local storage in Kubernetes so it would work. In fact, it looks like @fspmarshall added something called InitialLocalVersion to process storage a few weeks ago in #42189. Is this something we can piggy-back on, or add the version we need to record alongside it? @fspmarshall Is "InitialLocalVersion" supposed to be updated when teleport upgrades?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r0mant I've launched latest v17, so we keep initial version "initial_local_version":"15.4.5"
what about to extend this state object to add version history field as array to keep track all version launch (as usually migration logic does)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we need to keep the version history necessarily but adding another process state field for tracking the latest and current version (unless we can reuse initial_local_version) seems reasonable.

lib/services/version.go Outdated Show resolved Hide resolved
- replaced to use process storage and state itme
- regexp replaced with semver
- additional check for v18 for empty version record in storage
Copy link

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the comments, can you add test coverage to the PR?

lib/service/version.go Outdated Show resolved Hide resolved
lib/service/version.go Outdated Show resolved Hide resolved
lib/service/connect.go Outdated Show resolved Hide resolved
lib/service/version.go Outdated Show resolved Hide resolved
return trace.Wrap(err)
}
if currentMajor-localMajor > 1 {
return trace.Wrap(errMajorVersionUpgrade, "Teleport version %s is going to upgrade to %s, but first you need to upgrade to version %d.x.x at least",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments:

  • Here you don't need to wrap an error, I would just return a single trace.BadParameter(<error message>) so you don't have to define a separate errMajorVersionUpgrade.
  • This error message is better than the previous one but I would tweak it a bit more. For example:
Suggested change
return trace.Wrap(errMajorVersionUpgrade, "Teleport version %s is going to upgrade to %s, but first you need to upgrade to version %d.x.x at least",
return trace.BadParameter("Unsupported upgrade path detected: from %v to %v. Teleport supports direct upgrades to the next major version only. Please upgrade your cluster to version %d.x.x first. See compatibility guarantees for details: https://goteleport.com/docs/upgrading/overview/#component-compatibility.",

lib/service/version.go Outdated Show resolved Hide resolved
lib/service/version.go Outdated Show resolved Hide resolved
Comment on lines 51 to 54
if trace.IsNotFound(err) {
// If the Instance state has not been found, it means this is the first launch,
// and we don't need to check the version since this is the initial launch.
return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to test a scenario when local process state gets removed, for whatever reason (shouldn't normally happen but people can get their clusters in all sorts of weird states). What would happen in this case? This will not be considered "first time start" because that is determined based on the presence of CA IIRC, will we re-init the process state with correct version? I'm mostly worried about scenario where it would get into some state which we'll have to recover manually which we need to avoid.

Copy link
Contributor Author

@vapopov vapopov Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance if we remove process database completely or specifically this /states/instance/state item, first we should go in this version check so /states/instance/state not created yet (and we skip version check). This item managed by process.connect and going to be created after validation with latest version set to local_version. Currently looking to in integration tests, to cover this case

Worst case scenario might be, if the user manually change local_version of /states/instance/state item in database

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was testing couple of scenarios with integration tests, found that in configuration when we enable only auth service

	cfg.Auth.Enabled = true
	cfg.Proxy.Enabled = false
	cfg.SSH.Enabled = false

We never register the instance state in the process storage

if process.Config.Auth.Enabled && !hasNonAuthRole {
// if we have a local auth server and no other services, we cannot create an instance client without breaking HSM rotation.
// instance control stream will be created via in-memory pipe, but until this limitation is resolved
// or a fully in-memory instance client is implemented, we cannot rely on the instance client existing
// for purposes other than the control stream.
// TODO(fspmarshall): implement one of the two potential solutions listed above.
process.setInstanceConnector(nil)
process.BroadcastEvent(Event{Name: InstanceReady, Payload: nil})
return nil
}
process.RegisterWithAuthServer(types.RoleInstance, InstanceIdentityEvent)

this leads to skipping version check in auth service

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fspmarshall Any advice here?

@zmb3 zmb3 changed the title Teleport auth should refuse to start when jumping major versions (#41782) Prevent auth from starting when an incompatible major upgrade is detected Jun 27, 2024
@zmb3
Copy link
Collaborator

zmb3 commented Jun 27, 2024

@vapopov I've changed the title a bit so that it describes what the change does, not what the change should do.

I also removed the corresponding issue from the PR title. It's better to link the issue in the PR body, not the title.

- add integration test
- store latest version in separate key
integration/version_test.go Outdated Show resolved Hide resolved
integration/version_test.go Outdated Show resolved Hide resolved
lib/auth/storage/storage.go Outdated Show resolved Hide resolved
lib/service/version.go Outdated Show resolved Hide resolved
return trace.BadParameter("Unsupported upgrade path detected: to %v. "+
"Teleport supports direct upgrades to the next major version only.\n Please upgrade "+
"your cluster to version %d.x.x first. See compatibility guarantees for details: "+
"https://goteleport.com/docs/upgrading/overview/#component-compatibility.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error message is incorrect.

If you were on Teleport 15 and you tried to upgrade straight to Teleport 18, wouldn't it say "Please upgrade your cluster to version 17.x.x first" ??

If so, that's not right, we should be telling them to upgrade to v16 first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this specific case we couldn't know exact version of previous setup, I would probably change the message with example from docs instead of this version suggestion

lib/utils/ver.go Outdated Show resolved Hide resolved
- table tests
- error message
@vapopov vapopov force-pushed the vapopov/major-version-check branch from bfd4951 to c9159b0 Compare June 28, 2024 19:54
lib/auth/storage/storage.go Outdated Show resolved Hide resolved
lib/auth/storage/storage.go Outdated Show resolved Hide resolved
lib/auth/storage/storage.go Outdated Show resolved Hide resolved
lib/service/version.go Outdated Show resolved Hide resolved
- minor changes
lib/auth/storage/storage.go Outdated Show resolved Hide resolved
lib/service/version.go Outdated Show resolved Hide resolved
- reuse context for write operation
Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're getting there but left a few more comments.

if trace.IsNotFound(err) {
// We have to ensure that the process database is newly created before
//applying the majorVersionConstraint check.
_, err := process.storage.GetState(process.GracefulExitContext(), types.RoleAdmin)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why RoleAdmin and not RoleAuth?

Also, feels like the role should be a parameter accepted by the validate function so it can be extended to other components in future if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default we create state RoleAdmin only (in setup where we configure Teleport with auth service and disable everything else)

@@ -203,6 +207,28 @@ func (p *ProcessStorage) WriteIdentity(name string, id state.Identity) error {
return trace.Wrap(err)
}

// GetTeleportVersion reads the last known Teleport version from storage.
func (p *ProcessStorage) GetTeleportVersion(ctx context.Context) (string, error) {
item, err := p.stateStorage.Get(ctx, backend.Key(teleportPrefix, lastKnownVersion))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this include a system role in the key, like for other things that are kept in the process storage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fspmarshall proposed to make it global in context of the roles, should I change it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine as is.

lib/service/version.go Outdated Show resolved Hide resolved
lib/service/version.go Outdated Show resolved Hide resolved
lib/service/version.go Outdated Show resolved Hide resolved
lib/service/version.go Outdated Show resolved Hide resolved
if err != nil {
return trace.Wrap(err)
}
if currentMajor-localMajor > 1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me: we should probably prevent accidental downgrades by more than 1 major apart too.

Comment on lines 128 to 139
timeout := time.After(time.Second * 30)
select {
case err := <-authRunErrCh:
if test.expectError {
assert.Error(t, err)
} else {
require.NoError(t, err)
}
case <-timeout:
t.Fatal("timed out waiting for fetching identity")
case <-signal:
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a pretty major problem with flaky tests in Teleport, due to tests starting and waiting for various services which we largely have under control now but still, is there a way to make this test not depend on time?

For example, why do we need to start auth? Can we just instantiate auth of version X, call "initAuthService" on it, then start another auth of version X+1, init, make sure it works, then another of version X+3, init, make sure it fails, something like that? There would be no potential flakiness and the test would probably be simpler as well.

Copy link
Contributor Author

@vapopov vapopov Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to additionally ensure that state for RoleAdmin created and this logic might be changed in time in initialization flow, thats why initially I choose to use integration test for such case. I check this state to identify if we have new install with empty database rather than existing one.

For instance if I add tests which trigger initAuthService only, we will never cover the case with majorVersionConstraint (this case covers situation: let say we have version 15, we know that storing information about version added in v17, and we are trying to update to v18, so if we check that we don't have any version in storage, means that we have older version than <v17)

I Misinterpreted comment regarding initAuthService, I believe initAuthService expecting to be called once we might get some issues, need to check this one

Copy link
Contributor Author

@vapopov vapopov Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the test to use initAuthService instead

@vapopov vapopov force-pushed the vapopov/major-version-check branch from 58d6304 to c8d43f3 Compare July 2, 2024 03:51
- simplify integration test
- add downgrade check
- minor changes
@vapopov vapopov force-pushed the vapopov/major-version-check branch from c8d43f3 to fbc4636 Compare July 2, 2024 13:26
@vapopov vapopov force-pushed the vapopov/major-version-check branch from 424990a to 6ac40b8 Compare July 9, 2024 22:13
lib/auth/auth_test.go Outdated Show resolved Hide resolved
lib/auth/helpers.go Outdated Show resolved Hide resolved
@vapopov vapopov force-pushed the vapopov/major-version-check branch 2 times, most recently from 88db3e0 to ae96438 Compare July 12, 2024 21:20
lib/auth/helpers.go Outdated Show resolved Hide resolved
@vapopov vapopov force-pushed the vapopov/major-version-check branch 2 times, most recently from c9d5b8b to 4fa7038 Compare July 12, 2024 21:26
@vapopov vapopov force-pushed the vapopov/major-version-check branch from 4fa7038 to da6e46c Compare July 12, 2024 21:41
@vapopov vapopov added this pull request to the merge queue Jul 12, 2024
Merged via the queue into master with commit 894761b Jul 12, 2024
37 checks passed
@vapopov vapopov deleted the vapopov/major-version-check branch July 12, 2024 22:19
vapopov added a commit that referenced this pull request Jul 17, 2024
…cted (#43520)

* Teleport auth should refuse to start when jumping major versions (#41782)

* linter fixes

* code review changes:
- replaced to use process storage and state itme
- regexp replaced with semver
- additional check for v18 for empty version record in storage

* code review changes:
- add integration test
- store latest version in separate key

* code review changes:
- table tests
- error message

* code review changes:
- minor changes

* code review changes:
- reuse context for write operation

* code review changes:
- simplify integration test
- add downgrade check
- minor changes

* code review changes:
- move logic to the auth init

* code review changes:
- add development env to skip version check
- init the process storage if it is not set

* code review changes:
- add interface for local process storage
- use semver in storage instead

* change interface to version storage

* add fake version storage for other tests
vapopov added a commit that referenced this pull request Jul 17, 2024
…cted (#43520)

* Teleport auth should refuse to start when jumping major versions (#41782)

* linter fixes

* code review changes:
- replaced to use process storage and state itme
- regexp replaced with semver
- additional check for v18 for empty version record in storage

* code review changes:
- add integration test
- store latest version in separate key

* code review changes:
- table tests
- error message

* code review changes:
- minor changes

* code review changes:
- reuse context for write operation

* code review changes:
- simplify integration test
- add downgrade check
- minor changes

* code review changes:
- move logic to the auth init

* code review changes:
- add development env to skip version check
- init the process storage if it is not set

* code review changes:
- add interface for local process storage
- use semver in storage instead

* change interface to version storage

* add fake version storage for other tests
vapopov added a commit that referenced this pull request Jul 17, 2024
…cted (#43520)

* Teleport auth should refuse to start when jumping major versions (#41782)

* linter fixes

* code review changes:
- replaced to use process storage and state itme
- regexp replaced with semver
- additional check for v18 for empty version record in storage

* code review changes:
- add integration test
- store latest version in separate key

* code review changes:
- table tests
- error message

* code review changes:
- minor changes

* code review changes:
- reuse context for write operation

* code review changes:
- simplify integration test
- add downgrade check
- minor changes

* code review changes:
- move logic to the auth init

* code review changes:
- add development env to skip version check
- init the process storage if it is not set

* code review changes:
- add interface for local process storage
- use semver in storage instead

* change interface to version storage

* add fake version storage for other tests
github-merge-queue bot pushed a commit that referenced this pull request Jul 17, 2024
…cted (#43520) (#44360)

* Teleport auth should refuse to start when jumping major versions (#41782)

* linter fixes

* code review changes:
- replaced to use process storage and state itme
- regexp replaced with semver
- additional check for v18 for empty version record in storage

* code review changes:
- add integration test
- store latest version in separate key

* code review changes:
- table tests
- error message

* code review changes:
- minor changes

* code review changes:
- reuse context for write operation

* code review changes:
- simplify integration test
- add downgrade check
- minor changes

* code review changes:
- move logic to the auth init

* code review changes:
- add development env to skip version check
- init the process storage if it is not set

* code review changes:
- add interface for local process storage
- use semver in storage instead

* change interface to version storage

* add fake version storage for other tests
github-merge-queue bot pushed a commit that referenced this pull request Jul 17, 2024
…cted (#43520) (#44361)

* Teleport auth should refuse to start when jumping major versions (#41782)

* linter fixes

* code review changes:
- replaced to use process storage and state itme
- regexp replaced with semver
- additional check for v18 for empty version record in storage

* code review changes:
- add integration test
- store latest version in separate key

* code review changes:
- table tests
- error message

* code review changes:
- minor changes

* code review changes:
- reuse context for write operation

* code review changes:
- simplify integration test
- add downgrade check
- minor changes

* code review changes:
- move logic to the auth init

* code review changes:
- add development env to skip version check
- init the process storage if it is not set

* code review changes:
- add interface for local process storage
- use semver in storage instead

* change interface to version storage

* add fake version storage for other tests
github-merge-queue bot pushed a commit that referenced this pull request Jul 18, 2024
…cted (#43520) (#44362)

* Teleport auth should refuse to start when jumping major versions (#41782)

* linter fixes

* code review changes:
- replaced to use process storage and state itme
- regexp replaced with semver
- additional check for v18 for empty version record in storage

* code review changes:
- add integration test
- store latest version in separate key

* code review changes:
- table tests
- error message

* code review changes:
- minor changes

* code review changes:
- reuse context for write operation

* code review changes:
- simplify integration test
- add downgrade check
- minor changes

* code review changes:
- move logic to the auth init

* code review changes:
- add development env to skip version check
- init the process storage if it is not set

* code review changes:
- add interface for local process storage
- use semver in storage instead

* change interface to version storage

* add fake version storage for other tests
Comment on lines +73 to +79
if currentVersion.Major-lastKnownVersion.Major > 1 {
return trace.BadParameter("Unsupported upgrade path detected: from %v to %v. "+
"Teleport supports direct upgrades to the next major version only.\n Please upgrade "+
"your cluster to version %d.x.x first. See compatibility guarantees for details: "+
"https://goteleport.com/docs/upgrading/overview/#component-compatibility.",
lastKnownVersion, currentVersion.String(), lastKnownVersion.Major+1)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did what Roman described in another thread (https://github.com/gravitational/teleport/pull/43520/files#r1667060081): I switched to branch/v14, ran make build/teleport, started teleport and then switched back to master. Now I cannot start my cluster from master on 4aa48db, because I keep getting this error:

initialization failed
	Unsupported upgrade path detected: from 14.3.21 to 17.0.0-dev. Teleport supports direct upgrades to the next major version only.
 Please upgrade your cluster to version 15.x.x first. See compatibility guarantees for details: https://goteleport.com/docs/upgrading/overview/#component-compatibility.

So I downloaded 15.4.9 and 16.1.0, started them one after the other, but it didn't fix the problem. How does it know that my last version was 14.3.21 but cannot figure out that I ran 15.4.9 and 16.1.0 in the meantime? 🤔 Do I need to trigger a particular code path while 15.x and 16.x are running?

Copy link
Contributor Author

@vapopov vapopov Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravicious you need to switch to branch/v15, because seems like 15.4.9 #44051 has no these changes in this release, also you can use env TELEPORT_UNSTABLE_SKIP_VERSION_UPGRADE_CHECK=yes to skip this check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Teleport auth should refuse to start when jumping major versions
7 participants