-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent auth from starting when an incompatible major upgrade is detected #43520
Conversation
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of high level thoughts:
- I don't think we need the version quasi resource in api/types. This should only be internal to Auth and not something exposed to users.
- We should use the semver package instead of regular expressions to parse versions.
- ValidateAndUpdateTeleportVersion is currently very racy. Multiple Auth processes that start at the same time will both see no initial version in the backend and overwrite the others version. We should make use of ConditionalUpdate or AtomicWrite to guarantee that we are always acting on the most recent data stored in the backend and cannot accidentally overwrite the version of another process.
- The error returned when the differences in versions should prevent starting Auth should be more informative and actionable. Let's make sure the users know that what they are trying to do is not supported, and inform them of the proper major version to upgrade to in an effort to reduce support load.
- We should also consider in the future that an absence of a version in the backend might also mean the upgrade should not proceed. If this is introduced in v17, any updates from < v17 to >= v18 will not have a persisted version but would be an attempt to skip major versions.
lib/services/local/version.go
Outdated
* along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
*/ | ||
|
||
package local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought - but did we consider using the local process storage (SQLite) instead of plumbing this through all of the backend interfaces?
This would also solve the raciness that @rosstimothy mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that mean that all a user needs to do to violate the upgrade guidance is rm -rf /var/lib/teleport
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but we're not trying to make this bulletproof. You could make the same argument for the backend - "couldn't the user just remove the version entry from the backend to violate the upgrade guidance?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping it in local process storage would also not work in cases when auth runs in Kubernetes and pods recycle (which is every upgrade basically).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, ignore my previous comment, we use secrets for local storage in Kubernetes so it would work. In fact, it looks like @fspmarshall added something called InitialLocalVersion to process storage a few weeks ago in #42189. Is this something we can piggy-back on, or add the version we need to record alongside it? @fspmarshall Is "InitialLocalVersion" supposed to be updated when teleport upgrades?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r0mant I've launched latest v17, so we keep initial version "initial_local_version":"15.4.5"
what about to extend this state object to add version history field as array to keep track all version launch (as usually migration logic does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we need to keep the version history necessarily but adding another process state field for tracking the latest and current version (unless we can reuse initial_local_version) seems reasonable.
- replaced to use process storage and state itme - regexp replaced with semver - additional check for v18 for empty version record in storage
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the comments, can you add test coverage to the PR?
lib/service/version.go
Outdated
return trace.Wrap(err) | ||
} | ||
if currentMajor-localMajor > 1 { | ||
return trace.Wrap(errMajorVersionUpgrade, "Teleport version %s is going to upgrade to %s, but first you need to upgrade to version %d.x.x at least", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments:
- Here you don't need to wrap an error, I would just return a single
trace.BadParameter(<error message>)
so you don't have to define a separateerrMajorVersionUpgrade
. - This error message is better than the previous one but I would tweak it a bit more. For example:
return trace.Wrap(errMajorVersionUpgrade, "Teleport version %s is going to upgrade to %s, but first you need to upgrade to version %d.x.x at least", | |
return trace.BadParameter("Unsupported upgrade path detected: from %v to %v. Teleport supports direct upgrades to the next major version only. Please upgrade your cluster to version %d.x.x first. See compatibility guarantees for details: https://goteleport.com/docs/upgrading/overview/#component-compatibility.", |
lib/service/version.go
Outdated
if trace.IsNotFound(err) { | ||
// If the Instance state has not been found, it means this is the first launch, | ||
// and we don't need to check the version since this is the initial launch. | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to test a scenario when local process state gets removed, for whatever reason (shouldn't normally happen but people can get their clusters in all sorts of weird states). What would happen in this case? This will not be considered "first time start" because that is determined based on the presence of CA IIRC, will we re-init the process state with correct version? I'm mostly worried about scenario where it would get into some state which we'll have to recover manually which we need to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance if we remove process database completely or specifically this /states/instance/state
item, first we should go in this version check so /states/instance/state
not created yet (and we skip version check). This item managed by process.connect
and going to be created after validation with latest version set to local_version
. Currently looking to in integration tests, to cover this case
Worst case scenario might be, if the user manually change local_version
of /states/instance/state
item in database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was testing couple of scenarios with integration tests, found that in configuration when we enable only auth service
cfg.Auth.Enabled = true
cfg.Proxy.Enabled = false
cfg.SSH.Enabled = false
We never register the instance state in the process storage
teleport/lib/service/service.go
Lines 2564 to 2574 in dc11996
if process.Config.Auth.Enabled && !hasNonAuthRole { | |
// if we have a local auth server and no other services, we cannot create an instance client without breaking HSM rotation. | |
// instance control stream will be created via in-memory pipe, but until this limitation is resolved | |
// or a fully in-memory instance client is implemented, we cannot rely on the instance client existing | |
// for purposes other than the control stream. | |
// TODO(fspmarshall): implement one of the two potential solutions listed above. | |
process.setInstanceConnector(nil) | |
process.BroadcastEvent(Event{Name: InstanceReady, Payload: nil}) | |
return nil | |
} | |
process.RegisterWithAuthServer(types.RoleInstance, InstanceIdentityEvent) |
this leads to skipping version check in auth service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fspmarshall Any advice here?
@vapopov I've changed the title a bit so that it describes what the change does, not what the change should do. I also removed the corresponding issue from the PR title. It's better to link the issue in the PR body, not the title. |
- add integration test - store latest version in separate key
lib/service/version.go
Outdated
return trace.BadParameter("Unsupported upgrade path detected: to %v. "+ | ||
"Teleport supports direct upgrades to the next major version only.\n Please upgrade "+ | ||
"your cluster to version %d.x.x first. See compatibility guarantees for details: "+ | ||
"https://goteleport.com/docs/upgrading/overview/#component-compatibility.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this error message is incorrect.
If you were on Teleport 15 and you tried to upgrade straight to Teleport 18, wouldn't it say "Please upgrade your cluster to version 17.x.x first" ??
If so, that's not right, we should be telling them to upgrade to v16 first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this specific case we couldn't know exact version of previous setup, I would probably change the message with example from docs instead of this version suggestion
- table tests - error message
bfd4951
to
c9159b0
Compare
- minor changes
- reuse context for write operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're getting there but left a few more comments.
lib/service/version.go
Outdated
if trace.IsNotFound(err) { | ||
// We have to ensure that the process database is newly created before | ||
//applying the majorVersionConstraint check. | ||
_, err := process.storage.GetState(process.GracefulExitContext(), types.RoleAdmin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why RoleAdmin
and not RoleAuth
?
Also, feels like the role should be a parameter accepted by the validate function so it can be extended to other components in future if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default we create state RoleAdmin
only (in setup where we configure Teleport with auth service and disable everything else)
@@ -203,6 +207,28 @@ func (p *ProcessStorage) WriteIdentity(name string, id state.Identity) error { | |||
return trace.Wrap(err) | |||
} | |||
|
|||
// GetTeleportVersion reads the last known Teleport version from storage. | |||
func (p *ProcessStorage) GetTeleportVersion(ctx context.Context) (string, error) { | |||
item, err := p.stateStorage.Get(ctx, backend.Key(teleportPrefix, lastKnownVersion)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this include a system role in the key, like for other things that are kept in the process storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fspmarshall proposed to make it global in context of the roles, should I change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably fine as is.
lib/service/version.go
Outdated
if err != nil { | ||
return trace.Wrap(err) | ||
} | ||
if currentMajor-localMajor > 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just occurred to me: we should probably prevent accidental downgrades by more than 1 major apart too.
integration/version_test.go
Outdated
timeout := time.After(time.Second * 30) | ||
select { | ||
case err := <-authRunErrCh: | ||
if test.expectError { | ||
assert.Error(t, err) | ||
} else { | ||
require.NoError(t, err) | ||
} | ||
case <-timeout: | ||
t.Fatal("timed out waiting for fetching identity") | ||
case <-signal: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a pretty major problem with flaky tests in Teleport, due to tests starting and waiting for various services which we largely have under control now but still, is there a way to make this test not depend on time?
For example, why do we need to start auth? Can we just instantiate auth of version X, call "initAuthService" on it, then start another auth of version X+1, init, make sure it works, then another of version X+3, init, make sure it fails, something like that? There would be no potential flakiness and the test would probably be simpler as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to additionally ensure that state for RoleAdmin
created and this logic might be changed in time in initialization flow, thats why initially I choose to use integration test for such case. I check this state to identify if we have new install with empty database rather than existing one.
For instance if I add tests which trigger initAuthService
only, we will never cover the case with majorVersionConstraint
(this case covers situation: let say we have version 15, we know that storing information about version added in v17, and we are trying to update to v18, so if we check that we don't have any version in storage, means that we have older version than <v17)
I Misinterpreted comment regarding initAuthService
, I believe initAuthService
expecting to be called once we might get some issues, need to check this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the test to use initAuthService
instead
58d6304
to
c8d43f3
Compare
- simplify integration test - add downgrade check - minor changes
c8d43f3
to
fbc4636
Compare
424990a
to
6ac40b8
Compare
88db3e0
to
ae96438
Compare
c9d5b8b
to
4fa7038
Compare
4fa7038
to
da6e46c
Compare
…cted (#43520) * Teleport auth should refuse to start when jumping major versions (#41782) * linter fixes * code review changes: - replaced to use process storage and state itme - regexp replaced with semver - additional check for v18 for empty version record in storage * code review changes: - add integration test - store latest version in separate key * code review changes: - table tests - error message * code review changes: - minor changes * code review changes: - reuse context for write operation * code review changes: - simplify integration test - add downgrade check - minor changes * code review changes: - move logic to the auth init * code review changes: - add development env to skip version check - init the process storage if it is not set * code review changes: - add interface for local process storage - use semver in storage instead * change interface to version storage * add fake version storage for other tests
…cted (#43520) * Teleport auth should refuse to start when jumping major versions (#41782) * linter fixes * code review changes: - replaced to use process storage and state itme - regexp replaced with semver - additional check for v18 for empty version record in storage * code review changes: - add integration test - store latest version in separate key * code review changes: - table tests - error message * code review changes: - minor changes * code review changes: - reuse context for write operation * code review changes: - simplify integration test - add downgrade check - minor changes * code review changes: - move logic to the auth init * code review changes: - add development env to skip version check - init the process storage if it is not set * code review changes: - add interface for local process storage - use semver in storage instead * change interface to version storage * add fake version storage for other tests
…cted (#43520) * Teleport auth should refuse to start when jumping major versions (#41782) * linter fixes * code review changes: - replaced to use process storage and state itme - regexp replaced with semver - additional check for v18 for empty version record in storage * code review changes: - add integration test - store latest version in separate key * code review changes: - table tests - error message * code review changes: - minor changes * code review changes: - reuse context for write operation * code review changes: - simplify integration test - add downgrade check - minor changes * code review changes: - move logic to the auth init * code review changes: - add development env to skip version check - init the process storage if it is not set * code review changes: - add interface for local process storage - use semver in storage instead * change interface to version storage * add fake version storage for other tests
…cted (#43520) (#44360) * Teleport auth should refuse to start when jumping major versions (#41782) * linter fixes * code review changes: - replaced to use process storage and state itme - regexp replaced with semver - additional check for v18 for empty version record in storage * code review changes: - add integration test - store latest version in separate key * code review changes: - table tests - error message * code review changes: - minor changes * code review changes: - reuse context for write operation * code review changes: - simplify integration test - add downgrade check - minor changes * code review changes: - move logic to the auth init * code review changes: - add development env to skip version check - init the process storage if it is not set * code review changes: - add interface for local process storage - use semver in storage instead * change interface to version storage * add fake version storage for other tests
…cted (#43520) (#44361) * Teleport auth should refuse to start when jumping major versions (#41782) * linter fixes * code review changes: - replaced to use process storage and state itme - regexp replaced with semver - additional check for v18 for empty version record in storage * code review changes: - add integration test - store latest version in separate key * code review changes: - table tests - error message * code review changes: - minor changes * code review changes: - reuse context for write operation * code review changes: - simplify integration test - add downgrade check - minor changes * code review changes: - move logic to the auth init * code review changes: - add development env to skip version check - init the process storage if it is not set * code review changes: - add interface for local process storage - use semver in storage instead * change interface to version storage * add fake version storage for other tests
…cted (#43520) (#44362) * Teleport auth should refuse to start when jumping major versions (#41782) * linter fixes * code review changes: - replaced to use process storage and state itme - regexp replaced with semver - additional check for v18 for empty version record in storage * code review changes: - add integration test - store latest version in separate key * code review changes: - table tests - error message * code review changes: - minor changes * code review changes: - reuse context for write operation * code review changes: - simplify integration test - add downgrade check - minor changes * code review changes: - move logic to the auth init * code review changes: - add development env to skip version check - init the process storage if it is not set * code review changes: - add interface for local process storage - use semver in storage instead * change interface to version storage * add fake version storage for other tests
if currentVersion.Major-lastKnownVersion.Major > 1 { | ||
return trace.BadParameter("Unsupported upgrade path detected: from %v to %v. "+ | ||
"Teleport supports direct upgrades to the next major version only.\n Please upgrade "+ | ||
"your cluster to version %d.x.x first. See compatibility guarantees for details: "+ | ||
"https://goteleport.com/docs/upgrading/overview/#component-compatibility.", | ||
lastKnownVersion, currentVersion.String(), lastKnownVersion.Major+1) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did what Roman described in another thread (https://github.com/gravitational/teleport/pull/43520/files#r1667060081): I switched to branch/v14, ran make build/teleport
, started teleport and then switched back to master. Now I cannot start my cluster from master on 4aa48db, because I keep getting this error:
initialization failed
Unsupported upgrade path detected: from 14.3.21 to 17.0.0-dev. Teleport supports direct upgrades to the next major version only.
Please upgrade your cluster to version 15.x.x first. See compatibility guarantees for details: https://goteleport.com/docs/upgrading/overview/#component-compatibility.
So I downloaded 15.4.9 and 16.1.0, started them one after the other, but it didn't fix the problem. How does it know that my last version was 14.3.21 but cannot figure out that I ran 15.4.9 and 16.1.0 in the meantime? 🤔 Do I need to trigger a particular code path while 15.x and 16.x are running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ravicious you need to switch to branch/v15, because seems like 15.4.9 #44051 has no these changes in this release, also you can use env TELEPORT_UNSTABLE_SKIP_VERSION_UPGRADE_CHECK=yes
to skip this check
Added KV for storing teleport version in backend. If the current major version of teleport bigger than previous one more than one we must refuse to start.
Closes #41782
changelog: Teleport auth prevents upgrade major version more than one