Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine ID: Share a single bot identity auth client rather than creating multiple clients #38398

Merged
merged 2 commits into from Mar 4, 2024

Conversation

strideynet
Copy link
Contributor

@strideynet strideynet commented Feb 19, 2024

For a while, the way we have handled the fact that the tbot bot identity renews regularly is by creating short-lived clients with the current identity when a client is needed. This reduces performance and reliability as many new connections must be established and maintained to the Auth Server. This PR switches most usages of the Bot identity to share a single client which handles the rotation of the identity internally - there remains one place where the client must be short lived and that is for the bot identity renewal itself, due to the generation counter stored within the bot identity.

This should reduce the resource consumption of tbot and improve the overall performance.

changelog: Improved reliability and performance of tbot.

@@ -292,7 +299,10 @@ func (s *identityService) renew(
if s.cfg.Onboarding.RenewableJoinMethod() {
// When using a renewable join method, we use GenerateUserCerts to
// request a new certificate using our current identity.
authClient, err := clientForIdentity(ctx, s.log, s.cfg, currentIdentity, s.resolver)
// We explicitly create a new client here to ensure that the latest
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one place where we actually need to ensure we have a client which is using the latest identity - so we explicitly create the client. This is because of the generation counter stored in the x509 cert which is compared with the one on the user during the certificate renewal process. If an older identity is still in use by a client, the bot will be locked.

@strideynet strideynet changed the title Machine ID: Share a single auth client as much as possible in tbot Machine ID: Share a single bot identity auth client rather than creating multiple clients Feb 20, 2024
@strideynet strideynet marked this pull request as ready for review February 20, 2024 11:10
Copy link
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me just one question about client ownership: Should we prevent the shared client from being closable? In the tsh client handling code we have a psuedo implementation of auth.ClientI that looks something like the following to prevent this:

type sharedClient struct {
    auth.ClientI
}

func (s sharedClient) Close() error {
    return nil
} 

@strideynet
Copy link
Contributor Author

Overall this looks good to me just one question about client ownership: Should we prevent the shared client from being closable? In the tsh client handling code we have a psuedo implementation of auth.ClientI that looks something like the following to prevent this:

Hmm - I'm not sure I'm the biggest fan of this. It feels like if something is calling .Close when it shouldn't - that's a bug that should be caught and addressed - not one that silently is swallowed and hidden, leaving a .Close call that does nothing. This seems the opposite of a predictable behaviour for an engineer and could lead to resource leaks etc in future when someone expects .Close to close the client, but it doesn't.

@rosstimothy
Copy link
Contributor

Hmm - I'm not sure I'm the biggest fan of this. It feels like if something is calling .Close when it shouldn't - that's a bug that should be caught and addressed - not one that silently is swallowed and hidden, leaving a .Close call that does nothing. This seems the opposite of a predictable behaviour for an engineer and could lead to resource leaks etc in future when someone expects .Close to close the client, but it doesn't.

I agree the proposed solution is hacky, but, in practice catching places where a shared client is unintentionally closed isn't something the compiler can detect. We should at least add a comment to GetClient that the shared client should not be closed.

@strideynet
Copy link
Contributor Author

strideynet commented Feb 27, 2024

Going to spin up the tbot test ground to stability and performance test this vs the latest 15 release - I'll leave this cooking for about a week before merging.

@strideynet
Copy link
Contributor Author

Going to merge a pr to fix the otel memory leak first before resuming this one.

@strideynet
Copy link
Contributor Author

strideynet commented Mar 3, 2024

image

With the leak introduced on master patched, we see a really nice grouping of the three tbot running this PR (the lower cluster) and the three tbot running the latest v15. About a 3MiB (10%) decrease in heap size.

@strideynet
Copy link
Contributor Author

I'm happy with how this has run over the weekend so I've rebased out the release commits and added some godocs. PTAL @rosstimothy

@gravitational gravitational deleted a comment from strideynet Mar 4, 2024
@strideynet strideynet added this pull request to the merge queue Mar 4, 2024
Merged via the queue into master with commit 7175938 Mar 4, 2024
34 checks passed
@strideynet strideynet deleted the strideynet/central-client branch March 4, 2024 15:26
@public-teleport-github-review-bot

@strideynet See the table below for backport results.

Branch Result
branch/v14 Create PR
branch/v15 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants