Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement auto certificate rotation function for karmada-agent #2596

Merged
merged 2 commits into from
Oct 27, 2022

Conversation

lonelyCZ
Copy link
Member

@lonelyCZ lonelyCZ commented Sep 29, 2022

Signed-off-by: lonelyCZ 531187475@qq.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:
Part of #2282

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

karmada-agent: introduce auto certificate rotation function
karmadactl: introduce --enable-cert-rotation option to register command

@karmada-bot karmada-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 29, 2022
@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 29, 2022
@codecov-commenter
Copy link

codecov-commenter commented Sep 29, 2022

Codecov Report

Merging #2596 (241f722) into master (be0462c) will increase coverage by 1.69%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #2596      +/-   ##
==========================================
+ Coverage   27.48%   29.17%   +1.69%     
==========================================
  Files         190      190              
  Lines       19061    19049      -12     
==========================================
+ Hits         5239     5558     +319     
+ Misses      13461    13140     -321     
+ Partials      361      351      -10     
Flag Coverage Δ
unittests 29.17% <0.00%> (+1.69%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/agent/app/options/options.go 0.00% <0.00%> (ø)
pkg/controllers/context/context.go 48.57% <ø> (ø)
pkg/karmadactl/register.go 0.00% <0.00%> (ø)
pkg/karmadactl/get.go 0.00% <0.00%> (ø)
pkg/karmadactl/logs.go 0.00% <0.00%> (ø)
pkg/karmadactl/describe.go 0.00% <0.00%> (ø)
pkg/util/serviceaccount.go 90.27% <0.00%> (+40.27%) ⬆️
pkg/util/clusterrole.go 100.00% <0.00%> (+54.16%) ⬆️
pkg/util/secret.go 86.66% <0.00%> (+63.45%) ⬆️
pkg/util/worker.go 71.42% <0.00%> (+71.42%) ⬆️
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

pkg/controllers/cert/cert_rotating_controller.go Outdated Show resolved Hide resolved
pkg/controllers/cert/cert_rotating_controller.go Outdated Show resolved Hide resolved
pkg/controllers/cert/cert_rotating_controller.go Outdated Show resolved Hide resolved
pkg/controllers/cert/cert_rotating_controller.go Outdated Show resolved Hide resolved
pkg/controllers/cert/cert_rotating_controller.go Outdated Show resolved Hide resolved
pkg/controllers/cert/cert_rotating_controller.go Outdated Show resolved Hide resolved
@lonelyCZ lonelyCZ force-pushed the pr-agent-cert-rotation branch 2 times, most recently from 19115dc to 3ec282f Compare October 13, 2022 16:05
@lonelyCZ lonelyCZ changed the title [WIP]Implement auto certificate rotation function for karmada-agent Implement auto certificate rotation function for karmada-agent Oct 13, 2022
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 13, 2022
@lonelyCZ
Copy link
Member Author

Deploy karmada-agent in member cluster by below command

./karmadactl register 10.10.103.67:32443 --token tajqle.q025effjnsg9ihsq --discovery-token-ca-cert-hash sha256:28b41535b5c5102f0d4d6ef9478f4d6e9f9ebdb1d47d7735133aee80d21b98eb --karmada-agent-image swr.ap-southeast-1.myhuaweicloud.com/karmada/karmada-agent:v1.3.0-146-g4977115 --cert-expiration-seconds=1800

When the certificate is still valid and not to threshold, it is not rotated.

I1013 12:10:59.563069       1 cert_rotation_controller.go:69] Rotating the certificate of karmada-agent for the member cluster: member68
I1013 12:10:59.584867       1 cert_rotation_controller.go:250] The certificate of karmada-agent: time total=35m0s, remaining=8m28.415150573s, remaining/total=0.24210245265380953
I1013 12:10:59.584963       1 cert_rotation_controller.go:254] The certificate of karmada-agent is valid and has more than 20.00% of its life remaining

When the certificate is about to expire, it will be rotated.

I1013 12:15:59.585793       1 cert_rotation_controller.go:69] Rotating the certificate of karmada-agent for the member cluster: member68
I1013 12:15:59.604375       1 cert_rotation_controller.go:250] The certificate of karmada-agent: time total=35m0s, remaining=3m28.395803201s, remaining/total=0.09923609676238096
I1013 12:15:59.683012       1 cert_rotation_controller.go:196] Waiting for the client certificate to be issued
I1013 12:16:00.703082       1 cert_rotation_controller.go:204] signing certificate successfully
I1013 12:16:00.766080       1 cert_rotation_controller.go:237] The certificate has been rotated successfully, new expiration time is 2022-10-13 12:50:59 +0000 UTC

When the certificate is expired, the current pod will be killed and to be restarted.

I1013 12:50:59.490626       1 cluster_status_controller.go:108] Syncing cluster status: member68
E1013 12:51:00.007468       1 leaderelection.go:330] error retrieving resource lock karmada-system/karmada-agent-member68: Unauthorized
E1013 12:51:02.011907       1 leaderelection.go:330] error retrieving resource lock karmada-system/karmada-agent-member68: Unauthorized
I1013 12:51:03.227447       1 reflector.go:536] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Watch close - *unstructured.Unstructured total 6 items received
E1013 12:51:03.229458       1 reflector.go:138] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to watch *unstructured.Unstructured: the server has asked for the client to provide credentials
E1013 12:51:04.011004       1 leaderelection.go:330] error retrieving resource lock karmada-system/karmada-agent-member68: Unauthorized
I1013 12:51:04.127390       1 reflector.go:255] Listing and watching *unstructured.Unstructured from k8s.io/client-go/dynamic/dynamicinformer/informer.go:91
W1013 12:51:04.130333       1 reflector.go:324] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: failed to list *unstructured.Unstructured: Unauthorized
E1013 12:51:04.130384       1 reflector.go:138] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Unauthorized
E1013 12:51:06.011097       1 leaderelection.go:330] error retrieving resource lock karmada-system/karmada-agent-member68: Unauthorized
I1013 12:51:06.777919       1 reflector.go:255] Listing and watching *unstructured.Unstructured from k8s.io/client-go/dynamic/dynamicinformer/informer.go:91
W1013 12:51:06.780190       1 reflector.go:324] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: failed to list *unstructured.Unstructured: Unauthorized
E1013 12:51:06.780254       1 reflector.go:138] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Unauthorized
E1013 12:51:08.012456       1 leaderelection.go:330] error retrieving resource lock karmada-system/karmada-agent-member68: Unauthorized
E1013 12:51:08.447898       1 controller.go:187] failed to update lease, error: Unauthorized
E1013 12:51:08.450606       1 controller.go:187] failed to update lease, error: Unauthorized
E1013 12:51:08.452548       1 controller.go:187] failed to update lease, error: Unauthorized
E1013 12:51:08.455806       1 controller.go:187] failed to update lease, error: Unauthorized
E1013 12:51:08.458192       1 controller.go:187] failed to update lease, error: Unauthorized
I1013 12:51:08.458345       1 controller.go:114] failed to update lease using latest lease, fallback to ensure lease, err: failed 5 attempts to update lease
E1013 12:51:08.460511       1 controller.go:144] failed to ensure lease exists, will retry in 200ms, error: Unauthorized
E1013 12:51:08.663371       1 controller.go:144] failed to ensure lease exists, will retry in 400ms, error: Unauthorized
E1013 12:51:09.066623       1 controller.go:144] failed to ensure lease exists, will retry in 800ms, error: Unauthorized
I1013 12:51:09.565880       1 cluster_status_controller.go:108] Syncing cluster status: member68
E1013 12:51:09.869421       1 controller.go:144] failed to ensure lease exists, will retry in 1.6s, error: Unauthorized
I1013 12:51:10.005702       1 leaderelection.go:283] failed to renew lease karmada-system/karmada-agent-member68: timed out waiting for the condition
E1013 12:51:10.005872       1 run.go:74] "command failed" err="controller manager exits unexpectedly: leader election lost"

The number of restarts is increased by 1. The valid period of the certificate is generally 1 year, so the number of restarts will not to increase fastly.

[root@master68 lonelyCZ]# kubectl get pod -n karmada-system
NAME                             READY   STATUS    RESTARTS      AGE
karmada-agent-6f8cd7567f-5kdl6   1/1     Running   6 (46m ago)   4h27m

@lonelyCZ
Copy link
Member Author

/cc @RainbowMango @prodanlabs @jwcesign

@RainbowMango
Copy link
Member

/assign
I'll take a look soon. Thanks for your hard work. Take care.

@RainbowMango
Copy link
Member

Hi @lonelyCZ Would you add an agenda to the community meeting and introduce this feature to us? I'm not familiar with the certificate rotation, it'll need some time for me to go through it.

@lonelyCZ
Copy link
Member Author

Hi @lonelyCZ Would you add an agenda to the community meeting and introduce this feature to us? I'm not familiar with the certificate rotation, it'll need some time for me to go through it.

Sure, I will demostrate it at next Tuesday meeting.

@lonelyCZ
Copy link
Member Author

I will update a new version today.

@lonelyCZ
Copy link
Member Author

I have updated it. PTAL~

/assign @RainbowMango

@RainbowMango
Copy link
Member

Working on it now.

pkg/karmadactl/register.go Show resolved Hide resolved
@@ -205,6 +206,10 @@ func run(ctx context.Context, karmadaConfig karmadactl.KarmadaConfig, opts *opti
return err
}

if opts.EnableCertRotation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need the option to control the controller, why don't we take this as a disabled by default controller. How do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good idea.

@lonelyCZ
Copy link
Member Author

/cc @RainbowMango

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good now.
Just a small touch.

Comment on lines 89 to 91
// Fix that resourceVersion should not be set on objects to be created
secret.ResourceVersion = ""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first issue I found is the CreateOrUpdateSecret probably does not work as expected, I mean we expect it can fallback to Update once encountering the AlreadyExist error during creating, but since it doesn't take the latest resource version, so the Update would fail again.

For example, the CreateOrUpdate implemented by controller-runtime, it always deep-copy the existing object(take latest resource version) and then make changes(mutate) before try updation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but since it doesn't take the latest resource version, so the Update would fail again.

It seems not to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but since it doesn't take the latest resource version, so the Update would fail again

Yeah, Update will fail because of old resourceVersion. Caller shall refetch the object and retry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems not to happen.

After client fetch the object, other client update it. Then update will fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I created #2687 for tracking this.

pkg/controllers/certificate/cert_rotation_controller.go Outdated Show resolved Hide resolved
@lonelyCZ
Copy link
Member Author

/cc @RainbowMango

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2022
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2022
@karmada-bot karmada-bot merged commit 5d42e82 into karmada-io:master Oct 27, 2022
@lonelyCZ lonelyCZ deleted the pr-agent-cert-rotation branch October 27, 2022 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants