[FEATURE] mTLS authentication for manager <-> instance-manager communication #3839

joshimoo · 2022-04-11T15:18:46Z

Is your feature request related to a problem? Please describe

We want to add authentication to the grpc control communication, this can be done at the manager <-> im level.
Since we are switching the engine binary invocation to a proxy client that also talks directly via the im.

Previously the MTLS auth was tracked and discussed at #1983 but it isn't really the best place for that, since the actual feature request is concerned with the authentication from the users perspective (UI).

Describe the solution you'd like

evaluat auth schemess (per call auth token, MTLS)
implement auth scheme into the communication between manager and instance-manager

Additional context

previous discussion [FEATURE] Rudimentary authentication in longhorn-ui is beneficial #1983
im proxy implementation [IMPROVEMENT] Implement gRPC proxy in IM to replace MGR engine binary invocation #3546

longhorn-io-github-bot · 2022-04-11T15:19:02Z

Pre Ready-For-Testing Checklist

This image includes MTLS support for the GRPC server and im client refactorings to enable persistent connection support. Longhorn longhorn#3839 Signed-off-by: Joshua Moody <joshua.moody@suse.com>

The MTLS secret is currently hardcoded to longhorn-grpc-tls we can improve this further in the future when necessary to also enable auto cert generation Longhorn longhorn#3839 Signed-off-by: Joshua Moody <joshua.moody@suse.com>

This image includes MTLS support for the GRPC server and im client refactorings to enable persistent connection support. Longhorn #3839 Signed-off-by: Joshua Moody <joshua.moody@suse.com>

The MTLS secret is currently hardcoded to longhorn-grpc-tls we can improve this further in the future when necessary to also enable auto cert generation Longhorn #3839 Signed-off-by: Joshua Moody <joshua.moody@suse.com>

joshimoo · 2022-05-06T03:29:41Z

The below is a test secret that can be used to enable MTLS auth on a cluster.
The different test cases:

absent TLS secret (default setup):

manager + im is deployed while TLS Secret is not present (i.e non tls client, non tls server)
normal volume functionality should work

Test im deployed without TLS secret (non tls server)

same as default setup
create TLS secret
- manager will pickup the TLS secret when it's updated, should be quick and can be configured (check /tls-files/ in container of manager)
- manager will from this point forward always try a TLS connection then fall back to a non TLS connection if it's unable to communicate with the server
normal volume functionality should work
delete im-e/im-r deployment on random node X
- after the im's on node X are recreated, they will have TLS enabled
- the manager will communicate with them via a TLS client
- the manager will continue to use the fallback client for non TLS servers
normal volume functionality should work on node X as well as any other node

Old instance-mangers pre TLS should just continue to work, since they should use the same client fallback as in the above test case

keep an old running volume without engine updates around before the update to the new longhorn version, with a previously setup TLS secret

Expected Case that will not work, due to one time setup on the IM (tls server)

if the tls secret is removed after the IM are started while the TLS secret has been present
the manager will no longer be able to initiate a TLS connection with the IM
- since the secret tls files will no longer be present
- it will be able to reuse the previously created persistent client, but further refactoring on the controllers is needed to ensure no one shot clients.
in the future we can consider adding grpc server reloads + alternative TLS provisioning mechanism but this isn't currently supported
the reason for the refreshing secret on the manager side is so that we can support certificate rotation, i.e. as long as the base CA cert doesn't change the regular cert should be update able even if it's only updated on the manager side (client)

apiVersion: v1
kind: Secret
metadata:
  name: longhorn-grpc-tls
  namespace: longhorn-system
type: kubernetes.io/tls
data:
  ca.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUREakNDQWZhZ0F3SUJBZ0lVU2tNdUVEOC9XYXphNmpkb1NiTE1qalFqb3JFd0RRWUpLb1pJaHZjTkFRRUwKQlFBd0h6RWRNQnNHQTFVRUF4TVViRzl1WjJodmNtNHRaM0p3WXkxMGJITXRZMkV3SGhjTk1qSXdNVEV4TWpFeApPVEF3V2hjTk1qY3dNVEV3TWpFeE9UQXdXakFmTVIwd0d3WURWUVFERXhSc2IyNW5hRzl5YmkxbmNuQmpMWFJzCmN5MWpZVENDQVNJd0RRWUpLb1pJaHZjTkFRRUJCUUFEZ2dFUEFEQ0NBUW9DZ2dFQkFNY2grbTJhUndnNEtBa0EKT0xzdzdScWlWb1VqL2VPbVhuSE9HVE5nWE4rcFh5bDlCdzVDM1J4UDYzU29qaTVvNEhkU1htVmpwZmhmNjh1YwpvNVJJeUtXM1p6cndteDhXZldEc0dNNEtnYXBvMy84N3pVQ00vdGltOHllTzFUbTZlWVhXcWdlZ2JpM1Q1WnlvCmkzRjdteFg3QlU3Z25uWGthVmJ5UU1xRkEyMDJrK25jaVhaUE9iU0tlc1NvZ20wdWsrYXFvY3N1SjJ6dk9tZG0KMXd0a3ZTUklhL3l6T25JRGlmbFRteXNhZ3oxQy9VM1JxbzJ6TjIwbWJNYUJhMmx5anVZWkdWSnNyNGh4dGhqUApIR2x1UUh2QTlKTE9kc2J0T2xmbjRZNlZpUktCSzZWMVpOeVROMVJpN3ArTXZlaWQ3cE9rNHYweC9qVTc1a0N6Clo1cGJHbGtDQXdFQUFhTkNNRUF3RGdZRFZSMFBBUUgvQkFRREFnRUdNQThHQTFVZEV3RUIvd1FGTUFNQkFmOHcKSFFZRFZSME9CQllFRlBGc0xRbmQxOHFUTVd5djh1STk3Z2hnR2djR01BMEdDU3FHU0liM0RRRUJDd1VBQTRJQgpBUUNMcnk5a2xlSElMdDRwbzd4N0hvSldsMEswYjdwV2Y0Y3ZVeHh1bUdTYUpoQmFHNTVlZFNFSVAzajhsRGg1Cm94ZXJlbjNrRUtzeGZiQVQ0RzU3KzBaeExQSkZQcjFMM3JvcmxUVE1DS1QyY2Z1UDJ3SEIzZndWNDJpSHZSUDgKSUVqU041bFNkWjZnN1NjWFZ2RnpZNzlrbVZDQ2RNYlpGcEFuOElyTkh3L0tTUGZUajNob2VyV3ZGL3huaEo3bQpmSzUrcE5TeWR6QTA1K1Y0ODJhWGlvV2NWcWY2UHpSVndmT0tIalUrbUVDQXZMbitNSzRvN1l2VW1iN2tSUGs5CnBjU1A4N2lpN0hwRVhqZUtRaVJhZElXKzMySXp1UTFiOXRYc3BNTGF0UFA5TXNvWmY0M1EyZWw4bWd1RjRxOUcKVmVUZFZaU2hBNWNucmNRZTEySUs1MzAvCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
  tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVqekNDQTNlZ0F3SUJBZ0lVUjZWcGR5U1Z0MGp6bDcwQnIxMmdZOTB0QVNBd0RRWUpLb1pJaHZjTkFRRUwKQlFBd0h6RWRNQnNHQTFVRUF4TVViRzl1WjJodmNtNHRaM0p3WXkxMGJITXRZMkV3SGhjTk1qSXdNVEV4TWpFeApPVEF3V2hjTk1qTXdNVEV4TWpFeE9UQXdXakFiTVJrd0Z3WURWUVFERXhCc2IyNW5hRzl5YmkxaVlXTnJaVzVrCk1Ga3dFd1lIS29aSXpqMENBUVlJS29aSXpqMERBUWNEUWdBRUxOQVVJZUsvdnppaGc3a1Q5d3E4anU4VU51c24Kc2FzdmlpS1VHQnpkblZndlNSdzNhYzd4RTRSQjlmZytjRnVUenpGaFNHRlVLVUpYaVh5d0FXZ0o4YU9DQXBBdwpnZ0tNTUE0R0ExVWREd0VCL3dRRUF3SUZvREFkQmdOVkhTVUVGakFVQmdnckJnRUZCUWNEQVFZSUt3WUJCUVVICkF3SXdEQVlEVlIwVEFRSC9CQUl3QURBZEJnTlZIUTRFRmdRVXgvaDVCOUFMSExuYWJaNjBzT2dvbnA3YlN0VXcKSHdZRFZSMGpCQmd3Rm9BVThXd3RDZDNYeXBNeGJLL3k0ajN1Q0dBYUJ3WXdnZ0lMQmdOVkhSRUVnZ0lDTUlJQgovb0lRYkc5dVoyaHZjbTR0WW1GamEyVnVaSUlnYkc5dVoyaHZjbTR0WW1GamEyVnVaQzVzYjI1bmFHOXliaTF6CmVYTjBaVzJDSkd4dmJtZG9iM0p1TFdKaFkydGxibVF1Ykc5dVoyaHZjbTR0YzNsemRHVnRMbk4yWTRJUmJHOXUKWjJodmNtNHRabkp2Ym5SbGJtU0NJV3h2Ym1kb2IzSnVMV1p5YjI1MFpXNWtMbXh2Ym1kb2IzSnVMWE41YzNSbApiWUlsYkc5dVoyaHZjbTR0Wm5KdmJuUmxibVF1Ykc5dVoyaHZjbTR0YzNsemRHVnRMbk4yWTRJWGJHOXVaMmh2CmNtNHRaVzVuYVc1bExXMWhibUZuWlhLQ0oyeHZibWRvYjNKdUxXVnVaMmx1WlMxdFlXNWhaMlZ5TG14dmJtZG8KYjNKdUxYTjVjM1JsYllJcmJHOXVaMmh2Y200dFpXNW5hVzVsTFcxaGJtRm5aWEl1Ykc5dVoyaHZjbTR0YzNsegpkR1Z0TG5OMlk0SVliRzl1WjJodmNtNHRjbVZ3YkdsallTMXRZVzVoWjJWeWdpaHNiMjVuYUc5eWJpMXlaWEJzCmFXTmhMVzFoYm1GblpYSXViRzl1WjJodmNtNHRjM2x6ZEdWdGdpeHNiMjVuYUc5eWJpMXlaWEJzYVdOaExXMWgKYm1GblpYSXViRzl1WjJodmNtNHRjM2x6ZEdWdExuTjJZNElNYkc5dVoyaHZjbTR0WTNOcGdoeHNiMjVuYUc5eQpiaTFqYzJrdWJHOXVaMmh2Y200dGMzbHpkR1Z0Z2lCc2IyNW5hRzl5YmkxamMya3ViRzl1WjJodmNtNHRjM2x6CmRHVnRMbk4yWTRJUWJHOXVaMmh2Y200dFltRmphMlZ1WkljRWZ3QUFBVEFOQmdrcWhraUc5dzBCQVFzRkFBT0MKQVFFQWV5UlhCWnI5Z1RmTGlsNGMvZElaSlVYeFh4ckFBQmtJTG55QkdNdkFqaFJoRndLZ09VU0MvMGUyeDYvTQpoTi9SWElYVzdBYUF0a25ZSHFLa3piMDZsbWhxczRHNWVjNkZRZDViSGdGbnFPOHNWNEF6WVFSRWhDZjlrWWhUClVlRnJLdDdOQllHNFNXSnNYK2M0ZzU5RlZGZkIzbTZscStoR3JaY085T2NIQ1NvVDM2SVRPeERDT3lrV002WHcKVW5zYWtaaHRwQ3lxdHlwQXZqaURNM3ZTY2txVTFNSWxLSnA1Z3lGT3k2VHVwQ01tYnRiWlRpSEtaN0ZlcmlmcwoyYng4Z0JmaldFQnEwMEhVWTdyY3RFNzFpVk11WURTczAwYTB2c1ZGQ240akppeWFnM0lHWkdud0FHQk1zR2h3ClFJcndjRHgwdy91NGR1VWRNMzBpaU1WZ0pnPT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
  tls.key: LS0tLS1CRUdJTiBFQyBQUklWQVRFIEtFWS0tLS0tCk1IY0NBUUVFSUwzbjZVZzlhZU1Day9XbkZ2L1pmSTlxMkIyakxnbjFRWGQwcjhIL3k2QkhvQW9HQ0NxR1NNNDkKQXdFSG9VUURRZ0FFTE5BVUllSy92emloZzdrVDl3cThqdThVTnVzbnNhc3ZpaUtVR0J6ZG5WZ3ZTUnczYWM3eApFNFJCOWZnK2NGdVR6ekZoU0dGVUtVSlhpWHl3QVdnSjhRPT0KLS0tLS1FTkQgRUMgUFJJVkFURSBLRVktLS0tLQo=

chriscchien · 2022-05-10T03:34:57Z

Verified on v1.3.0-preview1
Result Pass

Follow test steps for this feature

1 . absent TLS secret (default setup):

Without TLS secret present in cluster, Longhorn can deploy success and normal volume functionality working properly

2 . Test im deployed without TLS secret (non tls server)

Continue after step 1. After TLS secret created in cluster , in longhorn-manager, instance-manager(r/e) can see TLS secret configured in folder /tls-files/
Delete pod im-e/im-r and wait them restart ready, the volume functionality still work properly (Can see log Creating grpc server with mtls auth without error after im-r/im-e restarted)

tgtd: work_timer_start(146) use timer_fd based scheduler
tgtd: bs_init(387) use signalfd notification
time="2022-05-10T02:37:28Z" level=info msg="Storing process logs at path: /var/log/instances"
[longhorn-instance-manager] time="2022-05-10T02:37:28Z" level=info msg="Creating grpc server with mtls auth"
[longhorn-instance-manager] time="2022-05-10T02:37:28Z" level=info msg="Instance Manager listening to 0.0.0.0:8500"
[longhorn-instance-manager] time="2022-05-10T02:37:28Z" level=debug msg="started new process manager update watch"

3 . Old instance-mangers pre TLS should just continue to work, since they should use the same client fallback as in the above test case

Deploy Longhorn V1.2.4 then create TLS secret, basic volume functionality working properly,
Then update Longhorn to v1.3.0-preview1 without volume engine upgrade, basic volume functionality working properly
Upgrade volume engine, basic volume functionality still working properly

4 . Expected Case that will not work, due to one time setup on the IM (tls server)

After delete secret by command kubectl delete secret longhorn-grpc-tls -n longhorn-system, create a new volume then attach it to node, volume will stuck at attacjing and we will see below error in longhorn-manager which meets the one time setup behavior

time="2022-05-10T03:20:08Z" level=warning msg="Error syncing Longhorn replica longhorn-system/vol3-r-df12be2d" controller=longhorn-replica error="fail to sync replica for longhorn-system/vol3-r-df12be2d: failed to get Version of Instance Manager Client for instance-manager-r-c672e060, state: running, IP: 10.42.0.44, TLS: false, Error: failed to get version: rpc error: code = Unavailable desc = connection closed" node=worker3
--
Tue, May 10 2022 11:20:08 am | time="2022-05-10T03:20:08Z" level=warning msg="Error syncing Longhorn replica longhorn-system/vol3-r-df12be2d" controller=longhorn-replica error="fail to sync replica for longhorn-system/vol3-r-df12be2d: failed to get Version of Instance Manager Client for instance-manager-r-c672e060, state: running, IP: 10.42.0.44, TLS: false, Error: failed to get version: rpc error: code = Unavailable desc = connection closed" node=worker3
Tue, May 10 2022 11:20:08 am | E0510 03:20:08.965378 1 replica_controller.go:179] fail to sync replica for longhorn-system/vol3-r-df12be2d: failed to get Version of Instance Manager Client for instance-manager-r-c672e060, state: running, IP: 10.42.0.44, TLS: false, Error: failed to get version: rpc error: code = Unavailable desc = connection closed
Tue, May 10 2022 11:20:08 am | time="2022-05-10T03:20:08Z" level=warning msg="Dropping Longhorn replica longhorn-system/vol3-r-df12be2d out of the queue" controller=longhorn-replica error="fail to sync replica for longhorn-system/vol3-r-df12be2d: failed to get Version of Instance Manager Client for instance-manager-r-c672e060, state: running, IP: 10.42.0.44, TLS: false, Error: failed to get version: rpc error: code = Unavailable desc = connection closed" node=worker3

innobead · 2022-05-10T03:40:29Z

@joshimoo Please help work on the doc at #3953.

chriscchien · 2022-05-13T02:57:34Z

Verified on master-head 20220513(73f1914b988daa121599f7ff2c4c9bbf59a40967)
Result Pass

1 . absent TLS secret (default setup):

Without TLS secret present in cluster, Longhorn can deploy success and normal volume functionality working properly

2 . Test im deployed without TLS secret (non tls server)

Continue after step 1. After TLS secret created in cluster , in longhorn-manager, instance-manager(r/e) can see TLS secret configured in folder /tls-files/
Delete pod im-e/im-r and wait them restart ready, the volume functionality still work properly (Can see log Creating grpc server with mtls auth without error after im-r/im-e restarted)
instance-manage log before restart

time="2022-05-13T02:03:01Z" level=info msg="Storing process logs at path: /var/log/instances"
longhorn-instance-manager] time="2022-05-13T02:03:01Z" level=warning msg="Failed to addd TLS key pair from /tls-files/: open /tls-files/tls.crt: no such file or directory"
[longhorn-instance-manager] time="2022-05-13T02:03:01Z" level=info msg="Creating grpc server with no auth"
[longhorn-instance-manager] time="2022-05-13T02:03:01Z" level=info msg="Instance Manager listening to 0.0.0.0:8500"
[longhorn-instance-manager] time="2022-05-13T02:03:03Z" level=debug msg="started new process manager update watch"

instance-manager after restart

time="2022-05-13T02:25:52Z" level=info msg="Storing process logs at path: /var/log/instances"
[longhorn-instance-manager] time="2022-05-13T02:25:53Z" level=info msg="Creating grpc server with mtls auth"
[longhorn-instance-manager] time="2022-05-13T02:25:53Z" level=info msg="Instance Manager listening to 0.0.0.0:8500"
tgtd: work_timer_start(146) use timer_fd based scheduler
tgtd: bs_init(387) use signalfd notification
[longhorn-instance-manager] time="2022-05-13T02:25:53Z" level=debug msg="started new process manager update watch"

Old instance-mangers pre TLS should just continue to work, since they should use the same client fallback as in the above test case

Deploy Longhorn V1.2.4 then create TLS secret, basic volume functionality working properly,
Then update Longhorn to master-head without volume engine upgrade, basic volume functionality working properly
Upgrade volume engine, basic volume functionality still working properly

Expected Case that will not work, due to one time setup on the IM (tls server)

After delete secret by command kubectl delete secret longhorn-grpc-tls -n longhorn-system, create a new volume then attach it to node, volume will stuck at attacjing and we will see below error in longhorn-manager which meets the one time setup behavior

innobead · 2022-05-13T03:29:47Z

cc @c3y1huang

innobead · 2022-05-13T03:31:09Z

@cchien816 just added an automation label, could you help create a ticket to track the e2e test implementation later? Thanks.

joshimoo added kind/feature Feature request, new feature component/longhorn-manager Longhorn manager (control plane) component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) area/security System or volume data access security labels Apr 11, 2022

joshimoo added this to the v1.3.0 milestone Apr 11, 2022

joshimoo self-assigned this Apr 11, 2022

This was referenced Apr 11, 2022

MTLS authentication implementation longhorn/longhorn-instance-manager#109

Merged

MTLS auth longhorn/longhorn-manager#1292

Merged

[FEATURE] Rudimentary authentication in longhorn-ui is beneficial #1983

Open

innobead added the priority/0 Must be fixed in this release (managed by PO) label Apr 12, 2022

innobead added require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal labels Apr 24, 2022

innobead changed the title ~~[FEATURE] MTLS authentication for manager <-> instance-manager communication~~ [FEATURE] mTLS authentication for manager <-> instance-manager communication Apr 26, 2022

joshimoo mentioned this issue May 6, 2022

Update chart + deployment to add MTLS support #3930

Merged

joshimoo mentioned this issue May 6, 2022

[BUG] Instance manager keep restarting #3922

Closed

innobead added the highlight Important feature/issue to highlight label May 6, 2022

innobead assigned chriscchien May 6, 2022

chriscchien closed this as completed May 10, 2022

innobead mentioned this issue May 10, 2022

[DOC] mTLS authentication for manager <-> instance-manager communication #3953

Closed

1 task

chriscchien mentioned this issue May 11, 2022

[BUG] Command longhorn-instance-manager process won't work without TLS key #3966

Closed

innobead added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label May 13, 2022

chriscchien mentioned this issue May 13, 2022

[TEST] Implement test case for mTLS authentication #3978

Closed

2 tasks

innobead mentioned this issue May 23, 2022

[IMPROVEMENT] Implement TLS to proxy client #3975

Closed

This was referenced Jan 24, 2024

[BUG] Volumes don't mount with mTLS enabled #7040

Closed

Support TLS in proxy clients longhorn/longhorn-instance-manager#382

Merged

derekbit mentioned this issue Feb 6, 2024

[BACKPORT][v1.5.4][BUG] Volumes don't mount with mTLS enabled #7789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] mTLS authentication for manager <-> instance-manager communication #3839

[FEATURE] mTLS authentication for manager <-> instance-manager communication #3839

joshimoo commented Apr 11, 2022

longhorn-io-github-bot commented Apr 11, 2022

joshimoo commented May 6, 2022 •

edited

Loading

chriscchien commented May 10, 2022 •

edited

Loading

innobead commented May 10, 2022

chriscchien commented May 13, 2022

innobead commented May 13, 2022

innobead commented May 13, 2022

[FEATURE] mTLS authentication for manager <-> instance-manager communication #3839

[FEATURE] mTLS authentication for manager <-> instance-manager communication #3839

Comments

joshimoo commented Apr 11, 2022

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Additional context

longhorn-io-github-bot commented Apr 11, 2022

Pre Ready-For-Testing Checklist

joshimoo commented May 6, 2022 • edited Loading

chriscchien commented May 10, 2022 • edited Loading

innobead commented May 10, 2022

chriscchien commented May 13, 2022

innobead commented May 13, 2022

innobead commented May 13, 2022

joshimoo commented May 6, 2022 •

edited

Loading

chriscchien commented May 10, 2022 •

edited

Loading