Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] mTLS authentication for manager <-> instance-manager communication #3839

Closed
joshimoo opened this issue Apr 11, 2022 · 7 comments
Closed
Assignees
Labels
area/security System or volume data access security component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) component/longhorn-manager Longhorn manager (control plane) highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal
Milestone

Comments

@joshimoo
Copy link
Contributor

Is your feature request related to a problem? Please describe

We want to add authentication to the grpc control communication, this can be done at the manager <-> im level.
Since we are switching the engine binary invocation to a proxy client that also talks directly via the im.

Previously the MTLS auth was tracked and discussed at #1983 but it isn't really the best place for that, since the actual feature request is concerned with the authentication from the users perspective (UI).

Describe the solution you'd like

  • evaluat auth schemess (per call auth token, MTLS)
  • implement auth scheme into the communication between manager and instance-manager

Additional context

@joshimoo joshimoo added kind/feature Feature request, new feature component/longhorn-manager Longhorn manager (control plane) component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) area/security System or volume data access security labels Apr 11, 2022
@joshimoo joshimoo added this to the v1.3.0 milestone Apr 11, 2022
@joshimoo joshimoo self-assigned this Apr 11, 2022
@longhorn-io-github-bot
Copy link

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

  • Which areas/issues this PR might have potential impacts on?
    Area
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@innobead innobead added the priority/0 Must be fixed in this release (managed by PO) label Apr 12, 2022
@innobead innobead added require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal labels Apr 24, 2022
@innobead innobead changed the title [FEATURE] MTLS authentication for manager <-> instance-manager communication [FEATURE] mTLS authentication for manager <-> instance-manager communication Apr 26, 2022
joshimoo added a commit to joshimoo/longhorn that referenced this issue May 6, 2022
This image includes MTLS support for the GRPC server and im client
refactorings to enable persistent connection support.

Longhorn longhorn#3839

Signed-off-by: Joshua Moody <joshua.moody@suse.com>
joshimoo added a commit to joshimoo/longhorn that referenced this issue May 6, 2022
The MTLS secret is currently hardcoded to longhorn-grpc-tls we can improve
this further in the future when necessary to also enable auto cert generation

Longhorn longhorn#3839

Signed-off-by: Joshua Moody <joshua.moody@suse.com>
innobead pushed a commit that referenced this issue May 6, 2022
This image includes MTLS support for the GRPC server and im client
refactorings to enable persistent connection support.

Longhorn #3839

Signed-off-by: Joshua Moody <joshua.moody@suse.com>
innobead pushed a commit that referenced this issue May 6, 2022
The MTLS secret is currently hardcoded to longhorn-grpc-tls we can improve
this further in the future when necessary to also enable auto cert generation

Longhorn #3839

Signed-off-by: Joshua Moody <joshua.moody@suse.com>
@joshimoo
Copy link
Contributor Author

joshimoo commented May 6, 2022

The below is a test secret that can be used to enable MTLS auth on a cluster.
The different test cases:

absent TLS secret (default setup):

  • manager + im is deployed while TLS Secret is not present (i.e non tls client, non tls server)
  • normal volume functionality should work

Test im deployed without TLS secret (non tls server)

  • same as default setup
  • create TLS secret
    • manager will pickup the TLS secret when it's updated, should be quick and can be configured (check /tls-files/ in container of manager)
    • manager will from this point forward always try a TLS connection then fall back to a non TLS connection if it's unable to communicate with the server
  • normal volume functionality should work
  • delete im-e/im-r deployment on random node X
    • after the im's on node X are recreated, they will have TLS enabled
    • the manager will communicate with them via a TLS client
    • the manager will continue to use the fallback client for non TLS servers
  • normal volume functionality should work on node X as well as any other node

Old instance-mangers pre TLS should just continue to work, since they should use the same client fallback as in the above test case

  • keep an old running volume without engine updates around before the update to the new longhorn version, with a previously setup TLS secret

Expected Case that will not work, due to one time setup on the IM (tls server)

  • if the tls secret is removed after the IM are started while the TLS secret has been present
  • the manager will no longer be able to initiate a TLS connection with the IM
    • since the secret tls files will no longer be present
    • it will be able to reuse the previously created persistent client, but further refactoring on the controllers is needed to ensure no one shot clients.
  • in the future we can consider adding grpc server reloads + alternative TLS provisioning mechanism but this isn't currently supported
  • the reason for the refreshing secret on the manager side is so that we can support certificate rotation, i.e. as long as the base CA cert doesn't change the regular cert should be update able even if it's only updated on the manager side (client)
apiVersion: v1
kind: Secret
metadata:
  name: longhorn-grpc-tls
  namespace: longhorn-system
type: kubernetes.io/tls
data:
  ca.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUREakNDQWZhZ0F3SUJBZ0lVU2tNdUVEOC9XYXphNmpkb1NiTE1qalFqb3JFd0RRWUpLb1pJaHZjTkFRRUwKQlFBd0h6RWRNQnNHQTFVRUF4TVViRzl1WjJodmNtNHRaM0p3WXkxMGJITXRZMkV3SGhjTk1qSXdNVEV4TWpFeApPVEF3V2hjTk1qY3dNVEV3TWpFeE9UQXdXakFmTVIwd0d3WURWUVFERXhSc2IyNW5hRzl5YmkxbmNuQmpMWFJzCmN5MWpZVENDQVNJd0RRWUpLb1pJaHZjTkFRRUJCUUFEZ2dFUEFEQ0NBUW9DZ2dFQkFNY2grbTJhUndnNEtBa0EKT0xzdzdScWlWb1VqL2VPbVhuSE9HVE5nWE4rcFh5bDlCdzVDM1J4UDYzU29qaTVvNEhkU1htVmpwZmhmNjh1YwpvNVJJeUtXM1p6cndteDhXZldEc0dNNEtnYXBvMy84N3pVQ00vdGltOHllTzFUbTZlWVhXcWdlZ2JpM1Q1WnlvCmkzRjdteFg3QlU3Z25uWGthVmJ5UU1xRkEyMDJrK25jaVhaUE9iU0tlc1NvZ20wdWsrYXFvY3N1SjJ6dk9tZG0KMXd0a3ZTUklhL3l6T25JRGlmbFRteXNhZ3oxQy9VM1JxbzJ6TjIwbWJNYUJhMmx5anVZWkdWSnNyNGh4dGhqUApIR2x1UUh2QTlKTE9kc2J0T2xmbjRZNlZpUktCSzZWMVpOeVROMVJpN3ArTXZlaWQ3cE9rNHYweC9qVTc1a0N6Clo1cGJHbGtDQXdFQUFhTkNNRUF3RGdZRFZSMFBBUUgvQkFRREFnRUdNQThHQTFVZEV3RUIvd1FGTUFNQkFmOHcKSFFZRFZSME9CQllFRlBGc0xRbmQxOHFUTVd5djh1STk3Z2hnR2djR01BMEdDU3FHU0liM0RRRUJDd1VBQTRJQgpBUUNMcnk5a2xlSElMdDRwbzd4N0hvSldsMEswYjdwV2Y0Y3ZVeHh1bUdTYUpoQmFHNTVlZFNFSVAzajhsRGg1Cm94ZXJlbjNrRUtzeGZiQVQ0RzU3KzBaeExQSkZQcjFMM3JvcmxUVE1DS1QyY2Z1UDJ3SEIzZndWNDJpSHZSUDgKSUVqU041bFNkWjZnN1NjWFZ2RnpZNzlrbVZDQ2RNYlpGcEFuOElyTkh3L0tTUGZUajNob2VyV3ZGL3huaEo3bQpmSzUrcE5TeWR6QTA1K1Y0ODJhWGlvV2NWcWY2UHpSVndmT0tIalUrbUVDQXZMbitNSzRvN1l2VW1iN2tSUGs5CnBjU1A4N2lpN0hwRVhqZUtRaVJhZElXKzMySXp1UTFiOXRYc3BNTGF0UFA5TXNvWmY0M1EyZWw4bWd1RjRxOUcKVmVUZFZaU2hBNWNucmNRZTEySUs1MzAvCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
  tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVqekNDQTNlZ0F3SUJBZ0lVUjZWcGR5U1Z0MGp6bDcwQnIxMmdZOTB0QVNBd0RRWUpLb1pJaHZjTkFRRUwKQlFBd0h6RWRNQnNHQTFVRUF4TVViRzl1WjJodmNtNHRaM0p3WXkxMGJITXRZMkV3SGhjTk1qSXdNVEV4TWpFeApPVEF3V2hjTk1qTXdNVEV4TWpFeE9UQXdXakFiTVJrd0Z3WURWUVFERXhCc2IyNW5hRzl5YmkxaVlXTnJaVzVrCk1Ga3dFd1lIS29aSXpqMENBUVlJS29aSXpqMERBUWNEUWdBRUxOQVVJZUsvdnppaGc3a1Q5d3E4anU4VU51c24Kc2FzdmlpS1VHQnpkblZndlNSdzNhYzd4RTRSQjlmZytjRnVUenpGaFNHRlVLVUpYaVh5d0FXZ0o4YU9DQXBBdwpnZ0tNTUE0R0ExVWREd0VCL3dRRUF3SUZvREFkQmdOVkhTVUVGakFVQmdnckJnRUZCUWNEQVFZSUt3WUJCUVVICkF3SXdEQVlEVlIwVEFRSC9CQUl3QURBZEJnTlZIUTRFRmdRVXgvaDVCOUFMSExuYWJaNjBzT2dvbnA3YlN0VXcKSHdZRFZSMGpCQmd3Rm9BVThXd3RDZDNYeXBNeGJLL3k0ajN1Q0dBYUJ3WXdnZ0lMQmdOVkhSRUVnZ0lDTUlJQgovb0lRYkc5dVoyaHZjbTR0WW1GamEyVnVaSUlnYkc5dVoyaHZjbTR0WW1GamEyVnVaQzVzYjI1bmFHOXliaTF6CmVYTjBaVzJDSkd4dmJtZG9iM0p1TFdKaFkydGxibVF1Ykc5dVoyaHZjbTR0YzNsemRHVnRMbk4yWTRJUmJHOXUKWjJodmNtNHRabkp2Ym5SbGJtU0NJV3h2Ym1kb2IzSnVMV1p5YjI1MFpXNWtMbXh2Ym1kb2IzSnVMWE41YzNSbApiWUlsYkc5dVoyaHZjbTR0Wm5KdmJuUmxibVF1Ykc5dVoyaHZjbTR0YzNsemRHVnRMbk4yWTRJWGJHOXVaMmh2CmNtNHRaVzVuYVc1bExXMWhibUZuWlhLQ0oyeHZibWRvYjNKdUxXVnVaMmx1WlMxdFlXNWhaMlZ5TG14dmJtZG8KYjNKdUxYTjVjM1JsYllJcmJHOXVaMmh2Y200dFpXNW5hVzVsTFcxaGJtRm5aWEl1Ykc5dVoyaHZjbTR0YzNsegpkR1Z0TG5OMlk0SVliRzl1WjJodmNtNHRjbVZ3YkdsallTMXRZVzVoWjJWeWdpaHNiMjVuYUc5eWJpMXlaWEJzCmFXTmhMVzFoYm1GblpYSXViRzl1WjJodmNtNHRjM2x6ZEdWdGdpeHNiMjVuYUc5eWJpMXlaWEJzYVdOaExXMWgKYm1GblpYSXViRzl1WjJodmNtNHRjM2x6ZEdWdExuTjJZNElNYkc5dVoyaHZjbTR0WTNOcGdoeHNiMjVuYUc5eQpiaTFqYzJrdWJHOXVaMmh2Y200dGMzbHpkR1Z0Z2lCc2IyNW5hRzl5YmkxamMya3ViRzl1WjJodmNtNHRjM2x6CmRHVnRMbk4yWTRJUWJHOXVaMmh2Y200dFltRmphMlZ1WkljRWZ3QUFBVEFOQmdrcWhraUc5dzBCQVFzRkFBT0MKQVFFQWV5UlhCWnI5Z1RmTGlsNGMvZElaSlVYeFh4ckFBQmtJTG55QkdNdkFqaFJoRndLZ09VU0MvMGUyeDYvTQpoTi9SWElYVzdBYUF0a25ZSHFLa3piMDZsbWhxczRHNWVjNkZRZDViSGdGbnFPOHNWNEF6WVFSRWhDZjlrWWhUClVlRnJLdDdOQllHNFNXSnNYK2M0ZzU5RlZGZkIzbTZscStoR3JaY085T2NIQ1NvVDM2SVRPeERDT3lrV002WHcKVW5zYWtaaHRwQ3lxdHlwQXZqaURNM3ZTY2txVTFNSWxLSnA1Z3lGT3k2VHVwQ01tYnRiWlRpSEtaN0ZlcmlmcwoyYng4Z0JmaldFQnEwMEhVWTdyY3RFNzFpVk11WURTczAwYTB2c1ZGQ240akppeWFnM0lHWkdud0FHQk1zR2h3ClFJcndjRHgwdy91NGR1VWRNMzBpaU1WZ0pnPT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
  tls.key: LS0tLS1CRUdJTiBFQyBQUklWQVRFIEtFWS0tLS0tCk1IY0NBUUVFSUwzbjZVZzlhZU1Day9XbkZ2L1pmSTlxMkIyakxnbjFRWGQwcjhIL3k2QkhvQW9HQ0NxR1NNNDkKQXdFSG9VUURRZ0FFTE5BVUllSy92emloZzdrVDl3cThqdThVTnVzbnNhc3ZpaUtVR0J6ZG5WZ3ZTUnczYWM3eApFNFJCOWZnK2NGdVR6ekZoU0dGVUtVSlhpWHl3QVdnSjhRPT0KLS0tLS1FTkQgRUMgUFJJVkFURSBLRVktLS0tLQo=

@innobead innobead added the highlight Important feature/issue to highlight label May 6, 2022
@chriscchien
Copy link
Contributor

chriscchien commented May 10, 2022

Verified on v1.3.0-preview1
Result Pass

Follow test steps for this feature

1 . absent TLS secret (default setup):

  • Without TLS secret present in cluster, Longhorn can deploy success and normal volume functionality working properly

2 . Test im deployed without TLS secret (non tls server)

  • Continue after step 1. After TLS secret created in cluster , in longhorn-manager, instance-manager(r/e) can see TLS secret configured in folder /tls-files/
  • Delete pod im-e/im-r and wait them restart ready, the volume functionality still work properly (Can see log Creating grpc server with mtls auth without error after im-r/im-e restarted)
tgtd: work_timer_start(146) use timer_fd based scheduler
tgtd: bs_init(387) use signalfd notification
time="2022-05-10T02:37:28Z" level=info msg="Storing process logs at path: /var/log/instances"
[longhorn-instance-manager] time="2022-05-10T02:37:28Z" level=info msg="Creating grpc server with mtls auth"
[longhorn-instance-manager] time="2022-05-10T02:37:28Z" level=info msg="Instance Manager listening to 0.0.0.0:8500"
[longhorn-instance-manager] time="2022-05-10T02:37:28Z" level=debug msg="started new process manager update watch"

3 . Old instance-mangers pre TLS should just continue to work, since they should use the same client fallback as in the above test case

  • Deploy Longhorn V1.2.4 then create TLS secret, basic volume functionality working properly,
  • Then update Longhorn to v1.3.0-preview1 without volume engine upgrade, basic volume functionality working properly
  • Upgrade volume engine, basic volume functionality still working properly

4 . Expected Case that will not work, due to one time setup on the IM (tls server)

  • After delete secret by command kubectl delete secret longhorn-grpc-tls -n longhorn-system, create a new volume then attach it to node, volume will stuck at attacjing and we will see below error in longhorn-manager which meets the one time setup behavior
time="2022-05-10T03:20:08Z" level=warning msg="Error syncing Longhorn replica longhorn-system/vol3-r-df12be2d" controller=longhorn-replica error="fail to sync replica for longhorn-system/vol3-r-df12be2d: failed to get Version of Instance Manager Client for instance-manager-r-c672e060, state: running, IP: 10.42.0.44, TLS: false, Error: failed to get version: rpc error: code = Unavailable desc = connection closed" node=worker3
--
Tue, May 10 2022 11:20:08 am | time="2022-05-10T03:20:08Z" level=warning msg="Error syncing Longhorn replica longhorn-system/vol3-r-df12be2d" controller=longhorn-replica error="fail to sync replica for longhorn-system/vol3-r-df12be2d: failed to get Version of Instance Manager Client for instance-manager-r-c672e060, state: running, IP: 10.42.0.44, TLS: false, Error: failed to get version: rpc error: code = Unavailable desc = connection closed" node=worker3
Tue, May 10 2022 11:20:08 am | E0510 03:20:08.965378 1 replica_controller.go:179] fail to sync replica for longhorn-system/vol3-r-df12be2d: failed to get Version of Instance Manager Client for instance-manager-r-c672e060, state: running, IP: 10.42.0.44, TLS: false, Error: failed to get version: rpc error: code = Unavailable desc = connection closed
Tue, May 10 2022 11:20:08 am | time="2022-05-10T03:20:08Z" level=warning msg="Dropping Longhorn replica longhorn-system/vol3-r-df12be2d out of the queue" controller=longhorn-replica error="fail to sync replica for longhorn-system/vol3-r-df12be2d: failed to get Version of Instance Manager Client for instance-manager-r-c672e060, state: running, IP: 10.42.0.44, TLS: false, Error: failed to get version: rpc error: code = Unavailable desc = connection closed" node=worker3

@innobead
Copy link
Member

@joshimoo Please help work on the doc at #3953.

@chriscchien
Copy link
Contributor

Verified on master-head 20220513(73f1914b988daa121599f7ff2c4c9bbf59a40967)
Result Pass

1 . absent TLS secret (default setup):

  • Without TLS secret present in cluster, Longhorn can deploy success and normal volume functionality working properly

2 . Test im deployed without TLS secret (non tls server)

  • Continue after step 1. After TLS secret created in cluster , in longhorn-manager, instance-manager(r/e) can see TLS secret configured in folder /tls-files/

  • Delete pod im-e/im-r and wait them restart ready, the volume functionality still work properly (Can see log Creating grpc server with mtls auth without error after im-r/im-e restarted)

  • instance-manage log before restart

time="2022-05-13T02:03:01Z" level=info msg="Storing process logs at path: /var/log/instances"
longhorn-instance-manager] time="2022-05-13T02:03:01Z" level=warning msg="Failed to addd TLS key pair from /tls-files/: open /tls-files/tls.crt: no such file or directory"
[longhorn-instance-manager] time="2022-05-13T02:03:01Z" level=info msg="Creating grpc server with no auth"
[longhorn-instance-manager] time="2022-05-13T02:03:01Z" level=info msg="Instance Manager listening to 0.0.0.0:8500"
[longhorn-instance-manager] time="2022-05-13T02:03:03Z" level=debug msg="started new process manager update watch"
  • instance-manager after restart
time="2022-05-13T02:25:52Z" level=info msg="Storing process logs at path: /var/log/instances"
[longhorn-instance-manager] time="2022-05-13T02:25:53Z" level=info msg="Creating grpc server with mtls auth"
[longhorn-instance-manager] time="2022-05-13T02:25:53Z" level=info msg="Instance Manager listening to 0.0.0.0:8500"
tgtd: work_timer_start(146) use timer_fd based scheduler
tgtd: bs_init(387) use signalfd notification
[longhorn-instance-manager] time="2022-05-13T02:25:53Z" level=debug msg="started new process manager update watch"
  1. Old instance-mangers pre TLS should just continue to work, since they should use the same client fallback as in the above test case
  • Deploy Longhorn V1.2.4 then create TLS secret, basic volume functionality working properly,
  • Then update Longhorn to master-head without volume engine upgrade, basic volume functionality working properly
  • Upgrade volume engine, basic volume functionality still working properly
  1. Expected Case that will not work, due to one time setup on the IM (tls server)
  • After delete secret by command kubectl delete secret longhorn-grpc-tls -n longhorn-system, create a new volume then attach it to node, volume will stuck at attacjing and we will see below error in longhorn-manager which meets the one time setup behavior

@innobead
Copy link
Member

cc @c3y1huang

@innobead innobead added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label May 13, 2022
@innobead
Copy link
Member

@cchien816 just added an automation label, could you help create a ticket to track the e2e test implementation later? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/security System or volume data access security component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) component/longhorn-manager Longhorn manager (control plane) highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal
Projects
Status: Closed
Development

No branches or pull requests

4 participants