Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSM interface deleting periodically #1929

Assignees
Labels
ASAP The issue should be completed as soon as possible bug Something isn't working

Comments

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Jun 21, 2021

Logs

stderr F Jun 17 10:50:48.232 [ERRO] [cmd:[/bin/app]] [healServer:processHeal] Failed to heal connection alpine-cl-0: Error returned from sdk/pkg/networkservice/common/authorize/authorizeClient.Request: rpc error: code = PermissionDenied desc = no sufficient privileges

Steps to reproduce

  1. Run webhook example: https://github.com/networkservicemesh/deployments-k8s/tree/main/examples/features/webhook
  2. don't cleanup
  3. wait for 20-30 min

Actual:

nsm-toggling.webm.zip

Expected:
NSM interface should not be deleted if data plane and control plane are fine

@denis-tingaikin denis-tingaikin added bug Something isn't working ASAP The issue should be completed as soon as possible labels Jun 21, 2021
@denis-tingaikin denis-tingaikin added this to Backlog in Issue/PR tracking via automation Jun 21, 2021
@denis-tingaikin denis-tingaikin moved this from Backlog to To do in Issue/PR tracking Jun 21, 2021
@denis-tingaikin
Copy link
Member Author

Tested on kind. The issue is reproducible.

@denis-tingaikin
Copy link
Member Author

Logs from my local running:

nsc-init.txt
nsc.txt

@denis-tingaikin
Copy link
Member Author

@edwarnicke Can we consider this issue ASAP?

@edwarnicke
Copy link
Member

@denis-tingaikin Yes

@denis-tingaikin denis-tingaikin self-assigned this Jun 21, 2021
@Mixaster995 Mixaster995 moved this from To do to In progress in Issue/PR tracking Jun 23, 2021
@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Jun 27, 2021

Root cause:

  1. Spire gives a certificate for 1h
  2. NSM schedules refresh after 1h * 1/3
  3. On refreshing spire updates certificates for all applications
  4. At refresh request moment nsmgr has a new certificate from spire, but authInfo from gRPC keeps the old certificate from step1.
  5. nsmgr updates token with new certificate
  6. client can not validate the token from nsmgr because authInfo from gRPC keeps the old certificate from step1. (failure here)

Currently, I'm not found a good solution for this issue, started to look into gRPC source code.

@denis-tingaikin
Copy link
Member Author

Tested today two workarounds:

  1. https://golang.org/pkg/crypto/tls/#RenegotiationSupport -- it is not helped
  2. remove connection caching in connect -- it is working

Still looking for other solutions.

@Mixaster995 Mixaster995 moved this from In progress to Open questions in Issue/PR tracking Jun 29, 2021
@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Jun 29, 2021

@edwarnicke

I've asked spire guys about the issue and got the next answer:

Andrew Harding 14 hours ago
This is expected. gRPC will reuse the existing connection when you issue RPCs unless you redial. Since no new TLS handshake takes place, the new client credential is never communicated.
white_check_mark
eyes
raised_hands

Andrew Harding 14 hours ago
The x509source returns a channel from Updated() that callers can use to know when the SVID has been updated so they can re-establish a connection with the new credential.

Question: Can we modify connect chain elements to wait for update SVID to make re-dial?

Note: we can just pass option to wait to channel to not depend on spire functions

@denis-tingaikin
Copy link
Member Author

Currently we have the next options to fix the issue:

  1. Do redial as suggested spire guys on svid updating NSM interface deleting periodically  #1929 (comment)
  2. Remove policy last token signed.
  3. Keep and use first certificates for client and server on token generating.
  4. Your option.

For me option 1 looks good.

@edwarnicke Please share your thoughts on these options.

@edwarnicke
Copy link
Member

I'm curious... are they saying that GRPC won't close existing connections that have a TLS certificate that has expired after the connection was established?

@denis-tingaikin
Copy link
Member Author

Yes, as I got it, a handshake is doing once per dial.

@edwarnicke
Copy link
Member

@denis-tingaikin It looks like we need to do something that involves option 1 above... but lets try to keep it simple and natural :)

@denis-tingaikin
Copy link
Member Author

The root cause is fixed in networkservicemesh/sdk#1005

But I found that the issue can be reproduced via unstable healing. This reproducing periodically.
Tested a fix networkservicemesh/sdk#1005 without heal and it working fine in 100% cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASAP The issue should be completed as soon as possible bug Something isn't working
Projects
2 participants