Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vm: support health checks for VMs that are not using auto-registration #44712

Conversation

yskopets
Copy link
Member

@yskopets yskopets commented May 2, 2023

Please provide a description of this PR:

This PR is a rebased version of former #37086.

The PR implements support for health checking of those VMs that are not using auto-registration.

The notion of WorkloadEntry "cleanup" has been extended to accommodate 2 types VMs, namely:

  • "cleanup" for VMs that do use auto-registration means "delete WorkloadEntry"
  • "cleanup" for VMs that do NOT use auto-registration means "delete Health condition" (as health status is no longer known)

The overall purpose of this PR is to make WorkloadEntry the fundamental building block, so that it can be used even without higher level concepts such as auto-registration.

Please check any characteristics that apply to this pull request.

This PR does not introduce any API changes, however it makes possible a new use case that might need to be documented.

E.g., if a user wants to enable health-checking for a WorkloadEntry that is not using auto-registration, he/she needs to make the following changes to the configuration:

apiVersion: networking.istio.io/v1alpha3
kind: WorkloadEntry
metadata:
  annotations:
    # indicate that `istiod` should exclude this workload from load balancing
    # until `Healthy` condition is set to `True`
    proxy.istio.io/health-checks-enabled: "true"
spec:
  ...  

Additional notes:

This PR adds a new feature while keeping the changes to the existing code to the minimum.

As a result, file pilot/pkg/autoregistration/controller.go is now a home to all sorts of logic:

  • [existing logic] auto-registration
  • [existing logic] health checking
  • [existing logic] persistence of internal state as part of WorkloadEntry resource
  • [existing logic] WorkloadEntry cleanup on Istio Proxy disconnect
  • [existing logic] WorkloadEntry cleanup on expiration of the grace period
  • [new logic] changes to various places above

To improve this, I've also prepared a refactoring PR where pilot/pkg/autoregistration/controller.go is split into smaller sub-components.

Please let me know if you would prefer the refactored PR instead of this one.

@istio-testing
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label May 2, 2023
@istio-testing istio-testing added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 2, 2023
Signed-off-by: Yaroslav Skopets <yaroslav@tetrate.io>
@yskopets yskopets force-pushed the feature/istio-1.18/vm-health-check-for-workload-entry branch from 2b59607 to a189378 Compare May 2, 2023 20:58
Signed-off-by: Yaroslav Skopets <yaroslav@tetrate.io>
@yskopets
Copy link
Member Author

yskopets commented May 2, 2023

/test all

@yskopets yskopets marked this pull request as ready for review May 2, 2023 22:19
@yskopets yskopets requested review from a team as code owners May 2, 2023 22:19
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label May 2, 2023
@yskopets
Copy link
Member Author

yskopets commented May 2, 2023

/test unit-tests-arm64

if features.WorkloadEntryAutoRegistration && proxy.Metadata.AutoRegisterGroup != "" {
entryName = autoregisteredWorkloadEntryName(proxy)
autoCreate = true
} else if features.WorkloadEntryHealthChecks && proxy.Metadata.WorkloadEntry != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where Metadata.WorkloadEntry is first set

Copy link
Member Author

@yskopets yskopets May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of proxy.Metadata.WorkloadEntry comes from the ISTIO_META_WORKLOAD_ENTRY environment variable that a user can set for an Istio Proxy (SystemD Unit) on a VM.

proxy.Metadata.WorkloadEntry is meant to be symmetrical to proxy.Metadata.AutoRegisterGroup:

  • if a user wants auto-registration behaviour, he can
    • run istioctl x workload entry --autoregister command, which will generate a mesh.yaml file with the ISTIO_META_AUTO_REGISTER_GROUP value set
  • if a user does not want auto-registration behaviour, he can
    • run istioctl x workload entry command (notice absence of --autoregister flag)
    • create a WorkloadEntry resource explicitly
    • set environment variable ISTIO_META_WORKLOAD_ENTRY for the Istio Proxy (SystemD Unit) on a VM

E.g., as a power user of Istio, we see a value in ability to create WorkloadEntry through external automation.
This way we can get full control over the content of WorkloadEntry, including:

  • WorkloadEntry name
  • WorkloadEntry labels
  • WorkloadEntry locality
  • WorkloadEntry address

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of proxy.Metadata.WorkloadEntry comes from the ISTIO_META_WORKLOAD_ENTRY environment variable that a user can set for an Istio Proxy (SystemD Unit) on a VM.

Is it a must for healthcheck, right

I can understand the need to do healthcheck for WLE, but could not for setting the above Spec for users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ISTIO_META_WORKLOAD_ENTRY is used for the purposes of health checks.

When Istio Proxy (running on a VM) connects to istiod and starts sending health status updates, we need to infer a WorkloadEntry to save the health updates to.

In the case of auto-registration, WorkloadEntry name is generated based on metadata of the Istio Proxy.

In the case of externally created WorkloadEntry, we need to infer which WorkloadEntry corresponds to the connected Istio Proxy.

One solution would be to find WorkloadEntry by IP address. However, it wouldn't cover all the use cases:

  • there might be multiple WorkloadEntry(s) per IP address
  • WorkloadEntry might be defined for a different IP address than the one sent in the metadata of the Istio Proxy

That's why this PR proposes a solution where Istio Proxy must explicitly name the WorkloadEntry it corresponds to.

@hzxuzhonghu
Copy link
Member

@yskopets I prefer refactoring the interface, currently it is a bit confusing

Copy link
Contributor

@costinm costinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a very good idea.

One small suggestion - I don't know what we do for the Service/Deployment created by Istiod for gateway, but would be good to have a consistent way to track who created
a resource. I think there are well-established patterns ( managed-by labels ), we should
use them to make it clear who created the workload entry.
It may also be helpful to add an annotation with the actual istiod instance where the
connection is held, for debug.

The 'auto created' name is a bit confusing - you really mean created by istiod, other
automation tools will also auto-created WEs.

if wle == nil {
return false, fmt.Errorf("failed updating WorkloadEntry %s/%s: WorkloadEntry not found", proxy.Metadata.Namespace, entryName)
}
lastConTime, _ := time.Parse(timeFormat, wle.Annotations[ConnectedAtAnnotation])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's handle the error here appropriately

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the existing code that was moved into a separate function to make it reusable.

I think the current behaviour makes sense - if annotation value is corrupted then recover automatically.

What error handling logic do you have in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd at least like logging for when this happens since I don't think it should ever happen in a typical flow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address it in a follow-up PR

@linsun
Copy link
Member

linsun commented May 5, 2023

This seems very useful. From a UX perspective, thinking loud here - would this be possible without user needing to manually annotate the workload entry resource to indicate health check enabled? e.g. can istiod calculate that info based on the fact that WorkloadEntry is auto vs manual generated?

@yskopets
Copy link
Member Author

yskopets commented May 5, 2023

@yskopets I prefer refactoring the interface, currently it is a bit confusing

@hzxuzhonghu Do I understand correctly that you want me to split pilot/pkg/autoregistration/controller.go into smaller sub-components?

I've prepared a #44786 for that. Right now it's Draft. Let me know if that's what you mean.

@yskopets
Copy link
Member Author

yskopets commented May 5, 2023

would this be possible without user needing to manually annotate the workload entry resource to indicate health check enabled?

@linsun AFAIU, the primary consumer of the proxy.istio.io/health-checks-enabled annotation is the logic that generates EDS responses.

It is preferable to keep that EDS logic free from dependencies on ProxyConfig (where ReadinessProbe is defined) or knowledge about auto-registered/non-auto-registered WorkloadEntry.

If we let a user to create WorkloadEntry resource without the proxy.istio.io/health-checks-enabled annotation, we would need to add it automatically later when Istio Proxy connects to istiod.

However, within the time span when proxy.istio.io/health-checks-enabled annotation is absent on a WorkloadEntry, EDS logic considers such a WorkloadEntry healthy. Which would be wrong.

@hzxuzhonghu
Copy link
Member

@hzxuzhonghu Do I understand correctly that you want me to split pilot/pkg/autoregistration/controller.go into smaller sub-components?
I've prepared a #44786 for that. Right now it's Draft. Let me know if that's what you mean.

Splitting looks good, the more important should be that the logic in this pr should not couple with register

store := memory.NewController(memory.Make(collections.All))
createOrFail(t, store, weB)

stop := make(chan struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: test.NewStop(). Or abstract L318-325 to a NewTestController function, since its repeated many times

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will address it in a follow-up PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #44825

@istio-testing istio-testing merged commit 6cc55f8 into istio:master May 8, 2023
27 checks passed
@yskopets yskopets deleted the feature/istio-1.18/vm-health-check-for-workload-entry branch May 9, 2023 14:57
@yskopets
Copy link
Member Author

yskopets commented May 9, 2023

Thank you all for the feedback. I will address it in the follow-up PR(s).

@istio-testing
Copy link
Collaborator

In response to a cherrypick label: new pull request created: #44866

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking cherrypick/release-1.18 Set this label on a PR to auto-merge it to the release-1.18 branch size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants