Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-container probing #14853

Merged
merged 2 commits into from Mar 12, 2024
Merged

Add multi-container probing #14853

merged 2 commits into from Mar 12, 2024

Conversation

ReToCode
Copy link
Member

@ReToCode ReToCode commented Jan 31, 2024

Fixes #8949

Proposed Changes

Release Note

A new feature (`multi-container-probing`) is introduced to enable liveness and readiness probes for Knative Services with multiple containers

Documentation
knative/docs#5855

@knative-prow knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 31, 2024
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/API API objects and controllers area/autoscale area/networking labels Jan 31, 2024
@ReToCode
Copy link
Member Author

cc @skonto

Copy link
Member Author

@ReToCode ReToCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some inline comments to help with reviews

#
# WARNING: Cannot safely be disabled once enabled.
# See: https://knative.dev/docs/serving/configuration/feature-flags/#multiple-container-probing
multi-container-probing: "disabled"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might not just want to enable this when multi-container is enabled, as it will change the QP spec (array instead of single probe) which causes a redeploy of all Knative Services, thus a new flag.

Copy link
Contributor

@skonto skonto Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it is a good idea for people to test this and then we should remove the flag once this is mature enough. We need to add this to the docs under https://knative.dev/docs/serving/configuration/feature-flags.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs in knative/docs#5855

}
} else if cfg.Features.MultiContainerProbing == config.Enabled {
// Liveness Probes
errs = errs.Also(validateProbe(container.LivenessProbe, nil, true).ViaField("livenessProbe"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if enabled, we validate & default the probes as on the user-container

if getPort.StrVal != "" && getPort.StrVal != port.Name {
errs = errs.Also(apis.ErrInvalidValue(getPort.String(), "httpGet.port", "Probe port must match container port"))
if !isSidecar {
getPort := h.HTTPGet.Port
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on sidecars we do not want to rewrite the port to the container-port.

@@ -132,9 +132,10 @@ func (rs *RevisionSpec) applyDefault(ctx context.Context, container *corev1.Cont
// If there are multiple containers then default probes will be applied to the container where user specified PORT
// default probes will not be applied for non serving containers
if len(rs.PodSpec.Containers) == 1 || len(container.Ports) != 0 {
rs.applyProbesWithDefaults(container)
rs.applyGRPCProbeDefaults(container)
rs.applyUserContainerDefaultReadinessProbe(container)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only enforce a readiness probe on the user-container, we do not enforce it on sidecars.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this ServingContainer

}

func (*RevisionSpec) applyReadinessProbeDefaults(container *corev1.Container) {
if container.ReadinessProbe == nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if a readiness probe is defined on sidecars, we want the same defaulting behaviour as for the UC

@@ -256,6 +265,19 @@ func makeContainer(container corev1.Container, rev *v1.Revision) corev1.Containe
if container.TerminationMessagePolicy == "" {
container.TerminationMessagePolicy = corev1.TerminationMessageFallbackToLogsOnError
}

if container.ReadinessProbe != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want this on on sidecars as well, thus moved from makeServingContainer

userProbe := container.ReadinessProbe.DeepCopy()
applyReadinessProbeDefaultsForExec(userProbe, probePort)

var err error
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this block out, as we now want to assemble multiple probes

case sc.ReadinessProbe.GRPC != nil && sc.ReadinessProbe.GRPC.Port > 0:
probePort = sc.ReadinessProbe.GRPC.Port
default:
return nil, fmt.Errorf("sidecar readiness probe does not define probe port on container: %s", sc.Name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidecars explicitly need to define on what port, we cannot just rely on defaulting to user-port on sidecars.

}

} else {
readinessProbeJSON, err = readiness.EncodeSingleProbe(userContainerReadinessProbe)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if disabled, we keep the env var as is

Ports: ports,
StartupProbe: execProbe,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execProbe was never set - it confused me so I removed it here

@knative-prow knative-prow bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 1, 2024
Copy link

codecov bot commented Feb 1, 2024

Codecov Report

Attention: Patch coverage is 82.15962% with 38 lines in your changes are missing coverage. Please review.

Project coverage is 85.74%. Comparing base (f4ef3c8) to head (7b30193).
Report is 117 commits behind head on main.

Files Patch % Lines
pkg/queue/readiness/probe.go 75.00% 12 Missing ⚠️
pkg/reconciler/revision/resources/queue.go 78.94% 10 Missing and 2 partials ⚠️
pkg/queue/readiness/probe_encoding.go 73.91% 4 Missing and 2 partials ⚠️
pkg/queue/sharedmain/main.go 0.00% 5 Missing ⚠️
pkg/apis/serving/k8s_validation.go 92.30% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14853      +/-   ##
==========================================
- Coverage   85.81%   85.74%   -0.07%     
==========================================
  Files         198      198              
  Lines       15117    15226     +109     
==========================================
+ Hits        12972    13055      +83     
- Misses       1824     1844      +20     
- Partials      321      327       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@knative-prow knative-prow bot added the area/test-and-release It flags unit/e2e/conformance/perf test issues for product features label Feb 1, 2024
@ReToCode ReToCode changed the title [wip] Add multi-container probing Add multi-container probing Feb 2, 2024
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 2, 2024
@ReToCode
Copy link
Member Author

ReToCode commented Feb 2, 2024

/unhold

PTAL:

/assign @dprotaso
/assign @skonto

@knative-prow knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2024
}),
data: map[string]string{
"multi-container-probing": "Disabled",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency reasons I think you need another test with name: "multi-container-probing Enabled".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have it here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically what we have for other features is this:

 {
		name:    "kubernetes.podspec-topologyspreadconstraints Allowed",
		wantErr: false,
		wantFeatures: defaultWith(&Features{
			PodSpecTopologySpreadConstraints: Allowed,
		}),
		data: map[string]string{
			"kubernetes.podspec-topologyspreadconstraints": "Allowed",
		},
	}, {
		name:    "kubernetes.podspec-topologyspreadconstraints Enabled",
		wantErr: false,
		wantFeatures: defaultWith(&Features{
			PodSpecTopologySpreadConstraints: Enabled,
		}),
		data: map[string]string{
			"kubernetes.podspec-topologyspreadconstraints": "Enabled",
		},
	}, {
		name:    "kubernetes.podspec-topologyspreadconstraints Disabled",
		wantErr: false,
		wantFeatures: defaultWith(&Features{
			PodSpecTopologySpreadConstraints: Disabled,
		}),
		data: map[string]string{
			"kubernetes.podspec-topologyspreadconstraints": "Disabled",
		

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the "multi-container" one. It has only two cases. I think because it already tests Enabled in the features Enabled test. But I'm fine with adding that as well.

// Liveness Probes
errs = errs.Also(validateProbe(container.LivenessProbe, port).ViaField("livenessProbe"))
errs = errs.Also(validateProbe(container.LivenessProbe, &port, true).ViaField("livenessProbe"))
Copy link
Contributor

@skonto skonto Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also pass context to validateProbe and define a simple func IsUserContainer(ctx). We use that pattern elsewhere in this file. In this case it is simple to pass the flag but in general it might be better in cases where we have multiple conditions. 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe a bit overkill here for one private only bool param 🤷

@@ -798,7 +807,7 @@ func validateReadinessProbe(p *corev1.Probe, port corev1.ContainerPort) *apis.Fi
return errs
}

func validateProbe(p *corev1.Probe, port corev1.ContainerPort) *apis.FieldError {
func validateProbe(p *corev1.Probe, port *corev1.ContainerPort, isUserContainer bool) *apis.FieldError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few functions we could use to do some early checking from "k8s.io/apimachinery/pkg/util/validation", we already import that:

	validation.IsValidPortNum()
	validation.IsValidPortName()

validation is the same pkg K8s project uses to do validation. So I am wondering if we want to fail fast and if so which part of the logic here: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/validation/validation.go#L3057 we want to re-use or copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better addressed in a separate issue/PR? This one is about adding the same logic to multiple containers. When we use K8s functionality, we change also the existing validation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}},
},
},
},
Copy link
Contributor

@skonto skonto Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add another test case to compare against liveness probes for which we don't do any defaulting:

{
		name: "multiple containers with probes and overrides",
		in: &Revision{
			Spec: RevisionSpec{
				PodSpec: corev1.PodSpec{
					Containers: []corev1.Container{{
						ReadinessProbe: &corev1.Probe{
							ProbeHandler: corev1.ProbeHandler{
								Exec: &corev1.ExecAction{
									Command: []string{"echo", "hi"},
								},
							},
						},
					}, {
						LivenessProbe: &corev1.Probe{
							PeriodSeconds: 10,
							ProbeHandler: corev1.ProbeHandler{
								TCPSocket: &corev1.TCPSocketAction{
									Host: "127.0.0.2",
								},
							},
						},
					}},
				},
			},
		},
		want: &Revision{
			Spec: RevisionSpec{
				TimeoutSeconds:       ptr.Int64(config.DefaultRevisionTimeoutSeconds),
				ContainerConcurrency: ptr.Int64(config.DefaultContainerConcurrency),
				PodSpec: corev1.PodSpec{
					Containers: []corev1.Container{{
						Name:      config.DefaultUserContainerName + "-0",
						Resources: defaultResources,
						ReadinessProbe: &corev1.Probe{
							SuccessThreshold: 1,
							ProbeHandler: corev1.ProbeHandler{
								Exec: &corev1.ExecAction{
									Command: []string{"echo", "hi"},
								},
							},
						},
					}, {
						Name:      config.DefaultUserContainerName + "-1",
						Resources: defaultResources,
						LivenessProbe: &corev1.Probe{
							SuccessThreshold: 0, // Not touched only for readiness probes
							PeriodSeconds:    10,
							ProbeHandler: corev1.ProbeHandler{
								TCPSocket: &corev1.TCPSocketAction{
									Host: "127.0.0.2",
								},
							},
						},
					}},
				},
			},
		},
	}

Also should we care about proper defaulting to avoid issues reported in: #13204?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also argue to tackle this in a separate issue/PR. I'd rather not change existing behaviour in this PR.

t.Fatalf("Failed to create initial Service: %v: %v", names.Service, err)
}

url := resources.Route.Status.URL.URL()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use something similar to:

	// Validate State after Creation
	if err := validateControlPlane(t, clients, names, "1" /*1 is the expected generation value*/); err != nil {
		t.Error(err)
	}

	if err := validateDataPlane(t, clients, names, test.MultiContainerResponse); err != nil {
		t.Error(err)
	}

for example you can get the url from names.URL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is basically this with multiple probes.

I don't think I get this comment. Can you elaborate a bit more?

Copy link
Contributor

@skonto skonto Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code there testing about the skvc response is already provided elsewhere. You don't need to call pkgTest.CheckEndpointState explicitly. Check the above functions for example you can get the url from names.URL no need to have resources.Route.Status.URL.URL() .

v1test "knative.dev/serving/test/v1"
)

func TestMultiContainerReadiness(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add more scenarios like triggering a container's readiness failure (sidecar failure), assuming we have scale 1, check service status and then get it back to health status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can, but we should add the same without the feature enabled. Our current e2e-testing of probes in general is lacking, so we should tackle this in a separate issue as it exceeds the scope of this change (We also need to extend the readiness web-server with endpoints for this and deploy a proxy to do it for non containerPort ports, like what I have on my GH).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> #14869

@ReToCode
Copy link
Member Author

@dprotaso gentle ping.

@rhuss
Copy link
Contributor

rhuss commented Mar 6, 2024

Thanks @ReToCode ! Looks good to me on a first sight, I'm wondering what still needs to be done to get it merged ? 🤔

Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small question and non-blocking nits

Also note mutli-container error reporting is broken. See this related PR/issue #14744

pkg/apis/serving/k8s_validation.go Show resolved Hide resolved
},
},
},
want: apis.ErrDisallowedFields("livenessProbe", "readinessProbe", "readinessProbe.failureThreshold", "readinessProbe.periodSeconds", "readinessProbe.successThreshold", "readinessProbe.timeoutSeconds"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this list isn't just livenessProbe and readiness probe? The extra fields seem in the error seem noisy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -132,9 +132,10 @@ func (rs *RevisionSpec) applyDefault(ctx context.Context, container *corev1.Cont
// If there are multiple containers then default probes will be applied to the container where user specified PORT
// default probes will not be applied for non serving containers
if len(rs.PodSpec.Containers) == 1 || len(container.Ports) != 0 {
rs.applyProbesWithDefaults(container)
rs.applyGRPCProbeDefaults(container)
rs.applyUserContainerDefaultReadinessProbe(container)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this ServingContainer

@ReToCode
Copy link
Member Author

@dprotaso I started renaming to ServingContainer but it seems the UserContainer thing is used everywhere:

serving git:multi-probe*  
❯ grep -rni "usercontainer" | wc -l
81

I think it might be better to stick to that - or rename all of these things in a separate PR.

@dprotaso
Copy link
Member

sounds good let's do it in a follow up

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Mar 12, 2024
Copy link

knative-prow bot commented Mar 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, ReToCode

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2024
@knative-prow knative-prow bot merged commit a194cb2 into knative:main Mar 12, 2024
49 checks passed
@ReToCode
Copy link
Member Author

@dprotaso #15007 think that could be done as a GFI.

ReToCode added a commit to ReToCode/serving that referenced this pull request Mar 13, 2024
* Add multi-container probing

* Add e2e test for multi container probing

(cherry picked from commit a194cb2)
openshift-merge-bot bot pushed a commit to openshift-knative/serving that referenced this pull request Mar 13, 2024
* Add multi-container probing (knative#14853)

* Add multi-container probing

* Add e2e test for multi container probing

(cherry picked from commit a194cb2)

* Run generate-release.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow sidecars to specify probe port
4 participants