Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to set gateway container's lifecycle hooks #48956

Closed
wants to merge 1 commit into from

Conversation

den-is
Copy link

@den-is den-is commented Jan 24, 2024

Please provide a description of this PR:

Fulfills #47265 #47779 kubernetes-sigs/aws-load-balancer-controller#2131
In my company, we do not have any kustomize workflows or the ability to add it to our CI/CD.
The addition of the ability to set container lifecycle hooks does not interfere with any existing setups.

Tests
Test values yaml

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh","-c","sleep 300"]

Render template

helm template \
./istio/manifests/charts/gateway \
--set name=istio-ingressgateway \
-f values-gw-test.yaml \
-s templates/deployment.yaml \
--dry-run --debug

Output with enabled preStop lifecycle hook:

# Source: gateway/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: istio-ingressgateway
  namespace: kube-system
  labels:
    helm.sh/chart: gateway-1.0.0
    app: istio-ingressgateway
    istio: ingressgateway
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: istio-ingressgateway
  annotations:
    {}
spec:
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway
  template:
    metadata:
      annotations:
        inject.istio.io/templates: gateway
        prometheus.io/path: /stats/prometheus
        prometheus.io/port: "15020"
        prometheus.io/scrape: "true"
        sidecar.istio.io/inject: "true"
      labels:
        sidecar.istio.io/inject: "true"
        app: istio-ingressgateway
        istio: ingressgateway
    spec:
      serviceAccountName: istio-ingressgateway
      securityContext:
        # Safe since 1.22: https://github.com/kubernetes/kubernetes/pull/103326
        sysctls:
        - name: net.ipv4.ip_unprivileged_port_start
          value: "0"
      containers:
        - name: istio-proxy
          # "auto" will be populated at runtime by the mutating webhook. See https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/#customizing-injection
          image: auto
          securityContext:
            # Safe since 1.22: https://github.com/kubernetes/kubernetes/pull/103326
            capabilities:
              drop:
              - ALL
            allowPrivilegeEscalation: false
            privileged: false
            readOnlyRootFilesystem: true
            runAsUser: 1337
            runAsGroup: 1337
            runAsNonRoot: true
          env:
          lifecycle: 
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - sleep 300
          ports:
          - containerPort: 15090
            protocol: TCP
            name: http-envoy-prom
          resources:
            limits:
              cpu: 2000m
              memory: 1024Mi
            requests:
              cpu: 100m
              memory: 128Mi
      terminationGracePeriodSeconds: 30

Output with empty lifecycle: {} var:

# Source: gateway/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: istio-ingressgateway
  namespace: kube-system
  labels:
    helm.sh/chart: gateway-1.0.0
    app: istio-ingressgateway
    istio: ingressgateway
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: istio-ingressgateway
  annotations:
    {}
spec:
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway
  template:
    metadata:
      annotations:
        inject.istio.io/templates: gateway
        prometheus.io/path: /stats/prometheus
        prometheus.io/port: "15020"
        prometheus.io/scrape: "true"
        sidecar.istio.io/inject: "true"
      labels:
        sidecar.istio.io/inject: "true"
        app: istio-ingressgateway
        istio: ingressgateway
    spec:
      serviceAccountName: istio-ingressgateway
      securityContext:
        # Safe since 1.22: https://github.com/kubernetes/kubernetes/pull/103326
        sysctls:
        - name: net.ipv4.ip_unprivileged_port_start
          value: "0"
      containers:
        - name: istio-proxy
          # "auto" will be populated at runtime by the mutating webhook. See https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/#customizing-injection
          image: auto
          securityContext:
            # Safe since 1.22: https://github.com/kubernetes/kubernetes/pull/103326
            capabilities:
              drop:
              - ALL
            allowPrivilegeEscalation: false
            privileged: false
            readOnlyRootFilesystem: true
            runAsUser: 1337
            runAsGroup: 1337
            runAsNonRoot: true
          env:
          ports:
          - containerPort: 15090
            protocol: TCP
            name: http-envoy-prom
          resources:
            limits:
              cpu: 2000m
              memory: 1024Mi
            requests:
              cpu: 100m
              memory: 128Mi
      terminationGracePeriodSeconds: 30

Signed-off-by: Denis Iskandarov <d.iskandarov@gmail.com>
@istio-policy-bot istio-policy-bot added area/environments release-notes-none Indicates a PR that does not require release notes. labels Jan 24, 2024
@istio-policy-bot
Copy link

😊 Welcome @den-is! This is either your first contribution to the Istio istio repo, or it's been
a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@istio-testing istio-testing added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-ok-to-test labels Jan 24, 2024
@istio-testing
Copy link
Collaborator

Hi @den-is. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ymesika
Copy link
Member

ymesika commented Jan 24, 2024

/ok-to-test

You will probably want to add a release note for this

@istio-testing istio-testing added ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. and removed needs-ok-to-test labels Jan 24, 2024
@den-pluto
Copy link

@ymesika
Can't find anything about it in the Contribution notes.
Can you show me an example on how to add release notes?

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments in #47779 is not the right approach. If the current way doesn't meet use cases we should enhance it.

@den-pluto
Copy link

den-pluto commented Jan 24, 2024

@howardjohn

see my comments in #47779 is not the right approach. If the current way doesn't meet use cases we should enhance it.

I'm not going to argue.
"I want this feature."
This is a very native kubernetes container feature which can be present in the default template in the official chart.

It is up to me whether enable it or not, and my responsibility - same way as if I insert it using kustomize, or cloning your chart and hosting altered version for myself.

And yes, I'm using terminationGracePeriodSeconds + terminationDrainDuration to solve the issue.

Also my PR is not setting any lifecycle by default keeping it blank/unset.

@howardjohn
Copy link
Member

Thanks I understand your point of view but the projects position (after much discussion)is to keep the helm charts opinionated.

@linsun
Copy link
Member

linsun commented Jan 25, 2024

Adding a few @istio/wg-environments-maintainers for input.

@den-is is the requirement/pain point to not have gateway ready too fast so you add pre-stop hook? In the past, we tend to approve helm values PRs when there are multiple people/vendors asking for it.

@akamac
Copy link

akamac commented Feb 13, 2024

I used to have preStop hook configured for Nginx Ingress + AWS ALB for zero downtime ingress updates. On receiving the SIGTERM I make health check to fail for ALB, while Nginx still accepting connections. Only once the target is down in AWS, the pod can be safely terminated.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Feb 24, 2024
@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Mar 10, 2024
@costinm costinm reopened this Mar 10, 2024
@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Mar 10, 2024
@costinm
Copy link
Contributor

costinm commented Mar 10, 2024

I think the actual feature - ability to configure drain duration - is a pretty important one.
I agree with John that adding the lifecycle hook is not the right approach - but I think adding an option ( or annotation on the gateway ) to indicate the desired drain duration/grace period as first class is not bad.

The implementation should also take into account the new way of creating Gateways - managed
by Istiod, which is what other non-Istio Gateways are using, we should not add features that only work on the helm-self-managed install.

@istio-testing
Copy link
Collaborator

@den-is: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
integ-telemetry_istio 8c6248c link true /test integ-telemetry

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Apr 10, 2024
@istio-policy-bot
Copy link

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2024-03-10. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

@clayvan
Copy link

clayvan commented May 1, 2024

Please re-open and let this PR through.

Thanks I understand your point of view but the projects position (after much discussion)is to keep the helm charts opinionated.

Yet, the helm chart allows configuration of several other fields such as tolerations, topologySpreadConstraints, affinity, etc.

we tend to approve helm values PRs when there are multiple people/vendors asking for it.

There are tons of people in #47779 and #47265 asking for this. Having lifecycle hook support is critical for any ingress controller helm chart, they all have it except istio gateway.

see my comments in #47779 is not the right approach. If the current way doesn't meet use cases we should enhance it.

This issue is still open because people can't agree on the fact the current way does not work. I would love it if there was native istio configuration to prevent this, but I don't believe it does from my testing. Whereas this lifecycle hook 100% works and is the simplest way for any ingress controller to avoid downtime.

is the requirement/pain point to not have gateway ready too fast so you add pre-stop hook? In the past

@linsun , no this preStop helps avoid downtime during gateway terminations. Please see #47265 (comment)

This PR will help your end users avoid downtime during rolling restarts. @howardjohn Please reconsider your "projects position" on this PR as it has no downsides, and only major upsides.

@howardjohn
Copy link
Member

@clayvan I can assure you there is no PR with no downsides 🙂.

This PR or similar would make a lot more sense if someone could explain why terminationDrainDuration is not an acceptable solution; the only one I have seen is #47779 (comment) which in my understanding is not technically accurate.

FWIW comments may be more visible on #47779 (open issue vs closed PR)

@ryanmac8
Copy link

ryanmac8 commented May 1, 2024

@howardjohn terminationDrainDuration isn't a viable solution because we need a prestop hook so that we have a chance to inform the load balancer to remove the node from service and mark it as unhealthy before envoy receives a sigterm. Without the prestop hook and just terminationDrainDuration , the load balancer isn't made aware of envoy no longer receiving active connections and will cause client side errors.

@howardjohn
Copy link
Member

@ryanmac8 can you help walk me through why a prestop informs the LB but terminationDrainDuration?

When envoy gets a SIGTERM it does not stop serving traffic. It immediately marks itself (or really, k8s does for us) as NotReady (which should stop the LB from sending it traffic). It also starts telling the LB to goaway directly with connection:close and GoAway messages.

I would expect preStop to be worse because it doesn't send connection:close or GoAway

@ryanmac8
Copy link

ryanmac8 commented May 1, 2024

@howardjohn terminationDrainDuration and SIGTERM causes new connections to close. This means that connections now are being denied. We are trying to avoid denying connections because that causes downtime. A prestop can resolve this because when the prestop is excecuted we are telling the container to just wait before executing the SIGTERM. We need the additional time so that a LB can notice the health checks are failing and mark the node as unhealthy. Now new connections are routed away from the node that's terminating and we don't experience any connections getting a connection close message. It's all about ensure uptime and making sure new connections are routed properly.

@howardjohn
Copy link
Member

terminationDrainDuration and SIGTERM causes new connections to close. This means that connections now are being denied.

Can you provide more details? That is neither how it was designed, nor how it works in out testing. If it is, it is a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/environments lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. release-notes-none Indicates a PR that does not require release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet