Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-11888: handle daemonSet pods restart #347

Merged
merged 1 commit into from
May 18, 2023

Conversation

msherif1234
Copy link
Contributor

@msherif1234 msherif1234 commented May 18, 2023

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

- What this PR does and why is it needed
fix an issue where we have stale XDP attached to interface(s) when daemonset restarts

- How to verify it

  • bring up OCP cluster
  • configure sctp and deploy sctp client and server pods
 oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
sctpclient   1/1     Running   0          17s   10.128.2.24   ci-ln-6rzv9ik-72292-ctfxk-worker-c-ls2sf   <none>           <none>
sctpserver   1/1     Running   0          17s   10.129.2.26   ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r   <none>           <none>
  • configure INFW and add rule to drop sctp traffic
$ oc get ingressnodefirewalls.ingressnodefirewall.openshift.io ingressnodefirewall-sctp -o yaml
apiVersion: ingressnodefirewall.openshift.io/v1alpha1
kind: IngressNodeFirewall
metadata:
  creationTimestamp: "2023-05-18T16:12:05Z"
  generation: 1
  name: ingressnodefirewall-sctp
  resourceVersion: "43804"
  uid: c6136909-60fe-4d73-ad46-68468399894b
spec:
  ingress:
  - rules:
    - action: Deny
      order: 10
      protocolConfig:
        protocol: SCTP
        sctp:
          ports: 30102-33000
    sourceCIDRs:
    - 10.128.2.24/24
  interfaces:
  - genev_sys_6081
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
status:
  syncStatus: Synchronized
  • run ncat traffic make sure traffic is blocked and event is generated
$ oc rsh sctpserver
sh-4.2# ncat -l 30102 --sctp 

$ oc rsh sctpclient
sh-4.2# nc -v 10.129.2.26 30102 --sctp
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection timed out.

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-lvq5x -c events --follow
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 45827 dstPort 30102
  • delete ingress-node-firewall daemon set
oc delete ds -n openshift-ingress-node-firewall  ingress-node-firewall-daemon
  • run traffic again make sure its blocked and events are generated
$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-wwh6b -c events --follow
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 36697 dstPort 30102

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

Signed-off-by: msherif1234 <mmahmoud@redhat.com>
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 18, 2023
@openshift-ci-robot
Copy link

@msherif1234: This pull request references Jira Issue OCPBUGS-11888, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

- What this PR does and why is it needed

- Special notes for reviewers

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 18, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2023
@openshift-ci-robot
Copy link

@msherif1234: This pull request references Jira Issue OCPBUGS-11888, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

- What this PR does and why is it needed

- Special notes for reviewers

- How to verify it

  • bring up OCP cluster
  • configure sctp and deploy sctp client and server pods
oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
sctpclient   1/1     Running   0          17s   10.128.2.24   ci-ln-6rzv9ik-72292-ctfxk-worker-c-ls2sf   <none>           <none>
sctpserver   1/1     Running   0          17s   10.129.2.26   ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r   <none>           <none>
  • configure INFW and add rule to drop sctp traffic
$ oc get ingressnodefirewalls.ingressnodefirewall.openshift.io ingressnodefirewall-sctp -o yaml
apiVersion: ingressnodefirewall.openshift.io/v1alpha1
kind: IngressNodeFirewall
metadata:
 creationTimestamp: "2023-05-18T16:12:05Z"
 generation: 1
 name: ingressnodefirewall-sctp
 resourceVersion: "43804"
 uid: c6136909-60fe-4d73-ad46-68468399894b
spec:
 ingress:
 - rules:
   - action: Deny
     order: 10
     protocolConfig:
       protocol: SCTP
       sctp:
         ports: 30102-33000
   sourceCIDRs:
   - 10.128.2.24/24
 interfaces:
 - genev_sys_6081
 nodeSelector:
   matchLabels:
     node-role.kubernetes.io/worker: ""
status:
 syncStatus: Synchronized
  • run ncat traffic make sure traffic is blocked and event is generated
$ oc rsh sctpserver
sh-4.2# ncat -l 30102 --sctp 

$ oc rsh sctpclient
sh-4.2# nc -v 10.129.2.26 30102 --sctp
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection timed out.

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-lvq5x -c events --follow
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 45827 dstPort 30102
  • delete ingress-node-firewall daemon set
oc delete ds -n openshift-ingress-node-firewall  ingress-node-firewall-daemon
  • run traffic again make sure its blocked and events are generated
$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-wwh6b -c events --follow
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 36697 dstPort 30102

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@msherif1234: This pull request references Jira Issue OCPBUGS-11888, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

- What this PR does and why is it needed
fix an issue where we have stale XDP attached to interface(s) when daemonset restarts

- How to verify it

  • bring up OCP cluster
  • configure sctp and deploy sctp client and server pods
oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
sctpclient   1/1     Running   0          17s   10.128.2.24   ci-ln-6rzv9ik-72292-ctfxk-worker-c-ls2sf   <none>           <none>
sctpserver   1/1     Running   0          17s   10.129.2.26   ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r   <none>           <none>
  • configure INFW and add rule to drop sctp traffic
$ oc get ingressnodefirewalls.ingressnodefirewall.openshift.io ingressnodefirewall-sctp -o yaml
apiVersion: ingressnodefirewall.openshift.io/v1alpha1
kind: IngressNodeFirewall
metadata:
 creationTimestamp: "2023-05-18T16:12:05Z"
 generation: 1
 name: ingressnodefirewall-sctp
 resourceVersion: "43804"
 uid: c6136909-60fe-4d73-ad46-68468399894b
spec:
 ingress:
 - rules:
   - action: Deny
     order: 10
     protocolConfig:
       protocol: SCTP
       sctp:
         ports: 30102-33000
   sourceCIDRs:
   - 10.128.2.24/24
 interfaces:
 - genev_sys_6081
 nodeSelector:
   matchLabels:
     node-role.kubernetes.io/worker: ""
status:
 syncStatus: Synchronized
  • run ncat traffic make sure traffic is blocked and event is generated
$ oc rsh sctpserver
sh-4.2# ncat -l 30102 --sctp 

$ oc rsh sctpclient
sh-4.2# nc -v 10.129.2.26 30102 --sctp
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection timed out.

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-lvq5x -c events --follow
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 45827 dstPort 30102
  • delete ingress-node-firewall daemon set
oc delete ds -n openshift-ingress-node-firewall  ingress-node-firewall-daemon
  • run traffic again make sure its blocked and events are generated
$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-wwh6b -c events --follow
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 36697 dstPort 30102

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msherif1234
Copy link
Contributor Author

/retest

@martinkennelly
Copy link
Contributor

/assign @martinkennelly

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 18, 2023

@msherif1234: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@@ -66,15 +66,15 @@ func (r *IngressNodeFirewallNodeStateReconciler) Reconcile(ctx context.Context,
// Request object not found, could have been deleted after reconcile request.
// Owned objects are automatically garbage collected. For additional cleanup logic use finalizers.
// Return and don't requeue
return r.reconcileResource(ctx, req, nodeState, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The code changes in ingressnofirewallnodestate_controller.go and in ebpfsyncer.go are not related, correct? So perhaps break this into 2 commits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that was just unused arg

@@ -82,6 +87,15 @@ func (e *ebpfSingleton) SyncInterfaceIngressRules(
}()
}

signal.Notify(sigc, os.Interrupt, syscall.SIGTERM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test if you can still sig TERM the process (if this is needed)?
In a standalone test process, I can't ctrl-c / sig TERM the process any more after doing this, but it might behave differently with the Operator SDK


I tested this quickly and kill -9 obviously works still, but the TERM signal doesn't shut down the process any more:

$ cat main.go
package main

import (
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"
)

type eStruct struct {
	c interface{}
}

func (e eStruct) resetAll() {
	fmt.Println("Reset: ", e.c)
}

func foo(e eStruct) {
	fmt.Println("foo", e)

	sigc := make(chan os.Signal, 1)
	signal.Notify(sigc, os.Interrupt, syscall.SIGTERM)
	go func(c chan os.Signal) {
		// Wait for a SIGTERM
		<-c
		if e.c != nil {
			e.resetAll()
		}
	}(sigc)
}

func main() {
	e := eStruct{
		c: "test",
	}
	foo(e)
	time.Sleep(15 * time.Second)
	fmt.Println("15 seconds are over, normal shutdown")
}
$ ./m & pid=$! ; sleep 2; kill $pid ; wait $pid
[1] 81616
foo {test}
Reset:  test
15 seconds are over, normal shutdown
[1]+  Done                    ./m

When I add os.Exit(0), then the program will shut down upon receiving the signal:

 19 func foo(e eStruct) {
   20     fmt.Println("foo", e)
   21 
   22     sigc := make(chan os.Signal, 1)
   23     signal.Notify(sigc, os.Interrupt, syscall.SIGTERM)
   24     go func(c chan os.Signal) {
   25         // Wait for a SIGTERM
   26         <-c
   27         if e.c != nil {
   28             e.resetAll()
   29         }
   30         os.Exit(0)
   31     }(sigc)
   32 }
   33 
[akaris@linux test-signal]$ ./m & pid=$! ; sleep 2; kill $pid ; wait $pid
[1] 82337
foo {test}
Reset:  test
[1]+  Done 

This is not a suggestion to add os.Exit, it's merely a question from my side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when u delete ds or delete the pods with something like
oc delete pod -l=app=ingress-node-firewall-daemon -n openshift-ingress-node-firewall SIGTERM will be sent and the controller runtime will be bring down all processes so I don't need to exit the process when the signal is detected I just need to cleanup and left the normal bring down flow continue did this answer ur question ? or I missunderstood ur question ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah my question is for the opposite case: when you run kill $pid, will the application be torn down and will the pod restart. Or if you test the process standalone, can you still CTRL-C and it will shut down

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will try it out and see if I can do much in separate PR thanks for the review!!!

@andreaskaris
Copy link
Contributor

Also: an E2E test for this perhaps?

@msherif1234
Copy link
Contributor Author

msherif1234 commented May 18, 2023

Also: an E2E test for this perhaps?

Marin has an updated e2e that specially test this condition so will wait on e2e till his PR merged

@andreaskaris
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2023
@openshift-merge-robot openshift-merge-robot merged commit 9e9f369 into openshift:master May 18, 2023
@openshift-ci-robot
Copy link

@msherif1234: Jira Issue OCPBUGS-11888: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-11888 has been moved to the MODIFIED state.

In response to this:

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

- What this PR does and why is it needed
fix an issue where we have stale XDP attached to interface(s) when daemonset restarts

- How to verify it

  • bring up OCP cluster
  • configure sctp and deploy sctp client and server pods
oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
sctpclient   1/1     Running   0          17s   10.128.2.24   ci-ln-6rzv9ik-72292-ctfxk-worker-c-ls2sf   <none>           <none>
sctpserver   1/1     Running   0          17s   10.129.2.26   ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r   <none>           <none>
  • configure INFW and add rule to drop sctp traffic
$ oc get ingressnodefirewalls.ingressnodefirewall.openshift.io ingressnodefirewall-sctp -o yaml
apiVersion: ingressnodefirewall.openshift.io/v1alpha1
kind: IngressNodeFirewall
metadata:
 creationTimestamp: "2023-05-18T16:12:05Z"
 generation: 1
 name: ingressnodefirewall-sctp
 resourceVersion: "43804"
 uid: c6136909-60fe-4d73-ad46-68468399894b
spec:
 ingress:
 - rules:
   - action: Deny
     order: 10
     protocolConfig:
       protocol: SCTP
       sctp:
         ports: 30102-33000
   sourceCIDRs:
   - 10.128.2.24/24
 interfaces:
 - genev_sys_6081
 nodeSelector:
   matchLabels:
     node-role.kubernetes.io/worker: ""
status:
 syncStatus: Synchronized
  • run ncat traffic make sure traffic is blocked and event is generated
$ oc rsh sctpserver
sh-4.2# ncat -l 30102 --sctp 

$ oc rsh sctpclient
sh-4.2# nc -v 10.129.2.26 30102 --sctp
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection timed out.

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-lvq5x -c events --follow
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 45827 dstPort 30102
  • delete ingress-node-firewall daemon set
oc delete ds -n openshift-ingress-node-firewall  ingress-node-firewall-daemon
  • run traffic again make sure its blocked and events are generated
$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-wwh6b -c events --follow
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 36697 dstPort 30102

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msherif1234 msherif1234 deleted the fix_ds_restart branch May 30, 2023 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants