OCPBUGS-11888: handle daemonSet pods restart #347

msherif1234 · 2023-05-18T15:04:57Z

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

- What this PR does and why is it needed
fix an issue where we have stale XDP attached to interface(s) when daemonset restarts

- How to verify it

bring up OCP cluster
configure sctp and deploy sctp client and server pods

 oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
sctpclient   1/1     Running   0          17s   10.128.2.24   ci-ln-6rzv9ik-72292-ctfxk-worker-c-ls2sf   <none>           <none>
sctpserver   1/1     Running   0          17s   10.129.2.26   ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r   <none>           <none>

configure INFW and add rule to drop sctp traffic

$ oc get ingressnodefirewalls.ingressnodefirewall.openshift.io ingressnodefirewall-sctp -o yaml
apiVersion: ingressnodefirewall.openshift.io/v1alpha1
kind: IngressNodeFirewall
metadata:
  creationTimestamp: "2023-05-18T16:12:05Z"
  generation: 1
  name: ingressnodefirewall-sctp
  resourceVersion: "43804"
  uid: c6136909-60fe-4d73-ad46-68468399894b
spec:
  ingress:
  - rules:
    - action: Deny
      order: 10
      protocolConfig:
        protocol: SCTP
        sctp:
          ports: 30102-33000
    sourceCIDRs:
    - 10.128.2.24/24
  interfaces:
  - genev_sys_6081
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
status:
  syncStatus: Synchronized

run ncat traffic make sure traffic is blocked and event is generated

$ oc rsh sctpserver
sh-4.2# ncat -l 30102 --sctp 

$ oc rsh sctpclient
sh-4.2# nc -v 10.129.2.26 30102 --sctp
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection timed out.

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-lvq5x -c events --follow
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 45827 dstPort 30102

delete ingress-node-firewall daemon set

oc delete ds -n openshift-ingress-node-firewall  ingress-node-firewall-daemon

run traffic again make sure its blocked and events are generated

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-wwh6b -c events --follow
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 36697 dstPort 30102

when delete daemonSet or daemon pods manually pods will get recreated but the interface will have older xdp attached to it Signed-off-by: msherif1234 <mmahmoud@redhat.com>

openshift-ci-robot · 2023-05-18T15:05:05Z

@msherif1234: This pull request references Jira Issue OCPBUGS-11888, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

- What this PR does and why is it needed

- Special notes for reviewers

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-05-18T15:06:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [msherif1234]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-05-18T16:20:51Z

@msherif1234: This pull request references Jira Issue OCPBUGS-11888, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

- What this PR does and why is it needed

- Special notes for reviewers

- How to verify it

bring up OCP cluster
configure sctp and deploy sctp client and server pods

oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
sctpclient   1/1     Running   0          17s   10.128.2.24   ci-ln-6rzv9ik-72292-ctfxk-worker-c-ls2sf   <none>           <none>
sctpserver   1/1     Running   0          17s   10.129.2.26   ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r   <none>           <none>

configure INFW and add rule to drop sctp traffic

$ oc get ingressnodefirewalls.ingressnodefirewall.openshift.io ingressnodefirewall-sctp -o yaml
apiVersion: ingressnodefirewall.openshift.io/v1alpha1
kind: IngressNodeFirewall
metadata:
 creationTimestamp: "2023-05-18T16:12:05Z"
 generation: 1
 name: ingressnodefirewall-sctp
 resourceVersion: "43804"
 uid: c6136909-60fe-4d73-ad46-68468399894b
spec:
 ingress:
 - rules:
   - action: Deny
     order: 10
     protocolConfig:
       protocol: SCTP
       sctp:
         ports: 30102-33000
   sourceCIDRs:
   - 10.128.2.24/24
 interfaces:
 - genev_sys_6081
 nodeSelector:
   matchLabels:
     node-role.kubernetes.io/worker: ""
status:
 syncStatus: Synchronized

run ncat traffic make sure traffic is blocked and event is generated

$ oc rsh sctpserver
sh-4.2# ncat -l 30102 --sctp 

$ oc rsh sctpclient
sh-4.2# nc -v 10.129.2.26 30102 --sctp
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection timed out.

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-lvq5x -c events --follow
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 45827 dstPort 30102

delete ingress-node-firewall daemon set

oc delete ds -n openshift-ingress-node-firewall  ingress-node-firewall-daemon

run traffic again make sure its blocked and events are generated

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-wwh6b -c events --follow
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 36697 dstPort 30102

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-05-18T16:26:44Z

@msherif1234: This pull request references Jira Issue OCPBUGS-11888, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

- What this PR does and why is it needed
fix an issue where we have stale XDP attached to interface(s) when daemonset restarts

- How to verify it

bring up OCP cluster
configure sctp and deploy sctp client and server pods

oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
sctpclient   1/1     Running   0          17s   10.128.2.24   ci-ln-6rzv9ik-72292-ctfxk-worker-c-ls2sf   <none>           <none>
sctpserver   1/1     Running   0          17s   10.129.2.26   ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r   <none>           <none>

configure INFW and add rule to drop sctp traffic

$ oc get ingressnodefirewalls.ingressnodefirewall.openshift.io ingressnodefirewall-sctp -o yaml
apiVersion: ingressnodefirewall.openshift.io/v1alpha1
kind: IngressNodeFirewall
metadata:
 creationTimestamp: "2023-05-18T16:12:05Z"
 generation: 1
 name: ingressnodefirewall-sctp
 resourceVersion: "43804"
 uid: c6136909-60fe-4d73-ad46-68468399894b
spec:
 ingress:
 - rules:
   - action: Deny
     order: 10
     protocolConfig:
       protocol: SCTP
       sctp:
         ports: 30102-33000
   sourceCIDRs:
   - 10.128.2.24/24
 interfaces:
 - genev_sys_6081
 nodeSelector:
   matchLabels:
     node-role.kubernetes.io/worker: ""
status:
 syncStatus: Synchronized

run ncat traffic make sure traffic is blocked and event is generated

$ oc rsh sctpserver
sh-4.2# ncat -l 30102 --sctp 

$ oc rsh sctpclient
sh-4.2# nc -v 10.129.2.26 30102 --sctp
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection timed out.

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-lvq5x -c events --follow
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 45827 dstPort 30102

delete ingress-node-firewall daemon set

oc delete ds -n openshift-ingress-node-firewall  ingress-node-firewall-daemon

run traffic again make sure its blocked and events are generated

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-wwh6b -c events --follow
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 36697 dstPort 30102

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msherif1234 · 2023-05-18T16:32:16Z

/retest

martinkennelly · 2023-05-18T16:40:54Z

/assign @martinkennelly

openshift-ci · 2023-05-18T17:55:41Z

@msherif1234: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

andreaskaris · 2023-05-18T18:21:41Z

controllers/ingressnodefirewallnodestate_controller.go

@@ -66,15 +66,15 @@ func (r *IngressNodeFirewallNodeStateReconciler) Reconcile(ctx context.Context,
 			// Request object not found, could have been deleted after reconcile request.
 			// Owned objects are automatically garbage collected. For additional cleanup logic use finalizers.
 			// Return and don't requeue
-			return r.reconcileResource(ctx, req, nodeState, true)


nit: The code changes in ingressnofirewallnodestate_controller.go and in ebpfsyncer.go are not related, correct? So perhaps break this into 2 commits?

yeah that was just unused arg

andreaskaris · 2023-05-18T18:41:58Z

pkg/ebpfsyncer/ebpfsyncer.go

@@ -82,6 +87,15 @@ func (e *ebpfSingleton) SyncInterfaceIngressRules(
 		}()
 	}

+	signal.Notify(sigc, os.Interrupt, syscall.SIGTERM)


Can you test if you can still sig TERM the process (if this is needed)?
In a standalone test process, I can't ctrl-c / sig TERM the process any more after doing this, but it might behave differently with the Operator SDK

I tested this quickly and kill -9 obviously works still, but the TERM signal doesn't shut down the process any more:

$ cat main.go package main import ( "fmt" "os" "os/signal" "syscall" "time" ) type eStruct struct { c interface{} } func (e eStruct) resetAll() { fmt.Println("Reset: ", e.c) } func foo(e eStruct) { fmt.Println("foo", e) sigc := make(chan os.Signal, 1) signal.Notify(sigc, os.Interrupt, syscall.SIGTERM) go func(c chan os.Signal) { // Wait for a SIGTERM <-c if e.c != nil { e.resetAll() } }(sigc) } func main() { e := eStruct{ c: "test", } foo(e) time.Sleep(15 * time.Second) fmt.Println("15 seconds are over, normal shutdown") }

$ ./m & pid=$! ; sleep 2; kill $pid ; wait $pid [1] 81616 foo {test} Reset: test 15 seconds are over, normal shutdown [1]+ Done ./m

When I add os.Exit(0), then the program will shut down upon receiving the signal:

19 func foo(e eStruct) { 20 fmt.Println("foo", e) 21 22 sigc := make(chan os.Signal, 1) 23 signal.Notify(sigc, os.Interrupt, syscall.SIGTERM) 24 go func(c chan os.Signal) { 25 // Wait for a SIGTERM 26 <-c 27 if e.c != nil { 28 e.resetAll() 29 } 30 os.Exit(0) 31 }(sigc) 32 } 33

[akaris@linux test-signal]$ ./m & pid=$! ; sleep 2; kill $pid ; wait $pid [1] 82337 foo {test} Reset: test [1]+ Done

This is not a suggestion to add os.Exit, it's merely a question from my side

when u delete ds or delete the pods with something like
oc delete pod -l=app=ingress-node-firewall-daemon -n openshift-ingress-node-firewall SIGTERM will be sent and the controller runtime will be bring down all processes so I don't need to exit the process when the signal is detected I just need to cleanup and left the normal bring down flow continue did this answer ur question ? or I missunderstood ur question ?

Yeah my question is for the opposite case: when you run kill $pid, will the application be torn down and will the pod restart. Or if you test the process standalone, can you still CTRL-C and it will shut down

will try it out and see if I can do much in separate PR thanks for the review!!!

andreaskaris · 2023-05-18T18:44:07Z

Also: an E2E test for this perhaps?

msherif1234 · 2023-05-18T19:03:05Z

Also: an E2E test for this perhaps?

Marin has an updated e2e that specially test this condition so will wait on e2e till his PR merged

andreaskaris · 2023-05-18T19:20:04Z

/lgtm

openshift-ci-robot · 2023-05-18T19:24:49Z

@msherif1234: Jira Issue OCPBUGS-11888: All pull requests linked via external trackers have merged:

openshift/ingress-node-firewall#347

Jira Issue OCPBUGS-11888 has been moved to the MODIFIED state.

In response to this:

when delete daemonSet or daemon pods manually
pods will get recreated but the interface will
have older xdp attached to it

- What this PR does and why is it needed
fix an issue where we have stale XDP attached to interface(s) when daemonset restarts

- How to verify it

bring up OCP cluster
configure sctp and deploy sctp client and server pods

oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
sctpclient   1/1     Running   0          17s   10.128.2.24   ci-ln-6rzv9ik-72292-ctfxk-worker-c-ls2sf   <none>           <none>
sctpserver   1/1     Running   0          17s   10.129.2.26   ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r   <none>           <none>

configure INFW and add rule to drop sctp traffic

$ oc get ingressnodefirewalls.ingressnodefirewall.openshift.io ingressnodefirewall-sctp -o yaml
apiVersion: ingressnodefirewall.openshift.io/v1alpha1
kind: IngressNodeFirewall
metadata:
 creationTimestamp: "2023-05-18T16:12:05Z"
 generation: 1
 name: ingressnodefirewall-sctp
 resourceVersion: "43804"
 uid: c6136909-60fe-4d73-ad46-68468399894b
spec:
 ingress:
 - rules:
   - action: Deny
     order: 10
     protocolConfig:
       protocol: SCTP
       sctp:
         ports: 30102-33000
   sourceCIDRs:
   - 10.128.2.24/24
 interfaces:
 - genev_sys_6081
 nodeSelector:
   matchLabels:
     node-role.kubernetes.io/worker: ""
status:
 syncStatus: Synchronized

run ncat traffic make sure traffic is blocked and event is generated

$ oc rsh sctpserver
sh-4.2# ncat -l 30102 --sctp 

$ oc rsh sctpclient
sh-4.2# nc -v 10.129.2.26 30102 --sctp
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection timed out.

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-lvq5x -c events --follow
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:13:02 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 45827 dstPort 30102

delete ingress-node-firewall daemon set

oc delete ds -n openshift-ingress-node-firewall  ingress-node-firewall-daemon

run traffic again make sure its blocked and events are generated

$ oc logs -n openshift-ingress-node-firewall ingress-node-firewall-daemon-wwh6b -c events --follow
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r ruleId 10 action Drop len 82 if genev_sys_6081
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	ipv4 src addr 10.128.2.24 dst addr 10.129.2.26
2023-05-18 16:14:20 +0000 UTC ci-ln-6rzv9ik-72292-ctfxk-worker-b-n2x9r 	sctp srcPort 36697 dstPort 30102

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

OCPBUGS-11888: handle daemonSet pods restart

76bd82f

when delete daemonSet or daemon pods manually pods will get recreated but the interface will have older xdp attached to it Signed-off-by: msherif1234 <mmahmoud@redhat.com>

msherif1234 requested a review from andreaskaris May 18, 2023 15:04

openshift-ci bot requested review from anuragthehatter, dcbw and dougbtv May 18, 2023 15:05

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2023

openshift-ci bot assigned martinkennelly May 18, 2023

andreaskaris reviewed May 18, 2023

View reviewed changes

openshift-ci bot assigned andreaskaris May 18, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2023

openshift-merge-robot merged commit 9e9f369 into openshift:master May 18, 2023

msherif1234 deleted the fix_ds_restart branch May 30, 2023 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-11888: handle daemonSet pods restart #347

OCPBUGS-11888: handle daemonSet pods restart #347

msherif1234 commented May 18, 2023 •

edited

Loading

openshift-ci-robot commented May 18, 2023

openshift-ci bot commented May 18, 2023

openshift-ci-robot commented May 18, 2023

openshift-ci-robot commented May 18, 2023

msherif1234 commented May 18, 2023

martinkennelly commented May 18, 2023

openshift-ci bot commented May 18, 2023

andreaskaris May 18, 2023

msherif1234 May 18, 2023

andreaskaris May 18, 2023

msherif1234 May 18, 2023

andreaskaris May 18, 2023

msherif1234 May 18, 2023

andreaskaris commented May 18, 2023

msherif1234 commented May 18, 2023 •

edited

Loading

andreaskaris commented May 18, 2023

openshift-ci-robot commented May 18, 2023

OCPBUGS-11888: handle daemonSet pods restart #347

OCPBUGS-11888: handle daemonSet pods restart #347

Conversation

msherif1234 commented May 18, 2023 • edited Loading

openshift-ci-robot commented May 18, 2023

openshift-ci bot commented May 18, 2023

openshift-ci-robot commented May 18, 2023

openshift-ci-robot commented May 18, 2023

msherif1234 commented May 18, 2023

martinkennelly commented May 18, 2023

openshift-ci bot commented May 18, 2023

andreaskaris May 18, 2023

Choose a reason for hiding this comment

msherif1234 May 18, 2023

Choose a reason for hiding this comment

andreaskaris May 18, 2023

Choose a reason for hiding this comment

msherif1234 May 18, 2023

Choose a reason for hiding this comment

andreaskaris May 18, 2023

Choose a reason for hiding this comment

msherif1234 May 18, 2023

Choose a reason for hiding this comment

andreaskaris commented May 18, 2023

msherif1234 commented May 18, 2023 • edited Loading

andreaskaris commented May 18, 2023

openshift-ci-robot commented May 18, 2023

msherif1234 commented May 18, 2023 •

edited

Loading

msherif1234 commented May 18, 2023 •

edited

Loading