Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Test custom ovn images #1288

Closed
wants to merge 1 commit into from
Closed

Conversation

tssurya
Copy link
Contributor

@tssurya tssurya commented Jan 21, 2022

No description provided.

@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2022

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tssurya
To complete the pull request process, please assign danwinship after the PR has been reviewed.
You can assign the PR to them by writing /assign @danwinship in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2022

{"component":"entrypoint","error":"wrapped process failed: exit status 124","file":"prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-01-21T16:41:35Z"}

E0121 16:07:47.013389 1 pods.go:529] unable to parse node L3 gw annotation: k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-173-96.us-west-1.compute.internal"

I0121 16:09:16.577441 5366 kube.go:98] Setting annotations map[k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-200-88.us-west-1.compute.internal","mac-address":"06:11:53:c0:ee:67","ip-addresses":["10.0.200.88/18"],"ip-address":"10.0.200.88/18","next-hops":["10.0.192.1"],"next-hop":"10.0.192.1","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:c2f9a7b1-db99-4f21-8024-b42e2bb3611c k8s.ovn.org/node-mgmt-port-mac-address:6a:91:92:a0:92:68 k8s.ovn.org/node-primary-ifaddr:{"ipv4":"10.0.200.88/18"}] on node ip-10-0-200-88.us-west-1.compute.internal

I0121 16:09:18.750376 5851 kube.go:98] Setting annotations map[k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-175-43.us-west-1.compute.internal","mac-address":"02:95:6c:f0:54:03","ip-addresses":["10.0.175.43/18"],"ip-address":"10.0.175.43/18","next-hops":["10.0.128.1"],"next-hop":"10.0.128.1","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:667d58ab-9650-4ba2-86ed-f89eddf8ee52 k8s.ovn.org/node-mgmt-port-mac-address:3e:80:a6:3d:44:af k8s.ovn.org/node-primary-ifaddr:{"ipv4":"10.0.175.43/18"}] on node ip-10-0-175-43.us-west-1.compute.internal

I0121 16:09:19.354018 3603 kube.go:98] Setting annotations map[k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-208-163.us-west-1.compute.internal","mac-address":"06:5a:b9:f4:d3:bb","ip-addresses":["10.0.208.163/18"],"ip-address":"10.0.208.163/18","next-hops":["10.0.192.1"],"next-hop":"10.0.192.1","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:6fa424ab-d3ae-4b99-bf4e-be82f856ab41 k8s.ovn.org/node-mgmt-port-mac-address:96:86:fb:47:6d:fa k8s.ovn.org/node-primary-ifaddr:{"ipv4":"10.0.208.163/18"}] on node ip-10-0-208-163.us-west-1.compute.internal

I0121 16:09:19.030080 3727 kube.go:98] Setting annotations map[k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-144-189.us-west-1.compute.internal","mac-address":"02:4e:49:e5:fd:33","ip-addresses":["10.0.144.189/18"],"ip-address":"10.0.144.189/18","next-hops":["10.0.128.1"],"next-hop":"10.0.128.1","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:1dd23f35-7eb9-4f77-b2a4-4557527a4900 k8s.ovn.org/node-mgmt-port-mac-address:fa:bb:1f:87:56:23 k8s.ovn.org/node-primary-ifaddr:{"ipv4":"10.0.144.189/18"}] on node ip-10-0-144-189.us-west-1.compute.internal

I0121 16:09:16.917455 6988 kube.go:98] Setting annotations map[k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-173-96.us-west-1.compute.internal","mac-address":"02:10:61:cf:51:6d","ip-addresses":["10.0.173.96/18"],"ip-address":"10.0.173.96/18","next-hops":["10.0.128.1"],"next-hop":"10.0.128.1","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:f9c494ac-93d0-4ba5-af11-b74a8a2b93fe k8s.ovn.org/node-mgmt-port-mac-address:52:a8:20:b4:c9:20 k8s.ovn.org/node-primary-ifaddr:{"ipv4":"10.0.173.96/18"}] on node ip-10-0-173-96.us-west-1.compute.internal

I0121 16:09:16.005180 4116 kube.go:98] Setting annotations map[k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-184-51.us-west-1.compute.internal","mac-address":"02:c9:f1:8b:ed:c9","ip-addresses":["10.0.184.51/18"],"ip-address":"10.0.184.51/18","next-hops":["10.0.128.1"],"next-hop":"10.0.128.1","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:f9ac92ef-6851-496c-8279-95221a084028 k8s.ovn.org/node-mgmt-port-mac-address:3e:a3:01:37:75:27 k8s.ovn.org/node-primary-ifaddr:{"ipv4":"10.0.184.51/18"}] on node ip-10-0-184-51.us-west-1.compute.internal

We should see the SNAT adds happening post this step.

@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2022

[surya@hidden-temple Downloads]$ omg get pods -n openshift-ingress -owide
NAME READY STATUS RESTARTS AGE IP NODE
router-default-55fcb47b5-46pmb 0/1 Running 14 1h10m 10.129.0.13 ip-10-0-208-163.us-west-1.compute.internal
router-default-55fcb47b5-rnshr 0/1 Running 14 1h12m 10.128.2.5 ip-10-0-184-51.us-west-1.compute.internal

I don't see any SNAT's being added for pod 10.129.0.13 :)

@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2022

The way I see it rn.

1 - The l3 gateway annotation is nil.
2 - pod creation happens without us creating the SNAT.
3 - l3 annotation is added; this triggers the annotation change step.
4 - we should add the pod SNATs. - currently this isn't happening.

@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2022

I0121 16:09:16.590644 1 transact.go:41] Configuring OVN: [{Op:update Table:Logical_Router Row:map[external_ids:{GoMap:map[physical_ip:10.0.184.51 physical_ips:10.0.184.51]} load_balancer_group:{GoSet:[{GoUUID:4241465e-03f3-46b4-9379-f2e39228018d}]} options:{GoMap:map[always_learn_from_arp_request:false chassis:f9ac92ef-6851-496c-8279-95221a084028 dynamic_neigh_routers:true lb_force_snat_ip:router_ip snat-ct-zone:0]}] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[where column _uuid == {c9e1eec8-b0b1-4b0a-b98a-d4d36be052a5}] Until: Durable: Comment: Lock: UUIDName:}]

@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2022

/test e2e-network-migration

@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2022

actually https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1288/pull-ci-openshift-cluster-network-operator-master-e2e-network-migration/1484625495885615104 fixed the error we were facing before. This time the run failed because apiserver didn't come up:

[surya@hidden-temple Downloads]$ omg get pods -n openshift-apiserver -owide
NAME                        READY  STATUS   RESTARTS  AGE    IP           NODE
apiserver-865c74cbcd-2c2wt  2/2    Running  1         1h5m   10.129.2.14  ip-10-0-136-71.us-west-2.compute.internal
apiserver-865c74cbcd-cb7p8  2/2    Running  3         55m    10.131.0.17  ip-10-0-137-79.us-west-2.compute.internal
apiserver-865c74cbcd-rfcts  0/2    Running  1         1h10m               ip-10-0-237-148.us-west-2.compute.internal

  containerStatuses:
  - image: registry.build02.ci.openshift.org/ci-op-d73p5nct/stable@sha256:7423f693623170ea9f3d3b653fff92b43c022ea607c97417f57c433d26d49dc2
    imageID: ''
    lastState:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was deleted.  The
          container used to be Running
        reason: ContainerStatusUnknown
        startedAt: null
    name: openshift-apiserver
    ready: false
    restartCount: 1
    started: false
    state:
      waiting:
        reason: PodInitializing

@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2022

/test e2e-aws-ovn-windows
/test e2e-gcp
/test e2e-metal-ipi-ovn-ipv6

@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2022

I0121 21:37:03.826560 1 pods.go:332] [openshift-apiserver/apiserver-865c74cbcd-rfcts] creating logical port for pod on switch ip-10-0-237-148.us-west-2.compute.internal

E0121 21:37:03.826791 1 ovn.go:646] unable to parse node L3 gw annotation: k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-237-148.us-west-2.compute.internal"

I0121 21:37:03.826622 1 pods.go:322] [openshift-apiserver/apiserver-865c74cbcd-rfcts] addLogicalPort took 82.634µs

I0121 21:37:03.826961 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-apiserver", Name:"apiserver-865c74cbcd-rfcts", UID:"2a18bedb-9ac3-4796-8d2f-acd01996f251", APIVersion:"v1", ResourceVersion:"46393", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' unable to parse node L3 gw annotation: k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-237-148.us-west-2.compute.internal"

======we start adding NATs from:
[surya@hidden-temple scripts]$ omg get pods -owide -A | grep 10.128.2.20
openshift-service-ca service-ca-f8b67997d-wb9h8 1/1 Running 1 1h6m 10.128.2.20 ip-10-0-237-148.us-west-2.compute.internal

I0121 21:37:58.341727 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-service-ca", Name:"service-ca-f8b67997d-wb9h8", UID:"440f66b3-0f51-41e2-b1b4-93961bd46b67", APIVersion:"v1", ResourceVersion:"46394", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' unable to parse node L3 gw annotation: k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-237-148.us-west-2.compute.internal"

I0121 21:37:58.341860 1 transact.go:41] Configuring OVN: [{Op:insert Table:NAT Row:map[external_ip:10.0.137.79 logical_ip:10.131.0.11 options:{GoMap:map[stateless:false]} type:snat] Rows:[] Columns:[] Mutations:[] Timeout:0 Where:[] Until: Durable: Comment: Lock: UUIDName:u2596996928} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:nat Mutator:insert Value:{GoSet:[{GoUUID:u2596996928}]}}] Timeout:0 Where:[where column _uuid == {36052781-b665-499a-8da1-fc2eb4d822c4}] Until: Durable: Comment: Lock: UUIDName:}]

============
however I don't see a retry for the other pods that are stuck in the retry loop due to error.

@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2022

Conclusion: Some pods are not getting added to the retry queue properly.

I0121 21:37:03.827806 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-network-diagnostics", Name:"network-check-target-mkdws", UID:"47b68e1a-e9e2-4e90-aa91-0622a3b258db", APIVersion:"v1", ResourceVersion:"46460", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' unable to parse node L3 gw annotation: k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-237-148.us-west-2.compute.internal"

[surya@hidden-temple scripts]$ omg get pods -n openshift-network-diagnostics -owide
NAME READY STATUS RESTARTS AGE IP NODE
network-check-source-7f9c6888fc-v4kd7 1/1 Running 1 1h7m 10.129.0.3 ip-10-0-176-223.us-west-2.compute.internal
network-check-target-4vfxw 1/1 Running 2 1h29m 10.131.0.13 ip-10-0-137-79.us-west-2.compute.internal
network-check-target-6jvkj 1/1 Running 2 1h24m 10.129.0.5 ip-10-0-176-223.us-west-2.compute.internal
network-check-target-dh599 1/1 Running 2 1h23m 10.130.0.12 ip-10-0-172-188.us-west-2.compute.internal
network-check-target-m6h5k 1/1 Running 2 1h29m 10.129.2.10 ip-10-0-136-71.us-west-2.compute.internal
network-check-target-mkdws 0/1 Running 2 1h29m ip-10-0-237-148.us-west-2.compute.internal
network-check-target-vqtvk 1/1 Running 2 1h23m 10.128.0.6 ip-10-0-205-200.us-west-2.compute.internal

Those that do get added to the queue get scheduled correctly.

@tssurya tssurya force-pushed the surya branch 2 times, most recently from 4a3a998 to 1640306 Compare January 22, 2022 16:44
@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2022

It seems like there is something very wrong with the retryPods map.

@tssurya tssurya changed the title WIP:blah WIP: Test disable-SNATs Jan 22, 2022
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 22, 2022
@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2022

/test e2e-network-migration
/test e2e-aws-sdn-multi
/test e2e-metal-ipi-ovn-ipv6
/test e2e-ovn-step-registry
/test e2e-ovn-ipsec-step-registry

@tssurya
Copy link
Contributor Author

tssurya commented Jan 23, 2022

/retest

@tssurya
Copy link
Contributor Author

tssurya commented Feb 1, 2022

Gonna re-use this PR to test ovn-org/ovn-kubernetes#2787. Let's see if add/del retries create problems for us.

@danwinship danwinship removed their request for review April 5, 2022 12:33
@tssurya tssurya changed the title WIP: Test disable-SNATs WIP: Test custom ovn images Apr 6, 2022
@tssurya
Copy link
Contributor Author

tssurya commented Apr 6, 2022

testing https://github.com/ovn-org/ovn-kubernetes/pull/2870/files let's see how the migration jobs look

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 19, 2022

@tssurya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-upgrade f2904e1 link true /test e2e-agnostic-upgrade
ci/prow/e2e-aws-ovn-windows dae4888 link true /test e2e-aws-ovn-windows
ci/prow/e2e-openstack-ovn dae4888 link false /test e2e-openstack-ovn
ci/prow/e2e-vsphere-ovn dae4888 link false /test e2e-vsphere-ovn
ci/prow/e2e-ovn-ipsec-step-registry dae4888 link false /test e2e-ovn-ipsec-step-registry
ci/prow/verify dae4888 link true /test verify
ci/prow/e2e-vsphere-windows dae4888 link false /test e2e-vsphere-windows
ci/prow/e2e-ovn-hybrid-step-registry dae4888 link false /test e2e-ovn-hybrid-step-registry
ci/prow/e2e-gcp dae4888 link true /test e2e-gcp
ci/prow/e2e-azure-ovn dae4888 link false /test e2e-azure-ovn
ci/prow/e2e-network-migration dae4888 link true /test e2e-network-migration
ci/prow/e2e-metal-ipi-ovn-ipv6 dae4888 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-ovn-step-registry dae4888 link false /test e2e-ovn-step-registry
ci/prow/e2e-metal-ipi-ovn-ipv6-ipsec dae4888 link false /test e2e-metal-ipi-ovn-ipv6-ipsec
ci/prow/e2e-gcp-ovn dae4888 link true /test e2e-gcp-ovn
ci/prow/e2e-sdn-network-migration-rollback dae4888 link true /test e2e-sdn-network-migration-rollback

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 19, 2022
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Dec 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 19, 2022

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants