New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DownstreamMerge] 03-12-2021 #863
[DownstreamMerge] 03-12-2021 #863
Conversation
Run full e2e tests on upgrade jobs instead of only the Networking Granular Checks. Signed-off-by: Andreas Karis <ak.karis@gmail.com>
This will ensure stale pod OVS interfaces are purged by the OVS service during a hard reboot on a node This leaves the mechanism we have in ovnkube node to periodically search for and remove stale interfaces Signed-off-by: astoycos <astoycos@redhat.com>
Increase the efficiency of this expensive operation Signed-off-by: astoycos <astoycos@redhat.com>
Signed-off-by: Martin Kennelly <mkennell@redhat.com>
Prometheus recommends to only use base units (seconds/ bytes). Also, corrected openflow probe interval that was in consumed in seconds [1] but the description said milliseconds. [1] https://www.ovn.org/support/dist-docs/ovn-controller.8.html Signed-off-by: Martin Kennelly <mkennell@redhat.com>
when Pod add/update event is handled for selected Pods of a given policy, the portInfo in the cache may still be that of the staled Pod. In order not to add the stale port into the portGroup, stop processing and wait for the next Pod update event. Signed-off-by: Yun Zhou <yunz@nvidia.com>
… handling Server needs to be shut down after the client, otherwise it will trigger the client's reconnect handling and block tests unecessarily for 10 seconds. Also make sure we actually clean everything up, which we weren't doing leading to some hard to reproduce deadlocks in the tests with upcoming changes. Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Dan Williams <dcbw@redhat.com>
Many testcases don't use the full configuration and don't need/want to use a urfave/cli/v2 App object, but that's what InitConfig() needs to parse CLI options. But the only thing that uses the CLI options is building up the config from the different parts, which testcases don't need to do if they just use the defaults. Unfortunately the default raw values (RawServiceCIDRs, RawClusterSubnets, etc) get parsed into their final form by InitConfig too. If we split the final parsing/completion out of InitConfig() then testcases can use it via PrepareTestConfig(), not have to call the full config init, and not require a urfave/cli/v2 App. Signed-off-by: Dan Williams <dcbw@redhat.com>
Most tests don't actually need a full config setup since they just use the default values and PrepareTestConfig() handles that. The config setup requires the urfave/cli/v2 App's Context object, so now we can remove that from testcases that don't change the configuration. Signed-off-by: Dan Williams <dcbw@redhat.com>
They don't need it and the one that does can just override the config options directly rather than going through the CLI arguments. Signed-off-by: Dan Williams <dcbw@redhat.com>
Two unrelated CI cleanups
LBs are only created on the node local switches in LGW, while they are created on both GR and switch on SGW. Let's create the LB on GR for LGW as well making this as the first step to handle service traffic in LGW exactly same as for SGW. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Since service traffic will now ingress and egress via GR/external switch on LGW, let's create the GR with the ct_zone and lb_force_snat_ip like we do for SGW. This ensures the right flows for snating packets into nodeIP and joinswitch subnet are present and also removes another topology difference between the two gateway modes. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Since service traffic needs to be handled the same way as in SGW, we need to masquerade host networked endpoints to support the hairpining case of host->svc->host like we do in SGW. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
There is no reason to only do this for SGW. Let's make the node-tracker update this for LGW as well. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
This commit edits the flows created on breth0 in LGW to be exactly the same as for SGW. This way any service traffic hitting breth0 will be sent to the external switch instead of sending it to the host. We remove the localPortWatcher that was specific to LGW and use the nodePortWatcher instead that is used for SGW. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
We do not need the 1003 and 1005 logical policies on the ovn_cluster_router anymore since it was mainly used for the hairpin traffic scenario to reach DGP. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
In LGW we had: 10.96.0.0/16 via 10.244.2.1 dev ovn-k8s-mp0 Convert this to: 10.96.0.0/16 via 172.18.0.1 dev breth0 mtu 1400 like in shared. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
We don't need the DGP, gw0, special nat's on ovn_cluster_router for the node-local-ip anymore since hairpinning will be handled the same way as in SGW via openflows. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
This annotation is not needed anymore since we don't need the DGP/gw0 logic. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Fixes commented out tests plus rearranges existing tests to segregate common tests for both modes versus the ones with the differences. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
dSNAT has been false on LGW since egress traffic doesn't use the GR path. In SGW, this could be true in which case we'd add individual nat's for pods not belonging to the routing-external-gws pod. Let's remove the check for gateway modes and add the clusterSubnet SNAT irrespective of the modes when dSNAT=false. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
We add a new topology version called routingViaHost, where we take a major step in bridging the gap between LGW and SGW modes. This new topology will not have br-local, DGP, gw0. It will have all the new openflows on breth0 as for SGW to handle ingress/service traffic. We handle upgrade from: 1) old lgw (with br-local) -> sgw (w/o br-local) 2) old lgw (with br-local) -> new lgw (w/o br-local) 3) shared (w/o br-local) -> shared (w/o br-local) - already works 4) shared (w/o br-local) -> new lgw (w/o br-local) At this point, the only differences between LGW and SGW are: 1) egress traffic from pod will still leave via mp0 in LGW while it leaves via GR in SGW. 2) acls for egress-firewall are on node switches in LGW while they are on the join switches in SGW. 3) 501 hybrid policy is used to steer traffic towards GR in LGW while its not needed in SGW. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Currently we use the disableSNATMultipleGWs option for externalgws feature to ensure we do not SNAT podIPs to the nodeIP. This means we enable a per pod snat for the pods not managed by externalgws. EgressIPs feature can work only if the SNAT for an egressIP managed pod is to the egressIP and not the nodeIP as is the case when disableSNATMultipleGWs is set. So far we were declaring these features as mutually exclusive. Let's stop that and instead add the logic to simply delete SNAT to nodeIP for an egressIP managed pod & add SNAT to nodeIP for an egressIP un-managed pod. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Previously used 'ovn-nbctl lsp-get-addresses stor-<node_name>' no longer works as it returns 'router' for lsp ports of type router. The change retrieves the MAC address of the lrp port instead. Signed-off-by: Patryk Diak <pdiak@redhat.com>
Fix getting router port MAC address
Make egressIPs compatible with externalgws
Some minor fixups, and ensure we return it's error if any in RegisterMasterMetrics Ensure we monitor the sbdb's sb_global table since it is fetched for metrics purposes Signed-off-by: Andrew Stoycos <astoycos@redhat.com>
Fixup FindSBGlobal and monitor sb_global
Make container OVS interfaces Transient, Batch Periodic interface scrubbing
CI: Run full e2e tests on upgrade jobs
When we lose leadership we exit to trigger a restart of the container to come back up as a follower. That is not an error condition so exit(0) makes more sense. When we exit(1) it's noted in some CI jobs that it was an error condition and indicative of some problem. This will remove that scary report in our CI. with exit(1) our monitor logs look like this: Nov 19 20:25:14.241 E ns/openshift-ovn-kubernetes pod/ovnkube-master-bpq94 node/ip-10-0-165-169.us-east-2.compute.internal container/ovnkube-master reason/ContainerExit code/1 cause/Error -11-19T20:24:43.215Z|02253|reconnect|INFO|ssl:10.0.241.143:9641: connected\n2021-11-19T20:24:43.221Z|02254|fatal_signal|WARN|terminating with signal 15 (Terminated)\n2021-11-19T20:24:43.221Z|00002|daemon_unix(monitor)|INFO|pid 13 died, killed (Terminated), exiting\nI1119 20:24:43.232860 1 client.go:484] [OVN_Southbound] disconnected from ssl:10.0.165.169:9642; reconnecting ... \nI1119 20:24:43.241844 1 ovnkube.go:123] Received signal terminated. Shutting down\nI1119 20:24:43.242142 1 reflector.go:225] Stopping reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242178 1 reflector.go:225] Stopping reflector *v1beta1.EndpointSlice (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242200 1 reflector.go:225] Stopping reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242222 1 services_controller.go:173] Shutting down controller ovn-lb-controller\nI1119 20:24:43.242283 1 reflector.go:225] Stopping reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242310 1 reflector.go:225] Stopping reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242338 1 reflector.go:225] Stopping reflector *v1.NetworkPolicy (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242361 1 reflector.go:225] Stopping reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242383 1 reflector.go:225] Stopping reflector *v1.EgressIP (0s) from github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117\nI1119 20:24:43.242397 1 reflector.go:225] Stopping reflector *v1.Endpoints (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242418 1 reflector.go:225] Stopping reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.253048 1 master.go:107] No longer leader;\n with exit(0) the monitor log is less alarming: Nov 19 22:25:20.567 I ns/openshift-ovn-kubernetes pod/ovnkube-master-pz287 node/ip-10-0-129-41.us-west-1.compute.internal container/ovnkube-master reason/ContainerExit code/0 cause/Completed Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
/hold |
fedora: Bump ovn version to 21.09.0-4.
exit 0 instead of 1
wait I see unit tests passing against upstream, but something is not working downstream. Investigating.... |
ok I need to update downstream specific unit tests since we have b99bcc1 |
Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
PodRequest`IsDPU is currently being used to capture whether ovnkube node is running in DPUHostMode or not. However, IsDPU could mean two things: -- DPU Host or DPU itself. Since IsDPU here is meant to track DPUHostMode lets change the name to mean IsDPUHostMode. Also, there is no need for Server`mode field, so remove it as well. Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
Fix the tests to include the extra IPTable rules that are added in OCP_HACKS.go Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
In ovn-org/ovn-kubernetes@fe963fd we fixed the upstream tests where we split the test harness handling for client and this PR ensures we do the same for the downstream specific tests. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
6a85822
to
744c49a
Compare
All unit tests have passed:
|
@tssurya: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold cancel
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: trozet, tssurya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- What this PR does and why is it needed
This is the downstream merge PR for this week.
- Special notes for reviewers
Conflicts:
There was one tiny conflict in the tests
ovn-kubernetes/go-controller/pkg/node/gateway_init_linux_test.go
Line 727 in 92e3fa2
expectedIPTablesRules
function which was removed since the new tests don't use that function.Two downstream only commits are present at the end:
/assign @trozet @dcbw