Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DownstreamMerge] 03-12-2021 #863

Merged

Conversation

tssurya
Copy link
Contributor

@tssurya tssurya commented Dec 2, 2021

- What this PR does and why is it needed

This is the downstream merge PR for this week.

- Special notes for reviewers

Conflicts:
There was one tiny conflict in the tests

func expectedIPTablesRules(gatewayIP string) map[string]util.FakeTable {
around expectedIPTablesRules function which was removed since the new tests don't use that function.

Two downstream only commits are present at the end:

  • 822fd5a to fix the tests for the MCS specific iptable rules.
  • 744c49a to fix the test harness for downstream specific hybrid overlay tests.

/assign @trozet @dcbw

andreaskaris and others added 30 commits November 11, 2021 13:17
Run full e2e tests on upgrade jobs instead of only the Networking
Granular Checks.

Signed-off-by: Andreas Karis <ak.karis@gmail.com>
This will ensure stale pod OVS interfaces are purged by
the OVS service during a hard reboot on a node

This leaves the mechanism we have in ovnkube node to
periodically search for and remove stale interfaces

Signed-off-by: astoycos <astoycos@redhat.com>
Increase the efficiency of this expensive operation

Signed-off-by: astoycos <astoycos@redhat.com>
Signed-off-by: Martin Kennelly <mkennell@redhat.com>
Prometheus recommends to only use base units (seconds/
bytes).

Also, corrected openflow probe interval that was in
consumed in seconds [1] but the description said milliseconds.

[1] https://www.ovn.org/support/dist-docs/ovn-controller.8.html

Signed-off-by: Martin Kennelly <mkennell@redhat.com>
when Pod add/update event is handled for selected Pods of a given
policy, the portInfo in the cache may still be that of the staled Pod.
In order not to add the stale port into the portGroup, stop processing
and wait for the next Pod update event.

Signed-off-by: Yun Zhou <yunz@nvidia.com>
… handling

Server needs to be shut down after the client, otherwise it will trigger
the client's reconnect handling and block tests unecessarily for 10 seconds.

Also make sure we actually clean everything up, which we weren't doing leading
to some hard to reproduce deadlocks in the tests with upcoming changes.

Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Dan Williams <dcbw@redhat.com>
Many testcases don't use the full configuration and don't need/want
to use a urfave/cli/v2 App object, but that's what InitConfig()
needs to parse CLI options. But the only thing that uses the
CLI options is building up the config from the different parts,
which testcases don't need to do if they just use the defaults.

Unfortunately the default raw values (RawServiceCIDRs,
RawClusterSubnets, etc) get parsed into their final form by
InitConfig too.

If we split the final parsing/completion out of InitConfig()
then testcases can use it via PrepareTestConfig(), not have to
call the full config init, and not require a urfave/cli/v2 App.

Signed-off-by: Dan Williams <dcbw@redhat.com>
Most tests don't actually need a full config setup since they
just use the default values and PrepareTestConfig() handles that.
The config setup requires the urfave/cli/v2 App's Context object,
so now we can remove that from testcases that don't change
the configuration.

Signed-off-by: Dan Williams <dcbw@redhat.com>
They don't need it and the one that does can just override
the config options directly rather than going through the CLI
arguments.

Signed-off-by: Dan Williams <dcbw@redhat.com>
LBs are only created on the node local switches
in LGW, while they are created on both GR and switch
on SGW. Let's create the LB on GR for LGW as well
making this as the first step to handle service
traffic in LGW exactly same as for SGW.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Since service traffic will now ingress and egress
via GR/external switch on LGW, let's create
the GR with the ct_zone and lb_force_snat_ip
like we do for SGW. This ensures the right flows
for snating packets into nodeIP and joinswitch subnet
are present and also removes another topology
difference between the two gateway modes.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Since service traffic needs to be handled the
same way as in SGW, we need to masquerade host
networked endpoints to support the hairpining
case of host->svc->host like we do in SGW.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
There is no reason to only do this for SGW.
Let's make the node-tracker update this for LGW as well.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
This commit edits the flows created on breth0 in LGW
to be exactly the same as for SGW. This way any service
traffic hitting breth0 will be sent to the external
switch instead of sending it to the host.

We remove the localPortWatcher that was specific to LGW
and use the nodePortWatcher instead that is used for SGW.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
We do not need the 1003 and 1005 logical policies on the
ovn_cluster_router anymore since it was mainly used
for the hairpin traffic scenario to reach DGP.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
In LGW we had:
10.96.0.0/16 via 10.244.2.1 dev ovn-k8s-mp0

Convert this to:
10.96.0.0/16 via 172.18.0.1 dev breth0 mtu 1400
like in shared.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
We don't need the DGP, gw0, special nat's on
ovn_cluster_router for the node-local-ip anymore
since hairpinning will be handled the same way
as in SGW via openflows.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
This annotation is not needed anymore
since we don't need the DGP/gw0 logic.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Fixes commented out tests plus rearranges
existing tests to segregate common tests
for both modes versus the ones with the
differences.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
dSNAT has been false on LGW since egress traffic
doesn't use the GR path. In SGW, this could be true
in which case we'd add individual nat's for pods
not belonging to the routing-external-gws pod.

Let's remove the check for gateway modes and
add the clusterSubnet SNAT irrespective of the modes
when dSNAT=false.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
We add a new topology version called
routingViaHost, where we take a major
step in bridging the gap between LGW and
SGW modes. This new topology will not have
br-local, DGP, gw0. It will have all the
new openflows on breth0 as for SGW to
handle ingress/service traffic.

We handle upgrade from:
1) old lgw (with br-local) -> sgw (w/o br-local)
2) old lgw (with br-local) -> new lgw (w/o br-local)
3) shared (w/o br-local) -> shared (w/o br-local) - already works
4) shared (w/o br-local) -> new lgw (w/o br-local)

At this point, the only differences between
LGW and SGW are:

1) egress traffic from pod will still leave
via mp0 in LGW while it leaves via GR in SGW.
2) acls for egress-firewall are on node switches
in LGW while they are on the join switches in SGW.
3) 501 hybrid policy is used to steer traffic
towards GR in LGW while its not needed in SGW.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Currently we use the disableSNATMultipleGWs
option for externalgws feature to ensure we
do not SNAT podIPs to the nodeIP. This means
we enable a per pod snat for the pods not
managed by externalgws.

EgressIPs feature can work only if the SNAT
for an egressIP managed pod is to the egressIP
and not the nodeIP as is the case when
disableSNATMultipleGWs is set.

So far we were declaring these features as
mutually exclusive. Let's stop that and instead
add the logic to simply delete SNAT to nodeIP
for an egressIP managed pod & add SNAT to nodeIP
for an egressIP un-managed pod.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Previously used 'ovn-nbctl lsp-get-addresses stor-<node_name>' no longer works as it returns 'router' for lsp ports of type router.
The change retrieves the MAC address of the lrp port instead.

Signed-off-by: Patryk Diak <pdiak@redhat.com>
Fix getting router port MAC address
Some minor fixups, and ensure we return it's error if any
in RegisterMasterMetrics

Ensure we monitor the sbdb's sb_global table since
it is fetched for metrics purposes

Signed-off-by: Andrew Stoycos <astoycos@redhat.com>
Fixup FindSBGlobal and monitor sb_global
dcbw and others added 5 commits December 1, 2021 15:13
Make container OVS interfaces Transient, Batch Periodic interface scrubbing
CI: Run full e2e tests on upgrade jobs
When we lose leadership we exit to trigger a restart of the container
to come back up as a follower. That is not an error condition so
exit(0) makes more sense. When we exit(1) it's noted in some
CI jobs that it was an error condition and indicative of some
problem. This will remove that scary report in our CI.

with exit(1) our monitor logs look like this:

Nov 19 20:25:14.241 E ns/openshift-ovn-kubernetes pod/ovnkube-master-bpq94 node/ip-10-0-165-169.us-east-2.compute.internal container/ovnkube-master reason/ContainerExit code/1 cause/Error -11-19T20:24:43.215Z|02253|reconnect|INFO|ssl:10.0.241.143:9641: connected\n2021-11-19T20:24:43.221Z|02254|fatal_signal|WARN|terminating with signal 15 (Terminated)\n2021-11-19T20:24:43.221Z|00002|daemon_unix(monitor)|INFO|pid 13 died, killed (Terminated), exiting\nI1119 20:24:43.232860       1 client.go:484] [OVN_Southbound] disconnected from ssl:10.0.165.169:9642; reconnecting ... \nI1119 20:24:43.241844       1 ovnkube.go:123] Received signal terminated. Shutting down\nI1119 20:24:43.242142       1 reflector.go:225] Stopping reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242178       1 reflector.go:225] Stopping reflector *v1beta1.EndpointSlice (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242200       1 reflector.go:225] Stopping reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242222       1 services_controller.go:173] Shutting down controller ovn-lb-controller\nI1119 20:24:43.242283       1 reflector.go:225] Stopping reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242310       1 reflector.go:225] Stopping reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242338       1 reflector.go:225] Stopping reflector *v1.NetworkPolicy (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242361       1 reflector.go:225] Stopping reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242383       1 reflector.go:225] Stopping reflector *v1.EgressIP (0s) from github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117\nI1119 20:24:43.242397       1 reflector.go:225] Stopping reflector *v1.Endpoints (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.242418       1 reflector.go:225] Stopping reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:134\nI1119 20:24:43.253048       1 master.go:107] No longer leader;\n

with exit(0) the monitor log is less alarming:

Nov 19 22:25:20.567 I ns/openshift-ovn-kubernetes pod/ovnkube-master-pz287 node/ip-10-0-129-41.us-west-1.compute.internal container/ovnkube-master reason/ContainerExit code/0 cause/Completed

Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
@tssurya
Copy link
Contributor Author

tssurya commented Dec 2, 2021

/hold
I might have to fix the iptable rules in the tests to accommodate the OCP_HACKS.go special rules

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2021
@tssurya
Copy link
Contributor Author

tssurya commented Dec 2, 2021

--- PASS: TestNodeSuite (2.34s)
PASS
coverage: 51.7% of statements
go test -mod=vendor -test.v  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn -ginkgo.v  -ginkgo.reportFile ./_artifacts/junit-pkg_ovn.xml
# github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn [github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.test]
pkg/ovn/pods_test.go:718:29: cannot use ctx (type *cli.Context) as type libovsdb.TestSetup in argument to fakeOvn.startWithDBSetup
pkg/ovn/pods_test.go:718:29: cannot use initialDB (type libovsdb.TestSetup) as type "k8s.io/apimachinery/pkg/runtime".Object in argument to fakeOvn.startWithDBSetup:
	libovsdb.TestSetup does not implement "k8s.io/apimachinery/pkg/runtime".Object (missing DeepCopyObject method)
FAIL	github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn [build failed]
FAIL

wait I see unit tests passing against upstream, but something is not working downstream. Investigating....

@tssurya
Copy link
Contributor Author

tssurya commented Dec 2, 2021

--- PASS: TestNodeSuite (2.34s)
PASS
coverage: 51.7% of statements
go test -mod=vendor -test.v  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn -ginkgo.v  -ginkgo.reportFile ./_artifacts/junit-pkg_ovn.xml
# github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn [github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.test]
pkg/ovn/pods_test.go:718:29: cannot use ctx (type *cli.Context) as type libovsdb.TestSetup in argument to fakeOvn.startWithDBSetup
pkg/ovn/pods_test.go:718:29: cannot use initialDB (type libovsdb.TestSetup) as type "k8s.io/apimachinery/pkg/runtime".Object in argument to fakeOvn.startWithDBSetup:
	libovsdb.TestSetup does not implement "k8s.io/apimachinery/pkg/runtime".Object (missing DeepCopyObject method)
FAIL	github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn [build failed]
FAIL

wait I see unit tests passing against upstream, but something is not working downstream. Investigating....

ok I need to update downstream specific unit tests since we have b99bcc1

girishmg and others added 5 commits December 2, 2021 10:07
Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
PodRequest`IsDPU is currently being used to capture whether ovnkube node
is running in DPUHostMode or not. However, IsDPU could mean two things:
-- DPU Host or DPU itself. Since IsDPU here is meant to track
DPUHostMode lets change the name to mean IsDPUHostMode.

Also, there is no need for Server`mode field, so remove it as well.

Signed-off-by: Girish Moodalbail <gmoodalbail@nvidia.com>
Fix the tests to include the extra IPTable rules that are
added in OCP_HACKS.go

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
In ovn-org/ovn-kubernetes@fe963fd
we fixed the upstream tests where we split the test harness handling
for client and this PR ensures we do the same for the
downstream specific tests.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
@tssurya
Copy link
Contributor Author

tssurya commented Dec 2, 2021

All unit tests have passed:

JUnit path was configured: ./_artifacts/junit-hybrid-overlay_pkg_controller.xml

JUnit report was created: /home/surya/go/src/github.com/openshift/ovn-kubernetes/go-controller/_artifacts/junit-hybrid-overlay_pkg_controller.xml

Ran 14 of 14 Specs in 3.022 seconds
SUCCESS! -- 14 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestHybridOverlayControllerSuite (3.03s)
PASS
coverage: 58.9% of statements

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2021

@tssurya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-windows 744c49a link false /test e2e-vsphere-windows
ci/prow/e2e-vsphere-ovn 744c49a link false /test e2e-vsphere-ovn
ci/prow/okd-e2e-gcp-ovn 744c49a link false /test okd-e2e-gcp-ovn
ci/prow/e2e-openstack-ovn 744c49a link false /test e2e-openstack-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2021
Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold cancel

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: trozet, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 2, 2021
@openshift-merge-robot openshift-merge-robot merged commit 9742426 into openshift:master Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet