Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DownstreamMerge] 9-23-22 b - dualstack fixed #1289

Merged
merged 51 commits into from Oct 5, 2022

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Oct 3, 2022

- What this PR does and why is it needed

- Special notes for reviewers

- How to verify it

- Description for the changelog

adrianchiris and others added 30 commits January 4, 2022 15:54
- Add design doc which depicts the current implementation
  in OVN-Kubernetes
- Add DPU section in ovs_offload docs

Note:

Currently only a general overview is provided
in the in-project documents plus links to google doc.

This is a first stage to have some initial documentation
in the project. Later on these docs should be converted
to Markdown and be part of the project.

Signed-off-by: Adrian Chiris <adrianc@nvidia.com>
NetworkPolicyNotCreated error only returned from
updateACLLoggingForPolicy when network policy that is passed as an
argument have np.created = false.
updateACLLoggingForPolicy called from 2 places:
func (oc *Controller) setNetworkPolicyACLLoggingForNamespace(ns string,
nsInfo *namespaceInfo) error {
   ...
   for name, policy := range nsInfo.networkPolicies { }
   ...
}

func (oc *Controller) addNetworkPolicy(policy *knet.NetworkPolicy)
error {
    ...
    if err := oc.createNetworkPolicy(np, policy, aclLogDeny,
    aclLogAllow,portGroupIngressDenyName, portGroupEgressDenyName);
    ...
    nsInfo.networkPolicies[policy.Name] = np
    if err := oc.updateACLLoggingForPolicy(np, nsInfo.aclLogging.Allow);
    err != nil {
    ...
}

np.created is set to true in createNetworkPolicy,
and createNetworkPolicy is only called from addNetworkPolicy.

To make sure that np.created = true for every updateACLLoggingForPolicy
call, we need to check that
1. every policy from nsInfo.networkPolicies has .created = true,
which is true since network policy is added to nsInfo.networkPolicies
only when createNetworkPolicy successfully completed
2. addNetworkPolicy only calls updateACLLoggingForPolicy after
createNetworkPolicy successfully completed, which i=s also true, since
error will be returned earlier if createNetworkPolicy fails

Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
Commit 16ea40b changed the logical switch other-config to be overwriten
this needs to be changed so that the hostsubnet values are appended.

Signed-off-by: Jacob Tanenbaum <jtanenba@redhat.com>
Signed-off-by: Yun Zhou <yunz@nvidia.com>
GetConfigDurationRecorder is called from two paths.
1) nodeTracker detects a switch or GR has been added for the node
and does a full RequestFullSync which calls onServiceAdd
2) onServiceAdd is called from the service handler for adds.

If we don't lock the cdr variable we access it from two
routines while one routine edits it; the other tries to read it
for example. Let's lock this variable to prevent races.

2022-09-09T14:33:01.6982688Z WARNING: DATA RACE
2022-09-09T14:33:01.6982976Z Read at 0x000004402a78 by goroutine 1044:
2022-09-09T14:33:01.6983629Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/metrics.GetConfigDurationRecorder()
2022-09-09T14:33:01.6984387Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/metrics/master.go:847 +0x29b
2022-09-09T14:33:01.6985239Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).onServiceAdd()
2022-09-09T14:33:01.6986069Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:366 +0x33f
2022-09-09T14:33:01.6986732Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).RequestFullSync()
2022-09-09T14:33:01.6988638Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:351 +0x2b2
2022-09-09T14:33:01.6989314Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).RequestFullSync-fm()
2022-09-09T14:33:01.6989705Z       <autogenerated>:1 +0x39
2022-09-09T14:33:01.6990287Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*nodeTracker).updateNodeInfo()
2022-09-09T14:33:01.6991204Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/controller/services/node_tracker.go:150 +0x67b
2022-09-09T14:33:01.6993403Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*nodeTracker).updateNode()
2022-09-09T14:33:01.6994200Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/controller/services/node_tracker.go:201 +0x709
2022-09-09T14:33:01.6996051Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.newNodeTracker.func1()
2022-09-09T14:33:01.6996855Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/controller/services/node_tracker.go:78 +0x55
2022-09-09T14:33:01.6998163Z   k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd()

versus

2022-09-09T14:33:01.7016687Z Previous write at 0x000004402a78 by goroutine 1070:
2022-09-09T14:33:01.7017329Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/metrics.GetConfigDurationRecorder()
2022-09-09T14:33:01.7018197Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/metrics/master.go:848 +0x310
2022-09-09T14:33:01.7018912Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).onServiceAdd()
2022-09-09T14:33:01.7019866Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:366 +0x33f
2022-09-09T14:33:01.7020529Z   github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).onServiceAdd-fm()
2022-09-09T14:33:01.7020900Z       <autogenerated>:1 +0x4d
2022-09-09T14:33:01.7021431Z   k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd()
2022-09-09T14:33:01.7022328Z       /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:232 +0x74
2022-09-09T14:33:01.7022857Z   k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnAdd()

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
The `err` variable was not being set to the result of the endpoint
generation, which meant that if - for whatever reason - creating those
endpoints failed, the test would continue and fail further down the
execution.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
…when-creating-endpoints

e2e tests, services: assert endpoints were created
The `checkPingOnContainer` function is duplicated in the tests. It also
returns an error, that is both already asserted within the function, and
not checked by the caller. Thus, we can safely remove that call.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>
BZ2117255 append instead of overwrite logicalSwitch other-config
Add retry logic to Namespace resources
Due to the way how IPv6 multipath is implemented, it is not possible
to filter by ifindex as the ifindex for multihop routes will always be
0. Instead, implement a custom filter.

Reported-at: https://issues.redhat.com/browse/OCPBUGS-1318
Signed-off-by: Andreas Karis <ak.karis@gmail.com>
In environments where PodSecurity is enforced we can't use the default ns unless
certain labels are set. Namespaces we create with the framework have these labels
which let us create pods on them without restrictions.
Since we're using disposable namespaces that are deleted in AfterEach there is
no longer need to delete the gw pods.

Signed-off-by: Ori Braunshtein <obraunsh@redhat.com>
EnsureAddressSet may be called by multiple go threads around the same time.
While transactions are serialized, the operations that populate the transactions
are not and that is where the determination on whether a row in ovsdb is to
be created or updated via libovsdbops. The result of that is that the creation
fails due to the constraints on the table index.

This change is a placebo. It simply detects and logs this case since a
higher level retry logic is expected to succeed on the following invocation.

Co-authored-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Signed-off-by: Flavio Fernandes <flaviof@redhat.com>
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2108026
getDefaultGatewayInterfaceByFamily: custom filter for MultiHop
Since the `br-ex` bridge already has the node masquerade IP address, it
is no longer required to add the masquerade route to the nodes.

The SNAT/unSNAT traffic to/from the node to use the node masquerade IP
address will **still** be used, to specifically support the use case of
packets sent from the node having a different source IP address.

The `V4MasqueradeSubnet` had to be updated, since follow-up commits will
require the `.4` address on that subnet to use as a dummy next-hop for
deployments where there isn't a next-hop available.

netlink utils: rename  method

This `LinkRoutesAddOrUpdateMTU` function can now be used to update the
route MTU *or* the route source. Thus, rename the method accordingly.

A follow up commit should try to come up with a better abstraction,
since this function already does too much:
It either:
- adds a route **or**
- updates a route
  - MTU **or**
  - source

I think we should have 3 different functions, each doing a single thing,
and leave to the caller to know which of those to invoke.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Co-authored-by: Tim Rozet <trozet@redhat.com>

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
For simplicity, the `.4` address on the node masquerade subnet was
chosen - this one was the next available IP address on that subnet.

Finally, also provision the MAC binding tables for each of the gateway
routers, thus providing ARP resolution for the aforementioned IP
address.

The resolved MAC address is computed from the IP address, and is the
same on each of the nodes, since traffic is not supposed to leave the
node - and as such, there is no risk for collisions.

The OVN nodes are also updates to use this dummy next-hop, and a
permanent neighbor is added for said IP / MAC.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Co-authored-by: Tim Rozet <trozet@redhat.com>
Now that host -> service can work without using a default gateway there
is no need to require it. With this change no next hops will be written
to the l3 gateway config annotation for the node if none are detected.
However, this only will work if --gateway-interface config is provided
at ovnkube node start time. If that is not provided, we still rely on
default gateway lookup to determine which interface should be used to
communicate with the external bridge.

Signed-off-by: Tim Rozet <trozet@redhat.com>
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
…ute on br-ex

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
E2E (exgw): create ns for the serving gw pods instead of using default
e2e, external gw test: remove duplicate funcs
This patch makes route filtering generic - it is up to the caller to
specify which route to look for, along with the filter mask specifying
the relevant attributes.

With this, we can now look for a simple route - match stricly on the dst
route - when the node is being created, but also have a finer match,
also providing the GW, for the services controller.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
We were not creating healthchecks for service updates
when the etp changed from local to cluster or back. This
PR fixes this. It also fixes some endpoint slices code
that earlier assumed svc:endpointslice was 1:1 mapping.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
trozet and others added 7 commits September 27, 2022 09:50
Don't assume policy type when evaluating isolation
Trim ACL names according to RFC1123
…brid

remove the requirement that hybrid overlay needs subnet .3 address
49f284e introduced an issue where it
never release the nsInfo lock. Fix this.

Signed-off-by: Andreas Karis <ak.karis@gmail.com>
Fix lock issue in ensureNamespaceLocked
Conflicts due to hybrid overlay OCP HACK

Signed-off-by: Andreas Karis <ak.karis@gmail.com>

Conflicts:
	go-controller/hybrid-overlay/pkg/types/types.go
        go-controller/pkg/ovn/pods_test.go
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2022
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2022
@trozet
Copy link
Contributor Author

trozet commented Oct 4, 2022

/test e2e-metal-ipi-ovn-dualstack

Now that the special masquerade IPs are assigned to the interface, they
default flows that end up DNAT'ing return traffic from the GR->host from
masquerade ip -> host ip were being overridden with secondary ip flows
for extra ips. This fixes it by skipping special ips when evaluting the
extra ips on a node.

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit 5a056e7)
@trozet trozet changed the title [WIP] Trozet merge 9 23 22 b - debugging dualstack failure merge 9 23 22 b - dualstack fixed Oct 4, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 4, 2022
@trozet trozet changed the title merge 9 23 22 b - dualstack fixed [DownstreamMerge] 9-23-22 b - dualstack fixed Oct 4, 2022
@trozet
Copy link
Contributor Author

trozet commented Oct 4, 2022

data race issue for unit tests: ovn-org/ovn-kubernetes#3203

@trozet
Copy link
Contributor Author

trozet commented Oct 4, 2022

/retest-required

@trozet
Copy link
Contributor Author

trozet commented Oct 5, 2022

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 5, 2022

@trozet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-e2e-gcp-ovn ead8543 link false /test okd-e2e-gcp-ovn
ci/prow/e2e-ovn-hybrid-step-registry ead8543 link false /test e2e-ovn-hybrid-step-registry
ci/prow/e2e-openstack-ovn ead8543 link false /test e2e-openstack-ovn
ci/prow/e2e-vsphere-windows ead8543 link false /test e2e-vsphere-windows
ci/prow/e2e-aws-ovn-windows ead8543 link false /test e2e-aws-ovn-windows
ci/prow/e2e-vsphere-ovn ead8543 link false /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@andreaskaris
Copy link
Contributor

/lgtm

Copy link
Contributor

@andreaskaris andreaskaris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 5, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 5, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreaskaris, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 5, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreaskaris, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit bb86d85 into openshift:master Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet