Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UT Flake: handles a HO node is switched to a OVN node is flaking #4387

Open
tssurya opened this issue May 23, 2024 · 6 comments · Fixed by #4583
Open

UT Flake: handles a HO node is switched to a OVN node is flaking #4387

tssurya opened this issue May 23, 2024 · 6 comments · Fixed by #4583
Assignees
Labels
kind/ci-flake Flakes seen in CI

Comments

@tssurya
Copy link
Contributor

tssurya commented May 23, 2024

Which jobs are flaking?

Unit Test

Which tests are flaking?

  • handles a HO node is switched to a OVN node

Since when has it been flaking?

This started happening from this week

Reason for failure (if possible)

The test is failing because we are having a race in how we compare the libovsdb objects:

I0521 23:40:23.312702   25592 reflector.go:295] Stopping reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:159
I0521 23:40:23.312726   25592 reflector.go:295] Stopping reflector *v1.EndpointSlice (0s) from k8s.io/client-go/informers/factory.go:159
�[91m�[1m• Failure [2.584 seconds]�[0m
Hybrid SDN Master Operations
�[90m/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:132�[0m
I0521 23:40:23.312733   25592 watch.go:183] Stopping fake watcher.
  �[91m�[1mhandles a HO node is switched to a OVN node [It]�[0m
I0521 23:40:23.312862   25592 reflector.go:295] Stopping reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:159
  �[90m/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:1212�[0m

I0521 23:40:23.312942   25592 reflector.go:295] Stopping reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:159
I0521 23:40:23.312925   25592 handler.go:217] Removed *v1.Node event handler 3
  �[91mTimed out after 2.000s.
  Expected
      <[]*nbdb.LogicalRouterStaticRoute | len:2, cap:2>: [
          {
              UUID: "7bc8e72a-5f2c-42e3-bb3f-28548e62048f",
              BFD: nil,
              ExternalIDs: {
                  "name": "hybrid-subnet-node1:node-windows",
              },
              IPPrefix: "10.1.3.0/24",
              Nexthop: "10.1.1.3",
              Options: nil,
              OutputPort: nil,
              Policy: nil,
              RouteTable: "",
          },
          {
              UUID: "05656f3b-320d-467c-9b86-909b01a8eb0f",
              BFD: nil,
              ExternalIDs: {
                  "name": "hybrid-subnet-node1-gr",
              },
              IPPrefix: "10.1.3.0/24",
              Nexthop: "100.64.0.1",
              Options: nil,
              OutputPort: nil,
              Policy: nil,
              RouteTable: "",
          },
      ]
  to have length 0�[0m

  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:1375
�[90m------------------------------�[0m
�[0mHybrid SDN Master Operations�[0m 
  �[1mcleans up a Linux node when the OVN hostsubnet annotation is removed�[0m
  �[37m/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:1401�[0m
I0521 23:40:23.317157   25592 config.go:2290] Default config: {MTU:1400 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:64001 OVNMasqConntrackZone:64002 HostNodePortConntrackZone:64003 ReassemblyConntrackZone:64004 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:0 RawClusterSubnets:10.1.0.0/16 ClusterSubnets:[10.1.0.0/16/24] EnableUDPAggregation:false Zone:global}
I0521 23:40:23.317231   25592 config.go:2291] Logging config: {File: CNIFile:/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log LibovsdbFile: Level:5 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:5 ACLLoggingRateLimit:20}
I0521 23:40:23.317293   25592 config.go:2292] Monitoring config: {RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]}
I0521 23:40:23.317333   25592 config.go:2293] IPFIX config: {Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0}
I0521 23:40:23.317370   25592 config.go:2294] CNI config: {ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay}

Anything else we need to know?

Sample failure Runs:

  1. https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2176/pull-ci-openshift-ovn-kubernetes-master-unit/1793061734513119232
  2. waiting for link from @kyrtapz where it had also failed on Enable global IPv6 forwarding #4376
@tssurya tssurya added the kind/ci-flake Flakes seen in CI label May 23, 2024
@kyrtapz
Copy link
Contributor

kyrtapz commented May 23, 2024

@tssurya
Copy link
Contributor Author

tssurya commented Jul 8, 2024

Happening again:

2024-07-08T12:10:06.2183525Z �[91m�[1m• Failure [2.562 seconds]�[0m
2024-07-08T12:10:06.2184129Z Hybrid SDN Master Operations
2024-07-08T12:10:06.2185282Z �[90m/home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:132�[0m
2024-07-08T12:10:06.2186667Z   �[91m�[1mhandles a HO node is switched to a OVN node [It]�[0m
2024-07-08T12:10:06.2188119Z   �[90m/home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:1212�[0m
2024-07-08T12:10:06.2189000Z 
2024-07-08T12:10:06.2189269Z   �[91mTimed out after 2.000s.
2024-07-08T12:10:06.2189765Z   Expected
2024-07-08T12:10:06.2190405Z       <[]*nbdb.LogicalRouterStaticRoute | len:2, cap:2>: [
2024-07-08T12:10:06.2191091Z           {
2024-07-08T12:10:06.2191917Z               UUID: "72b6171e-ab8d-4ad9-97c7-772e105fd96b",
2024-07-08T12:10:06.2192626Z               BFD: nil,
2024-07-08T12:10:06.2193169Z               ExternalIDs: {
2024-07-08T12:10:06.2194179Z                   "name": "hybrid-subnet-node1:node-windows",
2024-07-08T12:10:06.2195199Z               },
2024-07-08T12:10:06.2195748Z               IPPrefix: "10.1.3.0/24",
2024-07-08T12:10:06.2196431Z               Nexthop: "10.1.1.3",
2024-07-08T12:10:06.2197030Z               Options: nil,
2024-07-08T12:10:06.2197639Z               OutputPort: nil,
2024-07-08T12:10:06.2198237Z               Policy: nil,
2024-07-08T12:10:06.2198814Z               RouteTable: "",
2024-07-08T12:10:06.2199303Z           },
2024-07-08T12:10:06.2199672Z           {
2024-07-08T12:10:06.2200487Z               UUID: "30af3db2-9577-4d4a-b819-e11b727d41ec",
2024-07-08T12:10:06.2201348Z               BFD: nil,
2024-07-08T12:10:06.2201897Z               ExternalIDs: {
2024-07-08T12:10:06.2202777Z                   "name": "hybrid-subnet-node1-gr",
2024-07-08T12:10:06.2203408Z               },
2024-07-08T12:10:06.2203974Z               IPPrefix: "10.1.3.0/24",
2024-07-08T12:10:06.2204824Z               Nexthop: "100.64.0.1",
2024-07-08T12:10:06.2205439Z               Options: nil,
2024-07-08T12:10:06.2206032Z               OutputPort: nil,
2024-07-08T12:10:06.2206613Z               Policy: nil,
2024-07-08T12:10:06.2207198Z               RouteTable: "",
2024-07-08T12:10:06.2207678Z           },
2024-07-08T12:10:06.2208036Z       ]
2024-07-08T12:10:06.2208669Z   to have length 0�[0m
2024-07-08T12:10:06.2208958Z 
2024-07-08T12:10:06.2209798Z   /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:1375

https://github.com/ovn-org/ovn-kubernetes/actions/runs/9839033631/job/27160221770?pr=4507

@trozet
Copy link
Contributor

trozet commented Aug 4, 2024

also hit here: #4496 (comment)

I have a fix

@trozet trozet closed this as completed in 6e3bf49 Aug 8, 2024
dceara pushed a commit to tssurya/ovn-kubernetes that referenced this issue Aug 15, 2024
The issue was a race where the hybrid overlay node was being updated to
remove the windows label for testing. However, the update action itself
was with a blank original copy of the node which would overwrite l3
gateway config and other OVNK annotations with empty values, causing a
bunch of errors.

This changes the code to just patch and remove the labels, in order to
not corrupt any of the other aspects of the node object itself.

Fixes: ovn-kubernetes#4387

Signed-off-by: Tim Rozet <trozet@redhat.com>
@maiqueb
Copy link
Contributor

maiqueb commented Oct 2, 2024

Seen again on #4750 (comment).

@maiqueb maiqueb reopened this Oct 2, 2024
@trozet
Copy link
Contributor

trozet commented Oct 30, 2024

Seen here too: #4810 (comment)

The failure looks different iirc. Need to investigate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/ci-flake Flakes seen in CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants