Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FAIL] Kubevirt Virtual Machines when live migrated [It] with pre-copy should keep connectivity #3986

Open
trozet opened this issue Oct 30, 2023 · 18 comments · May be fixed by #4416
Open

[FAIL] Kubevirt Virtual Machines when live migrated [It] with pre-copy should keep connectivity #3986

trozet opened this issue Oct 30, 2023 · 18 comments · May be fixed by #4416
Assignees
Labels
kind/ci-flake Flakes seen in CI

Comments

@trozet
Copy link
Contributor

trozet commented Oct 30, 2023

Seeing failures with this test still:

Kubevirt Virtual Machines when live migrated [It] with pre-copy should keep connectivity
/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:656

  [FAILED] Oct 25 23:45:54.083: Timed out after 60.010s.
  worker1: after live migrate, migration #1: Check tcp connection is not broken
  Expected success, but got an error:
      <*errors.errorString | 0xc0003bb4a0>: 
      http connection to virtual machine was broken
      {
          s: "http connection to virtual machine was broken",
      }
  In [It] at: /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:296 @ 10/25/23 23:45:54.084

  There were additional failures detected.  To view them in detail run ginkgo -vv

https://github.com/ovn-org/ovn-kubernetes/actions/runs/6646915605/job/18062050736?pr=3979

@trozet trozet added the kind/ci-flake Flakes seen in CI label Oct 30, 2023
@npinaeva
Copy link
Member

npinaeva commented Nov 2, 2023

ricky-rav added a commit to ricky-rav/network-tools that referenced this issue Nov 14, 2023
Ovn-kubernetes upstream issues labeled with "ci-flake" are not tracked in JIRA: let's create a jira user story under the SDN-4175 epic for each such open issue:
- the assignee is kept whenever possible, otherwise bbennett is used.
- the summary (title) of these jira cards will start with upstream-$GITHUB_ISSUE_ID, where $GITHUB_ISSUE_ID is the ID found in the URL of the issue itself (e.g. upstream-3986 for ovn-org/ovn-kubernetes#3986)

A list of what jira stories have been created is printed to stdout, as well as a list of stories in the SDN-4175 epic whose status doesn't match the github issue they're tracking.

The new "-g" ("--process-github-issues") input parameter executes this new functionality, or the script alone without any input parameters.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
ricky-rav added a commit to ricky-rav/network-tools that referenced this issue Nov 14, 2023
Ovn-kubernetes upstream issues labeled with "ci-flake" are not tracked in JIRA: let's create a jira user story under the SDN-4175 epic for each such open issue:
- the assignee is kept whenever possible, otherwise bbennett is used.
- the summary (title) of these jira cards will start with upstream-$GITHUB_ISSUE_ID, where $GITHUB_ISSUE_ID is the ID found in the URL of the issue itself (e.g. upstream-3986 for ovn-org/ovn-kubernetes#3986)

A list of what jira stories have been created is printed to stdout, as well as a list of stories in the SDN-4175 epic whose status doesn't match the github issue they're tracking.

The new "-g" ("--process-github-issues") input parameter executes this new functionality, or the script alone without any input parameters.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
ricky-rav added a commit to ricky-rav/network-tools that referenced this issue Nov 14, 2023
Ovn-kubernetes upstream issues labeled with "ci-flake" are not tracked in JIRA: let's create a jira user story under the SDN-4175 epic for each such open issue:
- the assignee is kept whenever possible, otherwise bbennett is used.
- the summary (title) of these jira cards will start with upstream-$GITHUB_ISSUE_ID, where $GITHUB_ISSUE_ID is the ID found in the URL of the issue itself (e.g. upstream-3986 for ovn-org/ovn-kubernetes#3986)

A list of what jira stories have been created is printed to stdout, as well as a list of stories in the SDN-4175 epic whose status doesn't match the github issue they're tracking.

The new "-g" ("--process-github-issues") input parameter executes this new functionality, or the script alone without any input parameters.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
ricky-rav added a commit to ricky-rav/network-tools that referenced this issue Nov 14, 2023
Ovn-kubernetes upstream issues labeled with "ci-flake" are not tracked in JIRA: let's create a jira user story under the SDN-4175 epic for each such open issue:
- the assignee is kept whenever possible, otherwise bbennett is used.
- the summary (title) of these jira cards will start with upstream-$GITHUB_ISSUE_ID, where $GITHUB_ISSUE_ID is the ID found in the URL of the issue itself (e.g. upstream-3986 for ovn-org/ovn-kubernetes#3986)

A list of which jira stories have been created is printed to stdout, as well as a list of stories in the SDN-4175 epic whose status doesn't match the status of the github issue they're tracking.

The new "-g" ("--process-github-issues") input parameter executes this new functionality, or the script alone without any input parameters.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
ricky-rav added a commit to ricky-rav/network-tools that referenced this issue Nov 14, 2023
Ovn-kubernetes upstream issues labeled with "ci-flake" are not tracked in JIRA: let's create a jira user story under the SDN-4175 epic for each such open issue:
- the assignee is kept whenever possible, otherwise bbennett is used.
- the summary (title) of these jira cards will start with upstream-$GITHUB_ISSUE_ID, where $GITHUB_ISSUE_ID is the ID found in the URL of the issue itself (e.g. upstream-3986 for ovn-org/ovn-kubernetes#3986)

A list of which jira stories have been created is printed to stdout, as well as a list of stories in the SDN-4175 epic whose status doesn't match the status of the github issue they're tracking.

The new "-g" ("--process-github-issues") input parameter executes this new functionality, or the script alone without any input parameters.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
ricky-rav added a commit to ricky-rav/network-tools that referenced this issue Nov 15, 2023
Ovn-kubernetes upstream issues labeled with "ci-flake" are not tracked in JIRA: let's create a jira user story under the SDN-4175 epic for each such open issue:
- the assignee is kept whenever possible, otherwise bbennett is used.
- the summary (title) of these jira cards will start with upstream-$GITHUB_ISSUE_ID, where $GITHUB_ISSUE_ID is the ID found in the URL of the issue itself (e.g. upstream-3986 for ovn-org/ovn-kubernetes#3986)

A list of which jira stories have been created is printed to stdout, as well as a list of stories in the SDN-4175 epic whose status doesn't match the status of the github issue they're tracking.

The new "-g" ("--process-github-issues") input parameter executes this new functionality, or the script alone without any input parameters.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
@ricky-rav ricky-rav self-assigned this Nov 29, 2023
@martinkennelly
Copy link
Member

martinkennelly commented Dec 7, 2023

@flavio-fernandes
Copy link
Contributor

@tssurya
Copy link
Member

tssurya commented Feb 7, 2024

one more here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7789676840/job/21242664047?pr=4100 :/

#4100

@maiqueb or @qinqon : can one of you please fix this?

@qinqon
Copy link
Contributor

qinqon commented Feb 7, 2024

one more here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7789676840/job/21242664047?pr=4100 :/

#4100

@maiqueb or @qinqon : can one of you please fix this?

@tssurya we are preparing a fix for it, it will be ready to merge today.

@tssurya
Copy link
Member

tssurya commented Feb 7, 2024

@qinqon
Copy link
Contributor

qinqon commented Feb 7, 2024

@tssurya we are going to start small #4140

We have some other ideas but we will go step by step.

@martinkennelly
Copy link
Member

@qinqon
Copy link
Contributor

qinqon commented Feb 8, 2024

Thanks @qinqon

seen again here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7814875289/job/21318048354?pr=4061

Adding error so I can track if it's always the same.

 [FAILED] Timed out after 60.001s.
  worker1: after live migration to node owning the subnet: Check tcp connection is not broken
  Expected success, but got an error:
      <*errors.errorString | 0xc000f52b70>: 
      http connection to virtual machine was broken
      {
          s: "http connection to virtual machine was broken",
      }
  In [It] at: /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:330 @ 02/07/24 13:22:07.982

@qinqon
Copy link
Contributor

qinqon commented Feb 8, 2024

I am reproducing this error at my fork running 12 jobs in parallel and a more simple tcp server so is easier to debug
qinqon#10

qinqon#10 (comment)

@qinqon
Copy link
Contributor

qinqon commented Feb 9, 2024

After around 25 runs show this failure with the stabilization PR
https://github.com/ovn-org/ovn-kubernetes/actions/runs/7840556162/job/21395793644?pr=4145

2024-02-09T07:20:48.1135328Z   Latency metrics for node ovn-worker3
2024-02-09T07:20:48.1136685Z   �[1mSTEP:�[0m Destroying namespace "kv-live-migration-923" for this suite. �[38;5;243m@ 02/09/24 07:20:48.113�[0m
2024-02-09T07:20:48.1181879Z �[38;5;9m• [FAILED] [419.375 seconds]�[0m
2024-02-09T07:20:48.1184612Z �[0mKubevirt Virtual Machines �[38;5;243mwhen live migration �[38;5;9m�[1m[It] with pre-copy succeeds, should keep connectivity�[0m
2024-02-09T07:20:48.1187469Z �[38;5;243m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:836�[0m
2024-02-09T07:20:48.1188240Z 
2024-02-09T07:20:48.1190075Z   �[38;5;9m[FAILED] worker1: after live migration for the second time to node not owning subnet: Check connectivity is restored after delete deny all network policy
2024-02-09T07:20:48.1191653Z   Expected success, but got an error:
2024-02-09T07:20:48.1192369Z       <*fmt.wrapError | 0xc000763d20>: 
2024-02-09T07:20:48.1193746Z       failed Write to server: write tcp 172.18.0.1:41446->172.18.0.3:31702: write: broken pipe
2024-02-09T07:20:48.1194723Z       {
2024-02-09T07:20:48.1196036Z           msg: "failed Write to server: write tcp 172.18.0.1:41446->172.18.0.3:31702: write: broken pipe",
2024-02-09T07:20:48.1197218Z           err: <*net.OpError | 0xc00123f9f0>{
2024-02-09T07:20:48.1197908Z               Op: "write",
2024-02-09T07:20:48.1198471Z               Net: "tcp",
2024-02-09T07:20:48.1199748Z               Source: <*net.TCPAddr | 0xc0006ed9b0>{IP: [172, 18, 0, 1], Port: 41446, Zone: ""},
2024-02-09T07:20:48.1201354Z               Addr: <*net.TCPAddr | 0xc0006edaa0>{IP: [172, 18, 0, 3], Port: 31702, Zone: ""},
2024-02-09T07:20:48.1202601Z               Err: <*os.SyscallError | 0xc000763d00>{
2024-02-09T07:20:48.1203335Z                   Syscall: "write",
2024-02-09T07:20:48.1203895Z                   Err: <syscall.Errno>0x20,
2024-02-09T07:20:48.1204307Z               },
2024-02-09T07:20:48.1204642Z           },
2024-02-09T07:20:48.1204937Z       }�[0m
2024-02-09T07:20:48.1206276Z   �[38;5;9mIn �[1m[It]�[0m�[38;5;9m at: �[1m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:401�[0m �[38;5;243m@ 02/09/24 07:20:46.845�[0m

I will try to reproduce this too with the testing PR.

@qinqon
Copy link
Contributor

qinqon commented Feb 12, 2024

I have disable interconnect and remove the policy part of the test to make it simpler also printing all the echos
this is the result
Server logs

Server is running on: :9900
  2024/02/12 12:26:15 Handling connection 100.64.0.5:32870
  2024/02/12 12:26:15 Handling connection [fd98::5]:58614
  2024/02/12 12:28:33 failed copying data: readfrom tcp 10.244.1.8:9900->100.64.0.5:32870: splice: connection reset by peer
  2024/02/12 12:28:33 Closing connection 100.64.0.5:32870

Client logs

STEP: worker1: after live migration for the second time to node not owning subnet: Check tcp connection is not broken @ 02/12/24 13:07:39.455
  STEP: Writing 'Halo' @ 02/12/24 13:07:39.455
  STEP: Reading 'Halo' @ 02/12/24 13:07:39.455
  STEP: failed reading: read tcp 172.18.0.1:41882->172.18.0.4:30247: read: connection reset by peer  @ 02/12/24 13:07:39.465
  STEP: Writing 'Halo' @ 02/12/24 13:07:54.465
  STEP: failed writing: write tcp 172.18.0.1:41882->172.18.0.4:30247: write: broken pipe  @ 02/12/24 13:07:54.466
  STEP: Writing 'Halo' @ 02/12/24 13:08:09.466
  STEP: failed writing: write tcp 172.18.0.1:41882->172.18.0.4:30247: write: broken pipe  @ 02/12/24 13:08:09.467
  STEP: Writing 'Halo' @ 02/12/24 13:08:24.467
  STEP: failed writing: write tcp 172.18.0.1:41882->172.18.0.4:30247: write: broken pipe  @ 02/12/24 13:08:24.467

The test should not retry if first error is "connection reset by peer" since retrying will not fix anything.

I am going to attach a tcpdump to the client to see if we are receving at RST packet that triggers the "connection reset by peer" at first echo.

@qinqon
Copy link
Contributor

qinqon commented Feb 26, 2024

This should appear way less often after
#4174

@npinaeva
Copy link
Member

@trozet
Copy link
Contributor Author

trozet commented Mar 29, 2024

@qinqon
Copy link
Contributor

qinqon commented Apr 1, 2024

https://github.com/ovn-org/ovn-kubernetes/actions/runs/8478907825/job/23232584804?pr=4246

@trozet adding this link to #4237 since is the specific error related to nameserver

@martinkennelly
Copy link
Member

@qinqon qinqon linked a pull request Jun 4, 2024 that will close this issue
1 task
@flavio-fernandes
Copy link
Contributor

Also seen in #4468

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/ci-flake Flakes seen in CI
Projects
None yet
7 participants