Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update vpp version #9

Closed
glazychev-art opened this issue Jul 21, 2023 · 19 comments
Closed

Update vpp version #9

glazychev-art opened this issue Jul 21, 2023 · 19 comments
Assignees

Comments

@glazychev-art
Copy link

glazychev-art commented Jul 21, 2023

https://github.com/networkservicemesh/govpp/blob/main/Dockerfile#L1

TODO

  1. Update vpp to the latest main version. Because it contains commits that were not merged into the latest release (af_xdp for example)
  2. The work with memif interfaces (abstract sockets) has been changed. We need to figure out how to configure it now.
  3. It makes sense to update git.fd.io/govpp.git to the latest release (https://github.com/FDio/govpp). This can help with unstable calico-vpp tests. (Calico-vpp updates govpp regularly)
  4. Pass cmd-forwarder-vpp docker-tests
  5. Pass integration-tests
@szvincze
Copy link

szvincze commented Aug 4, 2023

We have used the artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f image in our tests.
The AF_PACKET related issue described in VPP-2081 occurred again but this time right after deployment.
Forwarder restart has not improved the situation but made it worse since more forwarders started suffering from the same problem.

@bellycat77
Copy link

We have used the artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f image in our tests. The AF_PACKET related issue described in VPP-2081 occurred again but this time right after deployment. Forwarder restart has not improved the situation but made it worse since more forwarders started suffering from the same problem.

Do you have SELinux enabled? Could you try to disable it and test if the issue appears again?

@denis-tingaikin
Copy link
Member

@szvincze I think we might need to check other versions of forwarder-vpp. It could help to detect problems.

v1.9.0
v1.8.0
v1.7.1
v1.6.2

If you will have a chance check the problem with these versions ☝️☝️☝️

@glazychev-art
Copy link
Author

glazychev-art commented Aug 8, 2023

@szvincze @ljkiraly
I have a few questions:

  1. I'm not sure that SLES is fully supported in VPP - https://s3-docs.fd.io/vpp/23.10/aboutvpp/supported.html
    Is it possible to check the problem on openSUSE?

  2. I have rebuilt the forwarder-vpp with af-packet v3 - artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_af_packet_v3
    Please take a look

  3. Can you please tell if the problems are observed on the NSC interfaces? How many clients (approximately) are required?

UPDATE:

  1. Can the issue be reproduced on other OSes (e.g. Ubuntu?)

  2. Could the problem be related to the data size? Maybe there is a problem with MTU...

@glazychev-art
Copy link
Author

Additional questions:

  1. Could you please share your configurations? How do you run multiple forwarders? Are they on the same node?

  2. Could you share your environment? Is SLES a node or container operating system?

@robaganrab
Copy link

Hello,
SELinux is disabled on the impacted cluster.
Various versions cannot be easily tested do to limitation on hardware and manpower (packaging a given version is not necessary trivial and the testing cycle is roughly one build per day, sometimes two).

@glazychev-art

  1. Yes, the issue was there with OpenSUSE as well.
  2. We are working on testing this.
  3. I think we have observed it on both the nse and nsc sides. Up to 10 clients.
  4. There was reproduction with Ubuntu base a month or so ago.
  5. We need to check the MTU settings. I let you know what we find.
  6. VPP is deployed as a DaemonSet. One pod on each eligible k8s nodes. Then on top of NSM/VPP we use Meridio that utilizes the extra networking and creates a complete NSC-NSE cross connect to handle higher level load balancing (eg. sticky session like setup).
  7. The issue manifests on a "CNIS" cluster. This is a bare metal k8s setup using Ericsson's CCD. SLES is used as operating system on bare metal compute nodes and those are joined into the k8s cluster. Then by default NSM/VPP/Spire components are packaged with SLES being the base Docker image.

Best Regards,
Gábor Barna

@robaganrab
Copy link

Hello,
The issue was reproduced on a relatively new kernel: 5.14.21-150400.24.63-default. Search for 5.14.21-150400.24.63 on the page.
Best Regards,
Gábor Barna

@glazychev-art
Copy link
Author

Thanks @robaganrab !

If you have a chance, could you also check this forwarder image: artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3

Also, when you see the problem, please attach the command output from vpp: show hardware-interfaces

@denis-tingaikin
Copy link
Member

Hello, @robaganrab , @szvincze

I'd like to suggest three ways for diagnostic the problem.

  1. test artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3
  2. I think we could also do some diagnostics (before/after reproducing)

2.1. vppctl show int
2.2. vppctl show hardware
2.3. vppctl trace add af-packet-input 1000 (do it once)
2.4. vppctl show trace (after reproducing)

  1. And also as @edwarnicke pointed we could try to disable tap and try to use veth interfaces (just cut this and build forwarder https://github.com/networkservicemesh/sdk-vpp/blob/main/pkg/networkservice/mechanisms/kernel/client.go#L36)

@robaganrab
Copy link

Hello,
We managed to get everything built and set up for test run with artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3. The test will include the vppctl commands and hopefully will provide useful output.
Best Regards,
Gábor Barna

@ljkiraly
Copy link

Hello @denis-tingaikin and @glazychev-art,
Test results with artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3 is quite good, the AF_PACKET issue wasn't seen during the test run.
What is the way forward? Which govpp is the base of this image? Or the sha c3f505fe7b7f is a vpp commit ID? We would like to build a SUSE based image from this version of vpp/govpp.
Regards,
Laszlo

@denis-tingaikin
Copy link
Member

Hello @ljkiraly

It's a nice new!

We're used this patch #11

Its based on this commit from vpp https://gerrit.fd.io/r/gitweb?p=vpp.git;a=commit;h=c3f505fe7b7fbecb35494863a6c9de3cad6e6d2d

@szvincze
Copy link

@denis-tingaikin, @glazychev-art: There were more attempts with the image that Artem provided and the issue did not come at all. So, it seems to be fine.

However as @ljkiraly mentioned above there was a test with a newly built image based on SUSE and the issue happened immediately. We will give it another try with an openSUSE based image as well to see if SUSE is the culprit or openSUSE is also suffering from the same thing. We will keep you informed.

@glazychev-art
Copy link
Author

@szvincze
Got it, thanks!

Did you use this PR to build your image - #11?

Now there are two main suspects:

  1. Calico-vpp patches
  2. You are using tap interfaces, not af_packet.

Therefore, it would be very cool to see vppctl show hardware-interfaces output from your forwarder (on failed tests)

@szvincze
Copy link

@glazychev-art: Yes, we used that PR for building the image.
For the next trial I requested the output you mentioned.

@glazychev-art
Copy link
Author

The fact is that for artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3 image, I locally removed calico-patches from this PR #11, and also removed the tap support (as Ed said and Denis described here)

@ljkiraly
Copy link

Hi @glazychev-art,
By calico patches you mean: patch/0004-capo-Calico-Policies-plugin.patch?
Can you elaborate how was the tap support removed? This would help us to build an openSUSE/SUSE based image similar to yours.
BR/Laszlo

@glazychev-art
Copy link
Author

glazychev-art commented Aug 31, 2023

@ljkiraly
I mean this patches:

govpp/patch/patch.sh

Lines 15 to 22 in 5482a9a

# Calico cherry picks
git_cherry_pick refs/changes/26/34726/3 # 34726: interface: add buffer stats api | https://gerrit.fd.io/r/c/vpp/+/34726
if [ "$(ls ./patch/*.patch 2> /dev/null)" ]; then
git apply patch/*.patch
git add --all
git commit -m "misc patches"
fi

There are 1 cherry_pick and 5 files

By the way, I've prepared new PRs today (they contain the latest main vpp version) and removed the calico-patches in one of them:
#12
#13
You can use it

To disable tap interfaces just delete these lines:
https://github.com/networkservicemesh/sdk-vpp/blob/main/pkg/networkservice/mechanisms/kernel/client.go#L35-L37
and
https://github.com/networkservicemesh/sdk-vpp/blob/main/pkg/networkservice/mechanisms/kernel/server.go#L35-L37

@glazychev-art
Copy link
Author

I think we can close the issue, because we merged all PRs
The discussion was moved to - networkservicemesh/cmd-forwarder-vpp#927

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

6 participants