Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[calico] don't enable ipip encapsulation by default and use vxlan in CI #8434

Merged

Conversation

cristicalin
Copy link
Contributor

@cristicalin cristicalin commented Jan 15, 2022

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:
While debugging calico ebpf failures in CI for AlmaLinux 8 we discovered an issue with Alma/Centos 8.5 and ipip encapsulation preventing pod-to-pod communication. Upon further investigation and discussions with the Calico team, their recomandation is to use vxlan encapsulation on newer kernels instead of ipip due to performance reasons even though both should work in both iptables and ebpf mode.

For now this PR changes the default to ipip: false disabling ipip encapsulation and we continue to investigate the reason for the ipip failure.

During the development of this fix I discovered that the kubespray CI needs encapsulation for multi-node deployments and I had to enable vxlan for calico based CI tests.

Additionally to properly troubleshoot the CI failures I had to modify the network connectivity test and actually make it fail when netchecker agents could not connect to the netchecker-server. This new change exposed connectivity issues with the kube-router CNI which were worked around by moving these tests to vagrant instead of packet. There is still a know failure in running kube-router in service proxy mode which I was not able to fully address, given that vagrant jobs in general are allowed to tail and this test is left on manual this will not block future CI jobs.

Which issue(s) this PR fixes:

Fixes #8456

Special notes for your reviewer:

Given previous experience with changing defaults such as when changing the container_manager to containerd, I'm somewhat conflicted on the actual change and an alternative would be to just change the settings in the CI but, for now the change as is seems like the best course of action to me but I welcome input here.

Does this PR introduce a user-facing change?:

[calico] Use vxlan instead of ipip as the default calico encapsulation mode. This change impacts existing deployments that don't explicitly set the encapsulation mode and will need to set calico_ipip_mode: Always and calico_network_backend: bird to avoid the upgrade process breaking.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 15, 2022
@champtar
Copy link
Contributor

At my job we switched to vxlan crosssubnet by default a long time ago because IPIP offload is broken with many driver / firmware combination. I think RHEL started to enable IPIP offload by default (if supported) in 8.3 or 8.4 and I found at least 3 different broken cases. Nobody uses IPIP except Calico I think, so it's not tested, when vxlan is used by OpenStack and many cloud. Also CrossSubnet because why encapsulate when not needed.

@champtar
Copy link
Contributor

@cristicalin
Copy link
Contributor Author

Thanks for the extra information @champtar ! it seems to put one more nail on the ipip encapsulation coffin. We could consider enabling vxlan for CrossSubnet by default to ensure a smooth deployment out of the box on public clouds. That thought actually crossed my mind for this PR but I'm waiting for the CI results of this run first before adding more changes to the mix.

@cristicalin cristicalin marked this pull request as draft January 15, 2022 08:45
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 15, 2022
@cristicalin
Copy link
Contributor Author

I opened an issue on calico project side to track this: projectcalico/calico#5449

@cristicalin cristicalin marked this pull request as ready for review January 16, 2022 07:13
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 16, 2022
@cristicalin
Copy link
Contributor Author

cristicalin commented Jan 16, 2022

Since it appears we can run the CI without encapsulation, I'm not going to switch the vxlan encapsulation on by default in this PR to keep it clean.

The eBPF test case is actually manual so I wrote the above conclusion too soon. Keeping this back in draft.

@cristicalin cristicalin marked this pull request as draft January 16, 2022 18:14
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 16, 2022
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 17, 2022
@cristicalin cristicalin force-pushed the calico_encapsulation branch 7 times, most recently from 00f8e3a to 6a2a48f Compare January 19, 2022 10:45
@cristicalin
Copy link
Contributor Author

/cc @floryut @oomichi

@oomichi
Copy link
Contributor

oomichi commented Mar 18, 2022

Sorry for late response and thanks for updating.
This makes good enough test coverage and the change seems good for me.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 18, 2022
@k8s-ci-robot k8s-ci-robot merged commit dd2d95e into kubernetes-sigs:master Mar 18, 2022
@floryut floryut added kind/network Network section in the release note and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 18, 2022
@mazdakn
Copy link

mazdakn commented Apr 8, 2022

The reason IPIP was not working in Calico was due to an internal change to IPIP code in kernel. We identified the change, and fixed our code to respect the new behaviour. The fix (projectcalico/calico#5846) will be available in Calico v3.23.0.

sakuraiyuta pushed a commit to sakuraiyuta/kubespray that referenced this pull request Apr 16, 2022
…CI (kubernetes-sigs#8434)

* [calico] make vxlan encapsulation the default

* don't enable ipip encapsulation by default
* set calico_network_backend by default to vxlan
* update sample inventory and documentation

* [CI] pin default calico parameters for upgrade tests to ensure proper upgrade

* [CI] improve netchecker connectivity testing

* [CI] show logs for tests

* [calico] tweak task name

* [CI] Don't run the provisioner from vagrant since we run it in testcases_run.sh

* [CI] move kube-router tests to vagrant to avoid network connectivity issues during netchecker check

* service proxy mode still fails connectivity tests so keeping it manual mode

* [kube-router] account for containerd use-case
sathieu added a commit to sathieu/kubespray that referenced this pull request Apr 26, 2022
sathieu added a commit to sathieu/kubespray that referenced this pull request Apr 26, 2022
@oomichi oomichi mentioned this pull request May 28, 2022
LuckySB pushed a commit to southbridgeio/kubespray that referenced this pull request Jun 29, 2023
…CI (kubernetes-sigs#8434)

* [calico] make vxlan encapsulation the default

* don't enable ipip encapsulation by default
* set calico_network_backend by default to vxlan
* update sample inventory and documentation

* [CI] pin default calico parameters for upgrade tests to ensure proper upgrade

* [CI] improve netchecker connectivity testing

* [CI] show logs for tests

* [calico] tweak task name

* [CI] Don't run the provisioner from vagrant since we run it in testcases_run.sh

* [CI] move kube-router tests to vagrant to avoid network connectivity issues during netchecker check

* service proxy mode still fails connectivity tests so keeping it manual mode

* [kube-router] account for containerd use-case
LuckySB pushed a commit to southbridgeio/kubespray that referenced this pull request Jun 30, 2023
LuckySB pushed a commit to southbridgeio/kubespray that referenced this pull request Oct 23, 2023
…CI (kubernetes-sigs#8434)

* [calico] make vxlan encapsulation the default

* don't enable ipip encapsulation by default
* set calico_network_backend by default to vxlan
* update sample inventory and documentation

* [CI] pin default calico parameters for upgrade tests to ensure proper upgrade

* [CI] improve netchecker connectivity testing

* [CI] show logs for tests

* [calico] tweak task name

* [CI] Don't run the provisioner from vagrant since we run it in testcases_run.sh

* [CI] move kube-router tests to vagrant to avoid network connectivity issues during netchecker check

* service proxy mode still fails connectivity tests so keeping it manual mode

* [kube-router] account for containerd use-case
LuckySB pushed a commit to southbridgeio/kubespray that referenced this pull request Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/network Network section in the release note lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube-router CI jobs are failing to create a working cluster
7 participants