Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkerd cni plugin blocks pods initialisation on GKE #10849

Closed
kpraveersingh opened this issue May 3, 2023 · 7 comments
Closed

linkerd cni plugin blocks pods initialisation on GKE #10849

kpraveersingh opened this issue May 3, 2023 · 7 comments
Labels
Milestone

Comments

@kpraveersingh
Copy link

What is the issue?

There is a random behaviour on the GKE cluster while installing linkerd cni plugin. During autoscaling, some daemons are running fine and they don't block pod initialisation as the linkerd cni plugin is installed in chained mode. However, when linkerd pods are created prior to gke cni plugin installation i.e. prior to creation of 10-gke-ptp.conflist, it doesn't find the file and creates another file 01-linkerd-cni.conf. The pods are then stuck in init state

(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "": plugin type="linkerd-cni" name="linkerd-cni" failed (add): cannot convert: no valid IP addresses.

Is there a way this daemon can wait till it finds the k8s cni conf file and then add linkerd-cni in chained mode?

How can it be reproduced?

It is a random behaviour. Happens when linkerd-cni installation happens prior to k8s cni conf file creation in /etc/cni/net.d.

Logs, error output, etc

(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "": plugin type="linkerd-cni" name="linkerd-cni" failed (add): cannot convert: no valid IP addresses.

When k8s cni conf is not found:
No active CNI configuration files found; installing in "interface" mode in /host/etc/cni/net.d/01-linkerd-cni.conf

When found:
Installing CNI configuration in "chained" mode for /host/etc/cni/net.d/10-gke-ptp.conflist

output of linkerd check -o short

linkerd-version

‼ can determine the latest version
unexpected versioncheck response: 403 Forbidden
see https://linkerd.io/2.13/checks/#l5d-version-latest for hints
‼ cli is up-to-date
unsupported version channel: stable-2.13.2
see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version

‼ control plane is up-to-date
unsupported version channel: stable-2.13.2
see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy

‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-dd8f7cc48-hs8d5 (stable-2.13.2)
* linkerd-identity-fd6b4d8b7-tv2qk (stable-2.13.2)
* linkerd-proxy-injector-7d79958b59-4jrlw (stable-2.13.2)
see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints

I have disabled version check cron.

Environment

  • K8s version- 1.24.10
  • GKE
  • container optimised (COS) GKE
  • Linkerd 2.13.2

Possible solution

daemon should wait till it finds the k8s cni conf file

Additional context

No response

Would you like to work on fixing this bug?

None

@kpraveersingh
Copy link
Author

For now, I have got it working by adding an init container to the daemon with a sleep for a second. However, I am not sure if this has any ramifications.

@kpraveersingh
Copy link
Author

The network validator init container gets stuck which is an expected behaviour based on the implementation. Restarting the pod manually or rolling out resolves the issue. However, if the pod gets restarted by the controller instead of manually restarting it, that would be a comprehensive solution to this.

@alpeb
Copy link
Member

alpeb commented May 12, 2023

Thanks for the detailed description and the follow-ups. We're currently working on a solution for this. We'll let you know when we have something that you can test. OTOH, if you manage to find a way to reproduce this consistently, it'd be of great help! :-)

@kpraveersingh
Copy link
Author

Thanks @alpeb for responding. I am able to reproduce it consistently where I am scaling the nodes from a dozen to dozen of dozens, I am getting this issue on 25-40% of the workload. Since I have introduced a delay of 5 seconds on cni daemon, the first race condition issue is not appearing any more. Basically I am waiting for GKE to create its own CNI and then the daemon is appending the linkerd configuration. Network Validator is a great way to ascertain pods don't go crazy. However, these pods get scheduled as soon as the node is up and network validator is an init container for each one of them. They are not waiting unlike my workaround for linkerd-cni plugin.

If linkerd cni plugin waits till GKE creates the CNI without giving any wait explicitly, this issue can be solved. Or if the network validation fails, the entire pod restarts, it can be fixed as well.

@kpraveersingh
Copy link
Author

kpraveersingh commented Jun 9, 2023

I need some clarification:
https://github.com/linkerd/linkerd2/blob/main/cni-plugin/deployment/scripts/install-cni.sh#L332
Here, if the k8s cni conf file doesn't exist, linkerd-cni file is created. Why can't it wait here till cni conf is created? I am specifically talking from the experience of using GKE. Here is something which has fixed the issue for me:

config_file_count=0
retry_count=0
while [ "$config_file_count" -eq 0 ]; do
sleep 2
config_file_count=$(find "${HOST_CNI_NET}" -maxdepth 1 -type f ( -iname '*conflist' -o -iname '*conf' ) | grep -v linkerd | sort | wc -l)

retry_count=$((retry_count + 1))
if [ "$retry_count" -eq 60 ]; then
echo "Max retry limit excedeed"
exit 1
fi
done

sleep 2

find "${HOST_CNI_NET}" -maxdepth 1 -type f ( -iname '*conflist' -o -iname '*conf' ) -print0 |
while read -r -d $'\0' file; do
echo "Installing CNI configuration in "chained" mode for $file"
install_cni_conf "$file"
done

I want to understand if there is any harm with it.

@alpeb
Copy link
Member

alpeb commented Jun 29, 2023

if the k8s cni conf file doesn't exist, linkerd-cni file is created. Why can't it wait here till cni conf is created?

Right, the "interface" mode, where an empty linkerd-cni config file was created even if no other CNI plugin had a chance to drop its config, has been abandoned. Please try with a more recent linkerd version to test that out. However, there remains a corner case described in #11073

@risingspiral
Copy link
Contributor

This was fixed with linkerd/linkerd2-proxy-init#242 and released as part of edge-23.6.1

@risingspiral risingspiral added this to the stable-2.13.6 milestone Jul 17, 2023
hawkw added a commit that referenced this issue Aug 9, 2023
This stable release fixes a regression introduced in stable-2.13.0 which
resulted in proxies shedding load too aggressively while under moderate
request load to a single service ([#11055]). In addition, it updates the
base image for the `linkerd-cni` initcontainer to resolve a CVE in
`libdb` ([#11196]), fixes a race condition in the Destination controller
that could cause it to crash ([#11163]), as well as fixing a number of
other issues.

* Control Plane
  * Fixed a race condition in the destination controller that could
    cause it to panic ([#11169]; fixes [#11193])
  * Improved the granularity of logging levels in the control plane
    ([#11147])
  * Replaced incorrect `server_port_subscribers` gauge in the
    Destination controller's metrics with `server_port_subscribes` and
    `server_port_unsubscribes` counters ([#11206]; fixes [#10764])

* Proxy
  * Changed the default HTTP request queue capacities for the inbound
    and outbound proxies back to 10,000 requests ([#11198]; fixes
    [#11055])

* CLI
  * Updated extension CLI commands to prefer the `--registry` flag over
    the `LINKERD_DOCKER_REGISTRY` environment variable, making the
    precedence more consistent (thanks @harsh020!) (see [#11144])

* CNI
  * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in
    `libdb` ([#11196])
  * Changed the CNI plugin installer to always run in 'chained' mode;
    the plugin will now wait until another CNI plugin is installed
    before appending its configuration ([#10849])
  * Removed `hostNetwork: true` from linkerd-cni Helm chart templates
    ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!)

* Multicluster
  * Fixed the `linkerd multicluster check` command failing in the
    presence of lots of mirrored services ([#10764])

[#10764]: #10764
[#10849]: #10849
[#11055]: #11055
[#11141]: #11141
[#11144]: #11144
[#11147]: #11147
[#11158]: #11158
[#11163]: #11163
[#11169]: #11169
[#11196]: #11196
[#11198]: #11198
[#11206]: #11206
[CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
@hawkw hawkw mentioned this issue Aug 9, 2023
hawkw added a commit that referenced this issue Aug 9, 2023
This stable release fixes a regression introduced in stable-2.13.0 which
resulted in proxies shedding load too aggressively while under moderate
request load to a single service ([#11055]). In addition, it updates the
base image for the `linkerd-cni` initcontainer to resolve a CVE in
`libdb` ([#11196]), fixes a race condition in the Destination controller
that could cause it to crash ([#11163]), as well as fixing a number of
other issues.

* Control Plane
  * Fixed a race condition in the destination controller that could
    cause it to panic ([#11169]; fixes [#11193])
  * Improved the granularity of logging levels in the control plane
    ([#11147])
  * Replaced incorrect `server_port_subscribers` gauge in the
    Destination controller's metrics with `server_port_subscribes` and
    `server_port_unsubscribes` counters ([#11206]; fixes [#10764])

* Proxy
  * Changed the default HTTP request queue capacities for the inbound
    and outbound proxies back to 10,000 requests ([#11198]; fixes
    [#11055])

* CLI
  * Updated extension CLI commands to prefer the `--registry` flag over
    the `LINKERD_DOCKER_REGISTRY` environment variable, making the
    precedence more consistent (thanks @harsh020!) (see [#11144])

* CNI
  * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in
    `libdb` ([#11196])
  * Changed the CNI plugin installer to always run in 'chained' mode;
    the plugin will now wait until another CNI plugin is installed
    before appending its configuration ([#10849])
  * Removed `hostNetwork: true` from linkerd-cni Helm chart templates
    ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!)

* Multicluster
  * Fixed the `linkerd multicluster check` command failing in the
    presence of lots of mirrored services ([#10764])

[#10764]: #10764
[#10849]: #10849
[#11055]: #11055
[#11141]: #11141
[#11144]: #11144
[#11147]: #11147
[#11158]: #11158
[#11163]: #11163
[#11169]: #11169
[#11196]: #11196
[#11198]: #11198
[#11206]: #11206
[CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
hawkw added a commit that referenced this issue Aug 9, 2023
This stable release fixes a regression introduced in stable-2.13.0 which
resulted in proxies shedding load too aggressively while under moderate
request load to a single service ([#11055]). In addition, it updates the
base image for the `linkerd-cni` initcontainer to resolve a CVE in
`libdb` ([#11196]), fixes a race condition in the Destination controller
that could cause it to crash ([#11163]), as well as fixing a number of
other issues.

* Control Plane
  * Fixed a race condition in the destination controller that could
    cause it to panic ([#11169]; fixes [#11193])
  * Improved the granularity of logging levels in the control plane
    ([#11147])

* Proxy
  * Changed the default HTTP request queue capacities for the inbound
    and outbound proxies back to 10,000 requests ([#11198]; fixes
    [#11055])

* CLI
  * Updated extension CLI commands to prefer the `--registry` flag over
    the `LINKERD_DOCKER_REGISTRY` environment variable, making the
    precedence more consistent (thanks @harsh020!) (see [#11144])

* CNI
  * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in
    `libdb` ([#11196])
  * Changed the CNI plugin installer to always run in 'chained' mode;
    the plugin will now wait until another CNI plugin is installed
    before appending its configuration ([#10849])
  * Removed `hostNetwork: true` from linkerd-cni Helm chart templates
    ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!)

* Multicluster
  * Fixed the `linkerd multicluster check` command failing in the
    presence of lots of mirrored services ([#10764])

[#10764]: #10764
[#10849]: #10849
[#11055]: #11055
[#11141]: #11141
[#11144]: #11144
[#11147]: #11147
[#11158]: #11158
[#11163]: #11163
[#11169]: #11169
[#11196]: #11196
[#11198]: #11198
[CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants