Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple runs of istio-init container leads to crashloopbackoff status #42792

Closed
RaiAnandKr opened this issue Jan 12, 2023 · 9 comments
Closed
Assignees
Labels
area/networking area/test and release area/user experience lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while

Comments

@RaiAnandKr
Copy link

RaiAnandKr commented Jan 12, 2023

Bug Description

We are testing Istio in a small test setup before expanding it into our production cluster. With sidecar injection, in a pod, istio-init container completes just fine for the first time, setting up all the iptables rules and the sidecar proxy thereafter works fine too.
However, for reasons that we are "aware" of, the istio-init containers re-run without the pod restarting or anything. We run docker system prune -f as a cron which removes the exited istio-init container and that is tipping off Kubelet to start the init container again. This situation is much discussed in the past at kubernetes/kubernetes#67261 for e.g.

Note: even though the container and the pod shows Init:CrashLoopBackOff status post that, the app container and proxy container keep on working fine but it's throwing us off.

However I was under the impression that istio-init runs are idempotent and a comment from around 3 years back indicates the same: #18159 (comment)
Is that not the case anymore? We still are running iptables with --noflush flag which makes this whole thing idempotent as per that comment.

All the first runs of istio-init container across pods are completing just fine but every run from the second run onward is failing. Here is the log of istio-init container from a failed run:

2023-01-12T07:42:54.205490Z	info	Istio iptables environment:
ENVOY_PORT=
INBOUND_CAPTURE_PORT=
ISTIO_INBOUND_INTERCEPTION_MODE=
ISTIO_INBOUND_TPROXY_ROUTE_TABLE=
ISTIO_INBOUND_PORTS=
ISTIO_OUTBOUND_PORTS=
ISTIO_LOCAL_EXCLUDE_PORTS=
ISTIO_EXCLUDE_INTERFACES=
ISTIO_SERVICE_CIDR=
ISTIO_SERVICE_EXCLUDE_CIDR=
ISTIO_META_DNS_CAPTURE=
INVALID_DROP=

2023-01-12T07:42:54.205569Z	info	Istio iptables variables:
PROXY_PORT=15001
PROXY_INBOUND_CAPTURE_PORT=15006
PROXY_TUNNEL_PORT=15008
PROXY_UID=1337
PROXY_GID=1337
INBOUND_INTERCEPTION_MODE=REDIRECT
INBOUND_TPROXY_MARK=1337
INBOUND_TPROXY_ROUTE_TABLE=133
INBOUND_PORTS_INCLUDE=*
INBOUND_PORTS_EXCLUDE=15090,15021,15020
OUTBOUND_OWNER_GROUPS_INCLUDE=*
OUTBOUND_OWNER_GROUPS_EXCLUDE=
OUTBOUND_IP_RANGES_INCLUDE=*
OUTBOUND_IP_RANGES_EXCLUDE=
OUTBOUND_PORTS_INCLUDE=
OUTBOUND_PORTS_EXCLUDE=
KUBE_VIRT_INTERFACES=
ENABLE_INBOUND_IPV6=false
DNS_CAPTURE=false
DROP_INVALID=false
CAPTURE_ALL_DNS=false
DNS_SERVERS=[],[]
OUTPUT_PATH=
NETWORK_NAMESPACE=
CNI_MODE=false
HOST_NSENTER_EXEC=false
EXCLUDE_INTERFACES=

2023-01-12T07:42:54.205993Z	info	Writing following contents to rules file: /tmp/iptables-rules-1673509374205647672.txt397886627
* nat
-N ISTIO_INBOUND
-N ISTIO_REDIRECT
-N ISTIO_IN_REDIRECT
-N ISTIO_OUTPUT
-A ISTIO_INBOUND -p tcp --dport 15008 -j RETURN
-A ISTIO_REDIRECT -p tcp -j REDIRECT --to-ports 15001
-A ISTIO_IN_REDIRECT -p tcp -j REDIRECT --to-ports 15006
-A PREROUTING -p tcp -j ISTIO_INBOUND
-A ISTIO_INBOUND -p tcp --dport 15090 -j RETURN
-A ISTIO_INBOUND -p tcp --dport 15021 -j RETURN
-A ISTIO_INBOUND -p tcp --dport 15020 -j RETURN
-A ISTIO_INBOUND -p tcp -j ISTIO_IN_REDIRECT
-A OUTPUT -p tcp -j ISTIO_OUTPUT
-A ISTIO_OUTPUT -o lo -s 127.0.0.6/32 -j RETURN
-A ISTIO_OUTPUT -o lo ! -d 127.0.0.1/32 -m owner --uid-owner 1337 -j ISTIO_IN_REDIRECT
-A ISTIO_OUTPUT -o lo -m owner ! --uid-owner 1337 -j RETURN
-A ISTIO_OUTPUT -m owner --uid-owner 1337 -j RETURN
-A ISTIO_OUTPUT -o lo ! -d 127.0.0.1/32 -m owner --gid-owner 1337 -j ISTIO_IN_REDIRECT
-A ISTIO_OUTPUT -o lo -m owner ! --gid-owner 1337 -j RETURN
-A ISTIO_OUTPUT -m owner --gid-owner 1337 -j RETURN
-A ISTIO_OUTPUT -d 127.0.0.1/32 -j RETURN
-A ISTIO_OUTPUT -j ISTIO_REDIRECT
COMMIT
2023-01-12T07:42:54.206062Z	info	Running command: iptables-restore --noflush /tmp/iptables-rules-1673509374205647672.txt397886627
2023-01-12T07:42:54.208964Z	error	Command error output: xtables other problem: line 2 failed
2023-01-12T07:42:54.209009Z	error	Failed to execute: iptables-restore --noflush /tmp/iptables-rules-1673509374205647672.txt397886627, exit status 1

On our side, we can enhance docker system prune -f to not prune istio-init containers or even get rid of this task, but I believe even k8s suggests making the init containers idempotent.

Because init containers can be restarted, retried, or re-executed, init container code should be idempotent.

from https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#detailed-behavior

Can I get some help/direction here?

Version

$ istioctl version
client version: 1.16.1
control plane version: 1.16.1
data plane version: 1.16.1 (9 proxies)

$ kubectl version --short
Client Version: v1.21.1
Server Version: v1.20.13

Additional Information

No response

@dhawton
Copy link
Member

dhawton commented Jan 12, 2023

Appears to be erroring when attempting to recreate a chain that already exists. Ideally, you shouldn't be deleting containers externally, which is why kubelet is rescheduling the init-container. There's a problem here as iptables is not declarative: there's not a way to say the rules should look like X, and only change what needs changing... but if we flush and rebuild, we risk breaking existing or new connections while the rules are being rebuilt.

@howardjohn
Copy link
Member

Definitely cannot flush them. If someone has a simple proposal to make it work without flushing I would be open to it. Maybe split "create chain" and "apply rules"

logic:

_ = CreateChains() // ignore error; second run would fail
return ApplyRules()

?

what happens when it fails? Do we apply no rules, all the valid rules, or all the valid rules before the error?

@dhawton
Copy link
Member

dhawton commented Jan 12, 2023

I believe at present it hits the failure to create the chains and stops. There's a config to not use the restore method, but the other relies on RunOrFail which would stop applying rules as well as soon as it fails. I can take a look at this, the restore method I don't think there's an easy way around but RunOrFail should be doable.

@dhawton dhawton self-assigned this Jan 12, 2023
@howardjohn
Copy link
Member

howardjohn commented Jan 12, 2023 via email

@RaiAnandKr
Copy link
Author

you shouldn't be deleting containers externally, which is why kubelet is rescheduling the init-container.

That's true. We already have a task in the pipeline to not do that. Irrespective of that I wanted to kick off the discussion around if the init container is idempotent and if it's straightforward enough to make it.

but if we flush and rebuild, we risk breaking existing or new connections while the rules are being rebuilt.

oh we definitely shouldn't be doing that. In fact, #16768 got us away from this deletion + re-creation combination as part of the init run so we wouldn't want to go back there.

There was this proposal at #18159 to add a step to check if iptables rules are already present and if they are, just exit. I like the #3 option proposed there and @howardjohn you seem to be a part of that conversation too. We backed out of the change but if the reason to back out is not true/no longer applies (I am not sure what it is), can we revive the same solution? (not sure what all has changed with Istio-init and cni plugin since then and if the same solution would still help).

@howardjohn
Copy link
Member

howardjohn commented Jan 13, 2023 via email

@RaiAnandKr
Copy link
Author

RaiAnandKr commented Jan 30, 2023

@dhawton have you already decided on how we want to tackle this (assuming, we want to tackle this)? Let me know if you need my bandwidth on any sort of implementation or testing.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Apr 14, 2023
@istio-policy-bot
Copy link

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2023-01-13. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Apr 29, 2023
@davidxia
Copy link

davidxia commented Jun 21, 2024

I also ran into this issue.

$ istioctl version
client version: 1.17.1
control plane version: 1.17.1
data plane version: 1.17.1 (352 proxies)

$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-gke.1447000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking area/test and release area/user experience lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while
Projects
None yet
Development

No branches or pull requests

5 participants