inpod redirection mode #747

yuval-k · 2023-12-08T19:38:20Z

In pod redirection mode.
for context see: https://docs.google.com/document/d/1dynKlnNgIOywv3cwuw_2RCk_SKFRs7IrESaC8_r-sm0/edit

This adds a new optional redirection mode to ztunnel - redirecting traffic from within the pod namespace.

A few details on the code:

ztunnel communicates with new node agent (@bleggett will do a PR for that soon) using a unix socket
The node agent provides that ztunnel with information and net-ns for pods local to the node
In the code we call this protocol ZDS in the code
Each pod get its own instance of a ztunnel proxy::Proxy, that starts its sockets in the pod's netns
We've created trait SocketFactory to abstract the netns details from the proxy::Proxy
Most of the code is in the new "inpod" module

Note that ztunnel would still work with the old redirection mode. new mode kicks in when INPOD_ENABLED is set to true.

Summary of changes:

File	Change
proto/zds.proto	ZDS protocol messages
scripts/ztunnel-redirect-inpod.sh	used in namespaced tests to redirect traffic
src/app.rs	start inpod machinary if inpod mode enabled
src/admin.rs	support dynamic admin handler (inpod mode only has a handler if it is enabled)
src/dns/server.rs	wire-in socket factory
src/config.rs	config options for inpod
src/inpod.rs	initialize the workload-proxy manager
src/inpod/config.rs	inpod socket factory and its config
src/inpod/admin.rs	admin endpoint for inpod - showing currently managed pods
src/inpod/netns.rs	specialized netns code suited for inpod
src/inpod/metrics.rs	inpod metrics
src/inpod/protocol.rs	ZDS implementation that parses the procol
src/inpod/packet.rs	SEQPACKET UDS impl
src/inpod/workloadmanager.rs	manages workload lifecycle. driven by events coming from protocol.rs
src/inpod/statemanager.rs	manages the running proxies
src/proxy.rs	wire-in socket factory
src/proxy/inbound_passthrough.rs	wire-in socket factory
src/proxy/inbound.rs	wire-in socket factory
src/proxy/socks5.rs	wire-in socket factory
src/proxy/outbound.rs	wire-in socket factory
src/socket.rs	add methods to set mark
src/proxyfactory.rs	factory to create proxies using a socket factory
src/test_helpers/app.rs	support admin_request inside the ztunnel netns
src/test_helpers/linux.rs	deploy ztunnel in inpod mode
src/test_helpers/inpod.rs	start a ZDS server, simulating the node agent
tests/namespaced.rs	add a test for in-pod mode
src/test_helpers/netns.rs	support returning value from `run()`

yuval-k · 2023-12-08T19:42:39Z

part of istio/istio#48212

proto/zds.proto

linsun · 2023-12-11T21:06:35Z

@hzxuzhonghu or @howardjohn can you please review this so we can make progress on the PR?

cc @stevenctl

howardjohn · 2023-12-11T21:08:18Z

@hzxuzhonghu or @howardjohn can you please review this so we can make progress on the PR?

cc @stevenctl

@linsun yes I am looking, but this is among the largest PR (especially when considered among istio/istio#48253) so please set expectations accordingly. It will not be done within week.

howardjohn · 2023-12-11T21:08:38Z

Also I have been reviewing the istio/istio side first fwiw

linsun · 2023-12-11T21:10:04Z

Ok, just wanted to make sure folks are reviewing as we don't see much feedbacks so far.

howardjohn · 2023-12-13T01:13:45Z

src/inpod/workloadmanager.rs

+pub struct WorkloadProxyManager {
+    state: super::statemanager::WorkloadProxyManagerState,
+    networking: WorkloadProxyNetworkHandler,
+    // readiness - we are only ready when we are connected. if we get disconnected, we become not ready.


what is the meaning/use of transitioning from ready -> not ready?

Currently its only not ready -> ready.

My thinking was that the user wants to be aware everytime the ztunnel is not connected to the CNI agent (as new pods will not get traffic).
The way to signal that is to make the ztunnel not-ready when it looses connectivity to the node agent.

so the meaning of "Ready" here is that ztunnel is ready to operate on new pods (i.e. connected to node agent).
WDYT?

ymesika · 2023-12-13T12:23:23Z

src/inpod/admin.rs

+
+    // using refernce counts to account for possible race between the proxy task that notifies us
+    // that a proxy is down, and the proxy factory task that notifies us when it is up.
+    #[serde(skip_serializing, skip_deserializing)]


Suggested change

#[serde(skip_serializing, skip_deserializing)]

#[serde(skip)]

ymesika · 2023-12-13T12:29:00Z

src/inpod/metrics.rs

+        );
+        registry.register(
+            "inpod_proxies_stopped",
+            "The current number of active inpod proxies",


Is this description correct? Should it be inactive / stopped ?

ymesika · 2023-12-13T12:38:55Z

src/proxyfactory.rs

+            Some(metrics) => Some(Arc::new(metrics)),
+            None => {
+                if config.proxy {
+                    error!("dns proxy configured but no dns metrics provided")


Suggested change

error!("dns proxy configured but no dns metrics provided")

error!("proxy configured but no metrics provided")

hzxuzhonghu · 2023-12-19T09:43:49Z

proto/zds.proto

it is sent in json, so why not define struct directly

This struct it is sent as binary proto. Where do you see it sent as json?

Ok, i see the proto api in istio pr too

Any particular reason for same-machine small message using protobuf ? GRPC and proto are great, but may be overkill - lots of debugging and possible integrations would be simpler with just json.

linsun · 2023-12-19T20:50:17Z

@hzxuzhonghu @howardjohn any further comments? It is been out for review for 2 weeks - is this ready to go? Should be minimal risk - the ztunnel PR should be safe to merge and it is not used/exercised until the corresponding istio PR is merged. After the PR is merged, ztunnel can still work with the old CNI.

This blocks Yuval from doing other works for L4 ambient:

Ensure zTunnel DaemonSet is running/ready on a Node before any other pods are scheduled. istio#48286 (install CNI race)
Pod identity in ztunnel could be mistaken istio#46628 (mistaken identity)

linsun · 2023-12-19T20:56:45Z

@yuval-k should we remove the hold label?

yuval-k · 2023-12-19T21:03:40Z

@yuval-k should we remove the hold label?

@howardjohn placed the label, so I'll direct the question to him.

As far as I am concerned this can be merged once reviewed.

howardjohn · 2023-12-19T21:07:40Z

Hold was just to make sure it was reviewed by interested parties and didn't get 1 approval and accidentally merge in if others were still in-review

…

On Tue, Dec 19, 2023 at 1:03 PM Yuval Kohavi ***@***.***> wrote: @yuval-k <https://github.com/yuval-k> should we remove the hold label? @howardjohn <https://github.com/howardjohn> place the label, so I'll direct the question to him. As far as I am concerned this can be merged once reviewed. — Reply to this email directly, view it on GitHub <#747 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEYGXL64FD2ZQOEQJYBKXLYKH6LRAVCNFSM6AAAAABANDLGOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRTGQ3DQNRTHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…ess (so we can set it to 127.0.0.1)

- move admin handler from metrics - feed metrics to inpod instead of registry - remove inpod prefix from metrics - change mark to 1337

- better error message when mark fails - minor clean-ups

- fix tests

howardjohn · 2024-01-05T22:36:31Z

src/inpod.rs

@@ -0,0 +1,114 @@
+// Copyright Istio Authors


We have two data sources for "local workloads": WDS and the new UDS model.

Once we get the UID from the UDS, we start the proxy in the network namespace.
Now we can let the pod start (TBH not sure how we block it, but I think you said we do?).
Pod starts, sends an outbound request.
The proxy that is running is ONLY running for that workload, but we still (1) need to have the source workload from WDS and (2) try to guess certificates, etc based on source IP.

Inbound has similar logic.

This feels fishy in a few ways (but likely more I haven't thought of):

If we are still doing IP based checks, we haven't really resolved the ip spoofing attacks, right?

We block the pod running until the proxy starts, but it may not actually be ready to handle traffic since we haven't gotten the WDS response

I feel like I had more concerns but i was thinking about this late last night and didn't write it down...

WDYT? I get this PR is allow inpod and the old model, and the changes I am proposing are large and invasive. So I don't necessarily want to tackle them in this PR, if at all -- but would be good to understand the gaps now and the path to resolving them in the future?

here are my thoughts:
https://github.com/yuval-k/ztunnel/tree/inpod-dest-id

i planned to do this as a follow up; LMK what you think

We block the pod running until the proxy starts, but it may not actually be ready to handle traffic since we haven't gotten the WDS response

I think the answer for this concern is on_demand

Note, that this also requires a change on the node agent side, as the uid from WDS is different from the uid in UDS - we may want to re-think this part

costinm · 2024-01-05T22:59:06Z

Pod delete cleanup is not time sensitive or critical - node agent can just send a list of namespaces that shouldn't exist. Or the cni plugin could do this - I remember it is notified.

…

On Wed, Jan 3, 2024, 17:40 Yuval Kohavi ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/inpod/statemanager.rs <#747 (comment)>: > + info!( + "pod {} received netns, starting proxy", + poddata.info.workload_uid + ); + if !self.snapshot_received { + self.snapshot_names + .insert(poddata.info.workload_uid.clone()); + } + let netns = InpodNetns::new(self.inpod_config.cur_netns(), poddata.netns) + .map_err(|e| Error::ProxyError(crate::proxy::Error::Io(e)))?; + + self.add_workload(poddata.info, netns) + .await + .map_err(Error::ProxyError) + } + WorkloadMessage::KeepWorkload(workload_uid) => { re @costinm <https://github.com/costinm> comment here <#747 (comment)> The problem is that 'complex' is rarely safe - and most of the time sorry. I'm aware. and not at all clear what 'remove' is supposed to do, the pod will be deleted anyways Let me clarify. When a pod is deleted, its network namespace doesn't get deleted. That's because network namespaces in linux can't be explicitly deleted. They automatically get cleaned up when nothing references them. Therefore, ztunnel needs some mechanism to know that a pod is deleted, so it can clean up its resources. Our current approach, is that the node-agent tells ztunnel that pods were deleted. And there for, if the ztunnel was disconnected from the the node agent, when it reconnects we need to reconcile the full state of local pods on the node, to account for pods that may have been deleted while the ztunnel was disconnected. We do this by sending a bunch of AddWorkload messages, followed by a SnapshotSent message on the initial connection. The ztunnel will clean up and workloads that it didn't see in one of these WorkloadAdded messages sent prior to the SnapshotSent message. Aftet the SnapshotSent message, the protocol becomes delta-like with AddWorkload \ DelWorkload sent as needed. If/When the node agent restarts, it needs to reconstruct its local state. That state is composed of local pods -> netns mapping. The node agent uses heuristics to reconstruct said state. The (potentially non-existent [see above]) problem that KeepWorkload aims to solve, is to account for a possible temporary issue with that heuristic. Thus allowing the node agent to tell the ztunnel when it reconnects, that even though it doesn't have the netns for a certain pod, the ztunnel shouldn't remove it after the SnapshotSent message is sent. — Reply to this email directly, view it on GitHub <#747 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAUR2UFTZNGAR6IR3A6EADYMYB7XAVCNFSM6AAAAABANDLGOGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMBTGMYTOOJUGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

yuval-k · 2024-01-05T23:14:58Z

Pod delete cleanup is not time sensitive or critical - node agent can just send a list of namespaces that shouldn't exist. Or the cni plugin could do this - I remember it is notified.

I agree that pod delete clean-up is not time sensitive.

What is the problem you are trying to solve? What part of the delete flow do you think is too complex?

Are you able perhaps to sketch out your proposal in code / or in more technical details that include the proposed changes to the code in both the node-agent and the ztunnel?
These details will allow me to better understand and evaluate your proposal, and understand if it simplifies the current state.

costinm · 2024-01-06T15:42:35Z

On Fri, Jan 5, 2024 at 5:29 PM Yuval Kohavi ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/inpod/workloadmanager.rs <#747 (comment)>: > +} + +struct WorkloadProxyManagerProcessor<'a> { + state: &'a mut super::statemanager::WorkloadProxyManagerState, + readiness: &'a mut WorkloadProxyReadinessHandler, + + next_pending_retry: Option<std::pin::Pin<Box<tokio::time::Sleep>>>, +} + +impl WorkloadProxyReadinessHandler { + fn new(ready: readiness::Ready) -> Self { + let mut r = Self { + ready, + block_ready: None, + }; + r.not_ready(); Re upgrade, supporting SxS upgrades will involve more than solving the ports. So probably worth a separate discussion. I was under the impression that our initial upgrade strategy was to cordon/drain node before an upgrade

Simple cordon/drain is currently the only safe option we have. We may find ways to upgrade without node drain - but it would be ideal to implement that in a separate binary/module - and keep the core simple. We already have far too much complexity - cordon just works with no extra code and will likely remain the safest option. Adding a lot of code to deal with some other mode is likely to add risks to the simple mode and regular operation. IMO it is time to do whatever needs to be done to get an initial working version working - with cordon and roughly this PR - and start working on a clean CNI without the legacy and complexity. Perhaps with others - i.e. define the requirements ( 'a CNI should allow a Pod to send/receive UDS file descriptors') and work with K8S upstream to extend the core specs, with a minimal reference implementation free of Istio-specific logic. The SDS/certificate UDS socket is another example involving creating UDS listeners in the pod namespace - and it will be a powerful general IPC mechanism, since it enables passing file descriptors (and memory regions) between pods. There is vast evidence it is a powerful and viable mehanism from /dev/binder. So let's get this in, test it - and start the cleanup/de-legacyfication as a long-term separate effort. Even if this PR was super simple - the rest of istio CNI is full of complexity and far too hard.

…

Message ID: ***@***.***>

note that SO_REUSEPORT only work if both old and new ztunnel have the same uid

yuval-k · 2024-01-08T16:43:19Z

@howardjohn added SO_REUSEPORT which "solves" the upgrade problem (and it should actually drain connections gracefully - thought i didn't test this thoroughly yet). The main limitation is that new ztunnel and old ztunnel need to have the same UID (enforced by linux for port-reuse). LMK what you think.

Note that i do still think that upgrading is a bigger story than just this change.

bleggett · 2024-01-08T18:30:52Z

I'm not sure what is the big problem with using protobufs. As I mentioned, we plan to add expand these messages to have more fields. And we need to read them from multiple languages. Why work hard and hand write serialization? I havn't needed to do that since my college days.

+1 on this.

The fact that we have a single version-controlled file declaring the schema and documenting the protocol which both the client and server explicitly refer to is far and away worth more than the trivial encoding cost, for the purposes of intelligibility and developer discoverability.

I'm not against picking a simpler or more efficient serialization protocol/schema format than protobuf (e.g. Cap'n Proto) - but there aren't many, and not having one at all makes this much harder to understand and use.

Co-authored-by: John Howard <howardjohn@google.com>

bleggett · 2024-01-08T20:08:49Z

FYI: containerd/containerd#8085. note its not implemented, so more just a reference.

Yeah +1 on what yuval suggested - the node-level CNI plugin that already has full node privs could definitely talk to the CRI and get the FD, and propagate it directly. That would be a good followup optimization, and IMO very preferable to giving ztunnel those privs.

hzxuzhonghu · 2024-01-09T01:23:14Z

If ztunnel has that privs, it won't need to communicate with CNI plugin for the netNs but with CRI right?

bleggett · 2024-01-09T04:12:17Z

If ztunnel has that privs, it won't need to communicate with CNI plugin for the netNs but with CRI right?

Yes, but it would still need to communicate with the node agent, so all that would really do in practice is spread elevated permissions among more components, rather than fewer, which is not desirable.

yuval-k · 2024-01-09T11:36:40Z

If ztunnel has that privs, it won't need to communicate with CNI plugin for the netNs but with CRI right?

Giving the ztunnel access to the CRI is giving it privileges: if ztunnel communicates with the CRI instead of the node agent, that gives it the power of root. Because you can use the CRI to spin up privileged containers with host mounts, anyone that can access the CRI is effectively root

hzxuzhonghu · 2024-01-09T12:46:31Z

Got it, that is what i think. Recently I talked to guy from containerd. He said CRI can be used to get the netNs, I think the tricky point is how to make the start up in order without race

yuval-k · 2024-01-09T15:35:43Z

I believe I responded to all comments; Let me know if I missed anything
cc @howardjohn @keithmattix @costinm

linsun · 2024-01-12T17:06:20Z

@howardjohn @hzxuzhonghu - is this one ready to be approved? would be great to get it in release-1.21 before branching.

linsun · 2024-01-18T15:34:12Z

A friendly ping to @howardjohn when he gets a min since he said should be good to go yesterday. This is blocking a few other PRs as well.

howardjohn

Looks like a good start to iterate on

inpod

007a445

yuval-k requested review from a team as code owners December 8, 2023 19:38

istio-testing added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 8, 2023

yuval-k added 2 commits December 8, 2023 15:01

make fix

8bfc6fb

git apply <(curl ...)

1138631

bleggett mentioned this pull request Dec 9, 2023

Initial merge of Solo out-of-tree inpod CNI istio/istio#48253

Merged

PlatformLC reviewed Dec 11, 2023

View reviewed changes

proto/zds.proto Outdated Show resolved Hide resolved

typo

dd78c6b

howardjohn added the do-not-merge/hold Block automatic merging of a PR. label Dec 11, 2023

howardjohn reviewed Dec 13, 2023

View reviewed changes

ymesika reviewed Dec 13, 2023

View reviewed changes

pr comments

546f705

hzxuzhonghu reviewed Dec 19, 2023

View reviewed changes

yuval-k added 2 commits December 19, 2023 11:11

Merge remote-tracking branch 'origin/master' into inpod

a0a5f5b

make gen

7f1f342

linsun removed the do-not-merge/hold Block automatic merging of a PR. label Dec 19, 2023

yuval-k added 2 commits December 21, 2023 17:08

remove NoSnapshotPresent

87615cd

due to how iptables redirect works, we need to configure the dns addr…

0298ba7

…ess (so we can set it to 127.0.0.1)

yuval-k added 7 commits January 5, 2024 13:07

fix clippy notes

5999c1a

make gen

8b1d760

git apply

14e7b00

pr comments:

59999ff

- move admin handler from metrics - feed metrics to inpod instead of registry - remove inpod prefix from metrics - change mark to 1337

be explicit about what's a uid vs String

11bfa4d

pr comments:

0068dae

- better error message when mark fails - minor clean-ups

- change mark in redirect script for namespaced tests

eb5db6c

- fix tests

howardjohn reviewed Jan 5, 2024

View reviewed changes

yuval-k added 2 commits January 8, 2024 11:11

remove pub from inpod

c1e8711

allow for SO_REUSEPORT

e3ef453

note that SO_REUSEPORT only work if both old and new ztunnel have the same uid

Change log level on stream end in src/inpod/workloadmanager.rs

6859237

Co-authored-by: John Howard <howardjohn@google.com>

tests for port re-use

cd97953

howardjohn approved these changes Jan 18, 2024

View reviewed changes

istio-testing merged commit bc68182 into istio:master Jan 18, 2024
3 checks passed

ericvn mentioned this pull request Jan 18, 2024

Automated branching step 1 istio/istio#48867

Closed

bleggett mentioned this pull request Jan 23, 2024

Update portlist for inpod #784

Merged

yuval-k mentioned this pull request Feb 21, 2024

proxy: validate ZDS provided workload service account when asserting rbac #777

Merged

	#[serde(skip_serializing, skip_deserializing)]
	#[serde(skip)]

	error!("dns proxy configured but no dns metrics provided")
	error!("proxy configured but no metrics provided")

inpod redirection mode #747

inpod redirection mode #747

Conversation

yuval-k commented Dec 8, 2023 • edited

yuval-k commented Dec 8, 2023

linsun commented Dec 11, 2023

howardjohn commented Dec 11, 2023

howardjohn commented Dec 11, 2023 • edited

linsun commented Dec 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linsun commented Dec 19, 2023

linsun commented Dec 19, 2023

yuval-k commented Dec 19, 2023 • edited

howardjohn commented Dec 19, 2023 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costinm commented Jan 5, 2024 via email

yuval-k commented Jan 5, 2024 • edited

costinm commented Jan 6, 2024 via email

yuval-k commented Jan 8, 2024 • edited

bleggett commented Jan 8, 2024 • edited

bleggett commented Jan 8, 2024 • edited

hzxuzhonghu commented Jan 9, 2024

bleggett commented Jan 9, 2024

yuval-k commented Jan 9, 2024

hzxuzhonghu commented Jan 9, 2024

yuval-k commented Jan 9, 2024

linsun commented Jan 12, 2024

linsun commented Jan 18, 2024

howardjohn left a comment

Choose a reason for hiding this comment

yuval-k commented Dec 8, 2023 •

edited

howardjohn commented Dec 11, 2023 •

edited

yuval-k commented Dec 19, 2023 •

edited

yuval-k commented Jan 5, 2024 •

edited

yuval-k commented Jan 8, 2024 •

edited

bleggett commented Jan 8, 2024 •

edited

bleggett commented Jan 8, 2024 •

edited