Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inpod redirection mode #747

Merged
merged 28 commits into from Jan 18, 2024
Merged

inpod redirection mode #747

merged 28 commits into from Jan 18, 2024

Conversation

yuval-k
Copy link
Contributor

@yuval-k yuval-k commented Dec 8, 2023

In pod redirection mode.
for context see: https://docs.google.com/document/d/1dynKlnNgIOywv3cwuw_2RCk_SKFRs7IrESaC8_r-sm0/edit

This adds a new optional redirection mode to ztunnel - redirecting traffic from within the pod namespace.

A few details on the code:

  • ztunnel communicates with new node agent (@bleggett will do a PR for that soon) using a unix socket
  • The node agent provides that ztunnel with information and net-ns for pods local to the node
  • In the code we call this protocol ZDS in the code
  • Each pod get its own instance of a ztunnel proxy::Proxy, that starts its sockets in the pod's netns
  • We've created trait SocketFactory to abstract the netns details from the proxy::Proxy
  • Most of the code is in the new "inpod" module

Note that ztunnel would still work with the old redirection mode. new mode kicks in when INPOD_ENABLED is set to true.

Summary of changes:

File Change
proto/zds.proto ZDS protocol messages
scripts/ztunnel-redirect-inpod.sh used in namespaced tests to redirect traffic
src/app.rs start inpod machinary if inpod mode enabled
src/admin.rs support dynamic admin handler (inpod mode only has a handler if it is enabled)
src/dns/server.rs wire-in socket factory
src/config.rs config options for inpod
src/inpod.rs initialize the workload-proxy manager
src/inpod/config.rs inpod socket factory and its config
src/inpod/admin.rs admin endpoint for inpod - showing currently managed pods
src/inpod/netns.rs specialized netns code suited for inpod
src/inpod/metrics.rs inpod metrics
src/inpod/protocol.rs ZDS implementation that parses the procol
src/inpod/packet.rs SEQPACKET UDS impl
src/inpod/workloadmanager.rs manages workload lifecycle. driven by events coming from protocol.rs
src/inpod/statemanager.rs manages the running proxies
src/proxy.rs wire-in socket factory
src/proxy/inbound_passthrough.rs wire-in socket factory
src/proxy/inbound.rs wire-in socket factory
src/proxy/socks5.rs wire-in socket factory
src/proxy/outbound.rs wire-in socket factory
src/socket.rs add methods to set mark
src/proxyfactory.rs factory to create proxies using a socket factory
src/test_helpers/app.rs support admin_request inside the ztunnel netns
src/test_helpers/linux.rs deploy ztunnel in inpod mode
src/test_helpers/inpod.rs start a ZDS server, simulating the node agent
tests/namespaced.rs add a test for in-pod mode
src/test_helpers/netns.rs support returning value from run()

@yuval-k yuval-k requested review from a team as code owners December 8, 2023 19:38
@istio-testing istio-testing added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 8, 2023
@yuval-k
Copy link
Contributor Author

yuval-k commented Dec 8, 2023

part of istio/istio#48212

proto/zds.proto Outdated Show resolved Hide resolved
@linsun
Copy link
Member

linsun commented Dec 11, 2023

@hzxuzhonghu or @howardjohn can you please review this so we can make progress on the PR?

cc @stevenctl

@howardjohn
Copy link
Member

@hzxuzhonghu or @howardjohn can you please review this so we can make progress on the PR?

cc @stevenctl

@linsun yes I am looking, but this is among the largest PR (especially when considered among istio/istio#48253) so please set expectations accordingly. It will not be done within week.

@howardjohn howardjohn added the do-not-merge/hold Block automatic merging of a PR. label Dec 11, 2023
@howardjohn
Copy link
Member

howardjohn commented Dec 11, 2023

Also I have been reviewing the istio/istio side first fwiw

@linsun
Copy link
Member

linsun commented Dec 11, 2023

Ok, just wanted to make sure folks are reviewing as we don't see much feedbacks so far.

pub struct WorkloadProxyManager {
state: super::statemanager::WorkloadProxyManagerState,
networking: WorkloadProxyNetworkHandler,
// readiness - we are only ready when we are connected. if we get disconnected, we become not ready.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning/use of transitioning from ready -> not ready?

Currently its only not ready -> ready.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that the user wants to be aware everytime the ztunnel is not connected to the CNI agent (as new pods will not get traffic).
The way to signal that is to make the ztunnel not-ready when it looses connectivity to the node agent.

so the meaning of "Ready" here is that ztunnel is ready to operate on new pods (i.e. connected to node agent).
WDYT?


// using refernce counts to account for possible race between the proxy task that notifies us
// that a proxy is down, and the proxy factory task that notifies us when it is up.
#[serde(skip_serializing, skip_deserializing)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[serde(skip_serializing, skip_deserializing)]
#[serde(skip)]

);
registry.register(
"inpod_proxies_stopped",
"The current number of active inpod proxies",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this description correct? Should it be inactive / stopped ?

Some(metrics) => Some(Arc::new(metrics)),
None => {
if config.proxy {
error!("dns proxy configured but no dns metrics provided")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error!("dns proxy configured but no dns metrics provided")
error!("proxy configured but no metrics provided")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is sent in json, so why not define struct directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct it is sent as binary proto. Where do you see it sent as json?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i see the proto api in istio pr too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for same-machine small message using protobuf ? GRPC and proto are great, but may be overkill - lots of debugging and possible integrations would be simpler with just json.

@linsun
Copy link
Member

linsun commented Dec 19, 2023

@hzxuzhonghu @howardjohn any further comments? It is been out for review for 2 weeks - is this ready to go? Should be minimal risk - the ztunnel PR should be safe to merge and it is not used/exercised until the corresponding istio PR is merged. After the PR is merged, ztunnel can still work with the old CNI.

This blocks Yuval from doing other works for L4 ambient:

@linsun
Copy link
Member

linsun commented Dec 19, 2023

@yuval-k should we remove the hold label?

@linsun linsun removed the do-not-merge/hold Block automatic merging of a PR. label Dec 19, 2023
@yuval-k
Copy link
Contributor Author

yuval-k commented Dec 19, 2023

@yuval-k should we remove the hold label?

@howardjohn placed the label, so I'll direct the question to him.

As far as I am concerned this can be merged once reviewed.

@howardjohn
Copy link
Member

howardjohn commented Dec 19, 2023 via email

- move admin handler from metrics
- feed metrics to inpod instead of registry
- remove inpod prefix from metrics
- change mark to 1337
- better error message when mark fails
- minor clean-ups
@@ -0,0 +1,114 @@
// Copyright Istio Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two data sources for "local workloads": WDS and the new UDS model.

Once we get the UID from the UDS, we start the proxy in the network namespace.
Now we can let the pod start (TBH not sure how we block it, but I think you said we do?).
Pod starts, sends an outbound request.
The proxy that is running is ONLY running for that workload, but we still (1) need to have the source workload from WDS and (2) try to guess certificates, etc based on source IP.

Inbound has similar logic.

This feels fishy in a few ways (but likely more I haven't thought of):

  • If we are still doing IP based checks, we haven't really resolved the ip spoofing attacks, right?
  • We block the pod running until the proxy starts, but it may not actually be ready to handle traffic since we haven't gotten the WDS response

I feel like I had more concerns but i was thinking about this late last night and didn't write it down...

WDYT? I get this PR is allow inpod and the old model, and the changes I am proposing are large and invasive. So I don't necessarily want to tackle them in this PR, if at all -- but would be good to understand the gaps now and the path to resolving them in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i planned to do this as a follow up; LMK what you think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We block the pod running until the proxy starts, but it may not actually be ready to handle traffic since we haven't gotten the WDS response

I think the answer for this concern is on_demand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, that this also requires a change on the node agent side, as the uid from WDS is different from the uid in UDS - we may want to re-think this part

@costinm
Copy link
Contributor

costinm commented Jan 5, 2024 via email

@yuval-k
Copy link
Contributor Author

yuval-k commented Jan 5, 2024

Pod delete cleanup is not time sensitive or critical - node agent can just send a list of namespaces that shouldn't exist. Or the cni plugin could do this - I remember it is notified.

I agree that pod delete clean-up is not time sensitive.

What is the problem you are trying to solve? What part of the delete flow do you think is too complex?

Are you able perhaps to sketch out your proposal in code / or in more technical details that include the proposed changes to the code in both the node-agent and the ztunnel?
These details will allow me to better understand and evaluate your proposal, and understand if it simplifies the current state.

@costinm
Copy link
Contributor

costinm commented Jan 6, 2024 via email

note that SO_REUSEPORT only work if both old and new ztunnel have the same uid
@yuval-k
Copy link
Contributor Author

yuval-k commented Jan 8, 2024

@howardjohn added SO_REUSEPORT which "solves" the upgrade problem (and it should actually drain connections gracefully - thought i didn't test this thoroughly yet). The main limitation is that new ztunnel and old ztunnel need to have the same UID (enforced by linux for port-reuse). LMK what you think.

Note that i do still think that upgrading is a bigger story than just this change.

@bleggett
Copy link
Contributor

bleggett commented Jan 8, 2024

I'm not sure what is the big problem with using protobufs. As I mentioned, we plan to add expand these messages to have more fields. And we need to read them from multiple languages. Why work hard and hand write serialization? I havn't needed to do that since my college days.

+1 on this.

The fact that we have a single version-controlled file declaring the schema and documenting the protocol which both the client and server explicitly refer to is far and away worth more than the trivial encoding cost, for the purposes of intelligibility and developer discoverability.

I'm not against picking a simpler or more efficient serialization protocol/schema format than protobuf (e.g. Cap'n Proto) - but there aren't many, and not having one at all makes this much harder to understand and use.

Co-authored-by: John Howard <howardjohn@google.com>
@bleggett
Copy link
Contributor

bleggett commented Jan 8, 2024

FYI: containerd/containerd#8085. note its not implemented, so more just a reference.

Yeah +1 on what yuval suggested - the node-level CNI plugin that already has full node privs could definitely talk to the CRI and get the FD, and propagate it directly. That would be a good followup optimization, and IMO very preferable to giving ztunnel those privs.

@hzxuzhonghu
Copy link
Member

If ztunnel has that privs, it won't need to communicate with CNI plugin for the netNs but with CRI right?

@bleggett
Copy link
Contributor

bleggett commented Jan 9, 2024

If ztunnel has that privs, it won't need to communicate with CNI plugin for the netNs but with CRI right?

Yes, but it would still need to communicate with the node agent, so all that would really do in practice is spread elevated permissions among more components, rather than fewer, which is not desirable.

@yuval-k
Copy link
Contributor Author

yuval-k commented Jan 9, 2024

If ztunnel has that privs, it won't need to communicate with CNI plugin for the netNs but with CRI right?

Giving the ztunnel access to the CRI is giving it privileges: if ztunnel communicates with the CRI instead of the node agent, that gives it the power of root. Because you can use the CRI to spin up privileged containers with host mounts, anyone that can access the CRI is effectively root

@hzxuzhonghu
Copy link
Member

Got it, that is what i think. Recently I talked to guy from containerd. He said CRI can be used to get the netNs, I think the tricky point is how to make the start up in order without race

@yuval-k
Copy link
Contributor Author

yuval-k commented Jan 9, 2024

I believe I responded to all comments; Let me know if I missed anything
cc @howardjohn @keithmattix @costinm

@linsun
Copy link
Member

linsun commented Jan 12, 2024

@howardjohn @hzxuzhonghu - is this one ready to be approved? would be great to get it in release-1.21 before branching.

@linsun
Copy link
Member

linsun commented Jan 18, 2024

A friendly ping to @howardjohn when he gets a min since he said should be good to go yesterday. This is blocking a few other PRs as well.

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start to iterate on

@istio-testing istio-testing merged commit bc68182 into istio:master Jan 18, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants