New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inpod redirection mode #747
Conversation
part of istio/istio#48212 |
@hzxuzhonghu or @howardjohn can you please review this so we can make progress on the PR? cc @stevenctl |
@linsun yes I am looking, but this is among the largest PR (especially when considered among istio/istio#48253) so please set expectations accordingly. It will not be done within week. |
Also I have been reviewing the istio/istio side first fwiw |
Ok, just wanted to make sure folks are reviewing as we don't see much feedbacks so far. |
pub struct WorkloadProxyManager { | ||
state: super::statemanager::WorkloadProxyManagerState, | ||
networking: WorkloadProxyNetworkHandler, | ||
// readiness - we are only ready when we are connected. if we get disconnected, we become not ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the meaning/use of transitioning from ready -> not ready?
Currently its only not ready -> ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that the user wants to be aware everytime the ztunnel is not connected to the CNI agent (as new pods will not get traffic).
The way to signal that is to make the ztunnel not-ready when it looses connectivity to the node agent.
so the meaning of "Ready" here is that ztunnel is ready to operate on new pods (i.e. connected to node agent).
WDYT?
src/inpod/admin.rs
Outdated
|
||
// using refernce counts to account for possible race between the proxy task that notifies us | ||
// that a proxy is down, and the proxy factory task that notifies us when it is up. | ||
#[serde(skip_serializing, skip_deserializing)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[serde(skip_serializing, skip_deserializing)] | |
#[serde(skip)] |
src/inpod/metrics.rs
Outdated
); | ||
registry.register( | ||
"inpod_proxies_stopped", | ||
"The current number of active inpod proxies", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this description correct? Should it be inactive / stopped ?
src/proxyfactory.rs
Outdated
Some(metrics) => Some(Arc::new(metrics)), | ||
None => { | ||
if config.proxy { | ||
error!("dns proxy configured but no dns metrics provided") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error!("dns proxy configured but no dns metrics provided") | |
error!("proxy configured but no metrics provided") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is sent in json, so why not define struct directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct it is sent as binary proto. Where do you see it sent as json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i see the proto api in istio pr too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for same-machine small message using protobuf ? GRPC and proto are great, but may be overkill - lots of debugging and possible integrations would be simpler with just json.
@hzxuzhonghu @howardjohn any further comments? It is been out for review for 2 weeks - is this ready to go? Should be minimal risk - the ztunnel PR should be safe to merge and it is not used/exercised until the corresponding istio PR is merged. After the PR is merged, ztunnel can still work with the old CNI. This blocks Yuval from doing other works for L4 ambient:
|
@yuval-k should we remove the hold label? |
@howardjohn placed the label, so I'll direct the question to him. As far as I am concerned this can be merged once reviewed. |
Hold was just to make sure it was reviewed by interested parties and didn't
get 1 approval and accidentally merge in if others were still in-review
…On Tue, Dec 19, 2023 at 1:03 PM Yuval Kohavi ***@***.***> wrote:
@yuval-k <https://github.com/yuval-k> should we remove the hold label?
@howardjohn <https://github.com/howardjohn> place the label, so I'll
direct the question to him.
As far as I am concerned this can be merged once reviewed.
—
Reply to this email directly, view it on GitHub
<#747 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEYGXL64FD2ZQOEQJYBKXLYKH6LRAVCNFSM6AAAAABANDLGOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRTGQ3DQNRTHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…ess (so we can set it to 127.0.0.1)
- move admin handler from metrics - feed metrics to inpod instead of registry - remove inpod prefix from metrics - change mark to 1337
- better error message when mark fails - minor clean-ups
@@ -0,0 +1,114 @@ | |||
// Copyright Istio Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two data sources for "local workloads": WDS and the new UDS model.
Once we get the UID from the UDS, we start the proxy in the network namespace.
Now we can let the pod start (TBH not sure how we block it, but I think you said we do?).
Pod starts, sends an outbound request.
The proxy that is running is ONLY running for that workload, but we still (1) need to have the source workload from WDS and (2) try to guess certificates, etc based on source IP.
Inbound has similar logic.
This feels fishy in a few ways (but likely more I haven't thought of):
- If we are still doing IP based checks, we haven't really resolved the ip spoofing attacks, right?
- We block the pod running until the proxy starts, but it may not actually be ready to handle traffic since we haven't gotten the WDS response
I feel like I had more concerns but i was thinking about this late last night and didn't write it down...
WDYT? I get this PR is allow inpod and the old model, and the changes I am proposing are large and invasive. So I don't necessarily want to tackle them in this PR, if at all -- but would be good to understand the gaps now and the path to resolving them in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here are my thoughts:
https://github.com/yuval-k/ztunnel/tree/inpod-dest-id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i planned to do this as a follow up; LMK what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We block the pod running until the proxy starts, but it may not actually be ready to handle traffic since we haven't gotten the WDS response
I think the answer for this concern is on_demand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, that this also requires a change on the node agent side, as the uid from WDS is different from the uid in UDS - we may want to re-think this part
Pod delete cleanup is not time sensitive or critical - node agent can just
send a list of namespaces that shouldn't exist. Or the cni plugin could do
this - I remember it is notified.
…On Wed, Jan 3, 2024, 17:40 Yuval Kohavi ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/inpod/statemanager.rs
<#747 (comment)>:
> + info!(
+ "pod {} received netns, starting proxy",
+ poddata.info.workload_uid
+ );
+ if !self.snapshot_received {
+ self.snapshot_names
+ .insert(poddata.info.workload_uid.clone());
+ }
+ let netns = InpodNetns::new(self.inpod_config.cur_netns(), poddata.netns)
+ .map_err(|e| Error::ProxyError(crate::proxy::Error::Io(e)))?;
+
+ self.add_workload(poddata.info, netns)
+ .await
+ .map_err(Error::ProxyError)
+ }
+ WorkloadMessage::KeepWorkload(workload_uid) => {
re @costinm <https://github.com/costinm> comment here
<#747 (comment)>
The problem is that 'complex' is rarely safe - and most of the time sorry.
I'm aware.
and not at all clear what 'remove' is supposed to do, the pod will be
deleted anyways
Let me clarify.
When a pod is deleted, its network namespace doesn't get deleted. That's
because network namespaces in linux can't be explicitly deleted. They
automatically get cleaned up when nothing references them. Therefore,
ztunnel needs some mechanism to know that a pod is deleted, so it can clean
up its resources.
Our current approach, is that the node-agent tells ztunnel that pods were
deleted. And there for, if the ztunnel was disconnected from the the node
agent, when it reconnects we need to reconcile the full state of local pods
on the node, to account for pods that may have been deleted while the
ztunnel was disconnected.
We do this by sending a bunch of AddWorkload messages, followed by a
SnapshotSent message on the initial connection. The ztunnel will clean up
and workloads that it didn't see in one of these WorkloadAdded messages
sent prior to the SnapshotSent message.
Aftet the SnapshotSent message, the protocol becomes delta-like with
AddWorkload \ DelWorkload sent as needed.
If/When the node agent restarts, it needs to reconstruct its local state.
That state is composed of local pods -> netns mapping. The node agent uses
heuristics to reconstruct said state.
The (potentially non-existent [see above]) problem that KeepWorkload aims
to solve, is to account for a possible temporary issue with that heuristic.
Thus allowing the node agent to tell the ztunnel when it reconnects, that
even though it doesn't have the netns for a certain pod, the ztunnel
shouldn't remove it after the SnapshotSent message is sent.
—
Reply to this email directly, view it on GitHub
<#747 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAUR2UFTZNGAR6IR3A6EADYMYB7XAVCNFSM6AAAAABANDLGOGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMBTGMYTOOJUGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I agree that pod delete clean-up is not time sensitive. What is the problem you are trying to solve? What part of the delete flow do you think is too complex? Are you able perhaps to sketch out your proposal in code / or in more technical details that include the proposed changes to the code in both the node-agent and the ztunnel? |
On Fri, Jan 5, 2024 at 5:29 PM Yuval Kohavi ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/inpod/workloadmanager.rs
<#747 (comment)>:
> +}
+
+struct WorkloadProxyManagerProcessor<'a> {
+ state: &'a mut super::statemanager::WorkloadProxyManagerState,
+ readiness: &'a mut WorkloadProxyReadinessHandler,
+
+ next_pending_retry: Option<std::pin::Pin<Box<tokio::time::Sleep>>>,
+}
+
+impl WorkloadProxyReadinessHandler {
+ fn new(ready: readiness::Ready) -> Self {
+ let mut r = Self {
+ ready,
+ block_ready: None,
+ };
+ r.not_ready();
Re upgrade, supporting SxS upgrades will involve more than solving the
ports. So probably worth a separate discussion. I was under the impression
that our initial upgrade strategy was to cordon/drain node before an upgrade
Simple cordon/drain is currently the only safe option we have.
We may find ways to upgrade without node drain - but it would be ideal to
implement that in a separate binary/module - and keep the
core simple. We already have far too much complexity - cordon just works
with no extra code and will likely remain the safest option.
Adding a lot of code to deal with some other mode is likely to add risks to
the simple mode and regular operation.
IMO it is time to do whatever needs to be done to get an initial working
version working - with cordon and roughly this PR - and
start working on a clean CNI without the legacy and complexity. Perhaps
with others - i.e. define the requirements ( 'a CNI should
allow a Pod to send/receive UDS file descriptors') and work with K8S
upstream to extend the core specs, with a minimal reference
implementation free of Istio-specific logic.
The SDS/certificate UDS socket is another example involving creating UDS
listeners in the pod namespace - and it will be a powerful
general IPC mechanism, since it enables passing file descriptors (and
memory regions) between pods. There is vast evidence it is
a powerful and viable mehanism from /dev/binder.
So let's get this in, test it - and start the cleanup/de-legacyfication as
a long-term separate effort. Even if this PR was super simple -
the rest of istio CNI is full of complexity and far too hard.
… Message ID: ***@***.***>
|
note that SO_REUSEPORT only work if both old and new ztunnel have the same uid
@howardjohn added SO_REUSEPORT which "solves" the upgrade problem (and it should actually drain connections gracefully - thought i didn't test this thoroughly yet). The main limitation is that new ztunnel and old ztunnel need to have the same UID (enforced by linux for port-reuse). LMK what you think. Note that i do still think that upgrading is a bigger story than just this change. |
+1 on this. The fact that we have a single version-controlled file declaring the schema and documenting the protocol which both the client and server explicitly refer to is far and away worth more than the trivial encoding cost, for the purposes of intelligibility and developer discoverability. I'm not against picking a simpler or more efficient serialization protocol/schema format than protobuf (e.g. Cap'n Proto) - but there aren't many, and not having one at all makes this much harder to understand and use. |
Co-authored-by: John Howard <howardjohn@google.com>
Yeah +1 on what yuval suggested - the node-level CNI plugin that already has full node privs could definitely talk to the CRI and get the FD, and propagate it directly. That would be a good followup optimization, and IMO very preferable to giving ztunnel those privs. |
If ztunnel has that privs, it won't need to communicate with CNI plugin for the netNs but with CRI right? |
Yes, but it would still need to communicate with the node agent, so all that would really do in practice is spread elevated permissions among more components, rather than fewer, which is not desirable. |
Giving the ztunnel access to the CRI is giving it privileges: if ztunnel communicates with the CRI instead of the node agent, that gives it the power of root. Because you can use the CRI to spin up privileged containers with host mounts, anyone that can access the CRI is effectively root |
Got it, that is what i think. Recently I talked to guy from containerd. He said CRI can be used to get the netNs, I think the tricky point is how to make the start up in order without race |
I believe I responded to all comments; Let me know if I missed anything |
@howardjohn @hzxuzhonghu - is this one ready to be approved? would be great to get it in release-1.21 before branching. |
A friendly ping to @howardjohn when he gets a min since he said should be good to go yesterday. This is blocking a few other PRs as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good start to iterate on
In pod redirection mode.
for context see: https://docs.google.com/document/d/1dynKlnNgIOywv3cwuw_2RCk_SKFRs7IrESaC8_r-sm0/edit
This adds a new optional redirection mode to ztunnel - redirecting traffic from within the pod namespace.
A few details on the code:
proxy::Proxy
, that starts its sockets in the pod's netnstrait SocketFactory
to abstract the netns details from theproxy::Proxy
Note that ztunnel would still work with the old redirection mode. new mode kicks in when
INPOD_ENABLED
is set totrue
.Summary of changes:
run()