Enable true mesh-only, remove hard dependency on the knative-local-gateway #1371

adamkgray · 2021-02-10T15:30:36Z

/kind feature

Describe the solution you'd like
[A clear and concise description of what you want to happen.]

Remove hard dependency on the knative local gateway and let kfserving have the option to run 100% mesh-only with istio.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Kfserving should be able to run in true mesh-only mode on istio with sidecars enabled. In doing so, we ought to be able to remove the hard dependency on the knative local gateway which currently is used to route top level requests (name.ns.svc.cluster.local) to the other knative services that kfserving creates (name-predictor-default.ns.svc.cluster.local). Vanilla knative services support this feature, so kfserving ought to as well.

Im not sure how the routing should be done in this case... Perhaps we can start by making the predictor ksvc the top level service so that name.ns.svc.cluster.local can resolve to its clusterIP instead of just being a CNAME for the knative local gateway.

yuzisun · 2021-03-20T23:42:36Z

@adamkgray I think I have figured out that the top level virtual service should use the mesh gateway instead for the mesh-only setup.

yuzisun · 2021-06-12T16:00:48Z

This is the current setup, we had to use knative local gateway to route from top level virtual service sklearn-iris to the knative service virtual service sklearn-iris-transformer-default as it is not routing to service directly.

spec:
  gateways:
  - knative-local-gateway.knative-serving
  - knative-ingress-gateway.knative-serving
  - mesh
  hosts:
  - sklearn-iris.default.svc.cluster.local
  - sklearn-iris.default.35.237.217.209.xip.io
  http:
  - match:
    - authority:
        regex: ^sklearn-iris\.default(\.svc(\.cluster\.local)?)?(?::\d{1,5})?$
      gateways:
      - knative-local-gateway.knative-serving
    - authority:
        regex: ^sklearn-iris\.default\.35\.237\.217\.209\.xip\.io(?::\d{1,5})?$
      gateways:
      - knative-ingress-gateway.knative-serving
    route:
    - destination:
        host: knative-local-gateway.istio-system.svc.cluster.local
        port:
          number: 80
      headers:
        request:
          set:
            Host: sklearn-iris-predictor-default.default.svc.cluster.local
      weight: 100

yuzisun · 2021-06-13T12:07:57Z

to avoid going through the local gateway, tried istio virtual service delegate which helps forwarding the traffic to the destination service when using the top level host sklearn-iris, however this does not seem to work and get 404 instead.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: sklearn-iris
  namespace: default
spec:
  gateways:
  - knative-serving/knative-ingress-gateway
  hosts:
  - sklearn-iris.default.example.com
  http:
  - delegate:
      name: sklearn-iris-predictor-default-ingress
      namespace: default

curl -v -H "Host: sklearn-iris.default.example.com" http://localhost:8080/v1/models/sklearn-iris:predict -d @./docs/samples/v1beta1/sklearn/v1/iris-input.json
*   Trying ::1:8080...
* Connected to localhost (::1) port 8080 (#0)
> POST /v1/models/sklearn-iris:predict HTTP/1.1
> Host: sklearn-iris.default.example.com
> User-Agent: curl/7.73.0
> Accept: */*
> Content-Length: 76
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 76 out of 76 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< date: Sun, 13 Jun 2021 12:07:23 GMT
< server: istio-envoy
< connection: close
< content-length: 0
< 
* Closing connection 0

yuzisun · 2021-06-15T13:39:54Z

@yanniszark @theofpa I think there are two options here

With istio delegate we can remove the dependency of knative local gateway, but I could not get this working not sure if it is an istio bug, also Istio still remains as a hard dependency for kfserving if we are using delegates.
Reconcile knative virtual service and update the top level virtual service to resolve to the actual kubernetes service which points to the right revision pods, with this approach we can potentially change to use kuberetes ingress CR instead thus Istio is no longer hard dependency.

yanniszark · 2021-06-16T17:34:25Z

@yuzisun thanks for taking the time to write all of this down. I tried to summarize all the information that we talked about in the meeting at a high level, to sync our understanding on where we are now and where we want to go.

Current Scenario that Kubeflow supports out-of-the-box

Please keep in mind that everything refers to sidecar-injected Pods, which is the default in Kubeflow Profiles.
Non-injected Pods don't have any of the following issues.

With the work in kubeflow/kubeflow#5848, this is the currently supported scenario:

Current State - Invalid Scenarios

The following scenarios do not work with sidecar-injected Pods. The traffic FROM the cluster-local gateway TO Pods in the user's namespace is not allowed

Desired State

Since we have sidecars, we should not need the cluster-local-gateway. Istio sidecars can do all the complex routing themselves.

Why this is important

By eliminating the cluster-local-gateway, we can write very straightforward Istio AuthorizationPolicies.
For example, allow traffic from Pods in the same namespace.
The cluster-local-gateway currently hides the identity of the calling sidecar, making it difficult to express sensible policies.
By opening traffic to the cluster-local-gateway, which is global, we are effectively letting every user make requests to every other user's models.

However, since creating InferenceServices without sidecars already allows that, we could create a permissive AuthorizationPolicy for Knative Pods right? WRONG! Because we can't express a selector that targets only Knative Pods:
istio/istio#30598

Next Steps

I understand from our discussion and your comment above that:

This is possible and Knative supports a mesh-mode. My question then is:
- How does one enable this mesh-mode? What is the configuration required in Knative and KFServing? I understand that in Kubeflow 1.3, this is not the default.
KFServing creates some top-level VirtualServices / Services that point to the first Knative Service in the chain (Transformer or Predictor). We need to find a way to route this traffic straight to the Pod, instead of having it go through the cluster-local-gateway first.

P.S. Here are the .drawio files for the diagrams, in case you want to use / edit them.
kfserving.zip

yuzisun · 2021-06-19T09:46:32Z

To enable the mesh mode all you need to do is to inject the istio sidecar which is done by default with Kubeflow setup I assume. @yanniszark I think the diagram isn't correct for the transformer to predictor routing, knative creates two virtual services for both mesh and local gateway, so istio sidecar should choose the mesh gateway and route the traffic from transformer to predictor directly without going through the local gateway.

When you route the request to transformer it would be still an issue because the top level virtual service introduces a level of indirection which forces to use local gateway but we are trying to figure out a way to do this without local gateway.

You can see the following access logging, when routing initially to transformer it goes through the local gateway BUT from transformer to predictor it does NOT go through the local gateway when istio sidecar is injected, instead the traffic directly goes from transformer istio envoy proxy to the predictor envoy proxy so you do not find the access logging on the gateway.

kubectl logs istio-ingressgateway-6d8565884d-bgqt5  -n istio-system

[2021-06-19T09:50:12.794Z] "POST /v1/models/mnist:predict HTTP/1.1" 200 - via_upstream - "-" 401 20 147 147 "10.244.0.6" "curl/7.73.0" "1e560b8f-c53d-45ec-9b8e-67eb0461cd0d" "torchserve-transformer-transformer-default.default.svc.cluster.local" "10.244.0.6:8081" outbound|80||knative-local-gateway.istio-system.svc.cluster.local 10.244.0.6:45404 127.0.0.1:8080 127.0.0.1:44140 - -
[2021-06-19T09:50:12.795Z] "POST /v1/models/mnist:predict HTTP/2" 200 - via_upstream - "-" 401 20 132 131 "10.244.0.6,10.244.0.6" "curl/7.73.0" "0da498ae-b76c-458d-bcd1-f8aaf3636fee" "torchserve-transformer-transformer-default.default.svc.cluster.local" "10.244.0.9:8012" outbound|80||torchserve-transformer-transformer-default-00001.default.svc.cluster.local 10.244.0.6:52428 10.244.0.6:8081 10.244.0.6:45404 - -

you can find the predictor call in the transformer envoy proxy log

kubectl logs torchserve-transformer-transformer-default-00001-deploymen65khl istio-proxy |grep predictor

[2021-06-19T09:50:12.806Z] "POST /v1/models/mnist:predict HTTP/1.1" 200 - via_upstream - "-" 16340 20 85 85 "-" "Tornado/6.1" "901984d6-ef9e-4561-8ae1-eaabb7d64caa" "torchserve-transformer-predictor-default.default" "10.244.0.9:8012" outbound|80||torchserve-transformer-predictor-default-00001.default.svc.cluster.local 10.244.0.20:59722 10.108.77.40:80 10.244.0.20:56670 - -

yanniszark · 2021-06-23T19:27:22Z

@yuzisun I tried to reproduce but bumped into something very strange. First, requests were proxied through the activator Pod. I tried injecting it with a sidecar and allowing its traffic. However, I noticed that the inferenceservice sidecar was still blocking traffic.

I tried to debug with:
k exec -it <transformer-pod> -c istio-proxy -- pilot-agent request POST 'logging?rbac=debug'

and saw that the activator was not connecting with mTLS (the IP of activator is 172.17.0.21):

2021-06-23T19:11:33.753693Z     debug   envoy rbac      checking request: requestedServerName: , sourceIP: 172.17.0.21:39884, directRemoteIP: 172.17.0.21:39884, remoteIP: 127.0.0.1:0,localAddress: 172.17.0.75:8012, ssl: none, headers: ':authority', '172.17.0.75:8012'
':path', '/v1/models/serving-test-kfserving-0-bjqn1:predict'
':method', 'POST'
'user-agent', 'curl/7.58.0'
'content-length', '0'
'accept', '*/*'
'k-original-host', 'serving-test-kfserving-0-bjqn1-transformer-default.kubeflow-user.svc.cluster.local'
'k-proxy-request', 'activator'
'x-envoy-attempt-count', '1'
'x-forwarded-for', '127.0.0.1'
'x-forwarded-proto', 'http'
'x-request-id', 'c94b59cf-6a4e-9d46-9bd5-ac8332ebe718'
'accept-encoding', 'gzip'
'x-envoy-internal', 'true'
, dynamicMetadata: filter_metadata {
  key: "istio_authn"
  value {
  }
}

I tried creating a DestinationRule to no avail. Was this not the case for you?

davidspek · 2021-06-28T15:12:05Z

@yanniszark I was seeing the same error while I was busy with #1688. In that PR, the AuthorizationPolicy for the namespace only allows access to the /metrics, /ready, /healthz, /wait-for-drain and /v1/models/* (not sure if this should be needed) from the knative-serving service account (or namespace, I tested that separately). The requests from the Activator pod to the inference service are blocked with the same RBAC logs you show above.
The requests are failing because the following section is empty:

dynamicMetadata: filter_metadata {
  key: "istio_authn"
  value {
  }
}

This section should include some of the values founf here, such as the source.namespace and source.principal. Because these are missing, the requests from the Activator fail.

I was able to replicate this behavior using a notebook server, by making requests directly to the IP address of a Pod rather than the service. This is also backed up by the fact that the Activator logs should show the requests going through PassthroughCluster. The Kiali docs have a good section describing this behavior:

Requests going to PassthroughCluster (or BlackHoleCluster) are requests that did not get routed to a defined service or service entry, and instead end up at one of these built-in Istio request handlers. See Monitoring Blocked and Passthrough External Service Traffic for more information.

yuzisun · 2021-06-30T08:41:41Z

@davidspek Do you have istio proxy injected for knative activator pod? I think those values also require mTLS to be enabled.

davidspek · 2021-06-30T08:42:54Z

@yuzisun Yes, the Activator pod is injected with the sidecar proxy.

yuzisun · 2021-06-30T12:20:14Z

@yuzisun Yes, the Activator pod is injected with the sidecar proxy.

Also if you see PassthroughCluster that means the request is not routed through the mesh, so the destination service is not on the service mesh? We can try debug with istioctl proxy-config and examine the mesh configurations.

yuzisun · 2021-06-30T12:26:28Z

@davidspek Actually knative activator might be directly addressing the pod IPs bypassing the virtual service routes.

davidspek · 2021-06-30T12:37:18Z

@yuzisun My conclusion is indeed that the KNative Activator is addressing the pods directly and not using the virtual service routes, which is causing the problems.

yuzisun · 2021-06-30T13:05:35Z

@yuzisun My conclusion is indeed that the KNative Activator is addressing the pods directly and not using the virtual service routes, which is causing the problems.

Let's bring this issue to knative community ? Looks like so far Knative has put an emphasis on no-mesh scenario or basic mTLS (no policies).

davidspek · 2021-06-30T13:15:15Z

@yuzisun That's what I was also thinking. The authorization policy they provide does work, but it's very broad and thus not really best practice I'd say. You should want to limit access to the necessary paths to either the service account KNative uses, or the knative-serving namespace. In situations like Kubeflow where you want to isolate namespaces this is a must, and these fine grained permissions are part of the reason for using a service mesh (which service can talk to another service).

yuzisun · 2021-06-30T13:16:44Z

see this relevant issue knative-extensions/net-istio#554

yanniszark · 2021-06-30T13:26:49Z

This Istio issue also seems relevant: istio/istio#23494

rachitchauhan43 · 2022-12-09T21:21:00Z

@yuzisun @Suresh-Nakkeran : Hasn't this PR #2380 solves this already ? If yes, then we can close this.

k8s-ci-robot added the kind/feature label Feb 10, 2021

adamkgray changed the title ~~Enabled true mesh-only, remove hard dependency on the knative-local-gateway~~ Enable true mesh-only, remove hard dependency on the knative-local-gateway Feb 10, 2021

This was referenced Jun 8, 2021

fix(profile-controller): KNative probes kubeflow/kubeflow#5848

Merged

profiles: Work with InferenceServices kubeflow/kubeflow#5965

Open

davidspek mentioned this issue Jun 29, 2021

Enable Istio injection in e2e namespace and add Auth policy #1688

Closed

davidspek mentioned this issue Jun 30, 2021

Istio Authorization for Knative Service to Service Communication knative-extensions/net-istio#554

Closed

karlschriek mentioned this issue Jul 8, 2021

HTTPError: 503 Server Error: Service Unavailable for url #1700

Closed

surajkota mentioned this issue Aug 25, 2022

fix: extend profile AuthorizationPolicy for KServe models kubeflow/kubeflow#6627

Open

yuzisun added the kserve/control-plane label Nov 14, 2022

yuzisun closed this as completed Jan 14, 2023

yuzisun self-assigned this Jan 21, 2023

treyhyde mentioned this issue Feb 13, 2024

Provide delegate virtual service to simplify ingress / reduce need for cluster-local-gateway knative-extensions/net-istio#1003

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable true mesh-only, remove hard dependency on the knative-local-gateway #1371

Enable true mesh-only, remove hard dependency on the knative-local-gateway #1371

adamkgray commented Feb 10, 2021 •

edited

yuzisun commented Mar 20, 2021

yuzisun commented Jun 12, 2021 •

edited

yuzisun commented Jun 13, 2021 •

edited

yuzisun commented Jun 15, 2021

yanniszark commented Jun 16, 2021 •

edited

yuzisun commented Jun 19, 2021 •

edited

yanniszark commented Jun 23, 2021

davidspek commented Jun 28, 2021

yuzisun commented Jun 30, 2021 •

edited

davidspek commented Jun 30, 2021 •

edited

yuzisun commented Jun 30, 2021

yuzisun commented Jun 30, 2021

davidspek commented Jun 30, 2021

yuzisun commented Jun 30, 2021 •

edited

davidspek commented Jun 30, 2021

yuzisun commented Jun 30, 2021

yanniszark commented Jun 30, 2021

rachitchauhan43 commented Dec 9, 2022

Enable true mesh-only, remove hard dependency on the knative-local-gateway #1371

Enable true mesh-only, remove hard dependency on the knative-local-gateway #1371

Comments

adamkgray commented Feb 10, 2021 • edited

yuzisun commented Mar 20, 2021

yuzisun commented Jun 12, 2021 • edited

yuzisun commented Jun 13, 2021 • edited

yuzisun commented Jun 15, 2021

yanniszark commented Jun 16, 2021 • edited

Current Scenario that Kubeflow supports out-of-the-box

Current State - Invalid Scenarios

Desired State

Why this is important

Next Steps

yuzisun commented Jun 19, 2021 • edited

yanniszark commented Jun 23, 2021

davidspek commented Jun 28, 2021

yuzisun commented Jun 30, 2021 • edited

davidspek commented Jun 30, 2021 • edited

yuzisun commented Jun 30, 2021

yuzisun commented Jun 30, 2021

davidspek commented Jun 30, 2021

yuzisun commented Jun 30, 2021 • edited

davidspek commented Jun 30, 2021

yuzisun commented Jun 30, 2021

yanniszark commented Jun 30, 2021

rachitchauhan43 commented Dec 9, 2022

adamkgray commented Feb 10, 2021 •

edited

yuzisun commented Jun 12, 2021 •

edited

yuzisun commented Jun 13, 2021 •

edited

yanniszark commented Jun 16, 2021 •

edited

yuzisun commented Jun 19, 2021 •

edited

yuzisun commented Jun 30, 2021 •

edited

davidspek commented Jun 30, 2021 •

edited

yuzisun commented Jun 30, 2021 •

edited