Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable true mesh-only, remove hard dependency on the knative-local-gateway #1371

Closed
adamkgray opened this issue Feb 10, 2021 · 18 comments
Closed

Comments

@adamkgray
Copy link
Contributor

adamkgray commented Feb 10, 2021

/kind feature

Describe the solution you'd like
[A clear and concise description of what you want to happen.]

Remove hard dependency on the knative local gateway and let kfserving have the option to run 100% mesh-only with istio.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Kfserving should be able to run in true mesh-only mode on istio with sidecars enabled. In doing so, we ought to be able to remove the hard dependency on the knative local gateway which currently is used to route top level requests (name.ns.svc.cluster.local) to the other knative services that kfserving creates (name-predictor-default.ns.svc.cluster.local). Vanilla knative services support this feature, so kfserving ought to as well.

Im not sure how the routing should be done in this case... Perhaps we can start by making the predictor ksvc the top level service so that name.ns.svc.cluster.local can resolve to its clusterIP instead of just being a CNAME for the knative local gateway.

@adamkgray adamkgray changed the title Enabled true mesh-only, remove hard dependency on the knative-local-gateway Enable true mesh-only, remove hard dependency on the knative-local-gateway Feb 10, 2021
@yuzisun
Copy link
Member

yuzisun commented Mar 20, 2021

@adamkgray I think I have figured out that the top level virtual service should use the mesh gateway instead for the mesh-only setup.

@yuzisun
Copy link
Member

yuzisun commented Jun 12, 2021

This is the current setup, we had to use knative local gateway to route from top level virtual service sklearn-iris to the knative service virtual service sklearn-iris-transformer-default as it is not routing to service directly.

spec:
  gateways:
  - knative-local-gateway.knative-serving
  - knative-ingress-gateway.knative-serving
  - mesh
  hosts:
  - sklearn-iris.default.svc.cluster.local
  - sklearn-iris.default.35.237.217.209.xip.io
  http:
  - match:
    - authority:
        regex: ^sklearn-iris\.default(\.svc(\.cluster\.local)?)?(?::\d{1,5})?$
      gateways:
      - knative-local-gateway.knative-serving
    - authority:
        regex: ^sklearn-iris\.default\.35\.237\.217\.209\.xip\.io(?::\d{1,5})?$
      gateways:
      - knative-ingress-gateway.knative-serving
    route:
    - destination:
        host: knative-local-gateway.istio-system.svc.cluster.local
        port:
          number: 80
      headers:
        request:
          set:
            Host: sklearn-iris-predictor-default.default.svc.cluster.local
      weight: 100

Screen Shot 2021-06-12 at 11 58 23 AM

@yuzisun
Copy link
Member

yuzisun commented Jun 13, 2021

to avoid going through the local gateway, tried istio virtual service delegate which helps forwarding the traffic to the destination service when using the top level host sklearn-iris, however this does not seem to work and get 404 instead.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: sklearn-iris
  namespace: default
spec:
  gateways:
  - knative-serving/knative-ingress-gateway
  hosts:
  - sklearn-iris.default.example.com
  http:
  - delegate:
      name: sklearn-iris-predictor-default-ingress
      namespace: default
curl -v -H "Host: sklearn-iris.default.example.com" http://localhost:8080/v1/models/sklearn-iris:predict -d @./docs/samples/v1beta1/sklearn/v1/iris-input.json
*   Trying ::1:8080...
* Connected to localhost (::1) port 8080 (#0)
> POST /v1/models/sklearn-iris:predict HTTP/1.1
> Host: sklearn-iris.default.example.com
> User-Agent: curl/7.73.0
> Accept: */*
> Content-Length: 76
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 76 out of 76 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< date: Sun, 13 Jun 2021 12:07:23 GMT
< server: istio-envoy
< connection: close
< content-length: 0
< 
* Closing connection 0

@yuzisun
Copy link
Member

yuzisun commented Jun 15, 2021

@yanniszark @theofpa I think there are two options here

  1. With istio delegate we can remove the dependency of knative local gateway, but I could not get this working not sure if it is an istio bug, also Istio still remains as a hard dependency for kfserving if we are using delegates.
  2. Reconcile knative virtual service and update the top level virtual service to resolve to the actual kubernetes service which points to the right revision pods, with this approach we can potentially change to use kuberetes ingress CR instead thus Istio is no longer hard dependency.

@yanniszark
Copy link
Contributor

yanniszark commented Jun 16, 2021

@yuzisun thanks for taking the time to write all of this down. I tried to summarize all the information that we talked about in the meeting at a high level, to sync our understanding on where we are now and where we want to go.

Current Scenario that Kubeflow supports out-of-the-box

Please keep in mind that everything refers to sidecar-injected Pods, which is the default in Kubeflow Profiles.
Non-injected Pods don't have any of the following issues.

With the work in kubeflow/kubeflow#5848, this is the currently supported scenario:

external-user-valid-scenario

Current State - Invalid Scenarios

The following scenarios do not work with sidecar-injected Pods. The traffic FROM the cluster-local gateway TO Pods in the user's namespace is not allowed

kfserving-External User


kfserving-Internal User

Desired State

Since we have sidecars, we should not need the cluster-local-gateway. Istio sidecars can do all the complex routing themselves.

kfserving-External User - Mesh

kfserving-Internal User - Mesh

Why this is important

By eliminating the cluster-local-gateway, we can write very straightforward Istio AuthorizationPolicies.
For example, allow traffic from Pods in the same namespace.
The cluster-local-gateway currently hides the identity of the calling sidecar, making it difficult to express sensible policies.
By opening traffic to the cluster-local-gateway, which is global, we are effectively letting every user make requests to every other user's models.

However, since creating InferenceServices without sidecars already allows that, we could create a permissive AuthorizationPolicy for Knative Pods right? WRONG! Because we can't express a selector that targets only Knative Pods:
istio/istio#30598

Next Steps

I understand from our discussion and your comment above that:

  • This is possible and Knative supports a mesh-mode. My question then is:
    • How does one enable this mesh-mode? What is the configuration required in Knative and KFServing? I understand that in Kubeflow 1.3, this is not the default.
  • KFServing creates some top-level VirtualServices / Services that point to the first Knative Service in the chain (Transformer or Predictor). We need to find a way to route this traffic straight to the Pod, instead of having it go through the cluster-local-gateway first.

P.S. Here are the .drawio files for the diagrams, in case you want to use / edit them.
kfserving.zip

@yuzisun
Copy link
Member

yuzisun commented Jun 19, 2021

To enable the mesh mode all you need to do is to inject the istio sidecar which is done by default with Kubeflow setup I assume. @yanniszark I think the diagram isn't correct for the transformer to predictor routing, knative creates two virtual services for both mesh and local gateway, so istio sidecar should choose the mesh gateway and route the traffic from transformer to predictor directly without going through the local gateway.

When you route the request to transformer it would be still an issue because the top level virtual service introduces a level of indirection which forces to use local gateway but we are trying to figure out a way to do this without local gateway.

You can see the following access logging, when routing initially to transformer it goes through the local gateway BUT from transformer to predictor it does NOT go through the local gateway when istio sidecar is injected, instead the traffic directly goes from transformer istio envoy proxy to the predictor envoy proxy so you do not find the access logging on the gateway.

kubectl logs istio-ingressgateway-6d8565884d-bgqt5  -n istio-system

[2021-06-19T09:50:12.794Z] "POST /v1/models/mnist:predict HTTP/1.1" 200 - via_upstream - "-" 401 20 147 147 "10.244.0.6" "curl/7.73.0" "1e560b8f-c53d-45ec-9b8e-67eb0461cd0d" "torchserve-transformer-transformer-default.default.svc.cluster.local" "10.244.0.6:8081" outbound|80||knative-local-gateway.istio-system.svc.cluster.local 10.244.0.6:45404 127.0.0.1:8080 127.0.0.1:44140 - -
[2021-06-19T09:50:12.795Z] "POST /v1/models/mnist:predict HTTP/2" 200 - via_upstream - "-" 401 20 132 131 "10.244.0.6,10.244.0.6" "curl/7.73.0" "0da498ae-b76c-458d-bcd1-f8aaf3636fee" "torchserve-transformer-transformer-default.default.svc.cluster.local" "10.244.0.9:8012" outbound|80||torchserve-transformer-transformer-default-00001.default.svc.cluster.local 10.244.0.6:52428 10.244.0.6:8081 10.244.0.6:45404 - -

you can find the predictor call in the transformer envoy proxy log

kubectl logs torchserve-transformer-transformer-default-00001-deploymen65khl istio-proxy |grep predictor

[2021-06-19T09:50:12.806Z] "POST /v1/models/mnist:predict HTTP/1.1" 200 - via_upstream - "-" 16340 20 85 85 "-" "Tornado/6.1" "901984d6-ef9e-4561-8ae1-eaabb7d64caa" "torchserve-transformer-predictor-default.default" "10.244.0.9:8012" outbound|80||torchserve-transformer-predictor-default-00001.default.svc.cluster.local 10.244.0.20:59722 10.108.77.40:80 10.244.0.20:56670 - -

@yanniszark
Copy link
Contributor

@yuzisun I tried to reproduce but bumped into something very strange. First, requests were proxied through the activator Pod. I tried injecting it with a sidecar and allowing its traffic. However, I noticed that the inferenceservice sidecar was still blocking traffic.

I tried to debug with:
k exec -it <transformer-pod> -c istio-proxy -- pilot-agent request POST 'logging?rbac=debug'

and saw that the activator was not connecting with mTLS (the IP of activator is 172.17.0.21):

2021-06-23T19:11:33.753693Z     debug   envoy rbac      checking request: requestedServerName: , sourceIP: 172.17.0.21:39884, directRemoteIP: 172.17.0.21:39884, remoteIP: 127.0.0.1:0,localAddress: 172.17.0.75:8012, ssl: none, headers: ':authority', '172.17.0.75:8012'
':path', '/v1/models/serving-test-kfserving-0-bjqn1:predict'
':method', 'POST'
'user-agent', 'curl/7.58.0'
'content-length', '0'
'accept', '*/*'
'k-original-host', 'serving-test-kfserving-0-bjqn1-transformer-default.kubeflow-user.svc.cluster.local'
'k-proxy-request', 'activator'
'x-envoy-attempt-count', '1'
'x-forwarded-for', '127.0.0.1'
'x-forwarded-proto', 'http'
'x-request-id', 'c94b59cf-6a4e-9d46-9bd5-ac8332ebe718'
'accept-encoding', 'gzip'
'x-envoy-internal', 'true'
, dynamicMetadata: filter_metadata {
  key: "istio_authn"
  value {
  }
}

I tried creating a DestinationRule to no avail. Was this not the case for you?

@davidspek
Copy link
Contributor

@yanniszark I was seeing the same error while I was busy with #1688. In that PR, the AuthorizationPolicy for the namespace only allows access to the /metrics, /ready, /healthz, /wait-for-drain and /v1/models/* (not sure if this should be needed) from the knative-serving service account (or namespace, I tested that separately). The requests from the Activator pod to the inference service are blocked with the same RBAC logs you show above.
The requests are failing because the following section is empty:

dynamicMetadata: filter_metadata {
  key: "istio_authn"
  value {
  }
}

This section should include some of the values founf here, such as the source.namespace and source.principal. Because these are missing, the requests from the Activator fail.

I was able to replicate this behavior using a notebook server, by making requests directly to the IP address of a Pod rather than the service. This is also backed up by the fact that the Activator logs should show the requests going through PassthroughCluster. The Kiali docs have a good section describing this behavior:

Requests going to PassthroughCluster (or BlackHoleCluster) are requests that did not get routed to a defined service or service entry, and instead end up at one of these built-in Istio request handlers. See Monitoring Blocked and Passthrough External Service Traffic for more information.

@yuzisun
Copy link
Member

yuzisun commented Jun 30, 2021

@davidspek Do you have istio proxy injected for knative activator pod? I think those values also require mTLS to be enabled.

@davidspek
Copy link
Contributor

davidspek commented Jun 30, 2021

@yuzisun Yes, the Activator pod is injected with the sidecar proxy.

@yuzisun
Copy link
Member

yuzisun commented Jun 30, 2021

@yuzisun Yes, the Activator pod is injected with the sidecar proxy.

Also if you see PassthroughCluster that means the request is not routed through the mesh, so the destination service is not on the service mesh? We can try debug with istioctl proxy-config and examine the mesh configurations.

@yuzisun
Copy link
Member

yuzisun commented Jun 30, 2021

@davidspek Actually knative activator might be directly addressing the pod IPs bypassing the virtual service routes.

@davidspek
Copy link
Contributor

@yuzisun My conclusion is indeed that the KNative Activator is addressing the pods directly and not using the virtual service routes, which is causing the problems.

@yuzisun
Copy link
Member

yuzisun commented Jun 30, 2021

@yuzisun My conclusion is indeed that the KNative Activator is addressing the pods directly and not using the virtual service routes, which is causing the problems.

Let's bring this issue to knative community ? Looks like so far Knative has put an emphasis on no-mesh scenario or basic mTLS (no policies).

@davidspek
Copy link
Contributor

@yuzisun That's what I was also thinking. The authorization policy they provide does work, but it's very broad and thus not really best practice I'd say. You should want to limit access to the necessary paths to either the service account KNative uses, or the knative-serving namespace. In situations like Kubeflow where you want to isolate namespaces this is a must, and these fine grained permissions are part of the reason for using a service mesh (which service can talk to another service).

@yuzisun
Copy link
Member

yuzisun commented Jun 30, 2021

see this relevant issue knative-extensions/net-istio#554

@yanniszark
Copy link
Contributor

This Istio issue also seems relevant: istio/istio#23494

@rachitchauhan43
Copy link
Contributor

@yuzisun @Suresh-Nakkeran : Hasn't this PR #2380 solves this already ? If yes, then we can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

6 participants