-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate Kubeflow GCP deployment using E2E example in https://github.com/kubeflow/pipelines/pull/5433 #271
Comments
KFServing step has failed Error message: {'apiVersion': 'serving.kubeflow.org/v1beta1', 'kind': 'InferenceService', 'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'}, 'creationTimestamp': '2021-05-11T21:34:33Z', 'finalizers': ['inferenceservice.finalizers'], 'generation': 1, 'managedFields': [{'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'.': {}, 'f:sidecar.istio.io/inject': {}}}, 'f:spec': {'.': {}, 'f:predictor': {'.': {}, 'f:tensorflow': {'.': {}, 'f:storageUri': {}}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2021-05-11T21:34:31Z'}, {'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {}}, 'f:spec': {'f:predictor': {'f:containers': {}, 'f:tensorflow': {'f:args': {}, 'f:command': {}, 'f:image': {}}}}, 'f:status': {'.': {}, 'f:components': {'.': {}, 'f:predictor': {'.': {}, 'f:latestCreatedRevision': {}, 'f:latestReadyRevision': {}, 'f:latestRolledoutRevision': {}, 'f:traffic': {}}}, 'f:conditions': {}}}, 'manager': 'manager', 'operation': 'Update', 'time': '2021-05-11T21:35:05Z'}], 'name': 'mnist-e2e', 'namespace': 'profile', 'resourceVersion': '651911', 'selfLink': '/apis/serving.kubeflow.org/v1beta1/namespaces/profile/inferenceservices/mnist-e2e', 'uid': '69783a07-cb35-41ae-8809-6c2efa15b954'}, 'spec': {'predictor': {'tensorflow': {'name': 'kfserving-container', 'resources': {'limits': {'cpu': '1', 'memory': '2Gi'}, 'requests': {'cpu': '1', 'memory': '2Gi'}}, 'runtimeVersion': '1.14.0', 'storageUri': 'pvc://end-to-end-pipeline-ztsm4-model-volume/'}}}, 'status': {'components': {'predictor': {'latestCreatedRevision': 'mnist-e2e-predictor-default-00002', 'latestReadyRevision': 'mnist-e2e-predictor-default-00001', 'latestRolledoutRevision': 'mnist-e2e-predictor-default-00001', 'traffic': [{'latestRevision': True, 'percent': 100, 'revisionName': 'mnist-e2e-predictor-default-00001'}]}}, 'conditions': [{'lastTransitionTime': '2021-05-11T21:35:05Z', 'severity': 'Info', 'status': 'True', 'type': 'PredictorConfigurationReady'}, {'lastTransitionTime': '2021-05-11T21:35:05Z', 'status': 'Unknown', 'type': 'PredictorReady'}, {'lastTransitionTime': '2021-05-11T21:34:36Z', 'severity': 'Info', 'status': 'Unknown', 'type': 'PredictorRouteReady'}, {'lastTransitionTime': '2021-05-11T21:35:05Z', 'status': 'Unknown', 'type': 'Ready'}]}} |
InferenceService Event Message: Warning InternalError 22m v1beta1Controllers fails to reconcile predictor: fails to update knative service: Operation cannot be fulfilled on services.serving.knative.dev "mnist-e2e-predictor-default": the object has been modified; please apply your changes to the latest version and try again |
Thank you for validating this example @zijianjoy!
|
Thank you @andreyvelich !
|
Thank you for providing this information @zijianjoy. It seems that
Also, I tried to deploy Kubeflow manifest on GCP with Dex setup: https://github.com/kubeflow/manifests/blob/master/example/kustomization.yaml. @yuzisun Any ideas why |
Also cc @theofpa for the KFServing question. |
Thank you @andreyvelich for helping to debug. I confirm that KFServing manager version is v0.5.1. Providing following logs:
|
Additional information:
|
@zijianjoy Sounds like model volume can't be attached to the KFServing container. Try to remove
|
In the meantime, please try to describe PVC to check which resources are using this volume: |
If I remove the Logs are
|
Sharing the result of this pvc:
|
Sharing the compiled pipeline definition of this sample: https://gist.github.com/zijianjoy/e96c8cc6533e21e48cf193746831b089 |
Can you double check why TFJob is in the |
List of pods in
Describe TFJob pod:
pod logs:
|
Please can you also describe the TFJob pods, not a Kubeflow Pipeline task? |
I see, thanks for clarification! Sharing the TFJob pod:
And TFJob:
|
It seems that the problem with attaching volume to your pods. It takes long time. Also, check this: kubernetes/kubernetes#67014 (comment). |
I am not aware of any specific constraint on my GCP cluster. I am using https://www.kubeflow.org/docs/distributions/gke/deploy/deploy-cli/ to deploy Kubeflow cluster. I am not sure if upgrading Kubernetes version can help with this issue. Would you like to help me understand? My current Kubernetes version is 1.18.17-gke.100. What is the Kubernetes version you are using which can finish volume mounting successfully? |
I am using GKE cluster with 4 Nodes. Version is:
|
Thank you @andreyvelich , it is useful command to clean up past resources, maybe we can add them to the E2E example documentation. I would like to share update from my side: EnvironmentI had Kubernetes cluster with version Based on the recommendation, I have upgraded cluster master version and nodepool node version to ActionInstead of calling KFP to create a run directly, I used the following command to compile to yaml file, and upload to KFP from UI. (I set
With this approach, I am able to finish pipeline run and created inference service successfully! However, I am not able to build connection to
I guess it is about the IAP policy requirement for sending request to GCP endpoint. LimitationNote that Kubeflow Pipelines actually doesn't work on Kubernetes 1.19: kubeflow/pipelines#5714. The reason that pipeline can run in the Another caveat is that calling KFP to create run directly from notebook still fails with the same message as described at the beginning of this thread #271 (comment), even when I have upgraded to k8s 1.19. Below is the command I use to initialize run:
What is the difference which caused the pipeline run failure when uploading from UI vs in-cluster notebook? |
Thank you for providing this information @zijianjoy.
Yes, for my GCP cluster I am also using
I believe, if you want to call KFServing inference outside of K8s cluster (not from Kubeflow Notebooks) you should use another URL. @yuzisun Do we have an example how to call KFServing inference which is deployed on GCP ?
It should not be any differences. Did you give your Kubeflow Notebook access to KFP (see this comment: kubeflow/pipelines#5138 (comment)) and extend KFP SDK with token credentials changes from this PR: kubeflow/pipelines#5287 ? Do we have any differences in Volume creation step ( |
kubeflow/pipelines#5433
The text was updated successfully, but these errors were encountered: