Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run pipeline samples in GCP IAP Deployment #2773

Closed
bruce3557 opened this issue Dec 25, 2019 · 17 comments
Closed

Cannot run pipeline samples in GCP IAP Deployment #2773

bruce3557 opened this issue Dec 25, 2019 · 17 comments
Assignees

Comments

@bruce3557
Copy link

What happened:
We cannot run pipeline samples.
Seems that gcloud related command cannot get workload identity correctly.
The error messages are

ERROR: (gsutil) timed out
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.

What did you expect to happen:
We should run pipeline samples smoothly.

What steps did you take:
Created a run and an experiment.

Anything else you would like to add:
I tried this implementation and still cannot get correct result.
https://github.com/kubeflow/pipelines/blob/master/samples/core/secret/secret.py

@bruce3557
Copy link
Author

When I retried deployment again, the message is changed to

AccessDeniedException: 403 Primary: /namespaces/dcard-data.svc.id.goog with additional claims does not have storage.objects.list access to dcard--bruce.

@bruce3557
Copy link
Author

not sure whether that is related,
when I run secret sample, I will get these messages.
It seems that cloud sdk cannot link to metadata.google.internal

Traceback (most recent call last):
  File "<string>", line 6, in <module>
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 212, in _items_iter
    for page in self._page_iter(increment=False):
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 243, in _page_iter
List of buckets:
    page = self._next_page()
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 369, in _next_page
    response = self._get_next_page_response()
  File "/usr/local/lib/python2.7/dist-packages/google/api_core/page_iterator.py", line 419, in _get_next_page_response
    method=self._HTTP_METHOD, path=self.path, query_params=params
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 417, in api_request
    timeout=timeout,
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 275, in _make_request
    method, url, headers, data, target_object, timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/_http.py", line 313, in _do_request
    url=url, method=method, headers=headers, data=data, timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/google/auth/transport/requests.py", line 277, in request
    self.credentials.before_request(auth_request, method, url, request_headers)
  File "/usr/local/lib/python2.7/dist-packages/google/auth/credentials.py", line 124, in before_request
    self.refresh(request)
  File "/usr/local/lib/python2.7/dist-packages/google/auth/compute_engine/credentials.py", line 102, in refresh
    six.raise_from(new_exc, caught_exc)
  File "/usr/lib/python2.7/dist-packages/six.py", line 737, in raise_from
    raise value
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Read timed out. (read timeout=120)

@bruce3557
Copy link
Author

I found this issue also: googleapis/google-auth-library-python#211

@bruce3557
Copy link
Author

After set workload identity to pipeline-runner in kubeflow namespace,
I can read data via gcloud command but still timeout after few minutes.

@parthmishra
Copy link
Contributor

I can read data via gcloud command but still timeout after few minutes.

Maybe this has to do with how gcloud obtains/refreshes credentials? Even when using the old secret method (e.g. .apply(gcp.use_gcp_secret("user-gcp-sa")) I still get the timeouts and have to rely on setting the retry attempts for the component.

@bruce3557
Copy link
Author

bruce3557 commented Dec 28, 2019

About timeout problem, I think that is GKE problem. That will use default credential client and the certification is timeout around 1 hour.
But I think binding workload identity to pipeline-runner is workable for kubeflow ~

@parthmishra I tried that but it didn’t work because gcloud sdk implementation

@wronk
Copy link

wronk commented Jan 2, 2020

@bruce3557, also running into this on some training experiments (using Katib outside pipelines). I end up with that same error when trying to download training data:

google.auth.exceptions.TransportError: HTTPConnectionPool(host='metadata.google.internal', port=80): Read timed out. (read timeout=120)

Please post back if you find a fix

@bruce3557
Copy link
Author

bruce3557 commented Jan 3, 2020

@wronk I find a workaround solution to prevent this problem in kubeflow issue 4607.
You can restart metadata pods regularly (around half hour)
The command is:
kubectl delete pods -n kube-system --selector=k8s-app=gke-metadata-server

Before GCP fix the issue, we cannot do anything I think.
The related GCP issue is here:
https://issuetracker.google.com/issues/146622472

@Bobgy
Copy link
Contributor

Bobgy commented Jan 20, 2020

As mentioned in the GCP issue, did you try the workarounds.

2 workarounds:

1) Disable workload identity
2) Downgrade GKE to a version that uses 0.2.13 of GKE Metadata server (1.14.8-gke.18)
  1. has been working well for me using the following command
    gcloud container clusters upgrade <cluster-name> --master --cluster-version 1.14.8-gke.17

@yantriks-edi-bice
Copy link

@Bobgy I get the following error when trying to downgrade

Master of cluster [xxxxx] will be upgraded from version [1.14.9-gke.2] to version [1.14.8-gke.17]. This operation is long-running and will block other operations on the cluster (including
delete) until it has run to completion.
Do you want to continue (Y/n)?
ERROR: (gcloud.container.clusters.upgrade) ResponseError: code=400, message=Master version "1.14.8-gke.17" is unsupported.

@yantriks-edi-bice
Copy link

But I think binding workload identity to pipeline-runner is workable for kubeflow ~

I don't yet understand how all of kubeflow is set up but am wondering about the effect such change would have on the other components. Would they continue to work assuming pipeline works ?

@numerology
Copy link

AFAIK there is an ongoing issue related with recent GKE release. Will keep this thread updated.

@Bobgy
Copy link
Contributor

Bobgy commented Jan 28, 2020

ERROR: (gcloud.container.clusters.upgrade) ResponseError: code=400, message=Master version "1.14.8-gke.17" is unsupported.

It means a new patch version has been released. The new 1.14.8-gke.x probably already have the fix.

@yantriks-edi-bice
Copy link

yantriks-edi-bice commented Jan 28, 2020

@Bobgy thanks - found latest in 1.18.8 series is 1.14.8-gke.33 and used your command to upgrade from earlier kubeflow 0.7 default version. Still getting this error though and cluster-user has Storage Admin role

File "kfp_component/google/dataflow/_launch_python.py", line 58, in launch_python
job_id, location = read_job_id_and_location(storage_client, staging_location)
File "kfp_component/google/dataflow/_common_ops.py", line 99, in read_job_id_and_location
if job_blob.exists():
File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 404, in exists
_target_object=None,
File "/usr/local/lib/python2.7/site-packages/google/cloud/_http.py", line 319, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.Forbidden: 403 GET
https://www.googleapis.com/storage/v1/b/edi_bice/o/kubeflow%2Fpipelines%2F378a9083ca79da0fc8b315b96dd965d8%2Fkfp%2Fdataflow%2Flaunch_python%2Fjob.txt?fields=name
: Primary: /namespaces/xxx-xx-xxx.svc.id.goog with additional claims does not have storage.objects.get access to edi_bice/kubeflow/pipelines/378a9083ca79da0fc8b315b96dd965d8/kfp/dataflow/launch_python/job.txt.

@Bobgy
Copy link
Contributor

Bobgy commented Mar 5, 2020

@yantriks-edi-bice Sorry for late notice, you probably also need to upgrade your google/cloud-sdk client versions as mentioned in #3069 (comment)

@Bobgy
Copy link
Contributor

Bobgy commented Mar 5, 2020

It seems the original issue is a GKE workload identity problem, closing now.
/close

@k8s-ci-robot
Copy link
Contributor

@Bobgy: Closing this issue.

In response to this:

It seems the original issue is a GKE workload identity problem, closing now.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023
Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com>
Signed-off-by: Dan Sun <dsun20@bloomberg.net>
Co-authored-by: Dan Sun <dsun20@bloomberg.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants