Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running pipeline: cannot create tfjobs.kubeflow.org 403 #294

Closed
lluunn opened this issue Nov 15, 2018 · 16 comments
Closed

Error running pipeline: cannot create tfjobs.kubeflow.org 403 #294

lluunn opened this issue Nov 15, 2018 · 16 comments
Assignees

Comments

@lluunn
Copy link

lluunn commented Nov 15, 2018

Steps:

  1. followed wiki to deploy pipeline.
  2. I took the code from @amygdala's blog, ran it in jupyter notebook to build the tarball
  3. Upload the tarball, and graph looks correct:

screen shot 2018-11-15 at 14 39 46

  1. Running the pipeline, got error:
INFO:root:Getting credentials for GKE cluster kfp1.
Fetching cluster endpoint and auth data.
kubeconfig entry generated for kfp1.
INFO:root:Generating training template.
INFO:root:Start training.
ERROR:root:Exception when calling DefaultApi->apis_fqdn_v1_namespaces_namespace_resource_post: tfjobs.kubeflow.org is forbidden: User "system:serviceaccount:kubeflow:pipeline-runner" cannot create tfjobs.kubeflow.org in the namespace "kubeflow"
Traceback (most recent call last):
  File "/ml/train.py", line 230, in <module>
    main()
  File "/ml/train.py", line 188, in main
    create_response = tf_job_client.create_tf_job(api_client, content_yaml, version=kf_version)
  File "/tf-operator/py/tf_job_client.py", line 56, in create_tf_job
    raise e
kubernetes.client.rest.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Date': 'Thu, 15 Nov 2018 22:32:05 GMT', 'Audit-Id': '316cc166-62f2-43aa-926e-8014f66f4b2e', 'Content-Length': '318', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"tfjobs.kubeflow.org is forbidden: User \"system:serviceaccount:kubeflow:pipeline-runner\" cannot create tfjobs.kubeflow.org in the namespace \"kubeflow\"","reason":"Forbidden","details":{"group":"kubeflow.org","kind":"tfjobs"},"code":403

screen shot 2018-11-15 at 14 40 58

@amygdala
Copy link
Contributor

amygdala commented Nov 16, 2018

I think the issue is that you need to run this command or similar before running any pipelines:
kubectl create clusterrolebinding sa-admin --clusterrole=cluster-admin --serviceaccount=kubeflow:pipeline-runner
(there is another issue filed related to this... let me find it).

@amygdala
Copy link
Contributor

Here's the related issue: #220

@lluunn
Copy link
Author

lluunn commented Nov 16, 2018

Thanks!
I created the clusterrolebinding.
Now the error is:
py.util.JobTimeoutError: Timeout waiting for job trainer-s46w4 in namespace kubeflow to enter one of the conditions [{u'status': u'True', u'lastUpdateTime': u'2018-11-16T04:11:13Z', u'lastTransitionTime': u'2018-11-16T04:11:13Z', u'reason': u'TFJobCreated', u'message': u'TFJob trainer-s46w4 is created.', u'type': u'Created'}, {u'status': u'True', u'lastUpdateTime': u'2018-11-16T04:11:59Z', u'lastTransitionTime': u'2018-11-16T04:11:13Z', u'reason': u'TFJobRunning', u'message': u'TFJob trainer-s46w4 is running.', u'type': u'Running'}]

screen shot 2018-11-15 at 20 22 25

@amygdala
Copy link
Contributor

Hmm, I have not seen this (and can't repro). :(
Do the logs of the tf-job-operator* pod have any clues?

Jeremy, I have a vague memory of your saying there was some recent issue with tf-job -- could this be related? @jlewi

@jlewi
Copy link
Contributor

jlewi commented Nov 26, 2018

@amygdala Only issues I'm aware of are RBAC and IAM.

@IronPan might know more.

@jlewi
Copy link
Contributor

jlewi commented Dec 3, 2018

@IronPan @vicaire Any update on this?

@vicaire
Copy link
Contributor

vicaire commented Dec 7, 2018

@qimingj @gaoning777, could you please have a look?

@jlewi
Copy link
Contributor

jlewi commented Dec 17, 2018

@qimingj @gaoning777 any update on this?

@IronPan
Copy link
Member

IronPan commented Jan 1, 2019

The RBAC issue should be resolved with the latest version.
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/pipeline/pipeline-apiserver.libsonnet#L322

@lluunn Could you give another try and if issue still persist, could you share me the pipeline definition and the kubeflow version?

@jlewi
Copy link
Contributor

jlewi commented Jan 7, 2019

@IronPan Is there a pipelines test that covers firing off TFJob from pipelines? If not can we open up an issue to add such at test and use that to verify the fix is working?

@qimingj
Copy link
Contributor

qimingj commented Jan 7, 2019

We currently don't have any samples covering tf-job except for @amygdala's sample. The sample's trainer container (gcr.io/google-samples/ml-pipeline-kubeflow-tf-taxi) was contributed by Amy and is out of pipeline's repo either.

@amygdala, does your sample still work in latest kubeflow deployment?

I am up for covering tf-job since this is the key component in Kubeflow.

@amygdala
Copy link
Contributor

amygdala commented Jan 7, 2019

Barbara reported that there is a credentials issue (see below) with running my tf-job step in her setup, which was created using the launcher (in contrast to my original instructions, in which I created the GKE cluster nodes with cloud-platform scope, then ran the bootstrapper.yaml).
I'm about to try to repro.
She reports that the issue is NOT fixed by adding that gcp credentials wrapper to the step, but it might be a clusterrole binding issue.

(I remember that previously, there was some situation where I needed to run:
kubectl create clusterrolebinding sa-admin --clusterrole=cluster-admin --serviceaccount=kubeflow:pipeline-runner
to fix something, but I can't remember the context.)

image

@amygdala
Copy link
Contributor

amygdala commented Jan 14, 2019

Update part 1: I had no problems with tf-job using a cluster created with 'cloud-platform' scoped nodes and this pipelines bootstrapper: gcr.io/ml-pipeline/bootstrapper:0.1.7 .
So this suggests that it's some credentials/scope setup issue rather than some recent release change, which I wanted to rule out first.

Next I'll play around with a launcher-created cluster. Maybe it needs some additional cluster role during setup.

(cc @BasiaFusinska )

@IronPan
Copy link
Member

IronPan commented Jan 14, 2019

I think the issue is caused by tf-job pod failed to talk to gcp services because tf-job pod doesn't have right GCP service account set up in launcher.

We need to mount the kubeflow-user GCP service account here.

@amygdala
Copy link
Contributor

amygdala commented Jan 14, 2019

You're probably right, I'll try that next.

@vicaire
Copy link
Contributor

vicaire commented Mar 26, 2019

Closing this issue as a duplicate of #677

@vicaire vicaire closed this as completed Mar 26, 2019
Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023
* add init.sh

* Move repo checkout to init.sh

* Update Dockerfile to use init.sh

* Change image pull path

* Add args to deploy config

* Add PYTHONPATH

* temp disable deployment deletion

* add comment

* update

* create issue for deployment permission and deletion script

* use command in yaml instead of args + ENTRYPOINT

* change tag to live
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024
reviews should be directed to folks who are active to optimize.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants