Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't deploy Pach to GKE default k8s version #2787

Closed
dwhitena opened this issue Mar 16, 2018 · 18 comments
Closed

Can't deploy Pach to GKE default k8s version #2787

dwhitena opened this issue Mar 16, 2018 · 18 comments

Comments

@dwhitena
Copy link
Contributor

Pachyderm won't deploy to the latest default version of k8s in GKE. This was reported by a user, and I reproduced the issue. The default version in GKE is 1.8.8-gke.0. For this version or greater, the pachd pods errors and goes into CrashLoopBackoff with the following serviceaccount related errors:

time="2018-03-16T20:16:44Z" level=error msg="unable to access kubernetes nodeslist, Pachyderm will continue to work but it will not be possible to use COEFFICIENT parallelism. error: nodes is forbidden: User "system:serviceaccount:default:pachyderm" cannot list nodes at the cluster scope: Unknown user "system:serviceaccount:default:pachyderm""
time="2018-03-16T20:16:44Z" level=error msg="unable to access kubernetes pods, Pachyderm will continue to work but certain pipeline errors will result in pipelines being stuck indefinitely in "starting" state. error: unknown (get pods)"
time="2018-03-16T20:16:44Z" level=error msg="unable to access kubernetes pods, Pachyderm will continue to work but get-logs will not work. error: pods is forbidden: User "system:serviceaccount:default:pachyderm" cannot list pods in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm""
time="2018-03-16T20:16:44Z" level=error msg="unable to create kubernetes replication controllers, Pachyderm will not function properly until this is fixed. error: replicationcontrollers is forbidden: User "system:serviceaccount:default:pachyderm" cannot create replicationcontrollers in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm""
time="2018-03-16T20:16:44Z" level=error msg="unable to delete kubernetes replication controllers, Pachyderm function properly but pipeline cleanup will not work. error: replicationcontrollers "ceb8a1da36ad4700811aa32da3ea8c29" is forbidden: User "system:serviceaccount:default:pachyderm" cannot delete replicationcontrollers in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm""
2018-03-16T20:16:44Z INFO authclient.API.GetCapability {"request":{}}
2018-03-16T20:16:44Z INFO authclient.API.GetCapability {"duration":0.001143887,"request":{},"response":{"capability":"5273272262ac4b06a76752cce2582e35"}}
endpoints "pachd" is forbidden: User "system:serviceaccount:default:pachyderm" cannot get endpoints in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm"

However, if you use --cluster-version 1.7.14-gke.1 or earlier, everything seems to be ok.

To reproduce following the GCP docs for deployment with Pach version 1.7.0rc2.

@dwhitena dwhitena added the user label Mar 16, 2018
@dwhitena
Copy link
Contributor Author

Note, that in the cases where Pachyderm fails to deploy and I get the above error, I have checked to ensure that the sa exists, and it does:

$ kubectl get serviceaccounts
NAME        SECRETS   AGE
default     1         16m
pachyderm   1         26s

@DSchmidtDev
Copy link

DSchmidtDev commented Mar 20, 2018

This issue is not GKE specific. It's related to the Kubernetes version and RBAC settings.
The serviceaccount pachyderm is missing a role binding with enough permissions to list and create the resources.
In case of using the helm chart an additional clusterrole and binding resource with enough permissions is needed

@dwhitena
Copy link
Contributor Author

Thanks for the additional info @DSchmidtDev. Just to clarify for the notes here, I didn't deploy with a Helm chart. I used pachctl deploy ..., so I would imagine this is a problem with both routes to deployment @jdoliner.

@jdoliner
Copy link
Member

@DSchmidtDev do you know which Kubernetes version and RBAC settings create this issue? The ClusterRole has, among other rules, this:

{
      "verbs": [
        "get",
        "list",
        "watch"
      ],
      "apiGroups": [
        ""
      ],
      "resources": [
        "nodes",
        "pods",
        "pods/log",
        "endpoints"
      ]
}

which seems like it should give the required permissions. In addition this works on both 1.8.0 and 1.9.0 kubernetes clusters with rbac enabled.

@DSchmidtDev
Copy link

DSchmidtDev commented Mar 21, 2018

I'm not sure. Thought it was officially introduced (stable) with Kubernetes v1.8 but it depends on the deployment args. GKE disabled the old authorization method with the change to v1.8 as default. So since then you have to configure your roles when needed. With v1.6 and v1.7 RBAC and the old authorization was possible in parallel.
Unfortunately I haven't invested much time in general RBAC settings yet to tell you more..

Your snippet does not cover all permissions.
The ClusterRole needs at least create, update and delete permissions to ReplicationControllers for the pipeline worker.

@brycemcanally
Copy link
Contributor

This issue comes up if you use the --no-rbac flag when deploying to a version of GKE later than 1.8. Using this flag removes the rolebindings for the pachyderm service account from the manifest. The docs can be a little misleading in this regard, so we are going to update them to reflect this information.

@isabella
Copy link

isabella commented Apr 1, 2018

As described, I get the following error as described above with the --no-rbac option.

$ pachctl deploy google ${BUCKET_NAME} ${STORAGE_SIZE} --dynamic-etcd-nodes=1 --no-rbac
time="2018-04-01T20:45:03Z" level=error msg="unable to access kubernetes nodeslist, Pachyderm will continue to work but it will not be possible to use COEFFICIENT parallelism. error: nodes is forbidden: User "system:serviceaccount:default:pachyderm" cannot list nodes at the cluster scope: Unknown user "system:serviceaccount:default:pachyderm""
time="2018-04-01T20:45:03Z" level=error msg="unable to access kubernetes pods, Pachyderm will continue to work but certain pipeline errors will result in pipelines being stuck indefinitely in "starting" state. error: unknown (get pods)"
time="2018-04-01T20:45:03Z" level=error msg="unable to access kubernetes pods, Pachyderm will continue to work but get-logs will not work. error: pods is forbidden: User "system:serviceaccount:default:pachyderm" cannot list pods in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm""
time="2018-04-01T20:45:03Z" level=error msg="unable to create kubernetes replication controllers, Pachyderm will not function properly until this is fixed. error: replicationcontrollers is forbidden: User "system:serviceaccount:default:pachyderm" cannot create replicationcontrollers in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm""
time="2018-04-01T20:45:03Z" level=error msg="unable to delete kubernetes replication controllers, Pachyderm function properly but pipeline cleanup will not work. error: replicationcontrollers "df15acbfa5644dd49d684d1796fa1921" is forbidden: User "system:serviceaccount:default:pachyderm" cannot delete replicationcontrollers in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm""
2018/04/01 20:45:03 INFO: Listening on addr: :999 path: /v1/handle/push
endpoints "pachd" is forbidden: User "system:serviceaccount:default:pachyderm" cannot get endpoints in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm"

However, without the --no-rbac option, I get the following error.

$ pachctl deploy google ${BUCKET_NAME} ${STORAGE_SIZE} --dynamic-etcd-nodes=1
Error from server (Forbidden): error when creating "STDIN": clusterroles.rbac.authorization.k8s.io "pachyderm" is forbidden: attempt to grant extra privileges: [PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["pods"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["pods"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["pods"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["pods/log"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["pods/log"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["pods/log"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["endpoints"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["endpoints"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["endpoints"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["replicationcontrollers"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["replicationcontrollers"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["replicationcontrollers"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["replicationcontrollers"], APIGroups:[""], Verbs:["create"]} PolicyRule{Resources:["replicationcontrollers"], APIGroups:[""], Verbs:["update"]} PolicyRule{Resources:["replicationcontrollers"], APIGroups:[""], Verbs:["delete"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["create"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["update"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["delete"]} PolicyRule{Resources:["secrets"], ResourceNames:["pachyderm-storage-secret"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["secrets"], ResourceNames:["pachyderm-storage-secret"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["secrets"], ResourceNames:["pachyderm-storage-secret"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["secrets"], ResourceNames:["pachyderm-storage-secret"], APIGroups:[""], Verbs:["create"]} PolicyRule{Resources:["secrets"], ResourceNames:["pachyderm-storage-secret"], APIGroups:[""], Verbs:["update"]} PolicyRule{Resources:["secrets"], ResourceNames:["pachyderm-storage-secret"], APIGroups:[""], Verbs:["delete"]}] user=&{Dagnytaggartindustrialist@gmail.com  [system:authenticated] map[authenticator:[GKE]]} ownerrules=[PolicyRule{Resources:["selfsubjectaccessreviews"], APIGroups:["authorization.k8s.io"], Verbs:["create"]} PolicyRule{NonResourceURLs:["/api" "/api/*" "/apis" "/apis/*" "/healthz" "/swagger-2.0.0.pb-v1" "/swagger.json" "/swaggerapi" "/swaggerapi/*" "/version"], Verbs:["get"]}] ruleResolutionErrors=[]
kubectl apply -f - --validate=false: exit status 1
$ pachctl version --client-only
1.7.0
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T10:09:24Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.8-gke.0", GitCommit:"6e5b33a290a99c067003632e0fd6be0ead48b233", GitTreeState:"clean", BuildDate:"2018-02-16T18:26:58Z", GoVersion:"go1.8.3b4", Compiler:"gc", Platform:"linux/amd64"

Then, when setting the cluster version for kubernetes to --cluster-version 1.7.14-gke.1

Error from server (BadRequest): error when creating "STDIN": ClusterRole in version "v1" cannot be handled as a ClusterRole: no kind "ClusterRole" is registered for version "rbac.authorization.k8s.io/v1"
Error from server (BadRequest): error when creating "STDIN": ClusterRoleBinding in version "v1" cannot be handled as a ClusterRoleBinding: no kind "ClusterRoleBinding" is registered for version "rbac.authorization.k8s.io/v1"

@brycemcanally
Copy link
Contributor

This looks like you do not have the permissions to create roles in your cluster. You can make yourself cluster admin with this command:

kubectl create clusterrolebinding cluster-admin-binding \
--clusterrole cluster-admin --user $(gcloud config get-value account) 

You are probably going to want to stick to not using the --no-rbac flag.

@isabella
Copy link

isabella commented Apr 2, 2018

That didn't fix it either.
The logs from k get logs for pachd are:

Unknown user "system:serviceaccount:default:pachyderm""
time="2018-04-02T04:08:04Z" level=error msg="unable to access kubernetes pods,
Pachyderm will continue to work but certain pipeline errors will result in pipelines
being stuck indefinitely in "starting" state. error: unknown (get pods)"

@brycemcanally
Copy link
Contributor

Did you switch to using a 1.7 version of GKE? If you did, you would need the --no-rbac flag because role based access control is not the default for that version. The error message you were getting in your second deployment attempt was because the deployment was trying to create the pachyderm service account and grant it privileges that you did not have.

@isabella
Copy link

isabella commented Apr 2, 2018

I'm using the 1.8.8-gke.0 default version. I'm following these steps: http://docs.pachyderm.io/en/latest/deployment/google_cloud_platform.html

With the following kubectl create rolebinding pach-admin --clusterrole=cluster-admin --serviceaccount=default:pachyderm --namespace=default, I was able to get pachd to start.

@brycemcanally
Copy link
Contributor

Okay. If you have already tried a clean deployment, you might want to jump into our slack users channel and layout your situation. Making yourself cluster admin should have gotten you past the issue you were having in the original post.

@alanz
Copy link

alanz commented Apr 2, 2018

See also #2787

I resorted to starting against 1.7.14-gke.1

@JoeyZwicker JoeyZwicker reopened this Apr 2, 2018
@najibninaba
Copy link

This is the workaround that worked for me with the current default version 1.8.8-gke.0 in GCP and Pachyderm version 1.7.0. I've also installed Pachyderm on a separate namespace. The installation uses the default RBAC setup as per the official documentation[1].

[1] http://pachyderm.readthedocs.io/en/latest/deployment/google_cloud_platform.html

The workaround is as follows, after running the Pachyderm deployment steps:

$ kubectl delete clusterrolebinding pachyderm
$ kubectl create clusterrolebinding pachyderm --clusterrole=cluster-admin --serviceaccount=pachyderm:pachyderm --namespace=pachyderm --user=system:serviceaccount:default:pachyderm
$ kubectl delete pods --all

The key thing is that the user is set to system:serviceaccount:default:pachyderm and given the cluster-admin role. Is this something that can be set for the serviceaccount settings somewhere?

@stevef1uk
Copy link

stevef1uk commented Apr 30, 2018

I have got past this stage on GKE by following the above and now when following the tutorial can create a repo but when attempting to add the blah.txt file get:

$pachctl put-file myrepo master -c -f blah.txt
googleapi: Error 403: Insufficient Permission, insufficientPermissions

From the Web GUI I have manually added a file to the bucket I used when deploying pachyderm

How do I update the permissions for the service account?

Hmm. This may be my problem as I configured the Kubernetes cluster from the GUI and didn't set any scopes :-(

https://stackoverflow.com/questions/29837531/changing-permissions-of-google-container-engine-cluster

Thus I needed to delete the previous K8s cluster and created another one and I clicked on he More tab and set Full access to the Storage API and all ok :-)

@jdoliner
Copy link
Member

@stevef1uk the most likely cause here is that the IAM role for the cluster doesn't have the proper permissions. You can change that in the GCE web console.

@Nick-Harvey
Copy link
Contributor

We should update the docs with the steps need to ensure the gcp service account has the appropriate permissions to be deployed properly.

@mindthevirt mindthevirt added this to To Do in Documentation via automation Feb 10, 2021
@npepin-hub npepin-hub moved this from Backlog to To Do in Documentation May 16, 2021
Documentation automation moved this from To Do to Done May 24, 2021
@mindthevirt
Copy link
Contributor

This is no longer an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests