New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pachd fails: panic: failed to initialize pach client: context deadline exceeded #4432
Comments
@benwbooth sorry you're hitting this. Can you double check to make sure etcd is healthy and running as well as the perms to your object store bucket? Being unable to connect to one of those is the most common place where this error happens as in #4424 |
The etcd server looks good as far as I can tell. Here are the etcd logs:
|
I am using an on-premises Rook/Ceph cluster as the object store. I have the access key, secret key, endpoint, bucket name all set correctly. I'm not sure what else to try. |
I still havent been able to get The RBAC permissions of the helm chart were not allowing pachd to query the kubernetes API to get the nodes when using the helm chart, so I first installed using |
+1 Running into exactly similar problem with same version of pachyderm. However, I am trying to use minio as object store instead of Rook/ceph on-prem. I posted the logs and other details on slack channel. I used the following command to generate the deployment manifest.
I noticed that pachctl is generating deployment manifest with STORAGE_BACKEND as AMAZON. I therefore updated it to MINIO. I also updated the secret generated by pachctl to use minio-* keys instead of amazon-* keys. Not sure what is missing at this point. Can you please refer us to some documentation on how to generate deployment manifest using pachctl for on-prem usecases where object store used is MINIO or ceph? It appears |
Scratch that-- the helm chart still has RBAC issues when attempting to run a pipeline:
|
I've tested the object store using s3cmd with the same accesskey/secretkey/endpoint that pachyderm is configured to use. I was able to get and retrieve an object without any issues. |
@suman724 I think you are supposed to use |
@benwbooth I fixed the typo in my previous command. Yes, I am using object-store as s3 and then modifying the generated yaml file. No, it is still not working. I see the same stack trace that you have posted on this thread. |
The problem seems to be here in
It's timing out waiting for the error group. |
Some better error messages would really be helpful to diagnose the issue. #4424 has the same error message but is really no help for diagnosing the problem |
@benwbooth @suman724 We are taking a closer look at this issue. |
Thank you @nitinjainsj . I created the issue #4437 with detailed logs and deployment manifest I used. |
@suman724 @benwbooth We are attempting to recreate this. Does this work in any older pachyderm version like 1.9.8? |
@ benwbooth in the original issue, is that the full set of panic tracebacks? Did you see a traceback with |
@ysimonson I ran it again on pachd v1.9.9 and got this traceback. Don't see a main.go:
If I change the
I'm using a self-signed certificate on the object store. The common name is set to the URL that I'm using to access the object store. Does pachyderm not allow self-signed certs? |
I've tried this with SSL disabled on the object store, and got a different error with 1.9.8 saying that pachd was trying to speak HTTPS but the object store was trying to speak HTTP. It looks like pachd does not allow unencrypted object stores, is that correct? On v1.9.9 I didn't get any useful error messages, just the stack trace. |
What do I need to do to get pachyderm to connect to my object store using a self-signed cert? Or is there a way to allow pachyderm to connect to an unencrypted object store? |
I tried version 1.9.8 as well. I disabled HTTPS on Minio. I therefore ran pachctl as below to generate the deployment manifest and make pachd work with non-secure minio.
I am now passing isS3V2 go make non-secure connection to minio. If I do not pass --isS3V2, I ran into the same error @benwbooth mentioned here. With that extra parameter, pachd started successfully.
So, the original problem (pachd service startup failure) does appear to be only in 1.9.9 in my environment. |
@suman724 @benwbooth 1.9.10 has additional logging to help debug this further. @suman724 I want to clarify if you pass the --isS3V2 flag it work in 1.9.8 and fail in 1.9.9? |
@nitinjainsj In version 1.9.9, pachd was failing with the following stack trace.
When I used version 1.9.8, I think it went past this failure that occurred in 1.9.9 and actually tried to make a connection to minio which is exposing only http in my test environment. However, pachd was attempting to make a https connection. Error message was printed in the logs with this message in 1.9.8. No such errors were printed in 1.9.9. To resolve this handshake problem, I used |
I was able to get pachyderm to deploy correctly and connect to my object storage using pachyderm 1.9.10:
However, I had to use an unencrypted object store on port 80. If I tried to use port 443 instead, then pachd would never become ready. Is this because of the self-signed cert? It would be nice if there was a way for pachyderm to support object stores that use self-signed certificates. |
I think it might make sense to address the set of issues you two (@benwbooth and @suman724) are hitting in this one issue instead of splitting it between this issue and #4437 and #4466. Just to give a little bit of background on the object storage client stuff, we are trying, in general, to move away from using the minio client libraries for custom deployments and instead use the aws client libraries across the board. The only time the minio client libraries are used is when V2 signatures are needed (the So, do you guys need V2 signing or not? If not, then lets just focus on getting the aws client libraries working (I am going to assume V2 signing is not needed for the following stuff, but we can try and figure something out if V2 signing is needed). If your object storage provider has ssl disabled then you should be able to prefix the endpoint with |
I tried deploying pachyderm again, this time using http:// and omitting the --isS3V2 flag. This is the command-line I used:
Here are the logs from pachd:
|
I'm having the same issue and also tried using http:// and omitting the --isS3V2 flag and still recieved: INFO error starting githook server context deadline exceeded |
Hey @benwbooth @suman724 @SolitaryThinker , we have a new release (1.9.11) that contains improved logging and adds more configuration that may help address your issues (such as disabling ssl or skipping certificate verification). It would be great to get an update on whether you are/were able to resolve the issue(s) or whether this helps identify the root issue(s). |
Closing this since there has not been a response for a while. Feel free to reopen with logs from a release >=1.9.12. |
What happened?:
Ran pachctl deploy to create an on-premises pachyderm cluster:
pachd is failing to start up and is reporting the following in the logs:
What you expected to happen?:
pachd should load successfully
How to reproduce it (as minimally and precisely as possible)?:
Anything else we need to know?:
Environment?:
kubectl version
):pachctl version
):The text was updated successfully, but these errors were encountered: