New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting error while running workflow on kubernetes #244
Comments
Hi, As I'm using minikube to setup a kubernates cluster with single node. What are the required installation as I have followed the instruction mentioned in https://github.com/mlrun/mlrun/blob/master/hack/local/README.md |
Hi, @yaronha and @tebeka do I need to setup and install kubeflow on Kubernetes cluster separately because It is not mentioned anywhere and I'm getting above mentioned error while deploying the workflow pipeline on Kubernetes cluster? Can you also please help me with queries asked in previous comment? |
@narendra36 you can start w/o Kubeflow, Kubeflow is needed for the Pipelines and CRDs like MPIJob and should be installed on the cluster |
Hi @yaronha, thanks for the reply. I have installed the Kubeflow pipeline on my GCP Kubernetes cluster and able to solve above error but still I'm getting 2 major issues while deploying a sample sklearn-pipeline Issue 1:The above cell in the image is running successfully and also checked that artifact dataset file created successfully but when I'm running describe function cell it is giving me the following error. Can you please tell me what am I missing here? Issue 2:When I deploy the Kubeflow pipeline there is no error message but the pod for running to executing the pipeline is failed. error logs is here. the URL mentioned in the output of the cell is giving 404. Cell output: Here are the details of the error logs of the failed pod for pipeline deployment. ========================================================= Please help me to get resolve these errors because I'm unable to explore MLRun further without resolving these errors. Thanks in advance 😄 @yaronha and @tebeka, Can anyone provide me the direction to solve these issues, and what could be the cause? |
for issue 1 try (if you followed the instructions):
|
Hi @yjb-ds, Yes. issue 1 is fixed with the changes mentioned by you. I think there should be more details about a function like mount_pvc etc. Thank you for your help :) I also have fixed this thing in pipeline workflow code but issue 2 still persists. I have followed the instructions but don't know what else I'm missing. This is the error msg from failed pod: Detailed logs of pipeline workflow pod are here. @yjb-ds, Is the above-mentioned changes will work for pipeline workflow also? |
@narendra36 looks like the builder pod (the first step) failed to store the |
If it is the docker setup, then you might try the following:
|
Hi @yaronha, as you rightly said it is the problem with the docker registry. I have already setup a docker hub account and added an access token while creating my-docker secret but the error remains the same. I also have followed the instructions mentioned by @yjb-ds but the issue is still there. @yjb-ds, Should the value of DEFAULT_DOCKER_SECRET be the name of created docker secret 'my-docker' or access token from docker hub?
@yaronha, Here are the logs of the first step of the pipeline: Here is the my mlrun-local.yaml |
@narendra36 you should set both |
@yaronha, thanks for your quick reply. I have tried again by setting the registry per function as you suggested and the first step works fine but when it comes to 2nd step 'get-data' it is giving some kind of error with the reason of forbidden. I think it's trying to create a resource pod on the default-tenant namespace. But my complete deployment is under kubeflow namespace and I also have set namespace parameter to kubeflow while running the pipeline as below.
Error logs from Kubeflow pipeline UI:
@yjb-ds @yaronha |
@narendra36 we are getting there :) , im adding more doc based on those issues you should set the default mlrun namespace |
Hi @yaronha, Yes please add more docs regarding the issues and setup. I will also share the installation details once it has been set up completely. Hopefully, we will be there soon :) I have configured namespace as mentioned by you but no luck the same error still persists. What could be the other reason? I have looked for the namespace value 'default-tenant' and found that one of the mentioned is inside the runtimes/mpijob.py file. |
@narendra36 i found an issue with kfp namespaces, issued a fix, we have a new ver w many new features in 1-2 days will have it in u can try using the dev branch:
|
Hi @yaronha and @yjb-ds, that's good to know. But for now, I have installed the mlrun from the development branch as suggested by @yaronha. When I start the workflow pipeline to execute, kubeflow pipeline UI and Kubernetes pod status showing below progress: Here are the logs for the first pod which completed without any error. Step 2 of the workflow pipeline (get-data): pod error: Further steps are failing with below errors: I have checked the requested container image with the latest tag on the docker hub and the container image does not exist there. I also have checked the pod logs for the first step which is not showing any docker command to push the image to dockerhub but it includes some minio-service calls (which was being set up while configuring kubeflow pipeline service). Pod logs are Here how can I fix it? Thanks for helping! :) |
@narendra36 u can see in the log that your server is 0.4.6, i suggest updating the docker to |
@narendra36 i see the error, u should re-pull the image |
@narendra36 mlrun 0.4.7 is now released, i suggest re-install the packages & containers |
Hi @yaronha, thanks for the help.
|
@narendra36 the serving part is done with Nuclio serverless engine, did you install Nuclio on your cluster? you would need to specify the nuclio service (dashboard) url in the serving function |
Hi @yaronha, @yjb-ds Observations:
Issues:
|
@narendra36 good to see we are in the final step, the problem is that Nuclio or by adding the env var to the dashboard deployment BTW did u set the dashbooard address in the |
Hi @yaronha, Thanks for the quick reply. 😄 So do I still have to set the above-mentioned environment variable and How should I add it while installing Nuclio or mlrun? I didn't understand. |
@narendra36 its part of nuclio installation (tells nuclio what is the external address to the dashboard & ingress), u can change/set the env var in the |
Issue seems to be resolved, closing |
I have these set but I am still facing this issue.
|
Hi,
I'm using minikube to run kubernetes in local system and trying to run workflow defined in demos/sklearn-pipe/sklearn-project.ipynb but getting the below error message.
Jupyter Cell:
Error message:
MaxRetryError: HTTPConnectionPool(host='ml-pipeline.default.svc.cluster.local', port=8888): Max retries exceeded with url: /apis/v1beta1/experiments (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fea36705a90>: Failed to establish a new connection: [Errno -2] Name or service not known'))
I have followed the instructions mentioned in below readme file
https://github.com/mlrun/mlrun/blob/master/hack/local/README.md
Can anyone help me in resolving the error?
The text was updated successfully, but these errors were encountered: