Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot running tfjob pod #944

Closed
Chris-Paul-Li opened this issue Feb 21, 2019 · 2 comments
Closed

Cannot running tfjob pod #944

Chris-Paul-Li opened this issue Feb 21, 2019 · 2 comments

Comments

@Chris-Paul-Li
Copy link

Chris-Paul-Li commented Feb 21, 2019

I cannot link gci ,so i find docker pull cschen/tf-mnist-with-summaries from dockerhub.
But my tfjob pod status is "CrashLoopBackOff" ,and describe is "Error syncing pod",and logs is "python: can't open file '/root/kubeflow/mnist-with-summaries.py': [Errno 2] No such file or directory"

The contents of other documents are the same as the examples.

my tf_job_mnist.yaml
apiVersion: "kubeflow.org/v1beta1"
kind: "TFJob"
metadata:
name: "mnist"
namespace: kubeflow
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: cschen/tf-mnist-with-summaries
command:
- "python"
- "/root/kubeflow/mnist-with-summaries.py"
- "--log_dir=/train"
- "--learning_rate=0.01"
- "--batch_size=150"
volumeMounts:
- mountPath: "/train"
name: tfevent-volume
volumes:
- name: tfevent-volume
persistentVolumeClaim:
claimName: "tfevent-volume"

kubectl -n kubeflow get pods
**mnist-worker-0 0/1 CrashLoopBackOff 2 1m
**

kubectl -n kubeflow describe pod mnist-worker -0
**Events:
Type Reason Age From Message


Normal Scheduled 3m default-scheduler Successfully assigned mnist-worker-0 to k8s-node-vm8o4o-ev15h9pjqd
Normal SuccessfulMountVolume 3m kubelet, k8s-node-vm8o4o-ev15h9pjqd MountVolume.SetUp succeeded for volume "default-token-pt7pz"
Normal SuccessfulMountVolume 3m kubelet, k8s-node-vm8o4o-ev15h9pjqd MountVolume.SetUp succeeded for volume "tfevent-volume"
Normal Pulled 2m (x3 over 2m) kubelet, k8s-node-vm8o4o-ev15h9pjqd Successfully pulled image "cschen/tf-mnist-with-summaries"
Normal Created 2m (x3 over 2m) kubelet, k8s-node-vm8o4o-ev15h9pjqd Created container
Normal Started 2m (x3 over 2m) kubelet, k8s-node-vm8o4o-ev15h9pjqd Started container
Warning BackOff 2m (x5 over 2m) kubelet, k8s-node-vm8o4o-ev15h9pjqd Back-off restarting failed container
Warning FailedSync 2m (x5 over 2m) kubelet, k8s-node-vm8o4o-ev15h9pjqd Error syncing pod
Normal Pulling 2m (x4 over 3m) kubelet, k8s-node-vm8o4o-ev15h9pjqd pulling image "cschen/tf-mnist-with-summaries"
**

kubectl -n kubeflow logs pod/mnist-worker-0
python: can't open file '/root/kubeflow/mnist-with-summaries.py': [Errno 2] No such file or directory

kubectl -n kubeflow describe tfjob mnist
**Events:
Type Reason Age From Message


Normal SuccessfulCreatePod 7m tf-operator Created pod: mnist-worker-0
Normal SuccessfulCreateService 7m tf-operator Created service: mnist-worker-0
Normal ExitedWithCode 1m (x10 over 7m) tf-operator Pod: kubeflow.mnist-worker-0 exited with code 2**

kubectl -n kubeflow logs tfjob/mnist
error: no kind "TFJob" is registered for version "kubeflow.org/v1beta1"

@gaocegege
Copy link
Member

Does cschen/tf-mnist-with-summaries contain the file /root/kubeflow/mnist-with-summaries.py?

@ChanYiLin
Copy link
Member

I suggest to build your own docker image using our examples/v1beta1/dist-mnist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants