learner pod failed #165

Eric-Zhang1990 · 2019-02-22T08:52:29Z

@Tomcli I am running the examples FfDL provided, but they all failed. I don't know how to find where error is, can you help me, thanks.
First, FfDL is running correctly.

I create bucket and upload data to s3 bucket.

I run "$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model" or "$CLI_CMD train etc/examples/tf-model/gpu-manifest.yml etc/examples/tf-model", all get status FAILED.

And I find the data path is "source: "/mnt/data/tf_training_data/train/"" (lenet_train_test.prototxt) in caffe train. Is this path right?
Thank you!

Tomcli · 2019-02-22T17:08:00Z

Hi @Eric-Zhang1990, did you update the manifest file with the correct object storage endpoint? In the instructions, if you are using the local object storage, the following script should help you setup the right endpoint

if [ "$(uname)" = "Darwin" ]; then
  sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
else
  sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
fi

Note you need to have node_ip and s3_port environment variable setup in your shell.

Eric-Zhang1990 · 2019-02-25T08:07:07Z

@Tomcli I run command 'sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml' to change the mainfest file, and it's status is Ok, but '$CLI_CMD show training-_S1ixBrmR' still shows status Pending,

and I describe the pod 'learner', it shows:

What message "pod has unbound immediate PersistentVolumeClaims (repeated 3 times)" means?
Besides, I install the s3fs drivers and helm install storage-plugin --set cloud=false as you said in issue #101 .
Thank you.

Eric-Zhang1990 · 2019-02-25T09:32:11Z

@Tomcli Seems I found error about above isuue, but I don't know wheather it is right or not.
I run 'kubectl get STORAGECLASS' and get:

So I change the file /FfDL/bin/create_static_volumes.sh (I don't know the change is right or not.)

I run 'kubectl get pvc --all-namespaces' and it is status is always 'pending', and 'kubectl describe pvc -n kube-system static-volume-1' shows message 'failed ***'.

Can you help me analysis where the problem is?
Thank you.

Eric-Zhang1990 · 2019-02-26T12:08:16Z

@Tomcli After I run 'FfDL/bin/create_static_pv.sh' and now status of 'static-volume-1' is Bound,

but I don't know how to modify the file tf-model/manifest.yaml, can you give me an detail example?? Or how to use s3 bucket for training??
Thank you.

Eric-Zhang1990 · 2019-02-27T09:29:46Z

@Tomcli Now I can run the example tf-model (manifest.yml in cpu), however, I use the aws s3 storage to upload data for training, can you provide a method like NFS to storage data for training??
Thank you very much.

Tomcli · 2019-02-27T17:34:02Z

Hi @Eric-Zhang1990, Sorry for the late reply. For the errors you have using the mount_cos mode, since you deploy the storage plugin with flag cloud=false. You need to install the s3fs driver and kubelet-plugin. (replace apt-get to yum if you are using CentOS)

sudo apt-get install s3fs
sudo mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo cp <FfDL repo>/bin/ibmc-s3fs /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo chmod +x /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs/ibmc-s3fs
sudo systemctl restart kubelet

@sboagibm can you describe how to use NFS as the data storage for FfDL training job? Thanks.

Eric-Zhang1990 · 2019-02-28T00:56:24Z

@Tomcli Thank you for your kind reply. I have installed s3fs driver accoding to your steps, now I can run examples tf-model in cpu and gpu on one node by using s3 data storage, I will try it on multi nodes.

@sboagibm @Tomcli I want to know how to use NFS on FfDL to use local data on our servers, thanks.

sboagibm · 2019-02-28T15:09:55Z

@Eric-Zhang1990 asked:

I want to know how to use NFS on FfDL to use local data on our servers, thanks.

NFS is used internally for data sharing between the learner, helper, and job monitor pods. It's not primarily involved in data access or results. In fact, we're working on a re-architecture that gets rid of the need for NFS altogether.

To use local data directly you could try to enable a host mount at this time. See the manifest at https://github.com/IBM/FfDL/blob/master/etc/examples/tf-model/manifest-hostmount.yml .

Maybe @fplk has other ideas on this front. Not sure if there's any functioning options with using S3FS over minio or something.

There's hopefully some new code coming soon that will allow PVC mounts. That's really what you want.

Eric-Zhang1990 · 2019-03-01T01:08:50Z

@sboagibm Host mount is also ok for me, but I don't know how to set the variable? like 'container:' 、 'connection:' , etc. Can you give me a detail example? Thank you.

sboagibm · 2019-03-01T02:46:44Z

@Eric-Zhang1990 connection/path is the name of the directory you want to mount. training_data/container is the name of a sub directory under the mount where the data will be fetched from. training_results/container is the name of a sub directory under the mount where the training results will be written to. You can see where these directories are set up for our test, and the permissions they need to have, at https://github.com/IBM/FfDL/blob/master/Makefile#L489.

Hope this helps!

Eric-Zhang1990 · 2019-03-01T08:39:14Z

@sboagibm @Tomcli I set these variables accoding to what you said above,

my data dir is like that:

but the job still failed and log info is 'Failed: load_model_exit_code: 1', I don't know which step is wrong.

Sometimes failed job gives the log info 'Failed: load_data_exit_code: 1', which is caused by big data? (my data is about 3G).

Can you help me solve the problems?
Thank you.

sboagibm · 2019-03-01T14:43:09Z

@Eric-Zhang1990 Debugging this remotely is difficult. My debugging strategy would be as follows. Do a watch kubectl get pods so you can watch if/when the learner, helper, and job-monitor pods appear. If the learner, helper, and job-monitor pods do not appear, log the lcm service and see if it's showing an error. If the learner pod does appear, do an kubectl exec -it learner-podname sh, go into the /cosdata directory, and see if all is as you expect. Also do kubectl logs learner-podname and see if the learner is showing any interesting logs directly.

Eric-Zhang1990 · 2019-03-04T09:49:00Z

@sboagibm I run 'kubectl exec -it learner-podname sh' and go into '/mnt' directory, I can find the data, which is same as my host directory,

but its log shows an error like that:

I did not find files about above directory, I run '$CLI_CMD train etc/examples/tf-model/manifest-hostmount.yml etc/examples/tf-model.zip' to run training job.
Can you tell me some details about above error?
Thank you.

Eric-Zhang1990 · 2019-03-05T01:10:38Z

@sboagibm I create directory '_sbumitted_code' and copy 'model.zip' into it, now it can run correctly. Thank you for your reply.

Eric-Zhang1990 · 2019-03-05T01:31:44Z

@Tomcli I am confused about the number of gpus and learners, what I understand is each learner can use the number of gpus I set, eg:

gpus is set 2, learners is also set 2, so each learner uses 2 gpus, 2 learners use 4 gpus, what I say is right? If it is right, what I want to know is that each learner runs the same code for training, I mean each learner just runs independently and saves 2 different models (2 learners for training).
Or what I just say is wrong, although it has 2 learners, and each learner has 2 gpus, it will just save one model for the training job, I mean the training job uses 4 gpus for distributed training.
I don't know which one is right or both are wrong.
Can you explain the relationship between gpus and learners?
Thank you.

Eric-Zhang1990 · 2019-03-05T08:11:32Z

@sboagibm Now I can run job correctly, but after it completed, I can't find the caffe model it saved, I just set 'snapshot_prefix: "./lenet"', I don't know which path I should set, can you tell me where the caffe model is? Thank you.
My config file:

sboagibm · 2019-03-05T15:27:57Z

Should be in /home/mount_test/caffe-mnist/results?

Eric-Zhang1990 · 2019-03-06T00:55:06Z

@sboagibm This is my directory content, there has a 'learner-1' dir for saving 'training-log.txt', and I view the log and find following info:

(I change dir 'result' to 'results')

But I still can't fine caffe model. One path I don't know how to set is 'snapshot_prefix: "./lenet"' in 'lenet_solver.prototxt'? Or can you provide an example you have used correctly? Thank you.

Besides, I find that when I use host-mount for training, it spends a long time for just training a few step (eg: 2000 iters), I view the log, seems like it spends much time for reading data, why?

Eric-Zhang1990 · 2019-03-07T00:59:00Z

Should be in /home/mount_test/caffe-mnist/results?

@sboagibm I find when I use s3 storage or host mount for caffe training, I can not find where caffe model is.

s3 storage:
host mount:

But when I use s3 storage or host mount for tensorflow training, I can find tf model in 's3://tf_trained_model' or in '/home/mount_test/tf-train/result/model' (host mount).

s3 storage:
host mount:

Can you tell me where caffe model is saved? Thank you.

Eric-Zhang1990 changed the title ~~job failed~~ learner pod failed Feb 26, 2019

Eric-Zhang1990 closed this as completed Mar 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

learner pod failed #165

learner pod failed #165

Eric-Zhang1990 commented Feb 22, 2019

Tomcli commented Feb 22, 2019

Eric-Zhang1990 commented Feb 25, 2019 •

edited

Loading

Eric-Zhang1990 commented Feb 25, 2019

Eric-Zhang1990 commented Feb 26, 2019 •

edited

Loading

Eric-Zhang1990 commented Feb 27, 2019

Tomcli commented Feb 27, 2019

Eric-Zhang1990 commented Feb 28, 2019

sboagibm commented Feb 28, 2019

Eric-Zhang1990 commented Mar 1, 2019

sboagibm commented Mar 1, 2019

Eric-Zhang1990 commented Mar 1, 2019 •

edited

Loading

sboagibm commented Mar 1, 2019

Eric-Zhang1990 commented Mar 4, 2019

Eric-Zhang1990 commented Mar 5, 2019

Eric-Zhang1990 commented Mar 5, 2019

Eric-Zhang1990 commented Mar 5, 2019 •

edited

Loading

sboagibm commented Mar 5, 2019 •

edited

Loading

Eric-Zhang1990 commented Mar 6, 2019 •

edited

Loading

Eric-Zhang1990 commented Mar 7, 2019 •

edited

Loading

learner pod failed #165

learner pod failed #165

Comments

Eric-Zhang1990 commented Feb 22, 2019

Tomcli commented Feb 22, 2019

Eric-Zhang1990 commented Feb 25, 2019 • edited Loading

@Tomcli I run command 'sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml' to change the mainfest file, and it's status is Ok, but '$CLI_CMD show training-_S1ixBrmR' still shows status Pending,

Eric-Zhang1990 commented Feb 25, 2019

Eric-Zhang1990 commented Feb 26, 2019 • edited Loading

Eric-Zhang1990 commented Feb 27, 2019

Tomcli commented Feb 27, 2019

Eric-Zhang1990 commented Feb 28, 2019

sboagibm commented Feb 28, 2019

Eric-Zhang1990 commented Mar 1, 2019

sboagibm commented Mar 1, 2019

Eric-Zhang1990 commented Mar 1, 2019 • edited Loading

sboagibm commented Mar 1, 2019

Eric-Zhang1990 commented Mar 4, 2019

Eric-Zhang1990 commented Mar 5, 2019

Eric-Zhang1990 commented Mar 5, 2019

Eric-Zhang1990 commented Mar 5, 2019 • edited Loading

@sboagibm Now I can run job correctly, but after it completed, I can't find the caffe model it saved, I just set 'snapshot_prefix: "./lenet"', I don't know which path I should set, can you tell me where the caffe model is? Thank you. My config file:

sboagibm commented Mar 5, 2019 • edited Loading

Eric-Zhang1990 commented Mar 6, 2019 • edited Loading

Eric-Zhang1990 commented Mar 7, 2019 • edited Loading

Eric-Zhang1990 commented Feb 25, 2019 •

edited

Loading

Eric-Zhang1990 commented Feb 26, 2019 •

edited

Loading

Eric-Zhang1990 commented Mar 1, 2019 •

edited

Loading

Eric-Zhang1990 commented Mar 5, 2019 •

edited

Loading

@sboagibm Now I can run job correctly, but after it completed, I can't find the caffe model it saved, I just set 'snapshot_prefix: "./lenet"', I don't know which path I should set, can you tell me where the caffe model is? Thank you.
My config file:

sboagibm commented Mar 5, 2019 •

edited

Loading

Eric-Zhang1990 commented Mar 6, 2019 •

edited

Loading

Eric-Zhang1990 commented Mar 7, 2019 •

edited

Loading