-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
learner pod failed #165
Comments
Hi @Eric-Zhang1990, did you update the manifest file with the correct object storage endpoint? In the instructions, if you are using the local object storage, the following script should help you setup the right endpoint
Note you need to have |
@Tomcli I run command 'sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml' to change the mainfest file, and it's status is Ok, but '$CLI_CMD show training-_S1ixBrmR' still shows status Pending,
|
@Tomcli Seems I found error about above isuue, but I don't know wheather it is right or not. |
@Tomcli After I run 'FfDL/bin/create_static_pv.sh' and now status of 'static-volume-1' is Bound, |
@Tomcli Now I can run the example tf-model (manifest.yml in cpu), however, I use the aws s3 storage to upload data for training, can you provide a method like NFS to storage data for training?? |
Hi @Eric-Zhang1990, Sorry for the late reply. For the errors you have using the sudo apt-get install s3fs
sudo mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo cp <FfDL repo>/bin/ibmc-s3fs /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo chmod +x /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs/ibmc-s3fs
sudo systemctl restart kubelet @sboagibm can you describe how to use NFS as the data storage for FfDL training job? Thanks. |
@Tomcli Thank you for your kind reply. I have installed s3fs driver accoding to your steps, now I can run examples tf-model in cpu and gpu on one node by using s3 data storage, I will try it on multi nodes. @sboagibm @Tomcli I want to know how to use NFS on FfDL to use local data on our servers, thanks. |
@Eric-Zhang1990 asked:
NFS is used internally for data sharing between the learner, helper, and job monitor pods. It's not primarily involved in data access or results. In fact, we're working on a re-architecture that gets rid of the need for NFS altogether. To use local data directly you could try to enable a host mount at this time. See the manifest at https://github.com/IBM/FfDL/blob/master/etc/examples/tf-model/manifest-hostmount.yml . Maybe @fplk has other ideas on this front. Not sure if there's any functioning options with using S3FS over minio or something. There's hopefully some new code coming soon that will allow PVC mounts. That's really what you want. |
@sboagibm Host mount is also ok for me, but I don't know how to set the variable? like 'container:' 、 'connection:' , etc. Can you give me a detail example? Thank you. |
@Eric-Zhang1990 Hope this helps! |
@sboagibm @Tomcli I set these variables accoding to what you said above, |
@Eric-Zhang1990 Debugging this remotely is difficult. My debugging strategy would be as follows. Do a |
@sboagibm I run 'kubectl exec -it learner-podname sh' and go into '/mnt' directory, I can find the data, which is same as my host directory, |
@sboagibm I create directory '_sbumitted_code' and copy 'model.zip' into it, now it can run correctly. Thank you for your reply. |
@Tomcli I am confused about the number of gpus and learners, what I understand is each learner can use the number of gpus I set, eg: |
@sboagibm Now I can run job correctly, but after it completed, I can't find the caffe model it saved, I just set 'snapshot_prefix: "./lenet"', I don't know which path I should set, can you tell me where the caffe model is? Thank you.
|
Should be in |
@sboagibm This is my directory content, there has a 'learner-1' dir for saving 'training-log.txt', and I view the log and find following info: |
@sboagibm I find when I use s3 storage or host mount for caffe training, I can not find where caffe model is. But when I use s3 storage or host mount for tensorflow training, I can find tf model in 's3://tf_trained_model' or in '/home/mount_test/tf-train/result/model' (host mount). Can you tell me where caffe model is saved? Thank you. |
@Tomcli I am running the examples FfDL provided, but they all failed. I don't know how to find where error is, can you help me, thanks.
First, FfDL is running correctly.
I create bucket and upload data to s3 bucket.
I run "$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model" or "$CLI_CMD train etc/examples/tf-model/gpu-manifest.yml etc/examples/tf-model", all get status FAILED.
And I find the data path is "source: "/mnt/data/tf_training_data/train/"" (lenet_train_test.prototxt) in caffe train. Is this path right?
Thank you!
The text was updated successfully, but these errors were encountered: