-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc] request parameters doc when the -init-version=-1 #302
Comments
it seems a bug of # get self ipv4 of given nic
get_self_ip() {
local nic=$1
ifconfig $nic | grep inet | grep -v inet6 | awk '{print $2}'
}
# construct kungfu-run flags
kungfu_run_flags() {
local nic=$1
local IP=$(get_self_ip $nic)
echo -H $IP # workaround
echo -init-version -1
echo -w
echo -nic $nic
}
kungfu_run_with_nic() {
local nic=$1
kungfu-run $(kungfu_run_flags $nic) $@
}
kungfu_run_with_nic ib0 python3 train-xxx.py |
thanks for the reply, i have tested kungfu-run with should i also set the correct value of np (gpu num in the newly added kungfu-run container) to clarify, currently, we have the machine with 8 * V100 GPU, and the newly added kungfu-run should be start with
|
this is my test case, init a job with 1 container (i.e. A.1 container), then scale up to 2 container (i.e. A.2 container is been added), but it turns out, both of containers hang after the init A.1 container
newly added A.2 container
init A.1 container
newly added A.2 container
|
by the way, it works well in this way ( init A.1 container
newly added A.2 container
|
i'd like kungfu can provide a brief doc about the parameters, i have try set
-init-version=-1
and ignore the-H
param, but it seems kungfu-run can't handle it wellThe text was updated successfully, but these errors were encountered: