New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Having trouble nni with frameworkcontroller on k8s #4588
Comments
Hi, do we have a solution for this? |
Face a similar issue(in Kubeflow training service), I fix it by hacking trialDispatcher.ts, kubeflowEnvironmentService.ts, kubernetesEnvironmentService.ts. Suppose experiment id is ABCDE sh /tmp/mount/nni/ABCDE/run.sh && ... So, it will raise "can't open /tmp/mount/nni/ABCDE/run.sh" error. P.S. In the container, the NFS path will be mounted to I'll create a PR ASAP to fix this issue. Related Issues: microsoft/frameworkcontroller#75, #4874, #5026. |
Thanks, I figured that out too. so you also modify the ts files and build nni again from source? I did not know about the overwriting, the overwritten of run.sh will not be a problem, right? because as I remember, it always creates a new env folder for each trial and runs the code there? |
Sorry for the late replay.
Yes, I modify the TS files and build NNI wheel from source.
No, I think it only create the number of For the overwriting problem, every env will actually run the same(the latest generated) run.sh script, and https://github.com/microsoft/nni/blob/v2.8/nni/tools/trial_tool/trial_runner.py#L164 will use dir name as runner id, finally it will raise error. an example of run.sh cd /tmp/mount/nni/5nfd2kzc && mkdir -p envs/ZKtWr && cd envs/ZKtWr && sh ../install_nni.sh && python3 -m nni.tools.trial_tool.trial_runner 1>/tmp/mount/nni/5nfd2kzc/envs/ZKtWr/trialrunner_stdout 2>/tmp/mount/nni/5nfd2kzc/envs/ZKtWr/trialrunner_stderr Every envs will use |
@amznero at first, I move the run.sh file to the correct experiment folder, but then, the trial doesn't seem to run concurrently, and yes, as you say, I also found out that the last config is applied for every worker (environment). |
@vincenthp2603
Does "time" mean training duration? If so, this scenario didn't happen to me, and I don't think the concurrency will affect the training duration. You can freeze random seeds(NumPy, torch, cuda, cudnn, et al) and set worker=1 to record the experiment baseline(batch size, epoch, model parameters, training duration). Then use concurrent mode to train the model and compare it with the baseline. Maybe the training duration is related to model complexity or training strategies(like Genetic Algorithm)? You can see my changes here: #5045. |
NNI v2.9 has been released. |
Describe the issue:
When I tried nni with frameworkcontroller on k8s, I used these yaml files
for nni config
config_framework.yml
and for frameworkcontroller Statefulset
frameworkcontroller-with-default-config.yaml
and execute below command for k8s statefulset
then frameworkcontroller-0 set to Run
and execute nnictl command
then new experiment worker pod created
but it failed to run
when I check logs by
kubectl logs nniexp~
so I checked the
nfs mount directory
,and there is not
nni directory
, but It hasenvs
directory andrun.sh
fileI think it should create
nni/experiment_id/run.sh
in mount folderhere is describe of
nniexp-worker-0
podplease let me know how to solving this trouble thanks!
Environment:
The text was updated successfully, but these errors were encountered: