Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reusable k8s training service bug #5045

Merged
merged 11 commits into from Aug 15, 2022

Conversation

amznero
Copy link
Contributor

@amznero amznero commented Aug 3, 2022

Description

For now, the reusable k8s training service(framework controller and kubeflow) will save the entry point file(run.sh) and env dir at the root path of NFS, but the content of run.sh declare it should be located at "envs/exp_id". So, the "cannot find xxxx" error will be raised when the worker pod runs this entry point.

Second, when users submit multi-experiments, different experiments will affect each other(due to all contents are stored at the root of NFS).

And the entry-point will be overwritten by different environments when the trialConcurrency is more than 1, so each environment will actually run the same(the latest generated) run.sh script, and https://github.com/microsoft/nni/blob/v2.8/nni/tools/trial_tool/trial_runner.py#L164 will use dir name as runner id, finally, it will raise an error.

More discussion can be found at #4588


  • fix upload dir bug, use a separate working directory for each experiment
  • fix trialConcurrency bug, use ${envId}_run.sh to replace run.sh

Test Options

  • fast test
  • full test - HPO
  • full test - NAS
  • full test - compression

Checklist

  • test case
  • doc

How to test

Related Issues: microsoft/frameworkcontroller#75, #4588, #4874, #5026.

@ghost
Copy link

ghost commented Aug 3, 2022

CLA assistant check
All CLA requirements met.

@amznero
Copy link
Contributor Author

amznero commented Aug 4, 2022

The commit(0c9c151) has a side effect when the storage is not nfs(like azure). To be refined.

@amznero amznero marked this pull request as ready for review August 10, 2022 03:47
@ultmaster ultmaster merged commit 125ec21 into microsoft:master Aug 15, 2022
@ultmaster
Copy link
Contributor

Need to be tested on K8S pipline. Tracked in #4954.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants