New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NNI on Windows for NNI Remote mode #1073
Conversation
src/nni_manager/common/utils.ts
Outdated
* Use '/' to join path instead of '\' for all kinds of platform | ||
* @param path | ||
*/ | ||
function pathJoin(...paths: any[]): string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark
this.criticalError(new NNIError('Job metrics error', `Job metrics error: ${err.message}`, err)); | ||
}); | ||
}); | ||
this.trainingService.addTrialJobMetricListener(this.trialJobMetricListener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark
@@ -438,7 +437,7 @@ class RemoteMachineTrainingService implements TrainingService { | |||
*/ | |||
private getLocalGpuMetricCollectorDir(): string { | |||
let userName: string = path.basename(os.homedir()); //get current user name of os | |||
return `${os.tmpdir()}/${userName}/nni/scripts/`; | |||
return path.join(os.tmpdir(), userName, 'nni', 'scripts'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark
@@ -481,7 +480,7 @@ class RemoteMachineTrainingService implements TrainingService { | |||
private async initRemoteMachineOnConnected(rmMeta: RemoteMachineMeta, conn: Client): Promise<void> { | |||
// Create root working directory after ssh connection is ready | |||
await this.generateGpuMetricsCollectorScript(rmMeta.username); //generate gpu script in local machine first, will copy to remote machine later | |||
const nniRootDir: string = `${os.tmpdir()}/nni`; | |||
const nniRootDir: string = getRemoteTmpDir(this.remoteOS) + '/nni'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark
|
||
//create tmp trial working folder locally. | ||
await cpp.exec(`cp -r ${this.trialConfig.codeDir}/* ${trialLocalTempFolder}`); | ||
await execCopydir(this.trialConfig.codeDir+'/*',trialLocalTempFolder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark
src/nni_manager/common/utils.ts
Outdated
if(dir === '/'){ | ||
dir = dir + path; | ||
}else{ | ||
dir = dir + '/' + path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential bug
85303ed
to
a2b9538
Compare
if(typeof cuda_visible_device === 'string' && cuda_visible_device.length > 0) { | ||
command = `CUDA_VISIBLE_DEVICES=${cuda_visible_device} ${this.trialConfig.command}`; | ||
} else { | ||
command = `CUDA_VISIBLE_DEVICES=" " ${this.trialConfig.command}`; | ||
command = `CUDA_VISIBLE_DEVICES='-1' ${this.trialConfig.command}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to keep the original CUDA_VISIBLE_DEVICES=" " which already tested.
test/pipelines-it-remote-windows.yml
Outdated
displayName: 'build nni bdsit_wheel' | ||
- task: SSH@0 | ||
inputs: | ||
sshEndpoint: remote_nni-ci-gpu-01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this endpoint remote_nni-ci-gpu-01 should not be hard coded.
test/pipelines-it-remote-windows.yml
Outdated
displayName: 'Start docker' | ||
- task: DownloadSecureFile@1 | ||
inputs: | ||
secureFile: remote_ci_private_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to change this key file configurable from pipeline settings.
src/nni_manager/core/nnimanager.ts
Outdated
@@ -76,6 +77,11 @@ class NNIManager implements Manager { | |||
status: 'INITIALIZED', | |||
errors: [] | |||
}; | |||
this.trialJobMetricListener = (metric: TrialJobMetric) => { | |||
this.onTrialJobMetrics(metric).catch((err: Error) => { | |||
this.criticalError(new NNIError('Job metrics error', `Job metrics error: ${err.message}`, err)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to handle string type err, please reference this PR #1064 to change it like this: this.criticalError(NNIError.FromError(err, 'Job metrics error: '));
please update doc accordingly |
#1053
Install nni on Windows, run experiments in remote linux machines
Pipeline:
copy nni project to remote and build wheel in remote, install the nni wheel in docker and finish the integration test.