Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNI on Windows for NNI Remote mode #1073

Merged
merged 392 commits into from May 27, 2019

Conversation

demianzhang
Copy link
Contributor

@demianzhang demianzhang commented May 13, 2019

#1053
Install nni on Windows, run experiments in remote linux machines

  1. Use python tarfile to replace the tar command when compress the file
  2. Change the windows path joined with '\' into linux path joined with '/'
  3. Fix nnictl stop using 'SIGBREAK' and remove the trialJobMetricListener to clean the experiments.

Pipeline:
copy nni project to remote and build wheel in remote, install the nni wheel in docker and finish the integration test.

  1. change the pypi Makefile, using downloaded node and yarn to build wheel
  2. use putty pscp command to get docker port

* Use '/' to join path instead of '\' for all kinds of platform
* @param path
*/
function pathJoin(...paths: any[]): string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark

this.criticalError(new NNIError('Job metrics error', `Job metrics error: ${err.message}`, err));
});
});
this.trainingService.addTrialJobMetricListener(this.trialJobMetricListener);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark

@@ -438,7 +437,7 @@ class RemoteMachineTrainingService implements TrainingService {
*/
private getLocalGpuMetricCollectorDir(): string {
let userName: string = path.basename(os.homedir()); //get current user name of os
return `${os.tmpdir()}/${userName}/nni/scripts/`;
return path.join(os.tmpdir(), userName, 'nni', 'scripts');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark

@@ -481,7 +480,7 @@ class RemoteMachineTrainingService implements TrainingService {
private async initRemoteMachineOnConnected(rmMeta: RemoteMachineMeta, conn: Client): Promise<void> {
// Create root working directory after ssh connection is ready
await this.generateGpuMetricsCollectorScript(rmMeta.username); //generate gpu script in local machine first, will copy to remote machine later
const nniRootDir: string = `${os.tmpdir()}/nni`;
const nniRootDir: string = getRemoteTmpDir(this.remoteOS) + '/nni';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark


//create tmp trial working folder locally.
await cpp.exec(`cp -r ${this.trialConfig.codeDir}/* ${trialLocalTempFolder}`);
await execCopydir(this.trialConfig.codeDir+'/*',trialLocalTempFolder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark

if(dir === '/'){
dir = dir + path;
}else{
dir = dir + '/' + path;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential bug

if(typeof cuda_visible_device === 'string' && cuda_visible_device.length > 0) {
command = `CUDA_VISIBLE_DEVICES=${cuda_visible_device} ${this.trialConfig.command}`;
} else {
command = `CUDA_VISIBLE_DEVICES=" " ${this.trialConfig.command}`;
command = `CUDA_VISIBLE_DEVICES='-1' ${this.trialConfig.command}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to keep the original CUDA_VISIBLE_DEVICES=" " which already tested.

displayName: 'build nni bdsit_wheel'
- task: SSH@0
inputs:
sshEndpoint: remote_nni-ci-gpu-01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this endpoint remote_nni-ci-gpu-01 should not be hard coded.

displayName: 'Start docker'
- task: DownloadSecureFile@1
inputs:
secureFile: remote_ci_private_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to change this key file configurable from pipeline settings.

@@ -76,6 +77,11 @@ class NNIManager implements Manager {
status: 'INITIALIZED',
errors: []
};
this.trialJobMetricListener = (metric: TrialJobMetric) => {
this.onTrialJobMetrics(metric).catch((err: Error) => {
this.criticalError(new NNIError('Job metrics error', `Job metrics error: ${err.message}`, err));
Copy link
Contributor

@chicm-ms chicm-ms May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to handle string type err, please reference this PR #1064 to change it like this: this.criticalError(NNIError.FromError(err, 'Job metrics error: '));

@QuanluZhang
Copy link
Contributor

please update doc accordingly

@xuehui1991 xuehui1991 merged commit a1f9266 into microsoft:master May 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants