Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SupportTrainingService.md #1401

Merged
merged 10 commits into from
Aug 6, 2019

Conversation

SparkSnail
Copy link
Contributor

@SparkSnail SparkSnail commented Aug 2, 2019

# Support TrainingService

TrainingService is a concept of training platform that run trial jobs on the corresponding platform. NNI support [local](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md), [remote](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md), [pai](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md), [kubeflow](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md) and [frameworkcontroller](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md) training service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NNI not only provides few built-in training service options, but also provides a method for customers to build their own training service easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -0,0 +1,36 @@
# Support TrainingService

TrainingService is a concept of service that matain the training platform on which trial jobs run. NNI support [local](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md), [remote](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md), [pai](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md), [kubeflow](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md) and [frameworkcontroller](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md) built-in training service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use a relative path here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

|[__Kubeflow__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.|
|[__FrameworkController__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.|

## Implement TrainingService
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrainingService Implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@SparkSnail SparkSnail requested review from QuanluZhang and chicm-ms and removed request for QuanluZhang August 2, 2019 08:18
@@ -0,0 +1,36 @@
# Support TrainingService

TrainingService is a concept of service that matain the training platform on which trial jobs run. NNI support [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to maintain

@@ -0,0 +1,36 @@
# Support TrainingService

TrainingService is a concept of service that maintain the training platform on which trial jobs run. NNI support [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"TrainingService is a concept of service that maintain the training platform on which trial jobs run." -->"TrainingService is a concept of service that used to maintain a training platform for running trial jobs." Maybe there can change like this? I first thought that run.NNI was connected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -0,0 +1,36 @@
# Support TrainingService

TrainingService is a concept of service that maintain the training platform on which trial jobs run. NNI support [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"NNI support [local]" -->" NNI supports [local] "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"built-in training service" --> "built-in training services"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

## Built-in TrainingService
|TrainingService|Brief Introduction|
|---|---|
|[__local__](./LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local --> Local

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

|TrainingService|Brief Introduction|
|---|---|
|[__local__](./LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.|
|[__remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote --> Remote
pai --> PAI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

@scarlett2018 scarlett2018 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with minor suggestions.

@@ -0,0 +1,36 @@
# Support TrainingService
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the title to TrainingService, as this is a page for Training Service root.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -0,0 +1,36 @@
# Support TrainingService

TrainingService is a concept of service that used to maintain a training platform for running trial jobs. NNI supports [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training services.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NNI TrainingService provides the training platform for running NNI trial jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@suiguoxin suiguoxin changed the base branch from master to v1.0 August 6, 2019 09:19
@SparkSnail SparkSnail merged commit 6c9d370 into microsoft:v1.0 Aug 6, 2019
liuzhe-lz pushed a commit that referenced this pull request Aug 14, 2019
* Update filter description and fix typo

* fix comments

* change node to result

* Add SupportTrainingService.md (#1401)

* fix nnictl schema

* Eject from react-scripts-ts-antd and bump webui dependencies version (#1412)

* Eject from react-scripts-ts-antd

* test whether it can pass CI without ugilfy

* temporarily disable uglify

* Try to fix security alert (#1429)

* fix bug of hyper-parameter broken when have not succeeded trial

* update

* update
liuzhe-lz pushed a commit that referenced this pull request Aug 20, 2019
…lt (#1472)

* Update filter description and fix typo

* fix comments

* change node to result

* Add SupportTrainingService.md (#1401)

* fix nnictl schema

* Eject from react-scripts-ts-antd and bump webui dependencies version (#1412)

* Eject from react-scripts-ts-antd

* test whether it can pass CI without ugilfy

* temporarily disable uglify

* Try to fix security alert (#1429)

* fix bug of detail page broken when trial is succeed but not report final result
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants