-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SupportTrainingService.md #1401
Changes from 8 commits
03e9cc5
29773e0
db2b304
df1016c
d921694
fb8ab8e
714c159
a88915d
5fe5680
2268fe3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Support TrainingService | ||
|
||
TrainingService is a concept of service that maintain the training platform on which trial jobs run. NNI support [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training service. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "TrainingService is a concept of service that maintain the training platform on which trial jobs run." -->"TrainingService is a concept of service that used to maintain a training platform for running trial jobs." Maybe there can change like this? I first thought that run.NNI was connected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "NNI support [local]" -->" NNI supports [local] " There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "built-in training service" --> "built-in training services" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. |
||
NNI not only provides few built-in training service options, but also provides a method for customers to build their own training service easily. | ||
|
||
## Built-in TrainingService | ||
|TrainingService|Brief Introduction| | ||
|---|---| | ||
|[__local__](./LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. local --> Local There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. |
||
|[__remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remote --> Remote There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. |
||
|[__pai__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.| | ||
|[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| | ||
|[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| | ||
|
||
## TrainingService Implementation | ||
|
||
TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainingService, users just need to inherit the parent class and complete their own child class if they want to implement customized TrainingService. | ||
The abstract function in TrainingService is shown below: | ||
``` | ||
abstract class TrainingService { | ||
public abstract listTrialJobs(): Promise<TrialJobDetail[]>; | ||
public abstract getTrialJob(trialJobId: string): Promise<TrialJobDetail>; | ||
public abstract addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void; | ||
public abstract removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void; | ||
public abstract submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail>; | ||
public abstract updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise<TrialJobDetail>; | ||
public abstract get isMultiPhaseJobSupported(): boolean; | ||
public abstract cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean): Promise<void>; | ||
public abstract setClusterMetadata(key: string, value: string): Promise<void>; | ||
public abstract getClusterMetadata(key: string): Promise<string>; | ||
public abstract cleanUp(): Promise<void>; | ||
public abstract run(): Promise<void>; | ||
} | ||
``` | ||
The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions. | ||
For more information about how to write your own TrainingService, please [refer](https://github.com/SparkSnail/nni/blob/dev-trainingServiceDoc/docs/en_US/TrainingService/HowToImplementTrainingService.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the title to TrainingService, as this is a page for Training Service root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.