From 03e9cc5b672d456dc4ab92d63cb6f4ed75183716 Mon Sep 17 00:00:00 2001 From: Shinai Yang Date: Thu, 1 Aug 2019 22:08:15 +0800 Subject: [PATCH 1/9] add doc --- docs/en_US/TrainingService/SupportTrainingService.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 docs/en_US/TrainingService/SupportTrainingService.md diff --git a/docs/en_US/TrainingService/SupportTrainingService.md b/docs/en_US/TrainingService/SupportTrainingService.md new file mode 100644 index 0000000000..cd42507639 --- /dev/null +++ b/docs/en_US/TrainingService/SupportTrainingService.md @@ -0,0 +1,12 @@ +# Supported TrainingService + +TrainingService is a concept of training platform that run trial jobs on the corresponding platform. NNI support [local](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md), [remote](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md), [pai](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md), [kubeflow](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md) and [frameworkcontroller](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md) training service. + + +|TrainingService|Brief Introduction| +|---|---| +|[__local__](#local)|Local mode means that NNI will run the trial jobs and nniManager process in local machine.| +|[__remote__](#remote)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.| +|[__pai__](#pai)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.| +|[__Kubeflow__](#Kubeflow)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| +|[__FrameworkController__](#FrameworkController)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| \ No newline at end of file From 29773e0e7ce7720b144e5a5156d02ef6783201af Mon Sep 17 00:00:00 2001 From: Shinai Yang Date: Thu, 1 Aug 2019 22:14:40 +0800 Subject: [PATCH 2/9] update --- .../TrainingService/SupportTrainingService.md | 25 ++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/docs/en_US/TrainingService/SupportTrainingService.md b/docs/en_US/TrainingService/SupportTrainingService.md index cd42507639..599f34b445 100644 --- a/docs/en_US/TrainingService/SupportTrainingService.md +++ b/docs/en_US/TrainingService/SupportTrainingService.md @@ -9,4 +9,27 @@ TrainingService is a concept of training platform that run trial jobs on the cor |[__remote__](#remote)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.| |[__pai__](#pai)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.| |[__Kubeflow__](#Kubeflow)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| -|[__FrameworkController__](#FrameworkController)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| \ No newline at end of file +|[__FrameworkController__](#FrameworkController)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| + +## Implement TrainingService + +TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainingService, users just need to inherit the parent class and complete their own child class if they want to implement customized TrainingService. +The abstract function in TrainingService is shown below: +``` +abstract class TrainingService { + public abstract listTrialJobs(): Promise; + public abstract getTrialJob(trialJobId: string): Promise; + public abstract addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void; + public abstract removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void; + public abstract submitTrialJob(form: JobApplicationForm): Promise; + public abstract updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise; + public abstract get isMultiPhaseJobSupported(): boolean; + public abstract cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean): Promise; + public abstract setClusterMetadata(key: string, value: string): Promise; + public abstract getClusterMetadata(key: string): Promise; + public abstract cleanUp(): Promise; + public abstract run(): Promise; +} +``` +The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions. +For more information about how to write your own TrainingService, please [refer](https://github.com/SparkSnail/nni/blob/dev-trainingServiceDoc/docs/en_US/TrainingService/HowToImplementTrainingService.md). From db2b3047a043bb2f962976d9cd4b88ebda72b365 Mon Sep 17 00:00:00 2001 From: Shinai Yang Date: Fri, 2 Aug 2019 10:29:35 +0800 Subject: [PATCH 3/9] fix dead link --- docs/en_US/TrainingService/SupportTrainingService.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/en_US/TrainingService/SupportTrainingService.md b/docs/en_US/TrainingService/SupportTrainingService.md index 599f34b445..fc3eba1251 100644 --- a/docs/en_US/TrainingService/SupportTrainingService.md +++ b/docs/en_US/TrainingService/SupportTrainingService.md @@ -1,15 +1,15 @@ -# Supported TrainingService +# Support TrainingService TrainingService is a concept of training platform that run trial jobs on the corresponding platform. NNI support [local](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md), [remote](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md), [pai](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md), [kubeflow](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md) and [frameworkcontroller](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md) training service. |TrainingService|Brief Introduction| |---|---| -|[__local__](#local)|Local mode means that NNI will run the trial jobs and nniManager process in local machine.| -|[__remote__](#remote)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.| -|[__pai__](#pai)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.| -|[__Kubeflow__](#Kubeflow)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| -|[__FrameworkController__](#FrameworkController)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| +|[__local__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md)|Local mode means that NNI will run the trial jobs and nniManager process in local machine.| +|[__remote__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.| +|[__pai__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.| +|[__Kubeflow__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| +|[__FrameworkController__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| ## Implement TrainingService From df1016cf480a302a96664f0acaac830f040a393b Mon Sep 17 00:00:00 2001 From: Shinai Yang Date: Fri, 2 Aug 2019 11:26:35 +0800 Subject: [PATCH 4/9] fix comments --- docs/en_US/TrainingService/SupportTrainingService.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/en_US/TrainingService/SupportTrainingService.md b/docs/en_US/TrainingService/SupportTrainingService.md index fc3eba1251..2da3727bb0 100644 --- a/docs/en_US/TrainingService/SupportTrainingService.md +++ b/docs/en_US/TrainingService/SupportTrainingService.md @@ -1,8 +1,9 @@ # Support TrainingService -TrainingService is a concept of training platform that run trial jobs on the corresponding platform. NNI support [local](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md), [remote](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md), [pai](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md), [kubeflow](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md) and [frameworkcontroller](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md) training service. - +TrainingService is a concept of training platform that run trial jobs on the corresponding platform. NNI support [local](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md), [remote](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md), [pai](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md), [kubeflow](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md) and [frameworkcontroller](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md) training service. +NNI not only provides few built-in training service options, but also provides a method for customers to build their own training service easily. +## Built-in TrainingService |TrainingService|Brief Introduction| |---|---| |[__local__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md)|Local mode means that NNI will run the trial jobs and nniManager process in local machine.| From d9216944d3944fe2eae24aae887f6533b49f3799 Mon Sep 17 00:00:00 2001 From: Shinai Yang Date: Fri, 2 Aug 2019 11:33:36 +0800 Subject: [PATCH 5/9] fix comments --- docs/en_US/TrainingService/SupportTrainingService.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/en_US/TrainingService/SupportTrainingService.md b/docs/en_US/TrainingService/SupportTrainingService.md index 2da3727bb0..b3f637af19 100644 --- a/docs/en_US/TrainingService/SupportTrainingService.md +++ b/docs/en_US/TrainingService/SupportTrainingService.md @@ -1,12 +1,12 @@ # Support TrainingService -TrainingService is a concept of training platform that run trial jobs on the corresponding platform. NNI support [local](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md), [remote](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md), [pai](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md), [kubeflow](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md) and [frameworkcontroller](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md) training service. +TrainingService is a concept of service that matain the training platform on which trial jobs run. NNI support [local](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md), [remote](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md), [pai](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md), [kubeflow](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md) and [frameworkcontroller](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md) built-in training service. NNI not only provides few built-in training service options, but also provides a method for customers to build their own training service easily. ## Built-in TrainingService |TrainingService|Brief Introduction| |---|---| -|[__local__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md)|Local mode means that NNI will run the trial jobs and nniManager process in local machine.| +|[__local__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.| |[__remote__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.| |[__pai__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.| |[__Kubeflow__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| From fb8ab8ed4ce67271ef5e4ab75851e00cda660a0c Mon Sep 17 00:00:00 2001 From: Shinai Yang Date: Fri, 2 Aug 2019 16:10:37 +0800 Subject: [PATCH 6/9] fix comments --- .../TrainingService/SupportTrainingService.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/en_US/TrainingService/SupportTrainingService.md b/docs/en_US/TrainingService/SupportTrainingService.md index b3f637af19..461f7279a4 100644 --- a/docs/en_US/TrainingService/SupportTrainingService.md +++ b/docs/en_US/TrainingService/SupportTrainingService.md @@ -1,18 +1,18 @@ # Support TrainingService -TrainingService is a concept of service that matain the training platform on which trial jobs run. NNI support [local](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md), [remote](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md), [pai](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md), [kubeflow](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md) and [frameworkcontroller](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md) built-in training service. +TrainingService is a concept of service that matain the training platform on which trial jobs run. NNI support [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training service. NNI not only provides few built-in training service options, but also provides a method for customers to build their own training service easily. ## Built-in TrainingService |TrainingService|Brief Introduction| |---|---| -|[__local__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.| -|[__remote__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.| -|[__pai__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.| -|[__Kubeflow__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| -|[__FrameworkController__](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| +|[__local__](./LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.| +|[__remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.| +|[__pai__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.| +|[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| +|[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| -## Implement TrainingService +## TrainingService Implementation TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainingService, users just need to inherit the parent class and complete their own child class if they want to implement customized TrainingService. The abstract function in TrainingService is shown below: From a88915d351b5914360dbedc82bb83d522dbc9253 Mon Sep 17 00:00:00 2001 From: Shinai Yang Date: Mon, 5 Aug 2019 15:54:07 +0800 Subject: [PATCH 7/9] fix comments --- docs/en_US/TrainingService/SupportTrainingService.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en_US/TrainingService/SupportTrainingService.md b/docs/en_US/TrainingService/SupportTrainingService.md index 461f7279a4..a6396c0f58 100644 --- a/docs/en_US/TrainingService/SupportTrainingService.md +++ b/docs/en_US/TrainingService/SupportTrainingService.md @@ -1,6 +1,6 @@ # Support TrainingService -TrainingService is a concept of service that matain the training platform on which trial jobs run. NNI support [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training service. +TrainingService is a concept of service that maintain the training platform on which trial jobs run. NNI support [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training service. NNI not only provides few built-in training service options, but also provides a method for customers to build their own training service easily. ## Built-in TrainingService From 5fe5680c7011b98629f5a2d4658d5079352c051b Mon Sep 17 00:00:00 2001 From: Shinai Yang Date: Mon, 5 Aug 2019 19:19:07 +0800 Subject: [PATCH 8/9] fix comments --- docs/en_US/TrainingService/SupportTrainingService.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/en_US/TrainingService/SupportTrainingService.md b/docs/en_US/TrainingService/SupportTrainingService.md index a6396c0f58..e66a231911 100644 --- a/docs/en_US/TrainingService/SupportTrainingService.md +++ b/docs/en_US/TrainingService/SupportTrainingService.md @@ -1,14 +1,14 @@ # Support TrainingService -TrainingService is a concept of service that maintain the training platform on which trial jobs run. NNI support [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training service. +TrainingService is a concept of service that used to maintain a training platform for running trial jobs. NNI supports [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training services. NNI not only provides few built-in training service options, but also provides a method for customers to build their own training service easily. ## Built-in TrainingService |TrainingService|Brief Introduction| |---|---| -|[__local__](./LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.| -|[__remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.| -|[__pai__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.| +|[__Local__](./LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.| +|[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.| +|[__Pai__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.| |[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| |[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| From 2268fe3d497ca547273ce741a155533ac6c4b666 Mon Sep 17 00:00:00 2001 From: Shinai Yang Date: Tue, 6 Aug 2019 17:59:33 +0800 Subject: [PATCH 9/9] fix comments --- docs/en_US/TrainingService/SupportTrainingService.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/en_US/TrainingService/SupportTrainingService.md b/docs/en_US/TrainingService/SupportTrainingService.md index e66a231911..50c91173e2 100644 --- a/docs/en_US/TrainingService/SupportTrainingService.md +++ b/docs/en_US/TrainingService/SupportTrainingService.md @@ -1,6 +1,6 @@ -# Support TrainingService +# TrainingService -TrainingService is a concept of service that used to maintain a training platform for running trial jobs. NNI supports [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training services. +NNI TrainingService provides the training platform for running NNI trial jobs. NNI supports [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md), [kubeflow](./KubeflowMode.md) and [frameworkcontroller](./FrameworkControllerMode.md) built-in training services. NNI not only provides few built-in training service options, but also provides a method for customers to build their own training service easily. ## Built-in TrainingService