meta-pytorch · kiukchung · Jun 16, 2021
diff --git a/docs/source/components/distributed.rst b/docs/source/components/distributed.rst
@@ -4,4 +4,4 @@ Distributed
 .. automodule:: torchx.components.dist
 .. currentmodule:: torchx.components.dist
 
-.. autofunction:: torchx.components.dist.ddp.get_app_spec
+.. autofunction:: torchx.components.dist.ddp
diff --git a/docs/source/components/serve.rst b/docs/source/components/serve.rst
@@ -5,5 +5,5 @@ Serve
 .. currentmodule:: torchx.components.serve
 
 
-.. currentmodule:: torchx.components.serve.serve
+.. currentmodule:: torchx.components.serve
 .. autofunction:: torchserve
diff --git a/docs/source/components/utils.rst b/docs/source/components/utils.rst
@@ -4,4 +4,5 @@ Utils
 .. automodule:: torchx.components.utils
 .. currentmodule:: torchx.components.utils
 
-.. autofunction:: torchx.components.utils.echo.get_app_spec
+.. autofunction:: torchx.components.utils.echo
+.. autofunction:: torchx.components.utils.touch
diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
@@ -17,15 +17,16 @@ For now lets take a look at the builtins
 
  $ torchx builtins
  Found <n> builtin configs:
-   1. echo
-   2. touch
+   ...
+   i. utils.echo
+   j. utils.touch
    ...
 
-Echo looks familiar and simple. Lets understand how to run ``echo``.
+Echo looks familiar and simple. Lets understand how to run ``utils.echo``.
 
 .. code-block:: shell-session
 
- $ torchx run --scheduler local echo --help
+ $ torchx run --scheduler local utils.echo --help
  usage: torchx run echo [-h] [--msg MSG]
 
  Echos a message
@@ -38,7 +39,7 @@ We can see that it takes a ``--msg`` argument. Lets try running it locally
 
 .. code-block:: shell-session
 
- $ torchx run --scheduler local echo --msg "hello world"
+ $ torchx run --scheduler local utils.echo --msg "hello world"
 
 .. note:: ``echo`` in this context is just an app spec. It is not the application
           logic itself but rather just the "job definition" for running `/bin/echo`.
@@ -58,16 +59,16 @@ This is just a regular python file where we define the app spec.
 
 .. code-block:: shell-session
 
- $ touch ~/echo_torchx.py
+ $ touch ~/test.py
 
-Now copy paste the following into echo_torchx.py
+Now copy paste the following into test.py
 
 ::
 
  import torchx.specs as specs
 
 
- def get_app_spec(num_replicas: int, msg: str = "hello world") -> specs.AppDef:
+ def echo(num_replicas: int, msg: str = "hello world") -> specs.AppDef:
      """
      Echos a message to stdout (calls /bin/echo)
 
@@ -83,8 +84,8 @@ Now copy paste the following into echo_torchx.py
                  name="echo",
                  entrypoint="/bin/echo",
                  image="/tmp",
-                 args=[f"replica #{specs.macros.replica_id}: msg"],
-                 num_replicas=1,
+                 args=[f"replica #{specs.macros.replica_id}: {msg}"],
+                 num_replicas=num_replicas,
              )
          ],
      )
@@ -103,13 +104,86 @@ Now lets try running our custom ``echo``
 
 .. code-block:: shell-session
 
- $ torchx run --scheduler local ~/echo_torchx.py --num_replicas 4 --msg "foobar"
+ $ torchx run --scheduler local ~/test.py:echo --num_replicas 4 --msg "foobar"
 
  replica #0: foobar
  replica #1: foobar
  replica #2: foobar
  replica #3: foobar
 
+Running on Other Images
+-----------------------------
+So far we've run ``utils.echo`` with ``image=/tmp``. This means that the
+``entrypoint`` we specified is relative to ``/tmp``. That did not matter for us
+since we specified an absolute path as the entrypoint (``entrypoint=/bin/echo``).
+Had we specified ``entrypoint=echo`` the local scheduler would have tried to invoke
+``/tmp/echo``.
+
+If you have a pre-built application binary, setting the image to a local directory is a
+quick way to validate the application and the ``specs.AppDef``. But its not all
+that useful if you want to run the application on a remote scheduler
+(see :ref:`Running On Other Schedulers`).
+
+.. note:: The ``image`` string in ``specs.Role`` is an identifier to a container image
+          supported by the scheduler. Refer to the scheduler documentation to find out
+          what container image is supported by the scheduler you want to use.
+
+For ``local`` scheduler we can see that it supports both a local directory
+and docker as the image:
+
+.. code-block:: shell-session
+
+ $ torchx runopts local
+
+ { 'image_type': { 'default': 'dir',
+                  'help': 'image type. One of [dir, docker]',
+                  'type': 'str'},
+ ... <omitted for brevity> ...
+
+
+.. note:: Before proceeding, you will need docker installed. If you have not done so already
+          follow the install instructions on: https://docs.docker.com/get-docker/
+
+Now lets try running ``echo`` from a docker container. Modify echo's ``AppDef``
+in ``~/test.py`` you created in the previous section to make the ``image="ubuntu:latest"``.
+
+::
+
+ import torchx.specs as specs
+
+
+ def echo(num_replicas: int, msg: str = "hello world") -> specs.AppDef:
+     """
+     Echos a message to stdout (calls /bin/echo)
+
+     Args:
+        num_replicas: number of copies (in parallel) to run
+        msg: message to echo
+
+     """
+     return specs.AppDef(
+         name="echo",
+         roles=[
+             specs.Role(
+                 name="echo",
+                 entrypoint="/bin/echo",
+                 image="ubuntu:latest", # IMAGE NOW POINTS TO THE UBUNTU DOCKER IMAGE
+                 args=[f"replica #{specs.macros.replica_id}: {msg}"],
+                 num_replicas=num_replicas,
+             )
+         ],
+     )
+
+Try running the echo app
+
+.. code-block:: shell-session
+
+ $ torchx run --scheduler local \
+              --scheduler_args image_type=docker \
+              ~/test.py:echo \
+              --num_replicas 4 \
+              --msg "foobar from docker!"
+
 Running On Other Schedulers
 -----------------------------
 So far we've launched components locally. Lets take a look at how to run this on

diff --git a/examples/pipelines/kfp/kfp_pipeline.py b/examples/pipelines/kfp/kfp_pipeline.py
@@ -16,7 +16,6 @@
 something that can be used within KFP.
 """
 
-
 # %%
 # Input Arguments
 # ###############
@@ -105,7 +104,6 @@
 
 args: argparse.Namespace = parser.parse_args(sys.argv[1:])
 
-
 # %%
 # Creating the Components
 # #######################
@@ -166,7 +164,7 @@
 
 import os.path
 
-from torchx.components.serve.serve import torchserve
+from torchx.components.serve import torchserve
 
 serve_app: specs.AppDef = torchserve(
     model_path=os.path.join(args.output_path, "model.mar"),

diff --git a/torchx/cli/cmd_run.py b/torchx/cli/cmd_run.py
@@ -195,6 +195,7 @@ def add_arguments(self, subparser: argparse.ArgumentParser) -> None:
             "--scheduler",
             type=str,
             help="Name of the scheduler to use",
+            default="default",
         )
         subparser.add_argument(
             "--scheduler_args",

diff --git a/torchx/components/dist.py b/torchx/components/dist.py
@@ -0,0 +1,65 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""
+Components for applications that run as distributed jobs. Many of the
+components in this section are simply topological, meaning that they define
+the layout of the nodes in a distributed setting and take the actual
+binaries that each group of nodes (``specs.Role``) runs.
+"""
+
+from typing import Dict, Optional
+
+import torchx.specs as specs
+from torchx.components.base import torch_dist_role
+
+
+def ddp(
+    image: str,
+    entrypoint: str,
+    resource: Optional[str] = None,
+    nnodes: int = 1,
+    nproc_per_node: int = 1,
+    base_image: Optional[str] = None,
+    name: str = "test_name",
+    role: str = "worker",
+    env: Optional[Dict[str, str]] = None,
+    *script_args: str,
+) -> specs.AppDef:
+    """
+    Distributed data parallel style application (one role, multi-replica).
+
+    Args:
+        image: container image.
+        entrypoint: script or binary to run within the image.
+        resource: Registered named resource.
+        nnodes: Number of nodes.
+        nproc_per_node: Number of processes per node.
+        name: Name of the application.
+        base_image: container base image (not required) .
+        role: Name of the ddp role.
+        script: Main script.
+        env: Env variables.
+        script_args: Script arguments.
+
+    Returns:
+        specs.AppDef: Torchx AppDef
+    """
+
+    ddp_role = torch_dist_role(
+        name=role,
+        image=image,
+        base_image=base_image,
+        entrypoint=entrypoint,
+        resource=resource or specs.NULL_RESOURCE,
+        script_args=list(script_args),
+        script_envs=env,
+        nproc_per_node=nproc_per_node,
+        nnodes=nnodes,
+        max_restarts=0,
+    ).replicas(nnodes)
+
+    return specs.AppDef(name).of(ddp_role)
diff --git a/torchx/components/dist/__init__.py b/torchx/components/dist/__init__.py
diff --git a/torchx/components/dist/ddp.py b/torchx/components/dist/ddp.py
diff --git a/torchx/components/distributed.py b/torchx/components/distributed.py
diff --git a/torchx/components/hpo/__init__.py → torchx/components/hpo.py b/torchx/components/hpo/__init__.py → torchx/components/hpo.py
diff --git a/torchx/components/serve/serve.py → torchx/components/serve.py b/torchx/components/serve/serve.py → torchx/components/serve.py
@@ -4,6 +4,11 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
 
+"""
+These components aim to make it easier to interact with inference and serving
+tools such as `torchserve <https://pytorch.org/serve/>`_.
+"""
+
 from typing import Dict, Optional
 
 import torchx.specs as specs
@@ -19,7 +24,7 @@ def torchserve(
     """Deploys the provided model to the given torchserve management API
     endpoint.
 
-    >>> from torchx.components.serve.serve import torchserve
+    >>> from torchx.components.serve import torchserve
     >>> torchserve(
     ...     model_path="s3://your-bucket/your-model.pt",
     ...     management_api="http://torchserve:8081",

diff --git a/torchx/components/serve/__init__.py b/torchx/components/serve/__init__.py
diff --git a/torchx/components/serve/test/__init__.py b/torchx/components/serve/test/__init__.py
diff --git a/torchx/components/test/distributed_test.py → torchx/components/test/dist_test.py b/torchx/components/test/distributed_test.py → torchx/components/test/dist_test.py
@@ -4,11 +4,10 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
 
-import torchx.components.distributed as distributed_components
-
-from .component_test_base import ComponentTestCase
+import torchx.components.dist as dist
+from torchx.components.test.component_test_base import ComponentTestCase
 
 
 class DistributedComponentTest(ComponentTestCase):
     def test_ddp(self) -> None:
-        self._validate(distributed_components, "ddp")
+        self._validate(dist, "ddp")