Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/components/distributed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ Distributed
.. automodule:: torchx.components.dist
.. currentmodule:: torchx.components.dist

.. autofunction:: torchx.components.dist.ddp.get_app_spec
.. autofunction:: torchx.components.dist.ddp
2 changes: 1 addition & 1 deletion docs/source/components/serve.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ Serve
.. currentmodule:: torchx.components.serve


.. currentmodule:: torchx.components.serve.serve
.. currentmodule:: torchx.components.serve
.. autofunction:: torchserve
3 changes: 2 additions & 1 deletion docs/source/components/utils.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ Utils
.. automodule:: torchx.components.utils
.. currentmodule:: torchx.components.utils

.. autofunction:: torchx.components.utils.echo.get_app_spec
.. autofunction:: torchx.components.utils.echo
.. autofunction:: torchx.components.utils.touch
96 changes: 85 additions & 11 deletions docs/source/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,16 @@ For now lets take a look at the builtins

$ torchx builtins
Found <n> builtin configs:
1. echo
2. touch
...
i. utils.echo
j. utils.touch
...

Echo looks familiar and simple. Lets understand how to run ``echo``.
Echo looks familiar and simple. Lets understand how to run ``utils.echo``.

.. code-block:: shell-session

$ torchx run --scheduler local echo --help
$ torchx run --scheduler local utils.echo --help
usage: torchx run echo [-h] [--msg MSG]

Echos a message
Expand All @@ -38,7 +39,7 @@ We can see that it takes a ``--msg`` argument. Lets try running it locally

.. code-block:: shell-session

$ torchx run --scheduler local echo --msg "hello world"
$ torchx run --scheduler local utils.echo --msg "hello world"

.. note:: ``echo`` in this context is just an app spec. It is not the application
logic itself but rather just the "job definition" for running `/bin/echo`.
Expand All @@ -58,16 +59,16 @@ This is just a regular python file where we define the app spec.

.. code-block:: shell-session

$ touch ~/echo_torchx.py
$ touch ~/test.py

Now copy paste the following into echo_torchx.py
Now copy paste the following into test.py

::

import torchx.specs as specs


def get_app_spec(num_replicas: int, msg: str = "hello world") -> specs.AppDef:
def echo(num_replicas: int, msg: str = "hello world") -> specs.AppDef:
"""
Echos a message to stdout (calls /bin/echo)

Expand All @@ -83,8 +84,8 @@ Now copy paste the following into echo_torchx.py
name="echo",
entrypoint="/bin/echo",
image="/tmp",
args=[f"replica #{specs.macros.replica_id}: msg"],
num_replicas=1,
args=[f"replica #{specs.macros.replica_id}: {msg}"],
num_replicas=num_replicas,
)
],
)
Expand All @@ -103,13 +104,86 @@ Now lets try running our custom ``echo``

.. code-block:: shell-session

$ torchx run --scheduler local ~/echo_torchx.py --num_replicas 4 --msg "foobar"
$ torchx run --scheduler local ~/test.py:echo --num_replicas 4 --msg "foobar"

replica #0: foobar
replica #1: foobar
replica #2: foobar
replica #3: foobar

Running on Other Images
-----------------------------
So far we've run ``utils.echo`` with ``image=/tmp``. This means that the
``entrypoint`` we specified is relative to ``/tmp``. That did not matter for us
since we specified an absolute path as the entrypoint (``entrypoint=/bin/echo``).
Had we specified ``entrypoint=echo`` the local scheduler would have tried to invoke
``/tmp/echo``.

If you have a pre-built application binary, setting the image to a local directory is a
quick way to validate the application and the ``specs.AppDef``. But its not all
that useful if you want to run the application on a remote scheduler
(see :ref:`Running On Other Schedulers`).

.. note:: The ``image`` string in ``specs.Role`` is an identifier to a container image
supported by the scheduler. Refer to the scheduler documentation to find out
what container image is supported by the scheduler you want to use.

For ``local`` scheduler we can see that it supports both a local directory
and docker as the image:

.. code-block:: shell-session

$ torchx runopts local

{ 'image_type': { 'default': 'dir',
'help': 'image type. One of [dir, docker]',
'type': 'str'},
... <omitted for brevity> ...


.. note:: Before proceeding, you will need docker installed. If you have not done so already
follow the install instructions on: https://docs.docker.com/get-docker/

Now lets try running ``echo`` from a docker container. Modify echo's ``AppDef``
in ``~/test.py`` you created in the previous section to make the ``image="ubuntu:latest"``.

::

import torchx.specs as specs


def echo(num_replicas: int, msg: str = "hello world") -> specs.AppDef:
"""
Echos a message to stdout (calls /bin/echo)

Args:
num_replicas: number of copies (in parallel) to run
msg: message to echo

"""
return specs.AppDef(
name="echo",
roles=[
specs.Role(
name="echo",
entrypoint="/bin/echo",
image="ubuntu:latest", # IMAGE NOW POINTS TO THE UBUNTU DOCKER IMAGE
args=[f"replica #{specs.macros.replica_id}: {msg}"],
num_replicas=num_replicas,
)
],
)

Try running the echo app

.. code-block:: shell-session

$ torchx run --scheduler local \
--scheduler_args image_type=docker \
~/test.py:echo \
--num_replicas 4 \
--msg "foobar from docker!"

Running On Other Schedulers
-----------------------------
So far we've launched components locally. Lets take a look at how to run this on
Expand Down
4 changes: 1 addition & 3 deletions examples/pipelines/kfp/kfp_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@
something that can be used within KFP.
"""


# %%
# Input Arguments
# ###############
Expand Down Expand Up @@ -105,7 +104,6 @@

args: argparse.Namespace = parser.parse_args(sys.argv[1:])


# %%
# Creating the Components
# #######################
Expand Down Expand Up @@ -166,7 +164,7 @@

import os.path

from torchx.components.serve.serve import torchserve
from torchx.components.serve import torchserve

serve_app: specs.AppDef = torchserve(
model_path=os.path.join(args.output_path, "model.mar"),
Expand Down
1 change: 1 addition & 0 deletions torchx/cli/cmd_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,7 @@ def add_arguments(self, subparser: argparse.ArgumentParser) -> None:
"--scheduler",
type=str,
help="Name of the scheduler to use",
default="default",
)
subparser.add_argument(
"--scheduler_args",
Expand Down
65 changes: 65 additions & 0 deletions torchx/components/dist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

"""
Components for applications that run as distributed jobs. Many of the
components in this section are simply topological, meaning that they define
the layout of the nodes in a distributed setting and take the actual
binaries that each group of nodes (``specs.Role``) runs.
"""

from typing import Dict, Optional

import torchx.specs as specs
from torchx.components.base import torch_dist_role


def ddp(
image: str,
entrypoint: str,
resource: Optional[str] = None,
nnodes: int = 1,
nproc_per_node: int = 1,
base_image: Optional[str] = None,
name: str = "test_name",
role: str = "worker",
env: Optional[Dict[str, str]] = None,
*script_args: str,
) -> specs.AppDef:
"""
Distributed data parallel style application (one role, multi-replica).

Args:
image: container image.
entrypoint: script or binary to run within the image.
resource: Registered named resource.
nnodes: Number of nodes.
nproc_per_node: Number of processes per node.
name: Name of the application.
base_image: container base image (not required) .
role: Name of the ddp role.
script: Main script.
env: Env variables.
script_args: Script arguments.

Returns:
specs.AppDef: Torchx AppDef
"""

ddp_role = torch_dist_role(
name=role,
image=image,
base_image=base_image,
entrypoint=entrypoint,
resource=resource or specs.NULL_RESOURCE,
script_args=list(script_args),
script_envs=env,
nproc_per_node=nproc_per_node,
nnodes=nnodes,
max_restarts=0,
).replicas(nnodes)

return specs.AppDef(name).of(ddp_role)
12 changes: 0 additions & 12 deletions torchx/components/dist/__init__.py

This file was deleted.

15 changes: 0 additions & 15 deletions torchx/components/dist/ddp.py

This file was deleted.

51 changes: 0 additions & 51 deletions torchx/components/distributed.py

This file was deleted.

File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

"""
These components aim to make it easier to interact with inference and serving
tools such as `torchserve <https://pytorch.org/serve/>`_.
"""

from typing import Dict, Optional

import torchx.specs as specs
Expand All @@ -19,7 +24,7 @@ def torchserve(
"""Deploys the provided model to the given torchserve management API
endpoint.
>>> from torchx.components.serve.serve import torchserve
>>> from torchx.components.serve import torchserve
>>> torchserve(
... model_path="s3://your-bucket/your-model.pt",
... management_api="http://torchserve:8081",
Expand Down
10 changes: 0 additions & 10 deletions torchx/components/serve/__init__.py

This file was deleted.

Empty file.
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,10 @@
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

import torchx.components.distributed as distributed_components

from .component_test_base import ComponentTestCase
import torchx.components.dist as dist
from torchx.components.test.component_test_base import ComponentTestCase


class DistributedComponentTest(ComponentTestCase):
def test_ddp(self) -> None:
self._validate(distributed_components, "ddp")
self._validate(dist, "ddp")
Loading