add full cosmoflow file set

ptooley · ptooley · commit be327b33ea01 · 2021-06-24T13:02:16.000+01:00
diff --git a/cosmoflow/Dockerfile b/cosmoflow/Dockerfile
@@ -1,4 +1,4 @@
-# Build image on top of NVidia MXnet image
+# Build image on top of NVidia TF1 image
 ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.12-tf1-py3
 FROM ${FROM_IMAGE_NAME}
 
diff --git a/cosmoflow/README.md b/cosmoflow/README.md
@@ -2,9 +2,7 @@
 
 #### Author: Phil Tooley - [phil.tooley@nag.co.uk](mailto:phil.tooley@nag.co.uk)
 
-# ***DRAFT: For Partner Review***
-
-# Tutorial: HPC-Scale AI on AzureML: Training CosmoFlow
+# Tutorial: HPC-Scale AI with NVIDIA GPUs on AzureML: Training CosmoFlow
 
 *In our previous tutorials we have shown you how to [run training workloads on the AzureML
 platform] and [set up an HPC-class high performance filesystem]. Now we will put everything
@@ -19,6 +17,12 @@ CosmoFlow on AzureML using a BeeOND filesystem for storage and demonstrate the [
 speedup](#performance-comparison-beeond-vs-blobfuse) this gives over Azure-blob based Dataset
 storage.
 
+[set up an HPC-class high performance filesystem]: https://www.nag.com/blog/tutorial-beeond-azureml-high-performance-filesystem-hpc-scale-machine-learning-nvidia-gpus
+[adding the BeeOND filesystem to AzureML]: https://www.nag.com/blog/tutorial-beeond-azureml-high-performance-filesystem-hpc-scale-machine-learning-nvidia-gpus
+[BeeOND tutorial]: https://www.nag.com/blog/tutorial-beeond-azureml-high-performance-filesystem-hpc-scale-machine-learning-nvidia-gpus
+[BeeOND filesystem tutorial]: https://www.nag.com/blog/tutorial-beeond-azureml-high-performance-filesystem-hpc-scale-machine-learning-nvidia-gpus
+
+
 ## CosmoFlow - The Model and Dataset
 
 [CosmoFlow](https://arxiv.org/abs/1808.04728) is a scientific machine learning model for
@@ -102,14 +106,15 @@ $ azcopy copy --recursive --include-pattern="*.tfrecord.gz" --content-encoding="
   ./cosmo_data https://${storage_acct}.blob.core.windows.net/${container}?sv=${sas}
 ```
 
-You should substitute your own storage account, container and shared access signature (SAS) when
+You should substitute your own storage account, container and [shared access signature] (SAS) when
 uploading your data with AzCopy.  Shared access signatures are a recommended method for
 authenticating to storage accounts from scripts without revealing sensitive credentials such as an
 account key.  They provide control of access permissions (read, write, etc.), start and end times
 for allowed accesss and fine-grained scoping down to the level of individual blobs. The [Azure
 Storage Docs] contain all the information you need to know to create and manage shared access
 signatures.
 
+[shared access signature]: https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview
 [Azure Storage Docs]: https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview
 
 
@@ -177,8 +182,10 @@ AzureML remains the same.  Unlike Mask R-CNN, CosmoFlow does not require install
 package as a result the required Docker file is quite simple  and can be used as a basis for any
 TensorFlow based ML workload:
 
+[TensorFlow]: https://www.tensorflow.org/
+
 ```Dockerfile
-# Build image on top of NVidia MXnet image
+# Build image on top of NVidia TF1 image
 ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.12-tf1-py3
 FROM ${FROM_IMAGE_NAME}
 
@@ -313,15 +320,18 @@ minimize both time- and cost-to-solution.
 
 ## Summary
 
-High performance storage is a critical component of modern AI training deployments - without
-sufficient I/O bandwidth to supply them with data the computing power of modern GPUs is wasted.
-Using CosmoFlow as an example we find that a BeeOND high-performance filesystem backed
-implementation is over 5x faster than using Premium Blob and nearly 10x faster than using Hot Blob
-for storage. BeeOND runs on the compute instances and requires no additional cloud resources,
-meaning cost to solution is reduced by the same factor of 5 or 10x.  As GPUs become ever more
-powerful, ensuring you have storage that can meet their demands is crucial to deliver best AI
-training performance at best cost.
-
+As GPUs become ever more powerful, ensuring you have the right I/O configuration to meet their data
+demands is crucial to deliver best AI training performance at best cost.  Using CosmoFlow as an
+example we demonstrate how a BeeOND high-performance filesystem allows us to fully unlock the
+computational power of the NVIDIA V100 GPUs in our training cluster.  With the BeeOND filesystem
+data bottlenecks are eliminated and training performs 5-10x faster than when using network attached
+Azure Blob storage. BeeOND runs on the compute instances and requires no additional cloud
+resources. This makes it an extremely cost effective way to unlock maximum performance of your
+NVIDIA GPU-enabled clusters on AzureML, and can bring cost savings of up to 90% for multi-GPU and
+multi-node training workloads.
+
+The work demonstrated here was funded by Microsoft in partnership with NVIDIA. The authors like to
+thank Microsoft and NVIDIA employees for their contributions to this tutorial.
 
 ### Find out more:
 
diff --git a/cosmoflow/beeond_create_cluster.py b/cosmoflow/beeond_create_cluster.py
@@ -0,0 +1,98 @@
+#!/usr/bin/env python3
+
+import argparse
+import subprocess
+import sys
+from datetime import timedelta, datetime
+from time import sleep
+
+from termcolor import cprint
+
+from azureml.core import Experiment, ScriptRunConfig
+from azureml.core.runconfig import MpiConfiguration
+
+from common import (
+    get_or_create_workspace,
+    create_or_update_environment,
+    create_or_update_cluster,
+)
+
+import sharedconfig
+
+
+with open("clusterkey.pub", "rt") as fh:
+    sharedconfig.ssh_key = fh.readline()
+
+
+def generate_training_opts():
+    """Populate common Mask RCNN command line options"""
+    opts = ["--output-dir", "./outputs"]
+    opts.extend(["--rank-gpu"])
+    opts.extend(["--distributed"])
+    opts.extend(["--verbose"])
+    opts.extend(["--stage-dir", "/data"])
+
+    return opts
+
+
+def generate_sas():
+    """Generate a short-lived sas for dataset download via az cli"""
+    exp = (datetime.utcnow() + timedelta(hours=1)).isoformat(
+        "T", "minutes"
+    )
+    # fmt: off
+    sas_gen_cmd = [
+        "az", "storage", "account", "generate-sas",
+        "--account-name", sharedconfig.storage_account,
+        "--services", "b",
+        "--permissions", "rl",
+        "--resource-types", "co",
+        "--expiry", exp,
+        "--output", "tsv"
+    ]
+    # fmt: on
+
+    sasres = subprocess.run(sas_gen_cmd, capture_output=True)
+
+    return sasres.stdout.strip()
+
+
+def main():
+
+    parser = argparse.ArgumentParser(
+        description="Create BeeOND enabled cluster"
+    )
+
+    parser.add_argument("num_nodes", type=int, help="Number of nodes")
+    parser.add_argument(
+        "--keep-cluster",
+        action="store_true",
+        help="Don't autoscale cluster down when idle (after run completed)",
+    )
+
+    args = parser.parse_args()
+
+    workspace = get_or_create_workspace(
+        sharedconfig.subscription_id,
+        sharedconfig.resource_group_name,
+        sharedconfig.workspace_name,
+        sharedconfig.location,
+    )
+
+    try:
+        clusterconnector = create_or_update_cluster(
+            workspace,
+            sharedconfig.cluster_name,
+            args.num_nodes,
+            sharedconfig.ssh_key,
+            sharedconfig.vm_type,
+            terminate_on_failure=False,
+            use_beeond=True,
+        )
+    except RuntimeError:
+        cprint("Fatal Error - exiting", "red", attrs=["bold"])
+        sys.exit(-1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/cosmoflow/beeond_submit_cosmoflow.py b/cosmoflow/beeond_submit_cosmoflow.py
@@ -0,0 +1,162 @@
+#!/usr/bin/env python3
+
+import argparse
+import subprocess
+import sys
+from datetime import timedelta, datetime
+
+from termcolor import cprint
+
+from azureml.core import Experiment, ScriptRunConfig
+from azureml.core.runconfig import MpiConfiguration
+
+from common import (
+    get_or_create_workspace,
+    create_or_update_environment,
+    create_or_update_cluster,
+)
+
+import sharedconfig
+
+k_runclass = "BeeOND"
+k_beeond_map = "/data"
+
+
+with open("clusterkey.pub", "rt") as fh:
+    sharedconfig.ssh_key = fh.readline()
+
+
+def generate_training_opts(sas, beeond_map, stage):
+    """Populate common Mask RCNN command line options"""
+    opts = ["--output-dir", "./outputs"]
+    opts.extend(["--data-dir", beeond_map + "/cosmoflow/cosmoUniverse_2019_05_4parE_tf"])
+    opts.extend(["--rank-gpu"])
+    opts.extend(["--distributed"])
+    opts.extend(["--verbose"])
+    opts.extend(["--account", sharedconfig.storage_account])
+    opts.extend(["--container", sharedconfig.storage_container])
+    opts.extend(["--sas", sas])
+    if stage:
+        opts.extend(["--beeond-stage"])
+
+    opts.extend(["configs/cosmo_runs_gpu.yaml"])
+
+    return opts
+
+
+def generate_sas():
+    """Generate a short-lived sas for dataset download via az cli"""
+    exp = (datetime.utcnow() + timedelta(hours=1)).isoformat("T", "minutes")
+    # fmt: off
+    sas_gen_cmd = [
+        "az", "storage", "account", "generate-sas",
+        "--account-name", sharedconfig.storage_account,
+        "--services", "b",
+        "--permissions", "rl",
+        "--resource-types", "co",
+        "--expiry", exp + 'Z',
+        "--output", "tsv"
+    ]
+    # fmt: on
+
+    sasres = subprocess.run(sas_gen_cmd, capture_output=True)
+
+    return sasres.stdout.strip()
+
+
+def main():
+
+    parser = argparse.ArgumentParser(
+        description="Submit Cosmoflow to BeeOND enabled cluster"
+    )
+
+    parser.add_argument("num_nodes", type=int, help="Number of nodes")
+    parser.add_argument("--follow", action="store_true", help="Follow run output")
+    parser.add_argument(
+        "--keep-cluster",
+        action="store_true",
+        help="Don't autoscale cluster down when idle (after run completed)",
+    )
+    parser.add_argument(
+        "--epochs",
+        type=int,
+        default=sharedconfig.default_epochs,
+        help="Number of training iterations",
+    )
+    parser.add_argument(
+        "--keep-failed-cluster", dest="terminate_on_failure", action="store_false"
+    )
+    parser.add_argument("--skip-staging", action="store_false", dest="stage")
+
+    args = parser.parse_args()
+
+    workspace = get_or_create_workspace(
+        sharedconfig.subscription_id,
+        sharedconfig.resource_group_name,
+        sharedconfig.workspace_name,
+        sharedconfig.location,
+    )
+
+    try:
+        clusterconnector = create_or_update_cluster(
+            workspace,
+            sharedconfig.cluster_name,
+            args.num_nodes,
+            sharedconfig.ssh_key,
+            sharedconfig.vm_type,
+            terminate_on_failure=args.terminate_on_failure,
+            use_beeond=True,
+        )
+    except RuntimeError:
+        cprint("Fatal Error - exiting", "red", attrs=["bold"])
+        sys.exit(-1)
+
+    docker_args = ["-v", "{}:{}".format(clusterconnector.beeond_mnt, k_beeond_map)]
+
+    # Get and update the AzureML Environment object
+    environment = create_or_update_environment(
+        workspace, sharedconfig.environment_name, sharedconfig.docker_image, docker_args
+    )
+
+    # Get/Create an experiment object
+    experiment = Experiment(workspace=workspace, name=sharedconfig.experiment_name)
+
+    # Configure the distributed compute settings
+    parallelconfig = MpiConfiguration(
+        node_count=args.num_nodes, process_count_per_node=sharedconfig.gpus_per_node
+    )
+
+    # Collect arguments to be passed to training script
+    script_args = generate_training_opts(
+        generate_sas().decode(), k_beeond_map, args.stage
+    )
+
+    # Define the configuration for running the training script
+    script_conf = ScriptRunConfig(
+        source_directory="cosmoflow-benchmark",
+        script="train.py",
+        compute_target=clusterconnector.cluster,
+        environment=environment,
+        arguments=script_args,
+        distributed_job_config=parallelconfig,
+    )
+
+    # We can use these tags make a note of run parameters (avoids grepping the logs)
+    runtags = {
+        "class": k_runclass,
+        "vmtype": sharedconfig.vm_type,
+        "num_nodes": args.num_nodes,
+        "ims_per_gpu": sharedconfig.ims_per_gpu,
+        "epochs": args.epochs,
+    }
+
+    # Submit the run
+    run = experiment.submit(config=script_conf, tags=runtags)
+
+    # Can optionally choose to follow the output on the command line
+    if args.follow:
+        run.wait_for_completion(show_output=True)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/cosmoflow/clusterconnector.py b/cosmoflow/clusterconnector.py
diff --git a/cosmoflow/sharedconfig.py b/cosmoflow/sharedconfig.py

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Build image on top of NVidia MXnet image`
	`1`	`+# Build image on top of NVidia TF1 image`
`2`	`2`	`ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.12-tf1-py3`
`3`	`3`	`FROM ${FROM_IMAGE_NAME}`
`4`	`4`