Skip to content

Commit be327b3

Browse files
committed
add full cosmoflow file set
1 parent 35c77e0 commit be327b3

File tree

6 files changed

+663
-15
lines changed

6 files changed

+663
-15
lines changed

cosmoflow/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Build image on top of NVidia MXnet image
1+
# Build image on top of NVidia TF1 image
22
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.12-tf1-py3
33
FROM ${FROM_IMAGE_NAME}
44

cosmoflow/README.md

Lines changed: 24 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,7 @@
22

33
#### Author: Phil Tooley - [phil.tooley@nag.co.uk](mailto:phil.tooley@nag.co.uk)
44

5-
# ***DRAFT: For Partner Review***
6-
7-
# Tutorial: HPC-Scale AI on AzureML: Training CosmoFlow
5+
# Tutorial: HPC-Scale AI with NVIDIA GPUs on AzureML: Training CosmoFlow
86

97
*In our previous tutorials we have shown you how to [run training workloads on the AzureML
108
platform] and [set up an HPC-class high performance filesystem]. Now we will put everything
@@ -19,6 +17,12 @@ CosmoFlow on AzureML using a BeeOND filesystem for storage and demonstrate the [
1917
speedup](#performance-comparison-beeond-vs-blobfuse) this gives over Azure-blob based Dataset
2018
storage.
2119

20+
[set up an HPC-class high performance filesystem]: https://www.nag.com/blog/tutorial-beeond-azureml-high-performance-filesystem-hpc-scale-machine-learning-nvidia-gpus
21+
[adding the BeeOND filesystem to AzureML]: https://www.nag.com/blog/tutorial-beeond-azureml-high-performance-filesystem-hpc-scale-machine-learning-nvidia-gpus
22+
[BeeOND tutorial]: https://www.nag.com/blog/tutorial-beeond-azureml-high-performance-filesystem-hpc-scale-machine-learning-nvidia-gpus
23+
[BeeOND filesystem tutorial]: https://www.nag.com/blog/tutorial-beeond-azureml-high-performance-filesystem-hpc-scale-machine-learning-nvidia-gpus
24+
25+
2226
## CosmoFlow - The Model and Dataset
2327

2428
[CosmoFlow](https://arxiv.org/abs/1808.04728) is a scientific machine learning model for
@@ -102,14 +106,15 @@ $ azcopy copy --recursive --include-pattern="*.tfrecord.gz" --content-encoding="
102106
./cosmo_data https://${storage_acct}.blob.core.windows.net/${container}?sv=${sas}
103107
```
104108

105-
You should substitute your own storage account, container and shared access signature (SAS) when
109+
You should substitute your own storage account, container and [shared access signature] (SAS) when
106110
uploading your data with AzCopy. Shared access signatures are a recommended method for
107111
authenticating to storage accounts from scripts without revealing sensitive credentials such as an
108112
account key. They provide control of access permissions (read, write, etc.), start and end times
109113
for allowed accesss and fine-grained scoping down to the level of individual blobs. The [Azure
110114
Storage Docs] contain all the information you need to know to create and manage shared access
111115
signatures.
112116

117+
[shared access signature]: https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview
113118
[Azure Storage Docs]: https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview
114119

115120

@@ -177,8 +182,10 @@ AzureML remains the same. Unlike Mask R-CNN, CosmoFlow does not require install
177182
package as a result the required Docker file is quite simple and can be used as a basis for any
178183
TensorFlow based ML workload:
179184

185+
[TensorFlow]: https://www.tensorflow.org/
186+
180187
```Dockerfile
181-
# Build image on top of NVidia MXnet image
188+
# Build image on top of NVidia TF1 image
182189
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.12-tf1-py3
183190
FROM ${FROM_IMAGE_NAME}
184191

@@ -313,15 +320,18 @@ minimize both time- and cost-to-solution.
313320

314321
## Summary
315322

316-
High performance storage is a critical component of modern AI training deployments - without
317-
sufficient I/O bandwidth to supply them with data the computing power of modern GPUs is wasted.
318-
Using CosmoFlow as an example we find that a BeeOND high-performance filesystem backed
319-
implementation is over 5x faster than using Premium Blob and nearly 10x faster than using Hot Blob
320-
for storage. BeeOND runs on the compute instances and requires no additional cloud resources,
321-
meaning cost to solution is reduced by the same factor of 5 or 10x. As GPUs become ever more
322-
powerful, ensuring you have storage that can meet their demands is crucial to deliver best AI
323-
training performance at best cost.
324-
323+
As GPUs become ever more powerful, ensuring you have the right I/O configuration to meet their data
324+
demands is crucial to deliver best AI training performance at best cost. Using CosmoFlow as an
325+
example we demonstrate how a BeeOND high-performance filesystem allows us to fully unlock the
326+
computational power of the NVIDIA V100 GPUs in our training cluster. With the BeeOND filesystem
327+
data bottlenecks are eliminated and training performs 5-10x faster than when using network attached
328+
Azure Blob storage. BeeOND runs on the compute instances and requires no additional cloud
329+
resources. This makes it an extremely cost effective way to unlock maximum performance of your
330+
NVIDIA GPU-enabled clusters on AzureML, and can bring cost savings of up to 90% for multi-GPU and
331+
multi-node training workloads.
332+
333+
The work demonstrated here was funded by Microsoft in partnership with NVIDIA. The authors like to
334+
thank Microsoft and NVIDIA employees for their contributions to this tutorial.
325335

326336
### Find out more:
327337

cosmoflow/beeond_create_cluster.py

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
#!/usr/bin/env python3
2+
3+
import argparse
4+
import subprocess
5+
import sys
6+
from datetime import timedelta, datetime
7+
from time import sleep
8+
9+
from termcolor import cprint
10+
11+
from azureml.core import Experiment, ScriptRunConfig
12+
from azureml.core.runconfig import MpiConfiguration
13+
14+
from common import (
15+
get_or_create_workspace,
16+
create_or_update_environment,
17+
create_or_update_cluster,
18+
)
19+
20+
import sharedconfig
21+
22+
23+
with open("clusterkey.pub", "rt") as fh:
24+
sharedconfig.ssh_key = fh.readline()
25+
26+
27+
def generate_training_opts():
28+
"""Populate common Mask RCNN command line options"""
29+
opts = ["--output-dir", "./outputs"]
30+
opts.extend(["--rank-gpu"])
31+
opts.extend(["--distributed"])
32+
opts.extend(["--verbose"])
33+
opts.extend(["--stage-dir", "/data"])
34+
35+
return opts
36+
37+
38+
def generate_sas():
39+
"""Generate a short-lived sas for dataset download via az cli"""
40+
exp = (datetime.utcnow() + timedelta(hours=1)).isoformat(
41+
"T", "minutes"
42+
)
43+
# fmt: off
44+
sas_gen_cmd = [
45+
"az", "storage", "account", "generate-sas",
46+
"--account-name", sharedconfig.storage_account,
47+
"--services", "b",
48+
"--permissions", "rl",
49+
"--resource-types", "co",
50+
"--expiry", exp,
51+
"--output", "tsv"
52+
]
53+
# fmt: on
54+
55+
sasres = subprocess.run(sas_gen_cmd, capture_output=True)
56+
57+
return sasres.stdout.strip()
58+
59+
60+
def main():
61+
62+
parser = argparse.ArgumentParser(
63+
description="Create BeeOND enabled cluster"
64+
)
65+
66+
parser.add_argument("num_nodes", type=int, help="Number of nodes")
67+
parser.add_argument(
68+
"--keep-cluster",
69+
action="store_true",
70+
help="Don't autoscale cluster down when idle (after run completed)",
71+
)
72+
73+
args = parser.parse_args()
74+
75+
workspace = get_or_create_workspace(
76+
sharedconfig.subscription_id,
77+
sharedconfig.resource_group_name,
78+
sharedconfig.workspace_name,
79+
sharedconfig.location,
80+
)
81+
82+
try:
83+
clusterconnector = create_or_update_cluster(
84+
workspace,
85+
sharedconfig.cluster_name,
86+
args.num_nodes,
87+
sharedconfig.ssh_key,
88+
sharedconfig.vm_type,
89+
terminate_on_failure=False,
90+
use_beeond=True,
91+
)
92+
except RuntimeError:
93+
cprint("Fatal Error - exiting", "red", attrs=["bold"])
94+
sys.exit(-1)
95+
96+
97+
if __name__ == "__main__":
98+
main()
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
#!/usr/bin/env python3
2+
3+
import argparse
4+
import subprocess
5+
import sys
6+
from datetime import timedelta, datetime
7+
8+
from termcolor import cprint
9+
10+
from azureml.core import Experiment, ScriptRunConfig
11+
from azureml.core.runconfig import MpiConfiguration
12+
13+
from common import (
14+
get_or_create_workspace,
15+
create_or_update_environment,
16+
create_or_update_cluster,
17+
)
18+
19+
import sharedconfig
20+
21+
k_runclass = "BeeOND"
22+
k_beeond_map = "/data"
23+
24+
25+
with open("clusterkey.pub", "rt") as fh:
26+
sharedconfig.ssh_key = fh.readline()
27+
28+
29+
def generate_training_opts(sas, beeond_map, stage):
30+
"""Populate common Mask RCNN command line options"""
31+
opts = ["--output-dir", "./outputs"]
32+
opts.extend(["--data-dir", beeond_map + "/cosmoflow/cosmoUniverse_2019_05_4parE_tf"])
33+
opts.extend(["--rank-gpu"])
34+
opts.extend(["--distributed"])
35+
opts.extend(["--verbose"])
36+
opts.extend(["--account", sharedconfig.storage_account])
37+
opts.extend(["--container", sharedconfig.storage_container])
38+
opts.extend(["--sas", sas])
39+
if stage:
40+
opts.extend(["--beeond-stage"])
41+
42+
opts.extend(["configs/cosmo_runs_gpu.yaml"])
43+
44+
return opts
45+
46+
47+
def generate_sas():
48+
"""Generate a short-lived sas for dataset download via az cli"""
49+
exp = (datetime.utcnow() + timedelta(hours=1)).isoformat("T", "minutes")
50+
# fmt: off
51+
sas_gen_cmd = [
52+
"az", "storage", "account", "generate-sas",
53+
"--account-name", sharedconfig.storage_account,
54+
"--services", "b",
55+
"--permissions", "rl",
56+
"--resource-types", "co",
57+
"--expiry", exp + 'Z',
58+
"--output", "tsv"
59+
]
60+
# fmt: on
61+
62+
sasres = subprocess.run(sas_gen_cmd, capture_output=True)
63+
64+
return sasres.stdout.strip()
65+
66+
67+
def main():
68+
69+
parser = argparse.ArgumentParser(
70+
description="Submit Cosmoflow to BeeOND enabled cluster"
71+
)
72+
73+
parser.add_argument("num_nodes", type=int, help="Number of nodes")
74+
parser.add_argument("--follow", action="store_true", help="Follow run output")
75+
parser.add_argument(
76+
"--keep-cluster",
77+
action="store_true",
78+
help="Don't autoscale cluster down when idle (after run completed)",
79+
)
80+
parser.add_argument(
81+
"--epochs",
82+
type=int,
83+
default=sharedconfig.default_epochs,
84+
help="Number of training iterations",
85+
)
86+
parser.add_argument(
87+
"--keep-failed-cluster", dest="terminate_on_failure", action="store_false"
88+
)
89+
parser.add_argument("--skip-staging", action="store_false", dest="stage")
90+
91+
args = parser.parse_args()
92+
93+
workspace = get_or_create_workspace(
94+
sharedconfig.subscription_id,
95+
sharedconfig.resource_group_name,
96+
sharedconfig.workspace_name,
97+
sharedconfig.location,
98+
)
99+
100+
try:
101+
clusterconnector = create_or_update_cluster(
102+
workspace,
103+
sharedconfig.cluster_name,
104+
args.num_nodes,
105+
sharedconfig.ssh_key,
106+
sharedconfig.vm_type,
107+
terminate_on_failure=args.terminate_on_failure,
108+
use_beeond=True,
109+
)
110+
except RuntimeError:
111+
cprint("Fatal Error - exiting", "red", attrs=["bold"])
112+
sys.exit(-1)
113+
114+
docker_args = ["-v", "{}:{}".format(clusterconnector.beeond_mnt, k_beeond_map)]
115+
116+
# Get and update the AzureML Environment object
117+
environment = create_or_update_environment(
118+
workspace, sharedconfig.environment_name, sharedconfig.docker_image, docker_args
119+
)
120+
121+
# Get/Create an experiment object
122+
experiment = Experiment(workspace=workspace, name=sharedconfig.experiment_name)
123+
124+
# Configure the distributed compute settings
125+
parallelconfig = MpiConfiguration(
126+
node_count=args.num_nodes, process_count_per_node=sharedconfig.gpus_per_node
127+
)
128+
129+
# Collect arguments to be passed to training script
130+
script_args = generate_training_opts(
131+
generate_sas().decode(), k_beeond_map, args.stage
132+
)
133+
134+
# Define the configuration for running the training script
135+
script_conf = ScriptRunConfig(
136+
source_directory="cosmoflow-benchmark",
137+
script="train.py",
138+
compute_target=clusterconnector.cluster,
139+
environment=environment,
140+
arguments=script_args,
141+
distributed_job_config=parallelconfig,
142+
)
143+
144+
# We can use these tags make a note of run parameters (avoids grepping the logs)
145+
runtags = {
146+
"class": k_runclass,
147+
"vmtype": sharedconfig.vm_type,
148+
"num_nodes": args.num_nodes,
149+
"ims_per_gpu": sharedconfig.ims_per_gpu,
150+
"epochs": args.epochs,
151+
}
152+
153+
# Submit the run
154+
run = experiment.submit(config=script_conf, tags=runtags)
155+
156+
# Can optionally choose to follow the output on the command line
157+
if args.follow:
158+
run.wait_for_completion(show_output=True)
159+
160+
161+
if __name__ == "__main__":
162+
main()

0 commit comments

Comments
 (0)