From 8cac3864ad1e9e071b551bad27a9c4517507cffc Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Mon, 9 Oct 2023 18:23:01 +0000
Subject: [PATCH 01/21] add disclaimber for double quotes

---
 README.md | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 1be096c2e..3ad226414 100644
--- a/README.md
+++ b/README.md
@@ -126,7 +126,15 @@ To use the Docker container as an interactive virtual environment, you can run a
       <docker_image_name> \
       --keep_container_alive true
    ```
-2. Open a bash terminal
+   Note: You may have to use double quotes around `algorithmic-efficiency` in the mounting `-v` flag. If the above command fails try replacing the following line:
+   ```bash
+   -v $HOME/algorithmic-efficiency:/algorithmic-efficiency2 \
+   ``` 
+   with 
+   ```
+   -v $HOME"/algorithmic-efficiency:/algorithmic-efficiency" \
+   ```
+   - Open a bash terminal
    ```bash
    docker exec -it <container_id> /bin/bash
    ```

From 8482b6a294ef1f9025f399c45c5fc3a8aa75cbd5 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Mon, 9 Oct 2023 23:00:58 +0000
Subject: [PATCH 02/21] init change log

---
 CHANGELOG.md | 3 +++
 1 file changed, 3 insertions(+)
 create mode 100644 CHANGELOG.md

diff --git a/CHANGELOG.md b/CHANGELOG.md
new file mode 100644
index 000000000..b69b92969
--- /dev/null
+++ b/CHANGELOG.md
@@ -0,0 +1,3 @@
+# Change log
+
+# TODO: algorithmic-efficiency 0.1.0
\ No newline at end of file

From 18a24575858b205ad8d564f295eb558f8efed2f2 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Mon, 9 Oct 2023 23:20:31 +0000
Subject: [PATCH 03/21] add some faqs

---
 README.md | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/README.md b/README.md
index 3ad226414..2dea5d39d 100644
--- a/README.md
+++ b/README.md
@@ -235,3 +235,28 @@ The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT
 
 Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details.
 While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example.
+
+# FAQS
+## Setup 
+### Why do I get a warning that GPU is not found?
+If running with pytorch, we intentionally hide the GPUs from jax. So please disregard the following warning:
+```
+W1003 ... xla_bridge.py:463] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
+```
+
+## Platform
+### My machine only has one GPU. How can I use this repo?
+You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines/` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this
+please reduce the batchsizes for the submission.
+
+### How do I run this on my SLURM cluster?
+You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker).
+### How can I run this on my AWS/GCP/Azure cloud project?
+Yes you can use this repo on your cloud project. As noted above we recommend
+Docker or Apptainer to ensure a similar environment as our scoring environment.
+## Submissions
+### Can submission be structured using multiple files?
+
+### How can I install custom dependencies?
+### How can I know if my code can be run on benchmarking hardware?
+### Are we allowed to use our own hardware to self-report the results?
\ No newline at end of file

From 2f5069193e12d2fe310e6ad578e7b6edf04aed0b Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Mon, 9 Oct 2023 23:22:55 +0000
Subject: [PATCH 04/21] edit

---
 README.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/README.md b/README.md
index 2dea5d39d..6c643f7fe 100644
--- a/README.md
+++ b/README.md
@@ -256,7 +256,6 @@ Yes you can use this repo on your cloud project. As noted above we recommend
 Docker or Apptainer to ensure a similar environment as our scoring environment.
 ## Submissions
 ### Can submission be structured using multiple files?
-
 ### How can I install custom dependencies?
 ### How can I know if my code can be run on benchmarking hardware?
 ### Are we allowed to use our own hardware to self-report the results?
\ No newline at end of file

From 07c1d100e19435347802c71fa69a913cac4c4c0c Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Wed, 11 Oct 2023 22:47:24 +0000
Subject: [PATCH 05/21] add some more faqs

---
 README.md          | 18 ++++++++++++++----
 getting_started.md |  2 +-
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 6c643f7fe..444bf2827 100644
--- a/README.md
+++ b/README.md
@@ -252,10 +252,20 @@ please reduce the batchsizes for the submission.
 ### How do I run this on my SLURM cluster?
 You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker).
 ### How can I run this on my AWS/GCP/Azure cloud project?
-Yes you can use this repo on your cloud project. As noted above we recommend
-Docker or Apptainer to ensure a similar environment as our scoring environment.
+Below instructions are for GCP. Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit.
+1. If you don't have an VM instance yet, we recommmend to use the "Deep Learning on Linux" Image in Boot disk options. 
+2. To install the NVIDIA Docker toolkit, you can use `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit.
+
 ## Submissions
 ### Can submission be structured using multiple files?
-### How can I install custom dependencies?
+Yes, your submission can be structured using multiple files. 
+### Can I install custom dependencies?
+You may use custom dependencies as long as they do not conflict with any of the pinned packages in setup.cfg. 
+To include your custom dependencies in your submission, please include them in a requirements.txt file. 
 ### How can I know if my code can be run on benchmarking hardware?
-### Are we allowed to use our own hardware to self-report the results?
\ No newline at end of file
+The benchmarking hardware specifications are documented in the [Getting Started Document](./getting_started.md).
+Please monitor your submission's memory usage so that it does not exceed the available memory 
+on the competition hardware. 
+### Are we allowed to use our own hardware to self-report the results?
+No. However you are allowed to use your own hardware to report the best hyperparameter point to qualify for 
+a compute sponsorship offering a free evaluation on the full benchmark set, see [Rules](./RULES.md#qualification-set)
\ No newline at end of file
diff --git a/getting_started.md b/getting_started.md
index 2942e632b..96e58edab 100644
--- a/getting_started.md
+++ b/getting_started.md
@@ -13,7 +13,7 @@ To get started you will have to make a few decisions and install the repository
 1. Decide if you would like to develop your submission in either Pytorch or Jax.
 2. Set up your workstation or VM. We recommend to use a setup similar to the [benchmarking hardware](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#benchmarking-hardware). 
 The specs on the benchmarking machines are:
-    -  8 V100 GPUs
+    - 8 V100 GPUs
     - 240 GB in RAM
     - 2 TB in storage (for datasets). 
 3. Install the algorithmic package and dependencies, see [Installation](./README.md#installation).

From 2c85991b4dda059dad436b0f13b59e880c6cba97 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Wed, 11 Oct 2023 23:15:53 +0000
Subject: [PATCH 06/21] add disclaimer on conformer pytorch workload

---
 CHANGELOG.md |  8 +++++++-
 README.md    | 11 ++++++++++-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index b69b92969..76dfa2b9a 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,9 @@
 # Change log
 
-# TODO: algorithmic-efficiency 0.1.0
\ No newline at end of file
+# TODO: algorithmic-efficiency 0.1.0
+First release of AlgoPerf benchmarking code. 
+Disclaimer: The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
+Pytorch 2.0.1. To circumvent this issues we have tuned the pytorch memory allocation configuration,
+which slows down the workload by a factor of 2x. For submitters, this means that the Conformer Pytorch 
+submission times will be about 2x compared to an identical jax submissions. 
+Tracking issue here: see issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
\ No newline at end of file
diff --git a/README.md b/README.md
index 444bf2827..0d1953ff7 100644
--- a/README.md
+++ b/README.md
@@ -229,13 +229,22 @@ The rules for the MLCommons Algorithmic Efficency benchmark can be found in the
 If you are interested in contributing to the work of the working group, feel free to [join the weekly meetings](https://mlcommons.org/en/groups/research-algorithms/), open issues. See our [CONTRIBUTING.md](CONTRIBUTING.md) for MLCommons contributing guidelines and setup and workflow instructions.
 
 
-# Note on shared data pipelines between JAX and PyTorch
+# Disclaimers
+
+# Shared data pipelines between JAX and PyTorch
 
 The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads are using the same TensorFlow input pipelines. Due to differences in how Jax and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads.
 
 Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details.
 While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example.
 
+# Conformer workload 2x slower in Pytorch vs Jax
+The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
+Pytorch 2.0.1, which led to out of memory errors. To circumvent this issues we have tuned the pytorch 
+memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this 
+means that the Conformer Pytorch submission times will be roughly 2x compared to an identical jax submissions. 
+Tracking issue here: see issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
+
 # FAQS
 ## Setup 
 ### Why do I get a warning that GPU is not found?

From f6ba321b95cbcad6e068000a0f0667f80243ee5b Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Wed, 11 Oct 2023 23:20:12 +0000
Subject: [PATCH 07/21] update readme

---
 CHANGELOG.md | 10 ++++++----
 README.md    | 12 +++++++++---
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 76dfa2b9a..2a06c043f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,9 +1,11 @@
 # Change log
 
 # TODO: algorithmic-efficiency 0.1.0
-First release of AlgoPerf benchmarking code. 
-Disclaimer: The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
+First release of AlgoPerf benchmarking code.
+
+**Disclaimer**: The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
 Pytorch 2.0.1. To circumvent this issues we have tuned the pytorch memory allocation configuration,
 which slows down the workload by a factor of 2x. For submitters, this means that the Conformer Pytorch 
-submission times will be about 2x compared to an identical jax submissions. 
-Tracking issue here: see issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
\ No newline at end of file
+submission times will be about 2x compared to an identical jax submission. 
+
+Tracking issue: [issue/497](https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
\ No newline at end of file
diff --git a/README.md b/README.md
index 0d1953ff7..e849bfca0 100644
--- a/README.md
+++ b/README.md
@@ -29,9 +29,12 @@
 - [Getting Started](#getting-started)
 - [Rules](#rules)
 - [Contributing](#contributing)
+- [Diclaimers](#disclaimers)
+- [FAQS](#faqs)
 - [Citing AlgoPerf Benchmark](#citing-algoperf-benchmark)
 
 
+
 ## Installation
 You can install this package and dependences in a [python virtual environment](#virtual-environment) or use a [Docker/Singularity/Apptainer container](#install-in-docker) (recommended).
 
@@ -231,14 +234,14 @@ If you are interested in contributing to the work of the working group, feel fre
 
 # Disclaimers
 
-# Shared data pipelines between JAX and PyTorch
+## Shared data pipelines between JAX and PyTorch
 
 The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads are using the same TensorFlow input pipelines. Due to differences in how Jax and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads.
 
 Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details.
 While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example.
 
-# Conformer workload 2x slower in Pytorch vs Jax
+## Conformer workload 2x slower in Pytorch vs Jax
 The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
 Pytorch 2.0.1, which led to out of memory errors. To circumvent this issues we have tuned the pytorch 
 memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this 
@@ -277,4 +280,7 @@ Please monitor your submission's memory usage so that it does not exceed the ava
 on the competition hardware. 
 ### Are we allowed to use our own hardware to self-report the results?
 No. However you are allowed to use your own hardware to report the best hyperparameter point to qualify for 
-a compute sponsorship offering a free evaluation on the full benchmark set, see [Rules](./RULES.md#qualification-set)
\ No newline at end of file
+a compute sponsorship offering a free evaluation on the full benchmark set, see [Rules](./RULES.md#qualification-set)
+
+# Citing AlgoPerf Benchmark
+Todo: how to cite the algoperf benchmark?
\ No newline at end of file

From cc953cf005e91aa83ae2ec7d1a8bf4bb932fde22 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Wed, 11 Oct 2023 23:37:07 +0000
Subject: [PATCH 08/21] typo

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index e849bfca0..cb3c09c83 100644
--- a/README.md
+++ b/README.md
@@ -280,7 +280,7 @@ Please monitor your submission's memory usage so that it does not exceed the ava
 on the competition hardware. 
 ### Are we allowed to use our own hardware to self-report the results?
 No. However you are allowed to use your own hardware to report the best hyperparameter point to qualify for 
-a compute sponsorship offering a free evaluation on the full benchmark set, see [Rules](./RULES.md#qualification-set)
+a compute sponsorship offering a free evaluation on the full benchmark set, see the [Rules](./RULES.md#qualification-set).
 
 # Citing AlgoPerf Benchmark
 Todo: how to cite the algoperf benchmark?
\ No newline at end of file

From 3760f28a2764fdc6aba5fbbce4f741390d85cf69 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Wed, 11 Oct 2023 23:37:43 +0000
Subject: [PATCH 09/21] format

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index cb3c09c83..99f159a03 100644
--- a/README.md
+++ b/README.md
@@ -272,7 +272,7 @@ Below instructions are for GCP. Depending on your virtual machine, you may have
 ### Can submission be structured using multiple files?
 Yes, your submission can be structured using multiple files. 
 ### Can I install custom dependencies?
-You may use custom dependencies as long as they do not conflict with any of the pinned packages in setup.cfg. 
+You may use custom dependencies as long as they do not conflict with any of the pinned packages in `algorithmic-efficiency/setup.cfg`. 
 To include your custom dependencies in your submission, please include them in a requirements.txt file. 
 ### How can I know if my code can be run on benchmarking hardware?
 The benchmarking hardware specifications are documented in the [Getting Started Document](./getting_started.md).

From b7bf1ca2b289a8cc942d9460a0328a97a1cceedc Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Wed, 11 Oct 2023 23:38:37 +0000
Subject: [PATCH 10/21] formatting

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 99f159a03..c7f34f5bf 100644
--- a/README.md
+++ b/README.md
@@ -258,7 +258,7 @@ W1003 ... xla_bridge.py:463] No GPU/TPU found, falling back to CPU. (Set TF_CPP_
 
 ## Platform
 ### My machine only has one GPU. How can I use this repo?
-You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines/` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this
+You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this
 please reduce the batchsizes for the submission.
 
 ### How do I run this on my SLURM cluster?

From 03c48230fea525b806295cb49ba1d08dbededba4 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Wed, 11 Oct 2023 23:38:59 +0000
Subject: [PATCH 11/21] clarify

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index c7f34f5bf..9af8fa445 100644
--- a/README.md
+++ b/README.md
@@ -258,7 +258,7 @@ W1003 ... xla_bridge.py:463] No GPU/TPU found, falling back to CPU. (Set TF_CPP_
 
 ## Platform
 ### My machine only has one GPU. How can I use this repo?
-You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this
+You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this
 please reduce the batchsizes for the submission.
 
 ### How do I run this on my SLURM cluster?

From f32444c05b1e99ba5a8f7d875329bf1a0dca30cd Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Wed, 11 Oct 2023 23:42:53 +0000
Subject: [PATCH 12/21] fix

---
 README.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 9af8fa445..ae81aba64 100644
--- a/README.md
+++ b/README.md
@@ -258,8 +258,10 @@ W1003 ... xla_bridge.py:463] No GPU/TPU found, falling back to CPU. (Set TF_CPP_
 
 ## Platform
 ### My machine only has one GPU. How can I use this repo?
-You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this
-please reduce the batchsizes for the submission.
+You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batchsizes for the submission. Note that your final submission must 'fit'
+on the benchmarking hardware, so if you are using fewer
+GPUs with higher per gpu memory, please monitor your memory usage 
+to make make sure it will fit on a 8 16GB V100 GPUs.
 
 ### How do I run this on my SLURM cluster?
 You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker).

From e6ef936e93447124c444248f3381c72d01d03059 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Wed, 11 Oct 2023 23:52:22 +0000
Subject: [PATCH 13/21] gcp instructions

---
 README.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index ae81aba64..348e963f4 100644
--- a/README.md
+++ b/README.md
@@ -266,8 +266,9 @@ to make make sure it will fit on a 8 16GB V100 GPUs.
 ### How do I run this on my SLURM cluster?
 You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker).
 ### How can I run this on my AWS/GCP/Azure cloud project?
-Below instructions are for GCP. Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit.
-1. If you don't have an VM instance yet, we recommmend to use the "Deep Learning on Linux" Image in Boot disk options. 
+ Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you can:
+1. If you don't have an VM instance yet, we recommmend creating a
+new Compute Instance with the "Deep Learning on Linux" Image in Boot disk options. 
 2. To install the NVIDIA Docker toolkit, you can use `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit.
 
 ## Submissions

From 6d2ba7e2ed3560fb7e7bc97af51d54ba79fa5ab2 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Wed, 11 Oct 2023 23:54:27 +0000
Subject: [PATCH 14/21] typo

---
 README.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 348e963f4..990d094bf 100644
--- a/README.md
+++ b/README.md
@@ -245,8 +245,8 @@ While this issue might not affect all setups, we currently implement a different
 The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
 Pytorch 2.0.1, which led to out of memory errors. To circumvent this issues we have tuned the pytorch 
 memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this 
-means that the Conformer Pytorch submission times will be roughly 2x compared to an identical jax submissions. 
-Tracking issue here: see issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
+means that the Conformer Pytorch submission times will be roughly 2x compared to an identical jax submission. 
+Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
 
 # FAQS
 ## Setup 
@@ -266,7 +266,7 @@ to make make sure it will fit on a 8 16GB V100 GPUs.
 ### How do I run this on my SLURM cluster?
 You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker).
 ### How can I run this on my AWS/GCP/Azure cloud project?
- Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you can:
+ Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you will have to do the following.
 1. If you don't have an VM instance yet, we recommmend creating a
 new Compute Instance with the "Deep Learning on Linux" Image in Boot disk options. 
 2. To install the NVIDIA Docker toolkit, you can use `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit.

From 687c9a28fa32d9cc3aea98f53bdf4c09bbe7c49b Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Thu, 12 Oct 2023 19:36:51 +0000
Subject: [PATCH 15/21] update

---
 README.md | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/README.md b/README.md
index 990d094bf..adea6b574 100644
--- a/README.md
+++ b/README.md
@@ -249,14 +249,8 @@ means that the Conformer Pytorch submission times will be roughly 2x compared to
 Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
 
 # FAQS
-## Setup 
-### Why do I get a warning that GPU is not found?
-If running with pytorch, we intentionally hide the GPUs from jax. So please disregard the following warning:
-```
-W1003 ... xla_bridge.py:463] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
-```
 
-## Platform
+## Setup and Platform
 ### My machine only has one GPU. How can I use this repo?
 You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batchsizes for the submission. Note that your final submission must 'fit'
 on the benchmarking hardware, so if you are using fewer

From 9d7e05db67a1dd392ee6546885dd21e13131be9f Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Thu, 12 Oct 2023 20:09:30 +0000
Subject: [PATCH 16/21] update conformer

---
 CHANGELOG.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 2a06c043f..ea3c7c046 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,6 +6,6 @@ First release of AlgoPerf benchmarking code.
 **Disclaimer**: The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
 Pytorch 2.0.1. To circumvent this issues we have tuned the pytorch memory allocation configuration,
 which slows down the workload by a factor of 2x. For submitters, this means that the Conformer Pytorch 
-submission times will be about 2x compared to an identical jax submission. 
+submission times will be about 2x slower. 
 
 Tracking issue: [issue/497](https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
\ No newline at end of file

From f1e3bb09d31bed4391efd6600c11a0a93bba51d1 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Thu, 12 Oct 2023 20:13:18 +0000
Subject: [PATCH 17/21] update

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index adea6b574..5a0f146a8 100644
--- a/README.md
+++ b/README.md
@@ -241,11 +241,11 @@ The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT
 Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details.
 While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example.
 
-## Conformer workload 2x slower in Pytorch vs Jax
+## Conformer Pytorch OOM 
 The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
 Pytorch 2.0.1, which led to out of memory errors. To circumvent this issues we have tuned the pytorch 
 memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this 
-means that the Conformer Pytorch submission times will be roughly 2x compared to an identical jax submission. 
+means that the Conformer Pytorch submission times will be roughly 2x slower. 
 Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
 
 # FAQS

From 442a152e2dd71a2ce0135aed2841b1ec33d5fd09 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Fri, 13 Oct 2023 18:59:33 +0000
Subject: [PATCH 18/21] fix

---
 CHANGELOG.md | 7 ++++---
 README.md    | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index ea3c7c046..c3112b9d7 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,8 +4,9 @@
 First release of AlgoPerf benchmarking code.
 
 **Disclaimer**: The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
-Pytorch 2.0.1. To circumvent this issues we have tuned the pytorch memory allocation configuration,
-which slows down the workload by a factor of 2x. For submitters, this means that the Conformer Pytorch 
-submission times will be about 2x slower. 
+Pytorch 2.0.1, which led to out of memory errors. To circumvent this issue we have tuned the pytorch 
+memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this 
+means that the Conformer Pytorch submission times will be roughly 2x slower. 
+Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497).  
 
 Tracking issue: [issue/497](https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
\ No newline at end of file
diff --git a/README.md b/README.md
index 5a0f146a8..3dce37924 100644
--- a/README.md
+++ b/README.md
@@ -243,7 +243,7 @@ While this issue might not affect all setups, we currently implement a different
 
 ## Conformer Pytorch OOM 
 The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
-Pytorch 2.0.1, which led to out of memory errors. To circumvent this issues we have tuned the pytorch 
+Pytorch 2.0.1, which led to out of memory errors. To circumvent this issue we have tuned the pytorch 
 memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this 
 means that the Conformer Pytorch submission times will be roughly 2x slower. 
 Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). 

From b2af85a0307b798f165c042f9b96cfcbda1d8b28 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Fri, 20 Oct 2023 22:58:00 +0000
Subject: [PATCH 19/21] readme fixes

---
 CHANGELOG.md | 11 ++---------
 README.md    | 37 ++++++++++++++++++++++---------------
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index c3112b9d7..b71e42e01 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,12 +1,5 @@
 # Change log
 
-# TODO: algorithmic-efficiency 0.1.0
-First release of AlgoPerf benchmarking code.
-
-**Disclaimer**: The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
-Pytorch 2.0.1, which led to out of memory errors. To circumvent this issue we have tuned the pytorch 
-memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this 
-means that the Conformer Pytorch submission times will be roughly 2x slower. 
-Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497).  
+## TODO: algorithmic-efficiency 0.1.0
 
-Tracking issue: [issue/497](https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
\ No newline at end of file
+First release of AlgoPerf benchmarking code.
diff --git a/README.md b/README.md
index 3dce37924..9803f38be 100644
--- a/README.md
+++ b/README.md
@@ -129,7 +129,7 @@ To use the Docker container as an interactive virtual environment, you can run a
       <docker_image_name> \
       --keep_container_alive true
    ```
-   Note: You may have to use double quotes around `algorithmic-efficiency` in the mounting `-v` flag. If the above command fails try replacing the following line:
+   Note: You may have to use double quotes around `algorithmic-efficiency` [path] in the mounting `-v` flag. If the above command fails try replacing the following line:
    ```bash
    -v $HOME/algorithmic-efficiency:/algorithmic-efficiency2 \
    ``` 
@@ -241,13 +241,6 @@ The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT
 Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details.
 While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example.
 
-## Conformer Pytorch OOM 
-The Conformer Pytorch workload has memory fragmentation issue after upgrading to 
-Pytorch 2.0.1, which led to out of memory errors. To circumvent this issue we have tuned the pytorch 
-memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this 
-means that the Conformer Pytorch submission times will be roughly 2x slower. 
-Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). 
-
 # FAQS
 
 ## Setup and Platform
@@ -261,7 +254,7 @@ to make make sure it will fit on a 8 16GB V100 GPUs.
 You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker).
 ### How can I run this on my AWS/GCP/Azure cloud project?
  Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you will have to do the following.
-1. If you don't have an VM instance yet, we recommmend creating a
+1. If you don't have a VM instance yet, we recommend creating a
 new Compute Instance with the "Deep Learning on Linux" Image in Boot disk options. 
 2. To install the NVIDIA Docker toolkit, you can use `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit.
 
@@ -270,14 +263,28 @@ new Compute Instance with the "Deep Learning on Linux" Image in Boot disk option
 Yes, your submission can be structured using multiple files. 
 ### Can I install custom dependencies?
 You may use custom dependencies as long as they do not conflict with any of the pinned packages in `algorithmic-efficiency/setup.cfg`. 
-To include your custom dependencies in your submission, please include them in a requirements.txt file. 
+To include your custom dependencies in your submission, please include them in a requirements.txt file. Please refer to the [Software dependencies](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#software-dependencies) section of our rules. 
 ### How can I know if my code can be run on benchmarking hardware?
 The benchmarking hardware specifications are documented in the [Getting Started Document](./getting_started.md).
-Please monitor your submission's memory usage so that it does not exceed the available memory 
-on the competition hardware. 
+We recommend monitoring your submission's memory usage so that it does not exceed the available memory 
+on the competition hardware. We also recommend to do a dry run using a cloud instance.
 ### Are we allowed to use our own hardware to self-report the results?
-No. However you are allowed to use your own hardware to report the best hyperparameter point to qualify for 
-a compute sponsorship offering a free evaluation on the full benchmark set, see the [Rules](./RULES.md#qualification-set).
+You only have to use the competition hardware for runs that are directly involved in the scoring procedure. This includes all runs for the self-tuning ruleset, but only the runs of the best hyperparameter configuration in each study for the external tuning ruleset. For example, you could use your own (different) hardware to tune your submission and identify the best hyperparameter configuration (in each study) and then only run this configuration (i.e. 5 runs, one for each study) on the competition hardware.
 
 # Citing AlgoPerf Benchmark
-Todo: how to cite the algoperf benchmark?
\ No newline at end of file
+If you use the **AlgoPerf** Benchmark in your work, please consider citing:
+
+> [George E. Dahl, Frank Schneider, Zachary Nado, et al.<br/>
+> **Benchmarking Neural Network Training Algorithms**<br/>
+> *arXiv 2306.07179*](http://arxiv.org/abs/2306.07179)
+
+```bibtex
+@misc{dahl2023algoperf,
+   title={{Benchmarking Neural Network Training Algorithms}},
+   author={Dahl, George E. and Schneider, Frank and Nado, Zachary and Agarwal, Naman and Sastry, Chandramouli Shama and Hennig, Philipp and Medapati, Sourabh and Eschenhagen, Runa and Kasimbeg, Priya and Suo, Daniel and Bae, Juhan and Gilmer, Justin and Peirson, Abel L. and Khan, Bilal and Anil, Rohan and Rabbat, Mike and Krishnan, Shankar and Snider, Daniel and Amid, Ehsan and Chen, Kongtao and Maddison, Chris J. and Vasudev, Rakshith and Badura, Michal and Garg, Ankush and Mattson, Peter},
+   year={2023},
+   eprint={2306.07179},
+   archivePrefix={arXiv},
+   primaryClass={cs.LG}
+}
+```
\ No newline at end of file

From 2e1ffd40c9d3fc7935b693e245c5928b53963c7c Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Mon, 23 Oct 2023 18:10:02 +0000
Subject: [PATCH 20/21] add space after heading

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index 9803f38be..f4c467665 100644
--- a/README.md
+++ b/README.md
@@ -244,6 +244,7 @@ While this issue might not affect all setups, we currently implement a different
 # FAQS
 
 ## Setup and Platform
+
 ### My machine only has one GPU. How can I use this repo?
 You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batchsizes for the submission. Note that your final submission must 'fit'
 on the benchmarking hardware, so if you are using fewer

From 9cd436282ac34a7c3b2d33ff6574b59767e69466 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg <kasimbeg@google.com>
Date: Mon, 23 Oct 2023 18:13:30 +0000
Subject: [PATCH 21/21] fix typos

---
 README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index f4c467665..6ffbab6f7 100644
--- a/README.md
+++ b/README.md
@@ -246,15 +246,15 @@ While this issue might not affect all setups, we currently implement a different
 ## Setup and Platform
 
 ### My machine only has one GPU. How can I use this repo?
-You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batchsizes for the submission. Note that your final submission must 'fit'
+You can run this repo on a machine with an arbitrary number of GPUs. However, the default batch sizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batch sizes for the submission. Note that your final submission must 'fit'
 on the benchmarking hardware, so if you are using fewer
-GPUs with higher per gpu memory, please monitor your memory usage 
-to make make sure it will fit on a 8 16GB V100 GPUs.
+GPUs with higher per GPU memory, please monitor your memory usage 
+to make make sure it will fit on 8xV100 GPUs with 16GB of VRAM per card.
 
 ### How do I run this on my SLURM cluster?
 You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker).
 ### How can I run this on my AWS/GCP/Azure cloud project?
- Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you will have to do the following.
+ Depending on your virtual machine, you may have to install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you will have to do the following.
 1. If you don't have a VM instance yet, we recommend creating a
 new Compute Instance with the "Deep Learning on Linux" Image in Boot disk options. 
 2. To install the NVIDIA Docker toolkit, you can use `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit.