diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 000000000000..70f39bd7f81f
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,67 @@
+# Contributing
+DeepSpeed welcomes your contributions!
+
+## Prerequisites
+DeepSpeed uses [pre-commit](https://pre-commit.com/) to ensure that formatting is
+consistent across DeepSpeed. First, ensure that `pre-commit` is installed from either
+installing DeepSpeed or `pip install pre-commit`. Next, the pre-commit hooks must be
+installed once before commits can be made:
+```bash
+pre-commit install
+```
+
+Afterwards, our suite of formatting tests run automatically before each `git commit`. You
+can also run these manually:
+```bash
+pre-commit run --all-files
+```
+If a formatting test fails, it will fix the modified code in place and abort
+the `git commit`. After looking over the changes, you can `git add <modified files>`
+and then repeat the previous `git commit` command.
+
+
+## Testing
+DeepSpeed tracks two types of tests: unit tests and more costly model convergence tests.
+The model convergence tests train
+[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) and measure
+end-to-end convergence and related metrics. Unit tests are found in `tests/unit/` and
+the model convergence tests are found in `tests/model/`.
+
+### Unit Tests
+[PyTest](https://docs.pytest.org/en/latest/) is used to execute tests. PyTest can be
+installed from PyPI via `pip install pytest`. Simply invoke `pytest --forked` to run the
+unit tests:
+```bash
+pytest --forked tests/unit/
+```
+You can also provide the `-v` flag to `pytest` to see additional information about the
+tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) and the
+`--forked` flag are required to test CUDA functionality in distributed tests.
+
+### Model Tests
+To execute model tests, first [install DeepSpeed](#installation). The
+[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) repository is cloned
+as part of this process. Next, execute the model test driver:
+```bash
+cd tests/model/
+pytest run_sanity_check.py
+```
+Note that the `--forked` flag is not necessary for the model tests.
+
+## Contributor License Agreement
+This project welcomes contributions and suggestions. Most contributions require you to
+agree to a Contributor License Agreement (CLA) declaring that you have the right to, and
+actually do, grant us the rights to use your contribution. For details, visit
+https://cla.opensource.microsoft.com.
+
+When you submit a pull request, a CLA bot will automatically determine whether you need
+to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply
+follow the instructions provided by the bot. You will only need to do this once across
+all repos using our CLA.
+
+## Code of Conduct
+This project has adopted the [Microsoft Open Source Code of
+Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the
+[Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
+[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or
+comments.
diff --git a/Dockerfile b/Dockerfile
index 1153145760bd..3274ff10ab25 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -101,12 +101,11 @@ USER deepspeed
 
 ##############################################################################
 # DeepSpeed
-# TODO: once repo is public we can install latest deepspeed via this command
-##############################################################################
-#RUN git clone https://github.com/microsoft/DeepSpeed.git ${STAGE_DIR}/DeepSpeed
-#RUN cd ${STAGE_DIR}/DeepSpeed && \
-#    git checkout . && \
-#    git checkout master && \
-#    sudo ./install.sh
-#RUN rm -rf ${STAGE_DIR}/DeepSpeed
-#RUN python -c "import deepspeed; print(deepspeed.__version__)"
+##############################################################################
+RUN git clone https://github.com/microsoft/DeepSpeed.git ${STAGE_DIR}/DeepSpeed
+RUN cd ${STAGE_DIR}/DeepSpeed && \
+    git checkout . && \
+    git checkout master && \
+    ./install.sh
+RUN rm -rf ${STAGE_DIR}/DeepSpeed
+RUN python -c "import deepspeed; print(deepspeed.__version__)"
diff --git a/README.md b/README.md
index f26c9cac3422..c046c5e19fd7 100755
--- a/README.md
+++ b/README.md
@@ -100,7 +100,7 @@ combination. ZeRO boosts the scaling capability and efficiency further.
 
 ![DeepSpeed-vs-Megatron](./docs/figures/DeepSpeed-vs-Megatron.png)
 <p align="center">
-<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone.</em>
+<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone.</em>
 </p>
 
 
@@ -119,7 +119,7 @@ convergence to desired accuracy.
 -->
 
 ## Good Usability
-Only a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to six billion parameters, you can use ZeRO-powered data parallelism conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.3 billion parameters. In addition, DeepSpeed conveniently supports flexible combination of ZeRO-powered data parallelism with custome model parallelisms, such as tensor slicing of Nvidia Megatron-LM.  
+Only a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to six billion parameters, you can use ZeRO-powered data parallelism conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.3 billion parameters. In addition, DeepSpeed conveniently supports flexible combination of ZeRO-powered data parallelism with custom model parallelisms, such as tensor slicing of NVIDIA's Megatron-LM.
 
 
 ## Features
@@ -265,7 +265,7 @@ the `step` value is stored as part of the `client_sd`.
 
 
 ## DeepSpeed Configuration
-DeepSpeed featureds can be enabled, disabled, or configured using a config JSON
+DeepSpeed features can be enabled, disabled, or configured using a config JSON
 file that should be specified as `args.deepspeed_config`. A sample config file
 is shown below. For a full set of features see [core API
 doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html).
@@ -377,56 +377,9 @@ as the hostname.
 
 
 # Contributing
-DeepSpeed welcomes your contributions!
-
-
-## Prerequisites
-DeepSpeed uses [pre-commit](https://pre-commit.com/) to ensure that formatting is
-consistent across DeepSpeed. First, ensure that `pre-commit` is installed from either
-installing DeepSpeed or `pip install pre-commit`. Next, the pre-commit hooks must be
-installed once before commits can be made:
-```bash
-pre-commit install
-```
-
-Afterwards, our suite of formatting tests run automatically before each `git commit`. You
-can also run these manually:
-```bash
-pre-commit run --all-files
-```
-If a formatting test fails, it will fix the modified code in place and abort
-the `git commit`. After looking over the changes, you can `git add <modified files>`
-and then repeat the previous `git commit` command.
-
-
-## Testing
-DeepSpeed tracks two types of tests: unit tests and more costly model convergence tests.
-The model convergence tests train
-[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) and measure
-end-to-end convergence and related metrics. Unit tests are found in `tests/unit/` and
-the model convergence tests are found in `tests/model/`.
-
-### Unit Tests
-[PyTest](https://docs.pytest.org/en/latest/) is used to execute tests. PyTest can be
-installed from PyPI via `pip install pytest`. Simply invoke `pytest --forked` to run the
-unit tests:
-```bash
-pytest --forked tests/unit/
-```
-You can also provide the `-v` flag to `pytest` to see additional information about the
-tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) and the
-`--forked` flag are required to test CUDA functionality in distributed tests.
-
-### Model Tests
-To execute model tests, first [install DeepSpeed](#installation). The
-[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) repository is cloned
-as part of this process. Next, execute the model test driver:
-```bash
-cd tests/model/
-pytest run_sanity_check.py
-```
-Note that the `--forked` flag is not necessary for the model tests.
-
+DeepSpeed welcomes your contributions! Please see our
+[contributing](CONTRIBUTING.md) guide for more details on formatting, testing,
+etc.
 
 ## Contributor License Agreement
 This project welcomes contributions and suggestions. Most contributions require you to
@@ -445,3 +398,6 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
 [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
 [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or
 comments.
+
+## Publications
+1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: Memory Optimization Towards Training A Trillion Parameter Models. [ArXiv:1910.02054](https://arxiv.org/abs/1910.02054)
diff --git a/docs/azure.md b/docs/azure.md
index 74ca6fd526f9..f37d8f4a6abe 100644
--- a/docs/azure.md
+++ b/docs/azure.md
@@ -122,7 +122,7 @@ the first DeepSpeed container:
 
 ## Megatron-LM GPT2
 DeepSpeed includes an example model using Megatron-LM's GPT2. Please refer to the full
-[Megatron tutorial](tutorials/MegatronGPT2Tutorial.md) for more details.
+[Megatron tutorial](../docs/tutorials/MegatronGPT2Tutorial.md) for more details.
  * In order to fully train GPT2 with DeepSpeed and ZeRO we recommend using 8 instances of
    Azure's Standard_ND40rs_v2 SKU for a total of 64 NVIDIA V100 GPUs. With this setup and
    a batch size of 1536 you should be able to complete 100k training steps (153.6 million
diff --git a/docs/features.md b/docs/features.md
index 6b0cdca74326..b9e94e53f75d 100644
--- a/docs/features.md
+++ b/docs/features.md
@@ -73,9 +73,8 @@ mpu.get_data_parallel_group()
 mpu.get_data_parallel_world_size()
 ```
 ### Integration with Megatron-LM
-**TODO: port tutorial to its own page**
 DeepSpeed is fully compatible with [Megatron](https://github.com/NVIDIA/Megatron-LM).
-Please see the [Megatron-LM tutorial](docs/tutorials/MegatronGPT2Tutorial.md) for details.
+Please see the [Megatron-LM tutorial](tutorials/MegatronGPT2Tutorial.md) for details.
 
 
 
@@ -89,8 +88,8 @@ over 6 billion parameters without any model parallelism, and up to 100 billion
 parameter models with model parallelism on current generation hardware.
 
 For more details see the [ZeRO paper](https://arxiv.org/abs/1910.02054), [GPT
-tutorial](../../Tutorials/Megatron_GPT2/MegatronGPT2Tutorial.md) on integration with
-DeepSpeed. Additional tutorals including *BERT Tutorial*: Coming Soon.
+tutorial](tutorials/MegatronGPT2Tutorial.md) on integration with
+DeepSpeed. Additional tutorials including *BERT Tutorial*: Coming Soon.
 <!---[BERT
 tutorial](../../Tutorials/BingBertSquad/BingBertSquadTutorial.md),
 -->
@@ -157,7 +156,7 @@ high memory bandwidth.
 **TODO: port tutorial**
 DeepSpeed makes it easy to train with large batch sizes by enabling the LAMB Optimizer.
 For more details on LAMB, see the [BERT
-tutorial](../../Tutorials/BingBertSquad/BingBertSquadTutorial.md)  and the [LAMB
+tutorial](tutorials/BingBertSquadTutorial.md)  and the [LAMB
 paper](https://arxiv.org/pdf/1904.00962.pdf).
 
 ### Memory-Efficient Training with ZeRO Optimizer
@@ -181,10 +180,10 @@ DeepSpeed supports multiple Learning Rate Schedules to enable faster convergence
 large batch scaling.
 
 ### Learning Rate Range Test
-Please refer to [Learning Rate Range Test](../../Tutorials/lrrt/lrrt.md).
+Please refer to the [Learning Rate Range Test](tutorials/lrrt.md) tutorial.
 
 ### 1Cycle Learning Rate Schedule
-Please refer to [1Cycle Learning Rate Schedule](../../Tutorials/1cycle/1Cycle.md).
+Please refer to the [1Cycle Learning Rate Schedule](tutorials/1Cycle.md) tutorial.
 
 
 ## Simplified Data Loader
diff --git a/docs/tutorials/CIFAR-10.md b/docs/tutorials/CIFAR-10.md
index 099557f311f1..1f8874f133d0 100755
--- a/docs/tutorials/CIFAR-10.md
+++ b/docs/tutorials/CIFAR-10.md
@@ -4,7 +4,7 @@ If you haven't already stepped through [DeepSpeed Model Training](../../Onboard/
 
 In this tutorial we will be adding DeepSpeed to CIFAR-10 model, which is small image classification model.
 
-First we will go over how to run original CIRAR-10. Then we will proceed step-by-step in enabling this model to run with DeepSpeed.
+First we will go over how to run original CIFAR-10. Then we will proceed step-by-step in enabling this model to run with DeepSpeed.
 
 
 
diff --git a/docs/tutorials/MegatronGPT2Tutorial.md b/docs/tutorials/MegatronGPT2Tutorial.md
index 0cc57c370374..724a58221f3d 100755
--- a/docs/tutorials/MegatronGPT2Tutorial.md
+++ b/docs/tutorials/MegatronGPT2Tutorial.md
@@ -1,9 +1,8 @@
 # Tutorial: Megatron-LM GPT2 with DeepSpeed
-**TODO: these two links are broken (not yet implemented).**
 
-We advise you to first read through the guides for [Setup and
-Onboarding](../../Onboard/onboard/onboard.md) and [Model
-Training](../../Onboard/model_training/deepspeed_model_training.md).
+If you haven't already, we advise you to first read through the [Getting
+Started](../../README.md#getting-started) guide before stepping through this
+tutorial.
 
 In this tutorial we will be adding DeepSpeed to Megatron-LM GPT2 model, which
 is a large, powerful transformer. Megatron-LM supports model-parallel and multi-node
@@ -30,9 +29,6 @@ git submodule update --init --recursive
 ### 1.1 Training Data Setup
 * Follow Megatron's [instructions](https://github.com/NVIDIA/Megatron-LM#collecting-gpt2-webtext-data)
   to download the webtext data and place a symbolic link under `DeepSpeedExamples/Megatron-LM/data`:
-  * (*Microsoft*:) Raw and pre-processed data has already been downloaded on
-     all DLTS clusters: `/data/Megatron-LM/data/`. You can simply execute
-      `ln -s /data/Megatron-LM/data DeepSpeedExamples/Megatron-LM/`.
 
 ### 1.2 Running Unmodified Megatron-LM GPT2 model