# Distributed Training using MPI on Amazon SageMaker

***This notebook should be deployed in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel in the Oregon (us-west-2) region for the hyperllinks to the console to work correctly***

[Open MPI](https://mpi4py.readthedocs.io/en/stable/overview.html) is a tool that allows us to convert a single-threaded python program into a parallel python program. 

SageMaker can use MPI to deploy multiple python threads on a single instance, or set up multiple instances with multiple threads each. In this demo we will deploy **multiple instances** that will run 2 parallel processes each. 

The mpi_demo.py python script provided in this demo ensures each process spawned looks for other processes and communicates with them. 

If the process can not communicate wiht other spawned processes it will fail and thus the training job would fail aswell. 

This functionality deonstrates that a single threaded process is being parallelized, which means you can import your existing single threaded training scripts and paralelize them with ease. 


## Let's deploy MPI! (emphasize this is a simmulation)

In the code below, note the **`distribution`** enables the use of MPI. Here, we can define the number of processes per host, which is set to 2.

The number of instances is defined in the training configuration with **`instance_count`** which is set to 5.

We are thus looking for 10 processes to be spawned and start communicating with each other.

To prove that this is happening, look for the following message in the output of the next code run:

**`[1,0]<stdout>:Number of MPI processes that will talk to each other: 10[1,0]<stdout>:`**

Feel free to also inspect the [training job](https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs) once you have ran the code below. 

Feel free to change the **`processes_per_host`**, **`instance_type`** and **`instacne_count`** if you would like to test a different number of distributed processes. Please keep the  **`instacne_count`** within your [account quotas](https://us-west-2.console.aws.amazon.com/servicequotas/home/services/sagemaker/quotas).

This following code should take 3-4 minutes to complete.

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow

role = get_execution_role()

# distribution enables running 2 processes per host on 5 instances.

distribution = {"mpi": {"enabled": True, "processes_per_host": 2}}

tfest = TensorFlow(
    entry_point="mpi_demo.py",
    role=role,
    framework_version="2.3.0",
    distribution=distribution,
    py_version="py37",
    instance_count=5,
    instance_type="ml.c5.xlarge",  # 4 cores
    output_path="s3://" + sagemaker.Session().default_bucket() + "/" + "mpi",
)

tfest.fit()