# Exercise 6. Distributed Training

### Distributed training

Now that the setup is working, we can go to the full dataset with 120 classes. We just need to point to a different path on the datastore. 

In [None]:
full_dataset = ds.path('breeds')
print(full_dataset)

In [None]:
## AML Compute
from azureml.train.dnn import PyTorch

script_params = {
    '--data_dir': full_dataset.as_mount(),
    '--num_epochs': 25,
    '--output_dir': './outputs',
    '--log_dir': './logs',
    '--mode': 'fine_tune'
}

estimator120 = PyTorch(source_directory='.', 
                        script_params=script_params,
                        compute_target=compute_target, 
                        entry_script='pytorch_train.py',
                        pip_packages=['tensorboardX'],
                        node_count=1,
                        use_gpu=True)

run120 = experiment.submit(estimator120)

from azureml.widgets import RunDetails
RunDetails(run120).show()

But now training takes very long (> 1 hour), so let's see if we can run this job on multiple GPUs to cut down on training time.

In [None]:
# first let's cancel the above job
run120.cancel()

Running the model on multiple nodes is simple (in this case using Horovod MPI-based algorithm running on 4 nodes)

In [None]:
## AML Compute
from azureml.train.dnn import PyTorch

script_params = {
    '--data_dir': full_dataset.as_mount(),
    '--num_epochs': 25,
    '--output_dir': './outputs',
    '--log_dir': './logs',
    '--mode': 'fine_tune'
}

estimator120 = PyTorch(source_directory='.', 
                        script_params=script_params,
                        compute_target=compute_target, 
                        pip_packages=['tensorboardX'],
                        entry_script='pytorch_train_horovod.py',
                        node_count=4,
                        distributed_backend='mpi',
                        use_gpu=True)

run120 = experiment.submit(estimator120)

In [None]:
from azureml.widgets import RunDetails
RunDetails(run120).show()

In [None]:
from azureml.contrib.tensorboard import Tensorboard

# The Tensorboard constructor takes an array of runs, so be sure and pass it in as a single-element array here
tb = Tensorboard([run120])

# If successful, start() returns a string with the URI of the instance.
tb.start()

In [None]:
tb.stop()

Training on 4 nodes completes in about 25 minutes and achieves 81% accuracy, which is similar to accuracy produced by single node training. This is great improvement of training time.