### Exercise 4. Prepare training script
Now you will need to create your training script. In this tutorial, the training script is already provided for you at `pytorch_train.py`. In practice, you should be able to take any custom training script as is and run it with AML without having to modify your code.

However, if you would like to use AML's [tracking and metrics](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#metrics) capabilities, you will have to add a small amount of AML code inside your training script. 

In `pytorch_train.py`, we will log some metrics to our AML run. To do so, we will access the AML run object within the script:
```Python
from azureml.core.run import Run
run = Run.get_context()
```
Further within `pytorch_train.py`, we log the learning rate and momentum parameters, the best validation accuracy the model achieves, and the number of classes in the model:
```Python
run.log('lr', np.float(learning_rate))
run.log('momentum', np.float(momentum))
run.log('num_classes', num_classes)

run.log('best_val_acc', np.float(best_acc))
```

If you downloaded the data, you can start to train the model locally (note that it will take long if you don't have a GPU -- 21 min. on a Core i7 CPU).

**This step requires Pytorch to be installed locally -- find instructions [here](https://pytorch.org/#pip-install-pytorch)**


In [None]:
!mkdir outputs
!python pytorch_train.py --data_dir breeds-10 --num_epochs 10 --output_dir outputs 

## Train model on the remote compute
Now that you have your data and training script prepared, you are ready to train on your remote compute cluster. You can take advantage of Azure compute to leverage GPUs to cut down your training time. 

### Create an experiment
Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this transfer learning PyTorch tutorial. 

In [None]:
from azureml.core import Experiment

experiment_name = 'pytorch-dogs' 
experiment = Experiment(ws, name=experiment_name)

### Create a PyTorch estimator
The AML SDK's PyTorch estimator enables you to easily submit PyTorch training jobs for both single-node and distributed runs. The following code will define a single-node PyTorch job.

The estimator also takes a `framework_version` parameter -- if no version is provided, the estimator will default to the latest version supported by AzureML. Use `PyTorch.get_supported_versions()` to get a list of all versions supported by your current SDK version or see the [SDK documentation](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.dnn?view=azure-ml-py) for the versions supported in the most current release. For more information on the PyTorch estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch).

In [None]:
from azureml.train.dnn import PyTorch

script_params = {
    '--data_dir': ds_data.as_mount(),
    '--num_epochs': 10,
    '--output_dir': './outputs',
    '--log_dir': './logs',
    '--mode': 'fine_tune'
}

estimator10 = PyTorch(source_directory='.', 
                      script_params=script_params,
                      compute_target=compute_target, 
                      entry_script='pytorch_train.py',
                      pip_packages=['tensorboardX'],
                      use_gpu=True,
                      framework_version='1.0',
                      _use_framework_images=True)

The `script_params` parameter is a dictionary containing the command-line arguments to your training script `entry_script`. Please note the following:
- We passed our training data reference `ds_data` to our script's `--data_dir` argument. This will 1) mount our datastore on the remote compute and 2) provide the path to the training data `breeds` on our datastore.
- We specified the output directory as `./outputs`. The `outputs` directory is specially treated by AML in that all the content in this directory gets uploaded to your workspace as part of your run history. The files written to this directory are therefore accessible even once your remote run is over. In this tutorial, we will save our trained model to this output directory.

The `_use_framework_images` parameter is in private preview. When `_use_framework_images = True`, pre-built framework images are used to run AMLCompute jobs or as intermediate images to make constructing an estimator more efficient by either eliminating or greatly reducing image build time.

To leverage the Azure VM's GPU for training, we set `use_gpu=True`.

### Submit job
Run your experiment by submitting your estimator object. Note that this call is asynchronous.

In [None]:
run10 = experiment.submit(estimator10)

### Monitor your run
You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [None]:
from azureml.widgets import RunDetails
RunDetails(run10).show()

### What happens during a run?
If you are running this for the first time, the compute target will need to pull the docker image, which will take about 2 minutes. This gives us the time to go over how a **Run** is executed in Azure Machine Learning. 

Note: had we not created the workspace with an existing ACR, we would have also had to wait for the image creation to be performed -- that takes and extra 10-20 minutes for big GPU images like this one. This is a one-time cost for a given python configuration, and subsequent runs will then be faster. We are working on ways to make this image creation faster.

![](aml-run.png)