# PIDNet Training on Railways SemSegm in AzureML Pipeline

## 0. Connect to your workspace
Connect to the workspace.

In [1]:
import azureml.core
from azureml.core import Workspace, Run

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

# Get the default datastore
default_ds = ws.get_default_datastore()

# Get the experiment run context
run = Run.get_context()

Ready to use Azure ML 1.48.0 to work with aml-con-fer-cvrail-euwe-dev


## 1. Create scripts for pipeline steps
First, let's create a folder for the script files we'll use in the pipeline steps.

In [2]:
import os
# Create a folder for the pipeline step files
pidnet_experiment_folder = 'pidnet_experiment_folder'
os.makedirs(pidnet_experiment_folder, exist_ok=True)

print(pidnet_experiment_folder)

pidnet_experiment_folder


### 1.1. Copy necessary files into experiment folder

In [3]:
import shutil

# configs folder
configs_path = os.path.join(pidnet_experiment_folder, 'configs')
os.makedirs(configs_path, exist_ok=True)
shutil.copy('../configs/__init__.py', configs_path)
shutil.copy('../configs/default.py', configs_path)
#shutil.copytree('../configs/railways', configs_path + '/railways', dirs_exist_ok=True)
shutil.copytree('../configs/railways_5classes', configs_path + '/railways_5classes', dirs_exist_ok=True)

# utils
utils_path = os.path.join(pidnet_experiment_folder, 'utils')
shutil.copytree('../utils', utils_path, dirs_exist_ok=True)

# tools
tools_path = os.path.join(pidnet_experiment_folder, 'tools')
shutil.copytree('../tools', tools_path, dirs_exist_ok=True)

# models
models_path = os.path.join(pidnet_experiment_folder, 'models')
shutil.copytree('../models', models_path, dirs_exist_ok=True)

# datasets
datasets_path = os.path.join(pidnet_experiment_folder, 'datasets')
shutil.copytree('../datasets', datasets_path, dirs_exist_ok=True)

# pretrained models
pretrained_models_path = os.path.join(pidnet_experiment_folder, 'pretrained_models/imagenet')
os.makedirs(pretrained_models_path, exist_ok=True)
#shutil.copy('../pretrained_models/imagenet/PIDNet_S_ImageNet.pth.tar', pretrained_models_path)
shutil.copy('../pretrained_models/imagenet/PIDNet_M_ImageNet.pth.tar', pretrained_models_path)
#shutil.copy('../pretrained_models/imagenet/PIDNet_L_ImageNet.pth.tar', pretrained_models_path)

'pidnet_experiment_folder/pretrained_models/imagenet/PIDNet_M_ImageNet.pth.tar'

## 2. Prepare a compute environment for the pipeline steps
### 2.1. Environment Setup
Import the environment that will be used to run the experiment. This method of execution avoids dependencies on updates from Azure machines, as it has been generated from Docker image.

If you need additional packages you can register your environment and run the experiment with it. To avoid Azure VM updates it is strongly recommended to register the environment with a pre-compiled Docker image. Below is an example of how to register your environment in this way:

#### 2.1.1. Training Step
First, the compute target set up.

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "mrc-railways-nc6"

try:
    # Check for existing compute target
    pipeline_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', max_nodes=4)
        pipeline_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        pipeline_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)

Found existing cluster, use it.


Then the compute environment based on a Dockerfile


In [5]:
from azureml.core import Environment

# Create a Python environment for the experiment (from Dockerfile)
pidnet_env = Environment("railways_pidnet_training_env")

dockerfile = r"""
FROM mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04:20211221.v1

ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/pidnet

# Create conda environment
RUN conda create -p $AZUREML_CONDA_ENVIRONMENT_PATH \
   python=3.9 pip=22.1.2

# Prepend path to AzureML conda environment
ENV PATH $AZUREML_CONDA_ENVIRONMENT_PATH/bin:$PATH

# Solve pub key problem (https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212772)
RUN apt-key del 7fa2af80
RUN apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

# Solve opencv dependencies
RUN apt-get update
RUN apt-get install ffmpeg libsm6 libxext6 -y

# Install Pytorch dependencies
RUN pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

# Install pip dependencies
RUN pip install 'azureml-core' \
                'azureml-dataprep' \
                'azureml-defaults' \
                'azureml-pipeline' \
                'azureml-pipeline-core ' \
                'azureml-pipeline-steps' \
                'azureml-telemetry '

# Install pip dependencies
RUN pip install 'EasyDict==1.7' \
                'shapely' \
                'Cython' \
                'scipy' \
                'pandas' \
                'pyyaml' \
                'json_tricks' \
                'scikit-image' \
                'yacs>=0.1.5' \
                'tensorboardX>=1.6' \
                'tqdm' \
                'ninja' \
                'opencv-python'

# This is needed for mpi to locate libpython
ENV LD_LIBRARY_PATH $AZUREML_CONDA_ENVIRONMENT_PATH/lib:$LD_LIBRARY_PATH
"""

pidnet_env.docker.enabled = True
pidnet_env.docker.base_image = None
pidnet_env.python.user_managed_dependencies = True
pidnet_env.docker.base_dockerfile = dockerfile
pidnet_env.save_to_directory(path="./railways_pidnet_training_env", overwrite=True)
# Register the environment
pidnet_env.register(workspace=ws)

'enabled' is deprecated. Please use the azureml.core.runconfig.DockerConfiguration object with the 'use_docker' param instead.


{
    "assetId": "azureml://locations/westeurope/workspaces/3147bafb-f565-4a86-b7c2-e84fb1ba22c4/environments/railways_pidnet_training_env/versions/7",
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": "\nFROM mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04:20211221.v1\n\nENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/pidnet\n\n# Create conda environment\nRUN conda create -p $AZUREML_CONDA_ENVIRONMENT_PATH \\\n   python=3.9 pip=22.1.2\n\n# Prepend path to AzureML conda environment\nENV PATH $AZUREML_CONDA_ENVIRONMENT_PATH/bin:$PATH\n\n# Solve pub key problem (https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212772)\nRUN apt-key del 7fa2af80\nRUN apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863c

In [6]:
from azureml.core import Environment
from azureml.core.runconfig import RunConfiguration

# Get the environment 
pidnet_env = Environment.get(workspace=ws,name="railways_pidnet_training_env")

# Create a new runconfig object for the pipeline
pidnet_run_config = RunConfiguration()

# Use the compute you created above. 
pidnet_run_config.target = pipeline_cluster

# Assign the environment to the run configuration
pidnet_run_config.environment = pidnet_env

print ("PIDNet run configuration created.")

PIDNet run configuration created.


## 4. Create and run a pipeline
Now you're ready to create and run a pipeline.

First you need to define the steps for the pipeline, and any data references that need to be passed between them. In this case, the first step must write the prepared data to a folder that can be read from by the second step. Since the steps will be run on remote compute (and in fact, could each be run on different compute), the folder path must be passed as a data reference to a location in a datastore within the workspace. The OutputFileDatasetConfig object is a special kind of data reference that is used for interim storage locations that can be passed between pipeline steps, so you'll create one and use at as the output for the first step and the input for the second step. Note that you need to pass it as a script argument so your code can access the datastore location referenced by the data reference.

In [7]:
from azureml.pipeline.steps import PythonScriptStep
from azureml.core import Dataset, ScriptRunConfig
from azureml.train.hyperdrive import RandomParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice, uniform, quniform, MedianStoppingPolicy

# Get the training dataset
training_ds = Dataset.get_by_name(ws, name='railways_semsegm_5classes_dataset')

# Step 1, Detect the solar panel defects on thermal images
#pidnet_training_step = PythonScriptStep(name = "Railways SemSegm PIDNet Training",
#                                source_directory = pidnet_experiment_folder,
#                                script_name = "./tools/train_azureml.py",
#                                arguments = ['--cfg', 'configs/railways/pidnet_medium_railways.yaml',
#                                            '--training-dataset', training_ds.as_named_input('railways_semsegm_dataset')],
#                                compute_target = pipeline_cluster,
#                                runconfig = pidnet_run_config,
#                                allow_reuse = True)

pidnet_training_step = ScriptRunConfig(source_directory = pidnet_experiment_folder,
                                script = "./tools/train_azureml.py",
                                arguments = ['--cfg', 'configs/railways_5classes/pidnet_medium_railways_5classes.yaml',
                                            '--training-dataset', training_ds.as_named_input('railways_semsegm_5classes_dataset')],
                                compute_target = pipeline_cluster,
                                environment = pidnet_env)

# Sample a range of parameter values
params = RandomParameterSampling(
    {
        '--batch-size': choice(8, 12, 16),
        '--learning-rate' : uniform(0.001, 0.01),
        '--max-num-epochs' : quniform(100, 150, 1),
        '--momentum' : uniform(0.85, 0.95),
        '--weight-decay' : uniform(0.00005, 0.005)
    }
)

# Establish an early termination policy to save training costs
early_termination_policy = MedianStoppingPolicy(evaluation_interval=1, delay_evaluation=10)

# Configure hyperdrive settings
hyperdrive = HyperDriveConfig(run_config=pidnet_training_step, 
                          hyperparameter_sampling=params, 
                          policy=early_termination_policy, # No early stopping policy
                          primary_metric_name='mean_IoU', # Find the highest mean_IoU metric
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=30, # Restict the experiment to 6 iterations
                          max_concurrent_runs=2) # Run up to 2 iterations in parallel

print("Pipeline steps defined")

Pipeline steps defined


In [8]:
# Run the experiment
#experiment = Experiment(workspace=ws, name='mslearn-diabetes-hyperdrive')
#run = experiment.submit(config=hyperdrive)

# Show the status in the notebook as the experiment runs
#RunDetails(run).show()
#run.wait_for_completion()

OK, you're ready build the pipeline from the steps you've defined and run it as an experiment.

In [None]:
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails
import pandas as pd

# Construct the pipeline
#pipeline_steps = [pidnet_training_step]
#pipeline = Pipeline(workspace=ws, steps=pipeline_steps)
#print("Pipeline is built.")

# Define experiment tags
# Incluse some tags register in the experiment
output_tags = {}
output_tags['Area']='Computer Vision, Railway Inspection'
output_tags['Target'] = 'Railway Semantic Segmentation'
output_tags['Subscription ID'] = ws.subscription_id
output_tags['Workspace'] = ws.name
output_tags['Resource Group'] = ws.resource_group
output_tags['Location'] = ws.location
output_tags['ADL'] = ws.tags['ADL']
output_tags['Company'] = ws.tags['Company']
output_tags['DevOps'] = ws.tags['DevOps']
output_tags['Environment'] = ws.tags['Environment']
output_tags['OU'] = ws.tags['OU']
output_tags['Project'] = ws.tags['Project']
output_tags['Responsible'] = ws.tags['Responsible']
output_tags['area'] = "railway inspection"
output_tags['type'] = "railway semantic segmentation PIDNet model training"
output_tags['Project'] = "Railway Inspection"
pd.set_option('display.max_colwidth', None)
outputDf = pd.DataFrame(data = output_tags, index = [''])
print(outputDf.T)

# Create an experiment, set some tags and run the pipeline
experiment = Experiment(workspace=ws, name = 'railway-pidnet-model-training')
experiment.set_tags(output_tags)
pipeline_run = experiment.submit(config=hyperdrive)
#pipeline_run = experiment.submit(pipeline, regenerate_outputs = True)
#pipeline_run = experiment.submit(pipeline)
print("Pipeline submitted for execution.")
RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion(show_output=True)

                                                                    
Area                             Computer Vision, Railway Inspection
Target                                 Railway Semantic Segmentation
Subscription ID                 764a3c74-3d65-4a9c-bc90-661b00666572
Workspace                                aml-con-fer-cvrail-euwe-dev
Resource Group                            rg-con-fer-cvrail-euwe-dev
Location                                                  westeurope
ADL                                      andres.santos@ferrovial.com
Company                                                  Corporacion
DevOps                                                CON-FER-CVRAIL
Environment                                              Development
OU                                                     Ferrocarriles
Project                                           Railway Inspection
Responsible                                 mcastrillo@ferrovial.com
area                              

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_c4fabe37-8835-473c-8647-824413410f00
Web View: https://ml.azure.com/runs/HD_c4fabe37-8835-473c-8647-824413410f00?wsid=/subscriptions/764a3c74-3d65-4a9c-bc90-661b00666572/resourcegroups/rg-con-fer-cvrail-euwe-dev/workspaces/aml-con-fer-cvrail-euwe-dev&tid=a9a8e375-fac1-4ec2-820a-cfb6eb5cf01b

Streaming azureml-logs/hyperdrive.txt

[2023-01-19T16:39:05.507582][GENERATOR][INFO]Trying to sample '2' jobs from the hyperparameter space
[2023-01-19T16:39:06.4306644Z][SCHEDULER][INFO]Scheduling job, id='HD_c4fabe37-8835-473c-8647-824413410f00_1' 
[2023-01-19T16:39:06.3481073Z][SCHEDULER][INFO]Scheduling job, id='HD_c4fabe37-8835-473c-8647-824413410f00_0' 
[2023-01-19T16:39:06.381402][GENERATOR][INFO]Successfully sampled '2' jobs, they will soon be submitted to the execution target.
[2023-01-19T16:39:06.5913864Z][SCHEDULER][INFO]Successfully scheduled a job. Id='HD_c4fabe37-8835-473c-8647-824413410f00_0' 
[2023-01-19T16:39:06.6712993Z][SCHEDULER][INFO]Successfully scheduled a job. Id='