## WML-A Job Submission via WML-A CLI for DDP training

Offical examples can be found here: https://wmla-console-cpd-wmla.apps.cpd.mskcc.org/ui/#/cliTools

This example uses a shared storage volume between the CPD project and the training job in WMLA.

### Setup

Let's first setup some important paths.

*HOST* and *BASE_URL* point to cmd and wml endpoints.

*dlicmd* holds path to the wmla cli tool downloaded locally.

In [1]:
%env HOST=wmla-console-cpd.apps.cpd.mskcc.org
%env BASE_URL=https://cpd-cpd.apps.cpd.mskcc.org

%env dlicmd=../wmla-utils/dlicmd.py

env: HOST=wmla-console-cpd.apps.cpd.mskcc.org
env: BASE_URL=https://cpd-cpd.apps.cpd.mskcc.org
env: dlicmd=../wmla-utils/dlicmd.py


Next we setup the volume that we will use to store data and training artifacts.

We will need it's display name, the path to the data and the path to save the trained model. 

The paths are the same in the cpd project environment and wmla environment where the model will be running.

In [6]:
VOLUME_DISPLAY_NAME='cpd::demo-project-pvc'

%env DRIVE_DATA_PATH=/mnts/demo_project_pvc/data
%env DRIVE_MODEL_PATH=/mnts/demo_project_pvc/model

env: DRIVE_DATA_PATH=/mnts/demo_project_pvc/data
env: DRIVE_MODEL_PATH=/mnts/demo_project_pvc/model


Now let's download the data.

In [7]:
import torchvision
import os
from torchvision import transforms

transform = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root=os.getenv('DRIVE_DATA_PATH'), train=True,
                                            download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(root=os.getenv('DRIVE_DATA_PATH'), train=False,
                                       download=True, transform=transform)

Files already downloaded and verified
Files already downloaded and verified


### Submit Jobs

First we select the folder to submit and the script to be executed from it.

In [8]:
%env DIR_job_submission=/userfs/ddp-tutorial/job_submission
%env file_exec=train_wmla.py

env: DIR_job_submission=/userfs/ddp-tutorial/job_submission
env: file_exec=train_wmla.py


In [9]:
#Volume connection description JSON 
data_source = '[{\"type":"fs","location":{"volume":"%s"}}]'%(VOLUME_DISPLAY_NAME) 

%env DATA_SOURCE = $data_source

env: DATA_SOURCE=[{"type":"fs","location":{"volume":"cpd::demo-project-pvc"}}]


And now we send the job.

In [10]:
!python $dlicmd --exec-start distPyTorch --rest-host $HOST --rest-port -1 --jwt-token $USER_ACCESS_TOKEN \
                  --msd-env USER_ACCESS_TOKEN=$USER_ACCESS_TOKEN --msd-env BASE_URL=$BASE_URL \
                  --msd-env DRIVE_DATA_PATH=$DRIVE_DATA_PATH --msd-env DRIVE_MODEL_PATH=$DRIVE_MODEL_PATH \
                  --numWorker 6 --workerMemory 8g \
                  --model-dir $DIR_job_submission --model-main $file_exec \
                  --data-source $DATA_SOURCE \
                  --appName "DDP Tutorial"

Copying files and directories ...
Content size: 2.8K
{
  "execId": "cpd-111",
  "appId": "cpd-111"
}


In [11]:
#Update the app id
%env APP_ID=cpd-111

env: APP_ID=cpd-111


### Submit Jobs
#### distPyTorch (multiprocessing using DDP)
This does not apply because the training code does not use DDP.

### Delete Jobs (and associated results/logs)
#### delete one job

In [13]:
# !python $dlicmd --exec-delete $cpd-38 --rest-host $HOST --rest-port -1 --jwt-token $USER_ACCESS_TOKEN 

### Get Job Status

In [14]:
!python $dlicmd --exec-get $APP_ID --rest-host $HOST --rest-port -1 --jwt-token $USER_ACCESS_TOKEN

{
  "id": "cpd-111",
  "args": "--exec-start distPyTorch --msd-env USER_ACCESS_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjNMVWFNdUp3UE1nNkhpaTZZQTZrTWhfNWxJYTZlQ0hpbkNtejNzXzBOYUUifQ.eyJ1aWQiOiIxMDAwMzMxMDg5IiwidXNlcm5hbWUiOiJraGFybGFkIiwicm9sZSI6IlVzZXIiLCJwZXJtaXNzaW9ucyI6WyJjcmVhdGVfcHJvamVjdCIsImFjY2Vzc19jYXRhbG9nIiwiYWNjZXNzX2luZm9ybWF0aW9uX2Fzc2V0cyIsInZpZXdfcXVhbGl0eSIsImNyZWF0ZV9zcGFjZSIsIm1hbmFnZV9pbmZvcm1hdGlvbl9hc3NldHMiLCJtYW5hZ2VfbWV0YWRhdGFfaW1wb3J0IiwibWFuYWdlX2Rpc2NvdmVyeSIsIm1hbmFnZV9xdWFsaXR5Iiwidmlld19nb3Zlcm5hbmNlX2FydGlmYWN0cyIsImF1dGhvcl9nb3Zlcm5hbmNlX2FydGlmYWN0cyIsImNhbl9wcm92aXNpb24iLCJzaWduX2luX29ubHkiXSwiZ3JvdXBzIjpbMTAwMDBdLCJzdWIiOiJraGFybGFkIiwiaXNzIjoiS05PWFNTTyIsImF1ZCI6IkRTWCIsImlhdCI6MTY5ODM0ODE1MiwiZXhwIjo1Mjk4MzQ0NTUyfQ.UQbBSX4LJvF6sFl0LoGB4lxV84iTe63oKOeVAoZjlZBMhTf4h5iLR47r4bgS9sdmkznItZXTqv5DWQERTxZEjclAJzwhjp2vGlhoUEVtsh5L2GDCiNtxcPWwTSZDhwSvjBNNn6Bu5lQcA2d-yFUR2e4uYcnybFJJBU3R61-rzPz82zgZjxjP3EeT4XUttg5rr022uTYavY9X-i5hORRNaqT1nu7QyfQ9-

### Get Job Log
#### last 10 lines

In [15]:
!python $dlicmd --exec-outlogs $APP_ID --rest-host $HOST --rest-port -1 --jwt-token $USER_ACCESS_TOKEN

Executor 1 stdout
*Task <1> SubProcess*: drwxrwx---. 3 1000670000 103000 4096 Oct 27 12:17 data
*Task <1> SubProcess*: drwxr-x---. 8 1000670000 root   4096 Oct 27 16:18 model
*Task <1> SubProcess*: 2023-10-27 16:27:57.761186 450 INFO Save log files under /gpfs/myresultfs/kharlad/batchworkdir/cpd-111/log/app.cpd-111-task12n-fdrrn
*Task <1> SubProcess*: 2023-10-27 16:27:57.777017 450 INFO Start running user model
*Task <1> SubProcess*: /gpfs/myresultfs/kharlad/batchworkdir/cpd-111/_submitted_code/job_submission/train_wmla.py Tutorial
*Task <1> SubProcess*: ------ initiate process group... ------
*Task <1> SubProcess*: RANK: 0 0
*Task <1> SubProcess*: Training...
*Task <1> SubProcess*: Validation set: Average loss: 2.3032	Accuracy 0.1014
*Task <1> SubProcess*: ** Validation: 0.101400 (best) - 0.101400 (current)


Executor 2 stdout
*Task <2> SubProcess*: total 2
*Task <2> SubProcess*: drwxr-x---. 7 1000670000 root   4096 Oct 25 14:44 checkpoints
*Task <2> SubProcess*: drwxrwx---. 3 1000670

### Full logs

In [16]:
!python $dlicmd --exec-trainoutlogs $APP_ID --rest-host $HOST --rest-port -1 --jwt-token $USER_ACCESS_TOKEN 

------ initiate process group... ------
RANK: 3 0
Training...
------ initiate process group... ------
RANK: 1 0
Training...
------ initiate process group... ------
RANK: 4 0
Training...
------ initiate process group... ------
RANK: 5 0
Training...
------ initiate process group... ------
RANK: 0 0
Training...
Validation set: Average loss: 2.3032	Accuracy 0.1014
** Validation: 0.101400 (best) - 0.101400 (current)
Validation set: Average loss: 2.3019	Accuracy 0.1121
** Validation: 0.112100 (best) - 0.112100 (current)
------ initiate process group... ------
RANK: 2 0
Training...



#### Errors

In [17]:
!python $dlicmd --exec-trainerrlogs $APP_ID --rest-host $HOST --rest-port -1 --jwt-token $USER_ACCESS_TOKEN


