# WML-A Model Deployment via CLI
Offical examples can be found here: https://wmla-console-cpd-wmla.apps.cpd.mskcc.org/ui/#/cliTools

In [1]:
%env DIR=/userfs/deployment-tutorial
%env REST_SERVER=https://wmla-console-cpd.apps.cpd.mskcc.org/dlim/v1/

%env dlim=../wmla-utils/dlim

env: DIR=/userfs/deployment-tutorial
env: REST_SERVER=https://wmla-console-cpd.apps.cpd.mskcc.org/dlim/v1/
env: dlim=../wmla-utils/dlim


In [2]:
import os
os.environ['auth'] = f"--rest-server {os.environ['REST_SERVER']} --jwt-token {os.environ['USER_ACCESS_TOKEN']}"

## 1. Create Model Deployment
Model / deployment name is specified in `model.json`. Whitesapce not allowed.

In [3]:
os.environ['REDHARE_MODEL_PATH']

KeyError: 'REDHARE_MODEL_PATH'

In [4]:
%env DIR_submission=/userfs/deployment-tutorial/deployment_submission
%env file_kernel=kernel.py

!rm -rf $DIR_submission
!mkdir -p $DIR_submission

!cp /userfs/training-tutorial/cifar-visdom/model/model.pt $DIR_submission
!cp $DIR/kernel.py $DIR_submission
!cp $DIR/model.json $DIR_submission
!cp $DIR/README.md $DIR_submission

env: DIR_submission=/userfs/deployment-tutorial/deployment_submission
env: file_kernel=kernel.py


In [5]:
!$dlim model deploy -p $DIR_submission $auth -f

Uploading...
</userfs/deployment-tutorial/deployment_submission/README.md> uploaded to server.
</userfs/deployment-tutorial/deployment_submission/kernel.py> uploaded to server.
</userfs/deployment-tutorial/deployment_submission/model.json> uploaded to server.
</userfs/deployment-tutorial/deployment_submission/model.pt> uploaded to server.
Registering...
Model <cifar-model-wendy-demo> is deployed successfully


A newly created deployment is not in "active" status.

In [6]:
!$dlim model list $auth

NAME                      REST URI
alpaca-65b                https://wmla-inference-cpd.apps.cpd.mskcc.org/dlim/v1/inference/alpaca-65b
alpaca-7b                 https://wmla-inference-cpd.apps.cpd.mskcc.org/dlim/v1/inference/alpaca-7b
cifar-model-wendy         -
cifar-model-wendy-demo    -
deepliif-deployment       -
deepliif-deployment-five  -
deepliif-deployment-four  -
hat-nn-level0-v1          -
hat-nn-level05-v1         -
hat-nn-level1-v1          -
msk-benefits-qa           -
msk-benefits-qa-test      -
nlp-curation              https://wmla-inference-cpd.apps.cpd.mskcc.org/dlim/v1/inference/nlp-curation
redcap-nlp                -
sample-test-02            -
test1                     -
toy-app-petrides          https://wmla-inference-cpd.apps.cpd.mskcc.org/dlim/v1/inference/toy-app-petrides
toymodel                  -
w2-extract-qa-v2          -
w2-extract-qa-v3          -


In [7]:
%env model_name=cifar-model-wendy-demo

env: model_name=cifar-model-wendy-demo


## 2. Modify Configurations

Some configurations can **only** be specified or modified after the deployment gets created. 

These configurations are flexible and changeable to an existing deployment, meaning that you can stop a deployment, change such config, activate the deployment again, and this new setting will be effective immediately, without the need to do a re-deployment all over again.

One example is resource usage.

A full list of configurable parameters in this category can be found in this doc page: https://www.ibm.com/docs/en/wmla/2.3?topic=inference-edit-service

In [8]:
!$dlim model viewprofile $model_name -j $auth > model_profile.json

In [9]:
import json

profile = json.load(open('model_profile.json'))
profile

{'schema_version': '1.2',
 'type': 'inference',
 'name': 'cifar-model-wendy-demo',
 'create_time': 'Wed Sep  6 19:59:14 2023 GMT',
 'last_update_time': 'Wed Sep  6 19:59:14 2023 GMT',
 'replica': 1,
 'policy': {'name': 'capacity',
  'schedule_interval': 3,
  'kernel_min': 1,
  'kernel_max': 100,
  'kernel_delay_release_time': 60,
  'task_execution_timeout': 60,
  'task_batch_size': 1,
  'task_pipe_size': 1,
  'task_parallel_size': 1,
  'stream_number_per_group': 0,
  'stream_discard_slow_tasks': True},
 'security': {'ssl': {'enable': True,
   'server_crt': '${REDHARE_TOP}/security/tls.crt',
   'server_key': '${REDHARE_TOP}/security/tls.key'}},
 'resource_allocation': {'service': {'type': 'k8s',
   'namespace': '',
   'image_name': '',
   'node_selector': ''},
  'kernel': {'type': 'msd',
   'namespace': '',
   'image_name': '',
   'resource_plan': 'sample-project/inference',
   'resources': 'ncpus=0.5,ncpus_limit=2,mem=1024,mem_limit=4096',
   'accelerator_resources': '',
   'gpu_pack_i

In [10]:
profile['kernel']['gpu'] = 'exclusive'
profile['policy']['kernel_max'] = 3
# profile['kernel']['envs'] = [{'name': 'DLIM_MK_LOG_LEVEL', 'value': 'DEBUG'}] # switch to DEBUG to view debug level logs

In [11]:
with open('model_profile.json','w') as f:
    json.dump(profile, f)

In [12]:
!$dlim model updateprofile $model_name -f model_profile.json $auth

Model is updated successfully


## 3. Start Deployment

In [13]:
!$dlim model start $model_name $auth 

Starting model "cifar-model-wendy-demo", run "dlim model view cifar-model-wendy-demo -s" to ensure startup.


In [14]:
!$dlim model view cifar-model-wendy -s

2023/09/06 19:59:47 Cluster URL is not set. Contact EDI cluster administrator to get it and set it as following example:
  "dlim config -c https://localhost:9000/dlim/v1/"
Or use "--rest-server https://localhost:9000/dlim/v1/" option.


In [15]:
!$dlim model view $model_name $auth

Name:		cifar-model-wendy-demo
Tag:		-
Model path:	/opt/wml-edi/repo/cifar-model-wendy-demo/cifar-model-wendy-demo-20230906-195914
Size:		248.39KB
Weight path:	./
Runtime:	dlipy3
Kernel path:	kernel.py
Creator:	kharlad
Create time:	Wed Sep  6 19:59:14 UTC 2023
Update time:	Wed Sep  6 19:59:14 UTC 2023
REST URI:	https://wmla-inference-cpd.apps.cpd.mskcc.org/dlim/v1/inference/cifar-model-wendy-demo
Attributes:	No attribute defined
Environments:	No environment variable defined
Schema version:	1


In [16]:
!$dlim model view $model_name -s $auth

Name:             cifar-model-wendy-demo
State:            Started
Serving replica:  1
Serving service ID:   babc0502-541d-4a4e-93d0-a819ee361aa0
Service JobID:        edi-cifar-model-wendy-demo-bf9b4cbdc-vxftw
GPU Mode:             exclusive
Served clients:       0
Pending requests:     0
Requests per second:  0.00
Data per second:      0.00
Kernel started:       0


### Test Deployment

In [17]:
from PIL import Image
from io import BytesIO
import base64

img = Image.open('camion_s_000148.png')

buffer = BytesIO()
img.save(buffer, 'PNG')

img_bytes = base64.b64encode(buffer.getvalue()).decode('utf-8')

In [18]:
img_bytes

'iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAIAAAD8GO2jAAAKGklEQVR4nAXBWXNcZ0IA0G+7e/e9va9qqSVZUmTLWxYTJyGhamDMLNRADVDwwh+gigd+Do8UD/BA1cxUUinGkCGTwfG44nG8yIusfe1Wb7fvfr+Vc+C//PL+6evHo4NXQpDm4juLq5vl1qJpkZ3tB0e7z1gYYUHcskdM+87Hn15ZfyebT7dfPJGSUpa93H4e+OOc5ozi6SSJkoyLvF6vlCsFoULOQJYqEsym1VJF1ZuKuO3FFSEZkolMeDabqDTr1hqLvSu9K0ud7kKj0dQ0g5fs3kKLc5plqT+LxuMp0U0AcblqmE46D2aGSaTiGjGCuU9zRQBjNGdJQvvr3SiOKcsqNY9oaG1t/aMP3+82FzyvzoiwTYMoADlP4yhnzLbscqmxunL11as3ALI8Tzy3rOlgHgwVoFKq2SxOk1wpQHiWQi4M3ZqPx9XWwuK1K41eR9N0wBnj2euLSbI/Yoi+ef70g82rn975QCkVBPPjo3NdM3XdrdW7xydvddOO0jgIxkSDrmunaSI44Fwahk7yJC5Yplupv3vzVm9lLeT8zf5JkCSR70/8ycVg5np1gPLP/+M/tb9Fn939RNNYq9UBauzPwj88eUY0wym6XCga+RiBer0iBJ1MxwjYhJBSySOGoTFcTK3CQZB+/7tH00l0dj7UMNSQzDnNMtquk8vBkWvooR/sHBy02zVNI+1eq9NrHQ9O3jw/abTrh8djwKSkUhBh6oZBtDQTrusSYhDbbl76fPfk5OX2C6QRkbM0jDGSaR74YRDG0eHpK8cqbqxuAE7/75v/XVpeXt9Yr1Y9wySeayA+j3OUJnnqh0JkpqVFQegWXcPElLIkSUipUts92bk4PLC1fB7PouASSumHkZ9mxNBqzYZV9Lr9mz0THzz9FkPKhBiNJ9evb15ZW+m164UPbz97fZxnZq5JCVyp+GBwrhuGV24AEKd

In [19]:
import urllib3
urllib3.disable_warnings()

#### Test: A Single Request

In [22]:
%%time
import requests

url = 'https://wmla-inference-cpd.apps.cpd.mskcc.org/dlim/v1/inference/cifar-model-wendy-demo'
headers = {'Authorization': 'Bearer ' + os.environ['USER_ACCESS_TOKEN']}

res = requests.post(url=url, headers=headers, json={'img':img_bytes}, verify=False)
res

CPU times: user 21.1 ms, sys: 1.39 ms, total: 22.5 ms
Wall time: 46.5 ms


<Response [503]>

In [23]:
res.text

'Inference service for model <cifar-model-wendy-demo> is currently unavailable.'

In [92]:
res = requests.post(url=url, headers=headers, json={'img':''}, verify=False)
res

<Response [200]>

In [93]:
res.text

'{"msg": "cannot identify image file <_io.BytesIO object at 0x7fd14f8b6520>"}'

#### Test: Multiple Concurrent Requests
You can use `requests-futures` to initiate async calls to mimic concurrent requests. At the backend it uses multi-threading. More details can be found in the git repository: https://github.com/ross/requests-futures

In this way, you will be able to trigger WMLA EDI deployment to generate new kernel pods in order to handle these requests and measure/evaluate the performance.

*If your inference request takes very short time, say 20ms, then the timing estimated by this method could be way off due to additional costs such as opening threads.*

*This is more for the mlops team to do. A researcher doesn't have to worry about concurrent requests.*

In [26]:
!pip install requests-futures



In [27]:
import datetime

In [28]:
%%time
from requests_futures.sessions import FuturesSession

session = FuturesSession()
params = {'url':'https://wmla-inference-cpd-wmla.apps.cpd.mskcc.org/dlim/v1/inference/cifar-model-wendy',
          'headers':headers, 
          'json':{'img':img_bytes},
          'verify':False}

l_sessions = []
for i in range(100):
    future_session = session.post(**params)
    l_sessions.append(future_session)
    print(datetime.datetime.now(),i)

2022-05-11 03:14:30.546037 0
2022-05-11 03:14:30.548319 1
2022-05-11 03:14:30.551203 2
2022-05-11 03:14:30.559417 3
2022-05-11 03:14:30.564920 4
2022-05-11 03:14:30.573180 5
2022-05-11 03:14:30.582728 6
2022-05-11 03:14:30.584893 7
2022-05-11 03:14:30.585241 8
2022-05-11 03:14:30.585745 9
2022-05-11 03:14:30.585796 10
2022-05-11 03:14:30.586030 11
2022-05-11 03:14:30.586055 12
2022-05-11 03:14:30.586079 13
2022-05-11 03:14:30.586106 14
2022-05-11 03:14:30.586135 15
2022-05-11 03:14:30.586172 16
2022-05-11 03:14:30.586585 17
2022-05-11 03:14:30.586804 18
2022-05-11 03:14:30.589804 19
2022-05-11 03:14:30.589961 20
2022-05-11 03:14:30.592041 21
2022-05-11 03:14:30.592341 22
2022-05-11 03:14:30.592774 23
2022-05-11 03:14:30.592830 24
2022-05-11 03:14:30.593263 25
2022-05-11 03:14:30.593294 26
2022-05-11 03:14:30.593320 27
2022-05-11 03:14:30.593344 28
2022-05-11 03:14:30.593375 29
2022-05-11 03:14:30.593397 30
2022-05-11 03:14:30.593836 31
2022-05-11 03:14:30.593875 32
2022-05-11 03:14:30.

In [29]:
%%time
l_res = []
for i in range(100):
    l_res.append(l_sessions[i].result())
    print(datetime.datetime.now(),i)

2022-05-11 03:14:31.584639 0
2022-05-11 03:14:31.584691 1
2022-05-11 03:14:32.936480 2
2022-05-11 03:14:32.936568 3
2022-05-11 03:14:33.235104 4
2022-05-11 03:14:33.235211 5
2022-05-11 03:14:33.235229 6
2022-05-11 03:14:33.235244 7
2022-05-11 03:14:33.235260 8
2022-05-11 03:14:33.235284 9
2022-05-11 03:14:34.731991 10
2022-05-11 03:14:34.732079 11
2022-05-11 03:14:34.732096 12
2022-05-11 03:14:34.732112 13
2022-05-11 03:14:34.833380 14
2022-05-11 03:14:34.833470 15
2022-05-11 03:14:34.833488 16
2022-05-11 03:14:34.833504 17
2022-05-11 03:14:37.130729 18
2022-05-11 03:14:37.130838 19
2022-05-11 03:14:37.130865 20
2022-05-11 03:14:37.130880 21
2022-05-11 03:14:38.140140 22
2022-05-11 03:14:38.140251 23
2022-05-11 03:14:38.140269 24
2022-05-11 03:14:38.140284 25
2022-05-11 03:14:38.833944 26
2022-05-11 03:14:38.834080 27
2022-05-11 03:14:38.834133 28
2022-05-11 03:14:38.834173 29
2022-05-11 03:14:41.035306 30
2022-05-11 03:14:41.035418 31
2022-05-11 03:14:41.035434 32
2022-05-11 03:14:41.

In [30]:
for res in l_res:
    print(res.text)

{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_class": 0}
{"pred_cla

## 4. Stop Deployment
You need to stop a deployment when
- you want to change configurable parameters for the existing deployment, or
- you want to delete this deployment

In [24]:
!$dlim model stop $model_name $auth -f

Stopping model "cifar-model-wendy-demo", run "dlim model view cifar-model-wendy-demo -s" to ensure stop.


## 5. Remove Deployment

In [25]:
!$dlim model undeploy $model_name $auth -f

Undeployed model "cifar-model-wendy-demo", run "dlim model list" to ensure deletion.


In [26]:
!$dlim model list $auth

NAME                      REST URI
alpaca-65b                https://wmla-inference-cpd.apps.cpd.mskcc.org/dlim/v1/inference/alpaca-65b
alpaca-7b                 https://wmla-inference-cpd.apps.cpd.mskcc.org/dlim/v1/inference/alpaca-7b
cifar-model-wendy         -
deepliif-deployment       -
deepliif-deployment-five  -
deepliif-deployment-four  -
hat-nn-level0-v1          -
hat-nn-level05-v1         -
hat-nn-level1-v1          -
msk-benefits-qa           -
msk-benefits-qa-test      -
nlp-curation              https://wmla-inference-cpd.apps.cpd.mskcc.org/dlim/v1/inference/nlp-curation
redcap-nlp                -
sample-test-02            -
test1                     -
toy-app-petrides          https://wmla-inference-cpd.apps.cpd.mskcc.org/dlim/v1/inference/toy-app-petrides
toymodel                  -
w2-extract-qa-v2          -
w2-extract-qa-v3          -
