Update to YOLO. Add Sagemaker deployment #229

daavoo · 2023-07-28T19:55:45Z

Migrate to YOLO

Make the decision because:

It was the framework I was already using on a similar project 😅
It feels like it is more popular these days compared to fast.ai.
Not only in general, but in our particular case more than 1 customer/prospect has explicitly reported using YOLO but (afaik) we did not hear that from fast.ai.
The framework logic for training felt simpler
If we remove the custom callback, it is YOLO.train vs ~60 lines of fast.ai

Sagemaker deployment

Not that I am sure the approach here is the best but at least we could start the discussion.
Create https://www.notion.so/iterative/Sagemaker-Deployment-fce94fa0d81c44f395c10fc32940ab82#f3773819ebce4659ad397ed92686d56c to discuss.

TODO:

Update README
Update dvc.org references to the repo

shcheklein · 2023-08-02T19:09:24Z

@daavoo could you point me please to some details / summary or update the description please - e.g. why YOLO (do we want / do we use the callback there?), what else has changed any why, etc. That make it easier for me to review. Thanks!

daavoo · 2023-08-02T19:16:26Z

example-get-started-experiments/code/src/train.py

+from ultralytics import YOLO
+
+
+def add_callbacks(live, yolo):


@shcheklein I am not using the built-in callback because:

I couldn't find a way to customize arguments (i.e. pass a custom dir, disable report)
It seems the other loggers in ultralytics rely on env vars.

Most of the image plots didn't feel useful / were redundant with existing plots.
I was planning to send a patch to ultralytics repo to skip some of the images (i.e. the confusion matrix, a big image with all the linear plots, etc)

This code below looks complicated for the example repo . It should not belong there I think.

Most of the image plots didn't feel useful / were redundant with existing plots.
I was planning to send a patch to ultralytics repo to skip some of the images (i.e. the confusion matrix, a big image with all the linear plots, etc)

I would let users decide on that in the UX / UI (that should handle that case)

I couldn't find a way to customize arguments (i.e. pass a custom dir, disable report)
It seems the other loggers in ultralytics rely on env vars.

If we go with YOLO I think we don't have other options but figure this out - we can't maintain that level of duplication + code below is not simple at all to my taste (fine for an internal callback, but not for the example repo, especially not if we need to put it into notebooks)

daavoo · 2023-08-02T19:17:47Z

example-get-started-experiments/code/.github/workflows/deploy-model.yml

+    - run: dvc pull results/train/artifacts/best.pt.dvc
+
+    - env:
+        VERSION: ${{steps.clean_version_name.outputs.version}}
+      run: |
+        bash sagemaker/bundle_and_upload_model.sh \
+        results/train/artifacts/best.pt \
+        s3://pool-segmentation/$VERSION/model.tar.gz


This is option A of How to make the bundled model accessible to Sagemaker.

Explained here : https://www.notion.so/iterative/Sagemaker-Deployment-fce94fa0d81c44f395c10fc32940ab82#735578764d964f9d9c6d7ab9d8c5fa39

can we avoid moving and rebundling the model?

daavoo · 2023-08-02T19:19:25Z

example-get-started-experiments/code/.github/workflows/deploy-model.yml

+        VERSION: ${{steps.clean_version_name.outputs.version}}
+      run: |
+        python sagemaker/deploy_model.py \
+        --name $VERSION \
+        --model_data s3://pool-segmentation/$VERSION/model.tar.gz \
+        --role ${{ secrets.AWS_ROLE_TO_ASSUME }} \
+        --instance_type 'ml.c4.xlarge'


This is option A of How to create endpoints based on MR events.

Explained here : https://www.notion.so/iterative/Sagemaker-Deployment-fce94fa0d81c44f395c10fc32940ab82#f3773819ebce4659ad397ed92686d56c

daavoo · 2023-08-02T19:20:40Z

example-get-started-experiments/code/.github/workflows/deploy-model.yml

+on:
+  push:
+    # When a new version is registered in Studio Model Registry
+    tags:


Did not use gto action because it did simplify / add much value for the deployment pattern used

shcheklein · 2023-08-02T19:20:56Z

.devcontainer.json

@@ -1,6 +1,7 @@
 {
  "name": "example-repos-dev",
  "image": "mcr.microsoft.com/devcontainers/python:3.10",
+  "runArgs": ["--ipc=host"],


could remind please, why it was needed?

Without it, the PyTorch dataloaders ran out of memory

daavoo · 2023-08-02T19:21:50Z

example-get-started-experiments/code/sagemaker/deploy_model.py

@@ -0,0 +1,46 @@
+import logging


This is just standard code suggested by sagemaker docs :

https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#deploy-pytorch-models

daavoo · 2023-08-02T19:22:46Z

example-get-started-experiments/code/src/create_yolo_dataset.py

This file could be removed if we create a version of the dataset in YOLO format instead of the current format with png masks

daavoo · 2023-08-02T19:23:20Z

example-get-started-experiments/code/src/inference.py

Just a simple script to showcase how to query the endpoint

shcheklein · 2023-08-02T19:37:08Z

@daavoo

Make the decision because:

thanks for clarifying this. does it make it easier to deploy? I wonder if those changes are related at all to each other? could have we done them one my one? (easier to review, etc)

daavoo · 2023-08-02T19:40:15Z

thanks for clarifying this. does it make it easier to deploy?

I found a working example of YOLO + Sagemaker that I could adapt but I did not find that for fast.ai

I wonder if those changes are related at all to each other? could have we done them one my one? (easier to review, etc)

I can split into 2 parts. Yolo and then sagemaker

shcheklein · 2023-08-02T19:43:02Z

I found a working example of YOLO + Sagemaker that I could adapt but I did not find that for fast.ai

make sense. Nw, let's keep it as is.

shcheklein · 2023-08-02T19:48:01Z

example-get-started-experiments/code/.github/workflows/deploy-model.yml

+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: '3.8'


doesn' work with the newer one?

shcheklein · 2023-08-02T19:50:44Z

example-get-started-experiments/code/.github/workflows/deploy-model.yml

+    - uses: aws-actions/configure-aws-credentials@v2
+      with:
+        aws-region: us-east-1
+        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}


can we use openid please?

example-get-started-experiments/code/README.md

shcheklein · 2023-08-02T19:56:52Z

example-get-started-experiments/code/README.md

@@ -11,8 +8,6 @@ This is an auto-generated repository for use in [DVC](https://dvc.org)
 This is a Computer Vision (CV) project that solves the problem of segmenting out 
 swimming pools from satellite images. 

-[Example results](./results/evaluate/plots/images/)


should we still try to include some examples?

shcheklein · 2023-08-02T19:57:26Z

example-get-started-experiments/code/data/.gitignore

@@ -1,3 +1,2 @@
 /pool_data
-/test_data
-/train_data
+/yolo_dataset


does it need to be prefixed with yolo_ - is that dataset yolo specific?

is not needed. just to give a name that indicates the format

example-get-started-experiments/code/notebooks/TrainSegModel.ipynb

example-get-started-experiments/code/sagemaker/bundle_and_upload_model.sh

shcheklein · 2023-08-02T20:00:30Z

example-get-started-experiments/code/requirements.txt

what does fire and shapely do?

fire is to remove the boilerplate of adding argparse. takes function signature and makes it work from CLI arguments
shapely needed to convert the png masks to yolo format, which expect polygons in a .txt

shcheklein · 2023-08-02T22:18:35Z

example-get-started-experiments/code/src/inference.py

+    output_path: str = "predictions",
+):
+
+    if endpoint_name is not None and model_path is not None:


why do we need two inferences (two python scripts) ... both of them something related to endpoints

(sagemaker/code/inference.py) is bundled alongside the model file and uploaded to Sagemaker, it can't be used from a user perspective.
It includes methods following conventions required by Sagemaker.

example-get-started-experiments/code/src/train.py

shcheklein · 2023-08-02T22:22:45Z

example-get-started-experiments/generate.sh

 pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118
 pip install jupyter
+yolo settings datasets_dir=/workspaces/example-repos-dev/example-get-started-experiments/build/example-get-started-experiments/data


so, how will it look like in docs - we'll we have to explain all this in the get started?

No, this is only needed during generation inside codespaces because YOLO resolves the paths wrong inside generate.sh.
For someone cloning the repo it works without running the command

shcheklein

May be it's worth splitting this indeed to unblock it faster. I like that we are trying yolo, it make it more useful for actual customers and scenarios, but it seems we need to do some homework for that to look decent / simple to use and something that can fit into get started docs for the experiments. I would love to hear @dberenbaum opinion as well. May be we should do a leap of faith and just push though it and learn from it and keep doing improvements, etc. Let's decide on this. Duplication and complexity (e.g. custom logger) bothers me in this case.

On the deployment:

I think we need an app that we could drop an image, so that we can demo it
Do we keep instances / deployment running 24/7 for the demo purposes? What will be the workflow there? Can it be deployed in a serverless / on-demand mode?
Good questions on the lifecycle - which endpoint we keep alive, etc. Don't know the answer yet to this.
What is the lifecycle for code + models - can I change code and deploy with an older version of a model - how is it determined?

daavoo · 2023-08-03T13:47:32Z

May be it's worth splitting this indeed to unblock it faster. I like that we are trying yolo, it make it more useful for actual customers and scenarios, but it seems we need to do some homework for that to look decent / simple to use and something that can fit into get started docs for the experiments. I would love to hear @dberenbaum opinion as well. May be we should do a leap of faith and just push though it and learn from it and keep doing improvements, etc. Let's decide on this. Duplication and complexity (e.g. custom logger) bothers me in this case.

Moved the YOLO part to #232 .
Tried to simplify (removed custom callback; sent patch upstream).

Update to use YOLO. Add Sagemaker deployment

7f99094

daavoo added the A: example-get-started-experiments DVC Experiment, DVCLive examples label Jul 28, 2023

daavoo self-assigned this Jul 28, 2023

daavoo linked an issue Jul 28, 2023 that may be closed by this pull request

Add deployment of example-get-started-experiments model #222

Closed

daavoo added 2 commits July 28, 2023 21:56

Update deploy_model.py

ec92cdf

Update deploy_model.py

f3be398

daavoo requested review from shcheklein and dberenbaum July 28, 2023 19:58

daavoo mentioned this pull request Aug 1, 2023

Add deployment of example-get-started-experiments model #222

Closed

Tibor Mach and others added 2 commits August 2, 2023 18:21

fixed typo in README

dd08272

gitignore pretrained weights

60a2a85

daavoo commented Aug 2, 2023

View reviewed changes