Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to YOLO. Add Sagemaker deployment #229

Conversation

daavoo
Copy link
Contributor

@daavoo daavoo commented Jul 28, 2023

Migrate to YOLO

Make the decision because:

  • It was the framework I was already using on a similar project 😅
  • It feels like it is more popular these days compared to fast.ai.
    Not only in general, but in our particular case more than 1 customer/prospect has explicitly reported using YOLO but (afaik) we did not hear that from fast.ai.
  • The framework logic for training felt simpler
    If we remove the custom callback, it is YOLO.train vs ~60 lines of fast.ai

Sagemaker deployment

Not that I am sure the approach here is the best but at least we could start the discussion.
Create https://www.notion.so/iterative/Sagemaker-Deployment-fce94fa0d81c44f395c10fc32940ab82#f3773819ebce4659ad397ed92686d56c to discuss.


TODO:

  • Update README
  • Update dvc.org references to the repo

@daavoo daavoo added the A: example-get-started-experiments DVC Experiment, DVCLive examples label Jul 28, 2023
@daavoo daavoo self-assigned this Jul 28, 2023
@daavoo daavoo linked an issue Jul 28, 2023 that may be closed by this pull request
@shcheklein
Copy link
Member

@daavoo could you point me please to some details / summary or update the description please - e.g. why YOLO (do we want / do we use the callback there?), what else has changed any why, etc. That make it easier for me to review. Thanks!

from ultralytics import YOLO


def add_callbacks(live, yolo):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein I am not using the built-in callback because:

  • I couldn't find a way to customize arguments (i.e. pass a custom dir, disable report)
    It seems the other loggers in ultralytics rely on env vars.

  • Most of the image plots didn't feel useful / were redundant with existing plots.
    I was planning to send a patch to ultralytics repo to skip some of the images (i.e. the confusion matrix, a big image with all the linear plots, etc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code below looks complicated for the example repo . It should not belong there I think.

Most of the image plots didn't feel useful / were redundant with existing plots.
I was planning to send a patch to ultralytics repo to skip some of the images (i.e. the confusion matrix, a big image with all the linear plots, etc)

I would let users decide on that in the UX / UI (that should handle that case)

I couldn't find a way to customize arguments (i.e. pass a custom dir, disable report)
It seems the other loggers in ultralytics rely on env vars.

If we go with YOLO I think we don't have other options but figure this out - we can't maintain that level of duplication + code below is not simple at all to my taste (fine for an internal callback, but not for the example repo, especially not if we need to put it into notebooks)

Comment on lines +42 to +49
- run: dvc pull results/train/artifacts/best.pt.dvc

- env:
VERSION: ${{steps.clean_version_name.outputs.version}}
run: |
bash sagemaker/bundle_and_upload_model.sh \
results/train/artifacts/best.pt \
s3://pool-segmentation/$VERSION/model.tar.gz
Copy link
Contributor Author

@daavoo daavoo Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is option A of How to make the bundled model accessible to Sagemaker.

Explained here : https://www.notion.so/iterative/Sagemaker-Deployment-fce94fa0d81c44f395c10fc32940ab82#735578764d964f9d9c6d7ab9d8c5fa39

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid moving and rebundling the model?

Comment on lines +52 to +58
VERSION: ${{steps.clean_version_name.outputs.version}}
run: |
python sagemaker/deploy_model.py \
--name $VERSION \
--model_data s3://pool-segmentation/$VERSION/model.tar.gz \
--role ${{ secrets.AWS_ROLE_TO_ASSUME }} \
--instance_type 'ml.c4.xlarge'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is option A of How to create endpoints based on MR events.

Explained here : https://www.notion.so/iterative/Sagemaker-Deployment-fce94fa0d81c44f395c10fc32940ab82#f3773819ebce4659ad397ed92686d56c

on:
push:
# When a new version is registered in Studio Model Registry
tags:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not use gto action because it did simplify / add much value for the deployment pattern used

@@ -1,6 +1,7 @@
{
"name": "example-repos-dev",
"image": "mcr.microsoft.com/devcontainers/python:3.10",
"runArgs": ["--ipc=host"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could remind please, why it was needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it, the PyTorch dataloaders ran out of memory

@@ -0,0 +1,46 @@
import logging
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file could be removed if we create a version of the dataset in YOLO format instead of the current format with png masks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a simple script to showcase how to query the endpoint

@shcheklein
Copy link
Member

@daavoo

Make the decision because:

thanks for clarifying this. does it make it easier to deploy? I wonder if those changes are related at all to each other? could have we done them one my one? (easier to review, etc)

@daavoo
Copy link
Contributor Author

daavoo commented Aug 2, 2023

thanks for clarifying this. does it make it easier to deploy?

I found a working example of YOLO + Sagemaker that I could adapt but I did not find that for fast.ai

I wonder if those changes are related at all to each other? could have we done them one my one? (easier to review, etc)

I can split into 2 parts. Yolo and then sagemaker

@shcheklein
Copy link
Member

I found a working example of YOLO + Sagemaker that I could adapt but I did not find that for fast.ai

make sense. Nw, let's keep it as is.

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.8'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn' work with the newer one?

- uses: aws-actions/configure-aws-credentials@v2
with:
aws-region: us-east-1
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use openid please?

@@ -11,8 +8,6 @@ This is an auto-generated repository for use in [DVC](https://dvc.org)
This is a Computer Vision (CV) project that solves the problem of segmenting out
swimming pools from satellite images.

[Example results](./results/evaluate/plots/images/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we still try to include some examples?

@@ -1,3 +1,2 @@
/pool_data
/test_data
/train_data
/yolo_dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need to be prefixed with yolo_ - is that dataset yolo specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not needed. just to give a name that indicates the format

fire
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does fire and shapely do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fire is to remove the boilerplate of adding argparse. takes function signature and makes it work from CLI arguments
shapely needed to convert the png masks to yolo format, which expect polygons in a .txt

output_path: str = "predictions",
):

if endpoint_name is not None and model_path is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need two inferences (two python scripts) ... both of them something related to endpoints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sagemaker/code/inference.py) is bundled alongside the model file and uploaded to Sagemaker, it can't be used from a user perspective.
It includes methods following conventions required by Sagemaker.

pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118
pip install jupyter
yolo settings datasets_dir=/workspaces/example-repos-dev/example-get-started-experiments/build/example-get-started-experiments/data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, how will it look like in docs - we'll we have to explain all this in the get started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is only needed during generation inside codespaces because YOLO resolves the paths wrong inside generate.sh.
For someone cloning the repo it works without running the command

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be it's worth splitting this indeed to unblock it faster. I like that we are trying yolo, it make it more useful for actual customers and scenarios, but it seems we need to do some homework for that to look decent / simple to use and something that can fit into get started docs for the experiments. I would love to hear @dberenbaum opinion as well. May be we should do a leap of faith and just push though it and learn from it and keep doing improvements, etc. Let's decide on this. Duplication and complexity (e.g. custom logger) bothers me in this case.

On the deployment:

  • I think we need an app that we could drop an image, so that we can demo it
  • Do we keep instances / deployment running 24/7 for the demo purposes? What will be the workflow there? Can it be deployed in a serverless / on-demand mode?
  • Good questions on the lifecycle - which endpoint we keep alive, etc. Don't know the answer yet to this.
  • What is the lifecycle for code + models - can I change code and deploy with an older version of a model - how is it determined?

@daavoo daavoo closed this Aug 3, 2023
@daavoo
Copy link
Contributor Author

daavoo commented Aug 3, 2023

May be it's worth splitting this indeed to unblock it faster. I like that we are trying yolo, it make it more useful for actual customers and scenarios, but it seems we need to do some homework for that to look decent / simple to use and something that can fit into get started docs for the experiments. I would love to hear @dberenbaum opinion as well. May be we should do a leap of faith and just push though it and learn from it and keep doing improvements, etc. Let's decide on this. Duplication and complexity (e.g. custom logger) bothers me in this case.

Moved the YOLO part to #232 .
Tried to simplify (removed custom callback; sent patch upstream).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: example-get-started-experiments DVC Experiment, DVCLive examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add deployment of example-get-started-experiments model
2 participants