Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ml): CUDA acceleration and ONNX compilation #2574

Closed
wants to merge 11 commits into from
Closed

feat(ml): CUDA acceleration and ONNX compilation #2574

wants to merge 11 commits into from

Conversation

mertalev
Copy link
Contributor

@mertalev mertalev commented May 26, 2023

Following the great work from #2563, I compiled the models to ONNX and cleaned up the Dockerfile and Python dependencies. I haven't done any serious profiling yet, but it seems to do around 20-25 images/s on a 3080 for image classification and CLIP.

Some things to note:

  • The first launch will be slower than successive ones. This is because each model is first compiled into ONNX and saved into the cache directory, at which point it can be loaded very quickly.
  • The image size is optimized by using a non-CUDA image and instead relying on the built-in CUDA libraries packaged with torch. This means that torch must be imported before onnxruntime when using CUDA.
  • CPU and CUDA can be targeted by setting the device build arg to cpu and cuda, respectively.
  • This PR also removes the object detection model and endpoint per the discussion in [Feature] ML improvements #2524. The default image classification model is also now significantly better.
  • The score threshold for classification is now an environmental variable (ML_MIN_TAG_SCORE).
  • The score threshold for facial recognition is now an environmental variable (ML_MIN_FACE_SCORE).
  • The lack of batching is a performance handicap since it means more overhead. GPU utilization fluctuates instead of being at a consistently high utilization, especially for facial recognition.
  • The image sizes are about 3.2gb for the CPU version, and about 8.5gb for the CUDA version.
  • VRAM utilization is still unoptimized. Ideally, one model would be loaded at a time and unloaded from the GPU when the job is complete.

Edit: Now with automatic model unloading! The default is to unload a model if it hasn't been used in 300 seconds, and this number can be overriden with the ML_MODEL_TTL env variable. Along with this is a flag to disable eager startup (ML_EAGER_STARTUP) so models are only loaded when needed.

@vercel
Copy link

vercel bot commented May 26, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
immich ⬜️ Ignored (Inspect) May 31, 2023 8:24pm

@alextran1502
Copy link
Contributor

When the PR is ready, please feel free to request review

@alextran1502
Copy link
Contributor

I believer this PR would supersede #2563, correct?

@mertalev
Copy link
Contributor Author

Sure thing! I'm working on unloading models automatically, should be ready very soon.

I believer this PR would supersede #2563, correct?

Basically yes, since the scope of this PR is wider and makes other improvements.

@mertalev
Copy link
Contributor Author

Hi @alextran1502, everything seems to work!

@bo0tzz
Copy link
Member

bo0tzz commented May 31, 2023

You renamed a bunch of env vars, which is a backwards incompatible change. Can you revert that part?

@mertalev
Copy link
Contributor Author

The names were getting a bit long 😅, but yes I can revert that.

@jrasm91
Copy link
Contributor

jrasm91 commented May 31, 2023

Man, I think the direction this is going is great, but the amount of things that is changing is a lot, many of which aren't even related to CUDA support, which is what I thought this PR was for.

Specifically, it looks like we've:

  • Added types to the python code
  • Worked on model unloading
  • Changed machine learning url endpoints
  • Changed the environment variable names
  • Changed how models are instantiated, presumably to enable HW support
  • Changed how logging works (not using the immich available LOG_LEVEL env variable either)
  • Added lots of comments, which I'm usually not a fan of, since that often means the code isn't very well written
  • Added new env variables to control the face score threshold
  • Defaulted the logger to use colors, which probably is not ideal. Does it support the NO_COLOR env?
  • Deleted the detect objects job.
  • Changed the machine learning interface from thumbnailPath to imagePath

I'm not necessarily against any one of these changes, but every single one could also be a separate PR, which would be easy to review, test, and revert if necessary. Putting them all together and it just ends up being a lot.

@mertalev
Copy link
Contributor Author

mertalev commented May 31, 2023

Hmm, you do have a point that it's a lot of changes. I might have gotten a bit overzealous. Since this PR makes major changes to how models are loaded and run, I wanted to make it easier to maintain and understand why something is done. But some of the changes aren't necessarily related to CUDA acceleration, just things I noticed and decided to change.

I can shrink the PR a bit. From the things you listed, I'll share my thought process with them and bucket them into things that are more directly within scope and things that could be extracted into a separate PR (or just removed).

Within scope:

Changed how models are instantiated, presumably to enable HW support

  • Namesake of the PR.

Worked on model unloading

  • Not critical, but it's effectively a memory leak to never unload the models. With GPU support added and VRAM being a limited resource, it's a good time to address it.

Could be removed:

Added types to the python code

  • This is to make it more explicit what the models are expected to work with and output, but this can be moved to a separate PR, possibly synchronizing it with the API generator.

Changed machine learning url endpoints

  • This change is because sentence transformers is removed as a dependency, but it's inconsequential to keep it as-is.

Changed how logging works (not using the immich available LOG_LEVEL env variable either)
Defaulted the logger to use colors, which probably is not ideal. Does it support the NO_COLOR env?

  • I actually didn't know about the LOG_LEVEL and NO_COLOR env variables, it makes sense to use them. This can be refined and made a PR later.

Added lots of comments, which I'm usually not a fan of, since that often means the code isn't very well written

Haha there are definitely some aspects of the code I'd like to improve, but overall the comments are to give more context about not just what the code does but why it's doing it. The docstrings are more of a formality so I could remove those. I could also add documentation to a README.md or somewhere else instead.

Added new env variables to control the face score threshold

  • I'm not a fan of hardcoded values, but you're totally right that this has nothing to do with the PR topic.

Changed the machine learning interface from thumbnailPath to imagePath

  • From the perspective of the server, it doesn't care if an image is a thumbnail or not as long as it's an image. Again though, you're right that this isn't necessarily on-topic.

Debatable:

Deleted the detect objects job.

  • This one is tricky. I could re-add the object detection model, but it seems the direction we're going is to remove it anyway so I'd rather not spend time on getting it to the same standard as the other endpoints.

Let me know what you think and which changes you think I should keep for this PR, and which are good but should be part of a different PR.

@jrasm91
Copy link
Contributor

jrasm91 commented May 31, 2023

I totally get getting carried away. I almost always end up refactoring, cleaning up, and renaming things as I add new features. I'm a big fan of splitting them into separate PRs as much as possible.

Ideally, I would recommend making separate PRs for the logging changes, cache / model unloading changes, thumbnail path change, delete object detection job, and new env variables. You can rebase this branch after they're merged and add the HW support last.

If that sounds like too much work, I guess we can just limit the scope of this PR and merge it mostly as is. I'd prefer (1) and think the PRs could be tested and merged pretty quickly. Otherwise with (2) we will probably have a bit of back and forth until everything is sorted out and ready to be merged.

@mertalev
Copy link
Contributor Author

I'm fine with (1) since those changes are easy to extract out into separate PRs.

@samip5
Copy link
Contributor

samip5 commented Jun 11, 2023

Something similar could potentially be done for non-cuda or does this work with AMD gpus too?

@mertalev
Copy link
Contributor Author

This PR only supports CUDA, so no AMD support yet. But ONNX Runtime supports AMD's ROCm, so it can be added in another PR without too many changes.

@mertalev
Copy link
Contributor Author

Closing this as it's very stale. This addition can be reintroduced with a new PR that's much smaller.

@mertalev mertalev closed this Jul 19, 2023
@MrModest
Copy link

MrModest commented Jan 3, 2024

@mertalev if you created new PRs, could you please link them here, so people can follow the progress of the feature? :)

@mertalev
Copy link
Contributor Author

mertalev commented Jan 3, 2024

Hi! The current PR for it is #5619.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants