Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider Cheaper Deployment model alternative #754

Open
patmagee opened this issue Mar 4, 2024 · 1 comment
Open

Consider Cheaper Deployment model alternative #754

patmagee opened this issue Mar 4, 2024 · 1 comment

Comments

@patmagee
Copy link
Contributor

patmagee commented Mar 4, 2024

Note: CoA uses TES as a submodule. Please consider creating this issue in the TES repository if you identify the root cause is there: https://github.com/microsoft/ga4gh-tes

Problem:
The new(ish) deployment model of CoA on Kubernetes is really great when you already have a presence on AKS, or are planning on using AKS for other reasons. Unfortunately, if you do not and deploying CoA is your first use case for AKS, it becomes a very expensive option for hosting a Workflow Execution server. Unless you are running large batches of workflows, the cost of the infrastructure drives up the cost per workflow substantially.

Using numbers pulled from running instances of both the new and the old deployment models, you can see the cost impact

  • Single VM Approach: ~ $275 USD / month
  • AKS Approach: ~ $620 / month
    • Non AKS Resources: ~ $185 / Month
    • AKS Resources: ~ $425 / Month

This represents greater then a 2x increase in cost for hosting the same services. Of course, I agree that the new approach is much more flexible, easier to maintain and troubleshoot and provides a host of other benefits. But in some scenarios it ends up being just a numbers game and $7500 / year is a lot for just hosting the infrastructure needed to run workflows on Azure

Solution:

There are a number of ways to solve this which require varying degrees of engineering and first class support from Azure

  1. Provide a single VM option once again and allow the user to tune the VM size / DB size. This is probably the simplest short term solution. Some workloads are small and there is little justification for having such a large infrastructure for running < 10 (or even 100) workflows a month
  2. Use a serverless execution engine, ie Cromwell with the run command or miniwdl deployed in a similar way as nextflow and then deploy TES on a small VM
  3. Make TES a first class Microsoft API and deploy a small VM to house Cromwell
  4. Make TES a first class Microsoft API and use a serverless execution engine
@MattMcL4475
Copy link
Contributor

MattMcL4475 commented Apr 12, 2024

@patmagee just had a good team discussion on this. The quickest, most impactful solution to this might be:

  1. Modify the Trigger Engine to stop AKS if:

There are no "new" or "inprogress" workflows AND no workflows have completed within the past 1 hour (configurable). This would ensure that AKS is shut down when there aren't any workflows running or workflows that have completed within 1h+.

  1. Create an Azure Function that uses a blob trigger so it's executed when a new blob is created in the workflows container.
    It should check if AKS is stopped, and if so, it should start it.

This should hypothetically reduce the cost of AKS significantly, with the only downside being that cold start will likely take an additional few minutes, which seems like a perfectly fine tradeoff.

To go even further, we could also move the Postgres database to optionally be deployed as a container in AKS instead of the managed Azure Postgres Flexible server.

Any thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants