Consider Cheaper Deployment model alternative #754

patmagee · 2024-03-04T18:48:59Z

Note: CoA uses TES as a submodule. Please consider creating this issue in the TES repository if you identify the root cause is there: https://github.com/microsoft/ga4gh-tes

Problem:
The new(ish) deployment model of CoA on Kubernetes is really great when you already have a presence on AKS, or are planning on using AKS for other reasons. Unfortunately, if you do not and deploying CoA is your first use case for AKS, it becomes a very expensive option for hosting a Workflow Execution server. Unless you are running large batches of workflows, the cost of the infrastructure drives up the cost per workflow substantially.

Using numbers pulled from running instances of both the new and the old deployment models, you can see the cost impact

Single VM Approach: ~ $275 USD / month
AKS Approach: ~ $620 / month
- Non AKS Resources: ~ $185 / Month
- AKS Resources: ~ $425 / Month

This represents greater then a 2x increase in cost for hosting the same services. Of course, I agree that the new approach is much more flexible, easier to maintain and troubleshoot and provides a host of other benefits. But in some scenarios it ends up being just a numbers game and $7500 / year is a lot for just hosting the infrastructure needed to run workflows on Azure

Solution:

There are a number of ways to solve this which require varying degrees of engineering and first class support from Azure

Provide a single VM option once again and allow the user to tune the VM size / DB size. This is probably the simplest short term solution. Some workloads are small and there is little justification for having such a large infrastructure for running < 10 (or even 100) workflows a month
Use a serverless execution engine, ie Cromwell with the run command or miniwdl deployed in a similar way as nextflow and then deploy TES on a small VM
Make TES a first class Microsoft API and deploy a small VM to house Cromwell
Make TES a first class Microsoft API and use a serverless execution engine

The text was updated successfully, but these errors were encountered:

MattMcL4475 · 2024-04-12T17:31:14Z

@patmagee just had a good team discussion on this. The quickest, most impactful solution to this might be:

Modify the Trigger Engine to stop AKS if:

There are no "new" or "inprogress" workflows AND no workflows have completed within the past 1 hour (configurable). This would ensure that AKS is shut down when there aren't any workflows running or workflows that have completed within 1h+.

Create an Azure Function that uses a blob trigger so it's executed when a new blob is created in the workflows container.
It should check if AKS is stopped, and if so, it should start it.

This should hypothetically reduce the cost of AKS significantly, with the only downside being that cold start will likely take an additional few minutes, which seems like a perfectly fine tradeoff.

To go even further, we could also move the Postgres database to optionally be deployed as a container in AKS instead of the managed Azure Postgres Flexible server.

Any thoughts on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider Cheaper Deployment model alternative #754

Consider Cheaper Deployment model alternative #754

patmagee commented Mar 4, 2024

MattMcL4475 commented Apr 12, 2024 •

edited

Loading

Consider Cheaper Deployment model alternative #754

Consider Cheaper Deployment model alternative #754

Comments

patmagee commented Mar 4, 2024

MattMcL4475 commented Apr 12, 2024 • edited Loading

MattMcL4475 commented Apr 12, 2024 •

edited

Loading