Automate deployment and simplify infrastructure #913

lucassz · 2018-06-25T18:29:56Z

Per a discussion with @hdoupe, I plan to work on the automation of the deployment that's currently done manually for both the webapp and backend worker nodes. Here are the changes on which I am planning, in order of least to most different from the current setup:

Automate the deployment of Heroku and AWS instances through Terraform, a tool for declaratively describing infrastructure through code and orchestrating its deployment. This deployment, which is currently done manually by Hank, would thus be described in an open-source way, possibly in a separate repo.
Move the backend worker containers from being self-managed Docker containers on EC2 instances to Amazon Elastic Container Service (ECS) + AWS Fargate, allowing for containers to be directly managed by AWS.
(This one is a bit farther off.) Use an Amazon Elastic Load Balancer (ELB) of the Application Load Balancer variety in order to distribute compute jobs to different worker nodes.
- This would potentially replace much of the functionality in compute.py, as the webapp would no longer have to keep track of a list of worker nodes and distribute the load among them. Instead, it would query a single address that would remain constant, and receive a cookie allowing it to ping the worker assigned to each job, without having to worry about which worker it is.
- This would make it easier to eventually expose the backend as an API accessible to end-users, as the load balancer can do things like rate-limiting.

I look forward to hearing feedback regarding these proposed changes, and am greatly open to suggestions.

hdoupe · 2018-06-26T14:01:44Z

@lucassz Thanks for laying out this plan. I'm not very familiar with Terraform, but from what I've read, it looks like a pretty cool tool and will address our needs (here's a blog series that I found helpful for getting up to speed).

As we've discussed, ECS seems like a good option for setting up servers, deploying docker containers to them, and scaling them up and down.

In my head, it seems like (1) and (2) are substitutes. What are the advantages of using both instead of only Terraform or only ECS?

Step (3) is beyond my knowledge level. My initial thoughts are: any time we can move complexity from our code into someone else's more thoroughly tested and used code is usually good. I'm happy and excited to learn more about this option.

Would the load balancer distribute the requests in a round-robin fashion?

Another option besides having to use a cookie to remember which server to query once the job is submitted is to use a shared redis server for all of the AWS worker nodes. Then, you could have a small flask server that fields requests, queries the redis server for the job status, and returns either the job status or the results. I'm not sure which option is better though.

lucassz · 2018-06-27T16:17:48Z

@hdoupe I would say 1) and 2) are complements, not substitutes. Terraform is just a tool for describing and deploying an infrastructure, which can be anything, including ECS. Using only Terraform would imply just moving the exact structure we have now to an automated deployment on EC2, whereas using only ECS would imply moving our self-managed EC2 instances to ECS, but doing so in a non-automated way directly through an AWS web interface or CLI. So we can (and I think, should) do both.

As for (3), the load balancer can incorporate a number of factors for distributing requests. All else being equal, it will do so in a round-robin, but can also do things like checking for instance/container health. The option that you mention is very interesting to me -- do you mean that the job result could be stored directly on Redis, and retrieved from there by the flask server? Or would it still be "passed through"/accessed via HTTP on the worker node by the coordinating flask node?

hdoupe · 2018-06-27T17:16:12Z

@lucassz Thanks for your explanation. I think it makes sense to do both now, too.

Hmmm, the load balancer sounds pretty good. I like the idea of having it taking care of the round-robin-ing (or similar) and doing health checks.

do you mean that the job result could be stored directly on Redis, and retrieved from there by the flask server?

Yes, this is what I mean. There's a managed Redis option on AWS if we go this route. This is an option that we can explore regardless of whether we use the load balancer or not.

hdoupe · 2018-06-27T17:16:20Z

@talumbau do you have any thoughts on the approach outlined by @lucassz?

lucassz · 2018-06-27T18:03:48Z

Yes, this is what I mean. There's a managed Redis option on AWS if we go this route. This is an option that we can explore regardless of whether we use the load balancer or not.

@hdoupe This would be a substitute for the load balancer, would it not? But I really like that option. Then we could possibly even use something like AWS Lambda, where you only pay for the time needed to run the compute jobs.

But incrementally speaking, I'm going to start working towards (1) and (2) with the possibility for adjustments to be made.

hdoupe · 2018-06-27T19:21:56Z

Perhaps, but the load-balancer could still help with the round-robin-ing. We just would already know which server to query for the job status and results. Again, I'm out of my depth and looking forward to learning about this.

But incrementally speaking, I'm going to start working towards (1) and (2) with the possibility for adjustments to be made.

Agreed.

lucassz · 2018-07-02T19:30:03Z

(1) and (2) are almost done as far as ECS goes, and I will publicly commit them soon. One thing that will left to be done is to add the current Heroku configuration as well, so it can be more cleanly managed, and easily reproduced.

As for the architecture, one thing that came to my attention is that we don't need 1 Redis instance and 1 Flask server for each Celery worker… Probably the most efficient use of resources to do it would be to have a single Redis instance and a single Flask worker, and then a bunch of real "workers" (i.e. job runners) that fetch from the Celery queue. Possibly the number of Celery workers could be auto-scaled as well.

lucassz mentioned this issue Jul 2, 2018

Load Redis client from existing environment variable #914

Merged

lucassz mentioned this issue Jul 5, 2018

Initial commit ospc-org/pb_deploy#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate deployment and simplify infrastructure #913

Automate deployment and simplify infrastructure #913

lucassz commented Jun 25, 2018

hdoupe commented Jun 26, 2018

lucassz commented Jun 27, 2018

hdoupe commented Jun 27, 2018

hdoupe commented Jun 27, 2018

lucassz commented Jun 27, 2018 •

edited

hdoupe commented Jun 27, 2018

lucassz commented Jul 2, 2018

Automate deployment and simplify infrastructure #913

Automate deployment and simplify infrastructure #913

Comments

lucassz commented Jun 25, 2018

hdoupe commented Jun 26, 2018

lucassz commented Jun 27, 2018

hdoupe commented Jun 27, 2018

hdoupe commented Jun 27, 2018

lucassz commented Jun 27, 2018 • edited

hdoupe commented Jun 27, 2018

lucassz commented Jul 2, 2018

lucassz commented Jun 27, 2018 •

edited