Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up US rendering server on AWS #682

Closed
8 of 9 tasks
pnorman opened this issue Jul 15, 2022 · 19 comments
Closed
8 of 9 tasks

Set up US rendering server on AWS #682

pnorman opened this issue Jul 15, 2022 · 19 comments
Labels
location:aws Services hosted on AWS service:tiles The raster map on tile.openstreetmap.org

Comments

@pnorman
Copy link
Collaborator

pnorman commented Jul 15, 2022

Ref #637

  • Create new AWS account, linked to others
  • Finalize cost estimate, based on specs similar to new Europe render servers and traffic equivalent to Pyrene when it handled the US + y/y growth
  • Get credits from AWS
  • Get Elastic IP address
  • Create EC2 instance with Ubuntu 22.04, EBS storage, using elastic IP
  • Setup chef on new instance and assign roles
  • Import DB, load test with z0-z12 background render
  • Setup endpoint in Fastly and slowly move traffic to it
  • When we're happy, get an EC2 instance savings plan to reduce costs

Outstanding questions

  • Do we bother with CFn for one always-on instance?
  • We're looking at m6g.16xlarge. We haven't run an ARM rendering server before, so we might need to go for m6a instead
  • We're assuming EBS will have the performance characteristics we need. If not, a d-type instance with ephemeral storage might be required
@MarkRose
Copy link

MarkRose commented Jul 27, 2022

EBS, if using GP3, is generally great. If you're looking for inexpensive high IOPS, cheapest is to make a bunch of small GP3 volumes as each has a 3000 IOPS baseline (and RAID 0/LVM them or whatever). ST1 gives consistent hard drive like performance but latency is a bit higher than locally attached hard drives. If mod_tile is using blocking IO for reads, and I imagine mod_tile is, you may find you need fewer Apache threads/processes to get the same request throughput with GP3.

m6a/c6a/r6a isn't always a savings over m6i/c6i/r6i due to performance differences in memory. I've seen the Xeon chips work out significantly cheaper than Epyc in some situations. I'd benchmark both if the Gravitons don't work out.

Be prepared for your instance to fail. It just happens. Most instances will stay up for years, other will have hardware issues. Sometimes you'll get a warning in the Events of the EC2 console (and sent to email) where you'll have a few weeks to stop and start the instance. In other cases the recovery process will start the instance on new hardware. Any ephemeral stored data would of course be gone, so that's a big negative for using locally attached storage beyond a cache.

Just some thoughts from someone who has been using EC2 for over a decade.

@Firefishy
Copy link
Member

@MarkRose Thank you for the helpful insights.

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 27, 2022

ST1 gives consistent hard drive like performance but latency is a bit higher than locally attached hard drives.

Our sustained IOPS is 10k-20k, with peaks of 50k, so st1 isn't an option. My inclination is to start with a single maxed out gp3 and if necessary, split the tiles into their own volume.

The big unknown to me is latency, not iops. I don't know how that's going to impact performance.

@grischard
Copy link
Collaborator

Solving this will also solve #637

@grischard
Copy link
Collaborator

Depends on #660?

@pnorman
Copy link
Collaborator Author

pnorman commented Aug 2, 2022

Solving this will also solve #637

We're looking at replacing pyrene independent of this.

Depends on #660?

No, although they have some common parts for changing our account management

@pnorman pnorman added service:tiles The raster map on tile.openstreetmap.org location:aws Services hosted on AWS labels Aug 2, 2022
@Firefishy
Copy link
Member

Account has been created. Accessible via assumed role from master account.

@Firefishy
Copy link
Member

Firefishy commented Nov 8, 2022

Design Considerations

  1. AWS Region
  2. Intel, AMD or ARM (Graviton)
  3. CPU cores
  4. RAM size
  5. PostgreSQL storage: Instance Store (local NVMe), EBS (GP2), EBS (GP3)
  6. PostgreSQL storage size
  7. PostgreSQL storage speed: IOPs and MiB/s (EBS GP3 mainly)
  8. Local tile cache storage: Instance Store (local NVMe), EBS (GP2), EBS (GP3)
  9. Local tile cache storage size
  10. Local tile cache storage speed: IOPs and MiB/s (EBS GP3 mainly)
  11. AWS Billing Alerts

Desired but not for initial launch:

  • Reserved Instance? No upfront / Partial upfront / Full upfront. Convertible?
  • Autoscaling using spot instances (Likely requires EBS snapshots with launch configuration)

@Firefishy
Copy link
Member

Decisions:

  1. AWS Region: us-west-2
  2. TBC
  3. CPU cores: 64+
  4. RAM size: 250GB+

@iandees
Copy link

iandees commented Nov 8, 2022

AWS Region: us-west-2

That's in Oregon probably very near the existing OSUOSL servers. Can I suggest us-east-2 (Ohio) or us-east-1 (Virginia) instead?

@Firefishy
Copy link
Member

Firefishy commented Nov 8, 2022

Instance choice for initial experiment: m6gd.16xlarge
Instance Store (local NVMe)

@grischard
Copy link
Collaborator

The reasoning for us-west-2 was carbon neutrality, but https://sustainability.aboutamazon.com/environment/the-cloud?energyType=true says us-east-1 and us-east-2 are 95% powered by renewables too. Initial choice for AWS region: us-east-2.

@Firefishy
Copy link
Member

Firefishy commented Nov 8, 2022

AWS Region: us-east-2
Elastic IP: 3.144.0.72
Instance Name: palulukon
Instance Type: m6gd.16xlarge

@Firefishy
Copy link
Member

Initial basic AWS billing Budget created. $1000/month. Alerts me, ops and @grischard

@grischard
Copy link
Collaborator

DNS records created for palulukon.openstreetmap.org

@Firefishy
Copy link
Member

Base chef is done, we're adding in arm64 for prometheus exporters.

@Firefishy
Copy link
Member

Firefishy commented Nov 10, 2022

Import is now running... Thank you to @pnorman

@pnorman
Copy link
Collaborator Author

pnorman commented Nov 11, 2022

Import completed. Pre-render took 1h47m, putting it at a comparable performance to culebre and nidhogg which have 2x 28 core AMD EPYC 7453. The new server is currently taking 38% of west-coast US load without issue and as its tile store gets populated, I'll be adding more load to it.

@Firefishy
Copy link
Member

Firefishy commented Nov 16, 2022

AWS credits cannot be used to buy Savings Plans or Reserved Instances (Partial Upfront or All Upfront). It looks like Reserved Instances No-Upfront are allowed, but that would leave OSMF exposed for potentially the last 2 months of the 12 month minimum reserved period (12 month is minimum period offered for this instance type). Reserved Instance pricing is available here.

EC2 + Bandwidth costs are currently around $115 per weekday which is sufficiently covered by the credits which expire on 30 September 2023 and allowing some headroom for bandwidth increase.

A remaining cost saving option (to allow more capacity) is to move to Spot Instances (~70% lower instance cost), but this would require additional DevOps investment to turn the "pet server" into "cattle", which is best handled by a separate ticket.

@Firefishy Firefishy removed their assignment Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
location:aws Services hosted on AWS service:tiles The raster map on tile.openstreetmap.org
Projects
None yet
Development

No branches or pull requests

5 participants