Enable spot instances #11

thodson-usgs · 2024-01-14T06:23:24Z

This PR enables optionally using spot instances. Ref #10.

NOTE: This deploys without error, but I haven't started testing the checkpointing yet.

Ideas for checkpoint tuning

incremental checkpoints
extend checkpointing.min-pause

References

thodson-usgs · 2024-01-21T15:28:33Z

@ranchodeluxe, @yuvipanda
I've come to understand that Flink does not use checkpointing in batch mode, so that extra configuration is unnecessary. Running the cluster on spot instances may be as simple as setting capacity_type="SPOT". However, I have left the default as "ON_DEMAND" and will continue testing "SPOT" on the USGS runner.

One thing that is a little unclear is whether we need to configure the job manager and autoscaler to always run "ON_DEMAND". Some blogs suggest this is a best practice, but their fault-tolerance might be covered in the managed K8 service (EKS).

P.S. One more thought
My limited understanding is that in batch mode, Flink well restart the job (task?) on failure. So, if a recipe includes a particularly long and costly job, like a big rechunk and transfer, it might be advisable to stick with "ON_DEMAND".

thodson-usgs · 2024-01-21T18:16:39Z

Digging in a little further. I think we would:

leave the core node group as on demand
create a second node group of spot instances
configure autoscaler to scale each node group: job managers to core and task managers to spot.
I'll need to investigate the last piece further.

thodson-usgs · 2024-01-22T05:17:35Z

Learning more, we might want to take advantage of Flink's High Availability (HA) Kubernetes Services:
"The Operator supports both Kubernetes HA Services and Zookeeper HA Services for providing High-availability for Flink jobs. The HA solution can benefit form using additional Standby replicas, it will result in a faster recovery time, but Flink jobs will still restart when the Leader JobManager goes down."

Essentially, we would deploy an all-spot cluster and use standby replicas in case an instance terminates. So flink-config.yaml would include something like:

      # Enable HA cluster
      "high-availability.type" : "kubernetes",
      "high-availability.storageDir" : "s3://${aws_s3_bucket.flink_store.id}/recovery",
      "kubernetes.cluster-id" :  <cluster_id>,

where <cluster_id> is the id of the job manager (?). I don't think we need to worry about flink-operator because it's on the control plane, which is managed by AWS.

yuvipanda · 2024-01-22T22:16:09Z

terraform/aws/nodes.tf

@@ -38,6 +38,8 @@ resource "aws_eks_node_group" "core_nodes" {

  instance_types = [var.instance_type]

+  capacity_type = "SPOT"


I assume this should be in fact using the variable var.capacity_type?

Yes. Fixed.

yuvipanda · 2024-01-22T22:16:24Z

terraform/aws/buckets.tf

@@ -1,3 +1,8 @@
+resource "aws_s3_bucket" "flink_store" {
+  bucket = "${var.cluster_name}-flink-store"
+  force_destroy = true


I generally suggest leaving this out, and hand-editing it in when you want to acutally destroy it.

Removing this bucket for now as the cluster seemed unable to write to it. Eventually, we'll want to add a shared filesystem to the cluster where flink can cache metadata for restarting failed jobs. https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/

thodson-usgs · 2024-01-27T19:44:57Z

This PR will enable spot instances for testing. Before this is used in production, we'll want to make some additional changes to handle failures (this will serve "on demand" clusters too).

yuvipanda · 2024-01-27T19:48:50Z

Thanks @thodson-usgs

thodson-usgs and others added 5 commits January 14, 2024 00:16

Switch instance type to spot

29637e0

Fix flink bucket reference

55934da

Update buckets.tf

05c7199

Remove checkpoints and add capacity_type variable

e0faae2

Fix typo in variables.tf

4f6f6b5

thodson-usgs marked this pull request as ready for review January 21, 2024 15:21

yuvipanda requested changes Jan 22, 2024

View reviewed changes

thodson-usgs added 3 commits January 22, 2024 21:23

Update nodes.tf

44e5bd4

Remove flink-cache bucket

5fd8da1

Remove whitespace

fc2b541

thodson-usgs requested a review from yuvipanda January 27, 2024 19:40

thodson-usgs changed the title ~~Switch instance type to spot~~ Enable spot instances Jan 27, 2024

yuvipanda approved these changes Jan 27, 2024

View reviewed changes

yuvipanda merged commit 306c5d4 into pangeo-forge:main Jan 27, 2024

thodson-usgs mentioned this pull request Jan 28, 2024

Spot instances #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable spot instances #11

Enable spot instances #11

thodson-usgs commented Jan 14, 2024 •

edited by yuvipanda

Loading

thodson-usgs commented Jan 21, 2024 •

edited

Loading

thodson-usgs commented Jan 21, 2024 •

edited

Loading

thodson-usgs commented Jan 22, 2024 •

edited

Loading

yuvipanda Jan 22, 2024

thodson-usgs Jan 27, 2024

yuvipanda Jan 22, 2024

thodson-usgs Jan 27, 2024

thodson-usgs commented Jan 27, 2024

yuvipanda commented Jan 27, 2024

		@@ -38,6 +38,8 @@ resource "aws_eks_node_group" "core_nodes" {

		instance_types = [var.instance_type]

		capacity_type = "SPOT"

Enable spot instances #11

Enable spot instances #11

Conversation

thodson-usgs commented Jan 14, 2024 • edited by yuvipanda Loading

Ideas for checkpoint tuning

References

thodson-usgs commented Jan 21, 2024 • edited Loading

thodson-usgs commented Jan 21, 2024 • edited Loading

thodson-usgs commented Jan 22, 2024 • edited Loading

yuvipanda Jan 22, 2024

Choose a reason for hiding this comment

thodson-usgs Jan 27, 2024

Choose a reason for hiding this comment

yuvipanda Jan 22, 2024

Choose a reason for hiding this comment

thodson-usgs Jan 27, 2024

Choose a reason for hiding this comment

thodson-usgs commented Jan 27, 2024

yuvipanda commented Jan 27, 2024

thodson-usgs commented Jan 14, 2024 •

edited by yuvipanda

Loading

thodson-usgs commented Jan 21, 2024 •

edited

Loading

thodson-usgs commented Jan 21, 2024 •

edited

Loading

thodson-usgs commented Jan 22, 2024 •

edited

Loading