"'subnetworks/default' is not ready" error thrown sporadically due to google_container_cluster adding ranges? #10972

gygitlab · 2022-01-25T15:59:18Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Noticed recently we were getting some "'subnetworks/default' is not ready" errors in our Terraform runs on new environments:

Error: Error creating instance: googleapi: Error 400: The resource 'projects/<redacted>/regions/us-east1/subnetworks/default' is not ready, resourceNotReady

We haven't seen this before but the reason I think this is happening is we have both VMs as well as a GKE cluster being created here at the same time.

When the GKE Cluster starts to be created it's coming in and adding it's additional ranges to the target subnet and that in turn is then "locking" the subnet for sometime - preventing other resources from being created against it.

Rerunning the apply works fine so it feels like this could be handle more gracefully by either doing some retries or holding the cluster build until other resources are done? We could workaround this on our end but at the same time it doesn't seem unreasonable to deploy both a Cluster and other resources to the same subnet so thought it was worth raising.

Terraform Version

1.1.4

Affected Resource(s)

google_compute_instance
google_container_cluster

Terraform Configuration Files

resource "google_container_cluster" "cluster" {
  count = min(local.total_node_pool_count, 1)
  name  = var.prefix

  remove_default_node_pool = true
  initial_node_count                = 1
  enable_shielded_nodes       = true

  network    = local.vpc_name # Default
  subnetwork = local.subnet_name # Default

  # Require VPC Native cluster
  # https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/using_gke_with_terraform#vpc-native-clusters
  # Blank block enables this and picks at random
  ip_allocation_policy {}

  release_channel {
    channel = "STABLE"
  }

  node_config {
    shielded_instance_config {
      enable_secure_boot = var.machine_secure_boot
    }
  }
}

resource "google_compute_instance" "node" {
  count                = var.node_count
  name                = "${local.name_prefix}-${count.index + 1}"
  machine_type = var.machine_type

  allow_stopping_for_update = var.allow_stopping_for_update

  shielded_instance_config {
    enable_secure_boot = var.machine_secure_boot
  }

  boot_disk {
    initialize_params {
      image = var.machine_image
      size  = var.disk_size
      type  = var.disk_type
    }
  }

  metadata = {
    enable-oslogin = "TRUE"
  }

  network_interface {
    network    = var.vpc
    subnetwork = var.subnet
  }

  service_account {
    scopes = concat(["storage-rw"], var.scopes)
  }

  lifecycle {
    ignore_changes = [
      min_cpu_platform
    ]
  }
}

Expected Behavior

The provider should gracefully handle any timing clashes caused by the Cluster when on the same subnet

Actual Behavior

The provider creates the Cluster at the same time as VMs. The VMs as a result get a 400 error from the API as the Cluster edits the subnet to add in more ranges.

Steps to Reproduce

Configure a VPC Native cluster and several VMs to deploy on the same subnet
Attempt to apply and notice sometimes the VMs fail to deploy due to the above 400 error

Important Factoids

References

Failing test(s): error waiting for creating dataproc cluster: subnetwork is not ready #10585 - Similar issue

b/300616739

The text was updated successfully, but these errors were encountered:

shuyama1 · 2022-01-27T19:09:13Z

Hi @grantyoung. We should already retry when APIs return this error. Would you mind sharing your debug log?

gygitlab · 2022-02-02T09:19:22Z

Hi @shuyama1. The log can be seen here thanks.

rileykarson · 2023-09-29T19:54:18Z

We run into this in our nightly tests a lot. It's definitely not service-specific, I'm gonna reclassify as provider-wide.

SarahFrench · 2023-10-02T16:53:39Z

Discussion from triage: a possible way to fix this issue is to implement a retry in the provider

melinath · 2023-10-05T22:01:55Z

We already have a retry in the provider for exactly this case:

terraform-provider-google/google/transport/error_retry_predicates.go

Line 120 in 3cfedbb

func isSubnetworkUnreadyError(err error) (bool, string) {

Based on test logs it looks like the retry gets called repeatedly throughout a test - presumably until some limit is hit (hopefully not a timeout). I'll look into whether it's possible to add backoff and jitter if those aren't already present, or increase the number of retries / timeout.

rileykarson · 2023-10-05T22:10:07Z

I noticed in TestAccComputeInstance_resourcePolicyUpdate (in this execution) we're hitting a context deadline really early. 2m15s instead of 20m or so. Maybe we're attaching a short one to the retry transport?

[upstream:ade0a1ec36b97ef2853044110fea0cdd4bec6383] Signed-off-by: Modular Magician <magic-modules@google.com>

gygitlab added the bug label Jan 25, 2022

shuyama1 added the waiting-response label Jan 27, 2022

github-actions bot removed the waiting-response label Feb 2, 2022

github-actions bot added service/container forward/review In review; remove label to forward labels Aug 17, 2023

edwardmedia removed the forward/review In review; remove label to forward label Sep 14, 2023

modular-magician added the forward/linked label Sep 15, 2023

rileykarson added service/terraform persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work test-failure and removed service/container bug forward/linked labels Sep 29, 2023

SarahFrench added this to the Goals milestone Oct 2, 2023

SarahFrench added the size/s label Oct 2, 2023

shuyama1 self-assigned this Oct 9, 2023

shuyama1 modified the milestones: Goals, Fixit - Test failures Oct 11, 2023

shuyama1 mentioned this issue Oct 17, 2023

increase default retry timeout GoogleCloudPlatform/magic-modules#9290

Closed

rileykarson unassigned shuyama1 Mar 28, 2024

modular-magician added a commit that referenced this issue Jun 18, 2024

Resize disk in google_workbench_instance resource (#10972) (#18482)

169fd24

[upstream:ade0a1ec36b97ef2853044110fea0cdd4bec6383] Signed-off-by: Modular Magician <magic-modules@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"'subnetworks/default' is not ready" error thrown sporadically due to google_container_cluster adding ranges? #10972

"'subnetworks/default' is not ready" error thrown sporadically due to google_container_cluster adding ranges? #10972

gygitlab commented Jan 25, 2022 •

edited by modular-magician

Loading

shuyama1 commented Jan 27, 2022 •

edited

Loading

gygitlab commented Feb 2, 2022

rileykarson commented Sep 29, 2023

SarahFrench commented Oct 2, 2023

melinath commented Oct 5, 2023

rileykarson commented Oct 5, 2023 •

edited

Loading

"'subnetworks/default' is not ready" error thrown sporadically due to google_container_cluster adding ranges? #10972

"'subnetworks/default' is not ready" error thrown sporadically due to google_container_cluster adding ranges? #10972

Comments

gygitlab commented Jan 25, 2022 • edited by modular-magician Loading

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

shuyama1 commented Jan 27, 2022 • edited Loading

gygitlab commented Feb 2, 2022

rileykarson commented Sep 29, 2023

SarahFrench commented Oct 2, 2023

melinath commented Oct 5, 2023

rileykarson commented Oct 5, 2023 • edited Loading

gygitlab commented Jan 25, 2022 •

edited by modular-magician

Loading

shuyama1 commented Jan 27, 2022 •

edited

Loading

rileykarson commented Oct 5, 2023 •

edited

Loading