Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

"'subnetworks/default' is not ready" error thrown sporadically due to google_container_cluster adding ranges? #10972

Open
gygitlab opened this issue Jan 25, 2022 · 6 comments
Labels
persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/terraform size/s test-failure

Comments

@gygitlab
Copy link

gygitlab commented Jan 25, 2022

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Noticed recently we were getting some "'subnetworks/default' is not ready" errors in our Terraform runs on new environments:

Error: Error creating instance: googleapi: Error 400: The resource 'projects/<redacted>/regions/us-east1/subnetworks/default' is not ready, resourceNotReady

We haven't seen this before but the reason I think this is happening is we have both VMs as well as a GKE cluster being created here at the same time.

When the GKE Cluster starts to be created it's coming in and adding it's additional ranges to the target subnet and that in turn is then "locking" the subnet for sometime - preventing other resources from being created against it.

Rerunning the apply works fine so it feels like this could be handle more gracefully by either doing some retries or holding the cluster build until other resources are done? We could workaround this on our end but at the same time it doesn't seem unreasonable to deploy both a Cluster and other resources to the same subnet so thought it was worth raising.

Terraform Version

1.1.4

Affected Resource(s)

  • google_compute_instance
  • google_container_cluster

Terraform Configuration Files

resource "google_container_cluster" "cluster" {
  count = min(local.total_node_pool_count, 1)
  name  = var.prefix

  remove_default_node_pool = true
  initial_node_count                = 1
  enable_shielded_nodes       = true

  network    = local.vpc_name # Default
  subnetwork = local.subnet_name # Default

  # Require VPC Native cluster
  # https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/using_gke_with_terraform#vpc-native-clusters
  # Blank block enables this and picks at random
  ip_allocation_policy {}

  release_channel {
    channel = "STABLE"
  }

  node_config {
    shielded_instance_config {
      enable_secure_boot = var.machine_secure_boot
    }
  }
}

resource "google_compute_instance" "node" {
  count                = var.node_count
  name                = "${local.name_prefix}-${count.index + 1}"
  machine_type = var.machine_type

  allow_stopping_for_update = var.allow_stopping_for_update

  shielded_instance_config {
    enable_secure_boot = var.machine_secure_boot
  }

  boot_disk {
    initialize_params {
      image = var.machine_image
      size  = var.disk_size
      type  = var.disk_type
    }
  }

  metadata = {
    enable-oslogin = "TRUE"
  }

  network_interface {
    network    = var.vpc
    subnetwork = var.subnet
  }

  service_account {
    scopes = concat(["storage-rw"], var.scopes)
  }

  lifecycle {
    ignore_changes = [
      min_cpu_platform
    ]
  }
}

Expected Behavior

The provider should gracefully handle any timing clashes caused by the Cluster when on the same subnet

Actual Behavior

The provider creates the Cluster at the same time as VMs. The VMs as a result get a 400 error from the API as the Cluster edits the subnet to add in more ranges.

Steps to Reproduce

  1. Configure a VPC Native cluster and several VMs to deploy on the same subnet
  2. Attempt to apply and notice sometimes the VMs fail to deploy due to the above 400 error

Important Factoids

References

b/300616739

@gygitlab gygitlab added the bug label Jan 25, 2022
@shuyama1
Copy link
Collaborator

shuyama1 commented Jan 27, 2022

Hi @grantyoung. We should already retry when APIs return this error. Would you mind sharing your debug log?

@gygitlab
Copy link
Author

gygitlab commented Feb 2, 2022

Hi @shuyama1. The log can be seen here thanks.

@github-actions github-actions bot added service/container forward/review In review; remove label to forward labels Aug 17, 2023
@edwardmedia edwardmedia removed the forward/review In review; remove label to forward label Sep 14, 2023
@rileykarson
Copy link
Collaborator

We run into this in our nightly tests a lot. It's definitely not service-specific, I'm gonna reclassify as provider-wide.

@rileykarson rileykarson added service/terraform persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work test-failure and removed service/container bug forward/linked labels Sep 29, 2023
@SarahFrench
Copy link
Member

Discussion from triage: a possible way to fix this issue is to implement a retry in the provider

@SarahFrench SarahFrench added this to the Goals milestone Oct 2, 2023
@melinath
Copy link
Collaborator

melinath commented Oct 5, 2023

We already have a retry in the provider for exactly this case:

func isSubnetworkUnreadyError(err error) (bool, string) {
Based on test logs it looks like the retry gets called repeatedly throughout a test - presumably until some limit is hit (hopefully not a timeout). I'll look into whether it's possible to add backoff and jitter if those aren't already present, or increase the number of retries / timeout.

@rileykarson
Copy link
Collaborator

rileykarson commented Oct 5, 2023

I noticed in TestAccComputeInstance_resourcePolicyUpdate (in this execution) we're hitting a context deadline really early. 2m15s instead of 20m or so. Maybe we're attaching a short one to the retry transport?

@shuyama1 shuyama1 self-assigned this Oct 9, 2023
modular-magician added a commit to modular-magician/terraform-provider-google that referenced this issue Jun 18, 2024
[upstream:ade0a1ec36b97ef2853044110fea0cdd4bec6383]

Signed-off-by: Modular Magician <magic-modules@google.com>
modular-magician added a commit that referenced this issue Jun 18, 2024
[upstream:ade0a1ec36b97ef2853044110fea0cdd4bec6383]

Signed-off-by: Modular Magician <magic-modules@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/terraform size/s test-failure
Projects
None yet
Development

No branches or pull requests

7 participants