Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'kops create cluster' with public topology and terraform output fails to add route 53 terraform resource records for the api end point #16455

Closed
dkwgit opened this issue Apr 6, 2024 · 1 comment
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dkwgit
Copy link

dkwgit commented Apr 6, 2024

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.28.4 (git-v1.28.4)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

v1.28.8

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

# Edit following four variables values to suit.
# You probably only need to replace the parent domain and do something to make the state bucket totally new and unique.
#
BUG_REPORT_AWS_REGION=us-east-1
BUG_REPORT_PARENT_DOMAIN=<YOUR_PARENT_DOMAIN_HERE> # Parent domain is a domain you control, with its hosted zone in Route53
BUG_REPORT_CLUSTER_BASE_NAME=bug-report-cluster
BUG_REPORT_KOPS_STATE_STORE_BUCKET="kops-state-store-public-terraform-missing-route53-api-endpoint" # A brand new bucket. Let's not tamper with real cluster info in real buckets

echo "Creating kops state bucket $BUG_REPORT_KOPS_STATE_STORE_BUCKET via 'aws s3api create-bucket'"
aws s3api create-bucket --bucket $BUG_REPORT_KOPS_STATE_STORE_BUCKET --region $BUG_REPORT_AWS_REGION > /dev/null

TOPOLOGY=public
SUB_DOMAIN="${BUG_REPORT_CLUSTER_BASE_NAME}-${TOPOLOGY}"  # Example bug-report-cluster
CLUSTER_NAME="$SUB_DOMAIN.$BUG_REPORT_PARENT_DOMAIN"

# get parent domain's hosted zone id
PARENT_ZONE=$(aws route53 list-hosted-zones | jq -r ".HostedZones[] | select(.Name==\"$BUG_REPORT_PARENT_DOMAIN.\") | .Id")

printf "\nCreating child hosted zone and retrieving zone's name servers\n"
SUB_NAMESERVERS=($(ID=$(uuidgen) && aws route53 create-hosted-zone --name $CLUSTER_NAME --caller-reference $ID | \
  jq .DelegationSet.NameServers | jq -r '.[]'))

# create delegation NS records for parent zone
cat > "bug_report_subdomain_ns_records_${CLUSTER_NAME}.json" << EOF
{
"Comment": "Create a subdomain NS record",
"Changes": [
  {
    "Action": "CREATE",
    "ResourceRecordSet": {
      "Name": "$CLUSTER_NAME",
      "Type": "NS",
      "TTL": 300,
      "ResourceRecords": [
        {"Value": "${SUB_NAMESERVERS[0]}"},
        {"Value": "${SUB_NAMESERVERS[1]}"},
        {"Value": "${SUB_NAMESERVERS[2]}"},
        {"Value": "${SUB_NAMESERVERS[3]}"}
      ]
    }
  }
]
}
EOF

printf "\nApplying child hosted zone name servers to parent zone for delegation\n"
aws route53 change-resource-record-sets \
  --hosted-zone-id $PARENT_ZONE \
  --change-batch "file://bug_report_subdomain_ns_records_${CLUSTER_NAME}.json"

printf "\nGenerating 'kops cluster create' terraform output for cluster $CLUSTER_NAME\n"
kops create cluster \
  --cloud aws \
  --name $CLUSTER_NAME \
  --zones ${BUG_REPORT_AWS_REGION}a \
  --control-plane-size t4g.small \
  --control-plane-count 1 \
  --topology $TOPOLOGY \
  --state "s3://$BUG_REPORT_KOPS_STATE_STORE_BUCKET" \
  --out "terraform/$SUB_DOMAIN" \
  --target terraform \
  -v 10 | tee create_cluster.log

printf "\nDone generating terraform output. CLUSTER_NAME for TOPOLOGY $TOPOLOGY is: $CLUSTER_NAME. Terraform files are at ./terraform/$SUB_DOMAIN\n"

5. What happened after the commands executed?

If you go to ./terraform/bug-report-cluster-public (or wherever the terraform is output, based on settings) and then open kubernetes.tf in an editor, you can search for aws_route53_record resources for the cluster's API end point. They will NOT exist.

This means that during a terraform apply cluster spin-up, the API end point A/AAAA records are not created.

By empirical observation, nothing else publishes these records during spin up, so the cluster never comes up. Spin-up hangs forever.

Running validation during the failed spin-up prints the following every ten seconds, for as long as the validation is running:

W0405 14:20:36.710742 261561 validate_cluster.go:184] (will retry): unexpected error during validation: unable to resolve Kubernetes cluster API URL dns: lookup on 8.8.8.8:53: no such host

6. What did you expect to happen?

Successful cluster spin up with the API endpoint A/AAAA records published in Route 53.

If you run the exact same reproducing commands with TOPOLOGY=private and then inspect the terraform output, the aws_route53_record (s) for the API end point ARE present in the terraform. When a terraform apply is done, the cluster spins up successfully.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2024-04-06T19:25:15Z"
  name: bug-report-cluster-public.<DOMAIN_REDACTED>
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://kops-state-store-public-terraform-missing-route53-api-endpoint/bug-report-cluster-public.<DOMAIN_REDACTED>
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-us-east-1a
      name: a
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-us-east-1a
      name: a
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeProxy:
    enabled: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.28.6
  masterPublicName: api.bug-report-cluster-public.<DOMAIN_REDACTED>
  networkCIDR: 172.20.0.0/16
  networking:
    cilium:
      enableNodePort: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  subnets:
  - cidr: 172.20.0.0/16
    name: us-east-1a
    type: Public
    zone: us-east-1a
  topology:
    dns:
      type: Public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-06T19:25:16Z"
  labels:
    kops.k8s.io/cluster: bug-report-cluster-public.<DOMAIN_REDACTED>
  name: control-plane-us-east-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-arm64-server-20240126
  machineType: t4g.small
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-east-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-06T19:25:16Z"
  labels:
    kops.k8s.io/cluster: bug-report-cluster-public.<DOMAIN_REDACTED>
  name: nodes-us-east-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240126
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Node
  subnets:
  - us-east-1a

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

https://gist.github.com/dkwgit/65ef9ac485a1ee8579279a0816915cc0

9. Anything else do we need to know?

I put a bash script on the same gist, https://gist.github.com/dkwgit/65ef9ac485a1ee8579279a0816915cc0#file-demonstrate_bug-bash, at the bottom of the gist.

This script makes it easy to run everything first for a public topology, then a private topology, so that you can inspect the terraform output for each and make comparisons. Everything is packaged in a bash function, so you can do:

  1. bug_report_do_terraform_output "public"
  2. bug_report_do_terraform_output "private"

Running it for private shows that Route53 records are created. The bash script also contains a function bug_report_clean_all, so that all Route 53 child zone related stuff is cleaned up for both the public and private runs and the kops state bucket is dropped.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 6, 2024
@dkwgit
Copy link
Author

dkwgit commented Apr 6, 2024

I am now seeing that the clusters (public or private) are creating records in Route53--but I am still having problems accessing them. I think I have made a mistake somewhere in diagnosing the problem. I am closing this pending further diagnosis. I think I have an error on my end. Am currently declaring PEBCAK.

@dkwgit dkwgit closed this as completed Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants