Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORS-3260: CAPI: Create GCP Internal LB #8151

Merged
merged 5 commits into from Apr 23, 2024

Conversation

patrickdillon
Copy link
Contributor

Adds an internal passthrough LB using the instance groups created by CAPG.

Currently uses a patch in vendored CAPG as a shortcut, see commit message for more detail . To use this, make sure to delete the capg binary from cluster-api/bin so that the provider will be rebuilt with the patch.

It should be possible to move this functionality into the installer and will test it ASAP.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2024
Copy link
Contributor

openshift-ci bot commented Mar 13, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@bfournie
Copy link
Contributor

/label platform/google

@patrickdillon patrickdillon changed the title WIP: CAPI: Create GCP Internal LB CAPI: Create GCP Internal LB Mar 18, 2024
@@ -5,6 +5,7 @@ import (
"os"
"path/filepath"

"github.com/Azure/go-autorest/autorest/to"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this to a different package

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this one be used?
"k8s.io/utils/pointer"

@patrickdillon patrickdillon changed the title CAPI: Create GCP Internal LB WIP: CAPI: Create GCP Internal LB Mar 19, 2024
@patrickdillon
Copy link
Contributor Author

The MAO is incompatible with the external proxy load balancer (MAO requires target pools, which only work with passthrough LB).

Just pushed some new changes to try some workarounds to see if we can get a non-production ready poc.

24143b2 has some temporary workarounds to rename the CAPG instance group to try to be compatible with the MAO instance group (instances can only be in a single instance group).

only the first master is Provisioning correctly. Why the other machines are failing is still unclear.

Master-1 and 2 are failing like:

Status:
  Conditions:
    Last Transition Time:  2024-03-19T18:55:59Z
    Status:                True
    Type:                  Drainable
    Last Transition Time:  2024-03-19T18:55:59Z
    Message:               Instance has not been created
    Reason:                InstanceNotCreated
    Severity:              Warning
    Status:                False
    Type:                  InstanceExists
    Last Transition Time:  2024-03-19T18:55:59Z
    Status:                True
    Type:                  Terminable
  Error Message:           error launching instance: googleapi: Error 409: The resource 'projects/openshift-dev-installer/zones/us-east1-b/instances/padillon-03191440-2hlmb-master-1' already exists, alreadyExists
  Error Reason:            InvalidConfiguration
  Last Updated:            2024-03-19T18:56:09Z
  Phase:                   Failed
  Provider Status:
    Conditions:
      Last Transition Time:  2024-03-19T18:56:09Z
      Message:               googleapi: Error 409: The resource 'projects/openshift-dev-installer/zones/us-east1-b/instances/padillon-03191440-2hlmb-master-1' already exists, alreadyExists
      Reason:                MachineCreationFailed
      Status:                False
      Type:                  MachineCreated
    Metadata:
Events:
  Type     Reason        Age   From           Message
  ----     ------        ----  ----           -------
  Warning  FailedCreate  39m   gcpcontroller  padillon-03191440-2hlmb-master-1: reconciler failed to Create machine: error launching instance: googleapi: Error 409: The resource 'projects/openshift-dev-installer/zones/us-east1-b/instances/padillon-03191440-2hlmb-master-1' already exists, alreadyExists

Workers are apparently being created but cannot be found(?)...

Status:
  Addresses:
    Address:  10.0.128.2
    Type:     InternalIP
    Address:  padillon-03191440-2hlmb-worker-b-b6msr.us-east1-b.c.openshift-dev-installer.internal
    Type:     InternalDNS
    Address:  padillon-03191440-2hlmb-worker-b-b6msr.c.openshift-dev-installer.internal
    Type:     InternalDNS
    Address:  padillon-03191440-2hlmb-worker-b-b6msr
    Type:     InternalDNS
  Conditions:
    Last Transition Time:  2024-03-19T18:55:58Z
    Status:                True
    Type:                  Drainable
    Last Transition Time:  2024-03-19T18:56:11Z
    Message:               Instance not found on provider
    Reason:                InstanceMissing
    Severity:              Warning
    Status:                False
    Type:                  InstanceExists
    Last Transition Time:  2024-03-19T18:55:58Z
    Status:                True
    Type:                  Terminable
  Error Message:           can't find created instance
  Last Updated:            2024-03-19T18:56:11Z
  Phase:                   Failed
  Provider Status:
    Conditions:
      Last Transition Time:  2024-03-19T18:56:02Z
      Message:               machine successfully created
      Reason:                MachineCreationSucceeded
      Status:                True
      Type:                  MachineCreated
    Instance Id:             padillon-03191440-2hlmb-worker-b-b6msr
    Instance State:          Unknown
    Metadata:
Events:
  Type     Reason        Age   From           Message
  ----     ------        ----  ----           -------
  Warning  FailedCreate  40m   gcpcontroller  padillon-03191440-2hlmb-worker-b-b6msr: reconciler failed to Create machine: requeue in: 20s
  Warning  FailedUpdate  40m   gcpcontroller  padillon-03191440-2hlmb-worker-b-b6msr: reconciler failed to Update machine: requeue in: 20s

@patrickdillon patrickdillon force-pushed the gcp-capi-int-lb branch 4 times, most recently from 980c36b to 9dbd867 Compare March 27, 2024 15:33
@patrickdillon patrickdillon force-pushed the gcp-capi-int-lb branch 2 times, most recently from ff26926 to e29ed72 Compare April 6, 2024 12:24
@patrickdillon patrickdillon marked this pull request as ready for review April 6, 2024 12:24
@openshift-ci openshift-ci bot requested review from bfournie and r4f4 April 6, 2024 12:25
@patrickdillon patrickdillon changed the title WIP: CAPI: Create GCP Internal LB CORS-3260: CAPI: Create GCP Internal LB Apr 6, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 6, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 6, 2024

@patrickdillon: This pull request references CORS-3260 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Adds an internal passthrough LB using the instance groups created by CAPG.

Currently uses a patch in vendored CAPG as a shortcut, see commit message for more detail . To use this, make sure to delete the capg binary from cluster-api/bin so that the provider will be rebuilt with the patch.

It should be possible to move this functionality into the installer and will test it ASAP.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@@ -85,9 +109,74 @@ func createInternalLBAddress(ctx context.Context, in clusterapi.InfraReadyInput)
},
}

if _, err := service.HealthChecks.Insert(in.InstallConfig.Config.GCP.ProjectID, healthCheck).Context(ctx).Do(); err != nil {
_, err = service.RegionHealthChecks.Insert(projectID, region, healthCheck).Context(ctx).Do()
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we need to augment the delete code, I see this getting created but when I destroy the cluster I don't see it being deleted. Only this health check resource remains e.g.:

$ gcloud compute health-checks describe https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/regions/us-east1/healthChecks/bfournie-capg-test-gn6g7-api-internal
checkIntervalSec: 2
creationTimestamp: '2024-04-06T10:28:41.909-07:00'
description: Created By OpenShift Installer
healthyThreshold: 3
httpsHealthCheck:
  port: 6443
  proxyHeader: NONE
  requestPath: /readyz
id: '7659072829660084390'
kind: compute#healthCheck
name: bfournie-capg-test-gn6g7-api-internal
region: https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/regions/us-east1
selfLink: https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/regions/us-east1/healthChecks/bfournie-capg-test-gn6g7-api-internal
timeoutSec: 2
type: HTTPS
unhealthyThreshold: 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 7de950d to delete HTTPS health checks. Before we were only handling HTTP health checks. Have not tested yet, but will now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 7de950d to delete HTTPS health checks. Before we were only handling HTTP health checks. Have not tested yet, but will now.

Hm no, tested and I don't think this is the right way. The CAPG-created https is getting deleted, but the internal one is still not. will debug that soon.

Also my install failed with:

ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed preparing ignition data: ignition failed to provision storage: failed to create storage: failed to create bucket: googleapi: Error 409: Your previous request to create the named bucket succeeded and you already own it., conflict 

I suspect #8056 broke the capg flow. #8056 adds ignition bucket creation in tfvars, which is still executed in the capi flow, so I think the same logic is being run twice and we hit this error. cc @barbacbd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left further comments on how to fix #8056 (comment)


// TODO: the subnet is only relevant for internal load balancer
op, err := service.BackendServices.Patch(projectID, extBesvcName, extBesvc).Context(ctx).Do()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this code will probably go away when we create the Load Balancers in CAPG but it would be useful to have a comment here until then as to why this patch is necessary.

@@ -5,6 +5,7 @@ import (
"os"
"path/filepath"

"github.com/Azure/go-autorest/autorest/to"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this one be used?
"k8s.io/utils/pointer"

@patrickdillon
Copy link
Contributor Author

Added f91a7cb to handle not setting instance group on masters and created #8238 with the workaround for mapi/capg instance group compatibility.

By default, CAPG creates a LB forwarding rule for port 443. Update
to 6443 for OpenShift.
Creates an internal passthrough LB to serve the API and machine
configs to the cluster. Utilizes the instance groups created by
CAPG.
capg sets the APIServerPort to 443 unless this is explicitly set.
Explicitly setting to the 6443 default.
CAPG installs will only use backend services--no target pools, so
we need to remove target pools from the machine and control plane
machineset manifests, so the machine-api operator doesn't throw
an error looking for a target pool that doesn't exist.
@bfournie
Copy link
Contributor

/approve
I tested this and verified that the healthcheck is now being deleted.

Copy link
Contributor

openshift-ci bot commented Apr 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bfournie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 15, 2024
This commit adds the capability to delete regional health checks
along with the global health checks. To do this, it refactors the
health-check code so that the logic essentially remains the same,
but different services (health check or regoinal health check) can
be injected.
Copy link
Contributor

openshift-ci bot commented Apr 18, 2024

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agent-compact-ipv4 12bfe75 link true /test e2e-agent-compact-ipv4
ci/prow/okd-e2e-aws-ovn-upgrade 4765e43 link false /test okd-e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@patrickdillon
Copy link
Contributor Author

/uncc @r4f4
/cc @barbacbd

@openshift-ci openshift-ci bot requested review from barbacbd and removed request for r4f4 April 18, 2024 22:51
Copy link
Contributor

@barbacbd barbacbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 0bbbb02 into openshift:master Apr 23, 2024
26 of 27 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-installer-altinfra-container-v4.16.0-202404222343.p0.g0bbbb02.assembly.stream.el8 for distgit ose-installer-altinfra.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. platform/google
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants