Degraded status when starting an OCP private cluster deployed on AWS #467

htkmts · 2020-09-25T05:41:22Z

When starting an OCP 4.3 private cluster deployed on AWS, the cluster ingress operator stays with "degraded" status.
(By "private cluster", I mean the OCP cluster cannot access the internet.)

It seems that the operator is trying to access "https://tagging.us-east-1.amazonaws.com" and this is causing the problem.

Q1. Are there any workaround for this?
Q2. Is it MANDATORY for the operator to be able to access the internet? (This makes it impossible for any Openshift clusters to be private...)

Thanks.

oc get dnsrecords -n openshift-ingress-operator -o yaml
The DNS provider failed to ensure the record: failed to find hosted zone for record: failed to get tagged resources: RequestError: send request failed
caused by: Post https://tagging.us-east-1.amazonaws.com/: dial tcp 52.94.224.124:443: i/o timeout
reason: ProviderError
status: "True"
type: Failed

oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.25 True False 59d Error while reconciling 4.3.25: the cluster operator ingress is degraded

openshift-bot · 2021-06-02T07:41:24Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-07-02T11:22:11Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-08-01T14:10:46Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-08-01T14:10:54Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rarguello · 2021-09-22T18:44:05Z

I have the same issue, this is the message on the installation log:

2021-09-22T15:21:08.439Z ERROR operator.init.controller controller/controller.go:218 Reconciler error {"controller": "dns_controller", "name": "default-wildcard", "namespace": "openshift-ingress-operator", "error": "failed to create DNS provider: failed to create AWS DNS manager: failed to validate aws provider service endpoints: [failed to list route53 hosted zones: RequestError: send request failed\ncaused by: Get \"https://route53.amazonaws.com/2013-04-01/hostedzone?maxitems=1\": dial tcp 52.46.154.111:443: i/o timeout, failed to get group tagging resources: RequestError: send request failed\ncaused by: Post \"https://tagging.us-east-1.amazonaws.com/\": dial tcp 52.94.233.76:443: i/o timeout]"}

I'm trying to do an air-gapped IPI installation on AWS, using OKD 4.7.0-0.okd-2021-09-19-013247.

The machine that executes openshift-install needs to have Internet, but all the instances on the VPC are on a private subnet without a NAT GW, so they don't have Internet access at all. I'm using an internal server as a Registry Mirror. I have configured EC2, S3, and ELB VPC Endpoints. The S3 VPC Endpoint is a Gateway endpoint, the other 2 are Interface endpoints.

I don't think you could have a VPC Endpoint for "tagging", so that's what failing at the end of the installation:

Post "https://tagging.us-east-1.amazonaws.com/": dial tcp 52.94.233.76:443: i/o timeout"

Any ideas for a workaround?

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 2, 2021

openshift-ci bot closed this as completed Aug 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Degraded status when starting an OCP private cluster deployed on AWS #467

Degraded status when starting an OCP private cluster deployed on AWS #467

htkmts commented Sep 25, 2020

openshift-bot commented Jun 2, 2021

openshift-bot commented Jul 2, 2021

openshift-bot commented Aug 1, 2021

openshift-ci bot commented Aug 1, 2021

rarguello commented Sep 22, 2021

Degraded status when starting an OCP private cluster deployed on AWS #467

Degraded status when starting an OCP private cluster deployed on AWS #467

Comments

htkmts commented Sep 25, 2020

openshift-bot commented Jun 2, 2021

openshift-bot commented Jul 2, 2021

openshift-bot commented Aug 1, 2021

openshift-ci bot commented Aug 1, 2021

rarguello commented Sep 22, 2021