Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive Cluster Relocation is stuck on kube-root-ca.crt and openshift-service-ca.crt already exists #1932

Closed
TehreemNisa opened this issue Dec 20, 2022 · 5 comments · Fixed by #1937

Comments

@TehreemNisa
Copy link

Hive Cluster Relocation is stuck on kube-root-ca.crt and openshift-service-ca.crt already exists . These configmaps are created by default when namespace created and hive clusterdeployment fails with error

    - lastProbeTime: '2022-12-20T11:03:48Z'
      lastTransitionTime: '2022-12-20T10:08:51Z'
      message: >-
        failed to copy *v1.ConfigMapList: could not copy *v1.ConfigMap resource
        "kube-root-ca.crt": failed to sync existing resource: configmaps
        "kube-root-ca.crt" already exists
      reason: MoveFailed
      status: 'True'
      type: RelocationFailed

Is there a way to ignore these configmaps ? We are stuck in a migration , looking forward to any help !

@2uasimojo
Copy link
Member

2uasimojo commented Dec 22, 2022

Howdy! I'm looking into this, but lots of people have lots of vacation until January, so it may be a little bit.

In the meantime, can you please tell me how you are installing hive, and at what version? See more complete list of questions below.

@2uasimojo
Copy link
Member

Couple of things I've discovered/surmised:

  • This is a timing problem. The code path that attempts to copy configmaps has logic like: Create() => if error == AlreadyExists, Delete() and Create() again => if error again, bail. The kube-controller-manager reconciles these configmaps; so the error only happens if it manages to do so between that Delete()/Create(). I was able to do a relocate in my test bed, so it doesn't always happen. My logs show that it did indeed replace the objects, but the configmaps contain the proper (destination) data so the k-c-m did revert them after.
  • Why hasn't this come up before? Well, the relocate feature was merged in June 2020 when supported OpenShift versions were 4.2 to 4.4 (source). The k-c-m code that creates these configmaps was introduced in k8s 1.21 (citation needed -- this is hearsay from a slack thread @wking found for me). The first OpenShift version to use k8s 1.21 was 4.8 (source) which GAed a in July 2021 (same source as above). I'm not going to assert that y'all are the first to try using relocation in the past 1.5y and/or 4-5 releases, but it's not impossible :) Also possible others just didn't hit the timing problem.

I'm afraid I don't have a workaround for you. I'm looking into code changes that would help. In the meantime, the below would really help me out:

  • How are you installing hive? From OperatorHub or $other? (asked above, copied here for completeness)
  • What version of hive is running? Output of oc logs -n hive deploy/hive-operator | head -1 (s/-n hive/appropriate namespace/ if needed). On both sides of the relocate, please. (ditto)
  • OpenShift/k8s versions on both sides. Output of oc version.
  • Logs from the clusterrelocate controller on both sides. Output of oc logs -n hive deploy/hive-controllers | grep controller=clusterRelocate (s/// namespace again if necessary).
  • Does the ClusterDeployment exist on the destination? Long shot; but if it does, it may just mean that we have a code path that does the right things, but doesn't clean up status properly.

@2uasimojo
Copy link
Member

We track bugs in Jira, so I've created https://issues.redhat.com/browse/HIVE-2080 for this one. Please feel free to follow along there.

@TehreemNisa
Copy link
Author

  • We had hive installed via rhacm multiclusterhub and rhacm (version 2.4) was installed via operatorhub on source cluster and rhacm version 2.6 on target cluster via operatorhub
  • The source cluster is turned off but there is no hive operator in hive namespace ( there are hive controllers and hive admission deployments are present)
  • Openshift Version of both cluster is 4.9
  • We turned off the source cluster so don't have logs anymore
  • ClusterDeployment did not move and the relocation stopped at the configmap error

The source cluster where we are moving from we successfully moved 4 openshift clusters (ClusterDeployments/configmaps/secrets) all to new hive but one got stuck in it , I tried it many times but last one stayed stuck because it kept saying the configmaps already exists. I tried to update the configmap or tried deleting the namespace to beat the race condition but nothing helped.

Yes , I know the configmap creation feature in namespace came in 1.20 But I strongly feel it should be handled. As We had to turn off the old hub we just imported the cluster in rhacm for now and turned it off without moving clusterdeployment and machinepools.

Great that you created a bug and please can you give us updates on it so we can follow up here.

@newtonheath
Copy link

Tracking via https://issues.redhat.com/browse/HIVE-2080

2uasimojo added a commit to 2uasimojo/hive that referenced this issue Jan 12, 2023
Starting in k8s 1.21, the kube-controller-manager (and presumably
OpenShift's fork thereof) creates/reconciles `kube-root-ca.crt` and
`openshift-service-ca.crt` ConfigMaps in every namespace. On relocate,
as we're copying objects, we'll try to copy these from the source to the
destination cluster, fail because the object (with the same name)
already exists, try to delete it, and then race with the kcm trying to
create it again. If the relocate controller loses the race, it'll fail
the relocate and try again. It might succeed eventually if it wins the
race, but in at least one case (see associated issue/card) it never won.

With this commit, we ignore these two ConfigMaps, based on their names
(hardcoded).

HIVE-2080

closes openshift#1932
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants