If openshift-infra is in terminating state when restarting server, nothing works #3274

derekwaynecarr · 2015-06-17T17:16:32Z

I had a population with following:

500 projects, each project has 1 rc, 3 pods, and the associated system stuff

Fat fingered me did the following:

osc delete projects --all

This marked all the projects as terminating, but it also marked openshift-infra as terminating.

At this point, there are a lot of replication controllers trying to create pods, but being rejected by admission control, and that was fine BUT when the openshift-infra namespace was being purged, it deleted the service accounts. The service accounts were used by the replication controller to function. This results in log messages that say the replication controller needs to present credentials.

I then was wondering what was going on, and saw that openshift-infra was terminating. It may have eventually terminated if I left it running, but there was a lot of client traffic contention, so I restarted openshift.

At this point, openshift had 498 projects terminating, and the openshift-infra project had no service accounts. You then get a message in log reporting the following and openshift fails to start:

Jun 17 17:09:34 openshiftdev.local openshift[30991]: F0617 17:09:34.334335   30991 start_master.go:403] Could not get client for replication controller: Could not get token for openshift-infra/re...on-controller
Ju

I realize this was an operator error, but there is no way to recover, and I am strongly inclined to believe that I will not be the first operator to make this error.

I think one of the following:

policy should never allow that namespace to be deleted
admission control should never allow that namespace to be deleted (since its essential to system function)

@smarterclayton @liggitt @deads2k - opinions on preferred route? I vote for 2.

The text was updated successfully, but these errors were encountered:

liggitt · 2015-06-17T17:17:20Z

policy can't express denies, so you couldn't prevent a cluster admin from deleting via policy

derekwaynecarr · 2015-06-17T17:19:10Z

I think all of our controller clients will cease to work if these service accounts are deleted, no?

Also, its not possible to make a 'terminating' namespace stop terminating unless there is a way to nil out a DeletionTimestamp that I am missing.

liggitt · 2015-06-17T17:20:25Z

I think all of our controller clients will cease to work if these service accounts are deleted, no?

not all, but important ones

deads2k · 2015-06-17T17:21:32Z

I vote for option 2 as well. It might be useful to support a list of "protected" resources in the master-config.

liggitt · 2015-06-17T17:22:01Z

should probably move openshiftConfig.RunOriginNamespaceController() up to the special list of controllers that get started first

liggitt · 2015-06-17T17:22:31Z

that would have helped clean things up, though it still might have taken a couple restarts

deads2k · 2015-06-17T17:23:50Z

I don't think we have a controller ensuring that our serviceaccounts are always present (I think its only on startup), so we might want to express: "don't delete these serviceaccounts" as well.

liggitt · 2015-06-17T17:25:54Z

not sure I care down to that level... a restart would fix that and that's unlikely to happen as a mass delete

derekwaynecarr · 2015-06-17T19:06:03Z

I need to modify upstream to expose Name on admission control since an object being deleted has no 'object' on input.

Sent from my iPhone

On Jun 17, 2015, at 1:43 PM, Dan McPherson notifications@github.com wrote:

Assigned #3274 to @derekwaynecarr.

—
Reply to this email directly or view it on GitHub.

smarterclayton · 2015-08-18T13:11:45Z

As a simple fix let's have the project command ignore deletes for a set of whitespaced project names. Admin can still delete them with namespace rest API but this prevents stupid stuff.

liggitt · 2015-08-18T14:30:12Z

the project command ignore deletes

the project API, you mean?

liggitt · 2015-08-18T14:31:33Z

also, we should move RunOriginNamespaceController() up to the special list of controllers that get started first, so it can clean things up even if the service account token fetcher has issues

smarterclayton · 2015-08-18T15:25:13Z

The API.

On Tue, Aug 18, 2015 at 10:31 AM, Jordan Liggitt notifications@github.com
wrote:

also, we should move RunOriginNamespaceController() up to the special
list of controllers that get started first, so it can clean things up even
if the service account token fetcher has issues

—
Reply to this email directly or view it on GitHub
#3274 (comment).

Clayton Coleman | Lead Engineer, OpenShift

liggitt · 2015-08-18T15:39:16Z

Moving relevant discussion points from #4228:

Move RunOriginNamespaceController() to first controller group
Prevent delete project --all from deleting "default", "openshift-infra" (configurable), "kube-system" (maybe?), and others? Possible mechanisms:
- annotation on a namespace. requires fetch before delete
- special list of projects to disallow deleting via the API

derekwaynecarr · 2015-08-28T18:46:13Z

We do have the concept of immortal namespaces in the NamespaceLifecycle admission controller.

liggitt · 2016-05-10T15:15:25Z

added openshift-infra to immortal namespaces list in #4318, I think we should close this

derekwaynecarr · 2016-06-09T01:07:55Z

I concur. Closing

derekwaynecarr · 2016-06-09T01:08:44Z

Long may openshift-infra live!

danmcp assigned derekwaynecarr Jun 17, 2015

danmcp added kind/bug Categorizes issue or PR as related to a bug. priority/P2 labels Jun 17, 2015

derekwaynecarr mentioned this issue Jun 17, 2015

Admission control attributes has access to resource name kubernetes/kubernetes#9975

Merged

smarterclayton added priority/P1 and removed priority/P2 labels Jun 19, 2015

bparees added the upcoming-release label Jul 20, 2015

bparees removed the upcoming-release label Aug 10, 2015

liggitt mentioned this issue Aug 18, 2015

Bug with deleting lots of namespaces and then restarting #4228

Closed

danmcp added priority/P2 and removed priority/P1 labels Aug 19, 2015

liggitt mentioned this issue Aug 21, 2015

Protect infra/shared resource namespaces #4318

Merged

timothysc mentioned this issue Feb 4, 2016

label deletion of all namespaces (openshift & openshift-infra) - cluster inoperable restart required. #7048

Closed

derekwaynecarr closed this as completed Jun 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If openshift-infra is in terminating state when restarting server, nothing works #3274

If openshift-infra is in terminating state when restarting server, nothing works #3274

derekwaynecarr commented Jun 17, 2015

liggitt commented Jun 17, 2015

derekwaynecarr commented Jun 17, 2015

liggitt commented Jun 17, 2015

deads2k commented Jun 17, 2015

liggitt commented Jun 17, 2015

liggitt commented Jun 17, 2015

deads2k commented Jun 17, 2015

liggitt commented Jun 17, 2015

derekwaynecarr commented Jun 17, 2015

smarterclayton commented Aug 18, 2015

liggitt commented Aug 18, 2015

liggitt commented Aug 18, 2015

smarterclayton commented Aug 18, 2015

liggitt commented Aug 18, 2015

derekwaynecarr commented Aug 28, 2015

liggitt commented May 10, 2016

derekwaynecarr commented Jun 9, 2016

derekwaynecarr commented Jun 9, 2016

If openshift-infra is in terminating state when restarting server, nothing works #3274

If openshift-infra is in terminating state when restarting server, nothing works #3274

Comments

derekwaynecarr commented Jun 17, 2015

liggitt commented Jun 17, 2015

derekwaynecarr commented Jun 17, 2015

liggitt commented Jun 17, 2015

deads2k commented Jun 17, 2015

liggitt commented Jun 17, 2015

liggitt commented Jun 17, 2015

deads2k commented Jun 17, 2015

liggitt commented Jun 17, 2015

derekwaynecarr commented Jun 17, 2015

smarterclayton commented Aug 18, 2015

liggitt commented Aug 18, 2015

liggitt commented Aug 18, 2015

smarterclayton commented Aug 18, 2015

liggitt commented Aug 18, 2015

derekwaynecarr commented Aug 28, 2015

liggitt commented May 10, 2016

derekwaynecarr commented Jun 9, 2016

derekwaynecarr commented Jun 9, 2016