Skip to content

Issue 41 #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 8, 2018
Merged

Issue 41 #131

merged 11 commits into from
Mar 8, 2018

Conversation

tbarnes-us
Copy link

@tbarnes-us tbarnes-us commented Mar 2, 2018

Status:

Dev complete and all acceptance tests pass

Changes:

(1) Label domainUid of various k8s resources that weren't already labeled.

(2) Add new weblogic.createdByOperator label for operator created resources, and modify operator code label searches to look for weblogic.createdByOperator in addition to a specific weblogic.domainUid (this prevents operator from changing/deleting/watching resources that it doesn't own).

(3) Add new kubernetes/delete-domain.sh script which takes advantage of labels to delete everything associated with the command-line supplied domain-uid(s). Usage:

[adc01jjm weblogic-kubernetes-operator]$ kubernetes/delete-domain.sh       
  Usage:

    delete-domain.sh -d domain-uid,domain-uid,... [-s max-seconds] [-t]
    delete-domain.sh -d all [-s max-seconds] [-t]
    delete-domain.sh -h

  Perform a best-effort delete of the kubernetes resources for
  the given domain(s), and retry until either max-seconds is reached
  or all resources were deleted (default 120 seconds).

  The domains can be specified as a comma-separated list of 
  domain-uids (no spaces), or the keyword 'all'.  The domains can be
  located in any kubernetes namespace.

  Specify '-t' to run the script in a test mode which will
  show kubernetes commands but not actually perform them.

  The script runs in three phases:  

    Phase 1:  Set the startupControl of each domain to NONE if
              it's not already NONE.  This should cause each
              domain's operator to initiate a controlled shutdown
              of the domain.  Immediately proceed to phase 2.

    Phase 2:  Wait up to half of max-seconds for WebLogic
              Server pods to exit normally, and then proceed
              to phase 3.

    Phase 3:  Periodically delete all remaining kubernetes resources
              for the specified domains, including any pods
              leftover from phase 2.  Exit and fail if max-seconds
              is exceeded and there are any leftover kubernetes
              resources.

  This script exits with a zero status on success, and a 
  non-zero status on failure.

Sample run:

[adc01jjm weblogic-kubernetes-operator]$ kubernetes/delete-domain.sh -d all
@@ Deleting kubernetes resources with label weblogic.domainUID 'all'.
@@ 24 resources remaining after 2 seconds, including 4 WebLogic Server pods. Max wait is 120 seconds.
@@ Setting startupControl to NONE on each domain (this should cause operator(s) to initiate a controlled shutdown of the domain's pods.)
domain "domain1" patched
@@ Waiting for operator to shutdown pods (will wait for no more than half of max wait seconds before directly deleting them).
@@ 24 resources remaining after 7 seconds, including 4 WebLogic Server pods. Max wait is 120 seconds.
@@ Waiting for operator to shutdown pods (will wait for no more than half of max wait seconds before directly deleting them).
@@ 24 resources remaining after 13 seconds, including 4 WebLogic Server pods. Max wait is 120 seconds.
@@ Waiting for operator to shutdown pods (will wait for no more than half of max wait seconds before directly deleting them).
@@ 24 resources remaining after 19 seconds, including 4 WebLogic Server pods. Max wait is 120 seconds.
@@ Waiting for operator to shutdown pods (will wait for no more than half of max wait seconds before directly deleting them).
@@ 24 resources remaining after 25 seconds, including 4 WebLogic Server pods. Max wait is 120 seconds.
@@ Waiting for operator to shutdown pods (will wait for no more than half of max wait seconds before directly deleting them).
@@ 24 resources remaining after 31 seconds, including 4 WebLogic Server pods. Max wait is 120 seconds.
@@ Waiting for operator to shutdown pods (will wait for no more than half of max wait seconds before directly deleting them).
@@ 24 resources remaining after 35 seconds, including 4 WebLogic Server pods. Max wait is 120 seconds.
@@ Waiting for operator to shutdown pods (will wait for no more than half of max wait seconds before directly deleting them).
@@ 23 resources remaining after 40 seconds, including 3 WebLogic Server pods. Max wait is 120 seconds.
@@ Waiting for operator to shutdown pods (will wait for no more than half of max wait seconds before directly deleting them).
@@ 23 resources remaining after 44 seconds, including 3 WebLogic Server pods. Max wait is 120 seconds.
@@ Waiting for operator to shutdown pods (will wait for no more than half of max wait seconds before directly deleting them).
@@ 20 resources remaining after 48 seconds, including 0 WebLogic Server pods. Max wait is 120 seconds.
@@ All pods shutdown, about to directly delete remaining resources.
domain "domain1" deleted
pod "domain1-cluster-1-traefik-59998fb86d-gzt5j" deleted
job "domain-domain1-job" deleted
deployment "domain1-cluster-1-traefik" deleted
persistentvolumeclaim "domain1-pv-claim" deleted
configmap "domain-domain1-scripts" deleted
configmap "domain1-cluster-1-traefik" deleted
serviceaccount "domain1-cluster-1-traefik" deleted
secret "domain1-weblogic-credentials" deleted
persistentvolume "domain1-pv" deleted
clusterrole "domain1-cluster-1-traefik" deleted
clusterrolebinding "domain1-cluster-1-traefik" deleted
@@ 0 resources remaining after 63 seconds, including 0 WebLogic Server pods. Max wait is 120 seconds.
@@ Success.

Note:

BTW, it turns out Mark had started on a change for this issue via WIP branch issue-41. I discovered this only after I'd written the script and tried to push. So I'm using "issue--41" for this branch instead of "issue-41".

@rjeberhard
Copy link
Member

The script needs to handle the case where the operator is running. Once you delete the domain resource (assuming the operator is running), the operator will begin shutting down servers and specifically removing the pods, services, and Ingress entries.

One option is to work with the operator by editing the domain to set domain.spec.startupControl = "NONE". If the operator is running, then it will shutdown all of the pods gracefully. You can watch the domain.status.conditions array for Progressing and Available conditions. For startupControl = "NONE", the operator will set a condition of type = "Available" and reason = "AllServersStopped".

After this, the script could safely delete the domain and other resources.

Tom Barnes added 2 commits March 5, 2018 11:48
…cts, set startupControl on each domain to NONE and wait up to half of max wait seconds for operator to shutdown its WLS pods normally. (2) Increase default max wait seconds to 120 seconds.
@rjeberhard
Copy link
Member

This looks really good with maybe one readability comment that getDomain surprised me by getting all of the objects associated with the domain. When it fails intermittently, what happens?

Tom Barnes added 5 commits March 6, 2018 10:21
…edByOperator label to operator owned domain resources, and modify its selectors to look for this label). Plus modify run.sh domain liefecycle test to verify webapp is still OK after a cycling.
@tbarnes-us tbarnes-us changed the title WIP: Issue 41 Issue 41 Mar 7, 2018
@rjeberhard
Copy link
Member

Resolves issue #41

@rjeberhard rjeberhard merged commit c5d5f60 into master Mar 8, 2018
@rjeberhard rjeberhard deleted the issue--41 branch March 30, 2018 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants