Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reaper khjob #619

Closed
wants to merge 54 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
907414f
Create khjobs crd
joshulyne Jul 31, 2020
16f6beb
Initial commmit -- apply khjob monitor, khjob trigger, khjob run
joshulyne Aug 1, 2020
cda57bb
edit
joshulyne Aug 1, 2020
fcd2aab
Make jobs run concurrently, modify khState reaper, set execution erro…
joshulyne Aug 4, 2020
66b0505
Merge branch 'master' into khjob
joshulyne Aug 11, 2020
d7bce3d
Refactor khjob trigger
joshulyne Aug 14, 2020
f654faa
simplified func main and integrating reap of khjob
Aug 19, 2020
5e0cc1f
updated delete khjob functionality
Aug 19, 2020
9bac5b9
fixed err verbage
Aug 19, 2020
2c1722e
included env variable for minutes for khjob to be alive before being …
Aug 19, 2020
8d9ab86
added more comments to functions
Aug 19, 2020
061d61d
corrected verbiage
Aug 19, 2020
c6f81af
updated from feedback from Josh to return error in khJobDelete
Aug 19, 2020
1ccf056
creating jobClient in main and passing into func
Aug 19, 2020
829878d
updated versions
Aug 19, 2020
e5a563f
Add JobDetails status
joshulyne Aug 20, 2020
43615ad
Add client function to khjobscrd
joshulyne Aug 20, 2020
6af0eb7
Remove khworkload field from json strcut
joshulyne Aug 21, 2020
0d061f0
added more log info for khjob and updated yaml
Aug 24, 2020
bb0cc63
jobs use main kh context. getter for khworkload type on WorkloadDetails
Aug 24, 2020
141d266
added workloadType to NewWorkloadDetails constructor
Aug 24, 2020
ea54b8a
Merge branch 'master' into khjob
Aug 26, 2020
ffd7c9d
Remove khworkload from json
joshulyne Sep 1, 2020
ded0059
Merge branch 'khjob' of https://github.com/Comcast/kuberhealthy into …
joshulyne Sep 1, 2020
bb26ce5
remove comments
joshulyne Sep 1, 2020
8451050
workload type panic
Sep 3, 2020
04adda0
Merge branch 'khjob' of https://github.com/Comcast/kuberhealthy into …
joshulyne Sep 3, 2020
ecfea77
Merge branch 'master' into khjob
joshulyne Sep 8, 2020
5447c8b
Determine khworkload once; workload cannot be retrieved from spec
joshulyne Sep 10, 2020
e149c2b
Merge branch 'master' into khjob
joshulyne Sep 10, 2020
9e1718b
update khjob pkg
joshulyne Sep 10, 2020
345f916
merge conflict
joshulyne Oct 1, 2020
eb71e06
jobDetails fix, currentMaster bug fix
joshulyne Oct 1, 2020
0eac4c0
Add documentation
joshulyne Oct 6, 2020
2209eb9
Update cmd/kuberhealthy/Makefile
joshulyne Oct 6, 2020
cf43396
Merge branch 'master' into khjob
joshulyne Oct 6, 2020
fd09f12
Merge branch 'khjob' of https://github.com/Comcast/kuberhealthy into …
joshulyne Oct 6, 2020
57283dc
add determine khworkload
joshulyne Oct 7, 2020
021931b
Add example yamls to docs
joshulyne Oct 7, 2020
a36030f
Add empty name check and error in PodNameExists func
joshulyne Oct 7, 2020
2cd2d17
fix error message
joshulyne Oct 7, 2020
166a165
add exp backoff in reporting failure/success to kh, remove returning …
joshulyne Oct 7, 2020
9bae7db
Edit exp backoff in checkclient, update check builds to use new check…
joshulyne Oct 9, 2020
a9c16ad
Remove upstream context from sub select statements, fix returning Err…
joshulyne Oct 9, 2020
9a2b878
Merge branch 'master' into khjob
joshulyne Oct 12, 2020
04685b7
Merge branch 'master' into khjob
joshulyne Oct 13, 2020
f9c5bda
Merge branch 'master' into khjob
joshulyne Oct 15, 2020
5b009e8
Clean up
joshulyne Oct 15, 2020
a653840
trying to fix merge conflict
Oct 15, 2020
0ad0532
added clusterRole for check-reaper to grant access to list khjobs
Oct 15, 2020
7d54cc0
clusterrolebinding to admin...configured admin
Oct 15, 2020
d2319a9
using kuberhealthy clusterrole
Oct 15, 2020
a67896d
added khjob count logs
Oct 15, 2020
dfb8065
changed logging
Oct 15, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions cmd/check-reaper/Makefile
@@ -1,5 +1,5 @@
build:
docker build -t kuberhealthy/check-reaper:v1.4.0 -f Dockerfile ../../
docker build -t kuberhealthy/check-reaper:v1.5.0 -f Dockerfile ../../

push:
docker push kuberhealthy/check-reaper:v1.4.0
docker push kuberhealthy/check-reaper:v1.5.0
10 changes: 7 additions & 3 deletions cmd/check-reaper/README.md
@@ -1,8 +1,6 @@


## Checker pod reaper

This container deletes kuberhealthy checker pods when they are no longer useful. Checker pods are identified by having a label with the key `kh-check-name`.
This container deletes kuberhealthy checker pods when they are no longer useful. Checker pods are identified by having a label with the key `kh-check-name`.

If the key `kh-check-name` is found on a pod, then it will be deleted when any of the following are true:

Expand All @@ -13,3 +11,9 @@ If the key `kh-check-name` is found on a pod, then it will be deleted when any o
- If the checker pod is `Failed` and there are more than 5 `Failed` checker pods of the same type which were created more recently

- If the checker pod is `Failed` and was created more than 5 days ago

This container deletes kuberhealthy jobs when they are no longer useful.

A khjob will be deleted when the following is true:

- If the khjob is older than 15 minutes and is `Completed`
14 changes: 9 additions & 5 deletions cmd/check-reaper/check-reaper.yaml
Expand Up @@ -11,11 +11,15 @@ spec:
spec:
containers:
- name: check-reaper
image: kuberhealthy/check-reaper:v1.4.0
image: kuberhealthy/check-reaper:v1.5.0
imagePullPolicy: IfNotPresent
# env:
# - name: SINGLE_NAMESPACE
# value: kuberhealthy
env:
- name: SINGLE_NAMESPACE
value: kuberhealthy
- name: MAX_PODS_THRESHOLD
value: "4"
- name: JOB_DELETE_TIME_DURATION
value: 15m
restartPolicy: OnFailure
serviceAccountName: check-reaper
concurrencyPolicy: Forbid
Expand All @@ -27,7 +31,7 @@ metadata:
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: admin
name: kuberhealthy
subjects:
- kind: ServiceAccount
name: check-reaper
Expand Down
86 changes: 83 additions & 3 deletions cmd/check-reaper/main.go
Expand Up @@ -4,12 +4,15 @@ import (
"context"
"os"
"path/filepath"
"strconv"
"time"

log "github.com/sirupsen/logrus"

khjobcrd "github.com/Comcast/kuberhealthy/v2/pkg/apis/khjob/v1"
v1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

"k8s.io/client-go/kubernetes"
_ "k8s.io/client-go/plugin/pkg/client/auth/oidc"

Expand All @@ -22,8 +25,14 @@ var kubeConfigFile = filepath.Join(os.Getenv("HOME"), ".kube", "config")
// ReapCheckerPods is a variable mapping all reaper pods
var ReapCheckerPods map[string]v1.Pod

// MaxPodsThreshold is a variable limiting how many reaper pods can exist in a cluster
var MaxPodsThreshold = 4
// MaxPodsThresholdEnv is a variable limiting how many reaper pods can exist in a cluster
var MaxPodsThresholdEnv = os.Getenv("MAX_PODS_THRESHOLD")

// JobDeleteTimeDurationEnv is a variable limiting how many minutes a khjob can be alive before it can be delted
var JobDeleteTimeDurationEnv = os.Getenv("JOB_DELETE_TIME_DURATION")

// instantiate kuberhealhty job client CRD
var khJobClient *khjobcrd.KHJobV1Client

// Namespace is a variable to allow code to target all namespaces or a single namespace
var Namespace string
Expand All @@ -46,6 +55,11 @@ func main() {
log.Fatalln("Unable to create kubernetes client", err)
}

jobClient, err := khjobcrd.Client(kubeConfigFile)
if err != nil {
log.Fatalln("Unable to create khJob client", err)
}

podList, err := listCheckerPods(ctx, client, Namespace)
if err != nil {
log.Fatalln("Failed to list and delete old checker pods", err)
Expand All @@ -62,6 +76,14 @@ func main() {
}

log.Infoln("Finished reaping checker pods.")
log.Infoln("Beginning to search for khjobs.")

// fetch and delete khjobs that meet criteria
err = khJobDelete(jobClient)
if err != nil {
log.Errorln("Failed to reap khjobs with error: ", err)
}
log.Infoln("Finished reaping khjobs.")
}

// listCheckerPods returns a list of pods with the khcheck name label
Expand Down Expand Up @@ -91,7 +113,10 @@ func listCheckerPods(ctx context.Context, client *kubernetes.Clientset, namespac
// deleteFilteredCheckerPods goes through map of all checker pods and deletes older checker pods
func deleteFilteredCheckerPods(ctx context.Context, client *kubernetes.Clientset, reapCheckerPods map[string]v1.Pod) error {

var err error
MaxPodsThreshold, err := strconv.Atoi(MaxPodsThresholdEnv)
if err != nil {
log.Errorln("Error converting MaxPodsThreshold to int")
}

for k, v := range reapCheckerPods {

Expand Down Expand Up @@ -195,3 +220,58 @@ func deletePod(ctx context.Context, client *kubernetes.Clientset, pod v1.Pod) er
options := metav1.DeleteOptions{PropagationPolicy: &propagationForeground}
return client.CoreV1().Pods(pod.Namespace).Delete(ctx, pod.Name, options)
}

// podConditions returns true if conditions are met to be deleted for checker pod
func podConditions(pod v1.Pod, duration float64, phase v1.PodPhase) bool {
if time.Now().Sub(pod.CreationTimestamp.Time).Hours() > duration && pod.Status.Phase == phase {
log.Infoln("Found pod older than ", duration, " hours in status ", phase, " .Deleting pod:", pod)
return true
}
return false
}

// jobConditions returns true if conditions are met to be deleted for khjob
func jobConditions(job khjobcrd.KuberhealthyJob, duration time.Duration, phase khjobcrd.JobPhase) bool {
if time.Now().Sub(job.CreationTimestamp.Time) > duration && job.Spec.Phase == phase {
log.Infoln("Found job older than", duration, "minutes in status", phase)
return true
}
return false
}

// KHJobDelete fetches a list of khjobs in a namespace and will delete them if they meet given criteria
func khJobDelete(client *khjobcrd.KHJobV1Client) error {

opts := metav1.ListOptions{}
del := metav1.DeleteOptions{}

// convert JobDeleteMinutes into time.Duration
JobDeleteTimeDuration, err := time.ParseDuration(JobDeleteTimeDurationEnv)
if err != nil {
log.Errorln("Error converting JobDeleteTimeDurationEnv to Float")
return err
}

// list khjobs in Namespace
list, err := client.KuberhealthyJobs(Namespace).List(opts)
if err != nil {
log.Errorln("Error: failed to retrieve khjob list with error", err)
return err
}

log.Infoln("Found", len(list.Items), "khjobs")

// Range over list and delete khjobs
for _, j := range list.Items {
if jobConditions(j, JobDeleteTimeDuration, "Completed") {
log.Infoln("Deleting khjob", j.Name)
err := client.KuberhealthyJobs(j.Namespace).Delete(j.Name, &del)
if err != nil {
log.Errorln("Failure to delete khjob", j.Name, "with error:", err)
return err

}
}
}
return nil
}
4 changes: 2 additions & 2 deletions cmd/daemonset-check/Makefile
@@ -1,5 +1,5 @@
build:
docker build -t kuberhealthy/daemonset-check:v3.2.0 -f Dockerfile ../../
docker build -t kuberhealthy/daemonset-check:v3.2.1 -f Dockerfile ../../

push:
docker push kuberhealthy/daemonset-check:v3.2.0
docker push kuberhealthy/daemonset-check:v3.2.1
2 changes: 1 addition & 1 deletion cmd/deployment-check/Makefile
@@ -1,5 +1,5 @@
IMAGE="kuberhealthy/deployment-check"
TAG="v1.6.1"
TAG="v1.6.2"

build:
docker build -t ${IMAGE}:${TAG} -f Dockerfile ../../
Expand Down
4 changes: 2 additions & 2 deletions cmd/dns-resolution-check/Makefile
@@ -1,5 +1,5 @@
build:
docker build -t kuberhealthy/dns-resolution-check:v1.4.0 -f Dockerfile ../../
docker build -t kuberhealthy/dns-resolution-check:v1.4.1 -f Dockerfile ../../

push:
docker push kuberhealthy/dns-resolution-check:v1.4.0
docker push kuberhealthy/dns-resolution-check:v1.4.1
5 changes: 2 additions & 3 deletions cmd/http-check/Makefile
@@ -1,6 +1,5 @@

build:
docker build -t kuberhealthy/http-check:v1.3.0 -f Dockerfile ../../
docker build -t kuberhealthy/http-check:v1.3.1 -f Dockerfile ../../

push:
docker push kuberhealthy/http-check:v1.3.0
docker push kuberhealthy/http-check:v1.3.1
4 changes: 2 additions & 2 deletions cmd/kiam-check/Makefile
@@ -1,5 +1,5 @@
build:
docker build -t kuberhealthy/kiam-check:v1.2.2 -f Dockerfile ../../
docker build -t kuberhealthy/kiam-check:v1.2.3 -f Dockerfile ../../

push:
docker push kuberhealthy/kiam-check:v1.2.2
docker push kuberhealthy/kiam-check:v1.2.3
2 changes: 1 addition & 1 deletion cmd/kuberhealthy/Makefile
@@ -1,5 +1,5 @@
IMAGE="kuberhealthy/kuberhealthy"
TAG="v2.3.2"
TAG="latest"

build:
docker build --no-cache --pull -t ${IMAGE}:${TAG} -f Dockerfile ../../
Expand Down
57 changes: 50 additions & 7 deletions cmd/kuberhealthy/crd.go
Expand Up @@ -13,20 +13,22 @@ package main

import (
"errors"
k8sErrors "k8s.io/apimachinery/pkg/api/errors"
"strings"
"time"

k8sErrors "k8s.io/apimachinery/pkg/api/errors"

log "github.com/sirupsen/logrus"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

v1 "github.com/Comcast/kuberhealthy/v2/pkg/apis/khjob/v1"
"github.com/Comcast/kuberhealthy/v2/pkg/health"
"github.com/Comcast/kuberhealthy/v2/pkg/khstatecrd"
)

// setCheckStateResource puts a check state's state into the specified CRD resource. It sets the AuthoritativePod
// to the server's hostname and sets the LastUpdate time to now.
func setCheckStateResource(checkName string, checkNamespace string, state health.CheckDetails) error {
func setCheckStateResource(checkName string, checkNamespace string, state health.WorkloadDetails) error {

name := sanitizeResourceName(checkName)

Expand Down Expand Up @@ -64,15 +66,15 @@ func sanitizeResourceName(c string) string {
}

// ensureStateResourceExists checks for the existence of the specified resource and creates it if it does not exist
func ensureStateResourceExists(checkName string, checkNamespace string) error {
func ensureStateResourceExists(checkName string, checkNamespace string, workload health.KHWorkload) error {
name := sanitizeResourceName(checkName)

log.Debugln("Checking existence of custom resource:", name)
state, err := khStateClient.Get(metav1.GetOptions{}, stateCRDResource, name, checkNamespace)
if err != nil {
if k8sErrors.IsNotFound(err) || strings.Contains(err.Error(), "not found") {
log.Infoln("Custom resource not found, creating resource:", name, " - ", err)
initialDetails := health.NewCheckDetails()
initialDetails := health.NewWorkloadDetails(workload)
initialState := khstatecrd.NewKuberhealthyState(name, initialDetails)
_, err := khStateClient.Create(&initialState, stateCRDResource, checkNamespace)
if err != nil {
Expand All @@ -90,14 +92,14 @@ func ensureStateResourceExists(checkName string, checkNamespace string) error {

// getCheckState retrieves the check values from the kuberhealthy khstate
// custom resource
func getCheckState(c KuberhealthyCheck) (health.CheckDetails, error) {
func getCheckState(c KuberhealthyCheck) (health.WorkloadDetails, error) {

var state = health.NewCheckDetails()
var state = health.NewWorkloadDetails(health.KHCheck)
var err error
name := sanitizeResourceName(c.Name())

// make sure the CRD exists, even when checking status
err = ensureStateResourceExists(c.Name(), c.CheckNamespace())
err = ensureStateResourceExists(c.Name(), c.CheckNamespace(), health.KHCheck)
if err != nil {
return state, errors.New("Error validating CRD exists: " + name + " " + err.Error())
}
Expand All @@ -110,3 +112,44 @@ func getCheckState(c KuberhealthyCheck) (health.CheckDetails, error) {
log.Debugln("Successfully retrieved khstate resource:", name)
return khstate.Spec, nil
}

// getCheckState retrieves the check values from the kuberhealthy khstate
// custom resource
func getJobState(j KuberhealthyCheck) (health.WorkloadDetails, error) {

var state = health.NewWorkloadDetails(health.KHJob)
var err error
name := sanitizeResourceName(j.Name())

// make sure the CRD exists, even when checking status
err = ensureStateResourceExists(j.Name(), j.CheckNamespace(), health.KHJob)
if err != nil {
return state, errors.New("Error validating CRD exists: " + name + " " + err.Error())
}

log.Debugln("Retrieving khstate custom resource for:", name)
khstate, err := khStateClient.Get(metav1.GetOptions{}, stateCRDResource, name, j.CheckNamespace())
if err != nil {
return state, errors.New("Error retrieving custom khstate resource: " + name + " " + err.Error())
}
log.Debugln("Successfully retrieved khstate resource:", name)
return khstate.Spec, nil
}

// setJobPhase updates the kuberhealthy job phase depending on the state of its run.
func setJobPhase(jobName string, jobNamespace string, jobPhase v1.JobPhase) error {

kj, err := khJobClient.KuberhealthyJobs(jobNamespace).Get(jobName, metav1.GetOptions{})
if err != nil {
log.Errorln("error getting khjob:", jobName, err)
return err
}
resourceVersion := kj.GetResourceVersion()
updatedJob := v1.NewKuberhealthyJob(jobName, jobNamespace, kj.Spec)
updatedJob.SetResourceVersion(resourceVersion)
log.Infoln("Setting khjob phase to:", jobPhase)
updatedJob.Spec.Phase = jobPhase

_, err = khJobClient.KuberhealthyJobs(jobNamespace).Update(&updatedJob)
return err
}