Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add not-ready reap logic and unit test #58

Merged
merged 3 commits into from
Oct 25, 2021

Conversation

sahilbadla
Copy link
Contributor

pdb-reaper not deleting the pdb's where pods are not fully ready(pod(s) in pending state or pod containers in not ready state)

Changes:

  • Added ReapNotReady flag
  • Added ReapNotReadyThreshold flag
  • Added AllNotReady flag
  • Logic to reap not-ready state pod/pdb
  • Unit tests for new logic

Comment on lines 434 to 439
if pod.Status.Phase == corev1.PodPending && isPodReadinessThresholdPast(pod.Status.StartTime, thresholdSeconds) {
notReadyCount++
}

if pod.Status.Phase == corev1.PodRunning && isPodReadinessThresholdPast(pod.Status.StartTime, thresholdSeconds) {
for _, condition := range pod.Status.Conditions {
if condition.Type == "ContainersReady" && condition.Status == "False" {
notReadyCount++
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of pod.Status.StartTime, (which is the start of the Pod) would it be possible to check if the threshold is past when the container last entered the NotReady state? For example, the lastTransitionTime of the Ready or ContainersReady condition?

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-09-15T22:27:29Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-09-15T22:27:15Z"
    message: 'containers with unready status: [app]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-09-15T22:27:15Z"
    message: 'containers with unready status: [app]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-09-15T22:27:15Z"
    status: "True"
    type: PodScheduled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -49,4 +49,7 @@ func init() {
pdbReaperCmd.Flags().BoolVar(&pdbReaperArgs.AllCrashLoop, "all-crashloop", true, "Only deletes PDBs for crashlooping pods when all pods are in crashloop")
pdbReaperCmd.Flags().IntVar(&pdbReaperArgs.CrashLoopRestartCount, "crashloop-restart-count", 5, "Minimum restart count to when considering pods in crashloop")
pdbReaperCmd.Flags().StringSliceVar(&pdbReaperArgs.ExcludedNamespaces, "excluded-namespaces", []string{}, "Namespaces excluded from scanning")
pdbReaperCmd.Flags().BoolVar(&pdbReaperArgs.ReapNotReady, "reap-not-ready", true, "Deletes PDBs which have pods in not-ready state")
pdbReaperCmd.Flags().IntVar(&pdbReaperArgs.ReapNotReadyThreshold, "not-ready-threshold-seconds", 400, "Minimum seconds to wait when considering pods in not-ready state")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default threshold for deleting a NotReady PDB should probably match the upgrade-manager drainTimeout default, which is 1800 seconds (30 minutes). If a Pod is still not ready after 30 minutes then probably something is wrong. After 400 seconds (6.6 minutes) maybe something is wrong, but there may be cases where containers take longer than 6.6 minutes to start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: sbadla1 <sahil_badla@intuit.com>
Signed-off-by: sbadla1 <sahil_badla@intuit.com>
Signed-off-by: sbadla1 <sahil_badla@intuit.com>
Copy link

@kevdowney kevdowney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov
Copy link

codecov bot commented Oct 22, 2021

Codecov Report

Merging #58 (added1e) into master (92b6912) will increase coverage by 0.14%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   59.71%   59.85%   +0.14%     
==========================================
  Files           9        9              
  Lines        1390     1420      +30     
==========================================
+ Hits          830      850      +20     
- Misses        467      476       +9     
- Partials       93       94       +1     
Impacted Files Coverage Δ
pkg/reaper/pdbreaper/types.go 24.24% <0.00%> (-3.35%) ⬇️
pkg/reaper/pdbreaper/pdbreaper.go 73.09% <90.90%> (+1.94%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92b6912...added1e. Read the comment docs.

@kevdowney kevdowney merged commit 0ccce94 into keikoproj:master Oct 25, 2021
shaoxt pushed a commit to shaoxt/governor that referenced this pull request Dec 21, 2021
* add not-ready reap logic and unit test

Signed-off-by: sbadla1 <sahil_badla@intuit.com>
Signed-off-by: shaoxt <shaoxt@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants