Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-73056] cancel queue Item that really matches the pod causing the cancelling #1539

Conversation

ysmaoui
Copy link
Contributor

@ysmaoui ysmaoui commented Apr 25, 2024

JENKINS-73056

Testing done

Submitter checklist

@ysmaoui ysmaoui requested a review from a team as a code owner April 25, 2024 13:43
@ysmaoui
Copy link
Contributor Author

ysmaoui commented Apr 25, 2024

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea seems right though implementation could be refined.

Deserves test coverage, which I think ought to be feasible here; it seems this method is called for several reasons, the most straightforward of which would be that the API request to create the pod in the first place was rejected, for example because you specified well-formed YAML which fails schema validation or some such constraint. So something like

parallel k8s: {
  catchError(buildResult: 'SUCCESS') {
    podTemplate(yaml: '…invalid…') {
      node(POD_LABEL) {
        sh 'false should never run'
      }
    }
    echo 'cancelled pod item by now'
  }
}, unrelated: {
  node('special-agent') {}
}

should pass if the test waits for cancelled pod item by now message and then creates a static agent named special-agent, if I follow the bug report correctly.

Comment on lines 104 to 105
String whyOnQueue = item.getWhy();
if(whyOnQueue.contains(metadata.getName())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fragile. We can probably match based on the label request instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried printing a couple of "assignedLabels" for items on the queue ( return value of Queue.Item.getAssignedLabel() )
image

some of these assigned labels match the exactly the pod label "jenkins/label"
but some other, have that - at the beginning that I don't really where it comes from

image

image

Copy link
Contributor Author

@ysmaoui ysmaoui Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
but the assignedLabel value and the why String have the same value

and the issue where jenkins/label does not match the assignedLabel String is in the case when the pod Name is using the fallback DEFAULT_AGENT_PREFIX ( because the given label does not match the regex set here https://github.com/jenkinsci/kubernetes-plugin/blob/40d4fb21bd8046082ecd88fc5cf860f0db8e54ee/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java#L313C14-L313C23 )

and the 'x' in the jenkins/label comes from here

input = input.replaceAll("[^_a-zA-Z0-9-]", "_").replaceFirst("^[^a-zA-Z0-9]", "x");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, the question I have is why the queue.item assignedLabel not the pod name but the value of the label in jenkins/label instead..

@ysmaoui
Copy link
Contributor Author

ysmaoui commented Apr 25, 2024

Hi @jglick

to be honest I am not very familiar of how to run tests for this plugin, But this is the pipeline that I use to reproduce the bug: ( the failing pod is where the image cannot be pulled - wrong registry )

String incorrectImage = "does.not.exist/jenkins/inbound-agent:3206.vb_15dcf73f6a_9-3-jdk11"
// on dev
String correctImage = "jenkins/inbound-agent:3206.vb_15dcf73f6a_9-3-jdk11"

String getPodYaml(String image){
  return """
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: jnlp
    image: ${image}
    imagePullPolicy: "Always"
    tty: false
    workingDir: "/home/jenkins/agent"
    resources:
      requests:
        memory: 100Mi
        cpu: 500m
  restartPolicy: "Never"
  volumes:
  - emptyDir:
      medium: ""
    name: "workspace-volume"
"""
}

parallel (
  failingPod: {
    stage("failing pod") {
      podTemplate(yaml: getPodYaml(incorrectImage)) {
        node(POD_LABEL) {
          sh 'false should never run'
        }
      }
    }
  },
  correctPods: {
   // keep spawning pods until eventually one of them catches the cancel queue item from the failing pod 
    List l = (1..10).toList()
    l.each { num ->
      stage("correctPod-${num}") {
        podTemplate(yaml: getPodYaml(correctImage)) {
          node(POD_LABEL) {
            echo("correctPod${num}")
          }
        }
      }
    }
  }
)

and when executed this is what happens:

image

I need probably some assistance in writing and executing such a test scenario for the plugin, can you help?

( excluding the failing pod branch from the pipeline results in no failures in the "correct-pods" branches )
image

@ysmaoui ysmaoui marked this pull request as draft April 27, 2024 10:31
@ysmaoui ysmaoui force-pushed the bugfix/JENKINS-73056-cancel-only-relevant-queue-item-when-a-pod-fails branch 7 times, most recently from 4c7d18c to 56c8cb3 Compare April 28, 2024 20:39
@ysmaoui ysmaoui force-pushed the bugfix/JENKINS-73056-cancel-only-relevant-queue-item-when-a-pod-fails branch from 56c8cb3 to c5889c6 Compare April 28, 2024 21:06
@Vlatombe Vlatombe changed the title cancel queue Item that really matches the pod causing the cancelling [JENKINS-73056] cancel queue Item that really matches the pod causing the cancelling Apr 29, 2024
* Add KubernetesPipelineTest#cancelOnlyRelevantQueueItem to cover the use case
* Remove duplicated sanitization method from `PodTemplate#sanitizeLabel`
* Extract labels and annotations to constant for reuse
@Vlatombe
Copy link
Member

Thank you for your contribution, made some follow-ups and took the opportunity to clean up sanitization. Implemented the test proposed by @jglick

@Vlatombe Vlatombe added the bug Bug Fixes label Apr 29, 2024
@ysmaoui
Copy link
Contributor Author

ysmaoui commented Apr 29, 2024

Thank you for your contribution, made some follow-ups and took the opportunity to clean up sanitization. Implemented the test proposed by @jglick

thanks for the help @Vlatombe

it seems spotbugs is still complaining, and I tried to fix it here: https://github.com/ysmaoui/kubernetes-plugin/pull/1/files
if you approve, I would like to merge the fix here

…-pointer-bug

spotbugs-detected possible nullpointer-exception
@ysmaoui ysmaoui marked this pull request as ready for review April 29, 2024 12:29
@Vlatombe Vlatombe requested a review from jglick April 29, 2024 12:37
Copy link
Member

@Vlatombe Vlatombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pinging @jglick for a second pair of eyes

* @throws AssertionError if the generated label is not valid
*/
public static String sanitizeLabel(String input) {
int max = /* Kubernetes limit */ 63 - /* hyphen */ 1 - /* suffix */ 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is actually from labelify not from the function previous called sanitizeLabel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think labelify and the previous sanitizeLabel methods are now merged into this new sanitizeLabel method of the PodTemplateUtils class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both were doing very similar job, there is now only one method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually changes a behavior for labels that contains a .. For example label release-1.2.3 resulted in jenkins/label: release-1.2.3. Now it results in jenkins/label: release-1_2_3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the previous version handled [^_.a-zA-Z0-9-] which is gone here. Seems like a regression; @Dohbedoh can you please make sure this is filed as an issue (if not a patch)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes a behavior

But is that breaking anything?

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, no, runUrl Javadoc does not seem to match reality.

@ysmaoui ysmaoui force-pushed the bugfix/JENKINS-73056-cancel-only-relevant-queue-item-when-a-pod-fails branch from b294ac7 to 1ce44e4 Compare April 29, 2024 13:45
@ysmaoui ysmaoui requested review from jglick and Vlatombe April 29, 2024 13:52
@ysmaoui ysmaoui force-pushed the bugfix/JENKINS-73056-cancel-only-relevant-queue-item-when-a-pod-fails branch from 1ce44e4 to baa1004 Compare April 29, 2024 13:54
@ysmaoui
Copy link
Contributor Author

ysmaoui commented Apr 30, 2024

thank you @Vlatombe and @jglick for reviewing and all the help :)

can we merge?

@Vlatombe Vlatombe merged commit c646b71 into jenkinsci:master Apr 30, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants