Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log events in build log during pod provisioning #742

Merged
merged 11 commits into from Mar 18, 2020

Conversation

Vlatombe
Copy link
Member

@Vlatombe Vlatombe commented Mar 16, 2020

This listens to pod events during the agent launch and stream them in the build logs. Streaming stops once the agent is up and running.

Happy case

‘Jenkins’ doesn’t have label ‘pp_20-pm7l0’

Created Pod: kubernetes-plugin-test/pp-20-pm7l0-chmxk-m4pww
[Normal][kubernetes-plugin-test/pp-20-pm7l0-chmxk-m4pww][Scheduled] Successfully assigned kubernetes-plugin-test/pp-20-pm7l0-chmxk-m4pww to docker-desktop
[Normal][kubernetes-plugin-test/pp-20-pm7l0-chmxk-m4pww][Pulled] Container image "jenkins/jnlp-slave:4.0.1-1" already present on machine
[Normal][kubernetes-plugin-test/pp-20-pm7l0-chmxk-m4pww][Created] Created container jnlp
[Normal][kubernetes-plugin-test/pp-20-pm7l0-chmxk-m4pww][Started] Started container jnlp

Agent pp-20-pm7l0-chmxk-m4pww is provisioned from template pp_20-pm7l0-chmxk

Unhappy case

Created Pod: kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8
[Normal][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Scheduled] Successfully assigned kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8 to docker-desktop

[Normal][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Pulling] Pulling image "invalid:image"

[Warning][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Failed] Failed to pull image "invalid:image": rpc error: code = Unknown desc = Error response from daemon: pull access denied for invalid, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
[Warning][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Failed] Error: ErrImagePull
[Normal][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Pulled] Container image "jenkins/jnlp-slave:4.0.1-1" already present on machine
[Normal][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Created] Created container jnlp
[Normal][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Started] Started container jnlp
[Normal][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Pulled] Container image "maven:3.3.9-jdk-8-alpine" already present on machine
[Normal][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Created] Created container maven

[Normal][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Started] Started container maven
[Normal][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][BackOff] Back-off pulling image "invalid:image"
[Warning][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Failed] Error: ImagePullBackOff

[Warning][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][Failed] Error: ImagePullBackOff
ERROR: Unable to pull Docker image "invalid:image". Check if image tag name is spelled correctly.
[Normal][kubernetes-plugin-test/wrong-image-10-0x2gp-cqgzf-pdrq8][BackOff] Back-off pulling image "invalid:image"

ERROR: Unable to pull Docker image "invalid:image". Check if image tag name is spelled correctly.

@Vlatombe Vlatombe added the enhancement Improvements label Mar 16, 2020
@Vlatombe
Copy link
Member Author

Need to gracefully degrade if jenkins sa is not able to watch pod events.

@jglick jglick self-requested a review March 16, 2020 15:15
@Vlatombe
Copy link
Member Author

Looks like SelfSubjectAccessReview is broken in kubernetes-client. Haven't been able to use it.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that has long bothered me: the build log receives a dump of the YAML of the pod if and when the agent launches successfully, but if the launch fails you have to either enable fine logging or have kubectl access in order to see what pod definition Jenkins had tried to create. Is that still true here; and where does the YAML appear relative to these other messages?

template.getWorkspaceVolume().createVolume(client, pod.getMetadata());
watcher = new AllContainersRunningPodWatcher(client, pod, runListener);
try (Watch _w = client.pods().inNamespace(namespace1).withName(podName).watch(watcher)) {
try (Watch w1 = client.pods().inNamespace(namespace).withName(podName).watch(watcher);
Watch w2 = client.events().inNamespace(namespace).withField("involvedObject.name", podName).watch(new TaskListenerEventWatcher(podName, runListener))) {
watcher.await(template.getSlaveConnectTimeout(), TimeUnit.SECONDS);
} catch (InvalidDockerImageException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this special case still needed? I am not really sure what it was doing to begin with; looks suspicious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails the build if one of the docker images is invalid (#497). Ideally I think it would deserve a rework but it works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the build would fail anyway, without this clause; the exception would be thrown up out of launch. Right? And with this PR, you would see the appropriate diagnostics right in the log. But the current code here does more, by canceling some queue items (which seem to be any node blocks associated with this build, which is wrong by the way—should only be canceling the node block associated with this pod). If that is needed in order to prevent the build from hanging, then it is needed in many other error cases too. So this does not add up for me.

Copy link
Member Author

@Vlatombe Vlatombe Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the launcher would just fail but this wouldn't affect the build so the build would wait forever.

should only be canceling the node block associated with this pod

probably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is anything blocking the current PR but it looks like something to revisit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglick I'm revisiting this, not sure how to find out item should be cancelled exactly.
The only correlation I can make seems to be based on label, since at this point the queue item is not yet assigned to the node. There could be several items waiting for a node if they are inside a parallel block or using a static pod template. So, cancel the first one only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure.

@Vlatombe Vlatombe force-pushed the log-events-during-pod-provisioning branch from 226b97b to a60d030 Compare March 17, 2020 08:42
test-in-k8s.yaml Outdated Show resolved Hide resolved
@Vlatombe
Copy link
Member Author

Vlatombe commented Mar 17, 2020

Next build should be ok. https://ci.jenkins.io/job/Plugins/job/kubernetes-plugin/job/PR-742/8/display/redirect shows 3 failures without privilege to watch events, pretty much what I expected.

@Vlatombe
Copy link
Member Author

Vlatombe commented Mar 17, 2020

One thing that has long bothered me: the build log receives a dump of the YAML of the pod if and when the agent launches successfully, but if the launch fails you have to either enable fine logging or have kubectl access in order to see what pod definition Jenkins had tried to create. Is that still true here; and where does the YAML appear relative to these other messages?

That part isn't touched by this PR, so the yaml is printed once you have an executor. But when using the podTemplate DSL, it could now be dumped before.

@Vlatombe Vlatombe requested a review from jglick March 17, 2020 14:06
try {
return client.events().inNamespace(namespace).withField("involvedObject.name", podName).watch(new TaskListenerEventWatcher(podName, runListener));
} catch (KubernetesClientException e) {
LOGGER.log(Level.FINE, e, () -> "Cannot watch events on " + namespace + "/" +podName);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is reasonable to ask that this log line be written more readily (less restricted than FINE) and include a note/link to instructions for the required level of access needed to watch pod events?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this seems like it could be at INFO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or printed to the runListener.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vlatombe force-pushed

Broke incremental review, so I guess I will need to review from scratch.

try {
return client.events().inNamespace(namespace).withField("involvedObject.name", podName).watch(new TaskListenerEventWatcher(podName, runListener));
} catch (KubernetesClientException e) {
LOGGER.log(Level.FINE, e, () -> "Cannot watch events on " + namespace + "/" +podName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this seems like it could be at INFO.

template.getWorkspaceVolume().createVolume(client, pod.getMetadata());
watcher = new AllContainersRunningPodWatcher(client, pod, runListener);
try (Watch _w = client.pods().inNamespace(namespace1).withName(podName).watch(watcher)) {
try (Watch w1 = client.pods().inNamespace(namespace).withName(podName).watch(watcher);
Watch w2 = client.events().inNamespace(namespace).withField("involvedObject.name", podName).watch(new TaskListenerEventWatcher(podName, runListener))) {
watcher.await(template.getSlaveConnectTimeout(), TimeUnit.SECONDS);
} catch (InvalidDockerImageException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the build would fail anyway, without this clause; the exception would be thrown up out of launch. Right? And with this PR, you would see the appropriate diagnostics right in the log. But the current code here does more, by canceling some queue items (which seem to be any node blocks associated with this build, which is wrong by the way—should only be canceling the node block associated with this pod). If that is needed in order to prevent the build from hanging, then it is needed in many other error cases too. So this does not add up for me.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have some questions but looks good.

watcher.await(template.getSlaveConnectTimeout(), TimeUnit.SECONDS);
try (Watch w1 = client.pods().inNamespace(namespace).withName(podName).watch(watcher);
Watch w2 = eventWatch(client, podName, namespace, runListener)) {
if (watcher != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could it be null? You just assigned it via a constructor. Does SpotBugs not reject this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Spotbugs thinks eventWatch could have a side-effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the watcher field is only written to in the one place above, so it should be definitely non-null here. You can always

Suggested change
if (watcher != null) {
assert watcher != null;

to make it behave.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂

try {
return client.events().inNamespace(namespace).withField("involvedObject.name", podName).watch(new TaskListenerEventWatcher(podName, runListener));
} catch (KubernetesClientException e) {
LOGGER.log(Level.FINE, e, () -> "Cannot watch events on " + namespace + "/" +podName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or printed to the runListener.

@Vlatombe Vlatombe merged commit 74f261b into jenkinsci:master Mar 18, 2020
@Vlatombe Vlatombe deleted the log-events-during-pod-provisioning branch March 18, 2020 09:24
@mickaelistria
Copy link

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements
Projects
None yet
5 participants