Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fedmsg - some messages are not registered by the plugin #206

Closed
vojtechsokol opened this issue Sep 9, 2020 · 19 comments
Closed

fedmsg - some messages are not registered by the plugin #206

vojtechsokol opened this issue Sep 9, 2020 · 19 comments

Comments

@vojtechsokol
Copy link

Hi,

our team uses JMS plugin to consume fedmsg messages with org.fedoraproject.prod.github.* topics (forwarded to fedmsg by github2fedmsg app).

Everything works fine, but after a while no jobs are triggered by the plugin. One hour after the last message was received, the plugin resubscribes the job to the topic and it works again for a while.

When this happens, Jenkins log contains these entries but no errors:

INFO com.redhat.jenkins.plugins.ci.messaging.FedMsgMessagingWorker receive
No message received for the past 3600000 ms, re-subscribing for job 'fedmsg-github-opened'.

INFO com.redhat.jenkins.plugins.ci.messaging.FedMsgMessagingWorker unsubscribe
Un-subscribing job 'fedmsg-github-opened' from org.fedoraproject.prod.github.pull_request.opened topic.

INFO com.redhat.jenkins.plugins.ci.messaging.FedMsgMessagingWorker subscribe
Subscribing job 'fedmsg-github-opened' to org.fedoraproject.prod.github.pull_request.opened topic.

INFO com.redhat.jenkins.plugins.ci.messaging.FedMsgMessagingWorker subscribe
Successfully subscribed job 'fedmsg-github-opened' to topic 'org.fedoraproject.prod.github.pull_request.opened'.
@ggallen
Copy link
Collaborator

ggallen commented Sep 9, 2020

@vojtechsokol , this is normal behavior for the plugin.

Can I get access to the Jenkins instance to take a look?

@vojtechsokol
Copy link
Author

Thank you for quick reply @ggallen
By 'normal behavior' you mean that it's expected to miss some messages?
The jenkins instance with affected job is here, you should be able to login there using your kerberos ticket.

@ggallen
Copy link
Collaborator

ggallen commented Sep 9, 2020

@vojtechsokol , no it shouldn't miss messages. I mean that resubscribing after 1 hour of no messages is normal, and those are the messages you are seeing in the log. That behavior and those messages are expected.

Let me take a look.

@ggallen
Copy link
Collaborator

ggallen commented Sep 9, 2020

@vojtechsokol , can you give me an example of a message that was handled and one that was missed? I don't see anything obvious yet.

@vojtechsokol
Copy link
Author

@ggallen I've created short script that dumps builds of the Jenkins job and messages from datagrepper: https://gist.github.com/vojtechsokol/e427c65d541ac294d3170df7a133ac9c. When you compare the resulting files (builds-pull-only.txt and messages.txt), you can see that there are a lot of builds missing (ATM more than 50%), the job was not triggered for e.g. these pull requests (lines are in the format owner:repo:pull_nr:commit_sha):

containers:podman:7578:cf14140fee185a4441acad148a971dbeabe52076
coreos:coreos-assembler:1706:af074e097bb5b859a4fea0bf12b0212f5ee3c004
operator-framework:community-operators:2303:8181d6944e2083fe35c71d915d5a27f45db27c6a
operator-framework:community-operators:2302:9cf7ae072d4ef423147a7010ade95d72a0fa8614
operator-framework:community-operators:2301:2830757fe5022059357e7a3d28a462f2ae59761d
ralphbean:bugwarrior:749:025f76ca31e72acf3426e06501576feca055f53a
mesonbuild:meson:7716:01b9f0af08fb813f68cec6edd795f3488a78728c
containers:podman:7577:b64f93f9c5cb4fc1870d7b830b2a23a4390c6940
ceph:ceph:37072:d6d73a649d89e44890f925278433e9ab55ffa68c

@ggallen
Copy link
Collaborator

ggallen commented Sep 9, 2020

@vojtechsokol , information overload!

All I want it a job, a message that triggered, and a message that didn't. I don't see how to find the actual messages in all the data the script produces.

@vojtechsokol
Copy link
Author

@ggallen
examples of correctly triggered builds and corresponding messages:
message: https://apps.fedoraproject.org/datagrepper/id?id=2020-7a1b1511-ed51-4657-93be-b2dc86cc25e0&size=extra-large
Jenkins job build: <JENKINS_URL>/job/ci-dispatch-github_pull_request_opened/2186

examples of messages that didn't trigger build
message: https://apps.fedoraproject.org/datagrepper/id?id=2020-036dd572-b42f-422c-a714-92030c33a7d4&size=extra-large
Jenkins job build: <JENKINS_URL>/job/ci-dispatch-github_pull_request_opened/2185

message: https://apps.fedoraproject.org/datagrepper/id?id=2020-b572e50a-857a-4ef1-8bec-d9dfb75b3f89&size=extra-large
Jenkins job build: <JENKINS_URL>/job/ci-dispatch-github_pull_request_opened/2184

message: https://apps.fedoraproject.org/datagrepper/id?id=2020-b5efe6bc-7e0b-4742-a726-3579362bf303&size=extra-large
Jenkins job build: <JENKINS_URL>/job/ci-dispatch-github_pull_request_opened/2183

message: https://apps.fedoraproject.org/datagrepper/id?id=2020-84cf66e9-4953-4096-be06-cc7c46f75694&size=extra-large
Jenkins job build: <JENKINS_URL>/job/ci-dispatch-github_pull_request_opened/2182

https://apps.fedoraproject.org/datagrepper/id?id=2020-08e6eece-4ed6-4a8c-9a06-3a49c28ca3f2&size=extra-large
https://apps.fedoraproject.org/datagrepper/id?id=2020-b82632dc-22ec-4a50-9104-6f6a64aa510a&size=extra-large
https://apps.fedoraproject.org/datagrepper/id?id=2020-fbc45335-409b-4656-99a3-7e517e48681c&size=extra-large
https://apps.fedoraproject.org/datagrepper/id?id=2020-2cc57b49-ebc2-481a-83be-15baa49f81b2&size=extra-large
https://apps.fedoraproject.org/datagrepper/id?id=2020-a5c9491b-2bad-4866-962a-939c3643a01c&size=extra-large

If the jenkins builds are not available any more due to retention policy, you can run the updated script, just don't forget to provide the jenkins url in the script.

@ggallen
Copy link
Collaborator

ggallen commented Sep 10, 2020

@vojtechsokol , I just checked your jms-messaging-plugin version and you are running 1.1.9. The current version is 1.1.18. Could you please update to that version and see if things improve? 1.9 was released over a year ago, and there have been a lot of changes since then.

@thrix
Copy link

thrix commented Sep 25, 2020

@ggallen should we provide more information here pls? This is becoming a blocker for migrating to these new jobs which use github -> fedsmg as triggers ...

@ggallen
Copy link
Collaborator

ggallen commented Sep 25, 2020

There's nothing obvious I can see here. This is going to require some more in-depth investigation.

Can you establish any sort of a pattern for when messages are missed vs. when they are received?

@scoheb , any thoughts?

@scoheb
Copy link
Contributor

scoheb commented Sep 28, 2020

@vojtechsokol contact me offline to discuss.

@scoheb
Copy link
Contributor

scoheb commented Sep 28, 2020

actually can you please check the status of the webhook deliveries at https://github.com/ceph/ceph/settings/hooks for https://apps.fedoraproject.org/github2fedmsg/webhook ?

It is possible there may be some 404 errors.

There might be some continual data centre migration happening.

@scoheb
Copy link
Contributor

scoheb commented Sep 28, 2020

It does seem that there may be an issue with your missed messages script. I found that build 2520, 2519, 2518 were all triggered but do not show up as matched in output.

@vojtechsokol
Copy link
Author

@scoheb Indeed, there was a bug in the script, It should work properly now.

Regarding the status of the webhook - I have access only to the oamg/leapp and oam/leapp-repository and the are two failures (500 Internal Server Error) in last 150 deliveries - however I don't think it relates to our problem - those events didn't even make it to the fedmsg.

@ggallen No idea at all if there is any pattern. The only hint I have is that maybe it affects more topics with smaller volume of messages - job ci-dispatch-github_pull_request_opened that consumes messages with topic org.fedoraproject.prod.github.pull_request.opened (722 messages in last 10 days) is affected more than job ci-dispatch-github_pull_request_commit that consumes messages with topic org.fedoraproject.prod.github.pull_request.synchronize (1213 messages in last 10 days).

@scoheb
Copy link
Contributor

scoheb commented Oct 1, 2020

@scoheb
Copy link
Contributor

scoheb commented Oct 2, 2020

@vojtechsokol @thrix

The problem was traced to some proxies behind the fedora hub that were not working correctly. This caused the issues you were seeing since it all depended on which of the 10 proxies you connected to in Jenkins when the job subscribed to a topic. Only 2 of the 10 had been working correctly.

This has since been fixed on the fedora infra side. All proxies are working well now.

Please see https://pagure.io/fedora-infrastructure/issue/9363

@scoheb scoheb closed this as completed Oct 2, 2020
@thrix
Copy link

thrix commented Oct 5, 2020

@scoheb thanks a lot for hunting this down!

@vojtechsokol
Copy link
Author

@scoheb \o/ Thanks for helping to fix that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants