Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approve jobs if at least older jobs passed #168

Merged
merged 1 commit into from
Mar 20, 2024

Conversation

michaelgrifalconi
Copy link

@michaelgrifalconi michaelgrifalconi commented Feb 21, 2024

https://progress.opensuse.org/issues/97118

  • if aggregate update failed, do not give up immediately
  • look at openQA previous jobs, if present, green, not too old and including the update under test: ignore that failure

This is to reduce the impact of one test being broken one day, a different test another day and the update not being approved even if combined result give all green, just not at the same time.

I did not touch the tests yet. Before investing more I would like to discuss about the new logic and it's implementation.

@michaelgrifalconi michaelgrifalconi changed the title Enhance bot approval logic [WIP] Enhance bot approval logic Feb 21, 2024
Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat idea! I wonder if this can actually work because so far the approver workflow does AFAIK never contact openQA directly.

openqabot/approver.py Outdated Show resolved Hide resolved
openqabot/approver.py Outdated Show resolved Hide resolved
openqabot/approver.py Outdated Show resolved Hide resolved
@michaelgrifalconi
Copy link
Author

Neat idea! I wonder if this can actually work because so far the approver workflow does AFAIK never contact openQA directly.

I can assure it works, I ran a lot of dry-runs on my machine with dashboard and openQA live data!
I stole some openQA calls that the bot do when syncing jobs and created some more using the same client

@michaelgrifalconi
Copy link
Author

If the code it's looking good enough, I will move on adapting the tests!

okurz

This comment was marked as resolved.

@michaelgrifalconi
Copy link
Author

As you stated you already tested manually can you provide the logs from such run so that we can see the flow of execution from the log messages?

https://gist.github.com/michaelgrifalconi/4d07eee0197c929db7c2a11b85759edb

Mind that stuff like

2024-02-22 08:13:59 INFO     20240221
2024-02-22 08:13:59 INFO     Failed job date: 2024-02-21

Will be removed as I mention on the other conversations

@michaelgrifalconi
Copy link
Author

New log with less frequent/redundant and more descriptive logs: https://gist.github.com/michaelgrifalconi/41b6451b36cdfb87814b9ce9635a9459

Would not reduce too much logging, since it as something you would look only in case of problems, and do not disturb anyone (of course as long as it's not so huge to be expensive on resources or just clutter when debugging)

openqabot/approver.py Outdated Show resolved Hide resolved
openqabot/approver.py Outdated Show resolved Hide resolved
openqabot/approver.py Outdated Show resolved Hide resolved
openqabot/approver.py Outdated Show resolved Hide resolved
Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feedback after the QE Tools workshop about this topic:

  1. What are alternatives that we can consider, how about again enabling automatic retry?
  2. How about shifting more aggregate tests to incidents, especially the tricky ones more prone to failing?
  3. There is no feedback to openQA so it will make it even harder for reviewers to find which jobs are actually blocking -> Hence my suggestion is instead to cover this logic in a separate application which simply writes "approvable_for" comments on openQA directly. This way there is direct feedback to reviewers same as to qem-bot. Thinking even further here such external applications can be triggered from openQA job hooks like scripts in https://github.com/os-autoinst/scripts/ do. I suggest you take a look into https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues-and-investigate-hook for this

This comment was marked as resolved.

openqabot/approver.py Outdated Show resolved Hide resolved
@okurz
Copy link
Member

okurz commented Mar 13, 2024

Before too much developent+review effort is invested please keep #168 (review) in mind which I think we agreed upon to follow up with.

@michaelgrifalconi
Copy link
Author

1. What are alternatives that we can consider, how about again enabling automatic retry?

I agree on making sure automatic retries should be set at least at RETRY=1 globally. Where would be a good place to set that? Medium types?
This is anyhow separate and does not substitute this PR. This is for reproducible issues (either test or product issues, unrelated to the update that shall be approved and was green the day before).

2. How about shifting more aggregate tests to incidents, especially the tricky ones more prone to failing?

Sure, there are tickets open for that. Still this is a different topic and I believe both things do help.

3. There is no feedback to openQA so it will make it even harder for reviewers to find which jobs are actually blocking -> Hence my suggestion is instead to cover this logic in a separate application.[...]

No, introducing a new thing that changes the behavior of something, from a different place makes it even more frustrating for an engineer to understand "what is happening and why".
I would agree on moving out the entire approval logic and handle it in a simpler way (no hand-crafted caching and syncing of data) but it's a future topic.

Right now this is the smallest tweak possible to increase the quality of life for maintenance test developers.

About the visibility issue, I think that all current test failure should be fixed/softfailed at some point, since they surely block updates that were not tested the day before(like newly released one).
In addition, this is just about aggregates, and they run every day. I see no reason why anyone would spend time fixing a failure from the previous day instead of focusing on the present ones.

I see that would still be nice to have visibility of what is really blocking an update and what could be ignored for a specific update request (by "approvable for comment"). Too bad that the dashboard does not show that either AFAIK.

For this topic I would either:

  • switch from ignoring the failure to writing a comment "Approvable for" (before looking at comments and ignore it)
  • merge it as it is to get the benefit of getting stuff approved, and then start discussing on a better solution, like the one in the line above or something else

I have no strong opinions for either option, as long as we don't spend too much time discussing on cosmetics and also new cosmetics changes do not require me to do more rebase of this PR (which i believe to be more important than suddenly changing code styles)

@okurz
Copy link
Member

okurz commented Mar 13, 2024

switch from ignoring the failure to writing a comment "Approvable for" (before looking at comments and ignore it)
Yes, that could be done in here as well. I just think that this is easier to implement and also run in a separate script. Or, maybe a compromise, run that pre-approval with writing the comments in a separate command within qem-bot? I am fine with either approach, just pointing out ideas. You can choose and we will support you either way :)

@asmorodskyi

This comment was marked as off-topic.

@michaelgrifalconi

This comment was marked as off-topic.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor phrasing issue left, rest is fine

openqabot/approver.py Outdated Show resolved Hide resolved
@michaelgrifalconi
Copy link
Author

Considering that:

  • we see more and more situations where this logic would help in real situations
  • changing the design to writing comments will require more time and make it more difficult to test dry runs

I would like to proceed as it is, add the necessary tests and then as soon as it is merged we can start a discussion on how to improve visibility and consider all various options like commenting and or moving to different bot job, etc.

@okurz
Copy link
Member

okurz commented Mar 14, 2024

CI failures in https://github.com/openSUSE/qem-bot/actions/runs/8276775359/job/22649198458?pr=168#step:5:221

I would like to proceed as it is, add the necessary tests and then as soon as it is merged we can start a discussion on how to improve visibility and consider all various options like commenting and or moving to different bot job, etc.

Yes, I can accept that although be aware that concerns were raised by others than just me about the overall approach. So besides the CI failures I see only two points missing before I can approve:

  1. The PR is marked as "WIP" and I won't approve before you confirm that the PR is ready to be approved, i.e. remove the WIP
  2. Squash the commits that just fixup the original commit

@codecov-commenter
Copy link

codecov-commenter commented Mar 14, 2024

Codecov Report

Attention: Patch coverage is 84.05797% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 67.74%. Comparing base (7b921a0) to head (806cbc0).

Files Patch % Lines
openqabot/approver.py 85.41% 7 Missing ⚠️
openqabot/openqa.py 80.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #168      +/-   ##
==========================================
+ Coverage   67.00%   67.74%   +0.73%     
==========================================
  Files          25       25              
  Lines        1664     1730      +66     
==========================================
+ Hits         1115     1172      +57     
- Misses        549      558       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@michaelgrifalconi
Copy link
Author

Right when I was feeling ready with this, I found a confirmation of the issue caused by using SMELT ids instead of RR ids.
Moving the discussion about it here: #174

@michaelgrifalconi
Copy link
Author

michaelgrifalconi commented Mar 14, 2024

Right when I was feeling ready with this, I found a confirmation of the issue caused by using SMELT ids instead of RR ids. Moving the discussion about it here: #174

Having lunch brought me some wisdom. We can make use of the same workaround described in the open issue to avoid falling in the rabbit hole of fixing everything in the system at once.
I can add a check here to make sure the selected job is still present in the qem-dashboard. If it's not, then it means it was removed by qem-dashboard/#78 and should not be used.

openqabot/approver.py Outdated Show resolved Hide resolved
openqabot/approver.py Outdated Show resolved Hide resolved
openqabot/approver.py Show resolved Hide resolved
openqabot/approver.py Outdated Show resolved Hide resolved
@michaelgrifalconi
Copy link
Author

I think I am ready for the last review. If it's all fine I will do one last consolidation of commits and we can merge!
By the way, this is what would be approved by the new logic:

2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32461:324048
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32575:324054
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32795:324057
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32898:324045
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32912:324047
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32934:324023
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32940:323930
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32951:323996
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32956:324041
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32959:324042
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32960:324072
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32967:324105
2024-03-19 08:14:56 INFO     * SUSE:Maintenance:32971:324109

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@michaelgrifalconi michaelgrifalconi changed the title [WIP] Enhance bot approval logic Enhance bot approval logic Mar 19, 2024
Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, now we need to find a better subject line than "Enhance bot approval logic". How about "Approve jobs if at least older jobs passed"?

Comment on lines 174 to 208
job = older_jobs["data"][i]
job_build = job["build"][:-2]
job_build_date = datetime.strptime(job_build, "%Y%m%d")

# Check the job is not too old
if job_build_date < oldest_build_usable:
log.info(
"Cannot ignore aggregate failure %s for update %s because: Older jobs are too old to be considered"
% (failed_job_id, inc)
)
return False

if job["result"] == "passed" or job["result"] == "softfailed":
# Check the job contains the update under test
job_settings = self.client.get_single_job(job["id"])
if not regex.match(str(job_settings)):
# Likely older jobs don't have it either. Giving up
log.info(
"Cannot ignore aggregate failure %s for update %s because: Older passing jobs do not have update under test"
% (failed_job_id, inc)
)
return False

if not self.validate_job_qam(job["id"]):
log.info(
"Cannot ignore failed aggregate %s using %s for update %s because is not present in qem-dashboard. It's likley about an older release request"
% (failed_job_id, job["id"], inc)
)
return False

log.info(
"Ignoring failed aggregate %s and using instead %s for update %s"
% (failed_job_id, job["id"], inc)
)
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for-body is now indented a bit too much. Can you extract a method here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of creating methods for things that get called only in one place in the code, making me jump around in the file to follow the flow, but I guess it's a personal preference. Will do as requested

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored a bit the code, without the need of a new method I removed a nested if to get down of one level of indentation and make it more readable

openqabot/approver.py Outdated Show resolved Hide resolved
openqabot/approver.py Outdated Show resolved Hide resolved
openqabot/openqa.py Outdated Show resolved Hide resolved
openqabot/openqa.py Outdated Show resolved Hide resolved
@okurz okurz changed the title Enhance bot approval logic Approve jobs if at least older jobs passed Mar 20, 2024
openqabot/approver.py Outdated Show resolved Hide resolved
Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry. You added another commit. Please squash, then I can approve

@michaelgrifalconi
Copy link
Author

Yeah I usually try to show what I change instead of force push on every change since it makes more difficult to check and might hide some stuff! Rebased now

- if aggregate update failed, do not give up immediately
- look at openQA previous jobs, if present, green, not too old,
  still present in the qem-dashboard (to avoid using tests about
  different Release Requests) and it includes the update under test:
  ignore that failure

This is to reduce the impact of one test being broken one day, a
different test another day and the update not being approved even if
combined result give all green, just not at the same time.
@michaelgrifalconi
Copy link
Author

Thank for the approval! I have no rights to merge so someone else will have to do it :)

@kalikiana kalikiana merged commit 00b4000 into openSUSE:master Mar 20, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants