Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

Fix bors problems or decide to remove it #6139

Closed
ekager opened this issue Oct 19, 2019 · 12 comments
Closed

Fix bors problems or decide to remove it #6139

ekager opened this issue Oct 19, 2019 · 12 comments
Assignees
Labels
eng:automation Build automation, Continuous integration, .. eng:qa:not-needed Added by QA to issues that cannot be tested

Comments

@ekager
Copy link
Contributor

ekager commented Oct 19, 2019

Because I don't like it 馃懜馃徏

鈹咺ssue is synchronized with this Jira Task

@ekager ekager added the eng:health Improve code health label Oct 19, 2019
@ekager
Copy link
Contributor Author

ekager commented Oct 19, 2019

My arguments:

  • Have had more issues with it than time it has saved us
  • Flakiness / weird error messages that aren't easily resolved (at least with my level of bors understanding)
  • When it autocloses the PR no one has the immediate responsibility to "QA-Needed" the issue that was linked or go and add the milestone - so it's still a manual process you just have to remember to go back and do it

@ekager
Copy link
Contributor Author

ekager commented Oct 19, 2019

If it would be possible to make bors an optional check so that non-admins could still merge issues if bors fails or without bors at all (once all tests pass on the PR like before) I would consider keeping it around if people have strong opinions.

@ekager
Copy link
Contributor Author

ekager commented Oct 19, 2019

In the small amount of time we've had bors, I've run into multiple times where bors refuses to merge a PR (even a small change) and after a million "bors retry"s I'm pretty over it. I've wasted enough time coming back and checking on it it would have been 1000x faster to just merge it myself

@NotWoods
Copy link
Contributor

It's also impossible (as far as I can tell) to see the problems now if a build fails. This might be due to configuration, since A-C shows taskcluster checks still.

@NotWoods NotWoods added the eng:automation Build automation, Continuous integration, .. label Oct 20, 2019
@ekager ekager removed the eng:health Improve code health label Oct 20, 2019
@sblatz
Copy link
Contributor

sblatz commented Oct 21, 2019

When it autocloses the PR no one has the immediate responsibility to "QA-Needed" the issue that was linked or go and add the milestone - so it's still a manual process you just have to remember to go back and do it

I think we originally thought we could automate more of this so it鈥檇 not be manual at all. Since we can鈥檛, I think this is one of the stronger arguments for removing it. Things are more likely to fall through the cracks.

I鈥檓 not sure how AC hasn鈥檛 run into all of these merging problems. I wonder if @pocmo can shine some light on that? We consistently get PR鈥檚 that get in an impossible-to-merge state no matter how many times we retry.

If we can鈥檛 resolve these issues I鈥檓 definitely in @ekager鈥檚 camp of removing it.

sblatz added a commit to sblatz/fenix that referenced this issue Oct 21, 2019
I'm sorry, Dave, I'm afraid I can't do that.
@pocmo
Copy link
Contributor

pocmo commented Oct 22, 2019

I鈥檓 not sure how AC hasn鈥檛 run into all of these merging problems. I wonder if @pocmo can shine some light on that? We consistently get PR鈥檚 that get in an impossible-to-merge state no matter how many times we retry.

bors has been very stable for us. If there were problems then they were caused by taskcluster tasks failing, freezing, not reporting status etc. - even without bors that seems to be something we should get resolved.

I'm happy to help fix and debug that. What errors did bors report (build failed, timeout, not responding with anything at all..)?

I usually start by looking at the taskcluster checks and tasks for the merge commit bors creates. One of them is usually the culprit.

In general bors is super helpful to keep a healthy green master since it merges PRs that have been tested against the latest master change. I wouldn't give that up if we can fix it. :)

@pocmo
Copy link
Contributor

pocmo commented Oct 22, 2019

What I just saw on my PR is:

  • UI tests are taking ages and at the end fail (-> Long-term: Lets see if we can make them faster, short-term: Fix the ui tests and/or Consider removing ui tests from the checks bors waits for)
  • Bors did not fail immediately when a task failed and was instead still waiting on the other tasks (and eventually timed out). I'm not sure whether we can configure that but in AC we only let bors wait on one "complete" task and that worked.

@sblatz sblatz changed the title Get rid of bors Fix bors problems or decide to remove it Oct 23, 2019
@JohanLorenzo
Copy link
Contributor

Hey there!

Thanks for raising concerns about bors! I don't personally have any preference about using bors itself. Although I should say a group of Mozillians from different teams came to the consensus that having an autoland bot is the best solution we can have at the moment. The context and the decision are gathered in this document. Bors is one autoland bot, I'm open to switching to another bot.

Flakiness / weird error messages that aren't easily resolved (at least with my level of bors understanding)

I confess I'm disappointed in bors try. It intermittently times out, without any explanatory error message. This issue has been reported at bors-ng/bors-ng#739. So far, no active developer interacted with us.

An example of weird error message I got is this one: #6117 (comment).

I think no matter what autoland bot we use, we will run into setup issues. Bors has been proven good enough to suit https://github.com/servo/servo, which has a similar complexity as Gecko. What changes between servo and Fenix+AC is the way we use taskcluster-github. Fenix+AC have used Github Checks (the new API) for the past month, while Servo remains on Github Statuses (the old API). I believe the situation can be improved if bors provides a better support for Github Statuses.

It's also impossible (as far as I can tell) to see the problems now if a build fails. This might be due to configuration, since A-C shows taskcluster checks still.

I might not have the full context, but Fenix has showed Taskcluster checks (aka Github Checks) for the past month. Do you have an example of a failed build where you cannot see what the problem is?

UI tests are taking ages and at the end fail (-> Long-term: Lets see if we can make them faster, short-term: Fix the ui tests and/or Consider removing ui tests from the checks bors waits for)

馃憤
Another short term fix I can help with:

# TODO Generate APKs in a build task instead
This way the UI tests just run the tests and doesn't build a dedicated APK in the same APK task.

Bors did not fail immediately when a task failed and was instead still waiting on the other tasks (and eventually timed out). I'm not sure whether we can configure that but in AC we only let bors wait on one "complete" task and that worked.

We can be at parity with AC, if I manage to land #6117 (let's discuss the technical details of why I cannot in the PR itself).

Please let me know if this helps addressing your major concerns 馃檪

@liuche
Copy link
Contributor

liuche commented Oct 29, 2019

For now, we've decided:

  • On Friday, we'll turn on bors 11/1 but not make it a merge requirement
  • If bors is running (bc someone started a bors task), DO NOT MANUALLY MERGE
  • otherwise, can merge manually if things look fine

@sblatz
Copy link
Contributor

sblatz commented Nov 12, 2019

I'll take this away and investigate if it's working now :)

@sblatz
Copy link
Contributor

sblatz commented Dec 17, 2019

After some more testing with this, I don't think bors solves a use case for us as we still need to babysit it. Because of that I'm going to move ahead with removing it.

sblatz added a commit to sblatz/fenix that referenced this issue Dec 17, 2019
I'm sorry, Dave, I'm afraid I can't do that.
@JohanLorenzo
Copy link
Contributor

馃憤 I haven't managed to get it working on Fenix since then. I think I tried like half a dozen times. It has always timed out. I'm fine removing bors from Fenix.

@sblatz sblatz closed this as completed in 99f3804 Dec 17, 2019
@sblatz sblatz added the eng:qa:not-needed Added by QA to issues that cannot be tested label Dec 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
eng:automation Build automation, Continuous integration, .. eng:qa:not-needed Added by QA to issues that cannot be tested
Projects
None yet
Development

No branches or pull requests

7 participants