Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: remove the test-infra release team role going forward #631

Closed
Katharine opened this issue May 15, 2019 · 20 comments

Comments

@Katharine
Copy link
Member

commented May 15, 2019

I propose that we work to eliminate the test-infra role after the 1.15 release concludes.

Historically, the test-infra role has had three responsibilities: some (very tedious) config mangling to create a new release branch and drop the oldest one, some config editing to turn code freeze on and off, and a broader "keep the test infrastructure healthy" responsibility. These responsibilities were once quite time-consuming. Additionally, performing the role required e.g. pushing images at the right time, which required special privileges.

Today these tasks are less burdensome:

  • Through significant work on automation, the release-branch config-mangling job is mostly just invoking a script, and perhaps cleaning up some trailing TestGrid config. Similarly, pushing the required images is now done automatically as soon as a reference to them is committed. Special powers are no longer required.
  • Enabling / disabling code freeze is a trivial config file edit
  • Test infrastructure health is a broader concern than the release team, and primarily falls to sig-testing's test-infra oncall rotation. Healthy test infrastructure is important for everyone, and that is reflected in the test-infra oncall's work.

This reduction in load is reflected in the recently revised and slimmed down test-infra handbook. As a data point, it took me less than half an hour to work though the new "Create CI/Presubmit jobs for the new release" steps.

Given that the total responsibilities for the role are now well under an hour of total work with no real pre-reqs over the course of three months, I think it's reasonable to drop the role going forward. The amount of overhead involved in having this role will substantially outweigh the actual work the role performs,

This does still leave the two responsibilities to be carried out by someone. I would initially suggest that config editing should fall to the branch manager as part of the branch cut process.

As an additional task, the test-infra lead reports the number of PRs merged to master each week at the release team meeting. If people find this number useful, I would imagine it could be folded into the CI signal update or similar.

/area release-team

@justaugustus

This comment has been minimized.

Copy link
Member

commented May 15, 2019

Echoing my sentiments from #sig-release:

Strong agree with @Katharine and @spiffxp, which I think we've mentioned on a few calls.
The end goal should be to obviate the need for most of these Release Team roles with automation. So if we're getting close with test-infra, that's a good thing!

I would say if there seems to be not a whole bunch for the Test Infra role to do this cycle, then helping out CI Signal with their tasks would probably dovetail nicely.

Thanks to everyone who had a hand in automating this away!

Dhawal/Jorge/Cheryl -- please work together to redelegate the remaining test-infra responsibilities to the CI Signal and Branch Manager roles for 1.16, as Katharine mentioned above.

/assign @imkin @alejandrox1 @Bubblemelon
/priority important-soon
/milestone v1.15

@k8s-ci-robot k8s-ci-robot added this to the v1.15 milestone May 15, 2019
@justaugustus justaugustus added this to Backlog in Release Team via automation May 15, 2019
@justaugustus justaugustus moved this from Backlog to To do in Release Team May 15, 2019
@justaugustus justaugustus moved this from To do to In progress in Release Team May 25, 2019
@alejandrox1

This comment has been minimized.

Copy link
Contributor

commented May 27, 2019

I believe the duty of monitoring and reporting on PRs can fall under the scope of CI signal. Performing such a task aids CI signal obtain more context about the work being done upstream by all contributors which is ultimately useful at the moment of “debugging” a test failure.

Additionally, the test-infra role handbook stipulates that another responsibility of the test-infra role is to “create CI/Presubmit jobs for new releases, and populate the Testgrid dashboard.” This additional task can also be taken by CI signal, and would make sense to do so as CI signal is primarily concerned with monitoring.

These additional responsibilities, I believe, will help CI signal be more involved and will provide the CI signal team with more background to work with other maintainers and contributors. Historically, CI signal members (past and present) have been involved in tracking the degree of coverage of jobs to decide how influential these should be in determining the stability of a release; clean up unmaintained tests; advice contributors and SIGs about the type of tests we need for the sake of advising the release team that it is appropriate to make a release.

@Bubblemelon

This comment has been minimized.

Copy link
Member

commented May 28, 2019

Re:

... I would initially suggest that config editing should fall to the branch manager as part of the branch cut process.

Having read the Configure merge automation for code freeze, and thaw section of the test-infra handbook, it makes sense to include it as part of the release manager role. It gives more context of what's goes on and as stated in the release manager handbook:

... The Test Infra Lead, Branch Manager and Release Lead coordinate checking in whichever config changes are required to enable and disable merge restrictions.

In other words, yes to adding config editing as part of the branch manager role.

Are there permissions needed to this that the branch manager role wouldn't have at this point?

cc: @imkin

@Katharine

This comment has been minimized.

Copy link
Member Author

commented May 29, 2019

No special permissions are needed to do any of this work.

Given the files being edited, you'll need approval from someone in sig-testing in that OWNERS file, but having that is not a requirement for the current test-infra position anyway.

@imkin

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

@alejandrox1 I agree that the configuration of the jobs could be something the CI-signal could shoulder on. As someone who got into the test-infra role first and then tried to understand the significance of the dashboards, I always was lost. And I heard from the CI signal folks that they were always lost on where these dashboards came from. May be this is something that should be baked into the CI-Signal docs.

However I think that as a part of release process this task is tied to creation of a branch and should be triggered immediately once a branch is available which is handled by the branch manager. May be CI-Signal should be responsible for approving the forked jobs.

@Bubblemelon I do not know the load on release manager role. @claurence any comments?

@alejandrox1

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

ah thank you for the clarification @imkin ! I wasn't aware of the process but that makes a lot of sense.
CI signal wise, we would be definitely interested in being on top of the PR that creates the jobs for a new release.
So with that in mind, i guess we are set?

@claurence

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

this all sounds good to me - in terms of succession plans for 1.16 then we would dissolve the Test Infra role is what it is sounding like...

@tpepper

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

In general I agree with Katherine's proposal.

The one gap I see (and arguably it is a gap already): the velocity and type of merges on k/test-infra and test-infra's overall health. I think release team needs a liaison covering this area. I think it's been implicit in the role.

@imkin

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

@tpepper this gap is currently filled by testinfra-on-call and would probably be filled by the on-call in the future.

@jberkus

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

+1 in general.

I'm surprised that PR reports wouldn't fall under Bug Triage, though. Back in the 1.10 cycle, this was actually a responsibility of Bug Triage; I'm not sure when it got to be a test-infra duty.

In general, I'm not keen to add more regular tasks to the CI Signal workload, which is already one of the busiest jobs for the release.

@imkin

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

@tpepper

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

@tpepper this gap is currently filled by testinfra-on-call and would probably be filled by the on-call in the future.

Would the testinfra-on-call proactively be in the regular release team meetings, proactively reporting on test-infra health regularly? Especially at the end of the release, we don't have the luxury of time for release team to debug alone to decide maybe there's an underlying test-infra issue and then reach out to the on-call, who quite possibly then says oh yeah this is known and we're working on it. If all of this falls to the CI Signal lead, we're making their role notably larger and also potentially wasting their time.

@jberkus

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

The test-infra reported it as a mechanism to report prow and tide health.

This feels like it should be wholly automated

@tpepper

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

Some areas in docs we will want to update IMO:

@lachie83

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

+1 to this proposal. I'm happy to commence the removal of the this role at the end of 1.15 so long as the following commitments are met (in order to ensure a smooth transition into 1.16):

  • All responsibilities of the test-infra role are allocated amongst the other roles
  • Appropriate handbook updates are made to reflect the change (in a timely manner)
  • Leads take extra care to hand over these specific responsibilities to the incoming leads
  • We have a way to contact the 1.15 test-infra team to clarify any issues throughout the 1.16 release cycle
  • A clearly defined escalation path to the test-infra team should issues be encountered
@jberkus

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

Hey, I'm gonna step in and say that I am 100% opposed to any division of test-infra duties that adds workload to CI signal.

We are already having problems recruiting a CI Signal lead because of the amount of time the role requires. Adding anything to that would be crazy.

Let's add any additional duties to Bug Triage, which is a less heavily loaded role.

@tpepper

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2019

@tpepper this gap is currently filled by testinfra-on-call and would probably be filled by the on-call in the future.

Would the testinfra-on-call proactively be in the regular release team meetings, proactively reporting on test-infra health regularly? Especially at the end of the release, we don't have the luxury of time for release team to debug alone to decide maybe there's an underlying test-infra issue and then reach out to the on-call, who quite possibly then says oh yeah this is known and we're working on it. If all of this falls to the CI Signal lead, we're making their role notably larger and also potentially wasting their time.

To cover this it is important that the portions of the test-infra README.md documentation around how to contact testing folks (Slack #testing-ops, @test-infra-oncall; http://gcsweb.k8s.io/gcs/test-infra-oncall/) and that release team CI Signal staff must proactively discuss possible issues with the broader set of test invested folks.

@imkin

This comment has been minimized.

Copy link
Contributor

commented Jul 8, 2019

I believe we have made the transition of the roles. During 1.16 release as we learn new gaps we should fill them. For now I believe the docs reflect the transfer of responsibilities.

@spiffxp

This comment has been minimized.

Copy link
Member

commented Jul 23, 2019

/close
Please reopen if there is any work that remains to be done here

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2019

@spiffxp: Closing this issue.

In response to this:

/close
Please reopen if there is any work that remains to be done here

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Release Team automation moved this from In progress to Done Jul 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.