Develop a plan for how maintainers can own their release schedule (e.g., not depend on other teams' deployments) #9574

BigLep · 2023-01-21T00:18:22Z

Done Criteria

There is an agreed-upon plan (e.g., document, issue) with the Kubo maintainers for how we can adjust the release process so that our schedule is not dependent on other teams' schedule (and in particular PL EngRes Bifrost team for ipfs.io Gateway deployments). This plan should make clear what the acceptance criteria is and the specific steps we're going to take so that engineers can pick it up and execute on it.

Why Important

Delays in the ipfs.io gateway deployment have been the main cause for release slips. As discussed during the 0.18 retrospective, delay starts with the deploy slipping which opens up the door for other bug fixes, improvements, etc. to start creeping in which can further can push the date out.

In addition, by not owning the production service that is using the software, maintainers are shielded from seeing firsthand how the software performs in production.

User/Customer

Kubo maintainers

Notes

This isn't personal towards the PL EngRes Bifrost team. We understand they have a lot to juggle. This is about the Kubo maintainers owning their destiny.
We have to be careful not to slow down on hitting the goal of owning our destiny because we have raised the bar on the level of rigor we're doing for release validation. I'm all for upping our rigor, but I want to make sure we get the win from decoupling first. For example, if the standard in the past was deploy to a production service and squint at some graphs comparing it to the previous version, then we can keep that.

galargh · 2023-01-23T10:35:01Z

I want to suggest exploring how we could improve/get more involved in the current process before developing a new release verification solution.

During the 0.17 release, @dharmapunk82 started documenting the release process at https://www.notion.so/pl-strflt/Release-Notes-6e0efff28ee540be9ccb8f2b85104c42 🙇.

Now the question is: can we automate it?

From what I understand, the infrastructure is described as code already in https://github.com/protocol/bifrost-infra, which is a great starting point.
To update the Kubo version, a bunch of YAMLs must be updated. This includes modifying the version number and, sometimes, modifying the Kubo configuration that comes with it. Because of the second part, the configuration changes, this will likely need verification. Could a Kubo maintainer perform it?
The basic deployment success verification is pretty straightforward - run ipfs version + check Pingdom. This sounds like an easy job for a computer.
We first deploy Kubo to a single canary and then proceed to update a single bank if it is a success. Can the hosts for these operations be predetermined?
The actual deployment involves applying ansible playbooks. Can we make the required credentials available in the cloud so that this wouldn't require human intervention?
What if it fails? Do Kubo maintainers have enough expertise to fix a broken deployment? Should they have it?

As @guseggert correctly pointed out, it is also crucial to formally determine what Kubo metrics we care about.

The doc mentions:

resource limit exception frequency
TTFB

In my opinion, such a list is a prerequisite for any further work. It might also be a good idea to make a distinction between success and performance metrics. If we want to further develop the current deployment process, it'd be helpful when thinking about things like automatic rollbacks or shadow deployments. If we decide to utilise any other solution, it'd be easier to reason whether the replacement allows us to answer questions that we need answering.

Finally, it'd be interesting to see Thunderdome integrated more closely with Kubo verification. I see it as a massive opportunity for shifting issue discovery left. However, I'd be cautious as to what extent it can "replace" a real deployment.

guseggert · 2023-01-25T16:25:58Z

So just to restate the goal and motivation: we want to be in a place where Kubo maintainers are confident in Kubo releases without requiring coordination with specific external groups. We'd like to treat the ipfs.io gateways as just another gateway operator, so that it is the operators of ipfs.io who are responsible for timely testing of RC releases, and if no feedback comes in during the RC cycle, then we move forward with the release. Currenly we block releases on ipfs.io testing, which requires a lot of coordination due to Kubo maintainers (rightfully) not having direct access to production systems that they don't own. We do this because it is the best mechanism we have for load testing Kubo on real workloads.

I think there are broadly two strategies we could take here:

Kubo maintainers own and operate their own production infrastructure
Kubo maintainers own load testing infrastructure

I think 1) is an ideal long-term direction, because I believe in the benefits of "owning" the code you write/merge. The people pressing the merge button should feel pain the pain of bad decisions, not just the users...this "skin in the game" not only drives motivation for high quality standards, but gives maintainers much more leverage to push back on bad code/designs.

Many long-standing performance issues and features for gateway operators also continue to languish because maintainers are not feeling the constant pain of them, and Kubo maintainers are in the best position to make the significant changes to Kubo to fix them or add the necessary features. The ipfs.io gateway operators have papered over many issues ("bad bits" blocking, figuring out when to manually scale, dashboards, excessive resource usage, etc.), which provides a fix for ipfs.io but for nobody else, which is very unfortunate for the ecosystem.

There's also a product angle to 1) that I've always been interested in. Currently the ipfs.io infrastructure is closed source and private, because the cost of extracting it is too high. Also there are numerous design deficiencies with it that make operating it more painful than it needs to be (top of mind: lack of autoscaling). Providing a solid "out-of-the-box" gateway product that fixes these issues would be beneficial for the community IMO. (related: https://github.com/ipfs-cluster/ipfs-operator)

That being said, I think 2) is a good incremental step towards 1) anyway, and will not be throwaway work even if 1) is not pursued. The crux is: can we get 2) to a point where Kubo maintainers are confident cutting a release without feedback from ipfs.io gateways? I think the answer is "yes", and I think we're almost there already.

So I propose the following concrete actions:

Agree on "launch blocker" metrics
- Just crib from Bifrost here and the metrics and thresholds they use for being paged
- These are what bifrost uses:
  - CPU per-core load > 1.2 for 75 min
  - Mem util >= 95% for 10 min
  - No traffic for 1 min
  - Wantlists > 20k
  - 5xx's > 25 over 5 min
  - Goroutines > 2million over 75 min
- My suggested additional metrics
  - TTFB
  - number of libp2p resource manager throttles
- Also would be helpful to have "diffs" between old version and new version of critical metrics like connected peers, routing table health, etc. I don't think we can alarm on these but we can browse through them for changes and make sure they make sense given the code changes.
Get buy-in from Bifrost (ipfs.io operators)
Set up thunderdome to be run by release engineer
- Talk to @iand
- Setup all the metrics and monitors
- Write a section in the release runbook for running the thunderdome tests (how long, passing criteria, etc.)

I'd prefer we start with just running it ad-hoc by the release engineer, see how it goes, and then determine if we want to add it to CI and what that would look like.

guseggert · 2023-01-25T20:06:31Z

@galargh I think we should avoid tying releases to ipfs.io deployments if possible. If something goes wrong it's quite difficult (and dangerous) for us to debug, it usually involves coordinating with a bifrost engineer which dramatically increases turnaround time. I really think we'd be in a better world if the infrastructure running the release acceptance tests are owned by us and are not production systems.

Finally, it'd be interesting to see Thunderdome integrated more closely with Kubo verification. I see it as a massive opportunity for shifting issue discovery left. However, I'd be cautious as to what extent it can "replace" a real deployment.

Can you elaborate on the specific deficient areas of Thunderdome?

BigLep · 2023-02-05T19:08:30Z

@guseggert : the 0.19 release is going to sneak up on us quick. Can you please engage with @iand to see if there's any way we can get help here on the Thunderdome side so we don't need t rely on Bifrost deployments for the next release?

lidel · 2023-02-06T16:41:08Z

Just a meta-not/flag that PL is moving towards ipfs.io switching from Kubo binary and using bifrost-gateway instead.

We will lose ability to dogfood Kubo unless we explicitly own % of gateway trafic, and route it to gateways backed by Kubo instead of Saturn.

guseggert · 2023-02-27T18:30:14Z

For the 0.19 release, the release engineer should work with @iand to run Thunderdome on Kubo to mimic the validation we perform on ipfs.io gateways. @iand has added a lot of documentation around this, it will probably take a release or two to work through the process here, and then when we have the manual process working okay we can think about automating it.

galargh · 2023-02-28T11:02:07Z

@galargh I think we should avoid tying releases to ipfs.io deployments if possible. If something goes wrong it's quite difficult (and dangerous) for us to debug, it usually involves coordinating with a bifrost engineer which dramatically increases turnaround time. I really think we'd be in a better world if the infrastructure running the release acceptance tests are owned by us and are not production systems.

Finally, it'd be interesting to see Thunderdome integrated more closely with Kubo verification. I see it as a massive opportunity for shifting issue discovery left. However, I'd be cautious as to what extent it can "replace" a real deployment.

Can you elaborate on the specific deficient areas of Thunderdome?

Sorry I missed this earlier. I didn't have any specific Thunderdome "deficiencies" in mind. I was only speaking to the fact that it is not a "real" deployment that the end users interact with. So using it can give us more confidence but the final verification will still be happening only when the code reaches ipfs.io. I was really trying to gauge how far off we might be from setting up Continuous Deployment to ipfs.io. Given the answers, it seems to me we're not really there yet.

For the 0.19 release, the release engineer should work with @iand to run Thunderdome on Kubo to mimic the validation we perform on ipfs.io gateways. @iand has added a lot of documentation around this, it will probably take a release or two to work through the process here, and then when we have the manual process working okay we can think about automating it.

I strongly believe someone from the core Kubo maintainers team should drive the validation setup work. I'm happy to get involved and help out but I think it'd be hard for me or Ian to definitively conclude what set of experiments would give Kubo maintainers enough confidence to proceed with a release.

iand · 2023-02-28T11:56:05Z

For the 0.19 release, the release engineer should work with @iand to run Thunderdome on Kubo to mimic the validation we perform on ipfs.io gateways. @iand has added a lot of documentation around this, it will probably take a release or two to work through the process here, and then when we have the manual process working okay we can think about automating it.

I strongly believe someone from the core Kubo maintainers team should drive the validation setup work. I'm happy to get involved and help out but I think it'd be hard for me or Ian to definitively conclude what set of experiments would give Kubo maintainers enough confidence to proceed with a release.

I agree, but I would like to build a list of the kinds of validation we need to see. What's the success criteria?

Also Thunderdome can simulate the bifrost environment fairly well and it captures the same metrics but it's missing some diagnostics that the Kubo team might want. What level of logging is needed? Are there additional metrics that we should add or existing ones that should be tracked? I'm also thinking about things like taking profiles at various points, goroutine traces and opentelemetry style tracing.

galargh · 2023-05-15T13:44:06Z

@BigLep Do we have a write-up of the testing process we should be following now?

BigLep · 2023-05-15T14:36:31Z

@galargh : yeah, the "setup/using" thunderdome docs got handled in #9872 . The acceptance criteria is still fuzzy, but it's at least documenting the steps we should take. I think we can resolve this now.

BigLep assigned guseggert Jan 24, 2023

This was referenced Feb 21, 2023

Resource manager items for Kubo 0.19 (the last charge) #9650

Closed

Release 0.19 #9502

Closed

galargh mentioned this issue Mar 6, 2023

docs: add bifrost to early testers #9699

Merged

BigLep closed this as completed May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a plan for how maintainers can own their release schedule (e.g., not depend on other teams' deployments) #9574

Develop a plan for how maintainers can own their release schedule (e.g., not depend on other teams' deployments) #9574

BigLep commented Jan 21, 2023

galargh commented Jan 23, 2023 •

edited

Loading

guseggert commented Jan 25, 2023 •

edited

Loading

guseggert commented Jan 25, 2023

BigLep commented Feb 5, 2023

lidel commented Feb 6, 2023

guseggert commented Feb 27, 2023

galargh commented Feb 28, 2023

iand commented Feb 28, 2023

galargh commented May 15, 2023

BigLep commented May 15, 2023

Develop a plan for how maintainers can own their release schedule (e.g., not depend on other teams' deployments) #9574

Develop a plan for how maintainers can own their release schedule (e.g., not depend on other teams' deployments) #9574

Comments

BigLep commented Jan 21, 2023

Done Criteria

Why Important

User/Customer

Notes

galargh commented Jan 23, 2023 • edited Loading

guseggert commented Jan 25, 2023 • edited Loading

guseggert commented Jan 25, 2023

BigLep commented Feb 5, 2023

lidel commented Feb 6, 2023

guseggert commented Feb 27, 2023

galargh commented Feb 28, 2023

iand commented Feb 28, 2023

galargh commented May 15, 2023

BigLep commented May 15, 2023

galargh commented Jan 23, 2023 •

edited

Loading

guseggert commented Jan 25, 2023 •

edited

Loading