Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial workflow (OWNERS, nodes, node channels, etc.) #1

Merged
merged 48 commits into from Jan 3, 2020

Conversation

wking
Copy link
Member

@wking wking commented Sep 6, 2019

Still need to work out edge handling, but here's what I have so far for folks to look at. CC @lucab, @steveej

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 6, 2019
@wking wking force-pushed the initial-workflow branch 3 times, most recently from b694135 to 3ba7267 Compare September 6, 2019 04:37
@lucab
Copy link
Contributor

lucab commented Sep 6, 2019

Two high level questions on this:

  1. do you plan for Cincinnati to consume all the JSON blobs directly, or do you plan to have an additional aggregation/transformation step in-between? (I'll have further followups based on this)
  2. to avoid confusion, it may be useful to replace the "cincinnati-" prefix in this repo name with something openshift specific. The equivalent of this for Fedora CoreOS (FCOS) is in a repo named "fedora-coreos-streams", under updates/

@wking
Copy link
Member Author

wking commented Sep 6, 2019

do you plan for Cincinnati to consume all the JSON blobs directly, or do you plan to have an additional aggregation/transformation step in-between? (I'll have further followups based on this)

The latter. A postmerge prow hook will collect all of this, massage it into a single JSON blob, and publish to somewhere outside of GitHub for Cincinnati to consume (S3? An app-sre ConfigMap? Quay labels?).

to avoid confusion, it may be useful to replace the "cincinnati-" prefix in this repo name with something openshift specific.

The org name is OpenShift-specific, and folks who need to distinguish between multiple Cincinnati can rename of fork (e.g. my fork of openshift/release is wking/openshift-release).

But on both points, the repo is new. Push back if I'm not convincing ;).

@lucab
Copy link
Contributor

lucab commented Sep 6, 2019

All good.

On the former, it's nice to have a massaged format. It'd be good to keep that closer to the way we query the key-value quay API.
(Perhaps you can even think of making this format human-friendlier if people are supposed to write entries by hand, like YAML or TOML)

On the latter, it's just a minor naming concern. Cincinnati is originally the name of the protocol replacing Omaha. We are slowly conflating implementation AND metadata format under that name now. But only specifically for openshift API. However, there are already other implementations of the protocol (e.g. the endpoint for nightlies) and other metadata formats (e.g. the FCOS one).

@wking
Copy link
Member Author

wking commented Sep 6, 2019

Perhaps you can even think of making this format human-friendlier if people are supposed to write entries by hand, like YAML or TOML

JSON looks like Python dicts and JavaScript objects, so it can't be too human-unfriendly ;). But sure, as long as a YAML/TOML parser is an acceptable dependency for the massaging/CI tools, I'd be ok with that. I'm fine deferring to the release managers; it should be easy to change later.

However, there are already other implementations of the protocol (e.g. the endpoint for nightlies)...

My long-term goal is to get nightlies into this flow and serve a graph for nightly-4.1, etc., out of this repo and the production OpenShift Cincinnati.

... and other metadata formats (e.g. the FCOS one).

I'd expect FCOS metadata to flow out of repos under other orgs. But no sense in diverging data formats. I'll take a closer look at the format in that repo.

@wking
Copy link
Member Author

wking commented Sep 6, 2019

Checking the FCOS node definition, it looks like most of the time is spent distinguishing between multiple arches and formats. These rolls will eventually be managed in OpenShift with multi-arch images, so I don't think we need to be quite so elaborate, especially out of the gate when we are amd64-only. Do you think the FCOS feeder and this Cincinnati feeder are consuming different-enough artifacts to motivate different input formats? Or do you think my current nodes/ structure should be adjusted somehow?

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 6, 2019
@wking
Copy link
Member Author

wking commented Sep 6, 2019

I've pushed up bbcf2d8 adding edge handling and doing some other polishing.

README.md Outdated Show resolved Hide resolved
edges/4.1/4.1.0-rc.6/4.1.0-rc.5.json Show resolved Hide resolved
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 7, 2019
@wking
Copy link
Member Author

wking commented Sep 7, 2019

I've pushed a WIP script for publishing this data to Quay labels, which we can use until we work out a better way to feed it into Cincinnati. Doesn't actually have POST/DELETE support yet, but you can run it to see what it would change:

$ hack/push-to-quay.py 
changing 4.1.0-rc.6 previous.add from [] to [u'4.1.0-rc.5']
(u'4.1.0-rc.6', 'FIXME post', 'https://quay.io/api/v1/repository/openshift-release-dev/ocp-release/manifest/sha256:1dabe42b5c94841fd8736d8f3a80afeaf5f5ad3833cef8d304c419a97b0efbc3/labels')
(u'4.1.0-rc.5', 'FIXME delete', 'https://quay.io/api/v1/repository/openshift-release-dev/ocp-release/manifest/sha256:dc67ad5edd91ca48402309fe0629593e5ae3333435ef8d0bc52c2b62ca725021/labels/io.openshift.upgrades.graph.next.add')
changing 4.1.0-rc.8 previous.add from [u'4.1.0-rc.7'] to []
(u'4.1.0-rc.8', 'FIXME delete', 'https://quay.io/api/v1/repository/openshift-release-dev/ocp-release/manifest/sha256:8250bbf79d4f567e24a7b85795aba97f9e75a9df18738891a0cb6ba42e422584/labels/io.openshift.upgrades.graph.previous.add')

That's replacing the next.add 4.1.0-rc.5 -> 4.1.0-rc.6 with a previous.add and removing the redundant 4.1.0-rc.7->4.1.0-rc.8 label, which makes sense.

@wking
Copy link
Member Author

wking commented Sep 8, 2019

Ok, fixed up the POST and DELETE handling (although I don't have creds to actually run those, so there are a handful of untested lines in the script. That should be all we need to start using this repo to manage the production graph metadata, vs. directly editing Quay tags. I'll start a follow-up branch to add CI checks to catch things like "listed version does not exist", impossible edge, cyclic graph, version removed from channel, etc. And once this branch lands, I'll add a postsubmit job to Prow to automatically publish after a PR merges.

@wking wking force-pushed the initial-workflow branch 2 times, most recently from cd30087 to 7d73466 Compare September 8, 2019 17:32
@wking wking force-pushed the initial-workflow branch 2 times, most recently from 3f93e46 to 1dafd8c Compare September 9, 2019 21:35
@lucab
Copy link
Contributor

lucab commented Sep 10, 2019

@wking the metadata you were looking at for FCOS is only used for boot-images. The relevant ones for Cincinnati are (example from the testing stream):

Do you think the FCOS feeder and this Cincinnati feeder are consuming different-enough artifacts to motivate different input formats? Or do you think my current nodes/ structure should be adjusted somehow?

I think it is fine if they diverge, just let's try to keep the overall approach similar. Relevant semantic differences in FCOS are:

  • there is a graph per stream. Clients cannot automatically jump across streams via auto-updates
  • there is a single graph for all architectures in a stream. That is, update paths are not architecture dependent. Some nodes may be missing for a specific arch, though.

@wking
Copy link
Member Author

wking commented Sep 10, 2019

  • updates-metadata: describes (cacheable) edges in the graph

How does this represent edges? It has a releases array, but each entry in that array only mentions one version. Is that the to version? What is the from version?

@lucab
Copy link
Contributor

lucab commented Sep 10, 2019

Correct, a rollout only specifies the to version.
The from set is composed of all the releases that come before it (in the release-index array) and are not deadends.
(There is another special case for enforcing barriers/chokepoints, but we haven't finalized/tested it yet).

You can observe the current FCOS testing graph here:

curl -H 'Accept: application/json' 'https://updates.coreos.stg.fedoraproject.org/v1/graph?basearch=x86_64&stream=testing&rollout_wariness=0'

Note: all of these are moving pieces as of today, full details will be in a public repo once settled.

@lucab
Copy link
Contributor

lucab commented Sep 10, 2019

Additionally, we just triggered a 24h-spread rollout right now, procedure is coreos/fedora-coreos-streams#17.

@wking
Copy link
Member Author

wking commented Sep 10, 2019

The from set is composed of all the releases that come before it (in the release-index array) and are not deadends.

Say a user upgrades from A->B, and then some flaw is discovered in B and you mark it a dead-end. How do they get off? With explicit from/to you can tune this sort of thing directly, and tooling can automate the tedium (e.g. see 1dafd8c).

Luke pushed this up to Quay.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.2.0-rc.0
Generated with:

  $ mkdir channels/prerelease-4.2
  $ ln -rs nodes/4.2/4.2.0-rc.0.json channels/prerelease-4.2/
  $ hack/graph-util.py extract-edges 4.2.0-rc.0
Luke pushed this up to Quay earlier today.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.1.19

We aren't going to tag it into 4.1.z because it failed it's
4.2.0-rc.3->4.1.19 promotion gate [1,2].  Possibly because the
Kubernetes API server melted down:

  $ ERRORS="$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8217/artifacts/e2e-aws-upgrade/pods/openshift-cluster-version_cluster-version-operator-7c5476564-hznw4_cluster-version-operator.log)"
  $ echo "${ERRORS}" | grep 'server is currently unable to handle the request' | head -n2
  E1010 16:01:57.406046       1 memcache.go:135] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
  E1010 16:09:57.352725       1 memcache.go:135] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
  $ echo "${ERRORS}" | grep 'server is currently unable to handle the request' | tail -n2
  E1010 17:17:57.303418       1 memcache.go:135] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
  E1010 17:17:57.352110       1 memcache.go:135] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
  $ echo "${ERRORS}" | grep 'server is currently unable to handle the request' | wc -l
  255

I think this is just a flake in the 4.2->4.1 rollback, but we're going
to skip the release just in case.

[1]: https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.1.19
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8217
Luke pushed this up to Quay on the 4th.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.2.0-rc.1
Avoid:

  $ hack/graph-util.py extract-edges 4.2.0-rc.1
  Traceback (most recent call last):
    File "hack/graph-util.py", line 360, in <module>
      extract_edges_for_versions(directory='.', versions=args.version)
    File "hack/graph-util.py", line 319, in extract_edges_for_versions
      extract_edges(node=node, directory=os.path.join(directory, 'edges', '{major}.{minor}'.format(**match.groupdict()), version))
    File "hack/graph-util.py", line 303, in extract_edges
      with open(os.path.join(directory, '{}.json'.format(previous)), 'w+') as f:
  IOError: [Errno 2] No such file or directory: './edges/4.2/4.2.0-rc.1/4.1.18.json'
Generated with:

  $ ln -rs nodes/4.2/4.2.0-rc.1.json channels/prerelease-4.2/
  $ hack/graph-util.py extract-edges 4.2.0-rc.1
Luke pushed this up to Quay on the 8th.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.2.0-rc.2

But we're not going to tag it into 4.2.z, because its
machine-os-content was from a failed build, the RHCOS pipeline will
recycle its tag for its next build, and the machine-os-content
referenced from rc.2 will be garbage-collected.
Luke pushed this up to Quay on the 9th.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.2.0-rc.3

But that hit [1], so we're not going to tag it into 4.2.z.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1760103
Luke pushed this up to Quay on the 10th.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.2.0-rc.4

But that hit [1], so we're not going to tag it into 4.2.z.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1721583
Luke pushed this up to Quay a few hours ago.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.2.0-rc.5
Only mark edges into channels to which both the previous and next
nodes belong.  Also skip edges where the previous node doesn't exist
at all.
4.1.18 is not in prerelease-4.2.  Pick up the fix from the previous
graph-util commit.  Generated with:

  $ hack/graph-util.py extract-edges 4.2.0-rc.1
Also tagging 4.2.0-rc.5 into prerelease-4.2.

Generated with:

  $ mkdir channels/candidate-4.2
  $ ln -rs nodes/4.2/4.2.0-rc.5.json channels/prerelease-4.2/
  $ ln -rs nodes/4.2/4.2.0-rc.5.json channels/candidate-4.2/
  $ ln -rs nodes/4.1/4.1.18.json channels/candidate-4.2/
  $ hack/graph-util.py extract-edges 4.2.0-rc.5
Because:

  edge channels [u'prerelease-4.2'] for 4.1.18->4.2.0-rc.1 differ from node channels []

is more actionable than the previous:

  edge channels [u'prerelease-4.2'] differ from node channels []
Luke pushed this up to Quay a few hours ago.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.1.20

It failed it's upgrade from 4.2.0-rc.5 [1], but we may decide to
promote it into channels anyway (at least the 4.1 channels).

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8325
Luke pushed this up to Quay 27 hours ago.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.2.0
Generated with:

  $ ln -rs nodes/4.1/4.1.20.json channels/prerelease-4.1/
  $ ln -rs nodes/4.1/4.1.20.json channels/stable-4.1/
  $ hack/graph-util.py extract-edges 4.1.20
Generated with:

  $ mkdir channels/fast-4.2 channels/stable-4.2
  $ ln -rs nodes/4.2/4.2.0.json channels/candidate-4.2/
  $ ln -rs nodes/4.2/4.2.0.json channels/fast-4.2/
  $ ln -rs nodes/4.2/4.2.0.json channels/stable-4.2/
  $ hack/graph-util.py extract-edges 4.2.0
Thiago pushed this up to Quay yesterday.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.1.21
Thiago pushed this up to Quay yesterday.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.2.1
Thiago pushed this up to Quay today, when Jessica noticed that we left
4.2.0->4.2.1 out of the 4.2.1 metadata.  4.2.2 has the same payload
excepting the fixed previous-version metadata.  Generated with:

  $ hack/graph-util.py update-node quay.io/openshift-release-dev/ocp-release:4.2.2
Generated with:

  $ ln -rs nodes/4.1/4.1.21.json channels/prerelease-4.1/
  $ hack/graph-util.py extract-edges 4.1.21
Generated with:

  $ ln -rs nodes/4.2/4.2.1.json channels/prerelease-4.2/
  $ hack/graph-util.py extract-edges 4.2.1

We left 4.2.0->4.2.1 off the 4.2.1 metadata by mistake, and fixed by
cutting 4.2 as described in c9e45b0 (nodes/4.2/4.2.2: Add a new
release, 2019-10-25).
Generated with:

  $ ln -rs nodes/4.2/4.2.2.json channels/prerelease-4.2/
  $ hack/graph-util.py extract-edges 4.2.2
wking added a commit to wking/cincinnati-graph-data that referenced this pull request Dec 6, 2019
Each time before updating Quay.  This is slow (tens of seconds), but
means release admins don't have to bother tracking the nodes and their
default edges in this repository (personally, I'd rather track nodes
and edges [1], but I'm not a release admin ;).  The delay is mitigated
by keeping a local cache (outside of version control) of the node
metadata.

[1]: openshift#1
@wking wking merged commit 99736a0 into openshift:master Jan 3, 2020
@wking wking deleted the initial-workflow branch January 3, 2020 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
5 participants