-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hab pkg download file format is insufficiently expressive. #7057
Comments
We want a format that is
Probably the simplest option would be to use TOML; we could do something like:
The '[[section]]' is a little clunky, but necessary (I think) to get a list of hash entries. The actual keyword 'section' is arbitrary, and perhaps a better one could be chosen. It still seems likely to cause occasional user confusion. The advantages of this is we already use TOML files as part of habitat plans, so users likely to be familiar with that format. The downside is that TOML isn't as mainstream as YAML, and the structure we want (a list of tables) is a little clunky to express in YAML, because the 'root' of a YAML file is a map, and we probably want a list. A similar file in YAML would look like a list of maps
A simple file could be just a map
YAML provides a more flexible format, and at least appears more concise and lacks the somewhat arbitrary ```[[section]]`` constructs. However it has us introducing a new file format into habitat, with some usability costs of its own. Alternately we could come up with our own DSL to express this. I have trouble seeing this being worth the effort, unless there was a huge usability gain, and we have great confidence that we were going to want this workflow long term. |
I think there is probably a use case that I am not considering but I am having difficulty understanding why a download file would want multiple targets. I would think it safe to assume that the target should match the architecture calling the command. If that machine downloaded harts for other architectures, what could it do with them. Perhaps the use case is to be able to download a "bundle" for a particular target that will be deployed to other supervisors of that target architecture, but one may perform the doenload from any target. Is that correct? |
There is a use case around grabbing packages for on-prem-builders, which may be behind an air-gap or firewall. The architecture(s) running hab behind the firewall may be different than the architecture that has access for download. For example, a user may have a mixed linux and windows environment, and will want to be able to grab packages for both in one go. |
Got it! Thanks. I personally prefer TOML since that is more habitat ubiquitous and so far we don't use yaml but I don't feel incredibly strong about this either. If going with TOML, we could use the target as the table name. |
Discussion note: had a conversation with @markan about the idea. We prefer to stick to using TOML for the reason that both Automate 1 and 2 use it as well, besides Habitat. The ideas below are not valid TOML and we will think more about how to make the naming of the table both human- and machine-readable. Throw some ideas for discussion. I don't have strong opinions on the file format, my idea is mostly around how the information is structured. The comments could use more work. Here is a template example if they are in the TOML format:
If the list is fully defined, it could look like this:
|
A few things from the discussion with @apriofrost It seems like it might be nice to have a more hierarchical structure, probably some nesting of architecture, channel, package. The TOML example I provided is relatively flat. Perhaps we want something with a nesting structure more like"
Another issue is whether we want to make default assumptions as to channel; should we assume stable unless it is specified? Do we want to version this file format? Probably would have something like |
One possibility would be to make the nesting implicit:
The chief advantage is that the file format is very simple. I see a few downsides:
|
While I can see some benefits of hierarchy, it seems like a flatter structure does not lose much but gains back more in clarity:
Also I like the idea of making |
I'm wondering how an end-user would really think about this information. Would the sysadmin or developer ask "What architectures have which packages installed" or "What packages are installed for which architectures"? I don't have the answer. Individual lists for each architecture potentially has significant redundancy in the file, which means a lot of room for typos, bad copy-pasta, and the package lists could get hard to read, which means that users could easily scan
Or is an end-user more likely to think of the information:
|
@kagarmoe good call! There is currently a download file, which is user-defined and serves as an instruction for Habitat to decide which package exactly to download. There will be another manifest file (temporary name), which is Habitat-generated and as an output to show users which package exactly were downloaded. In my current understanding, the download file format serves these major functions: How could we help users easily define the packages they want to download from the target Builder?It has to be flexible to cover the use cases where:
For the manifest file, it could reuse the same format as the download file, or it could have its own format. We probably need to leverage the effort for creating two separate file formats v.s. satisfying different needs from different use cases. The manifest file format services these major functions: How could we help users easily understand which packages they have downloaded from the target Builder?I think this question is more related to the example you demonstrated. I don't have a good answer to this question now without a good understanding of the users' perspective. For the download file format, it reminded me of a time when I was trying to redesign the Currently the From talking with @smacfarlane and others, my understanding is that when looking for a package, users are most interested in the platform and the release. For example, they want to look for the "latest Linux packages" for x package. So I'd support using the platform name as the primary grouping mechanism for the download file since it resembles the use case of "looking for a package". Food for thought! 😸 |
Another thing to note, is the channel probably belongs on an origin level, rather than target. You typically want to ensure that, within an origin, you're pulling from the same channel. If not, you're likely to end up with graph conflicts when you try to build. I would probably express this something like:
Though that makes the format more complex, but describing channel at the target level really makes my spideysense tingle. To echo @apriofrost, in my experience users are typically thinking of target as the top level construct. "For linux, I need core/foo and core/bar, for windows I need core/bar, etc." In addition, I realize the above are just examples, but we want to be careful not to change our terminology. I see
|
Thinking through the comments above, a few thoughts
Here's a slightly different format proposal based on the above:
|
Some off the cuff thoughts to the above @markan
My preference would also to not decouple origin from name, as that goes against how we typically describe idents*, and (depending on implementation of course) make 3 harder to implement. I'd think that if we go that route, the only difference between the input and output of
|
@smacfarlane I'm not quite understanding why the packages from the same origin have to come from the same channel, could you explain a bit more? :) In the That means the channels where the artifacts are downloaded from the target Builder specified in For your point about the graph conflicts when build, does it only matter in the target builder specified in the |
Had a bunch of discussions online and by zoom today, trying to sum up where we left off. Versioning:This file format may well need to change in the future, and we should Traceability:It would eventually be helpful to produce some information on the Expressiveness:The full combinatorial space is required right now. Some seed lists An important nuance of the current implementation of This provides completeness, at the cost of excess downloads, and After some discussion, allowing combinatorial specifiers (this is the Structure and hierarchy:Much of this discussion boils down to whether we have a flat The flat structure ends up having an array of tables specifing target, For hierarchical structures, so far we've had target, channel, origin, @smacfarlane as convinced me that origin makes little sense. It A package centric approach would tend to be verbose; we usually have Target would work fine; there's not much overlap between package names
Channel could also work as a starting point. The disadvantage is that That issue also arises if we choose some hierarchical scheme where we Use of the file format for package manifests: None of the formats would preclude using them as package manifest That would work to direct an upload; the bulk upload command could That would also work to allow a manifest to be used as a seed One interesting subcase is using a list of manifests as a 'filter', to One could imagine a manifest format that provide more detail than the While this is worth thinking on, it's probably best to defer this for Potential formatsIn the end I think this reduces us two possible formats for v1, with Simple format: a flat file with no groupings, just a channel, target, One of the following: 'flat' grouping: A grouping tag 'section' would separate a set of channel, target, Universal to all formats:
|
Thumbs up/down if you think this should be included or not. |
Thumbs up/down if you think this should be the format or not. |
Thumbs up/down if you think this should be the format or not. |
From my perspective, I think of artifacts in the following top down hierarchy:
Therefore, I think your format suggestion here is 🚀 Considering that format, it will be important for the user to understand that if they specify multiple tables of the same architecture, double brackets |
@apriofrost Coming from the build side, consistency in the source channel is important to avoid dependency graph errors on builds. In the context of this flow, I still think mixing channels is a bad idea, and I would also advise users on upload to ensure that the groups they download move together on the consuming end. After talking to @markan, though I do see the use for mixing channels. When packages in the channel are "leaf" packages, i.e. nothing depends on them and they are, at most, going to be wrapped once, then it's a relatively safe operation, though still makes me a bit nervous. Specifically with the |
The current file format (package idents separated by newlines) is isn't expressive enough to easily treat as an independent artifact.
Many origins have different channel names and promotion policies than core. When building a 'starter list' of packages, such as we are doing in #6902, we end up wanting to include multiple origins and channels. For example we pretty much always want stable packages from core, but might accept unstable from effortless or other third party origins.
If the wrong channel is used, many packages may not be at the right versions, or even found, and so a list of package idents isn't really complete without a channel specified. Unfortunately, the channel can only be specified on the command line, and applies to all the files provided. There isn't a way to specify multiple channels in the same input file, or designate the channel as part of the file.
A similar problem exists for target architecture, again packages might not exist for all architectures, and so the input file is in practice architecture specific as well.
So sharing 'starter list' files divorced from the context of channel and target is likely to be error prone, and in some cases tedious, as the download command will need to be run multiple times.
One possibility would be to stay with the simple text file format but expand the ident syntax to allow some sort of qualifier. Something like
/ORIGIN/NAME?target=TARGET&channel=CHANNEL
might work, but there's almost certainly a better UX there. We would have a kludgy workaround if we were able to find a fully qualified ident without needing the channel ident but currently we only look in the current channel even if the package is otherwise fully specified (see #7039)It might make more sense to switch to a human friendly structured text file format. Most likely you would want to represent a list of tuples {target, channel, [package_idents]} or a list of hashes {target: TARGET, channel: CHANNEL, package_idents: [ALL_THE_IDENTS]}.
JSON is quite expressive, but the mainstream varieties don't have comments.
TOML is less expressive (my initial attempt was pretty clunky), but is commonly used in habitat, and allows comments.
YAML is popular, expressive and allows comments, but currently isn't used inside habitat, and might be too much.
ASN.1 ... just kidding.
The text was updated successfully, but these errors were encountered: