Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear Architecture field in platform constraint for arm architectures #34021

Merged

Conversation

@nishanttotla
Copy link
Contributor

commented Jul 7, 2017

This is a potential fix for docker/swarmkit#2294

This is a temporary requirement.

Also related: opencontainers/image-spec#661

@nishanttotla nishanttotla force-pushed the nishanttotla:dont-set-architecture-constraint branch from 3cbd838 to 444fdca Jul 7, 2017

@aaronlehmann

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2017

I wonder if we could append all known ARM variants to platforms in this case.

@nishanttotla

This comment has been minimized.

Copy link
Contributor Author

commented Jul 7, 2017

Ping @stevvooe

@nishanttotla

This comment has been minimized.

Copy link
Contributor Author

commented Jul 7, 2017

@aaronlehmann based on the discussion in the SwarmKit issue, I don't think we should be doing that. This will eventually be fixed for images.

@aaronlehmann

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2017

Wouldn't this PR cause ARM images to be scheduled on x86 and other platforms? I suppose it's better than not working at all, but I'm wondering if we can do better.

@nishanttotla

This comment has been minimized.

Copy link
Contributor Author

commented Jul 7, 2017

@aaronlehmann yes, that will happen. I think until the manifest is fixed, it is better to err on the side of putting lesser constraints.

@stevvooe

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2017

This fix needs to happen server-side:

  1. If the platform comes from a manifest list or oci index, it is fine.
  2. If it has to be resolved from the image config, it should undergo similar logic.

Wouldn't this PR cause ARM images to be scheduled on x86 and other platforms? I suppose it's better than not working at all, but I'm wondering if we can do better.

Yes, but right now, we have no way of resolving which version of arm an image belongs to.

@nishanttotla nishanttotla force-pushed the nishanttotla:dont-set-architecture-constraint branch from 444fdca to 5fa6df3 Jul 10, 2017

@estesp

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2017

It is strange that I'm just learning this, but a weird mismatch that we actually query the Linux UTS subsystem for the daemon's architecture setting, but all image handling is done with GOOS and GOARCH. It was confusing to me that an image was trying to match on something that Go doesn't even acknowledge (armv7l) which to me is more the root of the problem than the well understood issue with ARM variants.

Seems we need to at some point resolve matching to all use GOOS/GOARCH (with the added variant processing that is already available in OCI and also available when building a manifest list) so that we don't have some resolver of a bunch of random distro uname choices back to something "understood".

@thaJeztah

This comment has been minimized.

Copy link
Member

commented Jul 12, 2017

I agree with all of the above; we should fix the current situation, because it's a regression from 17.03;

  • use "OS", if it's a known value
  • use architecture, if it's a known value
  • if not, ignore, and use the same behaviour as docker 17.03

Wouldn't this PR cause ARM images to be scheduled on x86 and other platforms? I suppose it's better than not working at all, but I'm wondering if we can do better.

This is what 17.03 did, so not a regression. Mixed architecture Swarm clusters are probably rare, and if someone currently uses a mixed architecture (or operating system) cluster, they probably use constraints already to influence scheduling.

In the reporter's case docker/swarmkit#2294, things worked fine in the old situation; the service used an image specific to the architecture, and could just be deployed.

@justincormack

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2017

@estesp

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2017

@justincormack true, the connection to Go is not to take their view of OS/architecture as gospel, but realize that as Go programs, we have a few options for determining our current runtime architecture and OS; either ask the Go runtime, or try to divine it from Linux syscalls/environment data.

FWIW, IMHO, YMMV :) .. Go is a much more stable source of this information than Linux as distros have shown to be quite open to variance and finding their own way when it comes to embedded architectures and uname -a responses :)

To make this more concrete, there is an issue (I think it was over in the OCI spec repo) that lists about 60 different variants (strings) of arch-os tuples on ARM platforms running Linux. Go provides us 2: "arm" and "arm64". That leaves us having to find another method (using the proposed variant field in manifest list/OCI indexes) when building images for v6, v7, v8 specific ARM families, but at least it means we aren't trying to harmonize 60 different strings down to "arm" or "arm64" + a variant list of 3 items. Of course we will have to find a way to query the runtime "variant" when arch = {arm,arm64}, which IMO is the only missing piece, but that only applies to manifest list/index-based image resolving, not single manifest images which only provide us the two pieces of info: OS and arch.

If I haven't rambled on too long (don't answer that), another way to look at this bug is that you will never have a stable comparison between runtime.GOARCH and uname -a; and images are created using runtime.GOARCH but they are being compared to output from uname -a. It was already shown to be a problem in the old days because x86_64 (a Linux-ism) was being hardcoded in images, but runtime.GOARCH in Go is called "amd64" for the same platform.

@stevvooe

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2017

@justincormack UTS doesn't exactly help here. Right now, it returns armv7l in certain cases, which I cannot find any information about, anywhere. UTS (or gnu triples) seem identify a toolchain, which can be very different for identical architectures.

Go's approach is at least pragmatic: they try to define a real machine target.

If you can identify a standard approach that actually works, then, by all means, suggest one. Laying this on the footsteps of Go's toolchain is fairly dismissive.

@alexellis

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2017

I also tested 17.07.0-ce-rc1 this morning on armv7l and after upgrading from 17.05 (fully working with stacks and services) - I get 0/1 replicas after deploying a stack.

In addition there are no error messages on service inspect/ls/ps.

@andrewhsu

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2017

@alexellis fyi the code changes in this PR was not included in the 17.07.0-ce-rc1 release because it has not been merged yet. Curious to know if you ran your test with a custom compile of the 17.07.0-ce-rc1 codebase with this PR patch applied?

@alexellis

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2017

I think Stefan and myself tried the same install script from Eli @ Docker Inc via the #arm Slack channel-

curl -fsSL
https://raw.githubusercontent.com/seemethere/docker-install/update_for_170701rc1/install.sh
| sh
 Version:      17.07.0-ce-rc1
 API version:  1.31
 Go version:   go1.8.3
 Git commit:   8c4be39
 Built:        Wed Jul 26 21:00:23 2017
 OS/Arch:      linux/arm

All services from docker-compose.armhf.yml are pinned at 0/1 replicas.
https://github.com/alexellis/faas

DEBU[0116] no suitable node available for task module=node node.id=a82troq1z5gmjwnpdlw76k5v0 task.id=dcxqoaohw872432weilc9voc8

However with 17.05 all come out at 1/1 replicas.

 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:30:54 2017
 OS/Arch:      linux/arm
@thaJeztah

This comment has been minimized.

Copy link
Member

commented Jul 27, 2017

@alexellis what does

docker service inspect -f '{{json .Spec.TaskTemplate.Placement}}' <service-id>

show for those services?

@alexellis

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2017

{"Platforms":[{"Architecture":"arm","OS":"linux"}]}
@alexellis

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2017

@thaJeztah ... it's still broken in the following version
@andrewhsu

Client:
 Version:      17.07.0-ce-rc2
 API version:  1.31
 Go version:   go1.8.3
 Git commit:   36ce605
 Built:        Mon Aug  7 23:53:53 2017
 OS/Arch:      linux/arm

Server:
 Version:      17.07.0-ce-rc2
 API version:  1.31 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   36ce605
 Built:        Mon Aug  7 23:47:08 2017
 OS/Arch:      linux/arm
 Experimental: false
@justincormack

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2017

@alexellis nothing has changed since #34021 (comment) "fyi the code changes in this PR was not included in the 17.07.0-ce-rc1 release because it has not been merged yet. Curious to know if you ran your test with a custom compile of the 17.07.0-ce-rc1 codebase with this PR patch applied?"

Have you tested it? Does it fix the issue?

@alexellis

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2017

I interpreted the comment to mean "tell me what build you're using?" and I replied.

Do you have a "custom compile" of Docker CE available for testing?

@alexellis

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2017

I can confirm the fix works.. I deployed a full application (FaaS).

I tested it along with rallying some of the community to help with the ARM testing and opened additional issues in the docker-arm repo with the tag community-supports-docker.

alexellis/docker-arm#14 (comment)

@nishanttotla @estesp @StefanScherer

@vattybear

This comment has been minimized.

Copy link

commented Aug 20, 2017

Concur with @alexellis above - have similarly tested the fix today - commented with details on alexellis/docker-arm#17

@andrewhsu

This comment has been minimized.

Copy link
Contributor

commented Aug 31, 2017

Is this PR a suitable stepping-stone step in the interim?

@@ -123,12 +123,19 @@ func (s *distributionRouter) getDistributionInfo(ctx context.Context, w http.Res
if err == nil {
err := json.Unmarshal(configJSON, &platform)
if err == nil && (platform.OS != "" || platform.Architecture != "") {
if platform.Architecture == "arm" || platform.Architecture == "Arm" {

This comment has been minimized.

Copy link
@stevvooe

stevvooe Sep 1, 2017

Contributor

I was looking at this approach and doing it on the server side will create future problems for this endpoint. Clearing the architecture should be done on the client side, when setting up the constraint. That way, we can fix this in the future and still be able to support this version of docker.

@@ -98,6 +98,11 @@ func imageDigestAndPlatforms(ctx context.Context, cli DistributionAPIClient, ima
if len(distributionInspect.Platforms) > 0 {
platforms = make([]swarm.Platform, 0, len(distributionInspect.Platforms))
for _, p := range distributionInspect.Platforms {
// clear architecture field for arm. This is a temporary fix.
arch := p.Architecture
if arch == "arm" || arch == "Arm" {

This comment has been minimized.

Copy link
@cpuguy83

cpuguy83 Sep 1, 2017

Contributor

if stringsToLower(arch) == "arm"

This comment has been minimized.

Copy link
@cpuguy83

cpuguy83 Sep 1, 2017

Contributor

Also should this be HasPrefix?

This comment has been minimized.

Copy link
@nishanttotla

nishanttotla Sep 1, 2017

Author Contributor

In the images I've seen, it's only been "arm". We could use HasPrefix to be extra careful but I don't think it's necessary.

@@ -98,6 +98,11 @@ func imageDigestAndPlatforms(ctx context.Context, cli DistributionAPIClient, ima
if len(distributionInspect.Platforms) > 0 {
platforms = make([]swarm.Platform, 0, len(distributionInspect.Platforms))
for _, p := range distributionInspect.Platforms {
// clear architecture field for arm. This is a temporary fix.

This comment has been minimized.

Copy link
@cpuguy83

cpuguy83 Sep 1, 2017

Contributor

Can you add more detail here as to why it's being cleared?

Clear Architecture field in platform constraint for arm architectures
Signed-off-by: Nishant Totla <nishanttotla@gmail.com>

@nishanttotla nishanttotla force-pushed the nishanttotla:dont-set-architecture-constraint branch from f172404 to 772af60 Sep 1, 2017

@nishanttotla

This comment has been minimized.

Copy link
Contributor Author

commented Sep 1, 2017

@cpuguy83 addressed your comments.

@cpuguy83

This comment has been minimized.

Copy link
Contributor

commented Sep 1, 2017

Lint errors are being dealt with here: #34706

@nishanttotla

This comment has been minimized.

Copy link
Contributor Author

commented Sep 1, 2017

This commit will also need vendoring into the docker/cli.

@cpuguy83
Copy link
Contributor

left a comment

LGTM

@andrewhsu

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2017

Is this one ready to go or are there any other types of verification that is needed before merging?

@vdemeester
Copy link
Member

left a comment

LGTM 🐮

@vdemeester vdemeester merged commit ea220e7 into moby:master Sep 7, 2017

5 of 6 checks passed

windowsRS1 Jenkins build Docker-PRs-WoW-RS1 16712 has failed
Details
dco-signed All commits are signed
experimental Jenkins build Docker-PRs-experimental 36605 has succeeded
Details
janky Jenkins build Docker-PRs 45244 has succeeded
Details
powerpc Jenkins build Docker-PRs-powerpc 5635 has succeeded
Details
z Jenkins build Docker-PRs-s390x 5411 has succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.