Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-stage Build Issues #4246

Open
WhisperingChaos opened this issue May 15, 2017 · 13 comments
Open

Multi-stage Build Issues #4246

WhisperingChaos opened this issue May 15, 2017 · 13 comments

Comments

@WhisperingChaos
Copy link

WhisperingChaos commented May 15, 2017

TL;DR

The current semantics of --from intrinsically induce pathological coupling between build stages. Its intimate binding to build stage implementation opposes the principle of encapsulation necessary to permit reuse, as well as reason, in isolation, about an individual stage's behavior. By defeating encapsulation, --from thwarts applying current Dockerfile reuse features, such as ONBUILD and inhibits the introduction of future reuse mechanisms.

To avoid the harmful traits associated to --from, the existing Build Context abstraction should be adapted so its content can be extended by mounting a stage's image file path into it, instead of introducing the new stage/image reference concept to Dockerfile development. By extending its content and introducing a mapping mechanism to the existing Build Context abstraction, the --from syntax can be eliminated, current reuse features restored, and the introduction of new reuse mechanisms unencumbered.

TOC

Issue: Tight, Pathological Coupling

The design of --from ensures the COPY instruction tightly couples itself to the implementation of other build stages. Tight coupling results from --from’s purposely crafted facility to directly reference artifacts of other build stages, within a given Dockerfile, by stage names/positions and their physical locations (paths) in those other images.

This pathological coupling, encouraging the internals of any build stage to intimately bind themselves to any other stage within a Dockerfile, eliminates the interface boundary between stages. This absence of an interface boundary negates encapsulation prohibiting human developers and algorithms from considering an individual build stage as a “black box” when defining or analyzing its behavior.

Issue expresses itself by:

  • Increasing the difficulty of implementing future features that encourage Dockerfile reuse, due to the absence of encapsulation, as well as discouraging the use of existing ones (ONBUILD).
  • Dramatically increases the amount of manual code produced by a developer because existing "boilerplate" code cannot be reused due to its direct, rigid binding to a particular artifact (file) instance.
  • Simple changes, like renaming a directory containing a set of artifacts or inserting/removing a build stage, can potentially ripple through the entire set of Dockerfile commands that reference this directory or build stage.

Issue: Precludes ONBUILD Trigger Support

ONBUILD trigger support enables a developer to declaratively encode an image’s transform behavior: operations responsible for converting a set of input artifacts to output ones. This declarative code includes a specification of an input interface followed by command(s) that execute a transform. The input interface definition emerges from the union of source file artifact (directory/filename) references specified by the triggered ADD/COPY Dockerfile commands and is statically defined during the construction of the ONBUILD image while the transform consists of one or more RUN commands.

Example

Create a golang compiler image that executes ONBUILD commands to automatically produce a golang executable image but not run it. Define the input interface: the path to copy golang source file(s) for the compiler image's Build Context, as /golang/app. Name the compiler image exgolang. Create the Dockerfile for this image by modifying a copy of the Docker Hub golang:1.7-onbuild image Dockerfile.

Dockerfile Contents:

FROM golang:1.7
RUN mkdir -p /go/src/app
WORKDIR /go/src/app
# Union the source argument of each COPY/ADD to determine the trigger's 'input interface'.
# Only one COPY instruction with single source argument of ‘/golang/app’.  Therefore,
# this trigger's 'input interface' is '/golang/app'.
ONBUILD COPY /golang/app /go/src/app
ONBUILD RUN go-wrapper download
ONBUILD RUN go-wrapper install

To reuse the defined trigger behavior, simply encode a FROM statement that references the image name (FROM exgolang) configured with ONBUILD commands. By promoting the DRY principle, ONBUILD triggers dramatically increase an image’s build time utility, reliability, and adaptability while simultaneously eliminating or greatly decreasing the code required to employ this image in other Dockerfiles by other developers. Given this understanding, an ONBUILD trigger definition is remarkably akin to a function definition.

Example

Using the exgolang image created above, generate a golang server executable from source server.go located in /golang/app/.

Build Context

Dockerfile
golang/app/
  server.go

Dockerfile

FROM exgolang

Docker build command:

> docker build /-t server .

The single instruction Dockerfile above when executed by docker build:

  • Copies golang source from the Build Context : /golang/app directory into the image directory of /golang/app.
  • Downloads any dependent golang packages.
  • Runs the compiler generating the executable file /go/bin/app from server.go that resides in the resultant image's file system.

As described and demonstrated by example, images incorporating ONBUILD statements are analogous to function definitions. This similarity extends to the equivalence of an ONBUILD image's input interface to a function's parameter list. As in the case of a function parameter list, an ONBUILD image's body: the series of ONBUILD statements, binds (couples) to the file paths referenced by each instruction just like statements within a function body bind to its parameters. For example, the COPY issued by the trigger statement ONBUILD COPY /golang/app /go/src/app binds to the source file path: /golang/app. This file path: /golang/app is equivalent to a parameter defined for a function and performs a similar role, as it represents an interface element. Given this equivalence, why isn't there a mapping mechanism, like the one implemented for functions, that maps arguments specified by an invocation statement to parameters?

When formulating ONBUILD support, the design avoided implementing an argument to parameter mapping mechanism on the trigger invocation statement: FROM. Although this mapping mechanism is intrinsic to function invocation, I speculate, at the time when trigger support was implemented, the multistage build feature was a distant, future consideration. Meanwhile, the limitation of a single stage Dockerfile masked this issue, as the Build Context could be structured to mirror the input interface required by a single stage's ONBUILD triggers. In other words, the Build Context file path (argument) names exactly match the (parameter) names required by the ONBUILD ADD/COPY instructions. However, introducing multistage builds starkly silhouettes the absence of an argument to parameter mapping mechanism.

Multistage support forces the once "elemental" Build Context, whose content and structure was dictated by the needs of a single FROM, to become a composite one that must comply to the dependencies of two or more FROM statements. Since the problems inherent to the transformation from an elemental to composite Build Context diminish not only trigger support but also affect non-trigger statements that follow a FROM, their discussion occurs in the topic: Issue: Ignores Aggregate Build Context below. Besides this issue of composite Build Contexts, pathological coupling introduced by --from impedes applying ONBUILD triggers.

COPY trigger instructions are currently bound at the time of their creation to a Build Context file path. If COPY where to include --from which stage name/position should it bind to, as it has to resolve the stage name within the context of all other existing and future Dockerfiles? Unfortunately, without introducing another mechanism to rebind the source file path references specified by ONBUILD COPY instructions within the scope of its invocation, it's very difficult within a multistage Dockerfile to reuse existing triggered enabled images once, let alone twice.

Issue: Ignores Aggregate Build Context

Since the Dockerfile semantics before incorporating multistage assumed a single FROM statement, the expected Build Context reflected only those source artifacts located in the directory structure required by ADD/COPY commands immediately following FROM. Incorporating many FROM statements within a single Dockerfile requires a means to initially compose/aggregate the Build Context with the more elemental ones needed by each FROM then partition this composite/aggregate to supply the specific (elemental) Build Context expected by an individual FROM (stage).

Example

Using the exgolang image created above, attempt to generate three golang server executables from an Aggregate Build Context. Note, issues related to partitioning the Aggregate Build Context are broadly applicable to any multistage Dockerfile without regard to its use of ONBUILD.

Build Context

Dockerfile
golang/app/
  server.go
golang/app2/
  server.go
golang/app3/
  server.go

Dockerfile

FROM exgolang
# the following stage will simply recompile golang/app/server.go instead of golang/app2/server.go
FROM exgolang
# the following stage will simply recompile golang/app/server.go instead of golang/app3/server.go
FROM exgolang

Docker build command:

/server > docker build /-t servers .

Unfortunately, the multistage build design ignores addressing Aggregate Build Context issues by failing to provide a mechanism that both partitions and restructures the Aggregate Build Context to supply the elemental Build Context needed by a specific FROM. Therefore, executing the above docker build command copies the same golang source /server/golang/app/server.go into three distinct images, runs the compiler and generates the same server executable writing it to each image's /go/bin directory.

Additionally, when incorporating stages referencing ONBUILD triggers, current multistage Dockerfile support not only inhibits their use but when "it works" the outcome can be dangerous, especially when the trigger assumes a Build Context interface of "." (everything interface) as in COPY . /go/src. In this situation, the entire Aggregate Build Context would be accessible to any stage, thereby, polluting an individual stage's source artifact set with artifacts from all other stages.

Issue: Complexity due to added Dockerfile abstractions

Any worthwhile program must apply coupling to map its abstractions to an implementation. However, it's important to minimize coupling whenever possible. One method to reduce coupling relies on limiting the abstractions required to only the essential ones applicable to realize the encoded algorithm's objective.

The purpose of a Dockerfile is to provide the scaffolding needed to deliver source artifact(s) to a transform that then produces output artifact(s). Since the transforms, executed by the RUN command, rely on reading and writing to files within a file system, the source artifacts must be eventually mapped as files within a file system. Perhaps due to a desire to align with this necessity, the Build Context abstraction responsible for providing source artifacts was also designed to represent source artifacts as files within a file system. This design choice, matching the representation of the Build Context with the one required by the underlying transforms (files in a file system), resulted in Dockerfile commands, like COPY, whose syntax and behavior nearly mirrors that of a corresponding OS command, such as cp, and facilitated Dockerfile adoption by leveraging a developer's existing understanding of it.

The introduction of COPY --from adds a new abstraction: stage/image reference, to Dockerfile coding. This addition abstraction necessitated changing COPY's interface and weaving the resolution of stage/image references into its implementation so COPY's binding mechanisms could differentiate between Build Context and other stage/image sources. Besides adding some complexity to applying COPY, introducing the stage/image reference abstraction imposes implications for features that rely on COPY's behavior. When assessing these implications one hopes for beneficial or neutral outcomes regarding their effect. However in this situation, the rigid binding of --from to a particular stage/image precludes the use of COPY --from in any current reuse mechanism, such as ONBUILD, or future one. This negative outcome not only prevents reuse mechanisms, like ONBUILD, from referencing other stages/images but also diminishes the utility of --from, as it can't be applied in all valid contexts of the COPY instruction.

An often sighted strength of Unix derivative OSes is their insistence on mapping various abstractions, like hard drives, IPC, ... to a file. Therefore, instead of adding complexity by creating a corresponding concrete OS concept for each supported device/abstraction, which in many cases would only offer a slightly different interface, Unix designers mapped new abstractions (especially devices) to a single one - the file. Once mapped, the majority of the code written to manage/manipulate this single abstraction (file) immediately applies to the new one. Since image/stage references are essentially file path references, perhaps, in lieu of explicitly exposing --from's stage/image reference abstraction, it should be mapped to an existing abstraction: the Build Context.

Recasting the stage/image references as file paths in the Build Context confers the following benefits:

  • Reduces complexity by eliminating the explicit stage/image reference abstraction and the --from option. COPY reverts to its prior, simpler syntax.
  • Limits artifact coupling to only Build Context file paths which existed before multi-stage support.
  • Existing or future mechanisms that apply to a Build Context, within a Dockerfile, like partitioning, renaming, and restructuring also immediately apply to artifacts contributed by other stages within a Dockerfile without writing additional code.

Issue: Extra Build Stage & Redundant COPYing

If the objective of a multistage build is the creation of a single layer representing a runtime image, the current semantics of COPY --from requires an extra build stage and redundant COPYing when the resultant build artifacts must be assembled from more than one build stage or image.

##### Example
Applying the current semantics of COPY --from, create a golang webserver whose stdout and stderr is redirected to a remote logging facility as a single layer in the resulting image.
```
FROM golang:nanoserver as webserver
COPY /web /code
WORKDIR /code
RUN go build webserver.go

FROM golang:nanoserver as remotelogger
COPY /remotelogger /code
WORKDIR /code
RUN go build remotelogger.go

# extra build stage and physical coping due to semantics of COPY --from in order
# to generate single layer in next build stage
FROM scratch as extra_redundant_copying
COPY --from=webserver /code/webserver.exe /redundant/webserver.exe
COPY --from=remotelogger /code/webserver.exe /redundant/remogelogger.exe
COPY /script/pipem.ps1 /redundant

FROM microsoft/nanoserver as extra_redundant_copying
COPY --from=extra_redundant_copying /redundant /
CMD ["\pipem.ps1"]
EXPOSE 8080

```
The above situation generalizes to N extra build stages and X redundant copy operations when there's a desire to create a resultant image of N layers where each layer requires artifacts from more than a single stage.

Recommendations:

  • Eliminate direct coupling to artifacts within images from other build stages by removing --from as an option to COPY.
  • Support a mapping mechanism that partitions, restructures, and renames file paths defined in the Aggregate (Global) Build Context so the resulting mapped version matches the (Local) Build Context required by an individual stage. A mapping mechanism satisfying these qualities has already been proposed and explored by #12072. In a nutshell, the mechanism, implemented by the keyword CONTEXT, mounts the desired Aggregate Build Context file paths, similar to docker run -v option, into the Build Context created for an individual stage.
  • Support a mechanism to allow a build stage to extend the Aggregate Build Context with the output artifacts produced by that stage. Proposal #12415 offers a solution MOUNT that's analogous to CONTEXT. However, MOUNT mounts an image's file path into the Aggregate Build context instead of mounting it into the stage's local Build Context.

Applying the recommendations above, when compared to the currently implement multistage design:

  • Promote encoding Dockerfiles with current and future reusable build mechanisms.
  • Seamlessly integrate with existing Dockerfile abstractions, such as Build Context and ONBUILD triggers.
  • Dramatically reduce the Dockerfile code required to reuse an image when building a new one.
  • Eliminate the necessity of encoding extra build stages and the overhead of redundant copying.
  • Foster innately declarative mechanisms of CONTEXT and MOUNT proposed by the links referenced above.

Comparison: Current Multistage Design vs. Recommended

The examples below concretely contrast, through the encoding of the same scenario, the benefits offered by the recommended approached when compared to the existing multistage design.

Scenario

Using already available Docker Hub images, construct a container composed of three independent golang executables. One executable implements a webserver, another a logging device that relays messages to a remote server, while the third reports on the webserver's health.

Initial Build Context

The initial Build Context common to both examples.

Build Context (initial aggregate/global context)

  Dockerfile
  script.sh
  go/src/webserver/
    server.go
  go/src/logger/
    server.go
  go/src/health/
    server.go

Example: Current Multistage Design

FROM golang:1.7 AS webserver
COPY /go/src/webserver /go/src/webserver
WORKDIR /go/src/webserver
RUN go-wrapper download              \
 && export GOBIN=/go/bin             \
 && go-wrapper install server.go
FROM golang:1.7 AS logger
COPY /go/src/logger /go/src/logger
WORKDIR /go/src/logger
RUN go-wrapper download              \
 && export GOBIN=/go/bin             \
 && go-wrapper install server.go
FROM golang:1.7 AS health
COPY /go/src/health /go/src/health
WORKDIR /go/src/health
RUN go-wrapper download              \
 && export GOBIN=/go/bin             \
 && go-wrapper install server.go
FROM scratch AS requiredExtra
COPY --from webserver /bin/server /final/bin/webserver
COPY --from logger    /bin/server /final/bin/logger
COPY --from health    /bin/server /final/bin/health
COPY /script.sh /start.sh
FROM alpine
COPY --from requiredExtra /final /start.sh  /
ENTRYPOINT /start.sh
EXPOSE 8080

Example: Recommended Multistage Design

FROM golang:1.7-onbuid CONTEXT /go/src/webserver/:/  MOUNT /go/bin/app:/final/bin/webserver  moby/moby#1
FROM golang:1.7-onbuid CONTEXT /go/src/logger/:/     MOUNT /go/bin/app:/final/bin/logger     moby/moby#2
FROM golang:1.7-onbuid CONTEXT /go/src/health/:/     MOUNT /go/bin/app:/final/bin/health     moby/moby#3
FROM alpine CONTEXT /final/bin:/bin  /script.sh:/start.sh   moby/moby#4
COPY . /   moby/moby#5
ENTRYPOINT /start.sh
EXPOSE 8080

Differences

Recommended Multistage Design when compared to Current Multistage Design:

  • Encourages more declarative solutions by:
    • leveraging reuse features, such as ONBUILD, that minimize developer produced code and
    • declares external data dependencies via CONTEXT & MOUNT separately from the Dockerfile operations like COPY.
  • Seamlessly leverages current ONBUILD images.
  • Eliminates harmful coupling by replacing direct, rigid physical stage/image references with Build Context file paths that can be rebound, through a standard mapping mechanism, when running the Dockerfile.
  • Addresses issue of partitioning, structuring, and renaming Aggregate Build Context artifacts using a syntax and behavior similar to docker run -v.
  • Eliminates complexity of --from and stage/image reference support by replacing both with a mapping mechanism that encourages encapsulation.
  • Eliminates encoding extra build stage(s) and redundant copying.
  • Clearly delineates the input and output artifacts aiding developer comprehension.
  • Simplifies DAG analysis, as only FROM instructions need be parsed to reveal the data dependencies between stages.

Example: Recommended Multistage Design: Explained

  1. CONTEXT partitions the initial Aggregate Build Context to present the Local Build Context required by the FROM. For this stage, the webserver's golang source named server.go is the only file that appears in the "root" dir of the Local Build Context. Once this stage finishes, MOUNT associates the file /go/bin/app located in the last container created by this stage to the Aggregate Build Context as /final/bin/webserver.

Local Build Context

 server.go

Aggregate Build Context

Dockerfile
script.sh
go/src/webserver/
  server.go
go/src/logger/
  server.go
go/src/health/
  server.go
final/bin/
  webserver
  1. CONTEXT partitions the initial Aggregate Build Context to present the Local Build Context required by the FROM image. For this stage, the logger's golang source named server.go is the only file that appears in the "root" dir of the Local Build Context. Once this stage finishes, MOUNT associates the file /go/bin/app located in the last container created by this stage to the Aggregate Build Context as /final/bin/logger.

Local Build Context

 server.go

Aggregate Build Context

Dockerfile
script.sh
go/src/webserver/
  server.go
go/src/logger/
  server.go
go/src/health/
  server.go
final/bin/
  webserver
  logger
  1. CONTEXT partitions the initial Aggregate Build Context to present the Local Build Context required by the FROM image. For this stage, the health's golang source named server.go is the only file that appears in the "root" dir of the Local Build Context. Once this stage finishes, MOUNT associates the file /go/bin/app located in the last container created by this stage to the Aggregate Build Context as /final/bin/health.

Local Build Context

 server.go

Aggregate Build Context

Dockerfile
script.sh
go/src/webserver/
  server.go
go/src/logger/
  server.go
go/src/health/
  server.go
final/bin/
  webserver
  logger
  health
  1. CONTEXT partitions the Aggregate Build Context extended by stages 1-3 by isolating the contents of /final/bin/ directory and projecting (renaming) it as /bin/. Additionally the shell script script.sh is renamed to start.sh.

Local Build Context

  start.sh
  bin/
    webserver
    logger
    health
  1. Create a single layer by COPYing the Local Build Context, into the root directory of alpine.
@dnephin
Copy link
Member

dnephin commented May 15, 2017

Issue: Tight, Pathological Coupling

I think moby/moby#32100 would fix this

Issue: Extra Build Stage & Redundant COPYing

Would be fixed by moby/moby#32507 and moby/moby#32904 . Copy and many other metadata operations can be implemented without creating any layers.

Comparison: Current Multistage Design vs. Recommended

This seems like a really specific use case, and I don't think this reflects the general problem that is solved by multi stage builds.

I would personally put those 4 into 4 separate Dockerfiles. They are building different applications, not a single one. You could use docker-compose build to build them all at once.

@WhisperingChaos
Copy link
Author

I think moby/moby#32100 would fix this

I look into this.

Would be fixed by moby/moby#32507 and moby/moby#32904 . Copy and many other metadata operations can be implemented without creating any layers.

As far as I can tell from exploring --from using docker hub image 17.05.0-ce only a single stage/image can be referenced by COPY --from, as --from cannot be specified more than once for a given COPY. I imagine this constraint also applies to RUN's --mount option, as it's difficult to discern from reading moby/moby#32507. Therefore, attempts to coalesce files sourced from two different stages/images into a single layer cannot be accomplished with either a single RUN or COPY instruction. Also, once COPY --from completes, it creates a new layer, therefore, please explain how : "Copy and many other metadata operations can be implemented without creating any layers."?

Notice the creation of a new layer for each COPY --from executed below:

Step 1/10 : FROM scratch as sp1
 ---> 
Step 2/10 : COPY afile /test/
 ---> 762a6ad71240
Removing intermediate container d14eb99a8af9
Step 3/10 : FROM scratch as sp2
 ---> 
Step 4/10 : COPY bfile /test/
 ---> f1971116fd80
Removing intermediate container 51a88fa672ce
Step 5/10 : FROM scratch as sp3
 ---> 
Step 6/10 : COPY cfile /test/
 ---> 1ee3cebffbff
Removing intermediate container 9894fe8756ef
Step 7/10 : FROM alpine
latest: Pulling from library/alpine
cfc728c1c558: Pull complete 
Digest: sha256:c0537ff6a5218ef531ece93d4984efc99bbf3f7497c0a7726c88e2bb7584dc96
Status: Downloaded newer image for alpine:latest
 ---> 02674b9cb179
Step 8/10 : COPY --from=sp1 /test /test
 ---> f12c4787c339
Removing intermediate container 3c0b5995b66f
Step 9/10 : COPY --from=sp2 /test /test
 ---> 6e620387b951
Removing intermediate container 9ac28d25abc4
Step 10/10 : COPY --from=sp3 /test /test
 ---> 0b84156ccf30

This seems like a really specific use case, and I don't think this reflects the general problem that is solved by multi stage builds.

I disagree. One of the primary objectives of Multistage build is the separation of build time concerns from the run time image. Essentially, the example manufactures three different artifacts needed by the run time image using three different stages that focus exclusively on providing the environments needed to construct each artifact. Once finished, the last stage transfers the artifacts (golang executables) from their no longer necessary build environments, combining them to create the final run time image.

At a minimum, at least 2 stages are required when a run time artifact must be built, instead of simply copied from the Build Context. In this situation, the first stage is polluted by the build environment needed to construct the run time artifact while the second stage extracts the constructed run time artifact from its build environment by transferring it into the run time image. Therefore, it's not unreasonable to expect scenarios where more than one run time artifact must be built to satisfy the run time image requirements.

Finally, one could easily create another example involving the building of two dynamic C++ libraries with the third stage creating an executable that depends on them . This can be accomplished by furnishing the appropriate source code and substituting the ONBUILD golang images with corresponding ONBUILD C++ ones.

Any feedback regarding the Issue: Ignores Aggregate Build Context?

@dnephin
Copy link
Member

dnephin commented May 16, 2017

Therefore, attempts to coalesce files sourced from two different stages/images into a single layer cannot be accomplished with either a single RUN or COPY instruction

Why are intermediate layers a problem? Layers from a previous stage are not in the final stage, so it shouldn't matter how many layers you have in intermediate stages. You can grab them all as a single layer in the final stage.

Also moby/moby#32904 will allow for COPY to work without creating any image or container. So effectively no layers, or at least none of the problems caused by extra layers

it's not unreasonable to expect scenarios where more than one run time artifact must be built to satisfy the run time image requirements.

This does seem reasonable, and I believe that works fine, as you demonstrage in your example.

Any feedback regarding the Issue: Ignores Aggregate Build Context?

I don't really see the issue. You can do something like this to append to merge contexts:

FROM alpine as base
COPY . .

FROM base as app
# this stage now has everything from the original context

You can filter by starting from a fresh base and using COPY --from=base instead of FROM base.

Also EXPORT from moby/moby#32100 would make this a little more declarative.

@WhisperingChaos
Copy link
Author

Why are intermediate layers a problem? Layers from a previous stage are not in the final stage, so it shouldn't matter how many layers you have in intermediate stages. You can grab them all as a single layer in the final stage.

Agreed, there shouldn't be a problem. However, I currently don't know how to "grab them all as a single layer". As far as I can tell, two COPY --from commands would be required to transfer artifacts from two different intermediate stages and currently, a commit is performed after each COPY. After reading your comments and moby/moby#32904 a couple more times I now believe I understand your replies - that once moby/moby#32904 becomes available, a commit won't be issued after a COPY instruction, therefore, multiple COPY --from can be executed without generating additional layers.

I don't really see the issue.

How would one write the golang example without rewriting the golang ONBUILD image?

@dnephin
Copy link
Member

dnephin commented May 16, 2017

I currently don't know how to "grab them all as a single layer"

This line from your example should accomplish that. It will be a single layer in the final image:

COPY --from requiredExtra /final /start.sh  /

How would one write the golang example without rewriting the golang ONBUILD image?

That golang example is already working, right? So is the problem that its so verbose? and each stage seems to be very similar?

@WhisperingChaos
Copy link
Author

I currently don't know how to "grab them all as a single layer"

This line from your example should accomplish that. It will be a single layer in the final image:

COPY --from requiredExtra /final /start.sh /

Yes of course, within the example COPY --from created only a single layer in the run time image. It works as intended due to the example's design. The stage named requiredExtra referenced by COPY --from above, issued a series of four COPY operations to locate all the artifacts into the (same) file system allocated to requiredExtra and this stage executes before the final one containing COPY --from requiredExtra /final /start.sh /

However, according to what I now understand from our posts, once moby/moby#32904 is merged, one should be able to eliminate requiredExtra and simply issue four independent COPY --from instructions in the final stage, as they should only generate a single layer:

FROM alpine
COPY --from webserver /bin/server /bin/webserver
COPY --from logger    /bin/server /bin/logger
COPY --from health    /bin/server /bin/health
COPY /script.sh /start.sh
ENTRYPOINT /start.sh

Therefore, the above should generate exactly 2 layers:

  • 1 from the single layer "alpine" image,
  • 1 generated by executing the four COPY instructions and ENTRYPOINT.

Let me know if my understanding above is incorrect.

That golang example is already working, right?

Yes. The Example: Current Multistage Design should work.

So is the problem that its so verbose? and each stage seems to be very similar?

Yes & Yes. Due to the Aggregate Build Context and lack of mechanisms to partition/map it so each stage can be defined with its own Local Build Context, one can't use the current golang on-build trigger image to implement any stage. There are essentially two reasons for the repetitive code:

  • COPY has to perform this partitioning/mapping from the Aggregate Build Context to the stage's file system.
  • Since the source COPY file path has to be different for each stage, in order to perform this mapping/partitioning, the remaining commands, that could have been encapsulated in an ONBUILD trigger, cannot be encoded that way because the golang source must be COPYed before running the remaining commands.

@tonistiigi
Copy link
Member

Let me know if my understanding above is incorrect.

It makes it possible for builder to squash these layers but we do not want to do that.

You should not care about the number of layers, and in the future not even know how many layers there were. Multiple layers that don't share contents do not perform any worse than a single one. In that case, multiple layers perform much better as they can reuse the data from previous builds. Checking for deduplication is a separate issue. If these copies share sources then it is not how multi-stage builds should be used.

ONBUILD

What you are asking should be basically FROM foo WITH bar AS baz that allows setting any source(dir, image, stage, git) as the main context for the stage. I'm open to consider this although it may be hard to justify the new syntax if it only helps the ONBUILD case.

@WhisperingChaos
Copy link
Author

You should not care about the number of layers...

Thanks for reminding me, as the original reason for eliminating layers was to flush build time artifacts from the run time image. Since multistage builds properly separate build and run time concerns, your right, layer count doesn't matter.

FROM foo WITH bar AS baz

I believe I understand this reference. bar becomes the stage's Build Context.

For me, the Build Context represents the essential abstraction for resolving a stage's file path references. A stage acquires input artifacts for its transforms and shares output artifacts via its Build Context. Its just a simple file system whose content and structure are unique to a given stage. To remain simple, the file paths do not directly expose concepts of an image or stage reference. Therefore, in order to include other abstractions like image or stage file paths, these abstractions must be mapped to a Build Context file path. This is analogous to how the Unix file system works.

In Unix, network files, in memory file systems, RAID arrays, ... can be mounted into the local file system, permitting processes to read and write to these hidden abstractions using simple file path references to the local file system, concealing the complexity of where/how these files are actually stored. Additionally, the simple file path references present a static interface that can be rebound to a different hidden abstraction. For example, a simple file path reference can be bound to a RAID array then rebound to another hidden abstraction, like an in memory file system. After rebinding, the processes referencing this file path wouldn't know/care about the change.

So what's my point? I would suggest eliminating the notion of stage/file references from COPY --from and RUN --mount and limit their binding to the file path references offered by a stage's Build Context. This provides a simple, static interface that developer's create and code to when designing a stage. Then separately provide an ability to rebind these simple file path references to the necessary abstraction when running a stage. I don't believe this suggestion is new to you. Your initial proposal suggested docker build docker://image-reference[::/subdir] which bound the image file path reference to the Build Context. This feature allowed seamless rebinding to a different image and as long as this image reflected the required Build Context.

A couple final points:

  • If you solve the issues precluding the use of current ONBUILD triggers within a multistage build I think it will focus your attention to the intricacies of binding abstractions that will hopefully provide a model to explore mechanisms that result in a cohesive solution to ONBUILD and other reuse features.
  • Concerning RUN --mount, I'm not suggesting eliminating --mount, instead, simply limit its ability to reference file paths provided by only the stage's Build Context.
  • The CONTEXT/MOUNT mapping features that I've suggested are a bit more flexible than typical Unix mount. They support the assembly of file path contents from multiple sources. For example, given a directory '''x''' one could contribute files from more than one file path (directory) to create x's content. Since MOUNT can be associated to any FROM it provides a facility to include any stage or image reference. If you wish to explore this further, let me know.

@tonistiigi
Copy link
Member

tonistiigi commented May 17, 2017

What you are calling context isn't really any different from any of the other sources that build can use like images, stages, tar archives, git repos. It is just the source that happens to contain the files from the working dir of the client. An important property of these sources that makes the core of builder to work is that they are all immutable.

@WhisperingChaos
Copy link
Author

What you are calling context isn't really any different from any of the other sources that build can use like images, stages, tar archives, git repos.

Exactly the point! A Build Context is an abstraction, just like the *nix file system is an abstraction allowing various kinds of resources to present themselves as simple file path(s) that can be traversed, read, renamed ... using a standard interface. Therefore, instead of limiting the notion of a "Build Context" to a concrete definition: the "source that happens to contain the files from the working dir of the client" extended it to include: images, stages, tar archives, git repos by reflecting these things as Build Context file paths.

Prior to the introduction of --from the Build Context was the sole abstraction used to provide source artifacts to a Dockerfile. What's being suggested by this thread is to remain faithful to this purpose by mapping other file path resources, such as images, stages, tar archives, and git repos to Build Context file paths.

An important property of these sources that makes the core of builder to work is that they are all immutable.

What's suggested by MOUNT is a mechanism that extends the Build Context without "writing" to it. Therefore, the Aggregate Build Context becomes analogous to an insert only database. In fact, it behaves similarly to RUN --mount which permits the extension of an (immutable) image's file system. Of course, MOUNT can suffer from the same write issue that RUN --mount introduces: the mount destination directory may contain artifacts contributed by the image, thereby, logically deleting these artifacts while the build runs, and once complete, these artifacts become visible again. However, precautions can be encoded to avoid possible side effects introduced by this behavior should it be considered dangerous.

Finally, CONTEXT produces a view (perspective) derived from the Aggregate Build Context that satisfies a given step's source artifact's needs. A content hash can be calculated for this view to monitor its immutability separate from the Aggregate Build Context. Therefore, extending the Aggregate Build Context won't cause a cache miss for a particular step unless a step's view is itself altered, CONTEXT definition updated, or one of its necessary source artifacts changed.

@WhisperingChaos
Copy link
Author

Any feedback regarding the Issue: Ignores Aggregate Build Context?
I don't really see the issue.

In addition to my initial reply, the below discusses a few more reasons why the suggested workaround is problematic:

  • The COPY . . encoded by the workaround converts an Aggregate Build Context file system to an image file system in order to utilize the extended binding mechanisms COPY --from and RUN --mount that offer an ability to rename a single resource with each invocation. Although limited, this ability is important, as it allows the Aggregate Build Context to be "reshaped" into a form that could be consumed by other build stages that bind themselves to their own specific set of Build Context file paths. Unfortunately, unless there's a mechanism to convert the image file system references back to the Build Context file system abstraction, the code relying on the Build Context file system can't be used and would have to be rewritten to employ COPY --from/RUN --mount. At one time, docker build docker://image-reference[::/subdir] was proposed which would allow an image file system to be reflected as a Build Context but I'm uncertain of its implementation status.

  • Although RUN --mount permits binding the top level directory to a different name or allows an individual file to be renamed, this ability is limited to a single rename operation for a given RUN --mount command. Unfortunately, if there's either more than one source image/stage that must contribute artifacts to a specific transform (RUN) or more than one rename operation required to compose the proper interface needed by it then a series of other RUN --mount or COPY --from operations must first be executed before running the desired transform. These additional RUN --mount/COPY --from essentially "reshape" the source artifacts contributed by other images/stages to reflect the input interface required by the consuming transform. Therefore, the simplicity of --from and --mount encourages the shape (file path names) of artifacts offered by stage(s)/image(s) to mirror those required by the transforms (stage(s)) that consume them in order to reduce redundant COPY --from and/or RUN --mount operations. This promoted, tight coupling between stages/images reduces the ability to reuse code and increases its rigidity. Essentially, --from and --mount encode mapping mechanisms that are too simple.

All the issues above apply to the provided workaround due to the lack of declarative and inflexible mapping mechanisms offered by --from /--mount when compared to the ones provided by CONTEXT/MOUNT.

@macropin
Copy link

macropin commented Dec 7, 2017

Came here to complain about lack of global args w/ multistage builds... and WhisperingChaos critique did not disappoint! 5/5 will subscribe.

@WhisperingChaos
Copy link
Author

@macropin

Thank you for your kind complement of my critique. Although I'm not quite sure what you expect to experience by subscribing to this thread.

I do appreciate that the core maintainers/developers where polite enough to respond to my arguments given the competition for their time to both respond to the other community posts and their driving desire to improve Docker through actually writing code. However, it's evident to me that even if they suspected the validity of some of the technical arguments presented above, that they believe the already encoded mult-stage mechanisms address the concerns well enough for the common use cases.

@thaJeztah thaJeztah transferred this issue from moby/moby Sep 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants