Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions pkg/build/strategies/sti/sti.go
Original file line number Diff line number Diff line change
Expand Up @@ -560,7 +560,6 @@ func (builder *STI) Execute(command string, user string, config *api.Config) err
}()

opts.Stdin = r
defer wg.Wait()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very tricky defer statement, there is no clean way to "cancel" it.

Calling wg.Done() earlier introduces a concurrency problem. As @bparees and @gabemontero noted, wg.Done() could end up being called multiple times sending the counter below 0, causing a panic.

Correct code arranges to call wg.Done() as many times as wg.Add(n) in all cases.


go func(reader io.Reader) {
Expand Down Expand Up @@ -591,16 +590,17 @@ func (builder *STI) Execute(command string, user string, config *api.Config) err
go dockerpkg.StreamContainerIO(errReader, &errOutput, func(a ...interface{}) { glog.Info(a...) })

err := builder.docker.RunContainer(opts)
if util.IsTimeoutError(err) {
// Cancel waiting for source input if the container timeouts
wg.Done()
}
if e, ok := err.(errors.ContainerError); ok {
// even with deferred close above, close errReader now so we avoid data race condition on errOutput;
// closing will cause StreamContainerIO to exit, thus releasing the writer in the equation
errReader.Close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, this is not a correct mechanism for cancellation, instead we will only call wg.Wait() if and when we want to block waiting for the goroutine that uploads source code to terminate.

return errors.NewContainerError(config.BuilderImage, e.ErrorCode, errOutput)
}
// Do not wait for source input if the container times out.
// FIXME: this potentially leaks a goroutine.
if !util.IsTimeoutError(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the !config.LayeredBuild check of the prior fix seems incorrect to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://play.golang.org/p/kpKJh9Prei

If you never call wg.Add, it is perfectly ok and correct to call wg.Wait(), so we don't need to add that extra condition, it is irrelevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Thursday, September 1, 2016, Rodolfo Carvalho notifications@github.com
wrote:

In pkg/build/strategies/sti/sti.go
#581 (comment)
:

if e, ok := err.(errors.ContainerError); ok {
    // even with deferred close above, close errReader now so we avoid data race condition on errOutput;
    // closing will cause StreamContainerIO to exit, thus releasing the writer in the equation
    errReader.Close()
    return errors.NewContainerError(config.BuilderImage, e.ErrorCode, errOutput)
}
  • // Do not wait for source input if the container times out.
  • // FIXME: this potentially leaks a goroutine.
  • if !util.IsTimeoutError(err) {

See https://play.golang.org/p/kpKJh9Prei

If you never call wg.Add, it is perfectly ok and correct to call wg.Wait(),
so we don't need to add that extra condition, it is irrelevant.

Ah - ok yeah that makes sense now - the count is still 0.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/source-to-image/pull/581/files/0b63dd20f38c0ab2c07a571708273cdb43d79f9c#r77222225,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADbadNxDN11QXZcFyt1lsRHS9kGvIPLYks5qlxGigaJpZM4Jy8x3
.

wg.Wait()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd drop a wg.Wait() here, without the if, but the if here is trying to cover whatever use case it had before, which I am not really sure is the correct thing to do, but we can probably defer to a rewrite. Assign that rewrite to me :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need the if-check around the wg.Wait() because the bad scenario would be:

  1. injections start (wg.Add(1) occurs)
  2. call to start container times out (container never starts)
  3. injections never finish because the container never started
  4. wg.Done() never gets called because the injections never finish
  5. wg.Wait() never completes

which is the original hang we were trying to fix.

that's not to say a rewrite couldn't improve it, but at least as it stands now i believe it's necessary.

Copy link
Contributor Author

@rhcarvalho rhcarvalho Sep 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bparees thanks for challenging my tired mind :)
Really, good to have another pair of eyes and good brain to reason about it.

  1. injections start (wg.Add(1) occurs)
  2. call to start container times out (container never starts)

This means that after

err := builder.docker.RunContainer(opts)

util.IsTimeoutError(err) == true, right?

  1. injections never finish because the container never started

Yes, and this causes a goroutine leak.

  1. wg.Done() never gets called because the injections never finish

Yes.

  1. wg.Wait() never completes

When util.IsTimeoutError(err) == true, wg.Wait() is not called. In all other cases, i.e. !util.IsTimeoutError(err), wg.Wait() is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bparees does this make sense to you after all?

In my mind we're trading a panic for a leak, and based on previous discussions this is a "less critical" leak because this code is running in the builder pod and the process is short-lived.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're in agreement, my point was that we need the "if !util.IsTimeoutError(err)" check.

It sounded to me like you were implying we might not need it, but on rereading I guess all you were saying is "ideally we'd refactor this code so this check is not needed, but until we do that, it is needed, so here it is"

so i think we're good (and yeah, goroutine leak. oh well. luckily this code is running in a short lived container in openshift, or a short lived process in s2i)

return err
}

Expand Down