Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/playground: very large output is sometimes corrupted #38751

Open
dmitshur opened this issue Apr 29, 2020 · 22 comments
Open

x/playground: very large output is sometimes corrupted #38751

dmitshur opened this issue Apr 29, 2020 · 22 comments

Comments

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Apr 29, 2020

Today and yesterday, I've observed instances where a playground execution seems to be successful (there's no error), but the output is truncated rather than complete.

Filing this issue for tracking purposes, will fill in more information later.

Edit: A workaround that is available is to modify the program (e.g., add whitespace) and re-run it until it works.

/cc @toothrot @bradfitz

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Apr 29, 2020

Can be reproduced fairly reliably with https://play.golang.org/p/ctu9bUY9ufb.

See the constant comment there:

// Change this number to control amount of output.
//
// A value of 100 works okay always.
//
// Values between 1000-2000 sometimes work okay (a complete PNG image is shown),
// but result in truncated output other times (displayed as a broken PNG image).
size = 100
@toothrot
Copy link
Contributor

@toothrot toothrot commented May 4, 2020

While investigating, to remove the browser as a variable, I wrote the following script:

curl 'https://play.golang.org/compile' --data 'version=3&body=package+main%0A%0Aimport+(%0A%09%22image%22%0A%09%22image%2Fcolor%22%0A%09%22image%2Fdraw%22%0A%09%22math%2Frand%22%0A%0A%09%22golang.org%2Fx%2Ftour%2Fpic%22%0A)%0A%0Aconst+(%0A%09%2F%2F+Change+this+number+to+control+amount+of+output.%0A%09%2F%2F%0A%09%2F%2F+A+value+of+100+works+okay+always.%0A%09%2F%2F%0A%09%2F%2F+Values+between+1000-2000+sometimes+work+okay+(a+complete+PNG+image+is+shown)%2C%0A%09%2F%2F+but+result+in+truncated+output+other+times+(displayed+as+a+broken+PNG+image).%0A%09size+%3D+1200%0A%0A%09randomSeed+%3D+1+%2F%2F+Change+this+number+to+force+new+executions.%0A)%0A%0Afunc+main()+%7B%0A%09m+%3A%3D+image.NewNRGBA(image.Rect(0%2C+0%2C+size%2C+size))%0A%09draw.Draw(m%2C+m.Bounds()%2C+Noise%7Brand.New(rand.NewSource(randomSeed))%7D%2C+image.Point%7B%7D%2C+draw.Src)%0A%09pic.ShowImage(m)%0A%7D%0A%0Atype+Noise+struct%7B+r+*rand.Rand+%7D%0A%0Afunc+(Noise)+ColorModel()+color.Model+%7B+return+color.Gray16Model+%7D%0Afunc+(Noise)+Bounds()+image.Rectangle+%7B%0A%09return+image.Rectangle%7Bimage.Point%7B-1e9%2C+-1e9%7D%2C+image.Point%7B1e9%2C+1e9%7D%7D%0A%7D%0Afunc+(n+Noise)+At(x%2C+y+int)+color.Color+%7B%0A%09return+color.Gray16%7Buint16(n.r.Intn(0xFFFF))%7D%0A%7D%0A&withVet=true'   --compressed > raw.json

cat raw.json | jq .Events[0].Message > raw.image

cat raw.image | sed -e s/\"IMAGE:// | sed -e s/=.*$/=/ > whatever.base64

base64 -d whatever.base64 > foo.png

The final few characters of the base64 looks like this, in a bad example:

BgYVFVVGxsbgfr2fwEAAP//+vzpnB150HsAAAAASUVORK5CYII=\n"

The trailing \n makes it look like we've read and returned everything the program had to offer, and that looks a lot like a base64 encoded string to me.

It does fail base64 decoding though, so I am wondering if there's a bug in the base64 encoding that we're hitting?

@toothrot
Copy link
Contributor

@toothrot toothrot commented May 4, 2020

It's also possible that we could be corrupting output read from the container somehow, but we might need to create a separate test to verify that.

@toothrot
Copy link
Contributor

@toothrot toothrot commented May 4, 2020

I've verified that this can't be reproduced locally, so it's almost certain that output is getting corrupted somewhere in the chain.

@dmitshur dmitshur changed the title x/playground: output is sometimes truncated x/playground: very large output is sometimes corrupted May 4, 2020
@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented May 4, 2020

Race in gvisor sandbox reading from stdout + Process.Wait or something? Haven't looked, but that sort of bug is common enough.

@toothrot
Copy link
Contributor

@toothrot toothrot commented May 4, 2020

/cc @bcmills and I spent a lot of time making the Process.Wait pretty bulletproof here. I'd first thought that was an issue when the issue title was that the output was truncated, but the fact that we're reading the full base64 string, but with invalid characters in it makes me think it's something else.

In a failure scenario, the middle of the base64 string looks like this:

DhKSkq+f/+upKRkYmKiqqr6/v\u0000\u0000PB\u0011t���\u0018`\u0000\u0000FЛ17GRmZ6upqZ2fn0NDQR48

So we're either having trouble reading something from the container, or something else is writing concurrently to stdout and junking up our output.

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented May 4, 2020

I could reproduce it (sometimes, not 100%) by re-running the same curl command (size 1500, some random seed) two times and getting different output:

$ curl -s 'https://play.golang.org/compile' --data 'version=3&body=package+main%0A%0Aimport+(%0A%09%22image%22%0A%09%22image%2Fcolor%22%0A%09%22image%2Fdraw%22%0A%09%22math%2Frand%22%0A%0A%09%22golang.org%2Fx%2Ftour%2Fpic%22%0A)%0A%0Aconst+(%0A%09%2F%2F+Change+this+number+to+control+amount+of+output.%0A%09%2F%2F%0A%09%2F%2F+A+value+of+100+works+okay+always.%0A%09%2F%2F%0A%09%2F%2F+Values+between+1000-2000+sometimes+work+okay+(a+complete+PNG+image+is+shown)%2C%0A%09%2F%2F+but+result+in+truncated+output+other+times+(displayed+as+a+broken+PNG+image).%0A%09size+%3D+1500%0A%0A%09randomSeed+%3D+175432523+%2F%2F+Change+this+number+to+force+new+executions.%0A)%0A%0Afunc+main()+%7B%0A%09m+%3A%3D+image.NewNRGBA(image.Rect(0%2C+0%2C+size%2C+size))%0A%09draw.Draw(m%2C+m.Bounds()%2C+Noise%7Brand.New(rand.NewSource(randomSeed))%7D%2C+image.Point%7B%7D%2C+draw.Src)%0A%09pic.ShowImage(m)%0A%7D%0A%0Atype+Noise+struct%7B+r+*rand.Rand+%7D%0A%0Afunc+(Noise)+ColorModel()+color.Model+%7B+return+color.Gray16Model+%7D%0Afunc+(Noise)+Bounds()+image.Rectangle+%7B%0A%09return+image.Rectangle%7Bimage.Point%7B-1e9%2C+-1e9%7D%2C+image.Point%7B1e9%2C+1e9%7D%7D%0A%7D%0Afunc+(n+Noise)+At(x%2C+y+int)+color.Color+%7B%0A%09return+color.Gray16%7Buint16(n.r.Intn(0xFFFF))%7D%0A%7D%0A&withVet=true' --compressed > compile-output-1.txt
$ curl -s 'https://play.golang.org/compile' --data 'version=3&body=package+main%0A%0Aimport+(%0A%09%22image%22%0A%09%22image%2Fcolor%22%0A%09%22image%2Fdraw%22%0A%09%22math%2Frand%22%0A%0A%09%22golang.org%2Fx%2Ftour%2Fpic%22%0A)%0A%0Aconst+(%0A%09%2F%2F+Change+this+number+to+control+amount+of+output.%0A%09%2F%2F%0A%09%2F%2F+A+value+of+100+works+okay+always.%0A%09%2F%2F%0A%09%2F%2F+Values+between+1000-2000+sometimes+work+okay+(a+complete+PNG+image+is+shown)%2C%0A%09%2F%2F+but+result+in+truncated+output+other+times+(displayed+as+a+broken+PNG+image).%0A%09size+%3D+1500%0A%0A%09randomSeed+%3D+175432523+%2F%2F+Change+this+number+to+force+new+executions.%0A)%0A%0Afunc+main()+%7B%0A%09m+%3A%3D+image.NewNRGBA(image.Rect(0%2C+0%2C+size%2C+size))%0A%09draw.Draw(m%2C+m.Bounds()%2C+Noise%7Brand.New(rand.NewSource(randomSeed))%7D%2C+image.Point%7B%7D%2C+draw.Src)%0A%09pic.ShowImage(m)%0A%7D%0A%0Atype+Noise+struct%7B+r+*rand.Rand+%7D%0A%0Afunc+(Noise)+ColorModel()+color.Model+%7B+return+color.Gray16Model+%7D%0Afunc+(Noise)+Bounds()+image.Rectangle+%7B%0A%09return+image.Rectangle%7Bimage.Point%7B-1e9%2C+-1e9%7D%2C+image.Point%7B1e9%2C+1e9%7D%7D%0A%7D%0Afunc+(n+Noise)+At(x%2C+y+int)+color.Color+%7B%0A%09return+color.Gray16%7Buint16(n.r.Intn(0xFFFF))%7D%0A%7D%0A&withVet=true' --compressed > compile-output-2.txt
$ ls -la compile-output*
-rw-r--r--  1 dmitshur  primarygroup  7639457 May  4 12:16 compile-output-1.txt
-rw-r--r--  1 dmitshur  primarygroup  7639505 May  4 12:16 compile-output-2.txt

The JSON field "Message" contains some characters that are clearly not base64:

[...]
Hx8fBwbGzs0NJSbm9va2gqJDF\u0000\u0000PB\u0011t���\u0018`\u0000\u0000s��VV1fb29nv37p2dnamoqPz+/VtRUXFsbKympmZ
[...]
mZlBQkKqrq\u0000\u0000PB\u0011t���\u0018`\u0000\u0000r\u0000�6GF7ampubZ2VlXVxcKCoqLi4uenh
[...]
ERHBx8+/\u0000\u0000PB\u0011t���\u0018`\u0000\u0000q\u0000�ZtVlZWY2NjGxsbLS0taB
[...]

It happened around offsets 69k, 168k, and 233k in the execution above.

Edit: In another run, I measured the deltas between the "\u0000\u0000PB" fragments more precisely, and they were exactly 65536 bytes (64 KiB) apart in most cases, sometimes a bit more (69632 bytes).

@andybons
Copy link
Member

@andybons andybons commented May 4, 2020

I’m curious if we should provide a method to avoid the cache to help with these types of issues. Uncertain whether it would be worth the effort but it would help in debugging at times.

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented May 4, 2020

a method to avoid the cache

I've been relying on adding whitespace in a program when I want to explicitly avoid the cache.

@andybons
Copy link
Member

@andybons andybons commented May 4, 2020

I've been relying on adding whitespace in a program when I want to explicitly avoid the cache.

Yes but that’s pretty tedious when you’re attempting to debug an issue.

@bcmills
Copy link
Member

@bcmills bcmills commented May 4, 2020

the \u0000\u0000PB prefix in the corrupted segments indicates that the interrupting blob is generated by the faketime patch:

// Playback header: 0 0 P B <8-byte time> <4-byte data length> (big endian)
var buf [4 + 8 + 4]byte
buf[2] = 'P'
buf[3] = 'B'

@toothrot
Copy link
Contributor

@toothrot toothrot commented May 4, 2020

@bcmills Thanks! That's consistent with the theory that something is writing to STDOUT in parallel. It seems to indicate that it's the program itself that is printing something in parallel (as only the snippet binary is built with the faketime patch).

Could it be a warning or error from the runtime? These images are "large", in that they're many megabytes, and we have hard constraints set on our sandbox containers.

By modifying the program to print to Stdout more efficiently, I haven't been able to trigger a broken image. I have been able to trigger a 500 with some large images, which means we may not handle capped output gracefully.

https://play.golang.org/p/sXRhLoQeITW

@toothrot
Copy link
Contributor

@toothrot toothrot commented May 4, 2020

Another side thought: random noise doesn't compress well to the PNG output (or anywhere else for that matter). I've been able to generate larger images using the inefficient print with a random-offset gradient: https://play.golang.org/p/Ca7xtHuq7RX

@bcmills
Copy link
Member

@bcmills bcmills commented May 4, 2020

The faketime framing happens at the syscall.Write level:

n = faketimeWrite(fd, p)

func faketimeWrite(fd int, p []byte) int {
var pp *byte
if len(p) > 0 {
pp = &p[0]
}
return int(runtimeWrite(uintptr(fd), unsafe.Pointer(pp), int32(len(p))))
}

TEXT runtime·write1(SB),NOSPLIT,$0-28
MOVQ fd+0(FP), DI
MOVQ p+8(FP), SI
MOVL n+16(FP), DX
MOVL $SYS_write, AX
SYSCALL
MOVL AX, ret+24(FP)
RET

So if the OS happens to split the write, as it is allowed to do, we can end up writing multiple headers within a given payload.

@bcmills
Copy link
Member

@bcmills bcmills commented May 4, 2020

Where is the Playground-side code that parses this output? (Is it assuming any particular framing?)

I'm guessing that's where our bug will be found...

@bcmills
Copy link
Member

@bcmills bcmills commented May 4, 2020

Oh-ho-ho! This is a fundamental problem with the faketime patch.

faketime writes a header followed by a payload length. However, nothing ensures that the OS will actually write that same length: it may write an arbitrarily truncated packet, in which case the header for the next packet will appear midway through what was declared as the previous payload.

@prattmic
Copy link
Member

@prattmic prattmic commented May 4, 2020

I'm not very familiar with NaCL, but it is quite possible that NaCL always did complete writes. gVisor, like Linux, on the other hand, with certainly do partial writes.

@bcmills
Copy link
Member

@bcmills bcmills commented May 4, 2020

I think the solution may be to rework faketime so that the locking (and header injection) occurs around the loop in (*internal/poll.FD).Write:

// Write implements io.Writer.
func (fd *FD) Write(p []byte) (int, error) {
if err := fd.writeLock(); err != nil {
return 0, err
}
defer fd.writeUnlock()
if err := fd.pd.prepareWrite(fd.isFile); err != nil {
return 0, err
}
var nn int
for {

If we move the header-injection there, then we can guarantee that a the payload is written to completion atomically after each header, even if it happens to require multiple write syscalls.

@gopherbot
Copy link

@gopherbot gopherbot commented May 4, 2020

Change https://golang.org/cl/232177 mentions this issue: pic: avoid triggering write issue in faketime

@toothrot
Copy link
Contributor

@toothrot toothrot commented May 4, 2020

Here's a dead simple program to reproduce the issue:

https://play.golang.org/p/qiktfHPSqAy

That also explains why the updated program to have the png to base64 encoder print directly to stdout did not trigger this, as it's less likely to happen with smaller writes.

@gopherbot
Copy link

@gopherbot gopherbot commented May 8, 2020

Change https://golang.org/cl/232867 mentions this issue: pic: don't make a string copy when writing image

gopherbot pushed a commit to golang/tour that referenced this issue May 8, 2020
Previously, the entire image was encoded into a buffer,
then encoded to a base64 string, which was concatenated
with another string, and written to stdout in one swoop.

We can do this more efficiently by writing the PNG image
to a base64 encoder directly, and using buffered I/O to
batch writes to stdout into moderately-sized chunks.

As an added bonus, though really the main motivation for
doing this optimization now, this helps avoid triggering
the golang/go#38751 issue on the Go Playground.

Switch to the best compression level instead of the default,
reducing encoded image size at the cost of extra computation.
Playground snippets need to be transferred over the network
and kept in storage, so it should be a favorable trade-off.

This CL was based on CL 232177.

For golang/go#38751.

Change-Id: I565fe538aa15910caaff98be156ac64b0d35fff4
Co-authored-by: Alexander Rakoczy <alex@golang.org>
Reviewed-on: https://go-review.googlesource.com/c/tour/+/232867
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Alexander Rakoczy <alex@golang.org>
gopherbot pushed a commit to golang/website that referenced this issue May 26, 2021
Previously, the entire image was encoded into a buffer,
then encoded to a base64 string, which was concatenated
with another string, and written to stdout in one swoop.

We can do this more efficiently by writing the PNG image
to a base64 encoder directly, and using buffered I/O to
batch writes to stdout into moderately-sized chunks.

As an added bonus, though really the main motivation for
doing this optimization now, this helps avoid triggering
the golang/go#38751 issue on the Go Playground.

Switch to the best compression level instead of the default,
reducing encoded image size at the cost of extra computation.
Playground snippets need to be transferred over the network
and kept in storage, so it should be a favorable trade-off.

This CL was based on CL 232177.

For golang/go#38751.

Change-Id: I565fe538aa15910caaff98be156ac64b0d35fff4
Co-authored-by: Alexander Rakoczy <alex@golang.org>
Reviewed-on: https://go-review.googlesource.com/c/tour/+/232867
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Alexander Rakoczy <alex@golang.org>
X-Tour-Commit: 0608babe047def227de553b6538ed3bd6277acff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants