Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text: Problem with charset 'hz-gb-2312', program stuck in infinite loop and gets killed! #35118

Open
ben-sab opened this issue Oct 23, 2019 · 8 comments
Labels
Milestone

Comments

@ben-sab
Copy link

@ben-sab ben-sab commented Oct 23, 2019

What version of Go are you using (go version)?

$ go version
go1.12.7 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build514111842=/tmp/go-build -gno-record-gcc-switches"

What did you do?

https://play.golang.org/p/V5OAmSmE9-r

package main

import (
	"bytes"
	"golang.org/x/text/encoding/ianaindex"
)
func main() {
	content := []byte{126}
	enc, _ := ianaindex.MIME.Encoding("hz-gb-2312")
	decoder := enc.NewDecoder()
	decoded, _ := decoder.Bytes(content)
	bytes.NewReader(decoded)
}

What did you expect to see?

For the program to exit normally.

What did you see instead?

Gets stuck in an infinite loop and gets killed. In playground you'll see: "Program exited: process took too long."

The culprit is the Tilde (~ or ascii=126) which if shows up at the end of a line, makes decoder.Bytes(content) get stuck in an infinite loop for this particular encoding.

@gopherbot gopherbot added this to the Unreleased milestone Oct 23, 2019
@dmitshur

This comment has been minimized.

Copy link
Member

@dmitshur dmitshur commented Oct 23, 2019

/cc @mpvl per owners.

@ben-sab

This comment has been minimized.

Copy link
Author

@ben-sab ben-sab commented Oct 23, 2019

Not sure if this is going to help in tracking down the issue or not but have a look at the following tests (just changing the value of content in the above code):

  1. content := []byte{126}: fails
  2. content := []byte{126,126}: OK
  3. content := []byte{126,126,126}: fails
  4. content := []byte{126,126,126,126}: OK
  5. content := []byte{126,126,126,126,126}: OK!

So it's not the odd number of tildes that's breaking it either, hope this helps!

@smasher164

This comment has been minimized.

Copy link
Member

@smasher164 smasher164 commented Oct 24, 2019

  1. content := []byte{126,126,126,126,126}: OK!

In that last case, shouldn't the decoder return an error? Since ~ triggers an escape sequence, any input that ends with an incomplete escape sequence is definitely invalid.

The simplifiedchinese transformer seems to be returning ErrShortDst when it should be returning ErrShortSrc instead, since an incomplete escape sequence is insufficient data to complete a transformation. This causes doAppend to continuously grow the destination buffer, re-running the same transformation in an infinite loop.

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented Oct 24, 2019

Aside: This sounds like the kind of bug fuzzing would catch. Might be worth writing a few fuzz functions for this repo. cc @yevgenypats

@yevgenypats

This comment has been minimized.

Copy link

@yevgenypats yevgenypats commented Oct 24, 2019

Thanks @josharian , I'll look into this.

@ben-sab

This comment has been minimized.

Copy link
Author

@ben-sab ben-sab commented Nov 4, 2019

Any news about this issue at all?

@garrmcnu

This comment has been minimized.

Copy link

@garrmcnu garrmcnu commented Feb 24, 2020

Hi, I would like to give this one a shot.

As mentioned, in this encoding '~' is an escape character which must be followed by another character, from RFC 1843:
"In ASCII mode, a byte is interpreted as an ASCII character, unless a
'~' is encountered. The character '~' is an escape character. By
convention, it must be immediately followed ONLY by '~', '{' or '\n'
(<LF>), with the following special meaning."

As far as I can tell, Transform() only returns ErrShortSrc if atEOF is false. If atEOF is true (as is the case for a fixed buffer/string) then a single '~' is replaced with the unicode replacement character (U+FFFD / utf8.RuneError), this character requires 3 bytes to encode [239 191 189].
Would be good if someone more familiar with this area could confirm if that is the expected behaviour?

The reason for the infinite loop is when the replacement character is inserted, the size variable (used to iterate through the src buffer) is not updated, so this loop doesn't make progress.
Setting this variable to 1 when writing the replacement character (to indicate that a single source byte has been read) fixes the issue.

The reason for the different behaviour on the number of '~' characters is due to the size of the destination buffer. If this buffer is not big enough to store the replacement character, Transform() returns ErrShortDst (and nSrc to indicate how many source bytes have been read so far), Transform() is then called again with a larger buffer to continue where it left off, however on this call size starts at 0 again and is never updated (because the next character is a single '~') and it gets stuck in an infinite loop.
On the other hand, if the destination buffer is large enough to store the replacement character, it will correctly write the destination buffer, but return an incorrect nSrc (1 byte larger than len(src)).

For example, []byte("~~~~~") requires 5 bytes for the destination buffer (1 byte each for the 2 pairs of '~' + 3 bytes for the replacement character).
The Bytes() method initially sizes the destination buffer to match the source buffer (so 5 bytes in this case). The first iteration through the Transform() loop reads '~~' this is valid, so size is 2 and 1 byte is written to the destination (nSrc=2 nDst=1), next iteration is the same (nSrc=4 nDst=2), next iteration for the final '~' the replacement character is written to the destination buffer but size is not updated and is still 2 from the previous iteration (nSrc=6 nDst=5). This is sufficient to exit the loop but an incorrect nSrc is returned.

The incorrect nSrc doesn't seem to cause a problem when calling Bytes(), however calling String() this results in a panic (slice bounds out of range).

So to summarise behaviour
decoder.Bytes([]byte("~")) // infinite loop (initial destination buffer too small)
decoder.Bytes([]byte("~~")) // valid sequence, no problem
decoder.Bytes([]byte("~~~")) // infinite loop (initial destination buffer too small)
decoder.Bytes([]byte("~~~~~")) // correct transform but incorrect nSrc=6
decoder.Bytes([]byte("abcd~")) // infinite loop (initial destination buffer too small)
decoder.String("~") // infinite loop (size never updated)
decoder.String("~~~") // panic due to incorrect nSrc=4

Perhaps an additional improvement in String() would be a sanity check if nSrc > len(src) to return an error in this case?

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 24, 2020

Change https://golang.org/cl/220460 mentions this issue: encoding/simplifiedchinese: fix incorrect transform count to avoid infinite loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.