Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace non-utf8 characters in bodies #228

Merged
merged 8 commits into from
May 9, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
18 changes: 16 additions & 2 deletions sdk/instrumentation/bodyattribute/bodyattribute.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,14 @@ package bodyattribute // import "github.com/hypertrace/goagent/sdk/instrumentati
import (
"encoding/base64"
"fmt"
"strings"
"unicode/utf8"

"github.com/hypertrace/goagent/sdk"
)

const utf8Replacement = "�"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I am not sure if substituting this sequence of bytes is sufficient to fix these errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if substituting this sequence of bytes is sufficient to fix these errors.

Hmm, it should resolve any errors from invalid bytes being in the request or response body from the protobuf perspective? (at least I see this sample export successfully now)
Or are you thinking there are other cases where the replace wouldn't actually replace all invalid byte sequence?

If they appear in headers we'd still run into exception
(but not sure how likely that is for us to be concerned of right now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand what causes this invalid byte sequence.


// SetTruncatedBodyAttribute truncates the body and sets the body as a span attribute.
// When body is being truncated, we also add a second attribute suffixed by `.truncated` to
// make it clear to the user, body has been modified.
Expand Down Expand Up @@ -48,7 +52,12 @@ func SetBodyAttribute(attrName string, body []byte, truncated bool, span sdk.Spa
return
}

span.SetAttribute(attrName, string(body))
bodyStr := string(body)
if !utf8.ValidString(bodyStr) {
bodyStr = strings.ToValidUTF8(bodyStr, utf8Replacement)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a string conversion from bytes and then a utf8 replace. Might become an expensive one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check perf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation should atleast use the right method without the conversion.
unicode/utf8 has Validate([]byte) bool -> https://pkg.go.dev/unicode/utf8#Valid
strings/utf8 has methods for string.
@ryanericson should we drop instead of correcting these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, updating to use utf8.Valid method shows similar perf. (unfortunately not good in either case)
if !utf8.Valid(body) {
bodyStr = strings.ToValidUTF8(string(body), utf8Replacement)
}

BenchmarkSetBodyWithValidUtf8-12      	 1947141	       618.5 ns/op
BenchmarkSetBodyWithInvalidUtf8-12    	   70920	     18097 ns/op
func BenchmarkSetBodyWithValidUtf8(b *testing.B) {
	validUTF8 := bytes.Repeat([]byte("hello world "), 500)
	span := mock.NewSpan()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		SetBodyAttribute("http.request.body", validUTF8, false, span)
	}
}

func BenchmarkSetBodyWithInvalidUtf8(b *testing.B) {
	baseString := "hello world "
	invalidBytes := []byte{0xff, 0xfe, 0xfd}
	largeInvalidUTF8 := make([]byte, 0, 6000)

	for len(largeInvalidUTF8) < 5970 {
		largeInvalidUTF8 = append(largeInvalidUTF8, baseString...)
		largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...)
	}
	span := mock.NewSpan()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		SetBodyAttribute("http.request.body", largeInvalidUTF8, false, span)
	}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, the non-conversion one is still faster if we just look at valid utf's case. I think due the number of replacements needed in case of invalid one's, we do not see any advantage. This is on arm though.

import (
	"bytes"
	"strings"
	"testing"
	"unicode/utf8"
)

var length int

const utf8Replacement = "�"

func SetBodyAttribute1(body []byte) int {
	if len(body) == 0 {
		return 0
	}

	bodyStr := string(body)
	if !utf8.ValidString(bodyStr) {
		bodyStr = strings.ToValidUTF8(bodyStr, utf8Replacement)
	}
	return len(body)
}

func SetBodyAttribute2(body []byte) int {
	if len(body) == 0 {
		return 0
	}

	var bodyStr string
	if !utf8.Valid(body) {
		bodyStr = strings.ToValidUTF8(string(body), utf8Replacement)
	}

	return len(bodyStr)
}

func BenchmarkSetBodyAttribute1WithInvalidUtf8(b *testing.B) {
	baseString := "hello world "
	invalidBytes := []byte{0xff, 0xfe, 0xfd}
	largeInvalidUTF8 := make([]byte, 0, 6000)

	for len(largeInvalidUTF8) < 5970 {
		largeInvalidUTF8 = append(largeInvalidUTF8, baseString...)
		largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...)
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		length += SetBodyAttribute1(largeInvalidUTF8)
	}
}

func BenchmarkSetBodyAttribute2WithInvalidUtf8(b *testing.B) {
	baseString := "hello world "
	invalidBytes := []byte{0xff, 0xfe, 0xfd}
	largeInvalidUTF8 := make([]byte, 0, 6000)

	for len(largeInvalidUTF8) < 5970 {
		largeInvalidUTF8 = append(largeInvalidUTF8, baseString...)
		largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...)
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		length += SetBodyAttribute2(largeInvalidUTF8)
	}
}

func BenchmarkSetBodyAttribute1WithValidUtf8(b *testing.B) {
	validUTF8 := bytes.Repeat([]byte("hello world "), 500)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		length += SetBodyAttribute1(validUTF8)
	}
}

func BenchmarkSetBodyAttribute2WithValidUtf8(b *testing.B) {
	validUTF8 := bytes.Repeat([]byte("hello world "), 500)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		length += SetBodyAttribute2(validUTF8)
	}
}
ubuntu@ub22:~/ws_go/samples/goagent$ go test -bench=.
goos: linux
goarch: arm64
pkg: samples/goagent
BenchmarkSetBodyAttribute1WithInvalidUtf8-3        94282             13102 ns/op
BenchmarkSetBodyAttribute2WithInvalidUtf8-3        94063             12888 ns/op
BenchmarkSetBodyAttribute1WithValidUtf8-3        1000000              1380 ns/op
BenchmarkSetBodyAttribute2WithValidUtf8-3        2686004               445.3 ns/op
PASS
ok      samples/goagent 5.767s

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm thinking about this now, I don't like the perf implication that we need to do the validation at the cost of few cases where invalid bytes are there. @prodion23 let's just try to fix this at the source in case we somehow corrupted it ourselves. Truncation is one part and seems like otel SDK has a way to do UTF8 aware truncation now: open-telemetry/opentelemetry-go#3156.

Did you also observe that for valid UTF-8 chars (with foreign or multibyte chars) we're also seeing this error like @tim-mwangi says?

}

span.SetAttribute(attrName, bodyStr)
// if already truncated then set attribute
if truncated {
span.SetAttribute(fmt.Sprintf("%s.truncated", attrName), true)
Expand All @@ -63,7 +72,12 @@ func SetEncodedBodyAttribute(attrName string, body []byte, truncated bool, span
return
}

span.SetAttribute(attrName+".base64", base64.RawStdEncoding.EncodeToString(body))
bodyStr := string(body)
if !utf8.ValidString(bodyStr) {
bodyStr = strings.ToValidUTF8(bodyStr, utf8Replacement)
}

span.SetAttribute(attrName+".base64", base64.RawStdEncoding.EncodeToString([]byte(bodyStr)))
// if already truncated then set attribute
if truncated {
span.SetAttribute(fmt.Sprintf("%s.truncated", attrName), true)
Expand Down
18 changes: 18 additions & 0 deletions sdk/instrumentation/bodyattribute/bodyattribute_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -199,3 +199,21 @@ func TestSetEncodedBodyAttribute(t *testing.T) {
})
}
}

func TestSetBodyWithoutUtf8(t *testing.T) {
invalidUTF8 := []byte{'h', 'e', 'l', 'l', 'o', ' ', 0xff, 0xfe, 0xfd}
span := mock.NewSpan()
SetBodyAttribute("http.request.body", invalidUTF8, false, span)
value := span.ReadAttribute("http.request.body")
assert.Equal(t, value.(string), "hello �")
}

func TestSetB64BodyWithoutUtf8(t *testing.T) {
invalidUTF8 := []byte{'h', 'e', 'l', 'l', 'o', ' ', 0xff, 0xfe, 0xfd}
span := mock.NewSpan()
SetEncodedBodyAttribute("http.request.body", invalidUTF8, false, span)
value := span.ReadAttribute("http.request.body.base64")
decodedBytes, err := base64.StdEncoding.DecodeString(value.(string))
assert.NoError(t, err)
assert.Equal(t, string(decodedBytes), "hello �")
}