-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace non-utf8 characters in bodies #228
Conversation
span.SetAttribute(attrName, string(body)) | ||
bodyStr := string(body) | ||
if !utf8.ValidString(bodyStr) { | ||
bodyStr = strings.ToValidUTF8(bodyStr, utf8Replacement) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a string conversion from bytes and then a utf8 replace. Might become an expensive one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation should atleast use the right method without the conversion.
unicode/utf8 has Validate([]byte) bool -> https://pkg.go.dev/unicode/utf8#Valid
strings/utf8 has methods for string.
@ryanericson should we drop instead of correcting these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, updating to use utf8.Valid method shows similar perf. (unfortunately not good in either case)
if !utf8.Valid(body) {
bodyStr = strings.ToValidUTF8(string(body), utf8Replacement)
}
BenchmarkSetBodyWithValidUtf8-12 1947141 618.5 ns/op
BenchmarkSetBodyWithInvalidUtf8-12 70920 18097 ns/op
func BenchmarkSetBodyWithValidUtf8(b *testing.B) {
validUTF8 := bytes.Repeat([]byte("hello world "), 500)
span := mock.NewSpan()
b.ResetTimer()
for i := 0; i < b.N; i++ {
SetBodyAttribute("http.request.body", validUTF8, false, span)
}
}
func BenchmarkSetBodyWithInvalidUtf8(b *testing.B) {
baseString := "hello world "
invalidBytes := []byte{0xff, 0xfe, 0xfd}
largeInvalidUTF8 := make([]byte, 0, 6000)
for len(largeInvalidUTF8) < 5970 {
largeInvalidUTF8 = append(largeInvalidUTF8, baseString...)
largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...)
}
span := mock.NewSpan()
b.ResetTimer()
for i := 0; i < b.N; i++ {
SetBodyAttribute("http.request.body", largeInvalidUTF8, false, span)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, the non-conversion one is still faster if we just look at valid utf's case. I think due the number of replacements needed in case of invalid one's, we do not see any advantage. This is on arm though.
import (
"bytes"
"strings"
"testing"
"unicode/utf8"
)
var length int
const utf8Replacement = "�"
func SetBodyAttribute1(body []byte) int {
if len(body) == 0 {
return 0
}
bodyStr := string(body)
if !utf8.ValidString(bodyStr) {
bodyStr = strings.ToValidUTF8(bodyStr, utf8Replacement)
}
return len(body)
}
func SetBodyAttribute2(body []byte) int {
if len(body) == 0 {
return 0
}
var bodyStr string
if !utf8.Valid(body) {
bodyStr = strings.ToValidUTF8(string(body), utf8Replacement)
}
return len(bodyStr)
}
func BenchmarkSetBodyAttribute1WithInvalidUtf8(b *testing.B) {
baseString := "hello world "
invalidBytes := []byte{0xff, 0xfe, 0xfd}
largeInvalidUTF8 := make([]byte, 0, 6000)
for len(largeInvalidUTF8) < 5970 {
largeInvalidUTF8 = append(largeInvalidUTF8, baseString...)
largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
length += SetBodyAttribute1(largeInvalidUTF8)
}
}
func BenchmarkSetBodyAttribute2WithInvalidUtf8(b *testing.B) {
baseString := "hello world "
invalidBytes := []byte{0xff, 0xfe, 0xfd}
largeInvalidUTF8 := make([]byte, 0, 6000)
for len(largeInvalidUTF8) < 5970 {
largeInvalidUTF8 = append(largeInvalidUTF8, baseString...)
largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
length += SetBodyAttribute2(largeInvalidUTF8)
}
}
func BenchmarkSetBodyAttribute1WithValidUtf8(b *testing.B) {
validUTF8 := bytes.Repeat([]byte("hello world "), 500)
b.ResetTimer()
for i := 0; i < b.N; i++ {
length += SetBodyAttribute1(validUTF8)
}
}
func BenchmarkSetBodyAttribute2WithValidUtf8(b *testing.B) {
validUTF8 := bytes.Repeat([]byte("hello world "), 500)
b.ResetTimer()
for i := 0; i < b.N; i++ {
length += SetBodyAttribute2(validUTF8)
}
}
ubuntu@ub22:~/ws_go/samples/goagent$ go test -bench=.
goos: linux
goarch: arm64
pkg: samples/goagent
BenchmarkSetBodyAttribute1WithInvalidUtf8-3 94282 13102 ns/op
BenchmarkSetBodyAttribute2WithInvalidUtf8-3 94063 12888 ns/op
BenchmarkSetBodyAttribute1WithValidUtf8-3 1000000 1380 ns/op
BenchmarkSetBodyAttribute2WithValidUtf8-3 2686004 445.3 ns/op
PASS
ok samples/goagent 5.767s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm thinking about this now, I don't like the perf implication that we need to do the validation at the cost of few cases where invalid bytes are there. @prodion23 let's just try to fix this at the source in case we somehow corrupted it ourselves. Truncation is one part and seems like otel SDK has a way to do UTF8 aware truncation now: open-telemetry/opentelemetry-go#3156.
Did you also observe that for valid UTF-8 chars (with foreign or multibyte chars) we're also seeing this error like @tim-mwangi says?
|
||
"github.com/hypertrace/goagent/sdk" | ||
) | ||
|
||
const utf8Replacement = "�" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I am not sure if substituting this sequence of bytes is sufficient to fix these errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if substituting this sequence of bytes is sufficient to fix these errors.
Hmm, it should resolve any errors from invalid bytes being in the request or response body from the protobuf perspective? (at least I see this sample export successfully now)
Or are you thinking there are other cases where the replace wouldn't actually replace all invalid byte sequence?
If they appear in headers we'd still run into exception
(but not sure how likely that is for us to be concerned of right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand what causes this invalid byte sequence.
|
|
||
"github.com/hypertrace/goagent/sdk" | ||
) | ||
|
||
const utf8Replacement = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to look at that codecov upload issue
@prodion23 |
We are relying on truncation flag. in eBPF we are not always able to correctly say that truncation has happened. We may need to look at that to ensure that truncation flag is passed correctly. |
} | ||
|
||
for idx := startIndex; idx < maxBytes; { | ||
_, size := utf8.DecodeRune(b[idx:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is our assumption here? max-4 can land us in the middle of previous rune and when we try DecodeRune from that index, it may give runeError which we are ignoring.
So, are we just trying to make sure that we correct the last rune if it got corrupted because of truncation.
I am fine with that, just trying to make sure I understand the behaviour well and we add comments around the assumption/behaviour.
Hey @prodion23 we should also look into the codecov issue. I do not know how the other hypertrace repos handle it. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #228 +/- ##
==========================================
+ Coverage 57.53% 57.80% +0.26%
==========================================
Files 69 69
Lines 2720 2737 +17
==========================================
+ Hits 1565 1582 +17
Misses 1085 1085
Partials 70 70 ☔ View full report in Codecov by Sentry. |
@tim-mwangi codecov issue is resolved, need to pass token now |
@@ -1,5 +1,3 @@ | |||
//go:build ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this back since it's in the examples folder.
This handles case where invalid utf8 is introduced because of truncation occuring in the middle of multibyte characters. Truncation now takes into account the character where truncation will occur, if it's length would put us over the truncation limit it is omitted.
(new unit test show example where rune at char 21-24 would put us over limit of 23, so the first 2 bytes are omitted and we only capture up to byte 21)
Attached perf results of this approach vs previous
someBodyString[:maxLength]
at bottom of this PR