Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace non-utf8 characters in bodies #228

Merged
merged 8 commits into from
May 9, 2024

Conversation

prodion23
Copy link
Collaborator

@prodion23 prodion23 commented Apr 26, 2024

This handles case where invalid utf8 is introduced because of truncation occuring in the middle of multibyte characters. Truncation now takes into account the character where truncation will occur, if it's length would put us over the truncation limit it is omitted.
(new unit test show example where rune at char 21-24 would put us over limit of 23, so the first 2 bytes are omitted and we only capture up to byte 21)

Attached perf results of this approach vs previous someBodyString[:maxLength] at bottom of this PR

span.SetAttribute(attrName, string(body))
bodyStr := string(body)
if !utf8.ValidString(bodyStr) {
bodyStr = strings.ToValidUTF8(bodyStr, utf8Replacement)
Copy link
Contributor

@puneet-traceable puneet-traceable Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a string conversion from bytes and then a utf8 replace. Might become an expensive one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check perf

Copy link
Contributor

@puneet-traceable puneet-traceable Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation should atleast use the right method without the conversion.
unicode/utf8 has Validate([]byte) bool -> https://pkg.go.dev/unicode/utf8#Valid
strings/utf8 has methods for string.
@ryanericson should we drop instead of correcting these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, updating to use utf8.Valid method shows similar perf. (unfortunately not good in either case)
if !utf8.Valid(body) {
bodyStr = strings.ToValidUTF8(string(body), utf8Replacement)
}

BenchmarkSetBodyWithValidUtf8-12      	 1947141	       618.5 ns/op
BenchmarkSetBodyWithInvalidUtf8-12    	   70920	     18097 ns/op
func BenchmarkSetBodyWithValidUtf8(b *testing.B) {
	validUTF8 := bytes.Repeat([]byte("hello world "), 500)
	span := mock.NewSpan()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		SetBodyAttribute("http.request.body", validUTF8, false, span)
	}
}

func BenchmarkSetBodyWithInvalidUtf8(b *testing.B) {
	baseString := "hello world "
	invalidBytes := []byte{0xff, 0xfe, 0xfd}
	largeInvalidUTF8 := make([]byte, 0, 6000)

	for len(largeInvalidUTF8) < 5970 {
		largeInvalidUTF8 = append(largeInvalidUTF8, baseString...)
		largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...)
	}
	span := mock.NewSpan()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		SetBodyAttribute("http.request.body", largeInvalidUTF8, false, span)
	}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, the non-conversion one is still faster if we just look at valid utf's case. I think due the number of replacements needed in case of invalid one's, we do not see any advantage. This is on arm though.

import (
	"bytes"
	"strings"
	"testing"
	"unicode/utf8"
)

var length int

const utf8Replacement = "�"

func SetBodyAttribute1(body []byte) int {
	if len(body) == 0 {
		return 0
	}

	bodyStr := string(body)
	if !utf8.ValidString(bodyStr) {
		bodyStr = strings.ToValidUTF8(bodyStr, utf8Replacement)
	}
	return len(body)
}

func SetBodyAttribute2(body []byte) int {
	if len(body) == 0 {
		return 0
	}

	var bodyStr string
	if !utf8.Valid(body) {
		bodyStr = strings.ToValidUTF8(string(body), utf8Replacement)
	}

	return len(bodyStr)
}

func BenchmarkSetBodyAttribute1WithInvalidUtf8(b *testing.B) {
	baseString := "hello world "
	invalidBytes := []byte{0xff, 0xfe, 0xfd}
	largeInvalidUTF8 := make([]byte, 0, 6000)

	for len(largeInvalidUTF8) < 5970 {
		largeInvalidUTF8 = append(largeInvalidUTF8, baseString...)
		largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...)
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		length += SetBodyAttribute1(largeInvalidUTF8)
	}
}

func BenchmarkSetBodyAttribute2WithInvalidUtf8(b *testing.B) {
	baseString := "hello world "
	invalidBytes := []byte{0xff, 0xfe, 0xfd}
	largeInvalidUTF8 := make([]byte, 0, 6000)

	for len(largeInvalidUTF8) < 5970 {
		largeInvalidUTF8 = append(largeInvalidUTF8, baseString...)
		largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...)
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		length += SetBodyAttribute2(largeInvalidUTF8)
	}
}

func BenchmarkSetBodyAttribute1WithValidUtf8(b *testing.B) {
	validUTF8 := bytes.Repeat([]byte("hello world "), 500)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		length += SetBodyAttribute1(validUTF8)
	}
}

func BenchmarkSetBodyAttribute2WithValidUtf8(b *testing.B) {
	validUTF8 := bytes.Repeat([]byte("hello world "), 500)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		length += SetBodyAttribute2(validUTF8)
	}
}
ubuntu@ub22:~/ws_go/samples/goagent$ go test -bench=.
goos: linux
goarch: arm64
pkg: samples/goagent
BenchmarkSetBodyAttribute1WithInvalidUtf8-3        94282             13102 ns/op
BenchmarkSetBodyAttribute2WithInvalidUtf8-3        94063             12888 ns/op
BenchmarkSetBodyAttribute1WithValidUtf8-3        1000000              1380 ns/op
BenchmarkSetBodyAttribute2WithValidUtf8-3        2686004               445.3 ns/op
PASS
ok      samples/goagent 5.767s

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm thinking about this now, I don't like the perf implication that we need to do the validation at the cost of few cases where invalid bytes are there. @prodion23 let's just try to fix this at the source in case we somehow corrupted it ourselves. Truncation is one part and seems like otel SDK has a way to do UTF8 aware truncation now: open-telemetry/opentelemetry-go#3156.

Did you also observe that for valid UTF-8 chars (with foreign or multibyte chars) we're also seeing this error like @tim-mwangi says?


"github.com/hypertrace/goagent/sdk"
)

const utf8Replacement = "�"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I am not sure if substituting this sequence of bytes is sufficient to fix these errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if substituting this sequence of bytes is sufficient to fix these errors.

Hmm, it should resolve any errors from invalid bytes being in the request or response body from the protobuf perspective? (at least I see this sample export successfully now)
Or are you thinking there are other cases where the replace wouldn't actually replace all invalid byte sequence?

If they appear in headers we'd still run into exception
(but not sure how likely that is for us to be concerned of right now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand what causes this invalid byte sequence.

@prodion23
Copy link
Collaborator Author

func BenchmarkNewTruncateSetBodyWithValidUtf8(b *testing.B) {
	span := mock.NewSpan()
	b.ResetTimer()
	starter := []byte("世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界")
	var input []byte
	for len(input) < 5970 {
		input = append(input, starter...)
	}
	for i := 0; i < b.N; i++ {
		SetTruncatedBodyAttribute("http.request.body", input, 5960, span, true)
	}
}

func BenchmarkOldTruncateSetBodyWithValidUtf8(b *testing.B) {
	span := mock.NewSpan()
	b.ResetTimer()
	starter := []byte("世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界")
	var input []byte
	for len(input) < 5970 {
		input = append(input, starter...)
	}
	for i := 0; i < b.N; i++ {
		SetTruncatedBodyAttribute("http.request.body", input, 5960, span, false)
	}
}
BenchmarkNewTruncateSetBodyWithValidUtf8
BenchmarkNewTruncateSetBodyWithValidUtf8-12    	  740908	      2143 ns/op
BenchmarkOldTruncateSetBodyWithValidUtf8
BenchmarkOldTruncateSetBodyWithValidUtf8-12    	  679801	      1777 ns/op


"github.com/hypertrace/goagent/sdk"
)

const utf8Replacement = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Collaborator

@tim-mwangi tim-mwangi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@tim-mwangi tim-mwangi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to look at that codecov upload issue

@puneet-traceable
Copy link
Contributor

puneet-traceable commented May 2, 2024

func BenchmarkNewTruncateSetBodyWithValidUtf8(b *testing.B) {
	span := mock.NewSpan()
	b.ResetTimer()
	starter := []byte("世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界")
	var input []byte
	for len(input) < 5970 {
		input = append(input, starter...)
	}
	for i := 0; i < b.N; i++ {
		SetTruncatedBodyAttribute("http.request.body", input, 5960, span, true)
	}
}

func BenchmarkOldTruncateSetBodyWithValidUtf8(b *testing.B) {
	span := mock.NewSpan()
	b.ResetTimer()
	starter := []byte("世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界")
	var input []byte
	for len(input) < 5970 {
		input = append(input, starter...)
	}
	for i := 0; i < b.N; i++ {
		SetTruncatedBodyAttribute("http.request.body", input, 5960, span, false)
	}
}
BenchmarkNewTruncateSetBodyWithValidUtf8
BenchmarkNewTruncateSetBodyWithValidUtf8-12    	  740908	      2143 ns/op
BenchmarkOldTruncateSetBodyWithValidUtf8
BenchmarkOldTruncateSetBodyWithValidUtf8-12    	  679801	      1777 ns/op

@prodion23 b.ResetTimer() should have be after the input is ready.

@puneet-traceable
Copy link
Contributor

We are relying on truncation flag. in eBPF we are not always able to correctly say that truncation has happened. We may need to look at that to ensure that truncation flag is passed correctly.

}

for idx := startIndex; idx < maxBytes; {
_, size := utf8.DecodeRune(b[idx:])
Copy link
Contributor

@puneet-traceable puneet-traceable May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is our assumption here? max-4 can land us in the middle of previous rune and when we try DecodeRune from that index, it may give runeError which we are ignoring.
So, are we just trying to make sure that we correct the last rune if it got corrupted because of truncation.
I am fine with that, just trying to make sure I understand the behaviour well and we add comments around the assumption/behaviour.

@tim-mwangi
Copy link
Collaborator

Hey @prodion23 we should also look into the codecov issue. I do not know how the other hypertrace repos handle it.

Copy link

codecov bot commented May 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 57.80%. Comparing base (87ab698) to head (37c9426).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #228      +/-   ##
==========================================
+ Coverage   57.53%   57.80%   +0.26%     
==========================================
  Files          69       69              
  Lines        2720     2737      +17     
==========================================
+ Hits         1565     1582      +17     
  Misses       1085     1085              
  Partials       70       70              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@prodion23
Copy link
Collaborator Author

@tim-mwangi codecov issue is resolved, need to pass token now

@@ -1,5 +1,3 @@
//go:build ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this back since it's in the examples folder.

@prodion23 prodion23 merged commit 7a089ab into main May 9, 2024
6 of 7 checks passed
@tim-mwangi tim-mwangi deleted the prevent-non-utf8-compat-bodies branch May 17, 2024 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants