Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[regression in master] Corrupted logs #43647

Closed
vvoland opened this issue May 26, 2022 · 13 comments
Closed

[regression in master] Corrupted logs #43647

vvoland opened this issue May 26, 2022 · 13 comments
Labels
area/logging kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.
Milestone

Comments

@vvoland
Copy link
Contributor

vvoland commented May 26, 2022

Description
Found it when writing tests for getting the container logs:

The logs are corrupted when capturing both stdout and stderr at the same time.
This isn't an issue with reading the logs, because the corruption is already visible in the json log file.

Example of corrupted json log file
# cat /var/lib/docker/containers/babb35c51972046a575a42c414cf2ad7d58d270b9594a251fb6d59234ccc021a/babb35c51972046a575a42c414cf2ad7d58d270b9594a251fb6d59234ccc021a-json.log
{"log":"accidents happ"st,"stre"std"std,rr","time":"2022-05-26T07:43:19.448576002Z"}
{"log":"accidents happ"st,"stre"std"std,rr","time":"2022-05-26T07:43:19.44857621Z"}

First commit with the regression is ae5f664 (#43294)

The issue is that marshalFunc used in LogFile use a shared buffer. See:

With the mutex lock around WriteLogEntry got removed in the linked commit, the marshaller can be called for both stdout and stderr at the same time, which makes them write to the same buffer.

Allocating a buffer inside the marshaller call fixes the problem. But this is probably not a good idea because it would cause too many memory allocation? Maybe we should restore the critical section for the WriteLogEntry?

Steps to reproduce the issue:
docker logs -f $(docker run -d busybox sh -c 'echo stdout; echo stderr >&2')

Describe the results you received:
Randomly one of:

  1. Correct output
  2. both stdout printed
  3. both stderr printed
  4. Error getting logs, for example: error from daemon in stream: Error grabbing logs: json: cannot unmarshal string into Go value of type jsonlog.JSONLog

Describe the results you expected:
Command should print

stdout
stderr
@thaJeztah thaJeztah added this to the 22.06.0 milestone May 26, 2022
@thaJeztah thaJeztah added area/logging kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. labels May 26, 2022
@thaJeztah
Copy link
Member

/cc @corhere @cpuguy83 ptal

@thaJeztah
Copy link
Member

@vvoland if you have a fix in mind; also don't hesitate to open a draft PR (having some code / approach to discuss never hurts :-) worst case scenario, the draft PR gets closed without merging)

@corhere
Copy link
Contributor

corhere commented May 26, 2022

I'm leaning towards fixing it with a sync.Pool of buffers so that concurrent logging streams could be marshaled in parallel without having to allocate for every message. A sync.Pool is already being used for the logger.Message buffers so it wouldn't be a new addition to the hot path.

@vvoland
Copy link
Contributor Author

vvoland commented May 26, 2022

Sounds good to me. I will prepare a PR if you don't mind :-).

@corhere
Copy link
Contributor

corhere commented May 26, 2022

Sure, go ahead @vvoland. I was thinking about changing the signature of func (*LogFile) WriteLogEntry to take a pre-marshaled msg []byte (and a timestamp) and hoisting the marshaling into the individual log drivers' Log methods. That way the log drivers would be able to manage their pools internally, returning the buffers to their correct pools after the WriteLogEntry call returns.

@vvoland
Copy link
Contributor Author

vvoland commented May 26, 2022

How about leaving the WriteLogEntry as is and changing the logger.MarshalFunc to additionally take *bytes.Buffer instead? The pool would be managed by the LogFile? This way we wouldn't have to implement managing the Pool for every driver separately.

@corhere
Copy link
Contributor

corhere commented May 26, 2022

How about leaving the WriteLogEntry as is and changing the logger.MarshalFunc to additionally take *bytes.Buffer instead?

That could work, but would not be too friendly to the local implementation which wants to work with a raw []byte buffer. Pushing management of the Pool into each driver separately affords the drivers the flexibility to allocate, manage and use their Pools in whichever way works best for their particular needs.

func makeMarshaller() func(m *logger.Message) ([]byte, error) {
buf := make([]byte, initialBufSize)
// allocate the partial log entry separately, which allows for easier re-use
proto := &logdriver.LogEntry{}
md := &logdriver.PartialLogEntryMetadata{}
return func(m *logger.Message) ([]byte, error) {
resetProto(proto)
messageToProto(m, proto, md)
protoSize := proto.Size()
writeLen := protoSize + (2 * encodeBinaryLen) // + len(messageDelimiter)
if writeLen > len(buf) {
buf = make([]byte, writeLen)
} else {
// shrink the buffer back down
if writeLen <= initialBufSize {
buf = buf[:initialBufSize]
} else {
buf = buf[:writeLen]
}
}
binary.BigEndian.PutUint32(buf[:encodeBinaryLen], uint32(protoSize))
n, err := proto.MarshalTo(buf[encodeBinaryLen:writeLen])
if err != nil {
return nil, errors.Wrap(err, "error marshaling log entry")
}
if n+(encodeBinaryLen*2) != writeLen {
return nil, io.ErrShortWrite
}
binary.BigEndian.PutUint32(buf[writeLen-encodeBinaryLen:writeLen], uint32(protoSize))
return buf[:writeLen], nil
}
}

@corhere
Copy link
Contributor

corhere commented May 26, 2022

One interesting observation is that func marshalMessage in the jsonfilelog driver doesn't really depend on *bytes.Buffer. It could work just as well with any value which satisfies

interface {
	Write([]byte) (int, error)
	WriteString(string) (int, error)
	WriteByte(byte) error
}

*bufio.Writer also satisfies that interface, so there might be a variation on your idea of the log driver writing to a LogFile-provided location which would let the jsonfilelog driver write through to the file without necessarily having to buffer the whole message in memory first! That doesn't help the local driver, though; forcing it to use that interface would impose an extra buffer-to-buffer copy. Perhaps LogFile could have facilities for both buffer-oriented and writer-oriented drivers? E.g.:

func (*LogFile) WriteLogEntry(time.Time, func(ThatWriterInterfaceAbove) error) error
func (*LogFile) WriteLogEntryBytes(time.Time, []byte) error

Log drivers using WriteLogEntryBytes, like local, would still need to manage their own buffer pools, but ones using that hypothetical version of WriteLogEntry, like jsonfilelog, would not. LogFile wouldn't need its own *bufio.Writer pool; it could retain a private instance for each LogFile instance or take from the global pools.BufioWriter32KPool.

@vvoland
Copy link
Contributor Author

vvoland commented May 26, 2022

That sounds good. On the other hand, given that there are only 2 drivers that use the WriteLogEntry, maybe we shouldn't overcomplicate the LogFile API :)
I will finish the code tomorrow, so I will try it and see how it looks.
Thanks!

@corhere
Copy link
Contributor

corhere commented May 30, 2022

Given that the root cause is a data race, it should be possible to write a unit test which would catch future regressions when run with go test -race. I think such a test would be a good addition to the LogReader test suite. Would you be interested in taking on that task @vvoland? Otherwise I can take it.

@vvoland
Copy link
Contributor Author

vvoland commented May 31, 2022

Sure, will make a PR soon.

@vvoland
Copy link
Contributor Author

vvoland commented May 31, 2022

@corhere included this test with the ones that led me to discovery of this regression. I closed the old PR and opened new for clarity:

@thaJeztah
Copy link
Member

This should be resolved now. There's still an issue with specific Windows images, but that's not a bug in our code, but an issue with those images; see #43669 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logging kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.
Projects
None yet
Development

No branches or pull requests

3 participants