New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix awslogs batch size calculation #35726

Merged
merged 1 commit into from Dec 19, 2017

Conversation

Projects
None yet
5 participants
@jahkeup
Contributor

jahkeup commented Dec 7, 2017

Fixes #35725

- What I did

  • Added a type to encapsulate the batch and its associated size counter.
  • Cleaned up types and declarations for code cleanliness

- How I did it

The added type holds the batching mechanism and the counter used for determining whether or not to submit a batch of log events.

- How to verify it

The change can be verified by using the steps from #35725 to reproduce the issue and then comparing it to a build from this changeset. The appropriate test and test mock have been updated to validate this behavior.

- Description for the changelog
awslogs: fix batch size calculation for large logs

- A picture of a cute animal (not mandatory but encouraged)
https://www.reddit.com/r/aww/comments/7hzq4m

cc @samuelkarp @adnxn

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah
Member

thaJeztah commented Dec 7, 2017

ping @anusha-ragunathan PTAL

@jahkeup

This comment has been minimized.

Show comment
Hide comment
@jahkeup

jahkeup Dec 7, 2017

Contributor

I'm investigating these errors now, I didn't have these same errors when running tests locally.

Contributor

jahkeup commented Dec 7, 2017

I'm investigating these errors now, I didn't have these same errors when running tests locally.

@jahkeup

This comment has been minimized.

Show comment
Hide comment
@jahkeup

jahkeup Dec 7, 2017

Contributor

There's a race condition with the mock and the tests, I'm going to continue to narrow it down and fix it.

Contributor

jahkeup commented Dec 7, 2017

There's a race condition with the mock and the tests, I'm going to continue to narrow it down and fix it.

@jahkeup

This comment has been minimized.

Show comment
Hide comment
@jahkeup

jahkeup Dec 8, 2017

Contributor

Looks like jenkins failed to get janky and powerpc tests up and running. Can we restart those tests?

Contributor

jahkeup commented Dec 8, 2017

Looks like jenkins failed to get janky and powerpc tests up and running. Can we restart those tests?

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Dec 8, 2017

Member

Restarted 👍

Member

thaJeztah commented Dec 8, 2017

Restarted 👍

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Dec 8, 2017

Member

@jahkeup if you need to make additional changes, can you remove the fixes #xxx from the commit message, and move it to the PR's description? GitHub can become noisy if such commits are cherry-picked or people fork the repo and do merges etc. (I know, I know) 😅

Member

thaJeztah commented Dec 8, 2017

@jahkeup if you need to make additional changes, can you remove the fixes #xxx from the commit message, and move it to the PR's description? GitHub can become noisy if such commits are cherry-picked or people fork the repo and do merges etc. (I know, I know) 😅

@jahkeup

This comment has been minimized.

Show comment
Hide comment
@jahkeup

jahkeup Dec 8, 2017

Contributor

@thaJeztah no problem! I've removed the 'fixes' reference and made a mental note to not include that in the future 👍

@anusha-ragunathan PTAL, I tested the latest set of changes locally with ~500 rounds to see if the test flaked out - this seems stable now and is passing tests.

Contributor

jahkeup commented Dec 8, 2017

@thaJeztah no problem! I've removed the 'fixes' reference and made a mental note to not include that in the future 👍

@anusha-ragunathan PTAL, I tested the latest set of changes locally with ~500 rounds to see if the test flaked out - this seems stable now and is passing tests.

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Dec 11, 2017

Contributor

I'm not wild about how you've reorganized the code here. Previously, the code was intended to be ordered such that caller was above callee, or most-abstract to most-concrete. Now you have concrete implementation details (like the eventBatch type) above the parts of the code that are the core logic of the log driver (like the exported methods for logStream). Might be worth breaking things out into separate files at this point.

Contributor

samuelkarp commented Dec 11, 2017

I'm not wild about how you've reorganized the code here. Previously, the code was intended to be ordered such that caller was above callee, or most-abstract to most-concrete. Now you have concrete implementation details (like the eventBatch type) above the parts of the code that are the core logic of the log driver (like the exported methods for logStream). Might be worth breaking things out into separate files at this point.

@anusha-ragunathan

This comment has been minimized.

Show comment
Hide comment
@anusha-ragunathan

anusha-ragunathan Dec 11, 2017

Contributor

DockerSwarmSuite.TearDownTest fails to shutdown the daemon.

`FAIL: check_test.go:366: DockerSwarmSuite.TearDownTest
00:49:43 
00:49:43 check_test.go:371:
00:49:43     d.Stop(c)
00:49:43 daemon/daemon.go:395:
00:49:43     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
00:49:43 ... Error: Error while stopping the daemon d7f3fe1f6d643 : exit status 130
00:49:43 ```


I notice this in the logs:


```time="2017-12-09T00:49:36.542132074Z" level=error msg="agent failed to clean up assignments" error="context deadline exceeded"
time="2017-12-09T00:49:36.542203730Z" level=error msg="failed to shut down cluster node: context deadline exceeded"
goroutine 1 [running]:
github.com/docker/docker/pkg/signal.DumpStacks(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/pkg/signal/trap.go:83 +0xc5
github.com/docker/docker/daemon/cluster.(*Cluster).Cleanup(0xc420850000)
	/go/src/github.com/docker/docker/daemon/cluster/cluster.go:365 +0x1c2
main.(*DaemonCli).start(0xc4204c4c60, 0xc4204c6a10, 0x0, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/daemon.go:314 +0x1b3b
main.runDaemon(0xc4204c6a10, 0xc42033eb40, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:78 +0x76
main.newDaemonCommand.func1(0xc4204646c0, 0xc420135d00, 0x0, 0x10, 0x0, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:29 +0x5b
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).execute(0xc4204646c0, 0xc420010130, 0x10, 0x11, 0xc4204646c0, 0xc420010130)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:646 +0x44d
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4204646c0, 0x1accbc0, 0xc420464701, 0xc42047ce80)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:742 +0x30e
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4204646c0, 0xc42047ce80, 0xc420000260)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:695 +0x2b
main.main()
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:105 +0xe1```


There is a followup panic in `TestAPISwarmServicesUpdateStartFirst`, due to several instances of dockerd running.


```00:49:43 PANIC: docker_api_swarm_service_test.go:201: DockerSwarmSuite.TestAPISwarmServicesUpdateStartFirst
00:49:43 
00:49:43 [d7f3fe1f6d643] waiting for daemon to start
00:49:43 [d7f3fe1f6d643] daemon started
00:49:43 
00:49:43 [d7f3fe1f6d643] daemon started
00:49:43 Attempt #2: daemon is still running with pid 9963
00:49:43 Attempt #3: daemon is still running with pid 9963
00:49:43 Attempt #4: daemon is still running with pid 9963
00:49:43 [d7f3fe1f6d643] exiting daemon
00:49:43 ... Panic: Fixture has panicked (see related PANIC)```
Contributor

anusha-ragunathan commented Dec 11, 2017

DockerSwarmSuite.TearDownTest fails to shutdown the daemon.

`FAIL: check_test.go:366: DockerSwarmSuite.TearDownTest
00:49:43 
00:49:43 check_test.go:371:
00:49:43     d.Stop(c)
00:49:43 daemon/daemon.go:395:
00:49:43     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
00:49:43 ... Error: Error while stopping the daemon d7f3fe1f6d643 : exit status 130
00:49:43 ```


I notice this in the logs:


```time="2017-12-09T00:49:36.542132074Z" level=error msg="agent failed to clean up assignments" error="context deadline exceeded"
time="2017-12-09T00:49:36.542203730Z" level=error msg="failed to shut down cluster node: context deadline exceeded"
goroutine 1 [running]:
github.com/docker/docker/pkg/signal.DumpStacks(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/pkg/signal/trap.go:83 +0xc5
github.com/docker/docker/daemon/cluster.(*Cluster).Cleanup(0xc420850000)
	/go/src/github.com/docker/docker/daemon/cluster/cluster.go:365 +0x1c2
main.(*DaemonCli).start(0xc4204c4c60, 0xc4204c6a10, 0x0, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/daemon.go:314 +0x1b3b
main.runDaemon(0xc4204c6a10, 0xc42033eb40, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:78 +0x76
main.newDaemonCommand.func1(0xc4204646c0, 0xc420135d00, 0x0, 0x10, 0x0, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:29 +0x5b
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).execute(0xc4204646c0, 0xc420010130, 0x10, 0x11, 0xc4204646c0, 0xc420010130)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:646 +0x44d
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4204646c0, 0x1accbc0, 0xc420464701, 0xc42047ce80)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:742 +0x30e
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4204646c0, 0xc42047ce80, 0xc420000260)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:695 +0x2b
main.main()
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:105 +0xe1```


There is a followup panic in `TestAPISwarmServicesUpdateStartFirst`, due to several instances of dockerd running.


```00:49:43 PANIC: docker_api_swarm_service_test.go:201: DockerSwarmSuite.TestAPISwarmServicesUpdateStartFirst
00:49:43 
00:49:43 [d7f3fe1f6d643] waiting for daemon to start
00:49:43 [d7f3fe1f6d643] daemon started
00:49:43 
00:49:43 [d7f3fe1f6d643] daemon started
00:49:43 Attempt #2: daemon is still running with pid 9963
00:49:43 Attempt #3: daemon is still running with pid 9963
00:49:43 Attempt #4: daemon is still running with pid 9963
00:49:43 [d7f3fe1f6d643] exiting daemon
00:49:43 ... Panic: Fixture has panicked (see related PANIC)```
@anusha-ragunathan

This comment has been minimized.

Show comment
Hide comment
@anusha-ragunathan

anusha-ragunathan Dec 11, 2017

Contributor

@jahkeup : Can you look into these test failures? Do they happen when you run integration tests locally?

Contributor

anusha-ragunathan commented Dec 11, 2017

@jahkeup : Can you look into these test failures? Do they happen when you run integration tests locally?

@anusha-ragunathan

This comment has been minimized.

Show comment
Hide comment
@anusha-ragunathan

anusha-ragunathan Dec 12, 2017

Contributor

@jahkeup : This patch has code rearrangement that's not necessary, as mentioned in #35726 (comment). Can you update the change such that only the code relevant to the fix is updated? We want to keep the code changes minimal, if and when possible. This also becomes important for considerations for backporting/crossporting to other releases.

Contributor

anusha-ragunathan commented Dec 12, 2017

@jahkeup : This patch has code rearrangement that's not necessary, as mentioned in #35726 (comment). Can you update the change such that only the code relevant to the fix is updated? We want to keep the code changes minimal, if and when possible. This also becomes important for considerations for backporting/crossporting to other releases.

@jahkeup

This comment has been minimized.

Show comment
Hide comment
@jahkeup

jahkeup Dec 12, 2017

Contributor

@anusha-ragunathan yes I think that's prudent, and what I had in mind also. I will work with @samuelkarp offline to clean up and make some other improvements to this package.

I'm still trying to repro the above failures in the integration tests. I do have some failures, but they're different one tests from the ones observed on jenkins and they seem to be happening irregularly. I'll try rebasing to take in any test fixes to see if it makes a difference.

Contributor

jahkeup commented Dec 12, 2017

@anusha-ragunathan yes I think that's prudent, and what I had in mind also. I will work with @samuelkarp offline to clean up and make some other improvements to this package.

I'm still trying to repro the above failures in the integration tests. I do have some failures, but they're different one tests from the ones observed on jenkins and they seem to be happening irregularly. I'll try rebasing to take in any test fixes to see if it makes a difference.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Dec 12, 2017

Member

The teardowntest itself sometimes fails, so not nescessarily related (haven't checked the full output)

Member

thaJeztah commented Dec 12, 2017

The teardowntest itself sometimes fails, so not nescessarily related (haven't checked the full output)

awslogs: Use batching type for ergonomics and correct counting
The previous bytes counter was moved out of scope was not counting the
total number of bytes in the batch. This type encapsulates the counter
and the batch for consideration and code ergonomics.

Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
@jahkeup

This comment has been minimized.

Show comment
Hide comment
@jahkeup

jahkeup Dec 15, 2017

Contributor

@anusha-ragunathan PTAL, I've updated the code and the tests are now passing.

Contributor

jahkeup commented Dec 15, 2017

@anusha-ragunathan PTAL, I've updated the code and the tests are now passing.

@samuelkarp

LGTM

// Warning: this type is not threadsafe and must not be used
// concurrently. This type is expected to be consumed in a single go
// routine and never concurrently.
type eventBatch struct {

This comment has been minimized.

@samuelkarp

samuelkarp Dec 18, 2017

Contributor

Can you move this down to be right above the methods for the type? Those start on line 636.

@samuelkarp

samuelkarp Dec 18, 2017

Contributor

Can you move this down to be right above the methods for the type? Those start on line 636.

@anusha-ragunathan

This comment has been minimized.

Show comment
Hide comment
@anusha-ragunathan

anusha-ragunathan Dec 19, 2017

Contributor

LGTM

Contributor

anusha-ragunathan commented Dec 19, 2017

LGTM

@thaJeztah

LGTM

left some suggestions for a follow up, but no showstoppers

}
added := batch.add(event, lineBytes)
if added {

This comment has been minimized.

@thaJeztah

thaJeztah Dec 19, 2017

Member

Nit; this could be (no need to change)

if ok := batch.add(event, lineBytes); ok {
@thaJeztah

thaJeztah Dec 19, 2017

Member

Nit; this could be (no need to change)

if ok := batch.add(event, lineBytes); ok {
@@ -615,3 +620,70 @@ func unwrapEvents(events []wrappedEvent) []*cloudwatchlogs.InputLogEvent {
}
return cwEvents
}
func newEventBatch() *eventBatch {

This comment has been minimized.

@thaJeztah

thaJeztah Dec 19, 2017

Member

Perhaps in a future rewrite we could pass in maximumLogEventsPerPut and maximumBytesPerPut (they feel like a property of the eventsBatch);

func newEventBatch(maxEvents uint, maxBytes uint) *eventBatch { 
@thaJeztah

thaJeztah Dec 19, 2017

Member

Perhaps in a future rewrite we could pass in maximumLogEventsPerPut and maximumBytesPerPut (they feel like a property of the eventsBatch);

func newEventBatch(maxEvents uint, maxBytes uint) *eventBatch { 
}
func (b *eventBatch) isEmpty() bool {
zeroEvents := b.count() == 0

This comment has been minimized.

@thaJeztah

thaJeztah Dec 19, 2017

Member

nit: these variables are a bit redundant;

return b.count() == 0 && b.size() == 0
@thaJeztah

thaJeztah Dec 19, 2017

Member

nit: these variables are a bit redundant;

return b.count() == 0 && b.size() == 0
}
events = append(events, wrappedEvent{
event := wrappedEvent{

This comment has been minimized.

@thaJeztah

thaJeztah Dec 19, 2017

Member

Wrapping and Unwrapping could be done in batch.add() / batch.events() as they're really related to how the batch stores the events (i.e. just for preserving the events order)

@thaJeztah

thaJeztah Dec 19, 2017

Member

Wrapping and Unwrapping could be done in batch.add() / batch.events() as they're really related to how the batch stores the events (i.e. just for preserving the events order)

@thaJeztah thaJeztah merged commit c8f7f44 into moby:master Dec 19, 2017

6 checks passed

dco-signed All commits are signed
experimental Jenkins build Docker-PRs-experimental 38366 has succeeded
Details
janky Jenkins build Docker-PRs 47103 has succeeded
Details
powerpc Jenkins build Docker-PRs-powerpc 7486 has succeeded
Details
windowsRS1 Jenkins build Docker-PRs-WoW-RS1 18622 has succeeded
Details
z Jenkins build Docker-PRs-s390x 7309 has succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment