Remove and close pooled states when a model is destroyed #6408

Merged
merged 6 commits into from Oct 10, 2016

Conversation

Projects
None yet
5 participants
Member

babbageclunk commented Oct 7, 2016

Investigating http://pad.lv/1625774 revealed that there were a lot of state instances (with their attendant goroutines) lingering after a model had been destroyed. They turned out to be held by the API server's StatePool.

Add StatePool.Remove and change the API server to watch for model removals and call it. This also required adding reference counting to the items in the pool - the State instances are designed to be shared between multiple requests, but we need to be sure that they aren't in use when they are closed. StatePool.Put was added so the API server could indicate that it's finished with a State.

QA done:

  • bootstrapped to LXD
  • dumped goroutines using the following commands to query the introspection worker on the controller machine
socat tcp-listen:8080,fork,reuseaddr abstract-connect:jujud-machine-0 &
curl localhost:8080/debug/pprof/goroutine?debug=2 > goroutines.txt
# Using this format because it's easier to compare old and new.
  • added and destroyed a model
  • dumped the goroutines again and checked there weren't any new ones for state workers.

I like this PR a lot, please get a second pair on eyes on it just to be sure it is a significant change.

Looks nice. A few things to consider.

@@ -538,9 +544,33 @@ func (srv *Server) serveConn(wsConn *websocket.Conn, modelUUID string, apiObserv
conn := rpc.NewConn(codec, apiObserver)
- h, err := srv.newAPIHandler(conn, modelUUID, host)
+ // Note that we don't overwrite modelUUID here because
@wallyworld

wallyworld Oct 10, 2016

Owner

I think with 2.0 this notion of an empty model UUID is now moot.

@babbageclunk

babbageclunk Oct 10, 2016

Member

So should I change resolvedModelUUID to modelUUID and pull that section out of validateModelUUID?

+ st, err = srv.statePool.Get(resolvedModelUUID)
+ }
+
+ if err == nil {
@wallyworld

wallyworld Oct 10, 2016

Owner

I think this would have been a little cleaner to retain the newAPIHandler() func and have it return st as well has apiHandler.

@babbageclunk

babbageclunk Oct 10, 2016

Member

The reason I merged it back in was because it felt weird to have the Get and Put in different scopes. If newAPIHandler returned an error for some reason (even though the Get succeeded) the Put would never be done, and that state would be leaked.

apiserver/apiserver.go
+ case <-srv.tomb.Dying():
+ return tomb.ErrDying
+ case modelIDs := <-w.Changes():
+ for _, modelID := range modelIDs {
@wallyworld

wallyworld Oct 10, 2016

Owner

these are modelUUIDs, can we rename the vars to avoid the notion there maybe such a thing as a modelID

@mjs

mjs Oct 10, 2016

Contributor

+1 - just to be consistent with other code

@babbageclunk

babbageclunk Oct 10, 2016

Member

Ok, wilco.

apiserver/server_test.go
+ c.Assert(err, jc.ErrorIsNil)
+
+ // Make a request for the model API to check it puts
+ // state back into the pool once the connection is closed.
@wallyworld

wallyworld Oct 10, 2016

Owner

Where is the check that it puts state back into the pool done? Is that expected to be part of this test?

@babbageclunk

babbageclunk Oct 10, 2016

Member

Well, it couldn't be closed unless it had been put/released, so the test checks it indirectly - the refcount isn't exposed so it can't test it directly (although I could do that if you think it's worthwhile). Originally I had two tests, one that didn't hit the API and one that did, but the former was a subset of the latter so I deleted it.

state/pool.go
+// Put indicates that the client has finished using the State. If the
+// state has been marked for removal, it will be closed and removed
+// when the final Put is done.
+func (p *StatePool) Put(modelUUID string) error {
@wallyworld

wallyworld Oct 10, 2016

Owner

Put() seems like a poor choice of name for this? It implies it should be the opposite of Get(), which returns a State.
Could we call it Release() ?

@mjs

mjs Oct 10, 2016

Contributor

-1 to Release. That implies to me that the call to Get granted exclusive access to the returned State which isn't the case.

Put confused me at first, but on reflection I don't think it's too bad.

@wallyworld

wallyworld Oct 10, 2016

Owner

I disagree :-)
What's wrong with Release() ? That is what is being done.
Put() really should be the opposite to Get() and in the PR it is not. It needs a different name to better reflect the semantics. Release() does that IMHO.

@wallyworld

wallyworld Oct 10, 2016

Owner

Actually, I liken Release() to operating on a wait group or something like that. Multiple processes which have previously got a resource then Release() when done, each Release() decrements the ref count. In the theme of wait groups, maybe Done() would be better? ie "I am done with this model UUID"

@babbageclunk

babbageclunk Oct 10, 2016

Member

The reason I used put was to match the method names on sync.Pool. I don't mind Release (I think I toyed with that) - I can see your point that Put should really take the State. Happy to change it to Release if you guys agree. I don't think it implies exclusive access.

state/pool.go
+ item, ok := p.pool[modelUUID]
+ if !ok {
+ // Don't require the client to keep track of what we've seen -
+ // ignore unknown model ids.
@babbageclunk

babbageclunk Oct 10, 2016

Member

I mean, they are identifiers (which happen to be universally unique). I'm not sure every mention of them needs to highlight the universal part, given that people aren't digging around inside them. But I don't feel strongly about this, so I don't mind changing it if other people feel differently.

state/pool_test.go
c.Assert(err, jc.ErrorIsNil)
c.Assert(st2_, gc.Equals, st2)
}
func (s *statePoolSuite) TestGetWithControllerEnv(c *gc.C) {
- p := state.NewStatePool(s.State)
- defer p.Close()
-
// When a State for the controller env is requested, the same
@wallyworld

wallyworld Oct 10, 2016

Owner

controller model

@babbageclunk

babbageclunk Oct 10, 2016

Member

Do you mean the name of the test? Will change (and the comment too).

state/pool_test.go
c.Assert(err, jc.ErrorIsNil)
c.Assert(st2_, gc.Not(gc.Equals), st2)
}
+
+func (s *statePoolSuite) TestPutSystemState(c *gc.C) {
+ // Doesn't maintain a refcount for the system state.
@wallyworld

wallyworld Oct 10, 2016

Owner

how are we testing that no ref count is maintained?

@babbageclunk

babbageclunk Oct 10, 2016

Member

I guess I'm not - really this is testing that it doesn't blow up when someone Puts the system state. I'll add a check that it's not closed.

state/pool_test.go
+ c.Assert(err, gc.ErrorMatches, "unable to return unknown model deadbeef to the pool")
+}
+
+func (s *statePoolSuite) TestTooManyPuts(c *gc.C) {
@wallyworld

wallyworld Oct 10, 2016

Owner

yeah, I definitely think Put() should be Release()

mjs approved these changes Oct 10, 2016

+ go func() {
+ defer srv.wg.Done()
+ srv.tomb.Kill(srv.processModelRemovals())
+ }()
@mjs

mjs Oct 10, 2016

Contributor

Man, apiserver needs to be catacomb based. Not for this PR though.

@babbageclunk

babbageclunk Oct 10, 2016

Member

Yeah, I almost started to do that but then backed off.

apiserver/apiserver.go
+ case <-srv.tomb.Dying():
+ return tomb.ErrDying
+ case modelIDs := <-w.Changes():
+ for _, modelID := range modelIDs {
@wallyworld

wallyworld Oct 10, 2016

Owner

these are modelUUIDs, can we rename the vars to avoid the notion there maybe such a thing as a modelID

@mjs

mjs Oct 10, 2016

Contributor

+1 - just to be consistent with other code

@babbageclunk

babbageclunk Oct 10, 2016

Member

Ok, wilco.

+ // time slices that the watcher uses for coalescing
+ // events. Without it the model appears and disappears quickly
+ // enough that it never generates a change from WatchModels.
+ // Many Bothans died to bring us this information.
@mjs

mjs Oct 10, 2016

Contributor

ha!

apiserver/server_test.go
+ // events. Without it the model appears and disappears quickly
+ // enough that it never generates a change from WatchModels.
+ // Many Bothans died to bring us this information.
+ time.Sleep(coretesting.ShortWait)
@mjs

mjs Oct 10, 2016

Contributor

Would it be more reliable to have the test have it's own model watcher from the same State instance that the apiserver is using? It could then watch for the new model to appear itself. I'm worried on a heavily loaded test machine, the sleep might not be enough.

@babbageclunk

babbageclunk Oct 10, 2016

Member

That's definitely nicer than a sleep - I'll try it out.

state/pool.go
+// Put indicates that the client has finished using the State. If the
+// state has been marked for removal, it will be closed and removed
+// when the final Put is done.
+func (p *StatePool) Put(modelUUID string) error {
@wallyworld

wallyworld Oct 10, 2016

Owner

Put() seems like a poor choice of name for this? It implies it should be the opposite of Get(), which returns a State.
Could we call it Release() ?

@mjs

mjs Oct 10, 2016

Contributor

-1 to Release. That implies to me that the call to Get granted exclusive access to the returned State which isn't the case.

Put confused me at first, but on reflection I don't think it's too bad.

@wallyworld

wallyworld Oct 10, 2016

Owner

I disagree :-)
What's wrong with Release() ? That is what is being done.
Put() really should be the opposite to Get() and in the PR it is not. It needs a different name to better reflect the semantics. Release() does that IMHO.

@wallyworld

wallyworld Oct 10, 2016

Owner

Actually, I liken Release() to operating on a wait group or something like that. Multiple processes which have previously got a resource then Release() when done, each Release() decrements the ref count. In the theme of wait groups, maybe Done() would be better? ie "I am done with this model UUID"

@babbageclunk

babbageclunk Oct 10, 2016

Member

The reason I used put was to match the method names on sync.Pool. I don't mind Release (I think I toyed with that) - I can see your point that Put should really take the State. Happy to change it to Release if you guys agree. I don't think it implies exclusive access.

babbageclunk added some commits Oct 5, 2016

Add StatePool.Remove to close States
This will be called when the underlying model is removed to prevent
State instances from leaking. Since State instances can be shared
between requests (in the API server for example), this requires keeping
a count of the times a state has been requested from the pool with
Get. Also add StatePool.Put so clients can indicate when they've
finished with a State.

If Remove is called for a State and there are no outstanding references,
it will be closed and removed immediately. If there are outstanding
references, it's marked for removal and removed and closed when the last
use is Put back.

At the moment this relies on the client to watch for model removals and
remove States from the pool, which is not ideal. There are two places
using StatePool and they use different types for managing lifetimes
unfortunately - the API server uses a tomb (v1), and the AllModelWatcher
uses a catacomb. It seemed like I'd have to convert the API server to
use a catacomb to be able to do the watching inside the StatePool, so I
did this instead.

Part of fixing http://pad.lv/1625774
Added a test for the API server removing States
... from the StatePool once the corresponding model is removed.
Member

babbageclunk commented Oct 10, 2016

$$merge$$

Contributor

jujubot commented Oct 10, 2016

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit da5b96c into juju:master Oct 10, 2016

@babbageclunk babbageclunk deleted the babbageclunk:pool-memleak branch Oct 10, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment