Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal: remove transportMonitor, replace with callbacks #2032

Conversation

jeanbza
Copy link
Member

@jeanbza jeanbza commented Apr 29, 2018

This removes a goroutine (in which transportMonitor was running) and the backoff and addrs loops. Instead we use callbacks to handle the deadline, goaway, and close logic that transport monitor handled before, and we descend into recursive createTransport calls instead of loops.

As a result, backoffIdx and addrIdx now live on addrConn. Also, transport.Close now blocks until the addrConn that initiated it is connected - this is because t.Close calls onClose, whose last step is to begin reconnecting. In the event of a shutdown, all reconnect logic ends and t.Close will return.

@@ -1515,6 +1517,20 @@ func (ac *addrConn) tearDown(err error) {
if channelz.IsOn() {
channelz.RemoveEntry(ac.channelzID)
}

// TODO(deklerk) is this bad?
transports := map[int32]transport.ClientTransport{}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive way to get around needing the transports outside the lock (GracefulClose takes the lock), but not wanting to cause data races.

Is there a better way to do this?

@@ -261,6 +261,11 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth
return nil, err
}

// TODO(deklerk) what??
if t == nil {
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out why this is happening, but sometimes I would get nil, true, nil which blows up a little below when we call t.NewStream. This conditional fixes the problem, but I thought I'd bring it up. I suspect it's a race between the transport closing - and getting deleted - and getTransport looking for, finding, and returning it.

If so, is this fix appropriate or should I make getTransport and transport deletion synchronized? Alternatively, should I document getTransport's race with closing transport? (again, all this predicated on my guess above)

clientconn.go Outdated
}
if connected {
return nil
if t == nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this solves the problem appropriately. AFAICT there's a race between the reader goroutine closing and onDeadline. This puts a plaster over the problem but I think there's a still a chance t.Close just below could happen incorrectly. Is that accurate? Is there a better way to handle this? Do I need to do more synchronization?

clientconn.go Outdated
defer shutdownLock.Unlock()

if timer != nil {
timer.Stop()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this appropriate? I'm a bit confused by the godoc in timer.Stop; it seems to suggest that we need to drain the channel, too. Is that right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine.

@jeanbza
Copy link
Member Author

jeanbza commented Apr 29, 2018

One of the 6 builds in travis are failing. The reason is a little cryptic though, but I think it's telling me there's a gofmt error in clientconn.go. Is that right? And if so, how do I repro that? I'm running go1.10 and gofmt -s -d -l . brings up nothing.

screen shot 2018-04-29 at 12 22 53 pm

screen shot 2018-04-29 at 12 21 47 pm

@dfawley
Copy link
Member

dfawley commented Apr 29, 2018

Yes, that's right. It's only failing in 1.10 because we don't run the vet/lint/etc checks other builds.

@jeanbza
Copy link
Member Author

jeanbza commented Apr 29, 2018

@dfawley Is there a way to see what in clientconn.go caused it to fail? Or, repro locally? gofmt -s -d -l . returns nothing for me, but not sure if I missed a step.

edit: vet.sh also didn't bring up any errors. Confusing :/

@dfawley
Copy link
Member

dfawley commented Apr 30, 2018

What version is your local go?

...Now I'm even more confused. The output shows it grepping for "unsafe", but if I look in your branch, your vet.sh doesn't have the change to do that yet.

I think we should remove the "-l" parameter from the command-line so we can see the full diffs. You can do that in your branch, anyway, to get travis to output it.

@dfawley
Copy link
Member

dfawley commented Apr 30, 2018

I'm running go1.10

Missed that. Are you not running 1.10.1? That's what travis used.

@jeanbza
Copy link
Member Author

jeanbza commented Apr 30, 2018

Missed that. Are you not running 1.10.1? That's what travis used.

Updated and rebased with master, still not seeing anything with vet.sh. I took your advice and removed -l; we'll see what travis prints.

@jeanbza jeanbza force-pushed the transport_monitor_refactor_restart_2_embrace_callbacks_3 branch 4 times, most recently from 233a399 to 86498f2 Compare April 30, 2018 18:02
Copy link
Contributor

@MakMukhi MakMukhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed it partially. Looks good, mostly. We might need to update the structure a little bit. Let's chat when you're in SVL again.

@@ -1174,152 +1179,217 @@ func (ac *addrConn) resetTransport() error {
ac.ready = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update documentation for this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

clientconn.go Outdated
shutdownLock := &sync.Mutex{}
var timer *time.Timer
ac.mu.Lock()
if !ac.connectDeadline.IsZero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this be zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, never. Removed conditional

clientconn.go Outdated
defer shutdownLock.Unlock()

if timer != nil {
timer.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine.

clientconn.go Outdated
newTrID := atomic.AddInt32(&ac.transportIdx, 1)

onGoAway := func() {
shutdownLock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this lock protect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timer. I'll tighten it up.

clientconn.go Outdated
shutdownLock.Lock()
defer shutdownLock.Unlock()

if timer != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if connectionDeadline is always non-zero we can get rid of timer != nil checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Done.

clientconn.go Outdated
}
t.Close()
Copy link
Contributor

@MakMukhi MakMukhi May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this closure, how about transport's reader goroutine wait on connectionDeadline as well while it waits on reading settings frame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm clear, since the diff is a bit hard to read; you're suggesting that we remove onDeadline and move the onDeadline logic into the transport reader goroutine by way of passing connectDeadline down to the transport?

clientconn.go Outdated
newTr, err := transport.NewClientTransport(connectCtx, ac.cc.ctx, target, copts, onPrefaceReceipt)

ac.mu.Lock()
delete(ac.transports, newTrID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also update ac.transport = nil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Done

clientconn.go Outdated
case <-ac.ctx.Done():
}

err = ac.resetTransport()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetTransport() always resets connectionDeadline, but if haven't exhausted our list of backends we'd like to use the same deadline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is the current behavior, no?

clientconn.go Outdated
}
return ac.resetTransport()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not replace loop with recursion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

clientconn.go Outdated
case <-connectCtx.Done():
// Didn't receive server preface, must kill this new transport now.
grpclog.Warningf("grpc: addrConn.createTransport failed to receive server preface before deadline.")
newTr.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onDeadline() closure will do this for us now, so this can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jeanbza
Copy link
Member Author

jeanbza commented May 15, 2018

@MakMukhi Updated from recursion to loop

edit: I should be able to get to your other couple of comments next couple of days

@jeanbza
Copy link
Member Author

jeanbza commented May 16, 2018

Just realized I was pushing the wrong branch heh. The updates I promised above are now actually here :)

clientconn.go Outdated
@@ -1,3 +1,11 @@
/**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I should and will remove this

@jeanbza jeanbza force-pushed the transport_monitor_refactor_restart_2_embrace_callbacks_3 branch from b0fd5be to 88d55c0 Compare June 13, 2018 00:35
@jeanbza
Copy link
Member Author

jeanbza commented Jun 13, 2018

@MakMukhi PTAL.

  • Refactor to use reader goroutine.
  • Rename back to resetTransport.
  • Various review comment fixes and cleanups.
  • Found a home for resolving (in nextAddr).
  • Rebased with master.

This removes a goroutine (in which transportMonitor was running) and the
backoff and addrs for loops. Instead we use callbacks to handle the
deadline, goaway, and close logic that transport monitor handled
before, and we descend into recursive resetTransport calls (from the
callbacks) instead of loops.

As a result, backoffIdx and addrIdx now live on addrConn. Also,
transport.Close now blocks until the addrConn that initiated it is
connected - this is because t.Close calls onClose, whose last
step is to begin reconnecting. In the event of a shutdown, all
reconnect logic ends and t.Close will return.
- Refactor to use reader goroutine for transport reset
- Rename back to resetTransport
- Trim down onDeadline extensively; consolidate a lot of logic into resetTransport
- General cleanup, codereview fixes, and optimizations
@jeanbza jeanbza force-pushed the transport_monitor_refactor_restart_2_embrace_callbacks_3 branch from 88d55c0 to 2725a51 Compare June 15, 2018 19:04
@jeanbza jeanbza force-pushed the transport_monitor_refactor_restart_2_embrace_callbacks_3 branch 3 times, most recently from b1b4ba9 to b1ba044 Compare June 16, 2018 01:45
@jeanbza
Copy link
Member Author

jeanbza commented Jul 13, 2018

Now that location differs, I'm closing this and re-opening in a new, clean PR.

@jeanbza jeanbza closed this Jul 13, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Jan 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants