Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

give a long gregor reconnect backoff to devices that don't send chats #9720

Merged
merged 2 commits into from Dec 12, 2017

Conversation

oconnor663
Copy link
Contributor

@oconnor663 oconnor663 commented Nov 29, 2017

This should mitigate some of the reconnect flood that the gregor servers
have to deal with when they restart. Most idle clients in the wild
aren't participating in chat, and don't need to reconnect very
aggressively.

There were a few different heuristics we could've used here, and others
we might want to use in the future. One in particular we almost chose
was "has this user ever received a message". However, we sometimes send
system-wide messages, like when Linux updates are broken, which could
confuse that heuristic. Chat sending is a surer sign of activity than
receiving, and it also has the benefit of being
individual-device-specific.

Two things had to change to make this work. First, we had to configure a
chat-activity-based backoff (using a couple new keys in LevelDB).
Second, we had to make sure that the backoff was respected on reconnect,
which required the new ForceInitialBackoff ConnectionOpts parameter
upstream, since we don't keep a persisitent Connection object after
disconnects.

@oconnor663 oconnor663 force-pushed the jack/CORE-6527/gregorbackoff branch 4 times, most recently from d139a3a to 189afc9 Compare December 4, 2017 22:19
@oconnor663
Copy link
Contributor Author

Will replace the "dummy revendor" commit after keybase/go-framed-msgpack-rpc#129 lands.

@oconnor663
Copy link
Contributor Author

@mmaxim this is ready for your review.

@oconnor663
Copy link
Contributor Author

Upstream landed, and I've replaced the dummy revendor. Will add a corresponding KBFS PR now. (I expect Android CI will be broken until both of these land.)

Copy link
Contributor

@mmaxim mmaxim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple q's

@@ -928,6 +928,8 @@ func (h *Server) PostLocal(ctx context.Context, arg chat1.PostLocalArg) (res cha
return chat1.PostLocalRes{}, err
}

RecordChatSend(h.G())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put this in BlockingSender.Send?

func RecordChatSend(g *globals.Context) {
err := g.LocalChatDb.PutObj(lastSendTimeDbKey(), nil, time.Now().Unix())
if err != nil {
g.Log.Warning("Failed to store chat last send time: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make all these prints debugs so they don't show up on the CLI.

// given a long backoff.
lastSendTime := chat.GetLastSendTime(g)
if now.Sub(lastSendTime) < chat.ActiveIntervalAfterSend {
return GregorConnectionShortRetryInterval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we wait at all for active people?

@oconnor663 oconnor663 force-pushed the jack/CORE-6527/gregorbackoff branch 4 times, most recently from 668c717 to 09a7814 Compare December 6, 2017 18:15
@oconnor663
Copy link
Contributor Author

Ok, the current state of this PR:

  • The main change from before is that we now supply InitialReconnectBackoffWindow. Before we didn't supply it at all.
  • The initial window is zero for active clients, but rand(0-10s) for inactive clients.
  • We've also changed the retry delay between failed reconnects to 10s for inactive clients. It's 2s for active clients, which is no change from the previous behavior.
  • "active" means any of:
    • all mobile devices
    • any device within 24 hours of when it checks its "active" status for the first time
    • any device that has sent a chat message in the last month

@mmaxim
Copy link
Contributor

mmaxim commented Dec 6, 2017

General comment to use g.Debug and make these functions receivers on gregorHandler so that the debug output can get caught up in my standard greps in the logs.

@oconnor663
Copy link
Contributor Author

Made @mmaxim's suggested logging changes and rebased.

Copy link
Contributor

@mmaxim mmaxim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! Two points:

1.) Is there anyway we can test this? I was thinking a unit test for active.go itself (without any connection business) could be useful.
2.) I think this should hold until everything is released.

@oconnor663
Copy link
Contributor Author

Added a test and rebased. Landing after this passes CI and some hand testing.

This should mitigate some of the reconnect flood that the gregor servers
have to deal with when they restart. Most idle clients in the wild
aren't participating in chat, and don't need to reconnect very
aggressively.

There were a few different heuristics we could've used here, and others
we might want to use in the future. One in particular we almost chose
was "has this user ever received a message". However, we sometimes send
system-wide messages, like when Linux updates are broken, which could
confuse that heuristic. Chat sending is a surer sign of activity than
receiving, and it also has the benefit of being
individual-device-specific.

Two things had to change to make this work. First, we had to configure a
chat-activity-based backoff (using a couple new keys in LevelDB).
Second, we had to make sure that the backoff was respected on reconnect,
which required the new ForceInitialBackoff ConnectionOpts parameter
upstream, since we don't keep a persisitent Connection object after
disconnects.
@oconnor663 oconnor663 closed this Dec 12, 2017
@oconnor663 oconnor663 deleted the jack/CORE-6527/gregorbackoff branch December 12, 2017 20:51
@oconnor663 oconnor663 merged commit 069968a into master Dec 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants