Skip to content
This repository has been archived by the owner on Feb 1, 2023. It is now read-only.

calculate message latency #386

Merged
merged 6 commits into from May 2, 2020
Merged

calculate message latency #386

merged 6 commits into from May 2, 2020

Conversation

dirkmc
Copy link
Contributor

@dirkmc dirkmc commented Apr 29, 2020

Fixes #385

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd double-check for possible deadlocks, but otherwise LGTM (except a few nits).

bitswap.go Outdated Show resolved Hide resolved
internal/messagequeue/donthavetimeoutmgr.go Show resolved Hide resolved
internal/messagequeue/messagequeue.go Outdated Show resolved Hide resolved
@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 30, 2020

The request-response cases are:

  • broadcast want-have (send-dont-have = false)
    • HAVE
    • block (if the block is very small)
  • regular want-have (send-dont-have = true)
    • HAVE
    • DONT_HAVE
    • block (if the block is very small)
  • want-block (send-dont-have = true)
    • block
    • DONT_HAVE

There are a few scenarios to address:

  1. No response followed by late response

Peers running new Bitswap should respond immediately in all cases, except for broadcast want-have (send-dont-have = false) when the peer doesn't have the block. Peers running old Bitswap won't respond unless they have the block.

In either of these two cases it's possible that

  • the peer does not have the block, so it doesn't respond
  • the peer receives the block later
  • the peer sends HAVE / block

We can mitigate this scenario by ignoring outlier latencies.

  1. DONT_HAVE then HAVE / block

It's possible that a peer sends DONT_HAVE and then subsequently sends HAVE / block. To address this I think we should clear out the sent-at time when we receive a response, so that subsequent responses will be ignored for the purpose of latency calculation.

  1. Simulated DONT_HAVE

If a peer doesn't respond to want-block, either because it's running old Bitswap or because it's overloaded, the timeout will fire an event that simulates receiving DONT_HAVE. We want to ignore this simulated DONT_HAVE for the purposes of latency tracking, so we should clear out the sent-at time before the event propagates upwards. Edit: Actually the simulated DONT_HAVE bypasses the message received event, so it will be ignored for the purposes of latency tracking.

@dirkmc
Copy link
Contributor Author

dirkmc commented Apr 30, 2020

With regards to checking for deadlocking, the possible places for slowness / deadlock that I can see are:

  • Routing the received message notification through the PeerManager
    We need to take the peer queue lock, which is already quite highly contended.
    I considered keeping a separate "event listener" on the Network abstraction that the MessageQueue can subscribe to, but it seemed overly complex so I went with this simpler solution.
  • Informing the MessageQueue of incoming responses
    Because the peer queue lock is highly contended I wanted to make sure this would not be blocking, so it sends the incoming response keys on a channel, and if the channel is full it just drops them (it's just used to approximate latency so we don't need to measure every single response)
  • Informing DontHaveTimeoutManager of latency calculation updates
    We need to acquire a lock but the timeout checking thread in DontHaveTimeoutManager only wakes up at the approximate moment that the oldest want is expected to timeout, so I don't expect much contention
  • DontHaveTimeoutManager firing timeouts
    This happens in a go-routine so I think we should be ok

@Stebalien any other places we should be thinking about?

@dirkmc dirkmc requested a review from Stebalien April 30, 2020 18:25

for _, e := range peerEntries[:sentPeerEntries] {
if e.Cid.Defined() { // Check if want was cancelled in the interim
mq.peerWants.SentAt(e.Cid, now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this happens after we receive the block? Is that an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be ok - when the Session receives a block it sends cancel, and cancel removes the want from the sent wantlist. SentAt() checks that the sent wantlist still contains the want before recording the time a response was received:

func (r *recallWantlist) SentAt(c cid.Cid, at time.Time) {
	// The want may have been cancelled in the interim
	if _, ok := r.sent.Contains(c); ok {
		if _, ok := r.sentAt[c]; !ok {
			r.sentAt[c] = at
		}
	}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds right.

@Stebalien
Copy link
Member

It's possible that a peer sends DONT_HAVE and then subsequently sends HAVE / block. To address this I think we should clear out the sent-at time when we receive a response, so that subsequent responses will be ignored for the purpose of latency calculation.

👍

With regards to checking for deadlocking, the possible places for slowness / deadlock that I can see are:

I think we're fine, I just haven't thought through this fully.

@Stebalien Stebalien merged commit 165b154 into master May 2, 2020
@Stebalien Stebalien mentioned this pull request May 26, 2020
77 tasks
Jorropo pushed a commit to Jorropo/go-libipfs that referenced this pull request Jan 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accurate DONT_HAVE timeouts
2 participants