Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add latency package #14

Merged
merged 35 commits into from
Jun 23, 2023
Merged

Add latency package #14

merged 35 commits into from
Jun 23, 2023

Conversation

robertodauria
Copy link
Contributor

@robertodauria robertodauria commented Jun 9, 2023

This PR introduces the latency package, implementing the UDP latency test.


This change is Reviewable

@coveralls
Copy link

coveralls commented Jun 9, 2023

Pull Request Test Coverage Report for Build 5357773285

  • 204 of 244 (83.61%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 83.038%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/latency1/model/result.go 40 43 93.02%
internal/latency1/latency1.go 162 199 81.41%
Totals Coverage Status
Change from base Build 5046749297: 0.2%
Covered Lines: 749
Relevant Lines: 902

💛 - Coveralls

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few comments to start.

Reviewed 2 of 6 files at r1.
Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)


internal/handler/handler.go line 51 at r2 (raw file):

)

type Handler struct {

This type is a handler.Handler. I didn't notice this before. See: https://google.github.io/styleguide/go/decisions#package-vs-exported-symbol-name

Consider renaming this package to provide context for its contents, e.g. ndt8.Handler.

This doesn't have to be in this PR.


internal/latency/latency.go line 22 at r2 (raw file):

// Handler is the handler for latency tests.
type Handler struct {

This is a latency.Handler -- this is self-describing name, and clear in the context where it's used. 👍


internal/latency/latency.go line 73 at r2 (raw file):

	log.Debug("session created", "id", mid)
	rw.Write([]byte(mid))

What must the client do with this value? If they must format this into a LatencyPacket for the first request, what do you think of sending the content the client should include in the first packet? e.g. a complete serialized "c2s" LatencyPacket.


internal/latency/latency.go line 90 at r2 (raw file):

		return
	}
	// TODO: mfence?

Are there memory errors to debug? What is this TODO?


internal/latency/latency.go line 125 at r2 (raw file):

		}
		n, writeErr := conn.WriteTo(b, remoteAddr)
		if err != nil {

The error handling here and above is unclear. marshalErr or writeErr may be non-nil but we'll continue?

Do these mean only to save the current error value in err?


internal/latency/latency.go line 135 at r2 (raw file):

		session.SendTimesMu.Lock()
		defer session.SendTimesMu.Unlock()
		session.SendTimes[seq] = time.Now()

I believe we want to record the timestamp before the call to conn.WriteTo. We cannot control or quantify delays between write operations and this timestamp; this could be a source of unquantified error. For the round trip time, the latency will include the syscall overhead for the recv and the write. That's okay. I don't think the timestamp here is a reliable way to avoid it.

If you want timestamps that avoid syscall overhead, there may be a way using SO_TIMESTAMPNS, but I don't have experience with this.


internal/latency/latency.go line 141 at r2 (raw file):

		log.Debug("packet sent", "len", n, "id", session.ID, "seq", seq)

	}, memoryless.Config{

This kind of scheduling limits the data to only network jitter calculations, and prevents application-level inter-arrival jitter calculations.


internal/latency/latency.go line 189 at r2 (raw file):

		if !session.Started {
			session.Started = true
			timeout, _ := context.WithTimeout(context.Background(), 5*time.Second)

Rather than ignore the cancel, can the timeout be created within the sendLoop instead?


pkg/latency/model/result.go line 45 at r2 (raw file):

	// Packets is a list of every measurement received or sent as part of
	// this latency measurement.
	Packets []LatencyPacket

Can we eliminate the redundancy in the Type/ID fields for archival data? I can imagine an ArchivePacket or similar that includes only Seq / LastRTT fields. What do you think?

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


internal/latency/latency.go line 125 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

The error handling here and above is unclear. marshalErr or writeErr may be non-nil but we'll continue?

Do these mean only to save the current error value in err?

It wasn't doing what I wanted. I believe now it does. PTAL again?


internal/latency/latency.go line 135 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

I believe we want to record the timestamp before the call to conn.WriteTo. We cannot control or quantify delays between write operations and this timestamp; this could be a source of unquantified error. For the round trip time, the latency will include the syscall overhead for the recv and the write. That's okay. I don't think the timestamp here is a reliable way to avoid it.

If you want timestamps that avoid syscall overhead, there may be a way using SO_TIMESTAMPNS, but I don't have experience with this.

Done.


internal/latency/latency.go line 141 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

This kind of scheduling limits the data to only network jitter calculations, and prevents application-level inter-arrival jitter calculations.

OTOH, memoryless has the property that you can see if there are cyclic network patterns with regard to round-trip time, while with fixed-interval scheduling you could align to the cycle and never notice them. Which property is more desirable for us?


internal/latency/latency.go line 189 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Rather than ignore the cancel, can the timeout be created within the sendLoop instead?

Done.

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


internal/handler/handler.go line 51 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

This type is a handler.Handler. I didn't notice this before. See: https://google.github.io/styleguide/go/decisions#package-vs-exported-symbol-name

Consider renaming this package to provide context for its contents, e.g. ndt8.Handler.

This doesn't have to be in this PR.

Ack. That hurts to read. I'll open a separate PR.

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)


internal/latency/latency.go line 125 at r2 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

It wasn't doing what I wanted. I believe now it does. PTAL again?

I still don't understand: why do we check err here instead of writeErr?


internal/latency/latency.go line 135 at r2 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

Done.

Should SendTimes be updated only on success? Or maybe it doesn't matter b/c that value is only used once the echo is received again?

My original thought was to take the timestamp just before the write and update SendTimes down here after everything else was successful.


internal/latency/latency.go line 141 at r2 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

OTOH, memoryless has the property that you can see if there are cyclic network patterns with regard to round-trip time, while with fixed-interval scheduling you could align to the cycle and never notice them. Which property is more desirable for us?

My desire is only that we document our choices and that our choices are fully considered. e.g. "We do not support application level inter arrival jitter calculations in favor of the potential for detecting cyclic network patterns". Or vice versa. Which property do you want to support or that this tool will be most applicable to?


internal/latency/latency.go line 134 at r3 (raw file):

		// include the ping packet's write time. This is intentional.
		session.SendTimesMu.Lock()
		defer session.SendTimesMu.Unlock()

Do we need to hold the lock open?

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


internal/latency/latency.go line 73 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

What must the client do with this value? If they must format this into a LatencyPacket for the first request, what do you think of sending the content the client should include in the first packet? e.g. a complete serialized "c2s" LatencyPacket.

Done. We now send a full kickoff packet instead of just the measurement id. The design document has been updated accordingly.


internal/latency/latency.go line 90 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Are there memory errors to debug? What is this TODO?

Done.


internal/latency/latency.go line 125 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

I still don't understand: why do we check err here instead of writeErr?

Done. That was a mistake.


internal/latency/latency.go line 135 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Should SendTimes be updated only on success? Or maybe it doesn't matter b/c that value is only used once the echo is received again?

My original thought was to take the timestamp just before the write and update SendTimes down here after everything else was successful.

I don't think it matters since it's only used when a corresponding "pong" is received, as you observed. I moved the update to just after WriteTo returned successfully in the interest of saving a mutex lock/unlock.


internal/latency/latency.go line 141 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

My desire is only that we document our choices and that our choices are fully considered. e.g. "We do not support application level inter arrival jitter calculations in favor of the potential for detecting cyclic network patterns". Or vice versa. Which property do you want to support or that this tool will be most applicable to?

(discussed on VC) I have documented this on the design document under "Alternatives considered".


internal/latency/latency.go line 134 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Do we need to hold the lock open?

Done.


pkg/latency/model/result.go line 45 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Can we eliminate the redundancy in the Type/ID fields for archival data? I can imagine an ArchivePacket or similar that includes only Seq / LastRTT fields. What do you think?

Changed as discussed on VC.

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r1.
Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)


internal/latency/latency.go line 141 at r2 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

(discussed on VC) I have documented this on the design document under "Alternatives considered".

Thank you. Please add a single sentence summary of the decision as a comment to motivate these expected delays.


internal/latency/latency.go line 63 at r7 (raw file):

		dataDir:    dir,
		sessions:   cache,
		sessionsMu: &sync.Mutex{},

nit: you're returning a *Handler, so the Handler sessionMu field could be a literal mutex rather than a pointer, and no mutex alloc is necessary. You do not have code that copies the Handler but depends on the mutex remaining the same? I suggest just changing the type of sessionsMu to sync.Mutex, and removing this line.

Part of what got my attention is that the embedded mutex pattern is typical in Go. This isn't wrong, but it's extra steps, so maybe a little more complex than needed. Simple is good as often as we can find it. The whole will always be complex.


internal/latency/latency.go line 103 at r7 (raw file):

	_, err = rw.Write(b)
	if err != nil {
		rw.WriteHeader(http.StatusInternalServerError)

nit: any write prompts the http library to send a status OK header if no other header has been sent yet. So, the write error is either b/c it couldn't send even the "status ok" header or it happened after that and this header isn't meaningful / would be ignored.

I'm not really sure what to do affirmatively if there is a write error here. Logs could be too verbose, metrics may not be seen. metrics are probably the right way given the options I see.


internal/latency/latency.go line 142 at r7 (raw file):

	_, err = rw.Write(b)
	if err != nil {
		rw.WriteHeader(http.StatusInternalServerError)

Same comment as above about write errors for rw.


internal/latency/latency.go line 239 at r7 (raw file):

		defer session.SendTimesMu.Unlock()
		if m.Seq >= len(session.SendTimes) {
			return errorInvalidSeqN

A valid mid but invalid sequence number would be a very interesting error case.


internal/latency/latency.go line 284 at r7 (raw file):

		// The receive time should be recorded as soon as possible after
		// reading the packet, to improve accuracy.
		recvTime := time.Now()

Based on the comment, move this above the log statement?


internal/latency/latency.go line 287 at r7 (raw file):

		err = h.processPacket(conn, addr, buf[:n], recvTime)
		if err != nil {
			log.Debug("failed to process packet", "err", err)

Who is the intended audience for this log message? Would it help to include some conn metadata to make these messages include diagnostic hints, e.g. remote IP? It looks like the common cases will be corrupt json, invalid seq, or unauthorized. But, maybe we'll only rarely have debug enabled, but what will that person want to see?


internal/persistence/file.go line 42 at r7 (raw file):

	var filename string
	if subtest != "" {
		filename = datatype + "-" + subtest + "-" +

nit: is there a subtest name for latency that can preserve the previous structure? e.g. "network" for the network latency definition you added to the design, and leave as an option for other definitions in the future? e.g. "application" or other?

I'm just trying to make this goop simpler :)


pkg/latency/model/result.go line 45 at r2 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

Changed as discussed on VC.

This is much better.


pkg/latency/model/result.go line 137 at r7 (raw file):

		ID:              s.ID,
		GitShortCommit:  prometheusx.GitShortCommit,
		Version:         "",

Add TODO?


pkg/latency/model/result.go line 143 at r7 (raw file):

		Results:         s.Results,
		PacketsSent:     len(s.SendTimes),
		PacketsReceived: len(s.Results),

Is this correct?

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


internal/latency/latency.go line 141 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Thank you. Please add a single sentence summary of the decision as a comment to motivate these expected delays.

Done.


internal/latency/latency.go line 103 at r7 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

nit: any write prompts the http library to send a status OK header if no other header has been sent yet. So, the write error is either b/c it couldn't send even the "status ok" header or it happened after that and this header isn't meaningful / would be ignored.

I'm not really sure what to do affirmatively if there is a write error here. Logs could be too verbose, metrics may not be seen. metrics are probably the right way given the options I see.

Added a TODO (will add Prometheus metrics in a future PR).


internal/latency/latency.go line 142 at r7 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Same comment as above about write errors for rw.

Added a TODO.


internal/latency/latency.go line 239 at r7 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

A valid mid but invalid sequence number would be a very interesting error case.

Done. I added an INFO log and a TODO to add a Prometheus metric. Not completely sure about the log level, which assumes we'll run this code in production with an INFO minium log level.


internal/latency/latency.go line 284 at r7 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Based on the comment, move this above the log statement?

Done.


internal/latency/latency.go line 287 at r7 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Who is the intended audience for this log message? Would it help to include some conn metadata to make these messages include diagnostic hints, e.g. remote IP? It looks like the common cases will be corrupt json, invalid seq, or unauthorized. But, maybe we'll only rarely have debug enabled, but what will that person want to see?

The intended audience is a developer running this manually with loglevel=DEBUG on their local machine. Since the unconditional DEBUG message just before this one already includes the address (and the packet's payload) I did not want to repeat the same info twice. I added the remote address.

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


pkg/latency/model/result.go line 137 at r7 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Add TODO?

Done.


pkg/latency/model/result.go line 143 at r7 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Is this correct?

Done. (It was not.)


internal/persistence/file.go line 42 at r7 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

nit: is there a subtest name for latency that can preserve the previous structure? e.g. "network" for the network latency definition you added to the design, and leave as an option for other definitions in the future? e.g. "application" or other?

I'm just trying to make this goop simpler :)

That works. Reverted to the original file.go and used "network" as the subtest name.

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r10, 2 of 2 files at r11, all commit messages.
Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)


cmd/msak-latency/main.go line 1 at r11 (raw file):

package main

Please stop adding new files to an already large PR.

Large PRs are slower to review, more likely to let bugs through, and generally lower the quality of the end result.


internal/latency/latency.go line 103 at r7 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

Added a TODO (will add Prometheus metrics in a future PR).

Please ensure the TODO includes removing the extraneous writes to rw also.


internal/latency/latency.go line 287 at r7 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

The intended audience is a developer running this manually with loglevel=DEBUG on their local machine. Since the unconditional DEBUG message just before this one already includes the address (and the packet's payload) I did not want to repeat the same info twice. I added the remote address.

Okay. Please format the addr the same way in both log lines. Currently one is "addr=addr" in the second it's "src=addr". My intention is to help a future developer line up separate messages, possibly among a flood of messages to find the two they want easily. This is hard won instinct!


internal/latency/latency.go line 52 at r11 (raw file):

		archive := i.Value().Archive()
		archive.EndTime = time.Now()
		_, err := persistence.WriteDataFile(dir, "latency", "network", archive.ID, archive)

I know I suggested "network" - since we are discussing "transport throughput" and "application throughput" - how about calling this "transport" for "transport latency", as distinct from a future test that could be "application latency"?


pkg/latency/model/result.go line 58 at r11 (raw file):

}

type RTT struct {

Rather than RTT.RTT how about RoundTrip.RTT?

Then the fields currently named Results can be RoundTrips []RoundTrip.

What do you think?

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)


pkg/latency/model/result.go line 58 at r11 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Rather than RTT.RTT how about RoundTrip.RTT?

Then the fields currently named Results can be RoundTrips []RoundTrip.

What do you think?

Or more literally, Packet.RTT and Packet.Lost?

Packets []Packet


pkg/latency/model/result.go line 132 at r11 (raw file):

		Started:   false,
		StartedMu: &sync.Mutex{},

I agree with what you said in chat - these do not need to be pointers either.

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


internal/latency/latency.go line 103 at r7 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Please ensure the TODO includes removing the extraneous writes to rw also.

Just to make sure I understand, you're suggesting to not even try changing the status code on a write error? If so, I can just do it in this PR.


internal/latency/latency.go line 52 at r11 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

I know I suggested "network" - since we are discussing "transport throughput" and "application throughput" - how about calling this "transport" for "transport latency", as distinct from a future test that could be "application latency"?

That works.

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


internal/latency/latency.go line 287 at r7 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Okay. Please format the addr the same way in both log lines. Currently one is "addr=addr" in the second it's "src=addr". My intention is to help a future developer line up separate messages, possibly among a flood of messages to find the two they want easily. This is hard won instinct!

Done.


pkg/latency/model/result.go line 58 at r11 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Or more literally, Packet.RTT and Packet.Lost?

Packets []Packet

It could be confused with LatencyPacket which is the one sent over the wire. Maybe I like RoundTrips more...


pkg/latency/model/result.go line 132 at r11 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

I agree with what you said in chat - these do not need to be pointers either.

Done. They aren't anymore.


cmd/msak-latency/main.go line 1 at r11 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Please stop adding new files to an already large PR.

Large PRs are slower to review, more likely to let bugs through, and generally lower the quality of the end result.

It was meant for a separate PR, I added it by mistake.

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


pkg/latency/model/result.go line 58 at r11 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

It could be confused with LatencyPacket which is the one sent over the wire. Maybe I like RoundTrips more...

Changed to RoundTrips.

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)


internal/latency/latency.go line 103 at r7 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

Just to make sure I understand, you're suggesting to not even try changing the status code on a write error? If so, I can just do it in this PR.

Yes, if the write fails writing again will also fail for one of two possible reasons.


internal/latency/latency.go line 52 at r11 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

That works.

We discussed on VC that "transport" is not correct actually - but "application" is.

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @robertodauria)

@robertodauria robertodauria merged commit 64bdc6e into main Jun 23, 2023
4 of 5 checks passed
@robertodauria robertodauria deleted the latency branch June 23, 2023 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants