Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use canonical time operations instead of bare uint/float/string #76

Merged
merged 15 commits into from
May 18, 2023

Conversation

ravsii
Copy link
Contributor

@ravsii ravsii commented May 12, 2023

Describe your changes

Durations were previously handled differently in different places, refactor some (for this draft PR) of them just to discuss the changes. Tests for TestCalcTime have been passed

Example output (time.Duration as of ravsii@e865f5c):
image

Example output (master as of 8c3d3b7)
image

There are some bugs at the moment:

  • (time.Duration) total uptime+downtime != total duration (last line); Probably related to the bug below
  • (both) longest consecutive downtime's seconds probably should match the timestamps without adding/removing an extra second. (Also 0 seconds is probably not the wanted result as well)

(context #73) I think the reason time.Duration was avoided initially is that helper functions, like x.Hours(), x.Minutes() return results with decimals, meaning time.Duration(10*time.Minute).Hours() will return 0.16... but not 0.
Another guess is that time.Duration don't provide helper functions for calculating lower units without higher units, like x := time.Duration(70*time.Minutes) will return x.Hours = 1.16... and x.Minutes() = 70, but not x.Minutes() = 10

Issue ticket number and link

Closes #73

Checklist before requesting a review

  • I have performed a self-review of my code
  • If it is a core feature, I have added tests.
  • I have run make check and there are no failures.

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

tcping.go Show resolved Hide resolved
@pouriyajamshidi pouriyajamshidi added the enhancement New feature or request label May 12, 2023
@pouriyajamshidi
Copy link
Owner

Hi @ravsii,

I looked through your changes and I've got to say well done! This can go in whenever you are ready with it.

Also the findings and bugs you mentioned in the description are 100% correct. Those are the things that made the code a bit too complex to begin with.

Need to find a solution for that.

@ravsii
Copy link
Contributor Author

ravsii commented May 13, 2023

I looked through your changes and I've got to say well done! This can go in whenever you are ready with it.

Thanks @pouriyajamshidi.

I'll check the codebase for other durations, which I could have missed / not changed yet.

As for bugs, I suggest we fix them in a separate PR/issue

@pouriyajamshidi
Copy link
Owner

I looked through your changes and I've got to say well done! This can go in whenever you are ready with it.

Thanks @pouriyajamshidi.

I'll check the codebase for other durations, which I could have missed / not changed yet.

As for bugs, I suggest we fix them in a separate PR/issue

Totally agree.

ravsii and others added 3 commits May 14, 2023 16:48
Remove 2 consts (thousandMillisend and oneSecond) which are basically the exact copy of time.Second
@ravsii
Copy link
Contributor Author

ravsii commented May 14, 2023

Hey, @pouriyajamshidi.

Is there a reason that this function exists?

tcping/tcping.go

Lines 478 to 481 in 8d4aa31

/* Get current system time */
func getSystemTime() time.Time {
return time.Now()
}

I think I mostly finished with the codebase review in terms of time-related changes and this is the last one

@pouriyajamshidi
Copy link
Owner

Hey, @pouriyajamshidi.

Is there a reason that this function exists?

tcping/tcping.go

Lines 478 to 481 in 8d4aa31

/* Get current system time */
func getSystemTime() time.Time {
return time.Now()
}

I think I mostly finished with the codebase review in terms of time-related changes and this is the last one

Hey @ravsii,

It once was providing some value but I don't think it really adds anything anymore.

@ravsii
Copy link
Contributor Author

ravsii commented May 14, 2023

@pouriyajamshidi
Actually, before doing any more changes (because I think I'm going a bit outside of the initial issue) I'd like to ask, are you okay with me completely removing nullTimeValue as well and replace those checks with default isZero()? example here 8024cbb, I'll revert if not, no problems

@pouriyajamshidi
Copy link
Owner

@pouriyajamshidi Actually, before doing any more changes (because I think I'm going a bit outside of the initial issue) I'd like to ask, are you okay with me completely removing nullTimeValue as well and replace those checks with default isZero()? example here 8024cbb, I'll revert if not, no problems

Yea, it is cleaner your way.
No issues.

@ravsii ravsii changed the title DRAFT: refactor: use time.Duration instead of uint/float DRAFT: refactor: use time.Duration instead of uint/float, isZero() instead of string comparison May 14, 2023
@ravsii ravsii marked this pull request as ready for review May 14, 2023 16:02
@ravsii ravsii changed the title DRAFT: refactor: use time.Duration instead of uint/float, isZero() instead of string comparison refactor: use canonical time operations instead of bare uint/float/string May 14, 2023
@pouriyajamshidi
Copy link
Owner

While reviewing your PR, found another bug that got me side-tracked a bit to fix.

Will get back to yours. Need to check a small thing with your PR. Apart from that, awesome job 👍🏻

tcping.go Outdated Show resolved Hide resolved
@pouriyajamshidi
Copy link
Owner

Ok, I did some more testing and as expected, the small downtimes (less than 1 second) are shown as zero. This will propagate throughout the runtime and say after a few days/hours of it running, users will see very strange and "inaccurate" reports.

It is better to address it in this PR since these changes (although correct) are causing a bit of drift.

Below picture is just a reference:

image

@ravsii
Copy link
Contributor Author

ravsii commented May 14, 2023

@pouriyajamshidi
I see. I have an idea how we can fix inaccurate seconds (having 1 extra/less), but fixing 0 seconds could be a little more problematic.

If both time.Now() for start/end durations will return the same amount of seconds, displaying 1 second is incorrect, because we assume that 1 probe is 1 second, but displaying 0 seconds feels wrong as well.
i.e, one of the first screenshots here, same time, 0 seconds:
image
But here: same time, 1 second
image

I hope this will get magically fixed by one of the comments regarding now mentioned above, if not I wonder if this bug existed in previous versions

P. S. guess it's time to add more tests 😄

@pouriyajamshidi
Copy link
Owner

@pouriyajamshidi I see. I have an idea how we can fix inaccurate seconds (having 1 extra/less), but fixing 0 seconds could be a little more problematic.

I'm excited to hear about it. However, do not feel the need to rush things. Take your time.

If both time.Now() for start/end durations will return the same amount of seconds, displaying 1 second is incorrect, because we assume that 1 probe is 1 second, but displaying 0 seconds feels wrong as well. i.e, one of the first screenshots here, same time, 0 seconds: image But here: same time, 1 second image

Absolutely right. This is another thing that I am guilty of 😄. The inconsistencies of timing reports. In this case, I refrained from bumping up the time as well and resorted to just using int/uint which you are helping to get away from.

I hope this will get magically fixed by one of the comments regarding now mentioned above, if not I wonder if this bug existed in previous versions

I will get to that comment soon. This bug has indeed existed for a long time but I never get the time I need to dig deep and come up with a fix.

P. S. guess it's time to add more tests 😄

yes, yes and yes 😄 hopefully more people like you would step to lend a hand.

tcping.go Outdated Show resolved Hide resolved
@ravsii
Copy link
Contributor Author

ravsii commented May 15, 2023

So, after playing with it for a while, I actually think that the problem is somewhere else, to be exact, in these lines:

tcpStats.totalUptime += time.Second

tcpStats.totalDowntime += time.Second

I dont think we can 100% guarantee that this block of code (old version)

tcping/tcping.go

Lines 529 to 543 in 761ae0f

connStart := getSystemTime()
conn, err := net.DialTimeout("tcp", IPAndPort.String(), oneSecond)
connEnd := time.Since(connStart)
rtt := nanoToMillisecond(connEnd.Nanoseconds())
now := getSystemTime()
if err != nil {
tcpStats.handleConnError(now)
} else {
tcpStats.handleConnSuccess(rtt, now)
conn.Close()
}
time.Sleep(thousandMilliSecond - connEnd)

will take EXACTLY 1 * time.Second, even with time.Sleep, or any other similar solutions, like time.Round/time.Truncate on start/end timestamps for total/longest durations, due to how modern CPUs' task-switching works (as far as I know)

I suggest instead of incrementing it exactly by 1 second we can just pass a duration and increment it by that (actually precise) duration. Of course that doesn't guarantee us that 100 probes would equal to exactly 100 seconds, but the data total/longest timestamps and duration should be in sync with one another. Maybe I'm wrong.

Honestly, I think instead of time.Sleep we switch to time.Ticker in func tcping, although (as I said), I don't think it will ever guarantee a 100% match, it seems like it fits this task much better than sleep. Not sure if we should do this as a part of this PR

NewTicker returns a new Ticker containing a channel that will send the current time on the channel after each tick. The period of the ticks is specified by the duration argument. The ticker will adjust the time interval or drop ticks[1][2] to make up for slow receivers. The duration d must be greater than zero; if not, NewTicker will panic. Stop the ticker to release associated resources.

  1. docs
  2. SO

Let me know what you think @pouriyajamshidi

@pouriyajamshidi
Copy link
Owner

pouriyajamshidi commented May 16, 2023

First off, I appreciate your inputs and the effort you put in. 🤝

So, after playing with it for a while, I actually think that the problem is somewhere else, to be exact, in these lines:

tcpStats.totalUptime += time.Second

tcpStats.totalDowntime += time.Second

Yes, quite plausible that this is the case.

I dont think we can 100% guarantee that this block of code (old version)

tcping/tcping.go

Lines 529 to 543 in 761ae0f

connStart := getSystemTime()
conn, err := net.DialTimeout("tcp", IPAndPort.String(), oneSecond)
connEnd := time.Since(connStart)
rtt := nanoToMillisecond(connEnd.Nanoseconds())
now := getSystemTime()
if err != nil {
tcpStats.handleConnError(now)
} else {
tcpStats.handleConnSuccess(rtt, now)
conn.Close()
}
time.Sleep(thousandMilliSecond - connEnd)

will take EXACTLY 1 * time.Second, even with time.Sleep, or any other similar solutions, like time.Round/time.Truncate on start/end timestamps for total/longest durations, due to how modern CPUs' task-switching works (as far as I know)

There is really no need to go to the extreme to make sure we are right on the dot. The main goal is to keep the program simple, because it has a relatively basic functionality. As long as we are reporting a close enough result, I think we should be happy. Of course, as we progress, we can always reiterate and improve it.

I suggest instead of incrementing it exactly by 1 second we can just pass a duration and increment it by that (actually precise) duration. Of course that doesn't guarantee us that 100 probes would equal to exactly 100 seconds, but the data total/longest timestamps and duration should be in sync with one another. Maybe I'm wrong.

This sounds reasonable to me. Again, as long as the actual duration represents a close enough relation to the reports, it is fine.

Honestly, I think instead of time.Sleep we switch to time.Ticker in func tcping, although (as I said), I don't think it will ever guarantee a 100% match, it seems like it fits this task much better than sleep.

Ticker does sound like a good candidate. However, does it really bring more added value? Considering we cannot guarantee a 100% correct timing anyway with the cost of an additional thread? And in fact, we are roughly spending no more or less than a second in the tcping function intentionally.

Please note that I am not opposing the idea here, just brainstorming.

Not sure if we should do this as a part of this PR

That can be part of another PR. Thankfully, you have made great progress in tidying up and ironing out some of the technical debts. My main concern regarding this PR is showing 0, when we should show 1 or 2 when we should show 1. Because to the average user who is using this tool in their network or just monitoring their internet connection, that is more important. They could say "there was a 'second' of downtime but the tool says zero seconds". That is enough to deter them from using the tool.

@ravsii
Copy link
Contributor Author

ravsii commented May 16, 2023

My main concern regarding this PR is showing 0, when we should show 1 or 2 when we should show 1. Because to the average user who is using this tool in their network or just monitoring their internet connection, that is more important. They could say "there was a 'second' of downtime but the tool says zero seconds". That is enough to deter them from using the tool.

Makes sense, I'll focus on that in this PR then.

I dont think we can 100% guarantee that this block of code (old version)

tcping/tcping.go

Lines 529 to 543 in 761ae0f

connStart := getSystemTime()
conn, err := net.DialTimeout("tcp", IPAndPort.String(), oneSecond)
connEnd := time.Since(connStart)
rtt := nanoToMillisecond(connEnd.Nanoseconds())
now := getSystemTime()
if err != nil {
tcpStats.handleConnError(now)
} else {
tcpStats.handleConnSuccess(rtt, now)
conn.Close()
}
time.Sleep(thousandMilliSecond - connEnd)

will take EXACTLY 1 * time.Second, even with time.Sleep, or any other similar solutions, like time.Round/time.Truncate on start/end timestamps for total/longest durations, due to how modern CPUs' task-switching works (as far as I know)

There is really no need to go to the extreme to make sure we are right on the dot. The main goal is to keep the program simple, because it has a relatively basic functionality. As long as we are reporting a close enough result, I think we should be happy. Of course, as we progress, we can always reiterate and improve it.

My bad for not explaining it the right way. What I meant is the root of the problem is that in total uptime/downtime we adding exactly 1 second, but when calculating longest uptime/downtime, we're actually using time.Since(), that gives us time.Duration (which is just uint in nanosec, just a reminder). So in result we can do exactly 1 second to totalUptime, but the actual longest uptime/downtime timestamps could be from 10:00:00.0000 to 10:00:00.999999, meaning longest duration will be parsed as 0 seconds. (math.Ceil is not an option either, it could add an extra second)

That's why I said we needed exactly 1 sec.

@pouriyajamshidi
Copy link
Owner

My bad for not explaining it the right way. What I meant is the root of the problem is that in total uptime/downtime we adding exactly 1 second, but when calculating longest uptime/downtime, we're actually using time.Since(), that gives us time.Duration (which is just uint in nanosec, just a reminder). So in result we can do exactly 1 second to totalUptime, but the actual longest uptime/downtime timestamps could be from 10:00:00.0000 to 10:00:00.999999, meaning longest duration will be parsed as 0 seconds. (math.Ceil is not an option either, it could add an extra second)

That's why I said we needed exactly 1 sec.

Aha, now it makes sense. Thanks for clearing it up.

I trust in you and your expertise. Feel free to continue with what you have in mind, whether in this PR or another. I have no objection and in fact, welcome it.

@ravsii
Copy link
Contributor Author

ravsii commented May 17, 2023

So, if we're targeting specifically "good-looking" results to the end user in this PR, this is probably the only achievable way to do this, @pouriyajamshidi. (commits above)

To be honest, while trying many possible changes for this to work, I think some code needs to be reworked, especially the way we handle "probe timeout". It was (and still is) hard-coded as 1*time.Second, which is the main "bottleneck" to get this working as expected.

I don't think that there's anything criminal about reporting a total uptime of 6.7 seconds for 7 probes, which would actually be a lot more accurate (in both ways, calculations and data representing). When (if?) we get this to work (maybe even postpone/reject this PR) I think that part needs attention the most. Just some thoughts, but I guess that's for another issue.

tcping.go Outdated Show resolved Hide resolved
@pouriyajamshidi
Copy link
Owner

To be honest, while trying many possible changes for this to work, I think some code needs to be reworked, especially the way we handle "probe timeout". It was (and still is) hard-coded as 1*time.Second, which is the main "bottleneck" to get this working as expected.

That is correct but don't let that discourage you. This can be addressed here or in the future.

I don't think that there's anything criminal about reporting a total uptime of 6.7 seconds for 7 probes, which would actually be a lot more accurate (in both ways, calculations and data representing). When (if?) we get this to work (maybe even postpone/reject this PR) I think that part needs attention the most. Just some thoughts, but I guess that's for another issue.

No, definitely nothing criminal about that. Apart from the looks, there is a reality factor to it too. For instance, if a server was up for a minute and you probed it 60 times, regardless of CPU time, RTT, etc... that server was up for 60 seconds.

Although I do understand where you are coming from and have no objection in altering the current behavior as long as the drift is insignificant.

I quite like this PR and the way you have cleaned up many of the existing hacks. With that being said, I am eager to merge it in whenever you feel confident with it.

Maybe we can cosmetically fix some of the nuances but not sure how much of a headache/work that brings for now.

@ravsii
Copy link
Contributor Author

ravsii commented May 18, 2023

Thanks for your thoughts.

I'm fine with this PR as-is, but if want me to change something in particular or if you've found bugs, feel free to ask for a change, because I haven't found any, and as my experience says, it's usually a sign that I've missed something haha.

For instance, if a server was up for a minute and you probed it 60 times, regardless of CPU time, RTT, etc... that server was up for 60 seconds.

Yeah that totally makes sense. I just don't think we'll be able to produce 100% matching timestamps/durations, or at least not in 100% cases. Users are probably expecting such output, so it needs some time to think about possible solutions without hard-coding anything.

finally it's ready to be closed/merged, haha

@pouriyajamshidi
Copy link
Owner

Thanks for your thoughts.

I'm fine with this PR as-is, but if want me to change something in particular or if you've found bugs, feel free to ask for a change, because I haven't found any, and as my experience says, it's usually a sign that I've missed something haha.

You have a good eye for catching bugs 😄 . I will run a few tests as well in a while.

Yeah that totally makes sense. I just don't think we'll be able to produce 100% matching timestamps/durations, or at least not in 100% cases. Users are probably expecting such output, so it needs some time to think about possible solutions without hard-coding anything.

At this stage, I think we can be a bit lenient and be fine with not producing a definite exact result. We can always improve it in the future..

finally it's ready to be closed/merged, haha

Cool. Will get back to you after some tests.

@pouriyajamshidi
Copy link
Owner

Alright, I should say in overall it is quite good and can be merged.

How much work would it be to cosmetically show 1 instead of 0 when we have single probe failures?

image

P.s. I really do not mind addressing that in a separate PR. Let me know what you think

@ravsii
Copy link
Contributor Author

ravsii commented May 18, 2023

How much work would it be to cosmetically show 1 instead of 0 when we have single probe failures?

Like in longest consecutive downtime on the first screenshot? Make it 1 second, instead of 0 seconds, assuming that there was a very short downtime? that could be fixed very easily, but that will be just a visual change.

Also total downtime is 2 seconds off, which wasn't expected tbh, but that what I was talking about when saying

I haven't found any, and as my experience says, it's usually a sign that I've missed something haha.

@pouriyajamshidi
Copy link
Owner

Like in longest consecutive downtime on the first screenshot? Make it 1 second, instead of 0 seconds, assuming that there was a very short downtime? that could be fixed very easily, but that will be just a visual change.

Sorry had to be more specific. yes, the longest consecutive downtime on the first screenshot.

At this point I am even thinking of coming up with a different message if the downtime is ~1 second. But I will keep it for another PR. Don't wanna clutter this.

Also total downtime is 2 seconds off, which wasn't expected tbh, but that what I was talking about when saying

Yea, getting the timing right with the mess we had is not so easy 😄

@ravsii
Copy link
Contributor Author

ravsii commented May 18, 2023

Yea, getting the timing right with the mess we had is not so easy 😄

haha exactly.

Like in longest consecutive downtime on the first screenshot? Make it 1 second, instead of 0 seconds, assuming that there was a very short downtime? that could be fixed very easily, but that will be just a visual change.

Sorry had to be more specific. yes, the longest consecutive downtime on the first screenshot.

At this point I am even thinking of coming up with a different message if the downtime is ~1 second. But I will keep it for another PR. Don't wanna clutter this.

Sure. I'll fix this soon and let's merge it then.

@ravsii
Copy link
Contributor Author

ravsii commented May 18, 2023

Randomly found this bug:
image

(Should be fixed now)

This is probably why 0 seconds for 2 seconds downtime has happened btw. So I think we're good for this PR. Feel free to merge if you think everything is ok.

@pouriyajamshidi
Copy link
Owner

Randomly found this bug:

(Should be fixed now)

whoopsie! I missed that too :S

This is probably why 0 seconds for 2 seconds downtime has happened btw. So I think we're good for this PR. Feel free to merge if you think everything is ok.

Alrighty, let's merge it in.

Copy link
Owner

@pouriyajamshidi pouriyajamshidi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@pouriyajamshidi pouriyajamshidi merged commit 4032ed2 into pouriyajamshidi:master May 18, 2023
2 checks passed
@ravsii ravsii deleted the refactor/duration branch May 18, 2023 20:17
@ravsii ravsii mentioned this pull request May 19, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Use standard time.Duration for storing/working with durations
2 participants