New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NTP fails upon boot when too fast #2886
Comments
It could be we have a bit too short wait after the "Got IP" event and sending out connection attempts. |
I managed to build from source. It looks like it may be some kind of rate-limiting issue, or other ntp-server-side issue. At boot, two NTP queries are sent in quick succession. (On settings updates, only one is sent.) In my experimenting, the first query works fine, but the second query sometimes yields a bad result. (In my case, it might be some kind of rate-limiting?) I don't know where that first query comes from or goes to; it doesn't appear to go through getNtpDate() -- at least, not according to the serial console logs. The response from the NTP server that yields a bad result looks like this:
It's clearly a bad packet. And the leap indicator (LI) is a two-bit variable. The original RFC 958 says 11 is Reserved. But later RFCs (1305, 5905) call this "alarm condition" or "unknown" and "clock unsynchronized". So, here is a solution (in getNtpTime(), after udp.read(), but before calculation of secsSince1900):
The false return will cause now() to avoid setting the time to an invalid value, and nextSyncTime will be unadjusted, so the next call to now() will try again. |
So you mean right in between these lines: udp.read(packetBuffer, NTP_PACKET_SIZE); // read packet into the buffer
// For more detailed info on improving accuracy, see:
// https://github.com/lettier/ntpclient/issues/4#issuecomment-360703503
// For now, we simply use half the reply time as delay compensation.
unsigned long secsSince1900; I've also noticed there is now an issue with the NTP, but it does look to be more than just NTP. The test suite I'm seeing this on is running the (new and improved) controller for LoRaWAN I'm working on and that one is already sending out its first messages over LoRa (software serial) after 500 msec from boot, but then the WiFi has not yet finished switching on. So your fix may help fixing the symptom you're seeing, but I think the problem may be somewhere else. |
Yes, that was the location I meant. In my tests, the web interface was fine, but it did report the current time as 1970-01-01 until the good NTP packet reply. (Presumably, sysTime is initialized to 0 at boot.) This is different behaviour than running with NTP disabled, which has a message about no system time source. However, it is the same behaviour on my build when NTP is enabled, but an invalid/unresponsive host is given. Perhaps NTP should not be considered enabled until the first valid reply is received? How does your build respond with: a) NTP disabled, and b) an invalid/unresponsive ntp host? |
Just tested and it is the same, so I guess the symptom you're seeing with NTP may have different cause, but it doesn't hurt to add some sanity checks in the received NTP reply. |
Can you test with a little change in code? Line 473 in a6031ce
Change it from 30 to 100 msec. |
See also letscontrolit#2886 This does make sure the initial network connection can transfer data. On some nodes it took a lot of attempts to get a connection established right after boot.
Hmm, in my case, adding the delay did not seem to change the behaviour. It does look like the 2nd set of packets were delayed a bit more: NTP communication, with initial 30ms delay:
With the delay at 100ms:
However, I did notice something else this evening, about why there are two attempts. The first attempt is to the DHCP server (or the NTP server specified by the DHCP server; in my network they are the same, so I can't tell). In particular, it is not the NTP host specified in the config. (Nor is it the default gateway, if that matters.) But this first response is wholly ignored; it does not appear to original from getNtpTime(). In my network, this response always appears correct in the tcpdump output. If I query that NTP host again immediately (within getNtpTime()), that is when I get the unsynchronized result. This is why I think the error is some kind of rate-limit. I do think that ignoring the bad response is good step. Perhaps the initial NTP reply is ignored here in checkUDP()? (Actually, it doesn't appear to be, based on additional logging.) Lines 122 to 127 in 0c6e6ca
But I haven't found where that initial NTP request is coming from. (This is a potential source, but I tried commenting it out, and it was still sent.) ESPEasy/src/ESPEasyWifi_ProcessEvent.ino Lines 259 to 261 in 0c6e6ca
At this point, I expect the initial NTP request (and ignored response) is somewhere deep in the library code. |
Just curious, does your DHCP server send a field populated with a suggested NTP server? So could you also test with an older core version to see if it makes any difference? |
I'm now looking into this issue.
Do you have an idea on the interval between those 2 queries that are being sent? |
When the ntp server is very close by (like on the local network), at boot the time is set incorrectly. If the ntp server is further away, the time is set correctly every time. If the ntp server is queried again, the time is set correctly, regardless of how long the query took. This appears to be caused because the total_delay in getNtpTime() is too short. However, that doesn't explain why it only happens at boot -- sysTime is not involved here. This behaviour is very consistent though.
This is a reboot with a local LAN NTP server:
If I force an NTP query (by setting the ntp host to the same one it currently is, using the web UI):
And here is the same log with an NTP server set to my WAN address (physical distance ~2500mi):
And again, forcing a re-query of the NTP server:
The text was updated successfully, but these errors were encountered: