Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NTP fails upon boot when too fast #2886

Closed
private-creator opened this issue Feb 15, 2020 · 9 comments
Closed

NTP fails upon boot when too fast #2886

private-creator opened this issue Feb 15, 2020 · 9 comments
Labels
Category: Wifi Related to the network connectivity Type: Bug Considered a bug

Comments

@private-creator
Copy link

When the ntp server is very close by (like on the local network), at boot the time is set incorrectly. If the ntp server is further away, the time is set correctly every time. If the ntp server is queried again, the time is set correctly, regardless of how long the query took. This appears to be caused because the total_delay in getNtpTime() is too short. However, that doesn't explain why it only happens at boot -- sysTime is not involved here. This behaviour is very consistent though.

This is a reboot with a local LAN NTP server:

>reboot
1203556 : Command: reboot
1204564 : MQTT : Connection lost, state: Disconnected

 ets Jan  8 2013,rst cause:1, boot mode:(3,6)

load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v8b899c12
~ld
�U59 : 

INIT : Booting version: mega-20191003 (ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support)
60 : INIT : Free RAM:31072
61 : INIT : Warm boot #18 Last Task: Background Task - Restart Reason: Software/System restart
63 : FS   : Mounting...
87 : FS   : Mount successful, used 75802 bytes of 957314
540 : CRC  : program checksum       ...OK
577 : CRC  : SecuritySettings CRC   ...OK 
681 : INIT : Free RAM:27952
682 : INIT : I2C
683 : INIT : SPI not enabled

734 : INFO : Plugins: 78 [Normal] [Testing] (ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support)fpm close 3 

839 : WIFI : Set WiFi to STAmode : sta(xx:xx:xx:xx:xx:xx)
add if0

871 : WIFI : Connecting XXXXXXXX attempt #0
1368 : SYS  : 0.00,24976.00,31.00,-1.00
1649 : SYS  : 100.00,1.00,3360.00,0.00
2291 : BMx280 : Detected BME280
2771 : WD   : Uptime 0 ConnectFailures 0 FreeMem 24336 WiFiStatus 6
scandone
state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 1
cnt 

connected with XXXXXXXX, channel 1
dhcp client start...
ip:192.168.xxx.xxx,mask:255.255.255.0,gw:192.168.xxx.xxx
4831 : WIFI : Connected! AP: XXXXXXXX (xx:xx:xx:xx:xx:xx) Ch: 1 Duration: 3762 ms
4833 : WIFI : DHCP IP: 192.168.xxx.xxx (hostname) GW: 192.168.xxx.xxx SN: 255.255.255.0   duration: 24 ms
4847 : NTP  : NTP replied: delay 10 mSec Accuracy increased by 0.005 seconds
4848 : Time adjusted by -2085978491172.00 msec. Wander: -579438469.77 msec/second
4850 : Current Time Zone: STD time start: 2036-11-02 03:00:00 offset: 0 min
4858 : Webserver: start
4860 : firstLoopConnectionsEstablished

If I force an NTP query (by setting the ntp host to the same one it currently is, using the web UI):

338926 : SaveToFile: free stack: 3296
338995 : FILE : Saved config.dat
338996 : SaveToFile: free stack after: 3296
339014 : NTP  : NTP replied: delay 10 mSec Accuracy increased by 0.719 seconds
339016 : Time adjusted by 504240926439.30 msec. Wander: 140066924.01 msec/second
339017 : Current Time Zone: STD time start: 2020-11-01 03:00:00 offset: 0 min

And here is the same log with an NTP server set to my WAN address (physical distance ~2500mi):

>reboot
451624 : Command: reboot
452630 : MQTT : Connection lost, state: Disconnected

 ets Jan  8 2013,rst cause:1, boot mode:(3,6)

load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v8b899c12
~ld
�U63 : 

INIT : Booting version: mega-20191003 (ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support)
64 : INIT : Free RAM:31072
65 : INIT : Warm boot #19 Last Task: Background Task - Restart Reason: Software/System restart
67 : FS   : Mounting...
91 : FS   : Mount successful, used 75802 bytes of 957314
544 : CRC  : program checksum       ...OK
581 : CRC  : SecuritySettings CRC   ...OK 
685 : INIT : Free RAM:27952
686 : INIT : I2C
687 : INIT : SPI not enabled
738 : INFO : Plugins: 78 [Normal] [Testing] (ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support)fpm close 3 

843 : WIFI : Set WiFi to STAmode : sta(xx:xx:xx:xx:xx:xx)
add if0

876 : WIFI : Connecting XXXXXXXX attempt #0
1373 : SYS  : 0.00,24976.00,31.00,-1.00
1654 : SYS  : 100.00,1.00,3360.00,0.00
2295 : BMx280 : Detected BME280
2776 : WD   : Uptime 0 ConnectFailures 0 FreeMem 24336 WiFiStatus 6
scandone
state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 1
cnt 

connected with XXXXXXXX, channel 11
dhcp client start...
ip:192.168.xxx.xxx,mask:255.255.255.0,gw:192.168.xxx.xxx
4836 : WIFI : Connected! AP: XXXXXXXX (xx:xx:xx:xx:xx:xx) Ch: 11 Duration: 3763 ms
4838 : WIFI : DHCP IP: 192.168.xxx.xxx (hostname) GW: 192.168.xxx.xxx SN: 255.255.255.0   duration: 20 ms
4912 : NTP  : NTP replied: delay 70 mSec Accuracy increased by 0.467 seconds
4913 : Time adjusted by -1581738017628.08 msec. Wander: -439371671.56 msec/second
4915 : Current Time Zone: STD time start: 2020-11-01 03:00:00 offset: 0 min
4922 : Webserver: start
4925 : firstLoopConnectionsEstablished

And again, forcing a re-query of the NTP server:

147669 : SaveToFile: free stack: 3296
147738 : FILE : Saved config.dat
147739 : SaveToFile: free stack after: 3296
147818 : NTP  : NTP replied: delay 70 mSec Accuracy increased by 0.374 seconds
147819 : Time adjusted by -75.32 msec. Wander: -0.02 msec/second
@TD-er
Copy link
Member

TD-er commented Feb 15, 2020

It could be we have a bit too short wait after the "Got IP" event and sending out connection attempts.
I think the very first thing we need to do after being connected, is sending out a Gratuitous ARP and wait for a few tens of msec and then consider it to be a "stable" connection.
If we send out data which does need to get an reply, then the switches may not yet know how to route these packets back.
I have already introduced a little wait after the connection was established, or else the unit would crash quite often at boot, so maybe this is another race condition but then on a network level.

@TD-er TD-er added Category: Wifi Related to the network connectivity Type: Bug Considered a bug labels Feb 15, 2020
@private-creator
Copy link
Author

I managed to build from source. It looks like it may be some kind of rate-limiting issue, or other ntp-server-side issue.

At boot, two NTP queries are sent in quick succession. (On settings updates, only one is sent.) In my experimenting, the first query works fine, but the second query sometimes yields a bad result. (In my case, it might be some kind of rate-limiting?) I don't know where that first query comes from or goes to; it doesn't appear to go through getNtpDate() -- at least, not according to the serial console logs.

The response from the NTP server that yields a bad result looks like this:

00:17:47.274490 IP (tos 0xb8, ttl 64, id 13592, offset 0, flags [DF], proto UDP (17), length 76)
    192.168.xxx.xxx.123 > 192.168.xxx.xxx.45684: NTPv4, length 48
        Server, Leap indicator: clock unsynchronized (192), Stratum 0 (unspecified), poll 6 (64s), precision -20
        Root Delay: 0.000000, Root dispersion: 0.000000, Reference-ID: (unspec)
          Reference Timestamp:  0.000000000
          Originator Timestamp: 0.000000000
          Receive Timestamp:    0.000000000
          Transmit Timestamp:   0.000000000
            Originator - Receive Timestamp:  0.000000000
            Originator - Transmit Timestamp: 0.000000000

It's clearly a bad packet. And the leap indicator (LI) is a two-bit variable. The original RFC 958 says 11 is Reserved. But later RFCs (1305, 5905) call this "alarm condition" or "unknown" and "clock unsynchronized".

So, here is a solution (in getNtpTime(), after udp.read(), but before calculation of secsSince1900):

if ((packetBuffer[0] & 0b11000000) == 0b11000000) {
  String log = F("NTP  : NTP host (");
  log += timeServerIP.toString();
  log += ") unsynchronized";
  addLog(LOG_LEVEL_ERROR, log);
  return false;
} 

The false return will cause now() to avoid setting the time to an invalid value, and nextSyncTime will be unadjusted, so the next call to now() will try again.

@TD-er
Copy link
Member

TD-er commented Feb 16, 2020

So you mean right in between these lines:

      udp.read(packetBuffer, NTP_PACKET_SIZE); // read packet into the buffer

      // For more detailed info on improving accuracy, see:
      // https://github.com/lettier/ntpclient/issues/4#issuecomment-360703503
      // For now, we simply use half the reply time as delay compensation.

      unsigned long secsSince1900;

I've also noticed there is now an issue with the NTP, but it does look to be more than just NTP.
The thing is, when it happens here on my test node, it may take 30 seconds to 300 seconds for the NTP to get synced after boot, but also the web interface is completely unreachable until the NTP reply comes in.
So not sure what is causing this.

The test suite I'm seeing this on is running the (new and improved) controller for LoRaWAN I'm working on and that one is already sending out its first messages over LoRa (software serial) after 500 msec from boot, but then the WiFi has not yet finished switching on.
So that controller may interfere a lot with the WiFi setup process on the ESP.

So your fix may help fixing the symptom you're seeing, but I think the problem may be somewhere else.
Will test it tomorrow, as it is already close to 3 am here and I did not mean to stay up so late :( (at least the LoRa stuff now works pretty swift :) )

@private-creator
Copy link
Author

private-creator commented Feb 16, 2020

Yes, that was the location I meant.

In my tests, the web interface was fine, but it did report the current time as 1970-01-01 until the good NTP packet reply. (Presumably, sysTime is initialized to 0 at boot.) This is different behaviour than running with NTP disabled, which has a message about no system time source. However, it is the same behaviour on my build when NTP is enabled, but an invalid/unresponsive host is given.

Perhaps NTP should not be considered enabled until the first valid reply is received?

How does your build respond with: a) NTP disabled, and b) an invalid/unresponsive ntp host?

@TD-er
Copy link
Member

TD-er commented Feb 16, 2020

How does your build respond with: a) NTP disabled,

Just tested and it is the same, so I guess the symptom you're seeing with NTP may have different cause, but it doesn't hurt to add some sanity checks in the received NTP reply.

@TD-er
Copy link
Member

TD-er commented Feb 16, 2020

Can you test with a little change in code?

delay(30); // Must allow for some time to init.

Change it from 30 to 100 msec.
It does make my node available almost immediately after boot.

TD-er added a commit to TD-er/ESPEasy that referenced this issue Feb 17, 2020
See also letscontrolit#2886
This does make sure the initial network connection can transfer data.
On some nodes it took a lot of attempts to get a connection established right after boot.
@private-creator
Copy link
Author

Hmm, in my case, adding the delay did not seem to change the behaviour. It does look like the 2nd set of packets were delayed a bit more:

NTP communication, with initial 30ms delay:

03:02:21.532203 IP (tos 0x0, ttl 255, id 2, offset 0, flags [none], proto UDP (17), length 76)
03:02:21.532305 IP (tos 0xb8, ttl 64, id 12636, offset 0, flags [DF], proto UDP (17), length 76)
03:02:21.534956 IP (tos 0x0, ttl 255, id 3, offset 0, flags [none], proto UDP (17), length 76)
03:02:21.535022 IP (tos 0xb8, ttl 64, id 59426, offset 0, flags [DF], proto UDP (17), length 76)

With the delay at 100ms:

03:06:12.877947 IP (tos 0x0, ttl 255, id 2, offset 0, flags [none], proto UDP (17), length 76)
03:06:12.878038 IP (tos 0xb8, ttl 64, id 56218, offset 0, flags [DF], proto UDP (17), length 76)
03:06:13.039471 IP (tos 0x0, ttl 255, id 3, offset 0, flags [none], proto UDP (17), length 76)
03:06:13.039579 IP (tos 0xb8, ttl 64, id 1643, offset 0, flags [DF], proto UDP (17), length 76)

However, I did notice something else this evening, about why there are two attempts. The first attempt is to the DHCP server (or the NTP server specified by the DHCP server; in my network they are the same, so I can't tell). In particular, it is not the NTP host specified in the config. (Nor is it the default gateway, if that matters.)

But this first response is wholly ignored; it does not appear to original from getNtpTime(). In my network, this response always appears correct in the tcpdump output. If I query that NTP host again immediately (within getNtpTime()), that is when I get the unsynchronized result. This is why I think the error is some kind of rate-limit.

I do think that ignoring the bad response is good step.

Perhaps the initial NTP reply is ignored here in checkUDP()? (Actually, it doesn't appear to be, based on additional logging.)

ESPEasy/src/Networking.ino

Lines 122 to 127 in 0c6e6ca

if (portUDP.remotePort() == 123)
{
// unexpected NTP reply, drop for now...
runningUPDCheck = false;
return;
}

But I haven't found where that initial NTP request is coming from. (This is a potential source, but I tried commenting it out, and it was still sent.)

if (systemTimePresent()) {
initTime();
}

At this point, I expect the initial NTP request (and ignored response) is somewhere deep in the library code.

@TD-er
Copy link
Member

TD-er commented Feb 17, 2020

Just curious, does your DHCP server send a field populated with a suggested NTP server?
It could be that core 2.6.3 does now in the background query a NTP server, based on what is supplied in the DHCP reply.
If so, then our own DHCP query may interfere with this and the core based NTP query may block other traffic?

So could you also test with an older core version to see if it makes any difference?

@TD-er
Copy link
Member

TD-er commented Feb 26, 2020

I'm now looking into this issue.

At boot, two NTP queries are sent in quick succession.

Do you have an idea on the interval between those 2 queries that are being sent?

@TD-er TD-er closed this as completed in c345b10 Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Wifi Related to the network connectivity Type: Bug Considered a bug
Projects
None yet
Development

No branches or pull requests

2 participants