Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERRO Eek... Too late, we missed an event schedule, system time might be off? #1580

Closed
adascanBovich opened this issue Jan 16, 2020 · 14 comments · Fixed by #1780
Closed

ERRO Eek... Too late, we missed an event schedule, system time might be off? #1580

adascanBovich opened this issue Jan 16, 2020 · 14 comments · Fixed by #1780

Comments

@adascanBovich
Copy link

@adascanBovich adascanBovich commented Jan 16, 2020

Describe the bug
My nodes keep missing blocks from time to time because of "ERRO Eek... Too late, we missed an event schedule, system time might be off?" error. System clock is in sync, ntp installed. Happens on different nodes, no pattern, one node can lose block for this error, then make 3 in a row just fine.

Mandatory Information

  1. jcli 0.8.6 (HEAD-7686397, release, linux [x86_64]) - [rustc 1.40.0 (73528e339 2019-12-16)]
  2. jormungandr 0.8.6 (HEAD-7686397, release, linux [x86_64]) - [rustc 1.40.0 (73528e339 2019-12-16)]

To Reproduce
Steps to reproduce the behavior:

  1. Start a pool.
  2. Check logs every time you are missing your blocks.
  3. Try no to cry.

Expected behavior
Blocks should be creating just fine with system clock synced.

Additional context
Couple of examples:

Screenshot 2020-01-16 at 13 43 40

Screenshot 2020-01-16 at 18 01 49

@Proxiweb
Copy link

@Proxiweb Proxiweb commented Jan 16, 2020

I had this one two time with 0.8.6. I switched back to 0.8.5. Not happening again.

@mmahut
Copy link
Contributor

@mmahut mmahut commented Jan 16, 2020

Seeing this as well.

Jan 16 15:44:37.326 ERRO Eek... Too late, we missed an event schedule, system time might be off?, event_end: 2020-01-16T14:44:37+00:00, event_start: 2020-01-16T14:44:35+00:00, event_date: 33.35129, leader_id: 1, task: leadership
Jan 16 17:43:52.056 ERRO Eek... Too late, we missed an event schedule, system time might be off?, event_end: 2020-01-16T16:43:51+00:00, event_start: 2020-01-16T16:43:49+00:00, event_date: 33.38706, leader_id: 1, task: leadership

However, not all the time:

---
- created_at_time: "2020-01-16T07:20:28.087825432+00:00"
  enclave_leader_id: 1
  finished_at_time: "2020-01-16T15:58:29.011607694+00:00"
  scheduled_at_date: "33.37346"
  scheduled_at_time: "2020-01-16T15:58:29+00:00"
  status:
    Block:
      block: 5a60962d5592ff31bc9e89a3cedd167bbeab510ce0b9cc7cb749ec1248828c8d
      chain_length: 103869
  wake_at_time: "2020-01-16T15:58:29.001486315+00:00"
- created_at_time: "2020-01-16T07:20:28.087826193+00:00"
  enclave_leader_id: 1
  finished_at_time: "2020-01-16T16:41:43.009136451+00:00"
  scheduled_at_date: "33.38643"
  scheduled_at_time: "2020-01-16T16:41:43+00:00"
  status:
    Block:
      block: 27a03c579d2901787163c7e40c2966265ebf7007590766067c4deab1cc2bb0b8
      chain_length: 103962
  wake_at_time: "2020-01-16T16:41:43.005696007+00:00"
- created_at_time: "2020-01-16T07:20:28.087826973+00:00"
  enclave_leader_id: 1
  finished_at_time: "2020-01-16T16:43:52.057019303+00:00"
  scheduled_at_date: "33.38706"
  scheduled_at_time: "2020-01-16T16:43:49+00:00"
  status:
    Rejected:
      reason: Missed the deadline to compute the schedule
  wake_at_time: "2020-01-16T16:43:52.056439497+00:00"

NTP appears to be on spot according to monitoring.

@hodlonaut
Copy link

@hodlonaut hodlonaut commented Jan 17, 2020

Can confirm this is happening here too - all nodes on the machine are up and seemingly in sync, in my case the leader ids were switched from node A to B about 7 seconds prior to the leader event but leader logs should have been identical... then when time arrived, the node without leader ids woke up the leadership event on time while the other one was late...

3 second delay doesn't seem to be a coincidence, feels like a bug somewhere

greg2@myserver2-4gb-hel1-1:~/cardano/jormungandr$ ./logs8502.sh | grep -C 6 "34.8703"
scheduled_at_time: "2020-01-17T16:54:11+00:00"
status: Pending
wake_at_time: ~

  • created_at_time: "2020-01-16T23:09:12.648007808+00:00"
    enclave_leader_id: 3
    finished_at_time: "2020-01-17T00:03:46.302390633+00:00"
    scheduled_at_date: "34.8703"
    scheduled_at_time: "2020-01-17T00:03:43+00:00"
    status:
    Rejected:
    reason: Missed the deadline to compute the schedule
    wake_at_time: "2020-01-17T00:03:46.302292649+00:00"

greg2@myserver2-4gb-hel1-1:~/cardano/jormungandr$ ./logs8504.sh | grep -C 6 "34.8703"
scheduled_at_time: "2020-01-17T12:26:59+00:00"
status: Pending
wake_at_time: ~

  • created_at_time: "2020-01-16T23:25:18.369987314+00:00"
    enclave_leader_id: 3
    finished_at_time: "2020-01-17T00:03:43.007313512+00:00"
    scheduled_at_date: "34.8703"
    scheduled_at_time: "2020-01-17T00:03:43+00:00"
    status:
    Rejected:
    reason: "Cannot sign the block: This leader 3 is not in the enclave"
    wake_at_time: "2020-01-17T00:03:43.001179318+00:00"

@PegasusPool
Copy link

@PegasusPool PegasusPool commented Jan 17, 2020

Same issue here on 0.8.6.

- created_at_time: "2020-01-17T08:11:41.864322871+00:00"
  enclave_leader_id: 1
  finished_at_time: "2020-01-17T09:16:42.243511068+00:00"
  scheduled_at_date: "34.25290"
  scheduled_at_time: "2020-01-17T09:16:37+00:00"
  status:
    Rejected:
      reason: Missed the deadline to compute the schedule
  wake_at_time: "2020-01-17T09:16:42.243471768+00:00"

Tried both chrony and NTP but doesn't seem to make any difference.

@NicolasDP NicolasDP self-assigned this Jan 17, 2020
@NicolasDP NicolasDP added this to Backlog (not in scope) in jormungandr-dmb via automation Jan 17, 2020
@NicolasDP NicolasDP added this to the Net improvement 2 milestone Jan 17, 2020
@NicolasDP NicolasDP moved this from Backlog (not in scope) to In scope to do in jormungandr-dmb Jan 17, 2020
@NicolasDP
Copy link
Contributor

@NicolasDP NicolasDP commented Jan 17, 2020

Thanks everyone. I will be looking into this for the next release.

@hodlonaut
Copy link

@hodlonaut hodlonaut commented Jan 17, 2020

@NicolasDP And here are example INFO level logs preceding the issue - showing that the node is busy doing other things instead of waking up for the very important event... followed by a snippet of logs from another node running on the same box, that has the same leader logs and seems to wake up for the same event in a timely fashion (but because it doesn't have the leader id just does nothing - as expected):

Node 1

Jan 17 23:48:42.545 INFO connecting to peer, node_id: 1aef7c3bdd297f099577f527b1d9dd4c8cbabe717591ebf2, peer_addr: 138.68.109.144:3000, task: network
Jan 17 23:48:42.645 INFO connection to peer failed, reason: protocol handshake failed: unknown error, node_id: 63e779f013cd9bfe4838b5e4e7bd09f81525b3fe3cca76b0, peer_addr: 208.113.200.150:3000, task: network
Jan 17 23:48:42.664 INFO move node to quarantine, id: 63e779f013cd9bfe4838b5e4e7bd09f81525b3fe3cca76b0, sub_task: policy, task: poldercast
Jan 17 23:48:42.664 INFO connecting to peer, node_id: 46b7cc025dea756b1b5fd6891a3fd92d268076b592ede391, peer_addr: 52.214.209.130:3000, task: network
Jan 17 23:48:42.711 INFO failed to connect to peer, reason: Connection refused (os error 111), node_id: 46b7cc025dea756b1b5fd6891a3fd92d268076b592ede391, peer_addr: 52.214.209.130:3000, task: network
Jan 17 23:48:42.727 INFO move node to quarantine, id: 46b7cc025dea756b1b5fd6891a3fd92d268076b592ede391, sub_task: policy, task: poldercast
Jan 17 23:48:42.880 INFO connecting to peer, node_id: 2e209da744cc345bc31602a197269cb64221c0d22c172b6b, peer_addr: 95.179.219.33:8650, task: network
Jan 17 23:48:42.999 INFO connecting to peer, node_id: 4c829a23c738934b52acfe2a9c3958e3ee6a0156b42f8c28, peer_addr: 15.222.159.21:3000, task: network
Jan 17 23:48:43.255 INFO connecting to peer, node_id: 7cbf1bf5dfbfcd3b0e6c6b2a8a105d8c49644131a202e6ef, peer_addr: 79.114.244.247:3000, task: network
Jan 17 23:48:43.374 INFO accepted connection, peer_addr: 62.202.33.120:31132, local_addr: 0.0.0.0:4093, task: network

Jan 17 23:48:43.374 ERRO Eek... Too late, we missed an event schedule, system time might be off?, event_end: 2020-01-17T22:48:43+00:00, event_start: 2020-01-17T22:48:41+00:00, event_date: 35.6452, leader_id: 3, task: leadership

Node 2

Jan 17 23:48:41.001 INFO Leader event started, event_remaining_time: 1s 998ms 916us 717ns, event_end: 2020-01-17T22:48:43+00:00, event_start: 2020-01-17T22:48:41+00:00, event_date: 35.6452, leader_id: 3, task: leadership
Jan 17 23:48:41.287 INFO incoming connection closed, peer_addr: 85.94.71.60:7884, task: network

@Straightpool
Copy link

@Straightpool Straightpool commented Jan 18, 2020

Just had the same error on my first and only slot assignment so far. Node was on 0.8.6 with chrony installed and perfectly in sync:

  • created_at_time: "2020-01-18T15:10:16.147301467+00:00"
    enclave_leader_id: 1
    finished_at_time: "2020-01-18T15:20:00.787124095+00:00"
    scheduled_at_date: "35.36190"
    scheduled_at_time: "2020-01-18T15:19:57+00:00"
    status:
    Rejected:
    reason: Missed the deadline to compute the schedule
    wake_at_time: "2020-01-18T15:20:00.787037638+00:00"

Logs shows the following error (Added the prior 10 lines of log):

194100896-Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.462 INFO connecting to peer, node_id: 9735e24b9e18ab730d876fb8df297264ef8b2291bdb73574, peer_addr: 172.104.250.194:7325, task: network
194101097-Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.471 INFO peer node ID differs from the expected 9735e24b9e18ab730d876fb8df297264ef8b2291bdb73574, node_id: 21ca2aae6445f5cbb826028dc28196af2502c5f935abe237, peer_addr: 172.104.250.194:7325, task: network
194101367-Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.510 INFO connection closed, peer_addr: 24.79.157.172:50886, local_addr: 51.68.189.68:3000, task: network
194101538-Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.570 INFO connection closed, peer_addr: 74.110.54.138:54457, local_addr: 51.68.189.68:3000, task: network
194101709-Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.573 INFO connecting to peer, node_id: c16303ca5fc42d9eef742e469a408a156f24da3179879eef, peer_addr: 45.153.186.57:3000, task: network
194101908-Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.597 INFO connection to peer failed, reason: protocol handshake failed: unknown error, node_id: c16303ca5fc42d9eef742e469a408a156f24da3179879eef, peer_addr: 45.153.186.57:3000, task: network
194102164-Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.672 INFO connecting to peer, node_id: fd1f545da317ed9c2a64a416b4730ef3e489a8967161505f, peer_addr: 66.42.93.187:8750, task: network
194102362-Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.724 INFO failed to connect to peer, reason: Connection timed out (os error 110), node_id: 35709a7fce210aa7e5b5947d66e5686e7059d454bfb5c63d, peer_addr: 23.240.204.173:3200, task: network
194102614-Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.728 INFO connection closed, peer_addr: 24.9.65.174:60658, local_addr: 51.68.189.68:3000, task: network
194102783-Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.787 INFO accepted connection, peer_addr: 159.203.173.148:49120, local_addr: 51.68.189.68:3000, task: network
194102958:Jan 18 15:20:00 <...> start_leader.sh[44874]: Jan 18 15:20:00.788 ERRO Eek... Too late, we missed an event schedule, system time might be off?, event_end: 2020-01-18T15:19:59+00:00, event_start: 2020-01-18T15:19:57+00:00, event_date: 35.36190, leader_id: 1, task: leadership

/etc/chrony : chronyc tracking
Reference ID : 05C787AA (<...>)
Stratum : 3
Ref time (UTC) : Sat Jan 18 15:37:37 2020
System time : 0.000284441 seconds slow of NTP time
Last offset : -0.000059839 seconds
RMS offset : 0.000054073 seconds
Frequency : 17.715 ppm slow
Residual freq : -0.005 ppm
Skew : 0.033 ppm
Root delay : 0.014574516 seconds
Root dispersion : 0.030481683 seconds
Update interval : 514.0 seconds
Leap status : Normal

Tuned my chrony settings now according to https://github.com/lovelypool/cardano_stuff/blob/master/chrony.conf

/etc/chrony : chronyc tracking
Reference ID : D8EF2304 (time2.google.com)
Stratum : 2
Ref time (UTC) : Sat Jan 18 15:44:27 2020
System time : 0.000015743 seconds fast of NTP time
Last offset : +0.000000032 seconds
RMS offset : 0.000015550 seconds
Frequency : 17.715 ppm slow
Residual freq : +0.250 ppm
Skew : 0.033 ppm
Root delay : 0.011313335 seconds
Root dispersion : 0.000151100 seconds
Update interval : 2.0 seconds
Leap status : Normal

@BitcoinMicroPaymentNetwork
Copy link

@BitcoinMicroPaymentNetwork BitcoinMicroPaymentNetwork commented Jan 19, 2020

I am also getting it . complete log below .

Rohits-MacBook-Pro:rohitchandan$ jcli rest v0 leaders logs get -h http://127.0.0.1:3100/api

  • created_at_time: "2020-01-18T19:13:49.422786+00:00"
    enclave_leader_id: 1
    finished_at_time: ~
    scheduled_at_date: "36.37064"
    scheduled_at_time: "2020-01-19T15:49:05+00:00"
    status: Pending
    wake_at_time: ~
    Rohits-MacBook-Pro:rohitchandan$ jcli rest v0 leaders logs get -h http://127.0.0.1:3100/api

  • created_at_time: "2020-01-18T19:13:49.422786+00:00"
    enclave_leader_id: 1
    finished_at_time: "2020-01-19T15:49:07.483840+00:00"
    scheduled_at_date: "36.37064"
    scheduled_at_time: "2020-01-19T15:49:05+00:00"
    status:
    Rejected:
    reason: Missed the deadline to compute the schedule
    wake_at_time: "2020-01-19T15:49:07.482823+00:00"
    Rohits-MacBook-Pro:rohitchandan$

@fedot-netot
Copy link

@fedot-netot fedot-netot commented Jan 21, 2020

Got the same issue.

I am confused by the huge gap of 7 seconds between scheduled_at_time and wake_at_time. Node host time is in sync and chrony is running. Host with the node has quite enough resources: 4G RAM, SSD disk, and CPU utilization rarely exceeds 30%. At the moment of schedule the node had ~500 established network connnections with othes (according to netstat). Logging level of the node is configured as INFO. Jormungandr v0.8.6-7686397 is the only service running on the host, except crony, systemd and some homemade monitoring tools.

---
- created_at_time: "2020-01-21T02:09:45.709170842+00:00"
  enclave_leader_id: 2
  finished_at_time: "2020-01-21T17:00:56.079760136+00:00"
  scheduled_at_date: "38.39216"
  scheduled_at_time: "2020-01-21T17:00:49+00:00"
  status:
    Rejected:
      reason: Missed the deadline to compute the schedule
  wake_at_time: "2020-01-21T17:00:56.079723536+00:00"

@vincenthz vincenthz removed this from the Net improvement 2 milestone Jan 22, 2020
@vincenthz vincenthz added this to the Net improvement 3 milestone Jan 22, 2020
@vincenthz vincenthz removed this from the Net improvement 3 milestone Jan 31, 2020
@vincenthz vincenthz added this to the Net improvement 4 milestone Jan 31, 2020
@cardadad
Copy link

@cardadad cardadad commented Feb 8, 2020

Still happening in version 0.8.9 a couple of times per epoch 😢

@BitcoinMicroPaymentNetwork

latest in 0.8.9

created_at_time: "2020-02-11T19:13:49.410644+00:00"
enclave_leader_id: 1
finished_at_time: "2020-02-12T16:53:31.527425+00:00"
scheduled_at_date: "60.38996"
scheduled_at_time: "2020-02-12T16:53:29+00:00"
status:
Rejected:
reason: Missed the deadline to compute the schedule
wake_at_time: "2020-02-12T16:53:31.524852+00:00".

@eugene-babichenko
Copy link
Collaborator

@eugene-babichenko eugene-babichenko commented Feb 13, 2020

Is the node able to recover after this log entry, or only restarting it helps?

@cardadad
Copy link

@cardadad cardadad commented Feb 13, 2020

@NicolasDP NicolasDP moved this from In scope to do to In progress in jormungandr-dmb Feb 13, 2020
jormungandr-dmb automation moved this from In progress to Done Feb 13, 2020
@NicolasDP
Copy link
Contributor

@NicolasDP NicolasDP commented Feb 13, 2020

We found a lock in the logging mechanism of the Leadership logs and in the fragment logs and pool. The leader schedule is greatly affected by these 3 tools. If the issue come too often after that, please report here again and we will re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.