Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSD doesn't refresh zones after extended downtime #25

Closed
aabdnn opened this issue Jul 18, 2019 · 4 comments
Closed

NSD doesn't refresh zones after extended downtime #25

aabdnn opened this issue Jul 18, 2019 · 4 comments
Assignees

Comments

@aabdnn
Copy link

aabdnn commented Jul 18, 2019

I have a test server, where I had last run NSD in March 2019. Then I had shut it down. Today, I installed a new version of NSD, and started it. It has 3 slave zones configured. The log showed this:

[2019-07-18 10:02:56.198] nsd[1187]: notice: nsd starting (NSD 4.2.1)
[2019-07-18 10:02:56.765] nsd[1188]: info: zone . read with success
[2019-07-18 10:02:56.768] nsd[1188]: info: zone arpa. read with success
[2019-07-18 10:02:56.768] nsd[1188]: info: zone root-servers.net. read with success
[2019-07-18 10:02:56.768] nsd[1188]: notice: nsd started (NSD 4.2.1), pid 1187
[2019-07-18 10:06:10.608] nsd[1188]: warning: signal received, shutting down...

Notice that NSD did not refresh any of the zones, even though they are vastly out of date. Now, this is caused by the timers on xfrd.state (which is shown below). I think the issue here is that NSD isn't checking the time of the state file with current system time, and so doesn't realise that the refresh timers are too old, and that it should immediately refresh these zones. Even at exit, it still saves the refresh timers, and won't update if I start it again. I think the value of next_timeout should take into account the current system time.

NSDXFRD2
# This file is written on exit by nsd xfr daemon.
# This file contains slave zone information:
#       * timeouts (when was zone data acquired)
#       * state (OK, refreshing, expired)
#       * which master transfer to attempt next
# The file is read on start (but not on reload) by nsd xfr daemon.
# You can edit; but do not change statement order
# and no fancy stuff (like quoted "strings").
#
# If you remove a zone entry, it will be refreshed.
# This can be useful for an expired zone; it revives
# the zone temporarily, from refresh-expiry time.
# If you delete the file all slave zones are updated.
#
# Note: if you edit this file while nsd is running,
#       it will be overwritten on exit by nsd.

filetime: 1563444370    # Thu Jul 18 10:06:10 2019

# The number of zone entries in this file
numzones: 3

zone:   name: .
        state: 0 # OK
        master: 0
        next_master: -1
        round_num: -1
        next_timeout: 1707      # = 28m 27s
        backoff: 0
        soa_nsd_acquired: 1553543862    # was 114d 14h 8m 28s ago
        soa_nsd: 6 1 86400 1792 a.root-servers.net. nstld.verisign-grs.com. 2019032501 1800 900 604800 86400
        # refresh = 30m retry = 15m expire = 7d minimum = 1d
        soa_disk_acquired: 1563444176   # was 3m 14s ago
        soa_disk: 6 1 86400 1792 a.root-servers.net. nstld.verisign-grs.com. 2019032501 1800 900 604800 86400
        # refresh = 30m retry = 15m expire = 7d minimum = 1d
        soa_notify_acquired: 0

zone:   name: arpa.
        state: 0 # OK
        master: 0
        next_master: -1
        round_num: -1
        next_timeout: 1572      # = 26m 12s
        backoff: 0
        soa_nsd_acquired: 1553543862    # was 114d 14h 8m 28s ago
        soa_nsd: 6 1 86400 1792 a.root-servers.net. nstld.verisign-grs.com. 2019032501 1800 900 604800 86400
        # refresh = 30m retry = 15m expire = 7d minimum = 1d
        soa_disk_acquired: 1563444176   # was 3m 14s ago
        soa_disk: 6 1 86400 1792 a.root-servers.net. nstld.verisign-grs.com. 2019032501 1800 900 604800 86400
        # refresh = 30m retry = 15m expire = 7d minimum = 1d
        soa_notify_acquired: 0

zone:   name: root-servers.net.
        state: 0 # OK
        master: 0
        next_master: -1
        round_num: -1
        next_timeout: 12555     # = 3h 29m 15s
        backoff: 0
        soa_nsd_acquired: 1553543861    # was 114d 14h 8m 29s ago
        soa_nsd: 6 1 3600000 1792 a.root-servers.net. nstld.verisign-grs.com. 2019031301 14400 7200 1209600 3600000
        # refresh = 4h retry = 2h expire = 14d minimum = 41d 16h
        soa_disk_acquired: 1563444176   # was 3m 14s ago
        soa_disk: 6 1 3600000 1792 a.root-servers.net. nstld.verisign-grs.com. 2019031301 14400 7200 1209600 3600000
        # refresh = 4h retry = 2h expire = 14d minimum = 41d 16h
        soa_notify_acquired: 0

NSDXFRD2
@wcawijngaards wcawijngaards self-assigned this Jul 18, 2019
@wcawijngaards
Copy link
Member

Hi Anand,
Thanks for the report! I added the logic so that when it reads in the xfrdfile it checks if the timeout is in the past. If so, it attempts to refetch the zone contents. Because that probably affects all of the zones in the file, it spreads the load over a couple of seconds with a random(10) second delay. That works in a test setup for me.
Best regards, Wouter

@aabdnn
Copy link
Author

aabdnn commented Jul 18, 2019

Hi Wouter! Thanks for the fix. Actually I was aware of this issue for ages, but kept forgetting to open a report. Today, when I built 4.2.1 for testing, and noticed it again, I decided to open the report before I forgot again. I will try to rebuild with this patch and report to you, or just test with 4.2.2 comes out.

@aabdnn
Copy link
Author

aabdnn commented Jul 22, 2019

Hi Wouter. I tried this today, and it works as you described. NSD starts, notices that the zones are way out of date, and schedules refresh for them with a random delay of up to 10 seconds. I guess this is fine, but in reality, there is probably no need for the delay, because it would be fine if NSD just immediately refreshes the zones. If you start NSD without any zone files or xfrd.state, then it XFRs all the slaves zones in immediately. It doesn't make too much sense to add a random 10s delay for this specific case of extended downtime. For consistency, if you're going to add a random delay, then it should be for all cases, to avoid flooding a master server.

@wcawijngaards
Copy link
Member

Hi Anand. Yes that is right, fixed it in commit 784600e where it fetches immediately, and then if that needs retries, the already existing logic for spreading retries can spread load. This is also how it works when NSD is started without files. So this fix makes all the zones get fetched immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants