Skip to content

feat!: replace DNS overlay with in-process DNS server#13

Merged
Frando merged 14 commits into
mainfrom
Frando/dns-server
Apr 9, 2026
Merged

feat!: replace DNS overlay with in-process DNS server#13
Frando merged 14 commits into
mainfrom
Frando/dns-server

Conversation

@Frando
Copy link
Copy Markdown
Member

@Frando Frando commented Apr 8, 2026

The old DNS system wrote per-device /etc/hosts files and bind-mounted them into every namespace thread. Lab-wide dns_entry() calls rewrote all files on every mutation, and hickory-resolver could miss late entries because it reads /etc/hosts at construction time.

This replaces it with an in-process DNS server on the IX bridge. Records live in a std::sync::RwLock<HashMap> and are served over UDP via hickory-proto on both v4 and v6. All mutations are synchronous and visible to queries before set_host() returns.

Per-device /etc/hosts overlays are kept for device-local isolation via Device::set_host(). Device::resolve() is now async, using tokio::net::lookup_host on the device's async worker so it does not block the runtime under link impairment. DNS names are normalized to FQDN internally, so both "relay.test" and "relay.test." work.

Breaking changes

  • Lab::dns_entry() removed — use lab.dns_server()?.set_host() instead.
  • Lab::set_nameserver() removed — calling dns_server() auto-sets resolv.conf.
  • Device::dns_entry() renamed to Device::set_host().
  • Device::resolve() is now async.

Frando and others added 14 commits April 8, 2026 23:06
Add a minimal DNS server (hickory-proto + raw UDP) that runs on the IX
bridge. Lab-wide records are inserted synchronously into an in-memory
HashMap behind std::sync::RwLock — visible to queries the instant
set_host()/set_txt() returns. No inotify, no file watching, no races.

Breaking API changes:
- Lab::dns_entry() removed → use lab.dns_server().await?.set_host()
- Lab::set_nameserver() removed → dns_server() auto-sets resolv.conf
- Device::dns_entry() renamed → Device::set_host()

Device-level /etc/hosts overlay kept for per-device isolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the per_device HashMap from DnsEntries (now DnsOverlayDir).
set_host() appends directly to the hosts file. Device::resolve() uses
glibc getaddrinfo via run_sync + ToSocketAddrs, which reads both
/etc/hosts and resolv.conf — covers device-local entries and DNS server.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address review findings:
- set_host/set_txt now replace (upsert) instead of appending duplicates
- Bump UDP receive buffer from 512 to 4096 bytes (EDNS support)
- Return NOERROR (not NXDomain) when name exists but queried type doesn't
- Use TTL=1 instead of TTL=0 to reduce resolver re-query overhead
- Use expect("poisoned") on RwLock instead of unwrap
- Document limitations: no TCP, FQDN trailing-dot convention, TXT resolve

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Confirms that set_host() with v4 and v6 for the same name stores both
records independently (keyed by (name, RecordType)), and that replacing
one doesn't clobber the other.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bind to both ix_gw (198.18.0.1) and ix_gw_v6 (2001:db8::1) on port 53.
Write both nameservers to resolv.conf so v4-only, v6-only, and
dual-stack devices can all reach the DNS server.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove redundant `run()` wrapper, inline `run_until_cancelled` in spawn
- Rename `serve_loop` → `serve`, `response_bytes` → `bytes`
- Remove `records2` pattern — shadow with `let records = records.clone()`
- Drop unnecessary `.clone()` on `name` before `Record::from_rdata`
- Use `SocketAddr::from((ip, port))` instead of `SocketAddr::new(IpAddr::V4(..))`
- Simplify `DnsOverlayDir`: replace `nameserver`/`nameserver_v6` with `Vec<IpAddr>`
- Collapse NXDomain/NODATA branch into single `let code = if ..`
- Remove redundant `let key` bindings where used once

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Device::resolve() used run_sync + glibc getaddrinfo, which blocks the
calling thread. Under link impairment this stalls the tokio runtime.
Switch to tokio::net::lookup_host on the device's async worker via spawn.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bind std::net::UdpSocket inside the root namespace via run_closure_in
(sync), then convert to tokio sockets for the async serve loops. This
removes the async requirement from dns_server() — callers no longer
need .await.

Lab stores the DnsServer in a Mutex<Option<Arc<DnsServer>>> since
OnceLock::get_or_try_init is unstable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DnsServer's fields (Arc<RwLock<..>> + CancellationToken) are already
cheaply cloneable. Derive Clone instead of wrapping in Arc. Move
shutdown to an explicit method called from LabInner::drop, so clones
handed to callers don't cancel the server when dropped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DNS wire queries always use absolute names (trailing dot). If records
were stored without a trailing dot (e.g. "relay.test" vs "relay.test."),
lookups would fail because the keys didn't match.

Add parse_name() that ensures all names are FQDN before storage and
lookup. Both "relay.test" and "relay.test." now work identically.

This fixes DNS resolution failures in iroh's patchbay tests, which
call set_host("relay.test", ip) without a trailing dot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Frando Frando merged commit 699170f into main Apr 9, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant