feat!: replace DNS overlay with in-process DNS server#13
Merged
Conversation
Add a minimal DNS server (hickory-proto + raw UDP) that runs on the IX bridge. Lab-wide records are inserted synchronously into an in-memory HashMap behind std::sync::RwLock — visible to queries the instant set_host()/set_txt() returns. No inotify, no file watching, no races. Breaking API changes: - Lab::dns_entry() removed → use lab.dns_server().await?.set_host() - Lab::set_nameserver() removed → dns_server() auto-sets resolv.conf - Device::dns_entry() renamed → Device::set_host() Device-level /etc/hosts overlay kept for per-device isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the per_device HashMap from DnsEntries (now DnsOverlayDir). set_host() appends directly to the hosts file. Device::resolve() uses glibc getaddrinfo via run_sync + ToSocketAddrs, which reads both /etc/hosts and resolv.conf — covers device-local entries and DNS server. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address review findings:
- set_host/set_txt now replace (upsert) instead of appending duplicates
- Bump UDP receive buffer from 512 to 4096 bytes (EDNS support)
- Return NOERROR (not NXDomain) when name exists but queried type doesn't
- Use TTL=1 instead of TTL=0 to reduce resolver re-query overhead
- Use expect("poisoned") on RwLock instead of unwrap
- Document limitations: no TCP, FQDN trailing-dot convention, TXT resolve
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Confirms that set_host() with v4 and v6 for the same name stores both records independently (keyed by (name, RecordType)), and that replacing one doesn't clobber the other. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bind to both ix_gw (198.18.0.1) and ix_gw_v6 (2001:db8::1) on port 53. Write both nameservers to resolv.conf so v4-only, v6-only, and dual-stack devices can all reach the DNS server. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove redundant `run()` wrapper, inline `run_until_cancelled` in spawn - Rename `serve_loop` → `serve`, `response_bytes` → `bytes` - Remove `records2` pattern — shadow with `let records = records.clone()` - Drop unnecessary `.clone()` on `name` before `Record::from_rdata` - Use `SocketAddr::from((ip, port))` instead of `SocketAddr::new(IpAddr::V4(..))` - Simplify `DnsOverlayDir`: replace `nameserver`/`nameserver_v6` with `Vec<IpAddr>` - Collapse NXDomain/NODATA branch into single `let code = if ..` - Remove redundant `let key` bindings where used once Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Device::resolve() used run_sync + glibc getaddrinfo, which blocks the calling thread. Under link impairment this stalls the tokio runtime. Switch to tokio::net::lookup_host on the device's async worker via spawn. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bind std::net::UdpSocket inside the root namespace via run_closure_in (sync), then convert to tokio sockets for the async serve loops. This removes the async requirement from dns_server() — callers no longer need .await. Lab stores the DnsServer in a Mutex<Option<Arc<DnsServer>>> since OnceLock::get_or_try_init is unstable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DnsServer's fields (Arc<RwLock<..>> + CancellationToken) are already cheaply cloneable. Derive Clone instead of wrapping in Arc. Move shutdown to an explicit method called from LabInner::drop, so clones handed to callers don't cancel the server when dropped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DNS wire queries always use absolute names (trailing dot). If records
were stored without a trailing dot (e.g. "relay.test" vs "relay.test."),
lookups would fail because the keys didn't match.
Add parse_name() that ensures all names are FQDN before storage and
lookup. Both "relay.test" and "relay.test." now work identically.
This fixes DNS resolution failures in iroh's patchbay tests, which
call set_host("relay.test", ip) without a trailing dot.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The old DNS system wrote per-device
/etc/hostsfiles and bind-mounted them into every namespace thread. Lab-widedns_entry()calls rewrote all files on every mutation, and hickory-resolver could miss late entries because it reads/etc/hostsat construction time.This replaces it with an in-process DNS server on the IX bridge. Records live in a
std::sync::RwLock<HashMap>and are served over UDP via hickory-proto on both v4 and v6. All mutations are synchronous and visible to queries beforeset_host()returns.Per-device
/etc/hostsoverlays are kept for device-local isolation viaDevice::set_host().Device::resolve()is now async, usingtokio::net::lookup_hoston the device's async worker so it does not block the runtime under link impairment. DNS names are normalized to FQDN internally, so both"relay.test"and"relay.test."work.Breaking changes
Lab::dns_entry()removed — uselab.dns_server()?.set_host()instead.Lab::set_nameserver()removed — callingdns_server()auto-setsresolv.conf.Device::dns_entry()renamed toDevice::set_host().Device::resolve()is now async.