Skip to content

lsferreira42/nfsdiag

Repository files navigation

nfsdiag

Website: www.nfsdiag.org · Releases: github.com/lsferreira42/nfsdiag/releases/latest

CI Deploy website to Cloudflare

nfsdiag is a command-line NFS diagnostic tool written in C. You give it an IP or hostname, and it checks everything that usually breaks in NFS: network reachability, rpcbind, NFS versions, mountd, exports, permissions, root squash, locking, stale handles, and performance.

It is not magic, and it will not replace a good server-side analysis. But it narrows down whether the problem is network, NFS config, permissions, UID/GID mapping, or something stranger.


What this tool does

  • test if rpcbind TCP port 111 is reachable
  • test if NFS TCP port 2049 is reachable
  • query the RPC service map from rpcbind (rpcbind v3/v4 DUMP with native IPv6, plus legacy portmapper)
  • measure TCP connect latency and path MTU towards the server
  • verify that dynamically registered mountd/lockd/statd ports are reachable through firewalls
  • fingerprint the server implementation heuristically from its RPC service layout
  • detect registered NFS, mountd, lockd/NLM and statd/NSM services
  • test NFS v2, v3 and v4 with RPC NULLPROC (including v4.1 and v4.2 hints)
  • test mountd v1, v2 and v3; optionally probe RPC over UDP
  • enumerate exports using mountd
  • check client prerequisite daemons (nfs-client.target, rpc.gssd, nfs-idmapd)
  • detect Kerberos tickets and configuration with --krb5, and test which sec=krb5/krb5i/krb5p flavors actually mount
  • mount exports automatically, trying NFSv4.2 → 4.1 → 4 → 3 in cascade
  • test selected exports only with repeatable --export, or several exports concurrently with --parallel N
  • benchmark rsize/wsize/nconnect combinations with --sweep and suggest mount options
  • parse and verify effective mount options from /proc/self/mountinfo
  • capture RPC stats (retransmissions, auth refreshes) before and after tests
  • extract deep latency metrics from /proc/self/mountstats
  • read NFS server info from /proc/fs/nfsfs/servers (protocol version, active mount count)
  • run filesystem checks after mount: close-to-open consistency, special files, quotas
  • test read/traverse permission, directory listing
  • test POSIX ACLs, NFSv4 ACLs, generic xattrs, and SELinux contexts
  • test create/write/read/fsync; advanced I/O: copy_file_range, fallocate, O_DIRECT
  • test advisory locks with fcntl
  • detect practical root_squash behavior
  • simulate UID/GID access with supplemental groups
  • run metadata latency benchmark (create/rename/unlink)
  • run stale file handle loop looking for ESTALE
  • test long filenames (255-byte), special characters (spaces, colons, UTF-8 multibyte)
  • detect NFSv4 delegation activity via /proc/self/mountstats (DELEGRETURN operations)
  • check for pNFS layouts via /proc/self/mountstats
  • run external fio benchmarks alongside internal smoke tests
  • generate JSON and HTML reports; stream NDJSON; emit Prometheus metrics or JUnit XML
  • serve Prometheus metrics continuously over HTTP with --listen PORT
  • keep a per-host baseline and compare each run against it with --diff-baseline
  • emit event categories, stable check_id values and remediation text for automation
  • write evidence bundles with --output-dir
  • compare two JSON reports with nfsdiag diff
  • run local dependency/helper validation with --self-test
  • run Docker fixture tests for regression checks (14 scenarios)

By default the output is compact. Use --verbose to see all probe steps.


Important note

NFS problems are very environment-dependent. Results can change because of firewall rules, server export options, NFS version, kernel client state, UID/GID mapping, root squash, ACLs, SELinux/AppArmor on the server, server load, or stale file handles that only appear during real use.

If the tool says no ESTALE happened, it only means the tool did not reproduce it during the test window.


Quick start (OCI image)

No compilation needed:

docker run --rm --privileged ghcr.io/lsferreira42/nfsdiag 192.168.1.10

The image is published to ghcr.io/lsferreira42/nfsdiag as :latest and :vX.Y.Z on each release.


Build requirements

Debian / Ubuntu:

sudo apt-get install -y build-essential pkg-config libtirpc-dev nfs-common

Fedora / RHEL:

sudo dnf install -y gcc make pkgconf-pkg-config libtirpc-devel nfs-utils

Build

make                    # build
make check              # unit tests plus CLI self-check
make sbom               # minimal SPDX-style SBOM in build/
sudo make install       # install binary, man page and shell completions to /usr/local

Override prefix:

make PREFIX=/opt/nfsdiag install

Manual compile:

gcc -O2 -Wall -Wextra -D_GNU_SOURCE -I/usr/include/tirpc \
    src/main.c src/mount.c src/network.c src/report.c \
    src/rpc.c src/stats.c src/tests.c src/validation.c \
    -ltirpc -o nfsdiag

Packaging

make deb        # Debian/Ubuntu .deb → build/
make rpm        # Fedora/RHEL .rpm   → build/
make apk        # Alpine .apk (needs Docker) → build/
make packages   # all three

Additional packaging templates live under packaging/:

  • packaging/Dockerfile for the OCI image
  • packaging/homebrew/nfsdiag.rb
  • packaging/aur/PKGBUILD
  • packaging/nix/flake.nix

Pre-built binaries (amd64 and arm64), packages, SBOM, checksums and provenance are attached to GitHub releases.


Basic usage

sudo ./nfsdiag 192.168.1.10          # full diagnostic
./nfsdiag --verbose 192.168.1.10     # show all steps
./nfsdiag --no-mount 192.168.1.10    # network/RPC only, no mounts
sudo ./nfsdiag --export /data 192.168.1.10   # one export only
sudo ./nfsdiag --read-only 192.168.1.10      # skip write/create tests
sudo ./nfsdiag --dry-run 192.168.1.10        # print what would run, do nothing
./nfsdiag --self-test                         # local dependency/helper checks

Profiles provide safer presets:

sudo ./nfsdiag --profile quick 192.168.1.10
sudo ./nfsdiag --profile safe 192.168.1.10
sudo ./nfsdiag --profile full 192.168.1.10
sudo ./nfsdiag --profile performance 192.168.1.10
sudo ./nfsdiag --profile security 192.168.1.10

Output formats

Default output is tagged text ([OK], [WARN], [FAIL], [INFO]).

Summary table (box-drawing, per-export columns):

sudo ./nfsdiag --output-format=table 192.168.1.10

Streaming NDJSON (one JSON object per event — ideal for log pipelines):

sudo ./nfsdiag --output-format=ndjson 192.168.1.10 | jq 'select(.level=="fail")'

Prometheus / OpenMetrics (emitted at end of run):

sudo ./nfsdiag --output-format=prometheus 192.168.1.10

JSON and HTML reports

JSON to stdout (diagnostic text suppressed):

./nfsdiag --json 192.168.1.10

JSON to file (diagnostic text still on stdout):

./nfsdiag --json=report.json 192.168.1.10

HTML report:

./nfsdiag --html=report.html 192.168.1.10

Suppress stdout when writing to file:

./nfsdiag --quiet --json=report.json 192.168.1.10

Reports include tool version, host, timestamp, system info, per-export results (NFS version, latency, throughput, ACLs), global events, and recommendations. The JSON schema includes schema_version, timestamp_iso8601, duration_sec, event category, stable check_id, severity, and remediation text.

Evidence bundle:

sudo ./nfsdiag --output-dir ./nfsdiag-report 192.168.1.10

This writes JSON, HTML, evidence text and SHA256SUMS for the generated files.

Compare two JSON reports:

./nfsdiag diff before.json after.json

Watch mode

Re-run diagnostics every N seconds (Ctrl-C to stop):

sudo ./nfsdiag --watch 60 192.168.1.10

The terminal is cleared between iterations. All pending mounts are cleaned up on SIGINT.


Multi-host batch

Run against a list of hosts:

sudo ./nfsdiag --hosts-file /etc/nfs-servers.txt --json=audit.json

File format: one host per line; lines starting with # are comments. Use --delay-ms to rate-limit between hosts.


On-failure hook

Execute a script whenever any test fails:

sudo ./nfsdiag --on-fail-exec /usr/local/bin/alert.sh 192.168.1.10

The script receives: NFSDIAG_HOST, NFSDIAG_LEVEL, NFSDIAG_FAIL_COUNT, NFSDIAG_WARN_COUNT. It is invoked through a resolved trusted path with a minimal environment, never via a shell.


Config file

Persist options in a key=value file:

sudo ./nfsdiag --config /etc/nfsdiag.conf 192.168.1.10

Example nfsdiag.conf:

timeout = 10
bench_bytes = 8388608
uid = 1000
gid = 1000

CLI flags override config-file values.


UID/GID and permission tests

sudo ./nfsdiag --uid 1000 --gid 1000 192.168.1.10
sudo ./nfsdiag --uid 1000 --gid 1000 --uid 65534 --gid 65534 192.168.1.10
sudo ./nfsdiag --uid 1000 --gid 1000 --groups 10,20,30 192.168.1.10

Performance and stale handle tests

sudo ./nfsdiag --bench-bytes 167772160 192.168.1.10
sudo ./nfsdiag --bench-iterations 500 192.168.1.10
sudo ./nfsdiag --bench-type=fio 192.168.1.10       # requires fio installed
sudo ./nfsdiag --stale-iterations 1000 192.168.1.10

Safety options

sudo ./nfsdiag --command-timeout 15 192.168.1.10
sudo ./nfsdiag --delay-ms 500 192.168.1.10
sudo ./nfsdiag --mount-namespace 192.168.1.10      # explicit namespace
sudo ./nfsdiag --no-mount-namespace 192.168.1.10   # opt out of automatic namespace
sudo ./nfsdiag --dangerous-fs-tests 192.168.1.10   # enable symlink/hardlink/FIFO/device probes
sudo ./nfsdiag --allow-risky-mount-options -o exec 192.168.1.10

Network/protocol options

./nfsdiag --no-mount --udp 192.168.1.10
./nfsdiag --ipv4-only --no-mount 192.168.1.10
./nfsdiag --ipv6-only --no-mount nfs-server.example.com
sudo ./nfsdiag --no-nfs4-discovery 192.168.1.10

Shell completions

sudo make install already places the bash, zsh, and fish completions. To load them without installing, source them manually:

source completions/nfsdiag.bash          # bash
fpath=(completions $fpath)               # zsh (add to .zshrc before compinit)
cp completions/nfsdiag.fish ~/.config/fish/completions/

Man page

man docs/nfsdiag.8           # view locally

sudo make install installs the man page to the system man path.


Command line reference

Usage: nfsdiag [OPTIONS] <server-ip-or-hostname>

Diagnostic options:
  -e, --export PATH          Test only this export path (repeatable, up to 64)
  -o, --mount-options OPTS   Extra mount options passed to mount(8)
      --no-mount             Run network/RPC checks only; skip all mounts
      --dry-run              Print what would be done; skip mounts and fs tests
      --read-only            Do not create or write test files
      --uid UID              Simulate access as UID (repeatable, needs root)
      --gid GID              GID paired with last --uid
      --groups G1,G2         Supplemental GIDs for UID/GID simulation
      --krb5                 Check Kerberos prerequisites and test sec=krb5/krb5i/krb5p mounts
      --parallel N           Test up to N exports concurrently (1-32). Default: 1
      --sweep                Benchmark rsize/wsize/nconnect combos and suggest mount options
      --diff-baseline        Compare with the last saved run for this host, then update it
      --udp                  Also probe RPC NULLPROC over UDP
      --ipv4-only            Force IPv4 for direct TCP checks
      --ipv6-only            Force IPv6 for direct TCP checks
      --no-nfs4-discovery    Disable NFSv4 pseudo-root fallback
      --mount-namespace      Use private mount namespace (needs root/CAP_SYS_ADMIN)
      --no-mount-namespace   Disable automatic private mount namespace
      --dangerous-fs-tests   Enable symlink/hardlink/FIFO/device-node probes
      --allow-risky-mount-options
                              Permit risky mount options such as exec/suid/dev
                              and skip the default nosuid,nodev,noexec hardening
      --profile NAME         quick, safe, full, performance, security, readonly
      --hosts-file FILE      Read one host per line from FILE
      --watch SEC            Re-run diagnostics every SEC seconds until Ctrl-C
      --on-fail-exec SCRIPT  Execute SCRIPT via trusted path when any test fails
      --config FILE          Load options from FILE (key=value) before CLI args

Timeout options:
      --timeout SEC          Network/RPC connect timeout. Default: 5
      --command-timeout SEC  Timeout for mount/umount commands. Default: 30
      --fs-timeout SEC       Timeout for each filesystem test group. Default: 30
      --delay-ms MS          Delay between testing each export (rate limit). Default: 0

Benchmark options:
      --bench-bytes BYTES    Bytes for read/write benchmark. Default: 4194304
      --bench-iterations N   Metadata latency iterations. Default: 10
      --bench-type TYPE      Benchmark engine: 'internal' or 'fio'. Default: internal
      --stale-iterations N   ESTALE probe loop iterations. Default: 100

Output options:
      --json[=PATH]          Emit JSON report to PATH (use '-' or omit for stdout)
      --html[=PATH]          Emit HTML report to PATH (use '-' or omit for stdout)
      --output-dir DIR       Write JSON, HTML, evidence and checksums to DIR
      --output-format FMT    Terminal output format: text (default), table, ndjson, prometheus, junit
      --listen PORT          Serve Prometheus metrics over HTTP on PORT;
                              re-runs diagnostics every --watch SEC (default 60)
      --keep-temp            Keep temp workspace after tests
  -v, --verbose              Show all diagnostic steps
  -q, --quiet                Suppress stdout (combine with --json=FILE or --html=FILE)
  -V, --version              Print version and exit
      --self-test            Validate local dependencies and helper checks
  -h, --help                 Show this help

Exit codes: 0=pass  1=warn/fail  2=usage/runtime error

Stdout suppression: active only when --json=- or --html=- (report to stdout).
  Use --quiet to suppress stdout when writing a report to a file.

Exit codes

  • 0: no warnings or failures
  • 1: warning or failure found
  • 2: usage error or local runtime error

Warnings return 1 because in automation they usually need attention.


Docker fixtures

The project has Docker fixtures to reproduce bad NFS situations.

make docker-build-all          # build all fixture images
make test-fixtures             # run all fixture tests
make test-fixture-root-squash  # run one fixture
make test-fixtures-list        # list available fixtures

Some tests need root for real kernel NFS mounts. If the host kernel cannot run NFS inside Docker, those cases are skipped.

Warning: fixture configurations use wildcard clients, insecure, and no_root_squash intentionally. These settings are test-only and must never be used in production.

Available fixtures: rpcbind-unreachable, nfs-port-unreachable, rpc-map-missing-nfs, mountd-unavailable, empty-exports, mount-denied, permission-denied, acl-unsupported, identity-denied, read-only-export, root-squash, locking-missing, stale-handle, slow-performance.


Security notes

nfsdiag is designed to run as root. Key mitigations:

  • Mount operations run in a private mount namespace to avoid polluting the global namespace.
  • Exports are mounted with nosuid,nodev,noexec by default; disable only with --allow-risky-mount-options.
  • Host, export path and mount options are validated before network or mount activity.
  • Risky mount options require --allow-risky-mount-options.
  • Identity simulation always resets supplemental groups so results reflect the simulated user, not root.
  • --on-fail-exec scripts and --config files are refused if not owned by root/current user or if group/world-writable.
  • Output from external commands is sanitised for terminal escape sequences before display.
  • Symlink, hardlink, FIFO and device-node probes require --dangerous-fs-tests.
  • External commands are resolved from trusted directories and run with a minimal environment.
  • Report files are created with O_NOFOLLOW and mode 0600.
  • Test file paths include cryptographically random bytes (getrandom()) to prevent symlink attacks.
  • XDR strings from the server are sanitised for control characters before display.
  • HTML reports include a Content-Security-Policy header; all server-supplied strings are HTML-escaped.
  • TMPDIR is validated for ownership and world-writability before use.
  • Child processes that simulate UID/GID clear ambient capabilities before setuid().

To avoid creating test files in exports, use --read-only.


Version bumping

make bump-version-bugfix   # 0.5.0 → 0.5.1
make bump-version-minor    # 0.5.0 → 0.6.0
make bump-version-major    # 0.5.0 → 1.0.0

Each target updates VERSION, src/nfsdiag.h, and all packaging files atomically.


Limitations

  • ESTALE only appears if the handle becomes stale during the test window
  • SELinux/AppArmor problems can look like generic permission denied
  • ACL info depends on what the NFS client exposes
  • Performance numbers are smoke-test values, not full benchmarks
  • Docker fixture tests depend on host kernel and Docker privileges