storageServer: Shorten NFS timeouts

Change the default timeout to 10 seconds, and number of retransmissions to 3. This should result in 60 seconds timeout before NFS request will fail, similar to the behaviour in multipath. According to nfs(5), NFS will retry a request after timeo deciseconds (timeo / 10 seonds). After each retransmission, the timeout is increased by timeo value (up to maximum of 600 seconds). After retrans retires, the NFS client will fail with "server not responding" message. This is the expected failure flow: 00:00 retry 1 (10 seconds timeout) 00:10 retry 2 (20 seconds timeout) 00:30 retry 3 (30 seconds timeout) 01:00 request fail In the past we were using timeout=600, retrans=6, which resulted in 21 minutes timeout: 00:00 retry 1 (60 seconds timeout) 01:00 retry 2 (120 seconds timeout) 03:00 retry 3 (180 seconds timeout) 06:00 retry 4 (240 seconds timeout) 10:00 retry 5 (300 seconds timeout) 15:00 retry 6 (360 seconds timeout) 21:00 request fail Testing show that storage monitor was blocked on storage for 270 seconds instead of the expected 60 seconds. So this is an improvement compared with 20-30 minutes seen with previous settings. VM running on the blocked NFS storage failed to resume after unblocking storage. This typically works with block storage. So this looks like an improvement, but more work may be needed. Change-Id: I29ad896e3e14e8a00edcdbec53226388281fec46 Bug-Url: https://bugzilla.redhat.com/1569926 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
oVirt · Apr 23, 2020 · 672a98b · 672a98b
1 parent 3ff3049
commit 672a98b
Showing 1 changed file with 20 additions and 1 deletion.
diff --git a/lib/vdsm/storage/storageServer.py b/lib/vdsm/storage/storageServer.py
@@ -378,8 +378,27 @@ def version(self):
         # Return -1 to signify the version has not been negotiated yet
         return -1
 
-    def __init__(self, id, export, timeout=600, retrans=6, version=None,
+    def __init__(self, id, export, timeout=100, retrans=3, version=None,
                  extraOptions=""):
+        """
+        According to nfs(5), NFS will retry a request after 100 deciseconds (10
+        seconds). After each retransmission, the timeout is increased by timeo
+        value (up to maximum of 600 seconds). After retrans retires, the NFS
+        client will fail with "server not responding" message.
+
+        With the default configuration we expect failures in 60 seconds, which
+        is about 3 times longer than multipath timeout (20 seconds) for block
+        storage.
+
+        00:00   retry 1 (10 seconds timeout)
+        00:10   retry 2 (20 seconds timeout)
+        00:30   retry 3 (30 seconds timeout)
+        01:00  request fail
+
+        WARNNING: timeout value must not be smaller than sanlock iotimeout (10
+        seconds). Using smaller value may cause sanlock to fail to renew
+        leases.
+        """
         self._remotePath = normpath(export)
         options = self.DEFAULT_OPTIONS[:]
         self._timeout = timeout