mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode()

Problem: There are certain scenarios in degraded stretched cluster where will try to go into the function ``Monitor::go_recovery_stretch_mode()`` that will lead to a `ceph_assert`. Solution: Make sure ``dead_mon_buckets.size() == 0`` in ``OSDMonitor:update_from_paxos()`` before going into ``Monitor::go_recovery_stretch_mode()``. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2104207 Signed-off-by: Kamoltat <ksirivad@redhat.com> (cherry picked from commit d95c41a)
kamoltat · Nov 8, 2022 · 94dc970 · 94dc970
1 parent 025d3fa
commit 94dc970
Showing 1 changed file with 3 additions and 1 deletion.
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
@@ -960,10 +960,12 @@ void OSDMonitor::update_from_paxos(bool *need_bootstrap)
   dout(20) << "mon_stretch_cluster_recovery_ratio: " << cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio") << dendl;
 	if (prev_num_up_osd < osdmap.num_up_osd &&
 	    (osdmap.num_up_osd / (double)osdmap.num_osd) >
-	    cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio")) {
+	    cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio") &&
+      mon.dead_mon_buckets.size() == 0) {
 	  // TODO: This works for 2-site clusters when the OSD maps are appropriately
 	  // trimmed and everything is "normal" but not if you have a lot of out OSDs
 	  // you're ignoring or in some really degenerate failure cases
+
 	  dout(10) << "Enabling recovery stretch mode in this map" << dendl;
 	  mon.go_recovery_stretch_mode();
 	}