Add reason for unhealthiness to autopilot server state #12

mkeeler · 2021-08-10T19:49:37Z

Healthiness of a server is computed with this function:

Lines 110 to 128 in 2ba1616

    
           func (s *ServerState) isHealthy(lastTerm uint64, leaderLastIndex uint64, conf *Config) bool { 
        
           	if s.Server.NodeStatus != NodeAlive { 
        
           		return false 
        
           	} 
        
           	if s.Stats.LastContact > conf.LastContactThreshold || s.Stats.LastContact < 0 { 
        
           		return false 
        
           	} 
        
           	if s.Stats.LastTerm != lastTerm { 
        
           		return false 
        
           	} 
        
           	if leaderLastIndex > conf.MaxTrailingLogs && s.Stats.LastIndex < leaderLastIndex-conf.MaxTrailingLogs { 
        
           		return false 
        
           	} 
        
           	return true 
        
           }

We store the results of those function calls in this field:

raft-autopilot/types.go

Line 133 in 2ba1616

Healthy bool

It would help during periods of unhealthiness to be able to quickly identify why individual nodes are unhealthy. For this I think adding a field to the autopilot ServerHealth type with the latest reason for perceived unhealthiness would be helpful.

The text was updated successfully, but these errors were encountered:

chermehdi · 2022-01-10T12:50:07Z

Will an enumeration that describe the none-health state as per the isHealthy method be enough to do this?

Will the reason be for logging? or is it something else?

mkeeler · 2022-01-10T15:03:37Z

I was originally thinking a more human readable message regarding the reason. Maybe coupled with an enumeration. For example if the terms are not equal you would want to know that and an enumeration would suffice. However it would also be nice to know what the term values were as it could provide insight into a cluster where leader elections keep getting triggered. For that reason I think a formatted message is probably most useful.

As for the reason, emitting logs at the debug level when marking a node as unhealthy with the reason would be nice. However what I was originally wanting was to keep track of the reason for unhealthiness in the autopilot state. Consul (main consumer of this library for me) exposes this state with an HTTP API. Combining it all, if something seems to be acting up a Consul user could query for the autopilot state and then see not only which nodes were considered unhealthy but a quick indication as to the cause. All of the information used to determine healthiness is already exported in the state, but during an incident it takes a little too much thought to piece all the bits together. This ticket is all about reducing the mental overhead of diagnosing problems in an on-call incident situation.

mkeeler added the enhancement New feature or request label Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reason for unhealthiness to autopilot server state #12

Add reason for unhealthiness to autopilot server state #12

mkeeler commented Aug 10, 2021

chermehdi commented Jan 10, 2022

mkeeler commented Jan 10, 2022

Add reason for unhealthiness to autopilot server state #12

Add reason for unhealthiness to autopilot server state #12

Comments

mkeeler commented Aug 10, 2021

chermehdi commented Jan 10, 2022

mkeeler commented Jan 10, 2022