Reliability is a function of mean-time-to-failure (MTTF), and mean-time-to-recovery (MTTR).
MTTF is important in determining when a process is not reliable.
MTTR is important in determining how quickly you can resolve an issue and limit the impact on a service.
As humans add latency to the MTTR, automated systems are useful to reduce the amount of time it takes to resolve an issue. Playbooks, or runbooks, also help reduce MTTR.