Skip to content

phoenix architecture reading notes phoenix server

landon edited this page Nov 3, 2021 · 2 revisions



One day I had this fantasy of starting a certification service for operations. The certification assessment would consist of a colleague and I turning up at the corporate data center and setting about critical production servers with a baseball bat, a chainsaw, and a water pistol. The assessment would be based on how long it would take for the operations team to get all the applications up and running again.


  • 注:这个其实很类似Netflix的Chaos Monkey。在分布式系统领域,如果你的系统足够健壮,那么随便启停某些服务并不会影响系统的整体运行,用户并不会感知到服务的故障。但是如果故障没有来,就没法验证这样的服务是否可以容忍某些极端情况的故障。为此,Netflix 在系统中引入了一系列搞破坏的「猴子」,它会主动给系统的各个部分制造麻烦,比如随时不小心关闭一台机器,但是你的服务还得继续运行,所有故障必须自动恢复,并且不能被用户感知到

This may be a daft fantasy, but there's a nugget of wisdom here. While you should forego the baseball bats, it is a good idea to virtually burn down your servers at regular intervals. A server should be like a phoenix, regularly rising from the ashes.


The primary advantage of using phoenix servers is to avoid configuration drift: ad hoc changes to a systems configuration that go unrecorded. Drift is the name of a street that leads to SnowflakeServers, and you don't want to go there without a big plough.


  • 注:configuration drift会导致出现随着时间的推移,基础设施中的服务器会变得越来越不同。而SnowflakeServers翻译过来是雪花服务器,世界上没有两片完全一样的雪花

One way to combat drift is to use software that automatically re-syncs servers with a known baseline. Tools like Puppet and Chef have facilities to do this, automatically re-applying their defined configuration. [2] The limitation is that re-applying configuration like this can only spot drift in areas that you've defined that the tools control. Configuration drift that occurs outside those areas doesn't get fixed. Since phoenixes start from scratch, however, they will pick up any drift from the source configuration.


This doesn't mean that re-applying configuration isn't useful since it's usually faster and less disruptive than burning down a server. But it's valuable to use both strategies to fight away the snowflakes.


Further Reading Netflix has a chaos monkey that randomly burns down servers in order to test that their system is resilient.



  • 注: Chaos Monkey is responsible for randomly terminating instances in production to ensure that engineers implement their services to be resilient to instance failures.
  • Chaos Monkey负责随机的终止线上的实例以确保工程师实现他们的服务在面对实例失败时的弹性能力


  • 为了避免线上出现大量的雪花服务器,一种避免此问题的方式是你的服务器支持凤凰涅槃的特性,可以rebuild
  • 另外是为了让你的服务更具有弹性/可靠性,需要工具去随机的制造混乱

landon 2021.11.03 补充

  • 其实这个和不可变基础设施如出一辙。传统的基础架构中,服务不断更新和修改,即可变的。而不可变基础设施是部署之后不会修改,如果需要更新,则rebuild新服务器去替换旧服务器。验证通过后,投入使用,旧的则会退役
Clone this wiki locally