Skip to content

phoenix architecture reading notes phoenix server

landon edited this page Nov 3, 2021 · 2 revisions

凤凰架构读书笔记之PhoenixServer翻译

原文

One day I had this fantasy of starting a certification service for operations. The certification assessment would consist of a colleague and I turning up at the corporate data center and setting about critical production servers with a baseball bat, a chainsaw, and a water pistol. The assessment would be based on how long it would take for the operations team to get all the applications up and running again.

有一天我有一个奇怪的想法是为运维团队提供一个认证服务。这个认证评估由我和几个同事出现在公司数据中心,我们拿着棒球棒、电锯和水枪去破坏关键生产环境的服务器。评估结果以运维团队用多长时间将所有的程序重新启动和运行。

  • 注:这个其实很类似Netflix的Chaos Monkey。在分布式系统领域,如果你的系统足够健壮,那么随便启停某些服务并不会影响系统的整体运行,用户并不会感知到服务的故障。但是如果故障没有来,就没法验证这样的服务是否可以容忍某些极端情况的故障。为此,Netflix 在系统中引入了一系列搞破坏的「猴子」,它会主动给系统的各个部分制造麻烦,比如随时不小心关闭一台机器,但是你的服务还得继续运行,所有故障必须自动恢复,并且不能被用户感知到

This may be a daft fantasy, but there's a nugget of wisdom here. While you should forego the baseball bats, it is a good idea to virtually burn down your servers at regular intervals. A server should be like a phoenix, regularly rising from the ashes.

这可能是一个发疯的想法,但这里确实有一点智慧。虽然你应该放弃棒球棒,但是定期虚拟的去破坏服务器不失是一个好主意。服务器应该向凤凰一样,涅槃重生。

The primary advantage of using phoenix servers is to avoid configuration drift: ad hoc changes to a systems configuration that go unrecorded. Drift is the name of a street that leads to SnowflakeServers, and you don't want to go there without a big plough.

使用凤凰服务器的主要好处是为了避免配置漂移:临时的系统配置修改导致未记录。漂移就是通往雪花服务器的大街名字,没有大犁你不会想去那。

  • 注:configuration drift会导致出现随着时间的推移,基础设施中的服务器会变得越来越不同。而SnowflakeServers翻译过来是雪花服务器,世界上没有两片完全一样的雪花

One way to combat drift is to use software that automatically re-syncs servers with a known baseline. Tools like Puppet and Chef have facilities to do this, automatically re-applying their defined configuration. [2] The limitation is that re-applying configuration like this can only spot drift in areas that you've defined that the tools control. Configuration drift that occurs outside those areas doesn't get fixed. Since phoenixes start from scratch, however, they will pick up any drift from the source configuration.

对抗漂移的一种方法是使用软件在一个已知基线的服务器上去自动重新同步。像Puppet和Chef这样的工具可以做到这一点,自动重新应用已经定义的配置。不过限制是对于重新应用配置只能在你定义的工具控制的区域内发现漂移。对于这些区域之外的配置漂移则不能修复。然而凤凰服务器是从零开始的,它可以从源配置中获取任何的偏移。

This doesn't mean that re-applying configuration isn't useful since it's usually faster and less disruptive than burning down a server. But it's valuable to use both strategies to fight away the snowflakes.

不过这并非意味着重新应用配置没有用,因为它通常比烧毁服务器更快且破坏性更小。但是使用这两种策略来对抗雪花是非常有价值的

Further Reading Netflix has a chaos monkey that randomly burns down servers in order to test that their system is resilient.

进一步阅读

Netflix有一个混沌猴子可以随机的去烧毁服务器以测试他们的系统的弹性

  • 注:https://netflix.github.io/chaosmonkey/ Chaos Monkey is responsible for randomly terminating instances in production to ensure that engineers implement their services to be resilient to instance failures.
  • Chaos Monkey负责随机的终止线上的实例以确保工程师实现他们的服务在面对实例失败时的弹性能力

landon

  • 为了避免线上出现大量的雪花服务器,一种避免此问题的方式是你的服务器支持凤凰涅槃的特性,可以rebuild
  • 另外是为了让你的服务更具有弹性/可靠性,需要工具去随机的制造混乱

landon 2021.11.03 补充

  • 其实这个和不可变基础设施如出一辙。传统的基础架构中,服务不断更新和修改,即可变的。而不可变基础设施是部署之后不会修改,如果需要更新,则rebuild新服务器去替换旧服务器。验证通过后,投入使用,旧的则会退役
Clone this wiki locally