Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-Availability (HA) cluster of Home Assistant instances for redundancy #152

Closed
Gamester17 opened this issue Feb 18, 2019 · 24 comments
Closed

Comments

@Gamester17
Copy link

Gamester17 commented Feb 18, 2019

Would it be possible to make Home Assistant into a High-Availability (HA) application with automatic fail-over as an installation option for redundancy?

That is, have Home Assistant run as High-Availability (HA) application multiple instances not simply for forwarding/repeater or performance functionality, but as in High-Availability (HA) cluster for automatic failover functionality?

Home Assistant supports (or did support?) Home Assistant multiple instances synchronized using a master-slave model, however, as I understand it, it does not support a true High-Availability (HA) set-up as the slave only has forwarding/repeater functionality and there is no automatic fail-over function where the slave has a full replica of the database and is automatically promoted to master in a failure scenario(?).

As requested/discussed in the forum here https://community.home-assistant.io/t/high-availability-ha/52785

As smart home devices/appliances are more and more becoming part of our everyday usage we are starting to rely on availability of our home automation controllers as dependencies, especially if you are using Home Assistant on a computer with a Z-Wave and/or Zigbee dongle, therefore it would be very nice to have the option to be able to achieve a higher degree of resilience with the help of multiple installations of Home Assistant on the same home network working as one, in a true High-Availability (HA) setup to secure its uptime; either in an active-active with no master or an active-passive capacity where the slave node can automatically be promoted into the master in a fail-over scenario. I guess that most common scenario today is corrupt or failure of SD-card when running Home Assistant on Raspberry Pi computers, but another common scenario could be regressions issues on upgrades which are usually covered by not upgrading both High-Availability nodes at the same time, something which could be possible as long as the High-Availability function is compatible over different versions of Home Assistant.

Home Assistant running a High-Availability (HA) application would imply Home Assistant running for example on two Raspberry Pi computers (a.k.a. nodes) as one Home Assistant instance (a.k.a. HA-cluster). In an active-passive capacity both Raspberry Pi computers with be running Home Assistant however one node would be in a kind of standby mode just waiting for the other to stop responding before taking over (fail-over) and if you have no physical Z-Wave or Zigbee USB/serial adapters connected then the second node could take over all functions directly, or if you have physical Z-Wave or Zigbee USB/serial adapters connected then it would notify and prompt you to move your adapters to that second node. In an active-active setup both Raspberry Pi computers would have the exact same hardware installation, meaning both Raspberry Pi computers/nodes have to each have the same type of physical Z-Wave or Zigbee USB/serial adapters connected.

Another very nice feature many modern High-Availability clustered applications (especially network appliances) have today which Hass.io OS could have with this is a feature referred to as "Cluster Operating System Rolling Upgrade" which means automatic rolling updates of the full application and/or whole operating-system one cluster node at a time, which normally allow for continuous delivery of all functions without any downtime as each node in a cluster fully takes on all functionality alone while the other nodes perform an upgrade and then vice versa in an automatic fail-over and fail-back procedure. so such rolling upgrades in a High-Availability clustered do not interrupt services to the users.

If you are using Home Assistant to for example control all your heating and/or lights then having it run in a High-Availability (HA) setup for higher reliability could certainly also make it reaches a higher WAF (Wife Acceptance Factor), as in my experience, tinkering too much with heating or lights on your one and only home automation controller is normally not acceptable when you have a wife/partner and young kids at home, as then your home automation controller is in "production" as has to be working all the time.

@Gamester17
Copy link
Author

Gamester17 commented Feb 19, 2019

From an architectural and technical point of view, High-Availability for Home Assistant and/or Hass.io could be achieved in many different ways, most of which depends if you want to make Home Assistant as an application High-Availability aware/ Cluster-aware or not, as an alternative could be to only implement a High-Availability cluster at the operating-system or container level in Hassio OS where the Home Assistant application is not High-Availability aware/ Cluster-aware, however, such OS/container-level only solution would probably imply an active-passive cluster where the second node is in standby which means that a fail-over would still mean a short downtime.

I understand that one way such OS/container-level the only solution could be achieved in Hass.io is with sing Docker Swarm with Docker Machine. As described here http://mmorejon.github.io/en/blog/docker-swarm-with-docker-machine-high-availability/

Regardless of at which level and type of High-Availability cluster you would choose to go with, one more tricky problems with a High-Availability cluster is that you will also need an external "cluster witness" for High-Availability cluster, and you would have to solve that at a relatively low-cost for home environment, preferably without forcing the need for a third Raspberry Pi. Because, this "witness" will act as a quorum to help make the decision which cluster node should be the master if and when the cluster nodes loose connection to each other, as otherwise you might end up with a "split-brain" scenario which both nodes think that they should have the role as the master node. One easy way to solve this is with a witness file on a third device on your network, that witness file must, in that case, be stored on a file system which can be shared and simultaneously written from multiple nodes at the same time, (such as the NFS file system on a NAS). Another option could be to offer Home Assistant Cloud subscription service from the internet as a cluster witness, but the downside to that is that users would be dependent on the internet for High-Availability cluster to work.

By the way, I do not know how Z-Wave would work with several adapter controllers on multiple nodes with Home Assistant in a High-Availability cluster, but for Zigbee adapter controllers in the above suggested active-active cluster you should technically be able to set the Zigbee adapter in one of the nodes (the 'master node') to be a "Zigbee coordinator" and the other node (the 'slave node') to a "Zigbee router", as then the second node will just route/forward/relay all the Zigbee signals to the first node as a range extender. These roles of "Zigbee coordinator" and "Zigbee router" would then have to automatically change when you have a cluster failover scenario, with the "Zigbee router" being promoted to a "Zigbee router" by the Home Assistant ZHA component.

@Shulyaka
Copy link

If we promote router to coordinator, do we need to re-pair all devices?

@Hedda
Copy link

Hedda commented Oct 21, 2019

I think that one of a "High-Availability (HA) cluster of Home Assistant" should be a solution for quickly and automatically getting back to operation after a fail-over. If that is agreed on objective then a re-pair all devices is not an option. HA fail-over from one node to the other (and back) should really be possible without any user operation what so ever.

@embcla
Copy link

embcla commented Sep 30, 2020

Given that everything is already containerized, running these containers in a swarm/k3s/k8s HA cluster should be feasibile with moderate effort.

Cloud connections would adapt seamlessly as they are created by the instance running, so if one drops, the next recreates the connection and that service keeps running.

What needs a real solution is how to keep wired devices connected. For usb devices there could be solutions, but they require client software. Moreover the machine physically connected to the wired device would become a single point of failure, so if that one goes down then the device is unreachable to any HA cluster.
This kind of makes the whole HA point moot in my opinion.

@MindFreeze
Copy link

MindFreeze commented Sep 30, 2020

Everyone seems focused on wired devices. I have never used any wired devices exactly because they cause this problem. And can't be the only one. There are alternatives to wired devices like a zigbee2mqtt bridge.
IMO there should be an easy High-Availability solution for Home Assistant that just states in bold text "This does not support wired devices".
And even if I did use wired stuff, I would prefer to have a fallback for the core Home Assistant server even when anything wired is toast. Home Assistant is not only zigbee/z-wave after all. Having anything working is better then nothing.

PS My idea is to have separate servers (RPi maybe) in different rooms, so if one is on fire, the other one can take over (and maybe raise a fire alarm :D ). This means that any attached USB devices will be on fire as well, so no USB switch or forwarding is going to help. Ok, maybe instead of a fire, it is just a blown fuse, but you get the idea.

@Adminiuga
Copy link

Then nothing prevents you from running ha instance in swarm.

@Hedda
Copy link

Hedda commented Sep 30, 2020

It should be noted that you could soon be able to backup a Zigbee coordinator and restore that to some other hardware, see these:

zigpy/zigpy#557

Koenkk/zigbee-herdsman#303

https://github.com/zigpy/open-coordinator-backup/

Also, note that backup and restore for some Z-Wave controllers and Zigbee coordinators of the same type is already possible, e.g.

Aeotec Z Stick Gen 5 USB device has a backup utility and can be restored to new Aeotec Z Stick Gen 5.

https://aeotec.freshdesk.com/support/solutions/articles/6000108806-z-stick-gen5-backup-software

zigpy-znp library for zigpy which ZHA uses can backup and restore Texas Instruments Zigbee devices

https://github.com/zha-ng/zigpy-znp

NVRAM Backup and restore

A complete NVRAM backup can be performed to migrate between different radios based on the same chip. Anything else will not work.

(venv) $ python -m zigpy_znp.tools.nvram_read /dev/serial/by-id/old_radio -o backup.json
(venv) $ python -m zigpy_znp.tools.nvram_write /dev/serial/by-id/new_radio -i backup.json

Tested migrations:

  • LAUNCHXL-CC26X2R1 running 3.30.00.03 to and from the zig-a-zig-ah! running 4.10.00.78.

bellows radio library for zigpy which ZHA uses can backup and restore Silicon Labs Zigbee devices

zigpy/bellows#295

Theoretically this allows backing up and restoring NCP state between the hardware version. Please note this is highly experimental. The restoration doesn't restore NCP children and relies on children just to find a new parent.

To export TC config, see `bellows backup --help` usually it is just `bellows backup > your-backup-file.txt`. 
The backup contains your network key, so you probably should keep that file secure.
To import, see `bellows restore --backup-file your-backup-file.txt`

Legacy RF protocols

Many older RF protocol standards before Zigbee and Z-Wave did not have anything stored or running on the USB device itself.

@Hedda
Copy link

Hedda commented Sep 30, 2020

Personally, I would love to be able to easily set up two Zigbee coordinator bridges like Sonoff ZBBridge WiFi to Zigbee bridges or these DIY Ethernet to Zigbee bridges as a pair of redundant Zigbee coordinators where one of them is primary which is used and get regular backups done from it that can in case of failure be restored to the secondary Zigbee coordinator bridge which would act as a warm/hot standby device (always-on standby just waiting for a Zigbee coordinator restore).

https://www.digiblur.com/2020/07/how-to-use-sonoff-zigbee-bridge-with.html

zigpy/zigpy#584

Again please see discussion about Zigbee coordinator backup and restore here:

zigpy/zigpy#557

Koenkk/zigbee-herdsman#303

https://github.com/zigpy/open-coordinator-backup/

PS: ZHA requires Tasmota or ESPHome serial server and standard Silicon Labs EmberZNet EZSP Zigbee firmware for EFR32 on these.

@Shulyaka
Copy link

Can the second HA instance join the ZigBee as a router? Will it still be able to bind to the end device clusters and access sensor data?

@Hedda
Copy link

Hedda commented Feb 10, 2021

Can the second HA instance join the ZigBee as a router?

Not yet, but IIRC read somewhere a comment Adminiuga wrote to dmulcahey which hinted that it could be possible in the future.

@Shulyaka
Copy link

Yes, I know it has not yet been developed. This is an architecture issue, to discuss what's possible.

@Shulyaka
Copy link

Shulyaka commented Mar 3, 2021

Let me comment on this issue.

  1. Some kind of redundancy is a must have, especially if you have smart switches to control your lighting. The hardware we are using is mostly not server-grade so failures do happen. A single major failure with the lighting or other essential systems of your home reduces WAF (wife acceptance factor) by a factor of two.
  2. There are two main types of redundancy: High Availability (aka Active-Active) and Disaster Recovery (aka standby replica). It is possible to have both at the same time.
  3. Home Assistant consists of a python application and a database. Most databases already have its own means to achieve redundancy, so it is possible right now. The separation between the database and the hass application allows us to mix or combine different types of redundancy for the database and for the application. The one notable exception is SQLite (the default option for hass), which probably does not support redundancy of any kind at all.
  4. There is one more possibility to reduce the risk of failures, which is to have a separate 'supervisor' instance of hass with a separate database, that would monitor the critical functions like smart switches and lights and stand in if it detects an anomaly. This is not a type of redundancy and is therefore out of scope of this issue.

@dgomes
Copy link

dgomes commented Mar 3, 2021

  1. You are missing the amount of information stored in in .storage that needs to be replicated and synchronised

@Shulyaka
Copy link

Shulyaka commented Mar 3, 2021

A few points on High Availability (i.e. active-active setup). This is about the application availability, not the database.

  1. The hass application must be cluster-aware, i.e. it needs modification for this to work. Luckily, hass is an event-based application which simplifies the things and I believe that this awareness will not complicate the codebase and will not affect maintainability.
  2. There are two main challenges: propagate state changes between nodes (probably via the database) and ensure that an event only happens on a single node. So, for example, a state change of an entity propagated from a different node should not generate a state change event or trigger any automation, because that event and that automation is already handled by a different node. Also, a time change event on every node must only happen every N'th second if there are N active nodes (i.e. if time%N+K==0, if you forgive me for the C notation). If there are other internally generated events, they must be treated the similar way.
  3. For the externally generated events (such as frontend, zha or mqtt) we will need some sort of external load balancing. For frontend/api any http server that can do load balancing will do. MQTT and zha are trickier, because several nodes could receive the same event, but only one should proceed with it. May be same trick as with the time will do (i.e. something like entity unique_id%N+K==0). May be someone could invent something more clever that would also correctly handle connection failures to the mqtt server on one of the nodes. ZHA will need to support joining to the same zigbee network as a router.
  4. Most of the integrations will not need a modification. Those of them that support scenes are already cluster-aware in a sense. There might be a couple of them however (probably those that require special hardware connected) which will have to stick to a single node. Sticking integrations to a single node might be a good way for a phased approach of the implementation, we could allow only some of them to opt-in and whitelist for a start.

I would be willing to start working on the HA support. It looks like a popular demand, apart availability it would also improve scalability and overall performance. But I would need a high-level approval from the core team on the approach. @balloob what do you think, is HA a good idea, if we manage to do it with the proper care, with documents and staged approach, and without breaking or overcomplicating things?

@Shulyaka
Copy link

Shulyaka commented Mar 3, 2021

1. You are missing the amount of information stored in in `.storage` that needs to be replicated and synchronised

Yes, you are right. BTW, why don't we keep all of it in the DB?

@frenck
Copy link
Member

frenck commented Mar 3, 2021

The database only contains history/stats. We don't keep any other data in the databases.

@allenporter
Copy link

allenporter commented Mar 3, 2021

Baking replication/consistency/high available into home assistant itself could be pretty complex, and so the comments above about leaning to a fast failover with active/standby using the containerization platform (outsourced entirely such that only one instance is active at a time) would be much more feasible -- and would meet most users requirements.

Maybe start by getting alignment on the external storage problem pointed out. This could be quite a challenge already, and seems an order of magnitude simpler. Edit: That reads like a feature request, rather than an architecture issue with a propose plan for comment, so maybe that specific issue isn't the right one.

(In the case of my cluster, i'd run external storage via kubernetes and figure out some kind of leader election, I guess similar to the docker swarm proposal above. I would imagine we wouldn't' want Home Assistant to have an opinion on how users achieve this, though there could be some common recipes for folks to share. )

@embcla
Copy link

embcla commented Mar 3, 2021 via email

@scop
Copy link
Member

scop commented Mar 3, 2021

The database only contains history/stats. We don't keep any other data in the databases.

Not to criticize, but genuinely interested: that's a statement regarding current state of affairs, not an answer to the question asked, i.e. why?

@frenck
Copy link
Member

frenck commented Mar 3, 2021

So, I'm still always confused about this. You can still make a reasonable high availability system, with HA just as it is today.

Replicate the storage (e.g., DRDB, GlusterFS, ZFS replication, whatever floats your boat), add some more into the mix; keepalived, HA Proxy, Corosync, Pacemaker or even live migration of VMs. Add an external DB, like dunno MariaDB and put it into a Galera cluster... Pick whatever you like in all these cases.

There are so many possibilities to solve this all, outside of HA. I'm still completely confused why HA itself has to worry about a complex use case, that will end up being used by just a few... Especially, considering the tons of tools available, that already make this all possible.

@Shulyaka
Copy link

Shulyaka commented Mar 3, 2021

Because a failover takes time first to detect the disaster condition and then to start another node. But with the Active-Active setup you will already have a running node, so it will take no downtime at all. Can you imagine updates with no downtime and 100% availability of your smart lights? For starters, that's just cool!

@onedr0p
Copy link

onedr0p commented Mar 3, 2021

@frenck those solutions sound like a nightmare of maintenance and cost, this is a one example of just because you can do something, should you?

If a stateful applications' architecture does not support being Highly Available there will always be issues you will face no matter how much you hack away at trying to solve it from outside the application code.

I run Home Assistant in Kubernetes and no way I will try to architect a way around for it to be HA. I am completely fine with it being a SPOF. But it goes without saying that if Home Assistant ever did implement this type of architecture I would be one of the first to jump on it. :)

@frenck
Copy link
Member

frenck commented Mar 4, 2021

But with the Active-Active setup you will already have a running node, so it will take no downtime at all.

Even if we did the best job in the world, that cannot be achieved. Simply because a lot of devices/services will not handle that right. Besides, be honest here, a minute downtime for takeovers, is absolutely not a problem? Heck, make it 10 minutes.

@frenck those solutions sound like a nightmare of maintenance and cost

Oh no, all listed is open source, so in terms of costs; that would be time. Doing High-Availability is never easy; any solution will cost maintenance.

We are not dealing with an application with a database that simply spits out pages like a website or so. We are dealing with an application that relies on a lot of external sources (devices/services). I bet a lot of devices and services will simply not be able to handle High Availability cases in a way that matches any of the wishes from this thread; there is also an active running state engine with an automation engine on top (that relies on runtime/memory).

Building Home Assistant into a true High Availability application, will be a nightmare. Especially considering all tools for this already available for making a reasonable setup that already could do this, the question becomes:

"Is the juice worth the squeeze?"

In my personal opinion: Most definitely not.

@balloob
Copy link
Member

balloob commented Mar 4, 2021

We're not going to implement any form of high availability. The added complexity is not worth it.

@balloob balloob closed this as completed Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests