Careful servers is a daemon which runs in servers which safely powers off them during cooling system power outages.
We had two racks of about 30 servers in our university for student use. They were placed in a room having 2 air conditioners(A/Cs) for cooling. Because of the scarcity of UPS power, only the servers were powered through UPS while the A/Cs where running on raw power. This led to issues when there were power outages. During power outages, the A/Cs will be off while the servers could be running on UPS power for hours. This will cause the servers to heat up and could potentially lead to fire accidents.
Ideally the A/Cs and servers should have the same power source. But, due to infrastructure limitations, we didn't have that option(neither we had temperature sensors or configurable thermal shutdown in the servers). Hence, the servers had to be shut down soon after the raw power goes off. The raw power usually goes down for short intervals and comes back. If it comes back within a small duration of time: say 5 minutes, we need not shut down the servers affecting all the running tasks. Hence, we had to shut down the servers only if the raw power goes down for more than 5 minutes.
Hence we had to first detect when the raw power goes off, the wait for 5 minutes and if the raw power is still off, shutdown the machines. Steps to realize each of these are described below:
The raw power outages had to be detected by all the servers. We implemented a heartbeat mechanism by which all the servers periodically pinged another machine(called a power node) which is directly connected to raw power(similar to the A/Cs). In our case, we used a small defunct wireless router as the power node. Its wireless functionality was not working, but it was configured to have a static ip when connected to the wired LAN.
Even if each of the servers are configured to ping the power node every 5 minutes and shut down on a ping fail, all the servers won't wait for at least 5 minutes to ignore a short-term raw power outage . Because, a server can detect a short-term 2 minute raw power outage within 1 minute and shut down. To avoid this, we implemented a small finite state machine-based long-term failure detector as shown in figure.
The initial state is NORMAL. Steps to be performed at each state are shown below.
NORMAL: wait for 1 minute and ping power node
PRE-FAIL: wait for 5 minutes and ping power node
SHUTDOWN: shutdown the server
Thus, the failure detector will frequently(every 1 minute) ping normally and wait for 5 minutes to check again on detecting a failure. If the power node is not up within 5 minutes, the server shuts itself down.
We have to invoke the state machine as a daemon on startup. But, an issue with simply putting it in rc.local is that if the daemon accidentally gets killed(due to insufficient system memory etc.), then the safety is compromised. So we wrote another wrapper script which will check if the deamon is up and will start if not. This script is called(as root's process) every 5 minutes by adding an entry to /etc/crontab.
Set the IP of the power node in checker.sh. (Note that the wait times are also configurable(by default set to 1 and 5 minutes))
Compile careful.c in the server.
cc careful.c -o careful
Copy careful and checker.sh to /usr/bin
sudo mv careful /usr/bin/careful
sudo mv checker.sh /usr/bin/checker.sh
Add checker.sh to /etc/crontab and restart the servers.
sudo nano /etc/crontab
The /etc/crontab should have an entry like:
#executes every 5 minutes */5 * * * * root /usr/bin/checker.sh
I guess the next step should be to randomize the waiting times a bit to avoid synchronized high ping load on the power nodes. Pull requests welcome!