Intelligent Monitor and Restarter of SMF Services for Solaris and OpenSolaris
Ruby Perl
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 38 commits ahead, 2 commits behind victori:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
spec
test
LICENSE
README
satan
satan.cfg
satan.smf

README

Automated Process Reaper for Unix Systems

Satan does just one thing, and one thing only; triggers restart of failed SMF services. Satan was designed to work with Solaris’ SMF self-healing properties. Let Satan restart/kill while SMF revive. The Satan name is a play off of the God Monitor http://god.rubyforge.org/

The reason Satan was developed was because God overlaps too much in functionality with SMF so it is not practical to run on Solaris.

Features

  - No dependencies aside from Ruby
  - Email notification on reaped processes
  - Easy to use DSL to define reaping rules
  - HTTP checks to reap based on non-200 response code
  
INSTALLATION

  - Install satan on your run path: /opt/bin;/opt/sfw/bin;/usr/bin
  - Create a satan config file, look at satan.cfg as an example
  - Edit satan.smf to your liking and import: svccfg import satan.smf
    - be sure to set up the PATH in smf properly
    - run satan with credentials that allow it to call svcadm
    - make satan dependent on the application that it monitors, this is crucial!

HOW TO USE

  - /opt/bin/satan ~/satan.cfg
    OR
  - via SMF, see installation block.

The configuration is all done in Ruby, clean and simple.

Satan.watch do |s|
  s.name = "jvm instances"                # name of job
  s.fmri = "applications/myapp"           # solaris FMRI of the SMF service to watch
  s.debug = true                          # if to write out debug information
  s.safe_mode = false                     # If in safe mode, satan will not kill ;-(
  s.interval = 10.seconds                 # interval to run at to collect statistics
  s.contact = "victori@fabulously40.com"  # admin contact, optional if you want email alerts
  s.restart_grace = 30.seconds            # grace period for svcadm restart before kill -9 is used


  s.kill_if do |process|
    process.daemon = "java"               # (optional) name of the process that rules should be applied to
    process.args = "myapp"                # (optional) substring to match in the arguments string
    process.user = "webservd"             # (optional) effective user (owner) of the process

    process.condition(:cpu) do |cpu|      # on cpu condition
      cpu.above = 48.percent              # if above certain percentage
      cpu.times = 5                       # how many times we can hit this condition before killing
    end

    process.condition(:memory) do |memory|  # on memory condition
      memory.above = 850.megabytes          # limit for memory use
      memory.times = 5                      # how many times we can hit this condition before killing
    end

    # ActiveMQ tends to die on us under heavy load so we need the power of satan!
    process.condition(:http) do |http|                        # on http condition
      http.uri    = "http://localhost:8161/admin/queues.jsp"  # the URI
      http.timeout = 5.seconds                                # timeout
      http.times  = 5                                         # how many times before the kill
    end

    # Checks free space in the old generation of a JVM heap
    # requires $JDK_HOME/bin/jstat to be on the PATH
    process.condition(:jvm_free_heap) do |free_heap|
      free_heap.below = 200.megabytes           # minimum free old generation space
      free_heap.times = 5                       # how many times we can hit this condition before killing
    end
  end
end


HOW IT WORKS

  - satan is started by SMF after the monitored app starts (because of the smf dependency)
  - satan will periodically invoke all the configured rules/conditions
  - if a rule failure is detected, a counter for this rule is increased
  - if a rule subsequently succeeds, the counter is decreased (up until 0)
  - if the failure counter reaches the value defined via 'times' property, satan asks SMF to restart the monitored service
  - if SMF fails to restart the service within 'restart_grace' period, satan will use kill -9 to kill all processes of the service (to avoid SMF maintenance state)
  - once the monitored service shuts down, SMF will temporarily shut satan down as well (because of the dependency)
  - monitoring will be resumed once the monitored app is started again