Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

More text

  • Loading branch information...
commit 7a30841211d06ac301f891efa72b868f22f0e05b 1 parent 938f6ba
Peter Burkholder (@pburkholder) authored May 08, 2012

Showing 1 changed file with 213 additions and 19 deletions. Show diff stats Hide diff stats

  1. 232  README.md
232  README.md
Source Rendered
@@ -35,26 +35,11 @@ Trend data to predict and diagnose
35 35
 
36 36
 Key characteristics of monitoring
37 37
 
38  
-- Integrate with Cloud Operations: CL
39  
--- Instances should be automatically monitored
40  
--- Cleanly decommissioned instances should go away
41  
-- Integrate with Configuration Management (Puppet or Chef): CM
42  
--- Adding and removing checks should be facilitated by Puppet
43  
-- Provide 'sensible' alerting: AL
44  
--- Partition urgency of alert by environment and/or time
45  
-- Support Developer Integration: DV
46  
--- Can we code our apps to monitor themselves, or provide metrics
47  
-- Trending: TR
48  
--- Gather data, present them and share them
49  
-- Comprehensive Checks: CH
50  
--- Make it easy to monitor _anything_
51  
--- If possible, implement _service dependencies_
52  
-- Available and Secure: AS
53  
--- Also, we need to label our instances sensibly...
54 38
 
55 39
 Audax only: Replace RightScale: RS
56 40
 
57  
-
  41
+Talk
  42
+====
58 43
 
59 44
 
60 45
 Good Evening
@@ -186,8 +171,102 @@ That was about as far as I got before I realized I needed to take my hands off
186 171
 the keyboard and step back.
187 172
 
188 173
 TKTKTK How sensu and I found each other.  
189  
-
190  
-
  174
+Fortunately, there'd been some buzz on Twitter about Sensu, and over the
  175
+course of a weekend I became convinced that I needed to abandon Nagios, or any
  176
+other monolithic monitoring system, and try Sensu.
  177
+
  178
+Sensu started as an internal project at Sonian, an archive-as-a-service
  179
+provider which runs on AWS.  Sean Porter and others had been using Nagios w/
  180
+Chef, but ran into many of the same convergence issues that I had described,
  181
+but also ran into scaling issues with Nagios's active check architecture.  So
  182
+Sean and a teammate, TKTK, wrote Sensu for internal use, and open-sourced it
  183
+in November of 2011.  Since then, it's seen a lot of uptake and has a very
  184
+active community.  Lots of help is available on IRC if you bother to stop in.
  185
+
  186
+Backbone is the RabbitMQ message broker. 
  187
+
  188
+Then we need at least one sensu-server, and typically on that box we'll run
  189
+Redis to provide persistence.  Sensu-server is written in Ruby, and can be
  190
+installed as a Gem, as an RPM, and soon as .deb.
  191
+
  192
+All of the configuration is in JSON.  Here's a minimal configuration for a
  193
+sensu-server:
  194
+
  195
+{
  196
+  "rabbitmq": {
  197
+    "host": "<%= rabbitmq_host %>",
  198
+    "port": <%= rabbitmq_port %>
  199
+  },
  200
+  "redis": {
  201
+    "host": "<%= redis_host %>",
  202
+    "port": <%= redis_port %>
  203
+  },
  204
+  "api": {
  205
+    "host": "<%= api_host %>",
  206
+    "port": <%= api_port %>
  207
+  },
  208
+}
  209
+
  210
+What's missing from this is the definition of what checks to run.  Rather than
  211
+define the checks in a single massive JSON file, we can drop JSON snippets
  212
+into /etc/sensu/conf.d, like this:
  213
+
  214
+{
  215
+  "checks": {
  216
+    "system_disks": {
  217
+      "handlers": ["irc", "mailer", "default" ],
  218
+      "notification": "System disk space is being exhausted",
  219
+      "command": "/etc/sensu/plugins/community/check-disk.rb -w 80 -c 90 -x tmpfs",
  220
+      "subscribers": [ "generic" ],
  221
+      "occurrences": 2,
  222
+      "interval": 300
  223
+    }
  224
+  }
  225
+}
  226
+
  227
+OR
  228
+
  229
+{
  230
+  "checks": {
  231
+    "careverge_api": {
  232
+      "handlers": ["irc", "default", "mailer" ],
  233
+      "notification": "Careverge API is not responding appropriately",
  234
+      "command": "/etc/sensu/plugins/local/check_cvapi.sh -S",
  235
+      "subscribers": [ "cvapi" ],
  236
+      "interval": 30,
  237
+      "refresh": 600
  238
+    }
  239
+  }
  240
+}
  241
+
  242
+On the client side, we need only install sensu-client, and configure it.  The
  243
+configuration is pretty minimal: Specify the rabbitmq information and details
  244
+on this client:
  245
+
  246
+{
  247
+  "rabbitmq": {
  248
+    "host": "<%= rabbitmq_host %>",
  249
+    "port": <%= rabbitmq_port %>
  250
+  },
  251
+  "client": {
  252
+    "name": "<%= sensu_hostname %>",
  253
+    "address": "<%= ipaddress %>",
  254
+    "subscriptions": [ "generic", "cvapi" ]
  255
+  }
  256
+}
  257
+
  258
+All of the check details can go in conf.d/ again, and from a configuration
  259
+mgmt standpoint we have the added bonus that we can use the _exact same_
  260
+conf.d/ as we had on the server.  Now we can leverage the beauty of the
  261
+message queue 
  262
+
  263
+* Sensu-server publishes a 'check-disk' request to the 'generic' channel every
  264
+  300s.
  265
+* Sensu-client how subscribe to 'generic' run the check and publish the
  266
+  results
  267
+* Sensu-server processes the results, and passes any failures to the handlers
  268
+* Likewise for the 30s interval checks of the API, but then it's only for the
  269
+  nodes subscribed to 'cvapi'.
191 270
 
192 271
 
193 272
 Sensu works so well that I had to make sure that the check scripts were
@@ -195,3 +274,118 @@ installed before the sensu service. Not because there's any logical
195 274
 dependency, but simply , otherwise the client would come up and
196 275
 start acting on published requests for, say, 'check\_disks' and fail because
197 276
 the 'check\_disk.rb' script wasn't there yet.
  277
+
  278
+Handlers
  279
+--------
  280
+
  281
+In order for anything useful to happen with a failed check result, we need
  282
+handlers to, say, notify us or even take action.  Let's take a look at IRC,
  283
+for example:
  284
+
  285
+  irc.json
  286
+
  287
+  irc.rb
  288
+
  289
+That's about it.
  290
+
  291
+
  292
+API
  293
+---
  294
+
  295
+Thin/Sinatra service running on port 4567
  296
+
  297
+* Read and update key/values in Redis
  298
+* Publish check requests on RabbitMQ
  299
+
  300
+For example:
  301
+
  302
+
  303
+KeepAlives
  304
+----------
  305
+
  306
+Remember how we had issues with handling terminated nodes in Nagios?  In
  307
+Sensu, the clients will send keep-alives every 30s so if a sensu-client
  308
+service dies unexpectedly, or the node hosting it, we can know about it.
  309
+
  310
+Upon an orderly system shutdown we can have a de-register itself through the
  311
+Sensu API.  Since we're currently on RightScale I've added this little script
  312
+to the Termination sequence on RightScale:
  313
+
  314
+
  315
+  #!/usr/bin/ruby
  316
+
  317
+  config_file='/etc/sensu/client.json'
  318
+  json = File.read(config_file)
  319
+
  320
+  client_name = JSON.parse(json)['client']['name']
  321
+  api_host    = JSON.parse(json)['api']['host']
  322
+
  323
+  uri   = URI.parse("http://#{api_host}/client/#{client_name}")
  324
+
  325
+  http  = Net::HTTP.new(uri.host, uri.port)
  326
+  http.request( Net::HTTP::Delete.new(uri.path) )
  327
+
  328
+
  329
+Dashboard
  330
+=========
  331
+
  332
+One place where Sensu really shows its youth is in the interactive WebUI
  333
+
  334
+(Three screenshots)
  335
+
  336
+It is not yet PHB-compliant.
  337
+
  338
+CheckPoint
  339
+==========
  340
+
  341
+* RabbitMQ
  342
+* Redis 
  343
+* sensu-server: 
  344
+** Publishes check requests
  345
+** Pushes results to Handler
  346
+* sensu-client:
  347
+** Listens for check-requests on its subscriptions
  348
+** Runs check commands and publishes to MQ
  349
+* sensu-api
  350
+* sensu-dashboard
  351
+
  352
+More Features
  353
+=============
  354
+
  355
+But wait, there's more:
  356
+
  357
+* Application integration
  358
+* Sensu and Graphite
  359
+* Standalone Checks
  360
+* Puppet integration
  361
+* Scheduling downtime
  362
+* Parameter passing
  363
+* Ideal monitoring system
  364
+
  365
+Missing Features
  366
+================
  367
+
  368
+* Reporting dashboard
  369
+* Service dependencies
  370
+
  371
+
  372
+Monitoring Requirements
  373
+-----------------------
  374
+
  375
+
  376
+- Integrate with Cloud Operations: CL
  377
+-- Instances should be automatically monitored
  378
+-- Cleanly decommissioned instances should go away
  379
+- Integrate with Configuration Management (Puppet or Chef): CM
  380
+-- Adding and removing checks should be facilitated by Puppet
  381
+- Provide 'sensible' alerting: AL
  382
+-- Partition urgency of alert by environment and/or time
  383
+- Support Developer Integration: DV
  384
+-- Can we code our apps to monitor themselves, or provide metrics
  385
+- Trending: TR
  386
+-- Gather data, present them and share them
  387
+- Comprehensive Checks: CH
  388
+-- Make it easy to monitor _anything_
  389
+-- If possible, implement _service dependencies_
  390
+- Available and Secure: AS
  391
+-- Also, we need to label our instances sensibly...

0 notes on commit 7a30841

Please sign in to comment.
Something went wrong with that request. Please try again.