Permalink
Browse files

- fixed

  • Loading branch information...
1 parent f03d0db commit c780c5793202cd4476d76f07a521eb2c03ab6598 @jordansissel committed Dec 24, 2011
View
101 2008/01/day1-strace-and-tcpdump.html
@@ -0,0 +1,101 @@
+One of the staple quotes from the British sitcom <a
+href="http://www.imdb.com/title/tt0487831/">The IT Crowd</a> is <a
+href="http://www.google.com/search?hl=en&q=%22have+you+tried+turning+it+off+and+on+again%22">"Have
+you tried turning it off and on again?"</a> as a first response when one of the
+IT staff answers a call. My
+officemate (a fellow sysadmin) has his own generic first response when someone
+wanders in with a question: "Have you run tcpdump or strace?"
+
+<p>
+
+It's a good question partly because almost nobody answers "yes" and partly
+because these two tools are very useful in helping you debug.
+
+<p>
+
+When other tools are failing to help you when debugging a system or network
+problem, strace or tcpdump might just be your salvation. Strace helps you trace
+system calls while tcpdump helps you trace network activity. For the BSD and
+Solaris users, you'll find truss a similar tool for tracing system calls. On
+Solaris, you also get snoop, which is similar to tcpdump.
+
+<p>
+
+These tools generally provide you the ability to have your output with
+high-precision real or relative timestamps, more or less verbosity, some
+filtering, etc. Times are important if you have a <a href="http://www.ibiblio.org/harris/500milemail.html">mysterious time-related problem</a>.
+
+<p>
+
+Strace lets you trace a new process (strace &lt;command ...&gt;) or running
+processes (strace -p &lt;pid&gt;). Is apache acting strange? Use strace to
+attach to all of the httpd processes:
+
+<pre>
+% strace $(pgrep httpd | sed -e 's/^/-p /')
+Process 12571 attached - interrupt to quit
+Process 12573 attached - interrupt to quit
+Process 12574 attached - interrupt to quit
+Process 12575 attached - interrupt to quit
+[pid 12574] accept(4, &lt;unfinished ...&gt;
+[pid 12573] accept(4, &lt;unfinished ...&gt;
+[pid 12571] select(0, NULL, NULL, NULL, {0, 216000} &lt;unfinished ...&gt;
+[pid 12575] accept(4, &lt;unfinished ...&gt;
+[pid 12571] wait4(-1, 0x7fff8f7a2ba4, WNOHANG|WSTOPPED, NULL) = 0
+[pid 12571] select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
+(output continues, but I cut it for brevity)
+</pre>
+
+Now you have a good idea what each process is doing with respect to system
+calls: On this idle apache server, one process appears to be in a sleep loop
+waiting for children to die while the rest are waiting for accept() to return
+on the listening http socket.
+
+<p>
+
+Access a page on this webserver from your workstation and check strace's output - maybe you'll learn more about what your webserver does when it serves up a page?
+
+<p>
+
+To see the network traffic alone, use tcpdump. tcpdump will show you traces of packets and can have the trace limited to only packets matching a query. To watch for http traffic, we would use this invocation:
+
+<pre>
+% tcpdump 'port 80'
+tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
+listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
+00:57:08.167785 IP 192.168.30.89.33471 > 192.168.30.19.http: S 3860627520:3860627520(0) win 5840 <mss 1460,sackOK,timestamp 1785372135 0,nop,wscale 7>
+00:57:08.167994 IP 192.168.30.19.http > 192.168.30.89.33471: S 1074530775:1074530775(0) ack 3860627521 win 5792 <mss 1460,sackOK,timestamp 2237995585 1785372135,nop,wscale 7>
+00:57:08.167905 IP 192.168.30.89.33471 > 192.168.30.19.http: . ack 1 win 46 <nop,nop,timestamp 1785372135 2237995585>
+00:57:08.169271 IP 192.168.30.89.33471 > 192.168.30.19.http: P 1:94(93) ack 1 win 46 <nop,nop,timestamp 1785372135 2237995585>
+(output continues, but I cut it for brevity)
+</pre>
+
+The above output might not be totally readable, but you should at least
+understand some of it: source and destination address and ports, timestamps,
+etc. Lastly, the filter language used for selecting only certain packets is
+documented well in the tcpdump manpage.
+
+<p>
+
+Keeping tcpdump, strace, and similar inspection tools close to your debugging practices should help you better debug and profile problems, and it just might save you the trip down the hall.
+
+<p>
+
+Further reading:
+<dl>
+ <dt> <a href="http://www.tcpdump.org/tcpdump_man.html">tcpdump manpage</a> </dt>
+ <dt> <a href="http://www.google.com/search?q=man strace">strace manpage</a> </dt>
+ <dt> <a href="http://opensolaris.org/os/community/dtrace/">DTrace</a>
+ (Solaris, FreeBSD, OS X) and <a
+ href="http://sourceware.org/systemtap/">SystemTap</a> (Linux) </dt>
+ <dd> These tools are much more advanced than strace or truss. They allow you
+ to scriptably inspect and instrument your system and processes in a wonderful
+ range of ways beyond just system calls. </dd>
+ <dt> <a href="http://www.wireshark.org/">Wireshark</a> (previously called Ethereal) </dt>
+ <dd> Wireshark (and tshark, the terminal version) provides much greater
+ protocol inspection than does tcpdump or snoop. You'll find it's benefits
+ beyond tcpdump include more advanced (and easier) filtering, stream tracking,
+ deeper protocol inspection, and more.</dd>
+</dl>
+
+
View
142 2008/03/day3-babysitting.html
@@ -0,0 +1,142 @@
+Software just isn't as reliable as we want it to be. Sometimes a simple reboot
+(or task restart) will make a problem go away, and this kind of "fix" is so
+commonly tried that it made it's way to the TV show mentioned in <a
+href="http://sysadvent.blogspot.com/2008/12/sysadmin-advent-day-1.html">day
+1</a>.
+
+<p>
+
+A blind fix that restores health to a down or busted service
+can be valuable. If there are a known set of conditions that indicate
+the poor health of a service or device, and a restart can fix it, why not try it
+automatically? The restart probably doesn't fix the real problem, but automated health-repairs can help you debug the root cause.
+
+<p>
+
+Restarting a service when it dies unexpectedly seems like a no-brainer, which is why mysql comes with "mysqld_safe" for babysitting mysqld. This script is basically:
+
+<pre>
+while true
+ run mysqld
+ if mysqld exited normally:
+ exit
+</pre>
+
+<p>
+
+A process (or device) that watches and restarts another process seems to have
+a few names: watchdog, babysitter, etc. There are a handful of free software projects that provide babysitting, including <a
+href="http://cr.yp.to/daemontools.html">daemontools</a>, <a
+href="http://mon.wiki.kernel.org/">mon</a>, and <a
+href="http://mmonit.com/monit/">Monit</a>. Monit was the first tool I looked at that today, so let's talk Monit.
+
+<p>
+
+Focusing only on the process health check features, Monit seems pretty decent.
+You can have it monitor things other than processes, and even send you email alerts,
+but that's not the focus today. Each process in Monit can have multiple health checks
+that, upon failure, result in a service restart or other action. Here's an example
+config with a health check ensuring mysql connections are working and restarting it on failure:
+
+<pre>
+# Check every 5 seconds.
+set daemon 5
+
+# monit requires each process have a pidfile and does not create pidfiles for you.
+# this means the start script (or mysql itself, here) must maintain the pid file.
+check process mysqld with pidfile /var/run/mysqld/mysqld.pid
+ start "/etc/init.d/mysqld start"
+ stop "/etc/init.d/mysqld stop"
+ if failed port 3306 protocol mysql then restart
+</pre>
+
+This will cause mysqld to be restarted whenever the check fails, such as when mysql's max connections is reached.
+
+<p>
+
+While I consider an automatic quick-fix to be good, this alone isn't good
+enough. Automatic restarts could hinder your ability to debug because the restart flushed the cause of the problem (at least temporarily). A mysql check failed, but what caused it?
+
+<p>
+
+To start with, maybe we want to record who was doing what when mysql was having
+problems. Depending on the state of your database, some of this data may not be
+available (if mysql is frozen, you probably can't run 'show full processlist')
+Here's a short script to do that (that we'll call "get-mysql-debug-data.sh"):
+
+<pre>
+#/bin/sh
+
+time="$(date +%Y%m%d.%H%M%S)"
+[ ! -d /var/log/debug ] && mkdir -p /var/log/debug
+exec &gt; "/var/log/debug/mysql.failure.$time"
+
+echo "=&gt; Status"
+mysqladmin status
+echo
+echo "=&gt; Active SQL queries"
+mysql -umonitor -e 'show full processlist\G'
+echo
+echo "=&gt; Hosts connected to mysql"
+lsof -p :3306
+</pre>
+
+We'll also need to tell Monit to run this script whenever mysql's check fails.
+
+<pre>
+check process mysqld with pidfile /var/run/mysqld/mysqld.pid
+ if failed port 3306 protocol mysql then
+ exec "get-mysql-debug-data.sh"
+</pre>
+
+However, now mysql doesn't get restarted if a health check fails, we only
+record data. I tried a few permutations to get both data recorded and mysql restarted, and came up with this as working:
+
+<pre>
+check process mysqld with pidfile /var/run/mysqld/mysqld.pid
+ start "/etc/init.d/mysqld start"
+ stop "/bin/sh -c '/bin/get-mysql-debug-data.sh ; /etc/init.d/mysqld stop'"
+ if failed port 3306 protocol mysql then restart
+</pre>
+
+Now any time mysql is restarted by monit, we'll exec the debug data script and
+then stop mysqld. The better solution is to probably combine both data and stop
+script invocations into a separate script you set to 'stop "myscript.sh"'.
+
+<p>
+
+If I run monit in the foreground (monit -I), I'll see this when mysql's check fails:
+
+<pre>
+MYSQL: login failed
+'mysqld' failed protocol test [MYSQL] at INET[localhost:3306] via TCP
+'mysqld' trying to restart
+'mysqld' stop: /bin/sh
+Stopping MySQL: [ OK ]
+'mysqld' start: /etc/init.d/mysqld
+Starting MySQL: [ OK ]
+'mysqld' connection succeeded to INET[localhost:3306] via TCP
+</pre>
+
+And in our debug log directory, a new file has been created with our debug
+output.
+
+<p>
+
+This kind of application isn't a perfect solution, but it can be quite useful.
+How many times has a coworker accidentally caused a development service to
+crash and you've needed to go restart it? Applying the ideas presented above
+will help you both keep from sshing all over restarting broken services in
+addition to helping automatically track crash/bad-health information for you.
+
+<p>
+
+Further reading:
+
+<dl>
+ <dt> <a href="http://cr.yp.to/daemontools.html">daemontools</a> </dt>
+ <dt> <a href="http://mmonit.com/monit/">Monit</a> </dt>
+ <dt> <a href="http://mon.wiki.kernel.org">mon</a> </dt>
+ <dt> <a href="http://www.linuxdevcenter.com/pub/a/linux/2002/05/09/sysadminguide.html">Another discussion of daemon monitoring tools</a> </dt>
+ <dd> This article is old, but still makes good points about why you want your services to automatically restart when they die. </dd>
+</dl>
View
106 2008/04/day4-extending-snmpd.html
@@ -0,0 +1,106 @@
+Do you monitor your hosts with snmp? Ever wanted to add additional data sources
+to your snmp agent? Net-SNMP's snmpd lets you do this.
+
+<p>
+
+There are a few different options available to extend snmpd. The first is the
+most primitive, simply running a program and reporting the first line of output
+and the exit status. This is done with the 'exec' statement in snmpd.conf.
+
+<pre>
+# Format is
+# exec &lt;name&gt; &lt;command&gt; [args]
+exec googleping /bin/ping -c 1 -w 1 -q
+</pre>
+
+You need to specify the full path for 'exec' commands. If you want to run your command in /bin/sh, swap in 'sh' for 'exec' and you get to avoid the full path requirement. The 'exec' and 'sh' extensions command show results through the UCD-SNMP-MIB::extTable table:
+
+<pre>
+% snmpwalk -v2c -c secret localhost UCD-SNMP-MIB::extTable
+UCD-SNMP-MIB::extIndex.1 = INTEGER: 1
+UCD-SNMP-MIB::extNames.1 = STRING: googleping
+UCD-SNMP-MIB::extCommand.1 = STRING: /bin/ping
+UCD-SNMP-MIB::extResult.1 = INTEGER: 0
+UCD-SNMP-MIB::extOutput.1 = STRING: PING www.l.google.com (74.125.19.147) 56(84) bytes of data.
+UCD-SNMP-MIB::extErrFix.1 = INTEGER: noError(0)
+UCD-SNMP-MIB::extErrFixCmd.1 = STRING:
+</pre>
+
+You can see that the first line of output is available in extOutput. This is
+nice, but the order of commands depends entirely on the order in snmpd.conf, so
+if you put another 'exec' above the googleping one, the googleping check
+becomes .2 instead of .1, which is not so stable with respect to adding new
+exec statements or moving them around. Boo.
+
+<p>
+
+The second option available is called 'extend,' and it works similarly to
+'exec,' but better. The 'extend' configuration accepts multiline output from
+your command and is indexed on the name (ie; "googleping") instead of an index
+number (ie; 1, 2, etc). Just change 'exec' to 'extend':
+
+<pre>
+extend googleping /bin/ping -c 1 -w 1 -qn www.google.com
+extend mysqlstatus /usr/bin/mysqladmin status
+</pre>
+
+These 'extend' commands show up in NET-SNMP-AGENT-MIB::nsExtensions. If you
+only want the output, you can walk NET-SNMP-EXTEND-MIB::nsExtendOutput1Table
+(or nsExtendOutput2Table). If you want only the exit code, you can walk
+nsExtendResult. If you want to view the output of walking nsExtensions (it's too long to post here), <a
+href="http://docs.google.com/View?docid=dckv5f97_0zx46q8fp">click here</a>.
+
+<p>
+
+Remember the benefit of 'extend' over 'exec' was that the indexing was on the
+name, so let's query for only the googleping result:
+
+<pre>
+% snmpget -v2c -c secret localhost 'NET-SNMP-EXTEND-MIB::nsExtendResult."googleping"'
+NET-SNMP-EXTEND-MIB::nsExtendResult."googleping" = INTEGER: 0
+
+# If I null route all www.google.com IPs, and requery:
+% snmpget -v2c -c secret localhost 'NET-SNMP-EXTEND-MIB::nsExtendResult."googleping"'
+NET-SNMP-EXTEND-MIB::nsExtendResult."googleping" = INTEGER: 2
+</pre>
+
+Take note above that the OID is in single quotes and "googleping" still needs
+to be sent as quoted to snmpget, this is so snmpget understands that this is
+really an octet-string OID. (See what "googleping" becomes with snmpget -On)
+
+<p>
+
+The output and exit code of your 'extend' and 'exec' statements are cached for a short period of time. The exact time saved in cache is determined by the nsExtendCacheTime OID. If you have write access configured in snmp, you can issue a SET command to change the cache time.
+
+<pre>
+# Cache the googleping results for 15 seconds
+% snmpset -v2c -c secret localhost 'NET-SNMP-EXTEND-MIB::nsExtendCacheTime."googleping"' i 15
+NET-SNMP-EXTEND-MIB::nsExtendCacheTime."googleping" = INTEGER: 15
+
+% snmpwalk -v2c -c secret localhost NET-SNMP-EXTEND-MIB::nsExtendCacheTime
+NET-SNMP-EXTEND-MIB::nsExtendCacheTime."googleping" = INTEGER: 15
+NET-SNMP-EXTEND-MIB::nsExtendCacheTime."mysqlstatus" = INTEGER: 5
+</pre>
+
+Lastly, you can tell snmpd to 'pass' (that's the name of the config statement)
+handling of an entire OID subtree to an external program, which seems like a
+nice feature. This lets you write a subtree handler in your language of choice
+rather than being required (while still an option) to write your more complex
+handlers using snmpd's perl support or C module support. For brevity, I'll skip
+coverage of that, but it works similar to 'extend' and 'exec,' but has it's own
+(simple) text protocol for telling your subprocess what OID it wants data on
+(See further reading).
+
+<p>
+
+Extending SNMP to support your own data sources is a good way to allow your
+existing monitoring tools (nagios, etc) to monitor remotely without having to
+have local access such as with ssh or nrpe.
+
+<p>
+
+Further reading:
+<dl>
+ <dt> <a href="http://www.net-snmp.org/docs/man/snmpd.conf.html#lbAW">Net-SNMP snmpd extension configuration and documentation </a> </dt>
+ <dd> See the "EXTENDING AGENT FUNCTIONALITY" section </dd>
+</dl>
View
151 2008/05/day5-capistrano.html
@@ -0,0 +1,151 @@
+Do you store and deploy configuration files from a revision control system? You
+should. If you don't, yet, this article aims to show you how to make that
+happen with very little effort using Capistrano.
+
+<p>
+
+Capistrano is a ruby-powered tool that acts like make (or rake, or any build
+tool), but it is designed with deploying data and running commands on remote
+machines. You can write tasks (like make targets) and even nest them in
+namespaces. Hosts can be grouped together in roles and you can have a task
+affect any number of hosts and/or roles. Capistrano, like Rake, uses only Ruby
+for configuration. Capistrano files are named 'Capfile'.
+
+<p>
+
+Much of the documentation and buzz about Capistrano deals with deployment of
+Ruby on Rails, but it's not at all limited to Rails.
+
+<p>
+
+For a simple example, lets ask a few servers what kernel version they
+are running:
+
+<pre>
+# in 'Capfile'
+role :linux, "jls", "mywebserver"
+
+namespace :query do
+ task :kernelversion, :roles =&gt; "linux" do
+ run "uname -r"
+ end
+end
+</pre>
+
+Output:
+
+<pre>
+% cap query:kernelversion
+ * executing `query:kernelversion'
+ * executing "uname -r"
+ servers: ["jls", "mywebserver"]
+ [jls] executing command
+ ** [out :: jls] 2.6.25.11-97.fc9.x86_64
+ [mywebserver] executing command
+ ** [out :: mywebserver] 2.6.18-53.el5
+ command finished
+</pre>
+
+Back at the original problem being solved, we want to download configuration
+files for any service on any host we care about and store it revision control.
+For now, let's just grab apache configs from one server.
+
+<p>
+
+Learning how to do this in Capistrano proved to be a great exercise in learning
+a boatload of Capistrano's features. The Capfile is short, but too long to paste here, so <a href="http://docs.google.com/View?docid=dckv5f97_1hkh9dzfh">click here to view</a>.
+
+<p>
+
+If I run "cap pull:apache", Capistrano dutifully downloads my apache configs from 'mywebserver' and pushes them into a local svn repository. Here's what it looks like (I removed some output):
+
+<pre>
+% cap pull:apache
+ triggering start callbacks for `pull:apache'
+ * executing `ensure:workdir'
+At revision 8.
+ * executing `pull:apache'
+ triggering after callbacks for `pull:apache'
+ * executing `pull:sync'
+ * executing "echo -n $CAPISTRANO:HOST$"
+ servers: ["mywebserver"]
+ [mywebserver] executing command
+ servers: ["mywebserver"]
+ ** sftp download /etc/httpd/conf -&gt; /home/configs/work/mywebserver
+ [mywebserver] /etc/httpd/conf/httpd.conf
+ [mywebserver] /etc/httpd/conf/magic
+ [mywebserver] done
+ * sftp download complete
+ servers: ["mywebserver"]
+ ** sftp download /etc/httpd/conf.d -&gt; /home/configs/work/mywebserver
+ [mywebserver] /etc/httpd/conf.d/README
+ [mywebserver] /etc/httpd/conf.d/welcome.conf
+ [mywebserver] /etc/httpd/conf.d/proxy_ajp.conf
+ [mywebserver] done
+ * sftp download complete
+A /home/configs/work/mywebserver/README
+A /home/configs/work/mywebserver/httpd.conf
+A /home/configs/work/mywebserver/magic
+A /home/configs/work/mywebserver/welcome.conf
+A /home/configs/work/mywebserver/proxy_ajp.conf
+ command finished
+Adding configs/work/mywebserver/README
+Adding configs/work/mywebserver/httpd.conf
+Adding configs/work/mywebserver/magic
+Adding configs/work/mywebserver/proxy_ajp.conf
+Adding configs/work/mywebserver/welcome.conf
+Transmitting file data .....
+Committed revision 9.
+</pre>
+
+If I then modify 'httpd.conf' on the webserver, and rerun 'cap pull:apache':
+
+<pre>
+&lt;output edited for content&gt;
+% cap pull:apache
+ ** sftp download /etc/httpd/conf -&gt; /home/configs/work/mywebserver
+ [mywebserver] /etc/httpd/conf/httpd.conf
+ [mywebserver] /etc/httpd/conf/magic
+ [mywebserver] done
+ * sftp download complete
+Sending configs/work/mywebserver/httpd.conf
+Transmitting file data .
+Committed revision 10.
+</pre>
+
+Now if I want to see the diff against the latest two revisions, to see what we
+changed on the server:
+
+<pre>
+% svn diff -r9:10 file:///home/configs/svn/mywebserver/httpd.conf
+Index: httpd.conf
+===================================================================
+--- httpd.conf (revision 9)
++++ httpd.conf (revision 10)
+@@ -1,3 +1,4 @@
++# Hurray for revision control!
+ #
+ # This is the main Apache server configuration file. It contains the
+ # configuration directives that give the server its instructions.
+</pre>
+
+This kind of solution is not necessarily ideal, it's a good and simple way to
+get history tracking on your config files right now until you have the time,
+energy and need to improve the way you do config management.
+
+<p>
+
+Capistrano might just help you with deployment and other common, remote access
+tasks.
+
+<p>
+
+Further reading:
+<dl>
+ <dt> <a href="http://www.capify.org/">Capistrano homepage</a> </dt>
+ <dt> <a href="http://www.scribd.com/doc/1618/a-great-capistrano-cheatsheet"> Capistrano cheat-sheet </a></dt>
+ <dt> <a href="http://www.shrubbery.net/rancid/">RANCID</a> </dt>
+ <dd> A similar idea presented here (download config files and put them in revision control) but for network gear. </dd>
+</dl>
+
+<i> Coverage for this was suggested by Jon Heise, who helpfully provided me with a intro to Capistrano. &lt;3 </i>
View
102 2008/06/day6-tripwire.html
@@ -0,0 +1,102 @@
+We need more automation-minded people writing tools. While playing with tripwire today, I saw
+something that made me think, how am I supposed to automate this? I don't like feeling like a useful tool can't be automated, so let's figure out how.
+
+<p>
+
+I'd done the basics using tripwire, so far. Creating site and host keys,
+creating the encrypted config and policy files, and running 'tripwire --init'
+to get things started. After making some changes, I ran 'tripwire --check' to
+see what tripwire would tell me. Things were going good until I decided to
+update tripwire's idea of what the current system should be, with 'tripwire
+--update'
+
+<p>
+
+The <a href="http://linuxgazette.net/106/odonovan.html">tripwire guide</a> I
+was following told me what would happen, but I hadn't read that far. Tripwire launched vi and let me edit a document that started like this:
+
+<pre>
+Tripwire(R) 2.3.0 Integrity Check Report
+
+Report generated by: root
+Report created on: Fri Dec 5 19:11:39 2008
+Database last updated on: Never
+</pre>
+
+The document was full of information about what had changed on the system.
+I hadn't a clue what I was supposed to do, since I was only skimming
+documentation when I got stuck or confused, so I went back to the guide and saw:
+
+<blockquote>
+"If any changes are found you will be presented with a "ballot-box" styled form that must be completed by placing an 'x' opposite the violations that are safe to be updated in the database."
+<br>
+<i>(link to the guide this quote came from under further reading)</i>
+</blockquote>
+
+I have to what? Carefully hand-edit some generated output, so tripwire will know
+what to store back in it's truth database? How the heck do you automate this? Was
+this a design decision meaning automation and security were mutually exclusive?
+I don't think they're mutually exclusive.
+
+<p>
+
+The tripwire config you used when you ran 'tripwire --init' had a variable in it,
+"EDITOR." This was set to /usr/bin/vi. I changed it to '/bin/cat' and
+regenerated (tripwire --create-cfgfile) the encrypted config file and reran the
+update command, and instead of launching vi, the data was simply output to
+stdout, meaning we might be able to automate it, replacing cat with some smart script.
+
+<p>
+
+The data format in the file report is very clearly meant for human reading, not
+for computer parsing. Tripwire can parse it for its own purposes, but are you
+up to writing a parser? Googling for <a
+href="http://www.google.com/search?q=tripwire+report+parser">tripwire report
+parser</a> doesn't show promise.
+
+<p>
+
+I replaced /usr/bin/nano with a shell script to see what output I should expect. Rerunning 'tripwire --check' and then 'tripwire --update', my nano change shows up like this:
+
+<pre>
+[x] "/usr/bin/nano"
+</pre>
+
+Leaving that box checked would mean "I know nano changed, it's ok." Writing a
+handler that automatically decided whether file was knowingly modified might be
+simple. For example, if you upgraded a package recently, most/all of the files
+for that package will be reported as modified/added/removed. You might be able
+to ask your packaging system if a file is valid. For instance, if a file is
+listed as modified, and you use RPMs, you could see if the file has changed
+since installing the RPM:
+
+<pre>
+% rpm -Vf /usr/bin/nano
+S.5....T /usr/bin/nano
+</pre>
+
+According to rpm's manpage, the first column means that the size, md5, and
+modifciation time are different on the current nano than the one that was
+installed by the rpm.
+
+<p>
+
+I'd hate to have to answer these questions every time I did an upgrade on one
+of my servers. Doing it once would be annoying, but doing it across all of my
+servers after an upgrade (how many servers would that be for you?) would be an
+impossible nightmare.
+
+<p>
+
+Since tripwire is a useful tool, you could use the verification information from rpm
+to automatically answer tripwire's inquiry with a script set as your EDITOR config variable. If you're especially on-top of your sysadmin practices, your systems have automated software rollouts, and if you want to use tripwire, you'll need to automate the management process.
+
+<p>
+
+Further reading:
+<dl>
+ <dt> <a href="http://tripwire.sourceforge.net">Open source version of Tripwire </a> </dt>
+ <dt> <a href="http://linuxgazette.net/106/odonovan.html">Intrusion Detection with Tripwire</a> </dt>
+ <dd> The tripwire guide I was learning from. </dd>
+ <dt> <a href="http://www.tripwire.com/">www.tripwire.com</a> </dt>
+</dl>
View
124 2008/07/day7-host-vs-service.html
@@ -0,0 +1,124 @@
+An important distinction when talking about servers and services is to talk
+about them separately. Build automation in terms of configuration sets, not in
+terms of servers.
+
+<p>
+
+I tend to think of servers, machines, devices, whatever, as having labels or
+tags. Each label refers to a particular configuration set. Your automation
+tools should know what labels are on a host and only apply changes based on
+those labels. Modern administration tools such as Capistrano and Puppet are
+designed with this distinction in mind. Capistrano calls them 'roles' and
+puppet calls them 'classes,' but ultimately they're just some kind of name you
+apply to configuration or change.
+
+<p>
+
+Labels can be anything, but they should be meaningful. You might have
+"mysql-debug" and "mysql-production" service labels which both cause mysql to
+install but the debug version means you have heavier logging features enabled
+like full query logging, etc.
+
+<p>
+
+Configuring with labels instead of individual hosts helps you scale up.
+Managing configuration changes for a specific service lets you make one change
+to a service and have it deploy on any host having that service. Further, if
+you buy new server hardware, simply adding the appropriate labels to a host
+will let your automation system do the hard work of installation and
+configuration.
+
+<p>
+
+It helps you scale down, too. Here's a fictional example:
+
+<blockquote>
+Quality control requested a production-like environment to test release
+candidates before pushing to production, but the budget will only allow you to
+use two server hosts for this. Production uses many more than this. If you
+automate based on labels instead of hosts, you could easily spread the
+required services across your two servers by simply labelling them, and
+automation would take care of the installation and configuration.
+</blockquote>
+
+<p>
+
+Assuming you have the development time or the tools available, you can use
+labels all over your automation:
+
+<ul>
+ <li> Generate dns entries for all hosts with a specific label </li>
+ <li> Configure your monitoring system based on labels on a host </li>
+ <li> Configure firewall rules </li>
+ <li> Configure backup policy </li>
+ <li> etc... </li>
+</ul>
+
+A simple implementation of this would be a small yaml file with host:label
+mappings:
+
+<pre>
+host1.prod.yourdomain:
+- mysql-debug
+host2.prod.yourdomain:
+- memcache
+- frontend
+</pre>
+
+The deployment of these labels is up to you and the needs of your automation
+system. Keeping this in revision control gives you history with logs. Along
+with the other automation code and configuration you should be keeping in
+revision control, you might just be one step closer to being able to do more
+while working less.
+
+<dl>
+ <dt> With puppet </dt>
+ <dd>
+ If you're using puppet, telling each host what it's labels (aka, puppet
+ classes) are is easy, you need only write a script to help puppet know what
+ classes to apply to a host (or node, in puppet's case). <a
+ href="http://reductivelabs.com/trac/puppet/wiki/ExternalNodes">This
+ document</a> will show you how in puppet.
+ </dd>
+ <dt> With capistrano </dt>
+ <dd>
+ You'll want some piece of code that turns your yaml file of host:label
+ entries into 'role &lt;label&gt;, &lt;host1, host2, ...&gt;. Something like
+ this may do (ruby): (I called our yaml file 'hostlabels.yaml')
+ <pre>
+# roles.rb
+require "yaml"
+labelmap = Hash.new { |h,k| h[k] = [] } # default hash value is empty array
+hosts = YAML::load(File.new("hostlabels.yaml"))
+hosts.each { |host,labels|
+ labels.each { |label| labelmap[label] &lt;&lt; host }
+}
+labelmap.each { |label,hosts|
+ role label, *hosts
+}
+</pre>
+
+ And in your Capfile:
+ <pre>
+load "roles" # use 'load' not 'require'
+
+task :uptime, :roles =&gt; "frontend" do
+ run "uptime"
+end
+</pre>
+ And now 'cap uptime' will only hit servers listed in your yaml file as
+ having the label 'frontend'. Cool.
+ </dd>
+</dl>
+
+I wanted to provide an example with cfengine, too, but I'm not familiar enough
+with the tool and my time ran out learning how to do it.
+
+<p>
+
+The yaml file example is not totally ideal, but it's a start if you have
+nothing. Evolutions beyond the simple host:services are the state configuration
+management tools where you store information about what is truth - such as for
+every machine that exists, mac addresses, IPs, service labels, hardware type,
+etc. It might include the class of "enterprise inventory management" suites by
+Oracle and others, too.
View
103 2008/09/day9-lockfiles.html
@@ -0,0 +1,103 @@
+The script started with a simple, small idea. Some simple task like backing up
+a database or running rsync. You produce the script matching your requirements
+and throw it up in cron on some reasonable schedule.
+
+<p>
+
+Time passes, growth happens, and suddenly your server is croaking because 10
+simultaneous rsyncs are happening. The script runtime is now longer than your
+interval. Being the smart person you are, you add some kind of synchronization
+to prevent multiple instances from running at once, and it might look like
+this:
+
+<pre>
+#!/bin/sh
+
+$lock = "/tmp/cron_rsync.lock"
+if [ -f "$lock" ] ; then
+ echo "Lockfile exists, aborting."
+ exit 1
+fi
+
+touch $lock
+rsync ...
+rm $lock
+</pre>
+
+You have your cron job put the output of this script into a logfile so cron
+doesn't email you when the lockfile's stuck.
+
+<p>
+
+Looks good for now. A while later, you log in and need to do work that requires
+this script temporarily not run, so you disable the cron job and kill the
+running script. After you finish you work, you enable the cron job again.
+
+<p>
+
+Due to your luck, you killed the script while it was in the rsync process,
+which meant the 'rm $lock' never ran, which means your cron job isn't running
+now and is periodically updating your logfile with "Lockfile exists, aborting."
+It's easy to not watch logfiles, so you only notice this when something breaks
+that depends on your script. Realizing the edge case you forgot, you add
+handling for signals, just above your 'touch' statement:
+
+<pre>
+trap "rm -f $lock; exit" INT TERM EXIT
+</pre>
+
+Now normal termination and signal (safely rebooting, for example) will remove your
+lockfile. And there was once again peace among the land ...
+
+<p>
+
+... until a power outage causes your server to reboot, interrupting the rsync
+and leaving your lockfile around. If you're lucky, your lockfile is in /tmp and
+your platform happens to wipe /tmp on boot, clearing your lockfile. If you
+aren't lucky, you'll need to fix the bug (you should fix the bug anyway), but how?
+
+<p>
+
+The real fix means we'll have to reliably know whether or not a process is
+running. Recording the pid isn't totally reliable unless you check the pid's
+command arguments, and it doesn't survive some kinds of updates (name change,
+etc). A reliable way to do it with the least amount of change is to use
+flock(1) for lockfile tracking. The flock(1) tool uses the flock(2) interface
+to lock your file. Locks are released when the program holding the lock dies
+or unlocks it. A small update to our script will let us use flock instead:
+
+<pre>
+#!/bin/sh
+
+lockfile="/tmp/cron_rsync.lock"
+if [ -z "$flock" ] ; then
+ lockopts="-w 0 $lockfile"
+ exec env flock=1 flock $lockopts $0 "$@"
+fi
+
+rsync ...
+</pre>
+
+This change allows us to keep all of the locking logic in one small part of the
+script, which is a benefit alone. The trick here is that if '$flock' is not
+set, we will exec flock with this script and its arguments. The '-w 0' argument
+to flock tells it to exit immediately if the lock is already held. This
+solution provide locking that expires when the shell script exits under any
+conditions (normal, signal, sigkill, power outage).
+
+<p>
+
+You could also use something like daemontools for this. If you use daemontools, you'd be better off making a service specific to this script. To have cron start your process only once and let it die, you can use 'svc -o /service/yourservice'
+
+<p>
+
+Whatever solution you decide, it's important that all of your periodic scripts will continue running normally if they are interrupted.
+
+<p>
+
+Further reading:
+<ul>
+ <li> flock(2) syscall is available on solaris, freebsd, linux, and probably other platforms </li>
+ <li> FreeBSD port of a different flock implementation: sysutils/flock </li>
+ <li> <a href="http://cr.yp.to/daemontools.html">daemontools homepage</a> </li>
+</ul>

0 comments on commit c780c57

Please sign in to comment.