Skip to content
This repository


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

a tool for continuously ingesting w/arc files into the archive

branch: master
DRAINTASKER - clears filling disks

* Draintasker wiki:
* Draintasker bugs:

supports "draining" a running crawler along two paths:

  1) dtmon: IAS3-to-petabox (paired storage)
  2) th-dtmon: catalog-direct-to-thumpers (Santa Clara MD)

run like this:

  $ ssh home
  $ screen
  $ ssh -A crawler
  $ cd /path/draintasker
  $ svn up
  $ emacs dtmon.yml, save-as: /path/drain.yml
  $ /path/drain.yml | tee -a /path/drain.log


  monitor job and drain (with - while DRAINME file exists,
  pack warcs (PACKED), make manifests (MANIFEST), launch transfers
  (TASK), verify transfers (TOMBSTONE), and finally, delete verified
  warcs, then sleep before trying again.

under-the-hood: config
  '-> s3-drain-job job_dir xfer_job_dir max_size warc_naming
     '->             => PACKED
     '->         => MANIFEST
     '->  => poof, no w/arcs! config
  '-> drain-job job_dir xfer_job_dir thumper max_size warc_naming
      '->            => PACKED
      '->        => MANIFEST
      '->      => LAUNCH, TASK
      '->      => SUCCESS, TOMBSTONE
      '-> => poof, no w/arcs!

get status of prerequisites and disk capacity like this:

  $ crawldata_dir xfer_dir

some advice:

  1) if there are old draintasker procs, kill them.
  2) if files in the way, investigate and move aside, 
     eg mv LAUNCH.1, mv ERROR ERROR.1
        good to number each failure/error file
  3) check the status of your disks
  4) (optional) test petabox-to-thumper path on single series
  5) log into home and open a screen session
  6) in screen, ssh crawler, cd /path/draintasker/, svn up
  7) run to continuously drain each job+disk
       cd /path/draintasker
       ./ /path/disk1.yml
       cd /path/draintasker
       ./ /path/disk3.yml


directory structure

  crawldata     /{1,3}/crawling
  rsync_path    /{1,3}/incoming
  job_dir       /{crawldata}/{job_name}
  xfer_job_dir  /{rsync_path}/{job_name}
  warc_series   {xfer_job_dir}/{warc_series}

depending on config, your warcs might be written in e.g.


and be "packed" into 

DEPENDENCIES (IAS3-to-petabox)
    + HOME/.ias3cfg (when using
    + add [incoming_x] stanzas to /etc/rsyncd.conf (see wiki) (catalog-direct-to-thumper)
    + ~/.wgetrc with your user cookies (see wiki)
    + ensure user petabox user exists: /home-local/petabox
    + PETABOX_HOME=/home/user/petabox (codebase from svn)
    + get petabox authorized_keys from "draintasking" crawler
      @crawling08:~$ scp /home-local/petabox/.ssh/authorized_keys\
    + add [incoming_x] stanzas to /etc/rsyncd.conf (see wiki)


  DRAINME       {job_dir}/DRAINME
  PACKED        {warc_series}/PACKED
  MANIFEST      {warc_series}/MANIFEST
  LAUNCH        {warc_series}/LAUNCH
  TASK          {warc_series}/TASK
  TOMBSTONE     {warc_series}/TOMBSTONE

if you see a RETRY file, eg RETRY.1284217446 the suffix is the epoch
time when a non-blocking retry was scheduled. if this file exists,
then the retry was attempted at some time after that. you can get the
human readable form of that time with the date cmd, like so:

  date -d @1284217446
  Sat Sep 11 15:04:06 UTC 2010

DRAIN DAEMON      run s3-drain-job periodically   run drain-job periodically  run draintasker processes in single mode

DRAIN PROCESSING  delete original (verified) w/arcs from each series   report remote md5 and url for all filesxml in series       submit catalog task for series   wget remote w/arc and verify checksum for series       verify remote size of w/arc series       submit transfer tasks for series         compute md5s into series MANIFEST             create warc series when available    invoke curl for series     check and report task success by task_id       run task-check-success and item-verify for series 

UTILS              report dtmons, prerequisites and disk usage             report count and total size of warcs  make tarball of crawldata for permastorage  report staged crawldata file count+size         report source crawldata file count+size          copy all crawldata preserving dir structure   make bundles and scp to staging

siznax 2010
Something went wrong with that request. Please try again.