Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
94 lines (63 sloc) 3.2 KB

Archiving URLs

From the command line

Once installed, you can start using storytracker's command-line tools immediately, like :py:func:`storytracker.archive`.

$ storytracker-archive http://www.latimes.com

That should pour out a scary looking stream of data to your console. That is the content of the page you requested compressed using gzip. If you'd prefer to see the raw HTML, add the --do-not-compress option.

$ storytracker-archive http://www.latimes.com --do-not-compress

You could save that yourself using a standard UNIX pipeline.

$ storytracker-archive http://www.latimes.com --do-not-compress > archive.html

But why do that when :py:func:`storytracker.create_archive_filename` will work behind the scenes to automatically come up with a tidy name that includes both the URL and a timestamp?

$ storytracker-archive http://www.latimes.com --do-not-compress --output-dir="./"

Run that and you'll see the file right away in your current directory.

# Try opening the file you spot here with your browser
$ ls | grep .html

Using Python

UNIX-like systems typically come equipped with a built in method for scheduling tasks known as cron. To utilize it with storytracker, one approach is to write a Python script that retrieves a series of sites each time it is run.

import storytracker

SITE_LIST = [
    # A list of the sites to archive
    'http://www.latimes.com',
    'http://www.nytimes.com',
    'http://www.kansascity.com',
    'http://www.knoxnews.com',
    'http://www.indiatimes.com',
]
# The place on the filesystem where you want to save the files
OUTPUT_DIR = "/path/to/my/directory/"

# Runs when the script is called with the python interpreter
# ala "$ python cron.py"
if __name__ == "__main__":
    # Loop through the site list
    for s in SITE_LIST:
        # Spit out what you're doing
        print "Archiving %s" % s
        try:
            # Attempt to archive each site at the output directory
            # defined above
            storytracker.archive(s, output_dir=OUTPUT_DIR)
        except Exception as e:
            # And just move along and keep rolling if it fails.
            print e

Scheduling with cron

Then edit the cron file from the command line.

$ crontab -e

And use cron's custom expressions to schedule the job however you'd like. This example would schedule the script to run a file like the one above at the top of every hour. Though it assumes that storytracker is available to your global Python installation at /usr/bin/python. If you are using a virtualenv or different Python configuration, you should begin the line with a path leading to that particular python executable.

0 * * * *  /usr/bin/python /path/to/my/script/cron.py