Mailman Archive Scraper
By Phil Gyford firstname.lastname@example.org
Latest version is available from http://github.com/philgyford/mailman-archive-scraper/
These scripts will scrape the archive pages generated by the Mailman mailing list manager and republish them as files on the local file system. In addition it can optionally do a number of things:
- Create an RSS feed of recent messages.
- Scrape private Mailman archives (if you have a valid email address and password).
- Remove all email addresses from the files (both those in 'email@example.com' and 'phil at gyford dot com' format).
- Replace the URL for the 'more info on this list' links with another.
- Remove one or more levels of quoted emails.
- Search and replace any custom strings you specify.
- Add custom HTML into the
<head></head>section of the re-published pages.
Why would you want to do this? Three reasons:
You want to create your own HTML archive of a mailing list hosted elsewhere.
You want to create a public version of a private archive. We hope you have permission to do this of course. The tools mentioned above allow you to do things like anonymise names and phone numbers, remove email addresses, etc.
To have an RSS feed of recent messages.
There may be more efficient ways to do this if you have access to the database in which the Mailman archive is stored. If you don't, and can only access the web pages, this script is for you.
This script doesn't store any state locally between sessions so every time it's run it will have to scrape several pages, even if nothing's changed (particularly if you want an RSS feed of n recent messages). There is a half second delay between each fetch of a remote page, which slows things up but will hopefully prevent hammering web servers.
There are caveats. This seems to work with the few Mailman archives tried. I'm sure that some people will find problems with different installations -- unscrapeable HTML, different URLs and filepaths, etc. Feel free to suggest fixes.
Put the directory containing the MailmanArchiveScraper.py script somewhere you want to run it from.
Make a copy of the
MailmanArchiveScraper-example.cfgfile and name it
Set the configuration options in that file (see below).
Install the required python modules, best done using pip:
$ pip install -r requirements.txt
Make sure the
MailmanArchiveScraper.pyscript is executable (
chmod +x). And the
MailmanGzTextScraper.pyscript if you need that too.
There is help in the configuration file for each setting. The minimum things you'll need to set are:
domain-- The domain name that your Mailman pages are on.
list_name-- Name of your mailing list.
password-- Required if your Mailman archive is password protected.
publish_dir-- The path to the local directory the files should be republished to.
publish_url-- If you're going to publish the messages to a website.
By default the script uses a single
MailmanArchiveScraper.cfg file in the same directory as the script, but you can specify multiple files, in different locations, instead (see Usage).
Once configuration is done, run the script:
$ python ./MailmanArchiveScraper.py
All being well, the HTML archive files will be downloaded. Set the
verbose setting in the configuration file to see a list of which files are being fetched.
If you want to download the plaintext files that Mailman saves for each month's messages (which may be gzipped), then run this script:
$ python ./MailmanGzTextScraper.py
After an initial run, you can run the script via cron to keep an updated copy of the HTML and/or text files. Note the
hours_to_go_back setting in the config file, which wil probably need to be different for the first run compared to subsequent, regular runs.
By default the script runs once, using
MailmanArchiveScraper.cfg. You can specify multiple config files and the script will run once for each file. For example, run the script with:
$ python ./MailmanArchiveScraper.py ~/lists/*.cfg
and the script will run once for each of the
*.cfg files found in the
What would also be nice:
- Sending each message on as an email. I can't see how to do this simply, given that we retain no state between times the script is run, so can't tell which emails haven't previously been sent.
Many thanks to:
- CyberRodent for the text/gzip file archiving.
- Danny O'Brien for https support.
- Andrew Bibby for supporting multiple config files, and better handling missing gzip archives.
See all versions: https://github.com/philgyford/mailman-archive-scraper/releases
v1.4.1 2015-09-12 Better handle missing gzip archives.
v1.4 2015-03-15 Multiple config files can be specified.
Add support for non-English language installations.
The monthly text files, possibly gzipped, can be archived using a new script.
The script can now archive files served over https.
Various improvements to the RSS files generated.
The script can now generate an RSS feed of the most recent posts to the mailing list.