Use jQuery, PHP, and an archive.org API to check the HTTP status of a link and redirect to Wayback if it's 404.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
assets
http-check.php
index.html
readme.md

readme.md

404 Checker

This tool sniffs out the HTTP status of links on a page, and if the URL returns 404 (or if it returns no headers at all) it queries the Wayback Machine's API to see if a snapshot is available. If one is, we can choose whether or not to redirect users to the Wayback snapshot instead of a 404 result.

Best approach?

At the moment, the script goes through three steps:

  1. On `document.ready`, the links are scanned and external links are flagged.
  2. On hovering over a link, we use PHP and AJAX to reach out and grab the headers for the URL in question.
  3. If the page returns a 404 header, or doesn't return a heard at all, we query the Wayback machine to see if it has a snapshot.

A few things:

  1. It would be better if the initial link scan is limited to areas of the page known to contain links worth checking, like main content areas. There's no sense in scanning links in areas of the page we know to contain good links.
  2. In theory we could preemptively scan all the links, instead of on hover. This is certainly easier from a programming standpoint, but possibly not so good from a UX and resources standpoint, as we'll be making a bunch of (possibly unneeded) HTTP requests.
  3. Right now the script only checks for 404s and pages that don't resolve at all. There a lot of other HTTP statuses we could be checking for.
  4. At the request of @waxpancake, the demo page has a fake pubdate of `20060303`, which the script is using to ask for a Wayback snapshot as close to this date as it will give us. If no pubdate is present (or if it's not in Wayback's preferred format: YYYYMMDD), Wayback will default to returning the most recent snapshot.

If we go with the on-demand approach, we need to decide what to do:

  1. Do we try to replace the URL before the user clicks? Depending on how fast the HTTP check comes in, the user may click before we get a response.
  2. It's possible we've been able to flag the link as 404, but don't have a result from the Wayback API yet yet. So do we capture the click, make a note of it, then push the user to the snapshot URL when it comes in, assuming it will come in a timely manner? If not, what?

Please note: I have not, but intend to, see how things work on touch devices. We'll likely need a different approach to the link events.

There's a demo over here: http://git.monkeydo.biz/404-checker/.