Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



9 Commits

Repository files navigation

Web Archive APIs workshop

Scripts and such for workshops on web archive APIs

Scripts and descriptions of what they do, culled from daily practice in my daily practices of work and fun and too-much-time-on-my-hands. Credit to the Archive-It and Internet Archive folks that sometimes help me with the harder parts I can't find on Stack Overflow. Portions delivered in an SAA Web Archiving Section webinar workshop, the WARCshop at Penn State University (both in Spring 2017) and at IPres in Kyoto, Japan in Fall 2017. APIs are mostly Intnernet Archive / Archive-It ones because, duh, that's where I work.

See also because all this stuff will eventually live together in a harmonious fashion.

API Examples

Wayback Machine CDX & Availability APIs

The Wayback CDX Server API has good documentation with examples,

The Availability API is useful to looking up if something has been archived. If a user wants to query it for a set of URLs, they can construct an input file like so:

[ {"url":""}, {"url":""}, {"url":""} ]

and store it in a text file like "wayback_lookup.txt" and run this command

curl -X POST -d @wayback_lookup.txt --header "Wayback-Api-Version:2" --header "Content-Type:application/json"

and get a response in JSON.

Get a list of all hosts on a certain domain

domain=np; curl "$domain&matchType=domain&fl=urlkey,length&from=1996&to=2015" | cut -f1 -d '/' | uniq -c

Here is a sample python notebook for playing with these APIs and charting results:

Host Details API

The new Wayback Machine search includes profile information for specific hosts, for example, and this visualization is served via an API, which can be queried at multiple levels, including TLD, for instance, (and and these stats can be parsed and analyzed.

Sum the total new captures by mime for a host

curl | jq '.new_urls | .[] | .["application/pdf"]' | paste -sd+ - | bc

Sum the total size of captures by mime for a host and print in sizes

curl | jq '.urls_total_compressed_size | .[] | .["application/pdf"]' | paste -sd+ - | bc | awk '{ myfloat = "%.10f"; byte = sprintf(myfloat, $1 /1024) ; print byte " kB";byte = sprintf(myfloat, $1 /1024**2) ; print byte " MB"; byte = sprintf(myfloat, $1 /1024**3) ; print byte " GB"; byte = sprintf(myfloat, $1 /1024**4) ; print byte " TB" }'

Media and Hash APIs

The new Wayback Search also allows the ability to search for specific media types and via SHA1 checksum.

Lookup a specific checksum in the Wayback Machine

Feed in a list of checksums and get their associated URLs

Make a text file of SHA1 checksums with one checksum on each line; in this example we have called the file 'hash_example2.txt'

cat hash_example2.txt | while read url; do curl -L -s${url} | jq -r '.["hits"] | .[] | .["content_digest"] + " " + .["url"]'; done

Get puppy media

curl | jq -r '.[].link'

Full Text Search APIs

Archive-It does full-text indexing of its entire corpus (well over 1PB total).The parameters for using this API via Open Search are at

View full text search results in JSON via the API

Archive-It Partner Data API

This API provides access to collection-specific data within Archive-It (but a similar API could provide such collection-related information held in other systems).

Get all of seeds in a collection

curl '' | jq '.[] | .canonical_url' | cut -d '"' -f2

Using the collection identifier 2950 (for the Occupy Movement Collection, it parses out the seed URLs from the JSON response. Change the colleciton identifier to get seed lists of other collections (collections IDs can be found by searching in Archive-It)

Get collection description

curl | jq '.metadata.Description | .[].value'

Similar to the above, this query returns the curator-added description field for a specific collection, as can be seen in the public interface at


GifCities term frequency count

Our GifCities project ( is built on top of an open API. This is an example of a collection extracted from a web archive collection and recreated in a new portal with additional information. The API powers the search and replay but can also be used for some funtime poking around, such as Kittens vs. Puppies!

curl | jq -r .[].url_text | tr -s ' ' | tr ' ' '\n' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr | head -20

curl | jq -r .[].url_text | tr -s ' ' | tr ' ' '\n' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr | head -20

WAT API examples

A WAT file is a derivative file created from a WARC but more lightweight and containing only key data and not the full resource information (such as page text). More on WATs at:

Get all anchor text from a resource

curl | jq '.["Envelope"]["Payload-Metadata"]["HTTP-Response-Metadata"]["HTML-Metadata"]["Links"][]["text"]' | grep -v null | less

Get page title from seed

curl | jq -r '.["Envelope"]["Payload-Metadata"]["HTTP-Response-Metadata"]["HTML-Metadata"]["Head"]["Title"]'

Extract titles from a list of seeds

cat eot_pr.txt | while read url; do curl -L -s${url} | jq '.["Envelope"] | .["Payload-Metadata"] | .["HTTP-Response-Metadata"] | .["HTML-Metadata"] | .["Head"] | .["Title"]' 2> /dev/null ; done

Anchor terms (with frequencies) in links from these pages

cat eot_pr.txt | while read url; do curl -L -s${url} | jq '.["Envelope"] | .["Payload-Metadata"] | .["HTTP-Response-Metadata"] | .["HTML-Metadata"] | .["Links"][] | .["text"]' 2> /dev/null | sed "s@\"@@g" ;done | tr -s ' ' | tr ' ' '\n' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr | grep -v null | head -50


Scripts and such for workshops on web archive APIs






No releases published


No packages published