Code from my work with the 80TB Wide Scrape of the World Wide Web, provided by the Internet Archive.
For more information, follow my blog. Code is written in Mathematica 8 or 9.
Domain Count for CDX Files (CDX-Analysis.nb): See http://ianmilligan.ca/2013/06/17/finding-ca-domains-in-the-80tb-wide-crawl/. In this version, before edits, it takes the CDX files from the Wide Scrape and provides counts per crawl about how many .ca domains there are. This will help isolate my research sample.
Output is: Digit Number (for internal purposes), WARC File, and then .ca count.
Written to a stream file.