Skip to content
Code from my work with the 80TB Wide Scrape of the World Wide Web, provided by the Internet Archive.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Computer-Vision-Data
Postal-Codes
WaybackMachine.workflow/Contents
.DS_Store
CDX-Analysis.nb
CDX-sorting.nb
README.md
Wide-Scrape-WARC-Tools.sh

README.md

Exploring-Wide-Scrape

Code from my work with the 80TB Wide Scrape of the World Wide Web, provided by the Internet Archive.

For more information, follow my blog. Code is written in Mathematica 8 or 9.

Domain Count for CDX Files (CDX-Analysis.nb): See http://ianmilligan.ca/2013/06/17/finding-ca-domains-in-the-80tb-wide-crawl/. In this version, before edits, it takes the CDX files from the Wide Scrape and provides counts per crawl about how many .ca domains there are. This will help isolate my research sample.

Output is: Digit Number (for internal purposes), WARC File, and then .ca count.

Written to a stream file.

You can’t perform that action at this time.