Skip to content

national or regional domain scope

Alex Osborne edited this page Jul 4, 2018 · 2 revisions

I'd like to crawl a nation's or region's web content. How should I set up Heritrix?

this is a complex topic possibly requiring deep meditation, however
for example - if you only want to capture "Hungarian" sites with the
"hu" TLD, then you could specify "http://(hu," as an allowable scope
SURT prefix using SurtPrefixedDecideRule in your crawler beans
CXML file. see:

however, this is a highly oversimplified notion of what might
constitute "Hungarian sites only". you'll want to think about what
that actually means for a class of URLs. for instance, what about
sites that are hosted elsewhere but are clearly Hungarian? or hosted
Hungarian blogs (livejournal, blogger, typepad, etc.)? or hungarian
social networking sites (facebook, myspace, bebo, flickr, etc.)?

other alternatives for addressing this scoping issue include
requesting a listing of sites from servers geographically located in
the country of interest from a service like Alexa Internet or a domain
registry.

it's also conceivable to write custom modules to try to detect the
language or geolocation of potential target sites to decide what to
capture.

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally