Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
Fetching contributors…

Cannot retrieve contributors at this time

file 112 lines (78 sloc) 3.528 kb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
Crawling - html to import
=========================
A source blueprint for crawling content from a site or local html files.

Webcrawler imports HTML either from a live website, for a folder on disk, or a folder
on disk with html which used to come from a live website and may still have absolute
links refering to that website.

To crawl a live website supply the crawler with a base http url to start crawling with.
This url must be the url which all the other urls you want from the site start with.

For example ::

 [crawler]
 blueprint = transmogrify.webcrawler
 url = http://www.whitehouse.gov
 max = 50

will restrict the crawler to the first 50 pages.

You can also crawl a local directory of html with relative links by just using a file: style url ::

 [crawler]
 blueprint = transmogrify.webcrawler
 url = file:///mydirectory
 
or if the local directory contains html saved from a website and might have absolute urls in it
the you can set this as the cache. The crawler will always look up the cache first ::

 [crawler]
 blueprint = transmogrify.webcrawler
 url = http://therealsite.com --crawler:cache=mydirectory

The following will not crawl anything larget than 4Mb ::

  [crawler]
  blueprint = transmogrify.webcrawler
  url = http://www.whitehouse.gov
  maxsize=400000

To skip crawling links by regular expression ::
 
  [crawler]
  blueprint = transmogrify.webcrawler
  url=http://www.whitehouse.gov
  ignore = \.mp3
                   \.mp4

If webcrawler is having trouble parsing the html of some pages you can preprocesses
the html before it is parsed. e.g. ::

  [crawler]
  blueprint = transmogrify.webcrawler
  patterns = (<script>)[^<]*(</script>)
  subs = \1\2
  
If you'd like to skip processing links with certain mimetypes you can use the
drop:condition. This TALES expression determines what will be processed further.
see http://pypi.python.org/pypi/collective.transmogrifier/#condition-section
::

 [drop]
 blueprint = collective.transmogrifier.sections.condition
 condition: python:item.get('_mimetype') not in ['application/x-javascript','text/css','text/plain','application/x-java-byte-code'] and item.get('_path','').split('.')[-1] not in ['class']


Options
-------

site_url
 - the top url to crawl

ignore
 - list of regex for urls to not crawl

cache
 - local directory to read crawled items from instead of accessing the site directly

patterns
 - Regular expressions to substitute before html is parsed. New line seperated

subs
 - Text to replace each item in patterns. Must be the same number of lines as patterns. Due to the way buildout handles empty lines, to replace a pattern with nothing (eg to remove the pattern), use ``<EMPTYSTRING>`` as a substitution.

maxsize
 - don't crawl anything larger than this

max
 - Limit crawling to this number of pages

  
WebCrawler will emit items like ::

 item = dict(_site_url = "Original site_url used",
            _path = "The url crawled without _site_url,
            _content = "The raw content returned by the url",
            _content_info = "Headers returned with content"
            _backlinks = names,
            _sortorder = "An integer representing the order the url was found within the page/site
)
  

transmogrify.webcrawler.typerecognitor
======================================

A blueprint for assinging content type based on the mime-type as given by the
webcrawler

transmogrify.webcrawler.cache
=============================

A blueprint that saves crawled content into a directory structure
Something went wrong with that request. Please try again.