CrawleMe! is is easy way of crawling image or link urls from any web site.
Create your web page wrapper class.
from crawleme.base import BasePage
class MyPage(BasePage):
url = 'http://www.mysite.com'
item_path = '//*[@id="campaign_list"]/div/a'
item_attribute = 'href'
Create a instance of wrapper class and call crawle method.
crawler = MyPage()
urls = crawler.crawle()
for url in urls:
print url
Result:
http://www.mysite.com/id/5
http://www.mysite.com/aboutus/
http://www.mysite.com/foo/
http://www.mysite.com/bar/
http://www.mysite.com/baz/
Also, you can pass or override the url or item_path of wrapper class on creating class instance.
crawler = MyPage(url='http://www.mysite.com/id/112312')
url:
Url of page that will be crawled.
item_path:
X-Path of selected DOM element(s).
item_attribute:
Attribute of selected DOM element(s).
has_only_single_item (default=False):
crawle method returns only single value when there is True
fix_urls (default=True):
Sometimes may be DOM object attributes contains only path value without hostname and protocol. This attributes fix the parsed value as full url.
crawle([timeout=crawleme.conf.REQUEST_TIMEOUT],[renew=False]):
Parses value list or single value from the page by the specified attributes.
get_filename([timeout=crawleme.conf.REQUEST_TIMEOUT]):
Returns requested filename.
read([timeout=crawleme.conf.REQUEST_TIMEOUT]):
read data from stream.