Skip to content

CrawleMe! is is easy way of crawling image or link urls from any web site.

Notifications You must be signed in to change notification settings

ibrahimgunduz34/crawleme

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is CrawleMe! ?

CrawleMe! is is easy way of crawling image or link urls from any web site.

How It Works ?

Create your web page wrapper class.

from crawleme.base import BasePage

class MyPage(BasePage):
	url = 'http://www.mysite.com'
	item_path = '//*[@id="campaign_list"]/div/a'
	item_attribute = 'href'

Create a instance of wrapper class and call crawle method.

crawler = MyPage()
urls = crawler.crawle()

for url in urls:
	print url

Result:

http://www.mysite.com/id/5
http://www.mysite.com/aboutus/
http://www.mysite.com/foo/
http://www.mysite.com/bar/
http://www.mysite.com/baz/

Also, you can pass or override the url or item_path of wrapper class on creating class instance.

crawler = MyPage(url='http://www.mysite.com/id/112312')

Properties:

url:
Url of page that will be crawled.

item_path:
X-Path of selected DOM element(s).

item_attribute:
Attribute of selected DOM element(s).

has_only_single_item (default=False):
crawle method returns only single value when there is True

fix_urls (default=True):
Sometimes may be DOM object attributes contains only path value without hostname and protocol. This attributes fix the parsed value as full url.

Methods:

crawle([timeout=crawleme.conf.REQUEST_TIMEOUT],[renew=False]):
Parses value list or single value from the page by the specified attributes.

get_filename([timeout=crawleme.conf.REQUEST_TIMEOUT]):
Returns requested filename.

read([timeout=crawleme.conf.REQUEST_TIMEOUT]):
read data from stream.

About

CrawleMe! is is easy way of crawling image or link urls from any web site.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages