Skip to content
A distributed spider based on Apache Storm
Java Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
deploy
src/main
.gitignore
LICENSE
README.md
pom.xml

README.md

StreamSpider

Spider based on storm platform

Environment

Features

Incremental scrap && analyze

Define allowed URL patterns

Customize scrap strategy of certain pattern

  • limitation
  • reset interval
  • expire time
  • parallelism

Update settings dynamically

System will refetch settings after a certain time (cache), so it is possible to update settings dynamically.

Topology

There are one spout(URLReader) and five bolts in these topology. Bolts include URLFilter, Downloader, HTMLParser, HTMLSaver, URLSaver

URLReader: Pop from redis waiting list to get urls

URLFilter: Determine Which url will be downloaded.

This bolt is the controller, in charge of :

  • Handle repeated urls
  • Pattern download count, ignore limitation exceeded pattern.

Downloader: Download url

URLParser: Parse urls from the page

HTMLSaver : Save page html to MQ

URLSaver : Push possible urls to redis waiting list

Configuration

There something to (or can to be) configured

urls_to_download (Redis list, required ) : waiting list, absolute url path.

allowed_url_patterns (Redis sorted list, required, priority from highest score(5) to lowest score(1), zrevrangeBYScore): allowed url patterns to be downloaded

url_pattern_setting_{pattern} (Redis hash, optional) :

- **limitation**: download count limitation in an interval
- **interval**: duration to reset count
- **expire**: cache time
- **parallelism**: max number of workers working on this pattern(host)

TODO

  • ignore nun-text pages (binary file)
  • consume faster
You can’t perform that action at this time.