Skip to content

A collection of awesome web crawler,spider in different language

License

Notifications You must be signed in to change notification settings

lostmonk/awesome-crawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Awesome-crawler Awesome

A collection of awesome web crawler,spider and resources in different language

Python

  • Scrapy - A fast high-level screen scraping and web crawling framework.
  • cola - A distributed crawling framework.
  • Demiurge - PyQuery-based scraping micro-framework.
  • feedparser - Universal feed parser.
  • Grab - Site scraping framework.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • portia - Visual scraping for Scrapy.
  • pyspider - A powerful spider system.
  • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
  • MSpider - A simple ,easy spider using gevent and js render.

JavaScript

Java

  • Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
  • Crawler4j - Simple and lightweight web crawler.
  • JSoup - Scrapes, parses, manipulates and cleans HTML.
  • websphinx - Website-Specific Processors for HTML INformation eXtraction.
  • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
  • Gecco - A easy to use lightweight web crawler
  • WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
  • Webmagic - A scalable crawler framework.
  • Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
  • SeimiCrawler - An agile, distributed crawler framework.

C#

  • ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
  • SimpleCrawler - Simple spider base on mutithreading, regluar expression.
  • Abot - C# web crawler built for speed and flexibility.

PHP

  • dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
  • pspider - Parallel web crawler written in PHP.
  • php-spider - A configurable and extensible PHP web spider.

C++

Ruby

  • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
  • RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.

Go

  • gocrawl - Polite, slim and concurrent web crawler.
  • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

Scala

  • crawler - Scala DSL for web crawling.
  • scrala - Scala crawler(spider) framework, inspired by scrapy.
  • ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

About

A collection of awesome web crawler,spider in different language

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published