Scoopi Web Scraper is a heavy duty tool to extract data from HTML pages.
Branch: master
Clone or download
maithilish fix breakAfter
        when breakAfter is defined, but axis value is null then throw error,
          otherwise it is unable to apply breakAfter and parse never terminates
        Parser - do not store data if reuse
Latest commit 1be17d1 Dec 29, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
notes
src fix breakAfter Dec 31, 2018
LICENSE add license and readme Jul 25, 2018
README.md updates Oct 20, 2018
pom.xml enable defs validation Dec 24, 2018

README.md

scoopi-logo

CodeTab Scoopi Guide Quickstart and Guide


Scoopi is a tool to extract and transform data from web pages.

Libraries such as JSoup and HtmlUnit makes it quite easy to scrape web pages in Java, but the things get complicated when data is from large number of pages. They do well in scraping data from limited set of pages but they are not meant to handle thousands of pages. Scoopi is built on JSoup and HtmlUnit and the functionality offered by Scoopi are:

  • Scoopi is completely definition driven. Data structure, task workflow and pages to scrape are defined with a set of YML definition files and no coding skill is required
  • It can be configured to use either JSoup or HtmlUnit as scraper
  • Query can be written either using Selectors with JSoup or XPath with HtmlUnit
  • Scoopi is a multithreaded application which process pages in parallel for maximum throughput.
    • even on a low end system with core 2 duo processor, it can load, parse and transform around 1000 pages in under two minutes.
  • Scoopi ships as Docker image so that it can run without any cumbersome installation
  • Scoopi persists pages and data to database so that it recover from the failed state without repeating the tasks already completed
  • For Transparent persistence, Scoopi uses JDO Standard and DataNucleus AccessPlatform and you can choose your Datastore from a very wide range!
  • Allows to transform, filter and sort the data
  • With built-in appenders such as FileAppender, DBAppender and ListAppender.
  • ScoopiEngine can be embeded in other programs and access scrapped data with ListAppender
  • Flexible workflow allows one to change sequence of steps
  • Scoopi is extensible. Developers can extend the predefined base steps or even create new ones with different functionality and weave them in workflow

Scoopi Installation

To install and run Scoopi see CodeTab Scoopi Guide. It is also a step-by-step guide to create data definition files through a set of examples.