Skip to content

Commit

Permalink
Announce Scoopi
Browse files Browse the repository at this point in the history
  • Loading branch information
maithilish committed Oct 15, 2018
1 parent 1a0c66a commit 2d48eec
Showing 1 changed file with 8 additions and 17 deletions.
25 changes: 8 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,17 @@
[![Build Status](https://travis-ci.org/maithilish/gotz.svg?branch=master)](https://travis-ci.org/maithilish/gotz)
[![Coverage Status](https://coveralls.io/repos/github/maithilish/gotz/badge.svg?branch=master)](https://coveralls.io/github/maithilish/gotz?branch=master&service=github)

Gotz Documentation <a href="http://www.codetab.org/gotz-etl/"> Gotz ETL Quickstart and Reference</a>

<hr>
#### Gotz ETL is replaced with much faster and easy to use **Scoopi Web Scraper**

Gotz ETL is a tool to extract data from HTML pages. In Java, one can scrape web pages with libraries such as JSoup and HtmlUnit. But, when one intents to extract huge amount of data from hundreds of pages as a dataset then task becomes daunting. Scraping libraries such as JSoup do well in scraping data but they are not meant to handle large set of data.

The XML definiton files of Gotz is complex and Scoopi replaces it with much simpler YAML configuration

Gotz is built upon <a href="https://jsoup.org/">JSoup</a> and <a href="http://htmlunit.sourceforge.net/">HtmlUnit</a> and some of the functionality offered by Gotz over and above the scrapping libraries are:
- compare [Gotz XML def](https://github.com/maithilish/gotz/blob/master/src/main/resources/defs/examples/jsoup/ex-1/job.xml) with [Scoopi YAML def](https://github.com/maithilish/scoopi/blob/master/src/main/resources/defs/examples/jsoup/ex-1/job.yml)

- Gotz is completely model driven like a real ETL tool. Data structure, task workflow and pages to scrape are defined with a set of XML definition files and no coding is required
- It can be configured to use either JSoup or HtmlUnit as scraper
- Query can be written either using Selectors with JSoup or XPath with HtmlUnit
- Gotz persists pages and data to database so that it recover from the failed state without repeating the tasks already completed
- For Transparent persistence, Gotz uses <a href="https://db.apache.org/jdo">JDO Standard</a> and <a href="http://www.datanucleus.org" >DataNucleus AccessPlatform</a> and you can choose your Datastore from a very wide range!
- Gotz is a multithreaded application which process pages in parallel for maximum throughput. Threads alloted to each task pool is configureable based on workload
- Allows to transform, filter and sort the data
- Comes with built-in appenders such as FileAppender, DBAppender and ListAppender.
- GotzEngine can be embeded in other programs and access scrapped data with ListAppender
- Flexible workflow allows one to change sequence of steps
- Gotz is extensible. Developers can extends the predefined base steps or even create new ones with different functionality and weave them in workflow
Scoopi is 5x faster than Gotz.

## Gotz Installation
- Gotz takes around 10 minutes to scrape 1000 pages where as Scoopi completes it in under 2 minutes.

To install and run Gotz ETL see [CodeTab Gotz Reference](http://www.codetab.org/gotz-etl/). It is also a step-by-step guide to create data definition files through a set of examples.

## Go to [Scoopi Web Scraper](https://github.com/maithilish/scoopi)

0 comments on commit 2d48eec

Please sign in to comment.