nutch-jsoup

Precise data extraction with nutch and Jsoup css selector via xml configuration

Extract data with super easy jsoup selector ( jquery like )

Example : #nodeId tr:nth-child(2) td:nth-child(2) p

Add custom field for each specific data that you need
ElasticSearch query integration

Compatibility

Compatible with nutch 2.2.1

Intallation

Copy "plugin/index-domsjoup" under plugin directory
Copy "conf/domjsoupconf.xml" under conf directory
Copy and edit "conf/example.xml" under conf directory

Configuration

Add "index-domsjoup" to your plugin.includes property in nutch-site.xml

<property>
 <name>plugin.includes</name>
 <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|index-domsjoup|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
 <description></description>
</property>

Edit conf/domjsoupconf.xml
Setup filter url where to apply precise data extraction (add or edit rule a section)

<rule>
    <!-- If url contains "urlcontain value" then use specific file for extract data "conf/domjsoup-github.xml" -->
	<urlcontain>http://github.com</urlcontain>
	<file>conf/domjsoup-github.xml</file>    
	<!-- true : add a field with html source -->
	<addhtmlsourcefield>true</addhtmlsourcefield>    
</rule>

Create domjsoup-github.xml or copy "conf/example.xml" and edit it

See comments in "conf/example.xml"

How to test

You can test it using "indexchecker" COMMAND

bin/nutch indexchecker http://yourUrl.html
<<<<<<< HEAD

=======

>>>>>>> e433e059d51fce0615ce8898cc3fd5daeb6f72ae

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
conf		conf
org/apache/nutch/indexer/domjsoup		org/apache/nutch/indexer/domjsoup
plugin/index-domsjoup		plugin/index-domsjoup
README.md		README.md
build.xml		build.xml
ivy.xml		ivy.xml
plugin.xml		plugin.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

org/apache/nutch/indexer/domjsoup

org/apache/nutch/indexer/domjsoup

plugin/index-domsjoup

plugin/index-domsjoup

README.md

README.md

build.xml

build.xml

ivy.xml

ivy.xml

plugin.xml

plugin.xml

Repository files navigation

nutch-jsoup

Compatibility

Intallation

Configuration

How to test

About

Releases

Packages

Languages

kantone/nutch-jsoup

Folders and files

Latest commit

History

Repository files navigation

nutch-jsoup

Compatibility

Intallation

Configuration

How to test

About

Resources

Stars

Watchers

Forks

Languages