Skip to content

jorgelbg/mimetype-filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mimetype-filter

Note

This plugin has been merged in the trunk of the Nutch project. Starting from version 1.10 this plugin will be bundled with the Nutch distribution

Nutch 1.x plugin that allows to filter the indexed documents by the MIME type property of the crawled web pages. Basically this will allow you to restrict the MIME type of the contents that will be stored in Solr/Elasticsearch index without the need to restrict the crawling/parsing process, so no need to use URLFilter plugins family. Also this address one particular corner case when certain URLs doesn't have any format to filter such as some RSS feeds (http://www.awesomesite.com/feed) and it will end in your index mixed with all your HTML pages.

Configuration

A custom configuration file that will hold the rules that can be specified using the a property mimetype.filter.file in your nutch-site.xml file:

<property>
    <name>mimetype.filter.file</name>
    <value>mimetype-filter.txt</value>
</property>

If no mimetype.filter.file key is found in your nutch-site.xml file an allow policy is used instead, so all your crawled documents will be indexed.

The rules configuration file use the same format as the urlfilter-suffix plugin, basically it starts with a general policy (+ to allow everything and - to block everything) an it follows a list of exceptions to the general rule, so:

-
image

Will block all mimetypes except those that contain image in the MIME type string extracted by Tika, wich will allow the indexing of all kind of images.

+
text/html

On the other hand this second example allows the indexing of every document type but blocks the text/html.

NOTE: Keep in mind that the rules that follows the global policy are evaluated as regular expressions.

About

Nutch 1.x plugin for MIME type filtering in the indexing phase

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages