mimetype-filter

Note

This plugin has been merged in the trunk of the Nutch project. Starting from version 1.10 this plugin will be bundled with the Nutch distribution

Nutch 1.x plugin that allows to filter the indexed documents by the MIME type property of the crawled web pages. Basically this will allow you to restrict the MIME type of the contents that will be stored in Solr/Elasticsearch index without the need to restrict the crawling/parsing process, so no need to use URLFilter plugins family. Also this address one particular corner case when certain URLs doesn't have any format to filter such as some RSS feeds (http://www.awesomesite.com/feed) and it will end in your index mixed with all your HTML pages.

Configuration

A custom configuration file that will hold the rules that can be specified using the a property mimetype.filter.file in your nutch-site.xml file:

<property>
    <name>mimetype.filter.file</name>
    <value>mimetype-filter.txt</value>
</property>

If no mimetype.filter.file key is found in your nutch-site.xml file an allow policy is used instead, so all your crawled documents will be indexed.

The rules configuration file use the same format as the urlfilter-suffix plugin, basically it starts with a general policy (+ to allow everything and - to block everything) an it follows a list of exceptions to the general rule, so:

-
image

Will block all mimetypes except those that contain image in the MIME type string extracted by Tika, wich will allow the indexing of all kind of images.

+
text/html

On the other hand this second example allows the indexing of every document type but blocks the text/html.

NOTE: Keep in mind that the rules that follows the global policy are evaluated as regular expressions.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
sample		sample
src		src
.gitignore		.gitignore
README.md		README.md
build.xml		build.xml
ivy.xml		ivy.xml
plugin.xml		plugin.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sample

sample

src

src

.gitignore

.gitignore

README.md

README.md

build.xml

build.xml

ivy.xml

ivy.xml

plugin.xml

plugin.xml

Repository files navigation

mimetype-filter

Note

Configuration

About

Releases

Packages

Languages

jorgelbg/mimetype-filter

Folders and files

Latest commit

History

Repository files navigation

mimetype-filter

Note

Configuration

About

Resources

Stars

Watchers

Forks

Languages