RenameTagger only works with collector.http.* fields #12

kalhomoud · 2013-06-04T22:09:51Z

Hello,

For some reason, RenameTagger is only working when the field name is starting with collector.http.* such as collector.http.MIMETYPE. It didn't work with me when the field name was "support_url" and "dc:title".

Here is how I have it setup:
.
.
.
.

.
.
.
.

Please let me know if you need my config to reproduce the issue.

Thanks,
Khalid

essiembre · 2013-06-05T00:50:21Z

This one works as expected. For the crawler to know about a document metadata, it has to parse it first. If you simply change "preParseHandlers" to "postParseHandlers" it will work. The reason you have "some" metadata available in pre-parse handlers is because whatever the crawler could find from the HTTP Header or extracting URLs is added as extra metadata. To make sure that extra metadata is not mixed up with actual document metadata once the document is parsed, they are prefixed with "collector.http.".

Web crawler maven build now produces a fat jar.

essiembre closed this as completed Jun 5, 2013

essiembre added a commit that referenced this issue Apr 21, 2023

Merge pull request #12 from Norconex/pascal/crawler-filesystem

b58265b

Web crawler maven build now produces a fat jar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RenameTagger only works with collector.http.* fields #12

RenameTagger only works with collector.http.* fields #12

kalhomoud commented Jun 4, 2013

essiembre commented Jun 5, 2013

RenameTagger only works with collector.http.* fields #12

RenameTagger only works with collector.http.* fields #12

Comments

kalhomoud commented Jun 4, 2013

essiembre commented Jun 5, 2013