Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RenameTagger only works with collector.http.* fields #12

Closed
kalhomoud opened this issue Jun 4, 2013 · 1 comment
Closed

RenameTagger only works with collector.http.* fields #12

kalhomoud opened this issue Jun 4, 2013 · 1 comment
Labels

Comments

@kalhomoud
Copy link

Hello,

For some reason, RenameTagger is only working when the field name is starting with collector.http.* such as collector.http.MIMETYPE. It didn't work with me when the field name was "support_url" and "dc:title".

Here is how I have it setup:
.
.
.
.







.
.
.
.

Please let me know if you need my config to reproduce the issue.

Thanks,
Khalid

@essiembre
Copy link
Contributor

This one works as expected. For the crawler to know about a document metadata, it has to parse it first. If you simply change "preParseHandlers" to "postParseHandlers" it will work. The reason you have "some" metadata available in pre-parse handlers is because whatever the crawler could find from the HTTP Header or extracting URLs is added as extra metadata. To make sure that extra metadata is not mixed up with actual document metadata once the document is parsed, they are prefixed with "collector.http.".

essiembre added a commit that referenced this issue Apr 21, 2023
Web crawler maven build now produces a fat jar.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants