-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warc writer chain #285
Warc writer chain #285
Conversation
seems better/cleaner and we will want the single json form if/when we start writing it to warc
exercised only lightly at this point
doesn't test that much though
fixes NPE trying to write a dns revisit record
this way other classes that extend other classes can also implement WARCRecordBuilder
ExtractorYoutubeDL implements WARCRecordBuilder
Would like to get your eyes on this @anjackson and/or @ato, at your convenience :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a sensible idea.
The one possible problem that jumps out to me is that some of the record builders make assumptions about concurrentTo being present or not present. It looks fine with the default config but changing the order, disabling a builder or adding a custom one could either cause a NullPointerException or missing header. I think we should probably do the concurrentTo if-statement consistently in every builder.
modules/src/main/java/org/archive/modules/warc/FtpResponseRecordBuilder.java
Outdated
Show resolved
Hide resolved
and equivalent for other sites. Usually we find these links through normal link extraction, but we have the info here, so we may as well use it to make sure.
because opt() returns org.json.JSONObject.Null
I think this is ready for merge now. Only lightly documented, but maybe enough. |
Idea here is to make warc writing configurable such that a custom processor can write a special warc record. The proximate need is to write special youtube-dl json records. Really didn't want to hack something specific to that case into WARCWriterProcessor.
The default chain (see profile-crawler-beans.cxml):
As an example, to let ExtractorYoutubeDL write youtube-dl json warc records, add this to the chain:
(You will have an identical element in the fetch chain.)
Informative code snippet:
We've had this running successfully in QA for 4 months, to the point I had completely forgotten about it, so I think it's pretty much complete. (I had aspirations to refactor some of the messy stuff like managing the dedup metadata in
updateMetadataAfterWrite()
. But I don't have a specific approach in mind and there's no urgent need.) This code may inherit bugs from legacy WARCWriterProcessor. Probably just needs javadocs 馃槵