Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warc writer chain #285

Merged
merged 18 commits into from
Jan 14, 2020
Merged

Conversation

nlevitt
Copy link
Contributor

@nlevitt nlevitt commented Oct 8, 2019

Idea here is to make warc writing configurable such that a custom processor can write a special warc record. The proximate need is to write special youtube-dl json records. Really didn't want to hack something specific to that case into WARCWriterProcessor.

The default chain (see profile-crawler-beans.cxml):

  <property name="chain">
   <list>
    <bean class="org.archive.modules.warc.DnsResponseRecordBuilder"/>
    <bean class="org.archive.modules.warc.HttpResponseRecordBuilder"/>
    <bean class="org.archive.modules.warc.WhoisResponseRecordBuilder"/>
    <bean class="org.archive.modules.warc.FtpControlConversationRecordBuilder"/>
    <bean class="org.archive.modules.warc.FtpResponseRecordBuilder"/>
    <bean class="org.archive.modules.warc.RevisitRecordBuilder"/>
    <bean class="org.archive.modules.warc.HttpRequestRecordBuilder"/>
    <bean class="org.archive.modules.warc.MetadataRecordBuilder"/>
   </list>
  </property>

As an example, to let ExtractorYoutubeDL write youtube-dl json warc records, add this to the chain:

        <ref bean="extractorYoutubeDL"/>

(You will have an identical element in the fetch chain.)

Informative code snippet:

    protected void writeRecords(CrawlURI curi, WARCWriter writer) throws IOException {
        URI concurrentTo = null;
        for (WARCRecordBuilder recordBuilder: getChain()) {
            if (recordBuilder.shouldBuildRecord(curi)) {
                WARCRecordInfo record = recordBuilder.buildRecord(curi, concurrentTo);
                if (record != null) {
                    writer.writeRecord(record);
                    if (concurrentTo == null) {
                        concurrentTo = record.getRecordId();
                    }
                }
            }
        }
    }

We've had this running successfully in QA for 4 months, to the point I had completely forgotten about it, so I think it's pretty much complete. (I had aspirations to refactor some of the messy stuff like managing the dedup metadata in updateMetadataAfterWrite(). But I don't have a specific approach in mind and there's no urgent need.) This code may inherit bugs from legacy WARCWriterProcessor. Probably just needs javadocs 馃槵

nlevitt added 12 commits June 7, 2019 14:24
seems better/cleaner and we will want the single json form if/when we
start writing it to warc
exercised only lightly at this point
doesn't test that much though
fixes NPE trying to write a dns revisit record
this way other classes that extend other classes can also implement
WARCRecordBuilder
ExtractorYoutubeDL implements WARCRecordBuilder
@nlevitt nlevitt requested review from anjackson and ato October 8, 2019 20:04
@nlevitt
Copy link
Contributor Author

nlevitt commented Oct 8, 2019

Would like to get your eyes on this @anjackson and/or @ato, at your convenience :)

Copy link
Collaborator

@ato ato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a sensible idea.

The one possible problem that jumps out to me is that some of the record builders make assumptions about concurrentTo being present or not present. It looks fine with the default config but changing the order, disabling a builder or adding a custom one could either cause a NullPointerException or missing header. I think we should probably do the concurrentTo if-statement consistently in every builder.

and equivalent for other sites. Usually we find these links through
normal link extraction, but we have the info here, so we may as well use
it to make sure.
because opt() returns org.json.JSONObject.Null
@nlevitt nlevitt changed the title WIP: Warc writer chain Warc writer chain Dec 31, 2019
@nlevitt
Copy link
Contributor Author

nlevitt commented Dec 31, 2019

I think this is ready for merge now. Only lightly documented, but maybe enough.

@jkafader jkafader merged commit 3a8b589 into internetarchive:master Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants