Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warc writer chain #285

Merged
merged 18 commits into from
Jan 14, 2020
Merged

Commits on Jun 7, 2019

  1. Configuration menu
    Copy the full SHA
    25848e9 View commit details
    Browse the repository at this point in the history
  2. use --dump-single-json

    seems better/cleaner and we will want the single json form if/when we
    start writing it to warc
    nlevitt committed Jun 7, 2019
    Configuration menu
    Copy the full SHA
    4d6314b View commit details
    Browse the repository at this point in the history

Commits on Jun 11, 2019

  1. configurable warc writer chain

    exercised only lightly at this point
    nlevitt committed Jun 11, 2019
    Configuration menu
    Copy the full SHA
    9435f76 View commit details
    Browse the repository at this point in the history

Commits on Jun 12, 2019

  1. same test as for WARCWriterProcessor

    doesn't test that much though
    nlevitt committed Jun 12, 2019
    Configuration menu
    Copy the full SHA
    870d847 View commit details
    Browse the repository at this point in the history
  2. oops, handle https too

    nlevitt committed Jun 12, 2019
    Configuration menu
    Copy the full SHA
    ece4358 View commit details
    Browse the repository at this point in the history
  3. default chain in code

    nlevitt committed Jun 12, 2019
    Configuration menu
    Copy the full SHA
    8c4c443 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    7c31b07 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    e928c21 View commit details
    Browse the repository at this point in the history
  6. revisits only for http and ftp

    fixes NPE trying to write a dns revisit record
    nlevitt committed Jun 12, 2019
    Configuration menu
    Copy the full SHA
    9e67a8d View commit details
    Browse the repository at this point in the history
  7. make WARCRecordBuilder an interface

    this way other classes that extend other classes can also implement
    WARCRecordBuilder
    nlevitt committed Jun 12, 2019
    Configuration menu
    Copy the full SHA
    9ddd281 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    5177b2b View commit details
    Browse the repository at this point in the history

Commits on Jun 13, 2019

  1. write youtube-dl json to the warc

    ExtractorYoutubeDL implements WARCRecordBuilder
    nlevitt committed Jun 13, 2019
    Configuration menu
    Copy the full SHA
    bff33e0 View commit details
    Browse the repository at this point in the history

Commits on Oct 15, 2019

  1. Configuration menu
    Copy the full SHA
    1b95453 View commit details
    Browse the repository at this point in the history

Commits on Nov 15, 2019

  1. fix non-playlist case (oops!)

    nlevitt committed Nov 15, 2019
    Configuration menu
    Copy the full SHA
    4983e65 View commit details
    Browse the repository at this point in the history
  2. extract watch page links from youtube playlists

    and equivalent for other sites. Usually we find these links through
    normal link extraction, but we have the info here, so we may as well use
    it to make sure.
    nlevitt committed Nov 15, 2019
    Configuration menu
    Copy the full SHA
    c4b0816 View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2019

  1. Configuration menu
    Copy the full SHA
    6e7bc54 View commit details
    Browse the repository at this point in the history

Commits on Nov 18, 2019

  1. use JSONObject.isNull()

    because opt() returns org.json.JSONObject.Null
    nlevitt committed Nov 18, 2019
    Configuration menu
    Copy the full SHA
    ff8ebc4 View commit details
    Browse the repository at this point in the history

Commits on Dec 31, 2019

  1. basic level of documentation

    nlevitt committed Dec 31, 2019
    Configuration menu
    Copy the full SHA
    31b9960 View commit details
    Browse the repository at this point in the history