Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heritrix sometimes writes empty WARC records for redirects #204

Open
anjackson opened this issue May 1, 2018 · 1 comment
Open

Heritrix sometimes writes empty WARC records for redirects #204

anjackson opened this issue May 1, 2018 · 1 comment
Labels

Comments

@anjackson
Copy link
Collaborator

Just noticed an oddity in our crawls. We have a WARC response with no response in it (see below). This seems to be due to the crawler getting a HTTP 204 response.

However, I only think that because the @ikreymer's pywb cdx-indexer creates this CDX line:

com,facebook)/plugins/like.php?action=like&colorscheme=light&height=21&href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105 20180422171119 http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21 unk 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 383 10514026 BL-20180422170134461-00018-63~ukwa-h3-pulse-daily~8443.warc.gz

But frankly I don't understand where it's getting the 204 from!

Assuming it is really a 204 (I'll check the crawl log), the question is: What should Heritrix3 be writing to the WARC file?

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-IP-Address: 157.240.1.35
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Record-ID: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
Content-Length: 0



WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-Concurrent-To: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
WARC-Record-ID: <urn:uuid:1fa4ddfb-2285-48b3-a835-61378b29a1d4>
Content-Length: 0



WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-Concurrent-To: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
WARC-Record-ID: <urn:uuid:dc148cb3-2c39-42c5-b1c6-02654fe428b7>
Content-Type: application/warc-fields
Content-Length: 564

via: http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/
hopsFromSeed: LLLE
sourceTag: http://newspig.co.uk/
fetchTimeMs: 12
charsetForLinkExtraction: ISO-8859-1
outlink: https://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fnewspig.co.uk%2F8-reasons-to-hold-cash-markets-are-rational-until-theyre-not%2F&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21 R Location:
outlink: http://www.facebook.com/favicon.ico I =INFERRED_MISC
outlink: http://www.facebook.com/ I =INFERRED_MISC



@ato
Copy link
Collaborator

ato commented Aug 2, 2018

From the extracted links it seems to be a redirect not a 204.

@ato ato changed the title Heritrix appears to write empty WARC records for HTTP 204 responses Heritrix sometimes writes empty WARC records for redirects Aug 2, 2018
@ato ato added the bug label Aug 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants