Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453) #455

Open
oschihin opened this issue Dec 20, 2021 · 3 comments
Labels

Comments

@oschihin
Copy link
Contributor

oschihin commented Dec 20, 2021

Using the advice in Issue #453, I successfully excluded unwanted PDF-documents from fetching and being written to WARC. But this method seems to generate misleading reports and stats.

mimetype-report

shows pdf- and zip-files with counts and bytes, both are excluded

[#urls] [#bytes] [mime-types]
6556 234271851 text/html
4193 8659344 application/pdf
42 1829002 image/jpeg
26 508206 text/css
23 239633 image/png
15 811627 application/javascript
14 1462995 application/vnd.openxmlformats-officedocument.wordprocessingml.document
9 18531 application/zip
7 1149664 image/svg+xml
4 49430 image/gif
2 97178 application/font-woff2
2 241457 application/vnd.ms-fontobject
2 240859 application/x-font-ttf
2 124253 application/x-font-woff
2 20934 text/xml
2 4400 unknown
1 212071 application/vnd.ms-excel
1 56 text/dns
1 2419 text/plain

count of content-type from WARC-file

If I grep and count the Content-Type fields from WARC, this is what I get. No pdf and zip:

6702 Content-Type: application/warc-fields
6701 Content-Type: application/http; msgtype=response
6701 Content-Type: application/http; msgtype=request
6190 Content-Type: text/html;charset=UTF-8
 356 Content-Type: text/html; charset=iso-8859-1
  42 Content-Type: image/jpeg;charset=UTF-8
  26 Content-Type: text/css;charset=UTF-8
  23 Content-Type: image/png;charset=UTF-8
  15 Content-Type: application/javascript;charset=UTF-8
  14 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
  10 Content-Type: text/html
   7 Content-Type: image/svg+xml;charset=UTF-8
   4 Content-Type: image/gif;charset=UTF-8
   2 Content-Type: text/xml;charset=UTF-8
   2 Content-Type: application/x-font-woff;charset=UTF-8
   2 Content-Type: application/x-font-ttf;charset=UTF-8
   2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
   2 Content-Type: application/font-woff2;charset=UTF-8
   1 Content-Type: text/plain
   1 Content-Type: text/dns
   1 Content-Type: application/vnd.ms-excel;charset=UTF-8

Crawled Bytes

  • Total crawled bytes according to the crawl summary are: 249943910 (238 MiB)
  • Size of the zipped warc file is 70 MB, unzippped 333 MB

Problem

We use the reports and logs in our archive for an overview of the content. In this case, this is dangerous. Is there an explanation and maybe a fix to the problem?

@ato ato added the bug label Dec 21, 2021
@ato
Copy link
Collaborator

ato commented Dec 21, 2021

Since neither FetchHTTP choosing to not download the request body nor the WarcWriter choosing not to write the record changes fetch status code of the CrawlURI it's still considered a success for statistics purposes.

As for fixing it, well WorkQueueFrontier.processFinish() is where the decision gets made. A URI is treated either as success, disregarded or failure. I suppose either the definition CrawlURI.isSuccess() and WorkQueueFrontier.isDisregarded() could be changed so URIs with the midFetchAbort annotation are considered disregarded or the abort itself could be changed to call setFetchStatus(S_OUT_OF_SCOPE).

This would have some side-effects though: extractors wouldn't run, the record wouldn't be recorded in the WARC file and the request wouldn't charged to the queue's budget. In your case those are desirable as the goal is for the PDF to be treated as out of scope. I guess the question is if there are other use cases for FetchHTTP shouldFetchBodyRule where those side-effects would be undesirable?

@ato
Copy link
Collaborator

ato commented Dec 21, 2021

Another idea is perhaps the full scope should be re-evaluated after the response header is received. This would mean putting a content type decide rule in the normal scope would "just work" and maybe would be less surprising to the operator.

@oschihin
Copy link
Contributor Author

Thanks for the information. This makes sense, even if it is not a perfect situation for our use case. But if I think about it, we can live with it. We do produce scope.log etc. Even though these logs tend to be pretty large, they show the effects of our "scoping" or appraisal decisions. We would need to explain that, but it makes for transparency.

I am rather sceptical about your second idea, if just for performance and runtime reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants