New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crawl level and host level limits on *novel* (not deduplicated) bytes and urls #138

Merged
merged 21 commits into from Jan 14, 2016

Conversation

Projects
None yet
2 participants
@nlevitt
Member

nlevitt commented Dec 10, 2015

No description provided.

@nlevitt nlevitt changed the title from QuotaEnforcer support for quotas on novelBytes and novelUrls, with tests to crawl level and host level limits on *novel* bytes and urls Dec 10, 2015

@nlevitt nlevitt changed the title from crawl level and host level limits on *novel* bytes and urls to crawl level and host level limits on *novel* (not deduplicated) bytes and urls Dec 10, 2015

nlevitt added some commits Dec 11, 2015

add warc stats to CrawlURI.getData() dynamic attributes list after wr…
…iting records; new method WARCWriterProcessor.getStats() to expose cumulative stats
Merge branch 'seed-limits' into novel-quotas -- so I can add tests to…
… StatisticsSelfTest.java in this branch

* seed-limits:
  change class originally known as SeedLimitsEnforcer to SourceQuotaEnforcer; make it a Processor instead of a DecideRule (because checking quota at link scoping time doesn't work, since many urls which would go over quota can be added to the frontier); support quotas on any of the fields tracked by CrawledBytesHistotable
  fix checkpointing problems with new statsBySource
  SeedLimitsEnforcer (contrib) DecideRule that rejects CrawlURI if source seed byte or document limit has been reached
  SourceSeedDecideRule applies the configured decision for any URI with discovered from one of a set of seeds
  add support to StatisticsTracker to keep a CrawledBytesHistotable per source tag when trackSources is enabled; integration test for this functionality
new HostQuotaEnforcer option applyToSubdomains, to additionally apply…
… the quotas (separately) to each subdomain of the configured host
avoid exception in case applyToSubdomains is enabled and CrawlURI hos…
…t is a nonstandard host like "dns:" in HostQuotaEnforcer
fix stats in unusual case of "failed" fetch with response > 0 (only c…
…ase where this can happen currently is if basic auth is configured for a url, but fails and url returns "401 Unauthorized")
Merge branch 'master' into novel-quotas
* master:
  license header
  check that sourceTag of CrawlURI actually matches configured sourceTag
  remove already-outdated stuff from javadoc
  handle multiple clauses for same user agent in robots.txt
  Hook in submitted seeds properly.
  avoid spurious logging
  try very hard to start url consumer, and therefore bind the queue to the routing key, so that no messages are dropped, before crawling starts (should always work unless rabbitmq is down); some other tweaks for clarity and stability

adam-miller added a commit that referenced this pull request Jan 14, 2016

Merge pull request #138 from nlevitt/novel-quotas
crawl level and host level limits on *novel* (not deduplicated) bytes and urls

@adam-miller adam-miller merged commit bf2a887 into internetarchive:master Jan 14, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment