Skip to content

Julien Nioche
jnioche

Organizations

@apache @commoncrawl @DigitalPebble
jnioche opened pull request apache/nutch#88
@jnioche
NUTCH-2213 : do not store the headers verbatim if the response was compressed
1 commit with 5 additions and 4 deletions
jnioche created branch NUTCH-2213 at jnioche/nutch
jnioche commented on pull request internetarchive/warctools#14
@jnioche

No idea how frequent this is, your guess is as good as mine.

jnioche commented on pull request internetarchive/warctools#14
@jnioche

Fixed in [commoncrawl/nutch@3551eb6]. The crawler stored the response headers including the content length even though Nutch stores the content dec…

@jnioche
  • @jnioche 3551eb6
    WarcExport : do not use verbatim response headers if content was comp…
jnioche commented on pull request internetarchive/warctools#14
jnioche deleted branch warc at DigitalPebble/storm-crawler
@jnioche
@jnioche
  • @jnioche 4ff86c7
    Applied formatting to files recently modified
jnioche deleted branch feature/non-blocking-fetcher at DigitalPebble/storm-crawler
@jnioche
Investigate use of async IO for Fetcher
@jnioche

Closing as there seems to be little interest in this.

@jnioche
Customisable/pluggable Scheduler instances
@jnioche

Implemented in #245

@jnioche
FetcherBolts : add counter for bytes fetched
@jnioche

[master ec233c1] FetcherBolts : add counter for bytes fetched #243

@jnioche
  • @jnioche ec233c1
    FetcherBolts : add counter for bytes fetched #243
@jnioche

We can now configure how often the metrics will be generated for both flavours of FetcherBolts. fetcher.metrics.time.bucket.secs: 10 @mattburns - …

@jnioche
  • @jnioche d9354d5
    Configurable time buckets for Fetcher(s) metrics #246
@jnioche

The counter could be shared by being a static field however this would not be across all bots, just the ones that live in the same JVM. Moreover I …

@jnioche
@jnioche

@mattburns what about being able to configure the metrics frequency per component? Would this still be useful and if so should we reopen this issue?

@jnioche
@jnioche
Reduce metrics chatter
@jnioche
Fixed whitelist blacklist to allow filtering by parent. Fixes #246
1 commit with 59 additions and 13 deletions
@jnioche
  • @jnioche 469c184
    Fetcher : set initial # threads + rampup delay via config
@jnioche

Could this be passed as an argument on the command line? Would be easier than having to modify the script every time

jnioche commented on pull request DigitalPebble/storm-crawler#250
@jnioche

thanks @mattburns

Something went wrong with that request. Please try again.