Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Commits on Dec 27, 2011
  1. Support for the class="robots-nocontent" attribute of HTML tags.

    authored
    Everything inside tags with this attribute will be skipped by the
    DOM parser.
    
    See http://www.ysearchblog.com/2007/05/02/introducing-robots-nocontent-for-page-sections/
     for more details.
Commits on Dec 20, 2011
  1. Renamed FetcherStatus to FetcherOutlinks for the new outlinks section…

    Markus Jelsma authored
    … of NUTCH-1184
    
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1221194 13f79535-47bb-0310-9956-ffa450edef68
  2. NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1221185 13f79535-47bb-0310-9956-ffa450edef68
  3. NUTCH-1184 Fetcher to parse and follow Nth degree outlinks

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1221181 13f79535-47bb-0310-9956-ffa450edef68
Commits on Dec 19, 2011
  1. NUTCH-1225 Migrate CrawlDBScanner to MapReduce API

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1220788 13f79535-47bb-0310-9956-ffa450edef68
  2. NUTCH-1222 Upgrade to new Hadoop 0.22.0

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1220786 13f79535-47bb-0310-9956-ffa450edef68
Commits on Dec 16, 2011
  1. NUTCH-1221 Migrate DomainStatistics to MapReduce API

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1215090 13f79535-47bb-0310-9956-ffa450edef68
Commits on Dec 9, 2011
  1. commit to address NUTCH-1216 and update to CHANGES.txt

    Lewis John McGibbney authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1212458 13f79535-47bb-0310-9956-ffa450edef68
Commits on Nov 29, 2011
  1. NUTCH-1214 DomainStats tool should be named for what it's doing

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1207967 13f79535-47bb-0310-9956-ffa450edef68
Commits on Nov 28, 2011
  1. NUTCH-1213 Pass additional SolrParams when indexing to Solr.

    Andrzej Bialecki authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1207217 13f79535-47bb-0310-9956-ffa450edef68
Commits on Nov 25, 2011
  1. @chrismattmann

    Oops, accidentally committed my local System.out debugging as part of…

    chrismattmann authored
    … NUTCH-1211: removing.
    
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1206039 13f79535-47bb-0310-9956-ffa450edef68
  2. @chrismattmann

    fix for NUTCH-1211 URLFilterChecker command line help doesn't inform …

    chrismattmann authored
    …user of STDIN requirements
    
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1206038 13f79535-47bb-0310-9956-ffa450edef68
Commits on Nov 24, 2011
  1. @chrismattmann
Commits on Nov 21, 2011
  1. NUTCH-1207 ParserChecker to output signature

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1204492 13f79535-47bb-0310-9956-ffa450edef68
Commits on Nov 15, 2011
  1. NUTCH-1090 InvertLinks should inform when ignoring internal links

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1202143 13f79535-47bb-0310-9956-ffa450edef68
Commits on Nov 11, 2011
  1. NUTCH-1174 Outlinks are not properly normalized

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200917 13f79535-47bb-0310-9956-ffa450edef68
  2. NUTCH-1203 ParseSegment to show number of milliseconds per parse

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200915 13f79535-47bb-0310-9956-ffa450edef68
  3. NUTCH-1155 Fixes failing test

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200912 13f79535-47bb-0310-9956-ffa450edef68
  4. NUTCH-1185 Decrease solr.commit.size to 250

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200833 13f79535-47bb-0310-9956-ffa450edef68
  5. NUTCH-1180 UpdateDB to backup previous CrawlDB

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200830 13f79535-47bb-0310-9956-ffa450edef68
Commits on Nov 10, 2011
  1. NUTCH-1173 DomainStats doesn't count db_not_modified

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200377 13f79535-47bb-0310-9956-ffa450edef68
  2. NUTCH-1155 Host/domain limit in generator is generate.max.count+1

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200370 13f79535-47bb-0310-9956-ffa450edef68
  3. NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200360 13f79535-47bb-0310-9956-ffa450edef68
  4. NUTCH-1178 Incorrect CSV header CrawlDatumCsvOutputFormat

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200347 13f79535-47bb-0310-9956-ffa450edef68
  5. NUTCH-1142 Normalization and filtering in WebGraph

    Markus Jelsma authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200346 13f79535-47bb-0310-9956-ffa450edef68
  6. NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCC…

    Markus Jelsma authored
    …ESS file
    
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1200344 13f79535-47bb-0310-9956-ffa450edef68
Commits on Nov 9, 2011
  1. commit to assign a unique key to build.xml Ant Sonar task.

    Lewis John McGibbney authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1199863 13f79535-47bb-0310-9956-ffa450edef68
Commits on Nov 5, 2011
  1. @chrismattmann

    - prepare for next development iteration.

    chrismattmann authored
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1197847 13f79535-47bb-0310-9956-ffa450edef68
  2. @chrismattmann
  3. @chrismattmann
  4. @chrismattmann
  5. @chrismattmann
  6. @chrismattmann
  7. @chrismattmann
  8. @chrismattmann
Something went wrong with that request. Please try again.