/
CHANGES.txt
1188 lines (782 loc) · 43.8 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Nutch Change Log
Release 1.0 - 2009-03-23
1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab)
2. NUTCH-443 - Allow parsers to return multiple Parse objects.
(Dogacan Guney et al, via ab)
3. NUTCH-393 - Indexer should handle null documents returned by filters.
(Eelco Lempsink via ab)
4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)
5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other
bots in robots.txt (Dogacan Guney via siren)
6. NUTCH-482 - Remove redundant plugin lib-log4j (siren)
7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin
(siren)
8. NUTCH-161 - Change Plain text parser to
use parser.character.encoding.default property for fall back encoding
(KuroSaka TeruHiko, siren)
9. NUTCH-61 - Support for adaptive re-fetch interval and detection of
unmodified content. (ab)
10. NUTCH-392 - OutputFormat implementations should pass on Progressable.
(cutting via ab)
11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)
12. NUTCH-443 - allow parsers to return multiple Parse object, this will speed
up the rss parser (dogacan via mattmann). This update is a fix and semantics
change from the original patch for NUTCH-443. The original patch did not tell
the Indexer to read crawl_parse too so that it can pickup sub-urls' fetch
datums. This patch addresses that issue. Now, if Fetcher gets a null content,
instead of pushing an empty content, it filters the null content.
13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of
Parse object. (Gal Nitzan via dogacan)
14. NUTCH-489 - URLFilter-suffix management of the url path when the url contains
some query parameters. (Emmanuel Joke via dogacan)
15. NUTCH-502 - Bug in SegmentReader causes infinite loop.
(Ilya Vishnevsky via dogacan)
16. NUTCH-444 Possibly use a different library to parse RSS feed for improved
performance and compatibility. This patch introduced a new plugin, feed,
that includes an index filter and a parse plugin for feeds that uses ROME.
There was discussion to remove parse-rss, in light of the feed plugin,
however, this patch does not explicitly remove parse-rss. (dogacan, mattmann)
17. NUTCH-471 - Fix synchronization in NutchBean creation.
(Enis Soztutar via dogacan)
18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab)
19. NUTCH-468 - Scoring filter should distribute score to all outlinks at
once. (dogacan)
20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan)
21. NUTCH-497 - Extreme Nested Tags causes StackOverflowException in
DomContentUtils...Spider Trap. (kubes)
22. NUTCH-434 - Replace usage of ObjectWritable with something based on
GenericWritable. (dogacan)
23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan)
24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation.
(Espen Amble Kolstad via dogacan)
25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml.
(Emmanuel Joke via dogacan)
26. NUTCH-503 - Generator exits incorrectly for small fetchlists.
(Vishal Shah via dogacan)
27. NUTCH-505 - Outlink urls should be validated. (dogacan)
28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan)
29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan)
30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan)
30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan)
31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via dogacan).
32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining.
(Enis Soztutar via dogacan)
33. NUTCH-516 - Next fetch time is not set when it is a
CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan)
34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException
when trying to rerun dedup on a segment. (Vishal Shah via dogacan)
35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.
(dogacan) Note: There is a bigger problem, i.e how to deal
with redirected pages, and this issue can be considered as a band-aid
for the time being. See NUTCH-273 and NUTCH-353 for more details.
36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and
inlinks list. (Emmanuel Joke via dogacan)
37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values during
parse. (dogacan)
38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan)
39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan)
40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds
domain-related utilities. (Enis Soztutar via dogacan)
41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable
release (2.1). (Dawid Weiss via dogacan)
42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every
request. (Dawid Weiss via dogacan)
43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time.
(Emmanuel Joke via dogacan)
44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan)
45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)
46. NUTCH-554 - Generator throws IOException on invalid urls.
(Brian Whitman via ab)
47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child.
(Emmanuel Joke via dogacan)
48. NUTCH-25 - needs 'character encoding' detector.
(Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan)
49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated
to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan)
50. NUTCH-562 - Port mime type framework to use Tika mime detection framework.
(mattmann)
51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink
list. (Emmanuel Joke, Marcin Okraszewski via kubes)
52. NUTCH-501 - Implement a different caching mechanism for objects cached in
configuration. (dogacan)
53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)
54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes)
55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm.
(dogacan, kubes via dogacan)
56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat.
(Emmanuel Joke via dogacan)
57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan)
58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan)
59. NUTCH-574 - Including inlink anchor text in index can create irrelevant
search results. Created index-anchor plugin, removed functionality from
index-basic plugin. For backwards compatibility, add index-anchor plugin to
nutch-site.xml plugin.includes. (kubes)
60. NUTCH-581 - DistributedSearch does not update search servers added to
search-servers.txt on the fly. (Rohan Mehta via kubes)
61. NUTCH-586 - Add option to run compiled classes without job file
(enis via ab)
62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
server. (Susam Pal via dogacan)
63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab)
64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format
(Emmanuel Joke via ab)
65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab)
66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab)
67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren)
68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes)
69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab)
70. NUTCH-602 - Allow configurable number of handlers for search servers
(hartbecke via kubes)
71. NUTCH-607 - Update build.xml to include tika jar when building war (kubes)
72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating (mattmann)
73. NUTCH-606 - Refactoring of Generator, run all urls through checks (kubes)
74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes)
75. NUTCH-603 - Add more default url normalizations (kubes)
76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes)
77. NUTCH-44 - Too many search results, limits max results returned from a
single search. (Emilijan Mirceski and Susam Pal via kubes)
78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is
updated to 1.2 version. (dogacan)
79. NUTCH-613 - Empty summaries and cached pages (kubes via ab)
80. NUTCH-612 - URL filtering was disabled in Generator when invoked
from Crawl (Susam Pal via ab)
81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab)
83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via ab)
84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval.
Guard against reprUrl being null. (Emmanuel Joke, ab)
85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful (Emmanuel
Joke, ab)
86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab)
87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab)
88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API.
(Emmanuel Joke, dogacan, ab)
89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a
single slash. (Mark DeSpain via ab)
90. NUTCH-500 - Add hadoop masters configuration file into conf folder.
(Emmanuel Joke via kubes)
91. NUTCH-596 - ParseSegments parse content even if its not
CrawlDatum.STATUS_FETCH_SUCCESS (dogacan)
92. NUTCH-618 - Tika error "Media type alias already exists" (mattmann,kubes)
93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln
Ritter, ab)
94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab)
95. NUTCH-645 - Parse-swf unit test failing (ab)
96. NUTCH-642 - Unit tests fail when run in non-local mode (ab)
97. NUTCH-639 - Change LuceneDocumentWrapper visibility from
private to _public_ (Guillaume Smet via dogacan)
98. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn
tracking. (dogacan)
99. NUTCH-375 - Add support for Content-Encoding: deflated
(Pascal Beis, ab)
100. NUTCH-633 - ParseSegment no longer allow reparsing.
(dogacan)
101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan)
102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann)
103. NUTCH-654 - urlfilter-regex's main does not work.
(dogacan)
104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE".
(dogacan)
105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes)
106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes)
107. NUTCH-647 - Resolve URLs tool (kubes)
108. NUTCH-665 - Search Load Testing Tool (kubes)
109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming
(kubes)
110. NUTCH-635 - LinkAnalysis Tool for Nutch. (kubes)
111. NUTCH-646 - New Indexing Framework for Nutch. (kubes)
112. NUTCH-668 - Domain URL Filter. (kubes)
113. NUTCH-594 - Serve Nutch search results in multiple formats including
XML and JSON. (kubes)
114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren)
115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
fetch interval correctly. (dogacan)
116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic)
117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t.
(julien nioche via dogacan)
118. NUTCH-681 - parse-mp3 compilation problem.
(Wildan Maulana via dogacan)
119. NUTCH-676 - MapWritable is written inefficiently and confusingly.
(dogacan)
120. NUTCH-579 - Feed plugin only indexes one post per feed due to identical
digest. (dogacan)
121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3.
(Joseph Chen, dogacan)
122. NUTCH-682 - SOLR indexer does not set boost on the document.
(julien nioche via dogacan)
123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab)
124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab)
125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab)
126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE
(Curtis d'Entremont, ab)
127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan)
128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException
(Stefan Will, siren)
129. NUTCH-691 - Update jakarta poi jars to the most relevant version
(Dmitry Lihachev via siren)
130. NUTCH-563 - Include custom fields in BasicQueryFilter
(Julien Nioche via siren)
131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin
(Dmitry Lihachev via siren)
132. NUTCH-694 - Distributed Search Server fails (siren)
133. NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links
set at cross domain redirects (Remco Verhoef, dogacan via siren)
134. NUTCH-247 - Robot parser to restrict (kubes, siren)
135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan
via siren)
136. NUTCH-699 - Add an "official" solr schema for solr integration (dogacan,
Dmitry Lihachev via siren)
137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab)
138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann,
Doug Cook via ab)
139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren)
140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren)
141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab)
142. NUTCH-684 - Dedup support for Solr. (dogacan)
143. NUTCH-715 - Subcollection plugin doesn't work with default
subcollections.xml file (Dmitry Lihachev via siren)
144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute
Release 0.9 - 2007-04-02
1. Changed log4j confiquration to log to stdout on commandline
tools (siren)
2. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren)
3. NUTCH-260 - Update hadoop version to 0.5.0 (Renaud Richardet,
siren)
4. Optionally skip pages with abnormally large values of Crawl-Delay
(Dennis Kubes via ab)
5. Change readdb -stats to use CombiningCollector (ab)
6. NUTCH-348 - Fix Generator to select highest scoring pages (Chris
Schneider and Stefan Groschupf via ab)
7. NUTCH-347 - Adjust plugin build script not to emit warnings when copying
dependant jars (siren)
8. NUTCH-338 - Remove the text parser as an option for parsing PDF files
in parse-plugins.xml (Chris A. Mattmann via siren)
9. NUTCH-105 - Network error during robots.txt fetch causes file to
be ignored (Greg Kim via siren)
10. NUTCH-367 - DistributedSearch thown ClassCastException (siren)
11. NUTCH-332 - Fix the problem of doubling scores caused by links pointing
to the current page (e.g. anchors). (Stefan Groschupf via ab)
12. NUTCH-365 - Flexible URL normalization (ab)
13. NUTCH-336 - Differentiate between newly discovered pages and newly
injected pages (Chris Schneider via ab) NOTE: this changes the
scoring API, filter implementations need to be updated.
14. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf
via ab)
15. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE
(Stefan Groschupf via ab)
16. NUTCH-374 - when http.content.limit be set to -1 and
Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing
(King Kong via pkosiorowski)
17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
****************************** WARNING !!! ********************************
* This upgrade breaks data format compatibility. A tool 'convertdb' *
* was added to migrate existing CrawlDb-s to the new format. Segment data *
* can be partially migrated using 'mergesegs', however segments will *
* require re-parsing (and consequently re-indexing). *
****************************** WARNING !!! ********************************
18. NUTCH-371 - DeleteDuplicates now correctly implements both parts of
the algorithm. (ab)
19. NUTCH-391 - ParseUtil logs file contents to log file when it cannot
find parser (siren)
20. NUTCH-379 - ParseUtil does not pass through the content's URL to the
ParserFactory (Chris A. Mattmann via siren)
21. NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one
partition. (ab)
22. NUTCH-399 - Change CommandRunner to use concurrent api from jdk (siren)
23. NUTCH-395 - Increase fetching speed (siren)
24. NUTCH-388 - nutch-default.xml has outdated example for urlfilter.order
(reported by Jared Dunne)
25. NUTCH-404 - Fix LinkDB Usage - implementation mismatch (siren)
26. NUTCH-403 - Make URL filtering optional in Generator (siren)
27. NUTCH-405 - Content object is not properly initialized in map method
of ParseSegment (siren)
28. NUTCH-362 - Remove parse-text from unsupported filetypes in
parse-plugins.xml (siren)
29. NUTCH-305 - Update crawl and url filter lists to exclude
jpeg|JPEG|bmp|BMP, suffix-urlfilter.txt (contributed by Stefan
Neufeind) is also updated (siren)
30. NUTCH-406 - Metadata tries to write null values (mattmann)
31. NUTCH-415 - Generator should mark selected records in CrawlDb.
Due to increased resource consumption this step is optional.
Application-level locking has been added to prevent concurrent
modification of databases. (ab)
32. NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
now possible to correctly update CrawlDb from multiple segments.
Introduce new status codes for temporary and permanent
redirection. (ab)
33. NUTCH-322 - Fix Fetcher to store redirected pages and to store
protocol-level status. This also should fix NUTCH-273. (ab)
34. Change default Fetcher behavior not to follow redirects immediately.
Instead Fetcher will record redirects as new pages to be added to CrawlDb.
This also partially addresses NUTCH-273. (ab)
35. Detect and report when Generator creates 0-sized segments. (ab)
36. Fix Injector to preserve already existing CrawlDatum if the seed list
being injected also contains such URL. (ab)
37. NUTCH-425, NUTCH-426 - Fix anchors pollution. Continue after
skipping bad URLs. (Michael Stack via ab)
38. NUTCH-325 - UrlFilters.java throws NPE in case urlfilter.order contains
Filters that are not in plugin.includes (Stefan Groschupf, siren)
39. NUTCH-421 - Allow predeterminate running order of indexing filters
(Alan Tanaman, siren)
40. When indexing pages with redirection, drop all intermediate pages and
index only the final page. (ab)
41. Upgrade to Hadoop 0.10.1. (ab)
42. NUTCH-420 - Fix a bug in DeleteDuplicates where results depended on the
order in which IndexDoc-s are processed. (Dogacan Guney via ab)
43. NUTCH-428 - NullPointerException thrown when agent name is not
configured properly. Changed to throw RuntimeException instead.
(siren)
44. NUTCH-430 - Integer overflow in HashComparator.compare (siren)
45. NUTCH-68 - Add a tool to generate arbitrary fetchlists. (ab)
46. NUTCH-433 - java.io.EOFException in newer nightlies in mergesegs
or indexing from hadoop.io.DataOutputBuffer (siren)
47. NUTCH-339 - Fetcher2: a queue-based fetcher implementation. (ab)
48. NUTCH-390 - Javadoc warnings (mattmann)
49. NUTCH-449 - Make junit output format configurable. (nigel via cutting)
50. NUTCH-432 - Fix a bug where platform name with spaces would break the
bin/nutch script. (Brian Whitman via ab)
51. Upgrade to Hadoop 0.11.2 and Lucene 2.1.0 release. (ab)
52. NUTCH-167 - Observation of robots "noarchive" directive. (ab)
53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins
framework to operate properly (Heiko Dietze via mattmann)
54. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan
Groschupf via kubes)
55. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
path is empty (kubes)
56. Upgrade to Hadoop 0.12.1 release. (ab)
57. NUTCH-246 - Incorrect segment size being generated due to time
synchronization issue (Stefan Groschupf via ab)
58. Upgrade to Hadoop 0.12.2 release. (ab)
59. NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob. (Michael
Stack and Dogacan Guney via kubes)
Release 0.8 - 2006-07-25
0. Totally new architecture, based on hadoop
[http://lucene.apache.org/hadoop] (cutting)
1. NUTCH-107 - Typo in plugin/urlfilter-*/plugin.xml. (Stephen Cross).
2. NUTCH-108 - Log hosts that exceed generate.max.per.host.
(Rod Taylor via cutting)
3. NUTCH-88 - Enhance ParserFactory plugin selection policy
(jerome)
4. NUTCH-124 - Protocol-httpclient does not follow redirects when
fetching robots.txt (cutting)
5. NUTCH-130 - Be explicit about target JVM when building (1.4.x?)
(stack@archive.org, cutting)
6. NUTCH-114 - Getting number of urls and links from crawldb
(Stefan Groschupf via ab)
7. NUTCH-112 - Link in cached.jsp page to cached content is an
absolute link (Chris A. Mattmann via jerome)
8. NUTCH-135 - Http header meta data are case insensitive in the
real world (Stefan Groschupf via jerome)
9. NUTCH-145 - Build of war file fails on Chinese (zh) .xml files due
to UTF-8 BOM (KuroSaka TeruHiko via siren)
10. NUTCH-121 - SegmentReader for mapred (Rod Taylor via ab)
11. Added support for OpenSearch (cutting)
12. NUTCH-142 - NutchConf should use the thread context classloader
(Mike Cannon-Brookes via pkosiorowski)
13. NUTCH-160 - Use standard Java Regex library rather than
org.apache.oro.text.regex (Rod Taylor via cutting)
14. NUTCH-151 - CommandRunner can hang after the main thread exec is
finished and has inefficient busy loop (Paul Baclace via cutting)
15. NUTCH-174 - Problem encountered with ant during compilation
16. NUTCH-190 - ParseUtil drops reason for failed parse
(stack@archive.org via ab)
17. NUTCH-169 - Remove static NutchConf (Marko Bauhardt via ab)
18. NUTCH-194 - Nutch-169 introduced two tiny bugs (Marko Bauhardt via ab)
19. NUTCH-178 - in search.jsp must be session creation "false"
(YourSoft via siren)
20. NUTCH-200 - OpenSearch Servlet ist broken
(Marko Bauhardt via siren)
21. NUTCH-81 - Webapp only works when deployed in root
(AJ Banck, Michael Nebel via siren)
22. NUTCH-139 - Standard metadata property names in the ParseData
metadata (Chris A. Mattmann, jerome)
23. NUTCH-192 - Meta data support for CrawlDatum
(Stefan Groschupf via ab)
24. NUTCH-52 - Parser plugin for MS Excel files
(Rohit Kulkarni via jerome)
25. NUTCH-53 - Parser plugin for Zip files
(Rohit Kulkarni via jerome)
26. NUTCH-137 - footer is not displayed in search result page
(KuroSaka TeruHiko via siren)
27. NUTCH-118 - FAQ link points to invalid URL
(Steve Betts via siren)
28. NUTCH-184 - Serbian (sr, Cyrilic) and Serbo-Croatian (sh, Latin)
translation (Ivan Sekulovic via siren)
29. NUTCH-211 - FetchedSegments leave readers open (Stefan Groschupf
via cutting)
30. NUTCH-140 - Add alias capability in parse-plugins.xml file that
allows mimeType->extensionId mapping (Chris A. Mattmann via jerome)
31. NUTCH-214 - Added Links to web site to search mailling list
(Jake Vanderdray via jerome)
32. NUTCH-204 - Multiple field values in HitDetails
(Stefan Groschupf via jerome)
33. NUTCH-219 - file.content.limit & ftp.content.limit should be changed
to -1 to be consistent with http (jerome)
34. NUTCH-221 - Prepare nutch for upcoming lucene 2.0 (siren)
35. NUTCH-91 - Empty encoding causes exception (Michael Nebel via
pkosiorowski)
36. NUTCH-228 - Clustering plugin descriptor broken (Dawid Weiss via
jerome)
37. NUTCH-229 - Improved handling of plugin folder configuration
(Stefan Groschupf via ab)
38. NUTCH-206 - Search server throws InstantiationException (ab)
39. NUTCH-203 - ParseSegment throws InstantiationException (Marko Bauhardt
via ab)
40. NUTCH-3 - Multi values of header discarded (Stefan Groschupf via ab)
41. Update to lucene 1.9.1 (cutting)
42. NUTCH-235 - Duplicate Inlink values (ab)
43. NUTCH-234 - Clustering extension code cleanups and a real
JUnit test case for the current implementation (Dawid Weiss via ab)
44. NUTCH-210 - Context.xml file for Nutch web application
(Chris A. Mattmann via jerome)
45. NUTCH-231 - Invalid CSS entries (AJ Banck via jerome)
46. NUTCH-232 - Search.jsp has multiple search forms creating
invalid html / incorrect focus function (jerome)
47. NUTCH-196 - lib-xml and lib-log4j plugins (ab, jerome)
48. NUTCH-244 - Inconsistent handling of property values
boundaries / unable to set db.max.outlinks.per.page to
infinite (jerome)
49. NUTCH-245 - DTD for plugin.xml configuration files
(Chris A. Mattmann via jerome)
50. NUTCH-250 - Generate to log truncation caused by
generate.max.per.host (Rod Taylor via cutting)
51. NUTCH-125 - OpenOffice Parser plugin (ab)
52. Switch from using java.io.File to org.apache.hadoop.fs.Path.
(cutting)
53. NUTCH-240 - Scoring API: extension point, scoring filters and
an OPIC plugin (ab)
54. NUTCH-134 - Summarizer doesn't select the best snippets (jerome)
55. NUTCH-268 - Generator and lib-http use different definitions of
"unique host" (ab)
56. NUTCH-280 - Url query causes NullPointerException (Grant Glouser
via siren)
57. NUTCH-285 - LinkDb Fails rename doesn't create parent directories
(Dennis Kubes via ab)
58. NUTCH-201 - Add support for subcollections
(siren)
59. NUTCH-298 - If a 404 for a robots.txt is returned a NPE is thrown
(Stefan Groschupf via jerome)
60. NUTCH-275 - Fetcher not parsing XHTML-pages at all (jerome)
61. NUTCH-301 - CommonGrams loads analysis.common.terms.file for each query
(Stefan Groschupf via jerome)
62. NUTCH-110 - OpenSearchServlet outputs illegal xml characters
(stack@archive.org via siren)
63. NUTCH-292 - OpenSearchServlet: OutOfMemoryError: Java heap space
(Stefan Neufeind via siren)
64. NUTCH-307 - Wrong configured log4j.properties (jerome)
65. NUTCH-303 - Logging improvements (jerome)
66. NUTCH-308 - Maximum search time limit (ab)
67. NUTCH-306 - DistributedSearch.Client liveAddresses concurrency
problem (Grant Glouser via siren)
68. Update to hadoop-0.4 (Milind Bhandarkar, cutting)
69. NUTCH-317 - Clarify what the queryLanguage argument of
Query.parse(...) means (jerome)
70. Added alternative experimental web gui in contrib containing
extensions like subcollection, keymatch, user preferences,
caching, implemented mainly using tiles and jstl (siren)
71. NUTCH-320 DmozParser does not output list of urls to stdout
but to a log file instead. Original functionality restored.
72. NUTCH-271 - Add ability to limit crawling to the set of initially
injected hosts (db.ignore.external.links) (Philippe Eugene,
Stefan Neufeind via ab)
73. NUTCH-293 - Support for Crawl-Delay (Stefan Groschupf via ab)
74. NUTCH-327 - Fixed logging directory on cygwin (siren)
Release 0.7 - 2005-08-17
1. Added support for "type:" in queries. Search results are limited/qualified
by mimetype or its primary type or sub type. For example,
(1) searching with "type:application/pdf" restricts results
to pages which were identified to be of mimetype "application/pdf".
(2) with "type:application", nutch will return pages of
primary type "application".
(3) with "type:pdf", only pages of sub type "pdf" will be listed.
(John Xing, 20050120)
2. Added support for "date:" in queries. Last-Modified is indexed.
Search results are restricted by lower and upper date (inclusive)
as date:yyyymmdd-yyyymmdd. For example, date:20040101-20041231
only returns pages with Last-Modified in year 2004.
(John Xing, 20050122)
3. Add URLFilter plugin interface and convert existing url filters into
plugins. (John Xing, 20050206)
4. Add UpdateSegmentsFromDb tool, which updates the scores and
anchors of existing segments with the current values in the web
db. This is used by CrawlTool, so that pages are now only fetched
once per crawl. (Doug Cutting, 20050221)
5. Moved code into org.apache.nutch sub-packages. Changed license to
Apache 2.0. Removed jar files whose licenses do not permit
redistribution by Apache. Disabled compilation of plugins which
require these libraries. (Doug Cutting 20050301)
6. Index host and title in separate fields. Host was indexed
previously only as a part of the URL. Title was indexed as an
anchor. Now boosts for matching these fields may be adjusted
separately from boosts for matching anchors and url. Also: move
site indexing to index-basic plugin to minimize the number of
times the URL needs to be parsed; and, stop using anchor analyzer
for anything but anchors. (Piotr Kosiorowski via Doug Cutting
20050323)
7. Add servlet Cached.java that serves cached Content of any mime type.
Slightly modified are web.xml and cached.jsp.
(John Xing, 20050401)
8. Add skipCompressedByteArray() to WritableUtils.java.
(John Xing, 20050402)
9. Fixes to jsp and static web pages. These now use relative links,
so that the Nutch webapp file can be used in places other than at
the root. Also fixed links to the about and help pages. Bug #32.
(Jerome Charron via cutting, 20050404)
10. Added some features to DistributedSearch: new segments can be added
to searchservers without restarting the frontend, defective search
servers are not queried until tey come back online, watchdog keeps
an eye for your searchservers and writes simple statistics.
(Sami Siren, 20050407)
11. Fix for bug #4 - Unbalanced quote in query eats all resources.
(Piotr Kosiorowski, Sami Siren, 20050407)
12. Close Issue #33 - MIME content type detector (using magic char sequences).
(Jerome Charron and Hari Kodungallur via John Xing, 20050416)
13. Add a servlet that implements A9's OpenSearch RSS web service.
(cutting, 20050418)
14. Remove references to link analysis from tutorial, and enable
scoring by link count when generating fetchlists and searching.
(cutting, 20040419)
15. Make query boosts for host, title, anchor and phrase matches
configurable. (Piotr Kosiorowski via cutting, 20050419)
16. Add support for sorting search results and search-time deduping by
fields other than site.
17. Automatically convert range queries into cached range filters.
This improves the performance and scalability of, e.g., date range
searching.
18. Several methods have been renamed due to misspellings. The old
methods have been deprecated and will be removed before the 1.0
release.
Release 0.6
1. Added clustering-carrot2 plugin, together with introduction of clustering
api and modification to search jsp. (Dawid Weiss via John Xing, 20040809)
2. Make a number of changes to NDFS (Nutch Distributed File System)
to fix bugs, add admin tools, etc.
Also, modify all command line tools so you can indicate whether to
use NDFS or the local filesystem. If you indicate nothing, then
it defaults to the local fs.
I've used this to do a 35m page crawl via NDFS, distributed over a
dozen machines. (Mike Cafarella)
3. Add support for BASE tags in HTML. Outlinks are now correctly
extracted when a BASE tag is present. (cutting)
4. Fix two bugs in result pagination. When the last hit on a page
was the last hit overall, the "next" button was sometimes shown
when the "show all" button should be shown instead. Also, in
certain cases, the "show all" button would be shown when the
"next" button should have been shown. (cutting)
5. Add config parameter "indexer.max.tokens" that determines the
maximum number of tokens indexed per field. (Andy Hedges via cutting)
6. Add parser for mp3 files. (Andy Hedges via cutting)
7. Add RegexUrlNormalizer. This is useful for things like stripping
out session IDs from URLs. To use it, add values for
urlnormalizer.class and urlnormalizer.regex.file to your
nutch-site.xml. The RegexUrlNormalizer class extends the
BasicUrlNormalizer, and does basic normalization as well.
(Luke Baker via cutting)
8. Added Swedish translation (Stefan Verzel via Sami Siren, 20040910)
9. Added Polish translation (Andrzej Bialecki, 20040911)
10. Added 3 more language profiles to language identifier (ru,hu,pl).
Other changes to language identifier: Porfiles converted to utf8,
added some test cases, changed the similarity calculation.
(Sami Siren, 20040925)
11. Added plugin parse-rtf (Andy Hedges via John Xing, 20040929)
12. Added plugin index-more and more.jsp (John Xing, 20041003)
13. Added "View as Plain Text" feature. A new op OP_PARSETEXT is introduced
in DistributedSearch.java. text.jsp is added. (John Xing, 20041006)
14. Fixed a bug that fails cached.jsp, explain.jsp, anchors.jsp and text.jsp
(but not search.jsp) with NullPointerException in distributed search.
It seems that this bug appears after "hits per site" stuff is added.
The fix is done in Hit.java, making sure String site is never null.
Hope this fix not have bad effetct on "hits per site" code.
(John Xing, 20041006)
15. Fixed a bug that fails fullyDelete() in FileUtil.java for
LocalFileSystem.java. This bug also exposes possible incompleteness
of NDFSFile.java, where a few methods are not supported, including
delete(). Nothing changed in NDFSFile.java though. Leave it for future
improvement (John Xing, 20041022).
16. Introduced option -noParsing to Fetcher.java and added ParseSegment.java.
A new status code CANT_PARSE is added to FetcherOutput.java.
Without option -noParsing , no change in fetcher behavior. With
option -noParsing, fetcher does crawls only, no parsing is carried out.
Then, ParseSegment.java should be used to parse in separate pass.
(John Xing, 20041025)
17. Added ontology plugin. Currently it is used for query refinement, as
examplified in refine-query-init.jsp and refine-query.jsp. By default,
query refinement is disabled in search.jsp. Please check
./src/plugin/ontology/README.txt for further description.
Ontology plugin certainly can be used for many other things.
(Michael J. Pan via John Xing, 20041129)
18. Changed fetcher.server.delay to be a float, so that sub-second
delays can be specified. (cutting)
19. Added plugin.includes config parameter that determines which
plugins are included. By default now only http, html and basic
indexing and search plugins are enabled, rather than all plugins.
This should make default performance more predictable and reliable
going forward. (cutting)
20. Cleaned up some filesystem code, including:
- Replaced BufferedRandomAccessFile with two simpler utilties,
NFSDataInputStream and NFSDataOutputStream.
- Fixed the bug where SequenceFiles were no longer flushed when
created, so that, when fetches crashed, segments were
unreadable. Now segments are always readable after crashes.
Only the contents of the last buffer is lost.
- Simplified the FSOutputStream API to not include seek(). We
should never need that functionality.
- Simplified LocalFileSystem's implementations of FSInputStream
and FSOutputStream and optimized FSInputStream.seek().
(cutting)
21. Fixed BasicUrlNormalizer to better handle relative urls. The file
part of a URL is normalized in the following manner:
1. "/aa/../" will be replaced by "/" This is done step by step until
the url doesn´t change anymore. So we ensure, that
"/aa/bb/../../" will be replaced by "/", too
2. leading "/../" will be replaced by "/"
(Sven Wende via cutting)
22. Fix Page constructors so that next fetch date is less likely to be
misconstrued as a float. This patches a problem in WebDBInjector,
where new pages were added to the db with nextScore set to the
intended nextFetch date. This, in turn, confused link analysis.
23. In ndfs code, replace addLocalFile(), putToLocalFile() with
copyFromLocalFile(), moveFromLocalFile(), copyToLocalFile() and
moveToLocalFile(). (John Xing, 20041217)
24. Added new config parameter fetcher.threads.per.host. This is used
by the Http protocol. When this is one behavior is as before.
When this is greater than one then multiple threads are permitted
to access a host at once. Note that fetcher.server.delay is no
longer consistently observed when this is greater than one.
(Luke Baker via Doug Cutting)
Release 0.5
1. Changed plugin directory to be a list of directories.
2. Permit Plugin to be the default plugin implementation.
3. Added pluggable interface for network protocols in new package
net.nutch.protocol. Moved http code from core into a plugin.
4. Added pluggable interface for content parsing in new package
net.nutch.parse. Moved html parsing code from core into a
plugin.
5. Fixed a bug in NutchAnalysis where 16-bit characters were not
processed correctly.