reworked tokenizer howto, as docstrings in tokenizer package

nltk · Nov 6, 2011 · 37aced7 · 37aced7
1 parent 637d190
commit 37aced7
Show file tree

Hide file tree

Showing 12 changed files with 407 additions and 361 deletions.
diff --git a/nltk/test/tag.errs b/nltk/test/tag.errs
@@ -0,0 +1,121 @@
+
+***************************************************************************
+File "/Users/sb/git/nltk/nltk/test/tag.doctest", line 167, in tag.doctest
+Failed example:
+    print 'Accuracy: %4.1f%%' % (
+        100.0 * unigram_tagger.evaluate(brown_test))
+Expected:
+    Accuracy: 85.4%
+Got:
+    Accuracy: 85.8%
+
+***************************************************************************
+File "/Users/sb/git/nltk/nltk/test/tag.doctest", line 178, in tag.doctest
+Failed example:
+    print 'Accuracy: %4.1f%%' % (
+        100.0 * unigram_tagger_2.evaluate(brown_test))
+Expected:
+    Accuracy: 88.0%
+Got:
+    Accuracy: 88.4%
+
+***************************************************************************
+File "/Users/sb/git/nltk/nltk/test/tag.doctest", line 205, in tag.doctest
+Failed example:
+    print bigram_tagger.size()
+Expected:
+    3394
+Got:
+    3386
+
+***************************************************************************
+File "/Users/sb/git/nltk/nltk/test/tag.doctest", line 207, in tag.doctest
+Failed example:
+    print 'Accuracy: %4.1f%%' % (
+        100.0 * bigram_tagger.evaluate(brown_test))
+Expected:
+    Accuracy: 89.4%
+Got:
+    Accuracy: 89.6%
+
+***************************************************************************
+File "/Users/sb/git/nltk/nltk/test/tag.doctest", line 222, in tag.doctest
+Failed example:
+    print trigram_tagger.size()
+Expected:
+    1493
+Got:
+    1502
+
+***************************************************************************
+File "/Users/sb/git/nltk/nltk/test/tag.doctest", line 224, in tag.doctest
+Failed example:
+    print 'Accuracy: %4.1f%%' % (
+        100.0 * trigram_tagger.evaluate(brown_test))
+Expected:
+    Accuracy: 88.8%
+Got:
+    Accuracy: 89.0%
+
+***************************************************************************
+File "/Users/sb/git/nltk/nltk/test/tag.doctest", line 251, in tag.doctest
+Failed example:
+    brill_tagger = trainer.train(brown_train, max_rules=10)  # doctest: +NORMALIZE_WHITESPACE
+Expected:
+    Training Brill tagger on 4523 sentences...
+    Finding initial useful rules...
+        Found 75359 useful rules.
+    <BLANKLINE>
+               B      |     
+       S   F   r   O  |        Score = Fixed - Broken
+       c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
+       o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
+       r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
+       e   d   n   r  |  e
+    ------------------+-------------------------------------------------------
+     354 354   0   3  | TO -> IN if the tag of the following word is 'AT'
+     111 173  62   3  | NN -> VB if the tag of the preceding word is 'TO'
+     110 110   0   4  | TO -> IN if the tag of the following word is 'NP'
+      83 157  74   4  | NP -> NP-TL if the tag of the following word is
+                      |   'NN-TL'
+      73  77   4   0  | VBD -> VBN if the tag of words i-2...i-1 is 'BEDZ'
+      71 116  45   3  | TO -> IN if the tag of words i+1...i+2 is 'NNS'
+      65  65   0   3  | NN -> VB if the tag of the preceding word is 'MD'
+      63  63   0   0  | VBD -> VBN if the tag of words i-3...i-1 is 'HVZ'
+      59  62   3   2  | CS -> QL if the text of words i+1...i+3 is 'as'
+      55  57   2   0  | VBD -> VBN if the tag of words i-3...i-1 is 'HVD'
+Got:
+    Training Brill tagger on 4523 sentences...
+    Finding initial useful rules...
+        Found 75299 useful rules.
+    <BLANKLINE>
+               B      |     
+       S   F   r   O  |        Score = Fixed - Broken
+       c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
+       o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
+       r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
+       e   d   n   r  |  e
+    ------------------+-------------------------------------------------------
+     354 354   0   3  | TO -> IN if the tag of the following word is 'AT'
+     110 110   0   3  | TO -> IN if the tag of the following word is 'NP'
+      91 127  36   6  | VB -> NN if the tag of words i-2...i-1 is 'AT'
+      82 143  61   3  | NN -> VB if the tag of the preceding word is 'TO'
+      71 116  45   2  | TO -> IN if the tag of words i+1...i+2 is 'NNS'
+      66  69   3   0  | VBN -> VBD if the tag of the preceding word is
+                      |   'NP'
+      64 131  67   6  | NP -> NP-TL if the tag of the following word is
+                      |   'NN-TL'
+      59  62   3   2  | CS -> QL if the text of words i+1...i+3 is 'as'
+      55  55   0   1  | NN -> VB if the tag of the preceding word is 'MD'
+      55  59   4   0  | VBD -> VBN if the tag of words i-2...i-1 is 'BEDZ'
+
+***************************************************************************
+File "/Users/sb/git/nltk/nltk/test/tag.doctest", line 274, in tag.doctest
+Failed example:
+    print 'Accuracy: %4.1f%%' % (
+        100.0 * brill_tagger.evaluate(brown_test))
+Expected:
+    Accuracy: 89.1%
+Got:
+    Accuracy: 89.5%
+.