Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 123 lines (83 sloc) 5.993 kb
5f8a052 John Wilkinson Moved README and put a new one in its place with a little explanatory bl...
authored
1 = Stanford Natural Language Parser Wrapper
2
3 This module is a wrapper for the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
4
5 The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby along with pure Ruby objects that enable standoff parsing.
6
7
8 = Installation and Configuration
9
10 In addition to the Ruby gems it requires, to run this module you must manually install the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
11
12 This module expects the parser to be installed in the <tt>/usr/local/stanford-parser/current</tt> directory on UNIX platforms and in the <tt>C:\stanford-parser\current</tt> directory on Windows platforms. This is the directory that contains the <tt>stanford-parser.jar</tt> file. When the module is loaded, it adds this directory to the Java classpath and launches the Java VM with the arguments <tt>-server -Xmx150m</tt>.
13
14 These defaults can be overridden by creating the configuration file <tt>/etc/ruby_stanford_parser.yaml</tt> on UNIX platforms and <tt>C:\stanford-parser\ruby-stanford-parser.yaml</tt> on Windows platforms. This file is in the Ruby YAML[http://ruby-doc.org/stdlib/libdoc/yaml/rdoc/index.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
15
16 root: /usr/local/stanford-parser/other/location
17 jvmargs: -Xmx100m -verbose
18
19
20 =Tokenization and Parsing
21
22 Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into sentences and words.
23
24 >> require "stanfordparser"
25 => true
26 >> preproc = StanfordParser::DocumentPreprocessor.new
27 => <DocumentPreprocessor>
28 >> puts preproc.getSentencesFromString("This is a sentence. So is this.")
29 This is a sentence .
30 So is this .
31
32 Use the StanfordParser::LexicalizedParser class to parse sentences.
33
34 >> parser = StanfordParser::LexicalizedParser.new
35 Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [5.5 sec].
36 => edu.stanford.nlp.parser.lexparser.LexicalizedParser
37 >> puts parser.apply("This is a sentence.")
38 (ROOT
39 (S [24.917]
40 (NP [6.139] (DT [2.300] This))
41 (VP [17.636] (VBZ [0.144] is)
42 (NP [12.299] (DT [1.419] a) (NN [8.897] sentence)))
43 (. [0.002] .)))
44
45 For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
46
47
48 =Standoff Tokenization and Parsing
49
50 This module also contains support for standoff tokenization and parsing, in which the terminal nodes of parse trees contain information about the text that was used to generate them.
51
52 Use StanfordParser::StandoffDocumentPreprocessor class to tokenize text and files into sentences and words.
53
54 >> preproc = StanfordParser::StandoffDocumentPreprocessor.new
55 => <StandoffDocumentPreprocessor>
56 >> s = preproc.getSentencesFromString("This is a sentence. So is this.")
57 => [This is a sentence., So is this.]
58
59 The standoff preprocessor returns StanfordParser::StandoffToken objects, which contain character offsets into the original text along with information about spacing characters that came before and after the token.
60
61 >> puts s
62 This [0,4]
63 is [5,7]
64 a [8,9]
65 sentence [10,18]
66 . [18,19]
67 So [21,23]
68 is [24,26]
69 this [27,31]
70 . [31,32]
71 >> "This is a sentence. So is this."[27..31]
72 => "this."
73
74 This is the same information contained in the <tt>edu.stanford.nlp.ling.FeatureLabel</tt> class in the Stanford Parser Java implementation.
75
76 Similarly, use the StanfordParser::StandoffParsedText object to parse a block of text into StanfordParser::StandoffNode parse trees whose terminal nodes are StanfordParser::StandoffToken objects.
77
78 >> t = StanfordParser::StandoffParsedText.new("This is a sentence. So is this.")
79 Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [4.9 sec].
80 => <StanfordParser::StandoffParsedText, 2 sentences>
81 >> puts t.first
82 (ROOT
83 (S
84 (NP (DT This [0,4]))
85 (VP (VBZ is [5,7])
86 (NP (DT a [8,9]) (NN sentence [10,18])))
87 (. . [18,19])))
88
89 Standoff parse trees can reproduce the text from which they were generated verbatim.
90
91 >> t.first.to_original_string
92 => "This is a sentence. "
93
94 They can also reproduce the original text with brackets inserted around the yields of specified parse nodes.
95
96 >> t.first.to_bracketed_string([[0,0,0], [0,1,1]])
97 => "[This] is [a sentence]. "
98
99 The format of the coordinates used to specify individual nodes is described in the documentation for the Ruby Treebank[http://rubyforge.org/projects/treebank/] gem.
100
101 See the documentation of the individual classes in this module for more details.
102
103 Unlike their parents StanfordParser::DocumentPreprocessor and StanfordParser::LexicalizedParser, which produce Ruby wrappers around Java objects, StanfordParser::StandoffDocumentPreprocessor and StanfordParser::StandoffParsedText produce pure Ruby objects. This is to facilitate serialization of these objects using tools like the Marshal module, which cannot serialize Java objects.
104
105 = History
106
107 1.0.0:: Initial release
108 1.1.0:: Make module initialization function private. Add example code.
109 1.2.0:: Read Java VM arguments from the configuration file. Add Word class.
110 2.0.0:: Add support for standoff parsing. Change the way Rjb::JavaObjectWrapper wraps returned values: see wrap_java_object for details. Rjb::JavaObjectWrapper supports static members. Minor changes to stanford-sentence-parser script.
111 2.1.0:: Different default paths for Windows machines; Minor changes to StandoffToken definition
112 2.2.0:: Add parent information to StandoffNode
113
114 = Copyright
115
116 Copyright 2007-2008, William Patrick McNeill
117
118 This program is distributed under the GNU General Public License.
119
120
121 = Author
122
123 W.P. McNeill mailto:billmcn@gmail.com
Something went wrong with that request. Please try again.