Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

First commit, this is the stock gem at v2.2.0

  • Loading branch information...
commit 1460c236b0a48a6f1fcb8f7aff9fccac6262797e 0 parents
John Wilkinson authored
123 README
... ... @@ -0,0 +1,123 @@
  1 += Stanford Natural Language Parser Wrapper
  2 +
  3 +This module is a wrapper for the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
  4 +
  5 +The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby along with pure Ruby objects that enable standoff parsing.
  6 +
  7 +
  8 += Installation and Configuration
  9 +
  10 +In addition to the Ruby gems it requires, to run this module you must manually install the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
  11 +
  12 +This module expects the parser to be installed in the <tt>/usr/local/stanford-parser/current</tt> directory on UNIX platforms and in the <tt>C:\stanford-parser\current</tt> directory on Windows platforms. This is the directory that contains the <tt>stanford-parser.jar</tt> file. When the module is loaded, it adds this directory to the Java classpath and launches the Java VM with the arguments <tt>-server -Xmx150m</tt>.
  13 +
  14 +These defaults can be overridden by creating the configuration file <tt>/etc/ruby_stanford_parser.yaml</tt> on UNIX platforms and <tt>C:\stanford-parser\ruby-stanford-parser.yaml</tt> on Windows platforms. This file is in the Ruby YAML[http://ruby-doc.org/stdlib/libdoc/yaml/rdoc/index.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
  15 +
  16 + root: /usr/local/stanford-parser/other/location
  17 + jvmargs: -Xmx100m -verbose
  18 +
  19 +
  20 +=Tokenization and Parsing
  21 +
  22 +Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into sentences and words.
  23 +
  24 + >> require "stanfordparser"
  25 + => true
  26 + >> preproc = StanfordParser::DocumentPreprocessor.new
  27 + => <DocumentPreprocessor>
  28 + >> puts preproc.getSentencesFromString("This is a sentence. So is this.")
  29 + This is a sentence .
  30 + So is this .
  31 +
  32 +Use the StanfordParser::LexicalizedParser class to parse sentences.
  33 +
  34 + >> parser = StanfordParser::LexicalizedParser.new
  35 + Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [5.5 sec].
  36 + => edu.stanford.nlp.parser.lexparser.LexicalizedParser
  37 + >> puts parser.apply("This is a sentence.")
  38 + (ROOT
  39 + (S [24.917]
  40 + (NP [6.139] (DT [2.300] This))
  41 + (VP [17.636] (VBZ [0.144] is)
  42 + (NP [12.299] (DT [1.419] a) (NN [8.897] sentence)))
  43 + (. [0.002] .)))
  44 +
  45 +For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
  46 +
  47 +
  48 +=Standoff Tokenization and Parsing
  49 +
  50 +This module also contains support for standoff tokenization and parsing, in which the terminal nodes of parse trees contain information about the text that was used to generate them.
  51 +
  52 +Use StanfordParser::StandoffDocumentPreprocessor class to tokenize text and files into sentences and words.
  53 +
  54 + >> preproc = StanfordParser::StandoffDocumentPreprocessor.new
  55 + => <StandoffDocumentPreprocessor>
  56 + >> s = preproc.getSentencesFromString("This is a sentence. So is this.")
  57 + => [This is a sentence., So is this.]
  58 +
  59 +The standoff preprocessor returns StanfordParser::StandoffToken objects, which contain character offsets into the original text along with information about spacing characters that came before and after the token.
  60 +
  61 + >> puts s
  62 + This [0,4]
  63 + is [5,7]
  64 + a [8,9]
  65 + sentence [10,18]
  66 + . [18,19]
  67 + So [21,23]
  68 + is [24,26]
  69 + this [27,31]
  70 + . [31,32]
  71 + >> "This is a sentence. So is this."[27..31]
  72 + => "this."
  73 +
  74 +This is the same information contained in the <tt>edu.stanford.nlp.ling.FeatureLabel</tt> class in the Stanford Parser Java implementation.
  75 +
  76 +Similarly, use the StanfordParser::StandoffParsedText object to parse a block of text into StanfordParser::StandoffNode parse trees whose terminal nodes are StanfordParser::StandoffToken objects.
  77 +
  78 + >> t = StanfordParser::StandoffParsedText.new("This is a sentence. So is this.")
  79 + Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [4.9 sec].
  80 + => <StanfordParser::StandoffParsedText, 2 sentences>
  81 + >> puts t.first
  82 + (ROOT
  83 + (S
  84 + (NP (DT This [0,4]))
  85 + (VP (VBZ is [5,7])
  86 + (NP (DT a [8,9]) (NN sentence [10,18])))
  87 + (. . [18,19])))
  88 +
  89 +Standoff parse trees can reproduce the text from which they were generated verbatim.
  90 +
  91 + >> t.first.to_original_string
  92 + => "This is a sentence. "
  93 +
  94 +They can also reproduce the original text with brackets inserted around the yields of specified parse nodes.
  95 +
  96 + >> t.first.to_bracketed_string([[0,0,0], [0,1,1]])
  97 + => "[This] is [a sentence]. "
  98 +
  99 +The format of the coordinates used to specify individual nodes is described in the documentation for the Ruby Treebank[http://rubyforge.org/projects/treebank/] gem.
  100 +
  101 +See the documentation of the individual classes in this module for more details.
  102 +
  103 +Unlike their parents StanfordParser::DocumentPreprocessor and StanfordParser::LexicalizedParser, which produce Ruby wrappers around Java objects, StanfordParser::StandoffDocumentPreprocessor and StanfordParser::StandoffParsedText produce pure Ruby objects. This is to facilitate serialization of these objects using tools like the Marshal module, which cannot serialize Java objects.
  104 +
  105 += History
  106 +
  107 +1.0.0:: Initial release
  108 +1.1.0:: Make module initialization function private. Add example code.
  109 +1.2.0:: Read Java VM arguments from the configuration file. Add Word class.
  110 +2.0.0:: Add support for standoff parsing. Change the way Rjb::JavaObjectWrapper wraps returned values: see wrap_java_object for details. Rjb::JavaObjectWrapper supports static members. Minor changes to stanford-sentence-parser script.
  111 +2.1.0:: Different default paths for Windows machines; Minor changes to StandoffToken definition
  112 +2.2.0:: Add parent information to StandoffNode
  113 +
  114 += Copyright
  115 +
  116 +Copyright 2007-2008, William Patrick McNeill
  117 +
  118 +This program is distributed under the GNU General Public License.
  119 +
  120 +
  121 += Author
  122 +
  123 +W.P. McNeill mailto:billmcn@gmail.com
46 examples/stanford-sentence-parser.rb
... ... @@ -0,0 +1,46 @@
  1 +#!/usr/bin/env ruby
  2 +
  3 +#--
  4 +
  5 +# Copyright 2007-2008 William Patrick McNeill
  6 +#
  7 +# This file is part of the Stanford Parser Ruby Wrapper.
  8 +#
  9 +# The Stanford Parser Ruby Wrapper is free software; you can redistribute it
  10 +# and/or modify it under the terms of the GNU General Public License as
  11 +# published by the Free Software Foundation; either version 2 of the License,
  12 +# or (at your option) any later version.
  13 +#
  14 +# The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
  15 +# useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
  16 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
  17 +# Public License for more details.
  18 +#
  19 +# You should have received a copy of the GNU General Public License along with
  20 +# editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
  21 +# St, Fifth Floor, Boston, MA 02110-1301 USA
  22 +#
  23 +#++
  24 +
  25 +# == Synopsis
  26 +#
  27 +# Parse a sentence passed in on the command line.
  28 +#
  29 +# == Usage
  30 +#
  31 +# stanford-sentence-parser.rb [options] sentence
  32 +#
  33 +# options::
  34 +# See the Java Stanford Parser documentation for details
  35 +#
  36 +# sentence::
  37 +# A sentence to parse. This must appear after all the options and be quoted.
  38 +
  39 +
  40 +require "stanfordparser"
  41 +
  42 +# The last argument is the sentence. The rest of the command line is passed
  43 +# along to the parser object.
  44 +sentence = ARGV.pop
  45 +parser = StanfordParser::LexicalizedParser.new(StanfordParser::ENGLISH_PCFG_MODEL, ARGV)
  46 +puts parser.apply(sentence)
129 lib/java_object.rb
... ... @@ -0,0 +1,129 @@
  1 +# Copyright 2007-2008 William Patrick McNeill
  2 +#
  3 +# This file is part of the Stanford Parser Ruby Wrapper.
  4 +#
  5 +# The Stanford Parser Ruby Wrapper is free software; you can redistribute it
  6 +# and/or modify it under the terms of the GNU General Public License as
  7 +# published by the Free Software Foundation; either version 2 of the License,
  8 +# or (at your option) any later version.
  9 +#
  10 +# The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
  11 +# useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
  12 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
  13 +# Public License for more details.
  14 +#
  15 +# You should have received a copy of the GNU General Public License along with
  16 +# editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
  17 +# St, Fifth Floor, Boston, MA 02110-1301 USA
  18 +
  19 +# Extenions to the {Ruby-Java Bridge}[http://rjb.rubyforge.org/] module that
  20 +# add a generic Java object wrapper class.
  21 +module Rjb
  22 +
  23 + #--
  24 + # The documentation for this class appears next to its extension inside the
  25 + # StanfordParser module in stanfordparser.rb. This should be changed if Rjb
  26 + # is ever moved into its own gem. See the documention in stanfordparser.rb
  27 + # for more details.
  28 + #++
  29 + class JavaObjectWrapper
  30 + include Enumerable
  31 +
  32 + # The underlying Java object.
  33 + attr_reader :java_object
  34 +
  35 + # Initialize with a Java object <em>obj</em>. If <em>obj</em> is a
  36 + # String, treat it as a Java class name and instantiate it. Otherwise,
  37 + # treat <em>obj</em> as an instance of a Java object.
  38 + def initialize(obj, *args)
  39 + @java_object = obj.class == String ?
  40 + Rjb::import(obj).send(:new, *args) : obj
  41 + end
  42 +
  43 + # Enumerate all the items in the object using its iterator. If the object
  44 + # has no iterator, this function yields nothing.
  45 + def each
  46 + if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
  47 + i = @java_object.iterator
  48 + while i.hasNext
  49 + yield wrap_java_object(i.next)
  50 + end
  51 + end
  52 + end # each
  53 +
  54 + # Reflect unhandled method calls to the underlying Java object and wrap
  55 + # the return value in the appropriate Ruby object.
  56 + def method_missing(m, *args)
  57 + begin
  58 + wrap_java_object(@java_object.send(m, *args))
  59 + rescue RuntimeError => e
  60 + # The instance method failed. See if this is a static method.
  61 + if not e.message.match(/^Fail: unknown method name/).nil?
  62 + getClass.send(m, *args)
  63 + end
  64 + end
  65 + end
  66 +
  67 + # Convert a value returned by a call to the underlying Java object to the
  68 + # appropriate Ruby object.
  69 + #
  70 + # If the value is a JavaObjectWrapper, convert it using a protected
  71 + # function with the name wrap_ followed by the underlying object's
  72 + # classname with the Java path delimiters converted to underscores. For
  73 + # example, a <tt>java.util.ArrayList</tt> would be converted by a function
  74 + # called wrap_java_util_ArrayList.
  75 + #
  76 + # If the value lacks the appropriate converter function, wrap it in a
  77 + # generic JavaObjectWrapper.
  78 + #
  79 + # If the value is not a JavaObjectWrapper, return it unchanged.
  80 + #
  81 + # This function is called recursively for every element in an Array.
  82 + def wrap_java_object(object)
  83 + if object.kind_of?(Array)
  84 + object.collect {|item| wrap_java_object(item)}
  85 + elsif object.respond_to?(:_classname)
  86 + # Ruby-Java Bridge Java objects all have a _classname member which
  87 + # tells the name of their Java class. Convert this to the
  88 + # corresponding wrapper function name.
  89 + wrapper_name = ("wrap_" + object._classname.gsub(/\./, "_")).to_sym
  90 + respond_to?(wrapper_name) ? send(wrapper_name, object) : JavaObjectWrapper.new(object)
  91 + else
  92 + object
  93 + end
  94 + end
  95 +
  96 + # Convert <tt>java.util.ArrayList</tt> objects to Ruby Array objects.
  97 + def wrap_java_util_ArrayList(object)
  98 + array_list = []
  99 + object.size.times do
  100 + |i| array_list << wrap_java_object(object.get(i))
  101 + end
  102 + array_list
  103 + end
  104 +
  105 + # Convert <tt>java.util.HashSet</tt> objects to Ruby Set objects.
  106 + def wrap_java_util_HashSet(object)
  107 + set = Set.new
  108 + i = object.iterator
  109 + while i.hasNext
  110 + set << wrap_java_object(i.next)
  111 + end
  112 + set
  113 + end
  114 +
  115 + # Show the classname of the underlying Java object.
  116 + def inspect
  117 + "<#{@java_object._classname}>"
  118 + end
  119 +
  120 + # Use the underlying Java object's stringification.
  121 + def to_s
  122 + toString
  123 + end
  124 +
  125 + protected :wrap_java_object, :wrap_java_util_ArrayList, :wrap_java_util_HashSet
  126 +
  127 + end # JavaObjectWrapper
  128 +
  129 +end # Rjb
470 lib/stanfordparser.rb
... ... @@ -0,0 +1,470 @@
  1 +# Copyright 2007-2008 William Patrick McNeill
  2 +#
  3 +# This file is part of the Stanford Parser Ruby Wrapper.
  4 +#
  5 +# The Stanford Parser Ruby Wrapper is free software; you can redistribute it
  6 +# and/or modify it under the terms of the GNU General Public License as
  7 +# published by the Free Software Foundation; either version 2 of the License,
  8 +# or (at your option) any later version.
  9 +#
  10 +# The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
  11 +# useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
  12 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
  13 +# Public License for more details.
  14 +#
  15 +# You should have received a copy of the GNU General Public License along with
  16 +# editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
  17 +# St, Fifth Floor, Boston, MA 02110-1301 USA
  18 +
  19 +
  20 +require "pathname"
  21 +require "rjb"
  22 +require "singleton"
  23 +begin
  24 + require "treebank"
  25 + gem "treebank", ">= 3.0.0"
  26 +rescue LoadError
  27 + require "treebank"
  28 +end
  29 +require "yaml"
  30 +
  31 +require "java_object.rb"
  32 +
  33 +# Wrapper for the {Stanford Natural Language
  34 +# Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
  35 +module StanfordParser
  36 +
  37 + VERSION = "2.2.0"
  38 +
  39 + # The default sentence segmenter and tokenizer. This is an English-language
  40 + # tokenizer with support for Penn Treebank markup.
  41 + EN_PENN_TREEBANK_TOKENIZER = "edu.stanford.nlp.process.PTBTokenizer"
  42 +
  43 + # Path to an English PCFG model that comes with the Stanford Parser. The
  44 + # location is relative to the parser root directory. This is a valid value
  45 + # for the <em>grammar</em> parameter of the LexicalizedParser constructor.
  46 + ENGLISH_PCFG_MODEL = "$(ROOT)/englishPCFG.ser.gz"
  47 +
  48 + # This function is executed once when the module is loaded. It initializes
  49 + # the Java virtual machine in which the Stanford parser will run. By
  50 + # default, it adds the parser installation root to the Java classpath and
  51 + # launches the VM with the arguments <tt>-server -Xmx150m</tt>. Different
  52 + # values may be specified with the <tt>ruby-stanford-parser.yaml</tt>
  53 + # configuration file.
  54 + #
  55 + # This function determines which operating system we are running on and sets
  56 + # default pathnames accordingly:
  57 + #
  58 + # UNIX:: /usr/local/stanford-parser/current, /etc/ruby-stanford-parser.yaml
  59 + # Windows:: C:\stanford-parser\current,
  60 + # C:\stanford-parser\ruby-stanford-parser.yaml
  61 + #
  62 + # This function returns the path of the parser installation root.
  63 + def StanfordParser.initialize_on_load
  64 + if RUBY_PLATFORM =~ /(win|w)32$/
  65 + root = Pathname.new("C:\\stanford-parser\\current")
  66 + config = Pathname.new("C:\\stanford-parser\\ruby-stanford-parser.yaml")
  67 + else
  68 + root = Pathname.new("/usr/local/stanford-parser/current")
  69 + config = Pathname.new("/etc/ruby-stanford-parser.yaml")
  70 + end
  71 + jvmargs = ["-server", "-Xmx150m"]
  72 + if config.file?
  73 + configuration = open(config) {|f| YAML.load(f)}
  74 + if configuration.key?("root") and not configuration["root"].nil?
  75 + root = Pathname.new(configuration["root"])
  76 + end
  77 + if configuration.key?("jvmargs") and not configuration["jvmargs"].nil?
  78 + jvmargs = configuration["jvmargs"].split
  79 + end
  80 + end
  81 + Rjb::load(classpath = (root + "stanford-parser.jar").to_s, jvmargs)
  82 + root
  83 + end
  84 +
  85 + private_class_method :initialize_on_load
  86 +
  87 + # The root directory of the Stanford parser installation.
  88 + ROOT = initialize_on_load
  89 +
  90 + #--
  91 + # The documentation below is for the original Rjb::JavaObjectWrapper object.
  92 + # It is reproduced here because rdoc only takes the last document block
  93 + # defined. If Rjb is moved into its own gem, this documentation should go
  94 + # with it, and the following should be written as documentation for this
  95 + # class:
  96 + #
  97 + # Extension of the generic Ruby-Java Bridge wrapper object for the
  98 + # StanfordParser module.
  99 + #++
  100 + # A generic wrapper for a Java object loaded via the {Ruby-Java
  101 + # Bridge}[http://rjb.rubyforge.org/]. The wrapper class handles
  102 + # intialization and stringification, and passes other method calls down to
  103 + # the underlying Java object. Objects returned by the underlying Java
  104 + # object are converted to the appropriate Ruby object.
  105 + #
  106 + # Other modules may extend the list of Java objects that are converted by
  107 + # adding their own converter functions. See wrap_java_object for details.
  108 + #
  109 + # This object is enumerable, yielding items in the order defined by the
  110 + # underlying Java object's iterator.
  111 + class Rjb::JavaObjectWrapper
  112 + # FeatureLabel objects go inside a FeatureLabel wrapper.
  113 + def wrap_edu_stanford_nlp_ling_FeatureLabel(object)
  114 + StanfordParser::FeatureLabel.new(object)
  115 + end
  116 +
  117 + # Tree objects go inside a Tree wrapper. Various tree types are aliased
  118 + # to this function.
  119 + def wrap_edu_stanford_nlp_trees_Tree(object)
  120 + Tree.new(object)
  121 + end
  122 +
  123 + alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeLeaf :wrap_edu_stanford_nlp_trees_Tree
  124 + alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeNode :wrap_edu_stanford_nlp_trees_Tree
  125 + alias :wrap_edu_stanford_nlp_trees_SimpleTree :wrap_edu_stanford_nlp_trees_Tree
  126 + alias :wrap_edu_stanford_nlp_trees_TreeGraphNode :wrap_edu_stanford_nlp_trees_Tree
  127 +
  128 + protected :wrap_edu_stanford_nlp_trees_Tree, :wrap_edu_stanford_nlp_ling_FeatureLabel
  129 + end # Rjb::JavaObjectWrapper
  130 +
  131 +
  132 + # Lexicalized probabalistic parser.
  133 + #
  134 + # This is an wrapper for the
  135 + # <tt>edu.stanford.nlp.parser.lexparser.LexicalizedParser</tt> object.
  136 + class LexicalizedParser < Rjb::JavaObjectWrapper
  137 + # The grammar used by the parser
  138 + attr_reader :grammar
  139 +
  140 + # Create the parser given a grammar and options. The <em>grammar</em>
  141 + # argument is a path to a grammar file. This path may contain the string
  142 + # <tt>$(ROOT)</tt>, which will be replaced with the root directory of the
  143 + # Stanford Parser. By default, an English PCFG grammar is loaded.
  144 + #
  145 + # The <em>options</em> argument is a list of string arguments as they
  146 + # would appear on a command line. See the documentaion of
  147 + # <tt>edu.stanford.nlp.parser.lexparser.Options.setOptions</tt> for more
  148 + # details.
  149 + def initialize(grammar = ENGLISH_PCFG_MODEL, options = [])
  150 + @grammar = Pathname.new(grammar.gsub(/\$\(ROOT\)/, ROOT))
  151 + super("edu.stanford.nlp.parser.lexparser.LexicalizedParser", @grammar.to_s)
  152 + @java_object.setOptionFlags(options)
  153 + end
  154 +
  155 + def to_s
  156 + "LexicalizedParser(#{grammar.basename})"
  157 + end
  158 + end # LexicalizedParser
  159 +
  160 +
  161 + # A singleton instance of the default Stanford Natural Language parser. A
  162 + # singleton is used because the parser can take a few seconds to load.
  163 + class DefaultParser < StanfordParser::LexicalizedParser
  164 + include Singleton
  165 + end
  166 +
  167 +
  168 + # This is a wrapper for
  169 + # <tt>edu.stanford.nlp.trees.Tree</tt> objects. It customizes
  170 + # stringification.
  171 + class Tree < Rjb::JavaObjectWrapper
  172 + def initialize(obj = "edu.stanford.nlp.trees.Tree")
  173 + super(obj)
  174 + end
  175 +
  176 + # Return the label along with the score if there is one.
  177 + def inspect
  178 + s = "#{label}" + (score.nan? ? "" : " [#{sprintf '%.2f', score}]")
  179 + "(#{s})"
  180 + end
  181 +
  182 + # The Penn treebank representation. This prints with indenting instead of
  183 + # putting everything on one line.
  184 + def to_s
  185 + "#{pennString}"
  186 + end
  187 + end # Tree
  188 +
  189 +
  190 + # This is a wrapper for
  191 + # <tt>edu.stanford.nlp.ling.Word</tt> objects. It customizes
  192 + # stringification and adds an equivalence operator.
  193 + class Word < Rjb::JavaObjectWrapper
  194 + def initialize(obj = "edu.stanford.nlp.ling.Word", *args)
  195 + super(obj, *args)
  196 + end
  197 +
  198 + # See the word values.
  199 + def inspect
  200 + to_s
  201 + end
  202 +
  203 + # Equivalence is defined relative to the word value.
  204 + def ==(other)
  205 + word == other
  206 + end
  207 + end # Word
  208 +
  209 +
  210 + # This is a wrapper for <tt>edu.stanford.nlp.ling.FeatureLabel</tt> objects.
  211 + # It customizes stringification.
  212 + class FeatureLabel < Rjb::JavaObjectWrapper
  213 + def initialize(obj = "edu.stanford.nlp.ling.FeatureLabel")
  214 + super
  215 + end
  216 +
  217 + # Stringify with just the token and its begin and end position.
  218 + def to_s
  219 + # BUGBUG The position values come back as java.lang.Integer though I
  220 + # would expect Rjb to convert them to Ruby integers.
  221 + begin_position = get(self.BEGIN_POSITION_KEY)
  222 + end_position = get(self.END_POSITION_KEY)
  223 + "#{current} [#{begin_position},#{end_position}]"
  224 + end
  225 +
  226 + # More verbose stringification with all the fields and their values.
  227 + def inspect
  228 + toString
  229 + end
  230 + end
  231 +
  232 +
  233 + # Tokenizes documents into words and sentences.
  234 + #
  235 + # This is a wrapper for the
  236 + # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> object.
  237 + class DocumentPreprocessor < Rjb::JavaObjectWrapper
  238 + def initialize(suppressEscaping = false)
  239 + super("edu.stanford.nlp.process.DocumentPreprocessor", suppressEscaping)
  240 + end
  241 +
  242 + # Returns a list of sentences in a string.
  243 + def getSentencesFromString(s)
  244 + s = Rjb::JavaObjectWrapper.new("java.io.StringReader", s)
  245 + _invoke(:getSentencesFromText, "Ljava.io.Reader;", s.java_object)
  246 + end
  247 +
  248 + def inspect
  249 + "<#{self.class.to_s.split('::').last}>"
  250 + end
  251 +
  252 + def to_s
  253 + inspect
  254 + end
  255 + end # DocumentPreprocessor
  256 +
  257 + # A text token that contains raw and normalized token identity (.e.g "(" and
  258 + # "-LRB-"), an offset span, and the characters immediately preceding and
  259 + # following the token. Given a list of these objects it is possible to
  260 + # recreate the text from which they came verbatim.
  261 + class StandoffToken < Struct.new(:current, :word, :before, :after,
  262 + :begin_position, :end_position)
  263 + def to_s
  264 + "#{current} [#{begin_position},#{end_position}]"
  265 + end
  266 + end
  267 +
  268 +
  269 + # A preprocessor that segments text into sentences and tokens that contain
  270 + # character offset and token context information that can be used for
  271 + # standoff annotation.
  272 + class StandoffDocumentPreprocessor < DocumentPreprocessor
  273 + def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER)
  274 + # PTBTokenizer.factory is a static function, so use RJB to call it
  275 + # directly instead of going through a JavaObjectWrapper. We do it this
  276 + # way because the Standford parser Java code does not provide a
  277 + # constructor that allows you to specify the second parameter,
  278 + # invertible, to true, and we need this to write character offset
  279 + # information into the tokens.
  280 + ptb_tokenizer_class = Rjb::import(tokenizer)
  281 + # See the documentation for
  282 + # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a
  283 + # description of these parameters.
  284 + ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false)
  285 + super(ptb_tokenizer_factory)
  286 + end
  287 +
  288 + # Returns a list of sentences in a string. This wraps the returned
  289 + # sentences in a StandoffSentence object.
  290 + def getSentencesFromString(s)
  291 + super(s).map!{|s| StandoffSentence.new(s)}
  292 + end
  293 + end
  294 +
  295 +
  296 + # A sentence is an array of StandoffToken objects.
  297 + class StandoffSentence < Array
  298 + # Construct an array of StandoffToken objects from a Java list sentence
  299 + # object returned by the preprocessor.
  300 + def initialize(stanford_parser_sentence)
  301 + # Convert FeatureStructure wrappers to StandoffToken objects.
  302 + s = stanford_parser_sentence.to_a.collect do |fs|
  303 + current = fs.current
  304 + word = fs.word
  305 + before = fs.before
  306 + after = fs.after
  307 + # The to_s.to_i is necessary because the get function returns
  308 + # java.lang.Integer objects instead of Ruby integers.
  309 + begin_position = fs.get(fs.BEGIN_POSITION_KEY).to_s.to_i
  310 + end_position = fs.get(fs.END_POSITION_KEY).to_s.to_i
  311 + StandoffToken.new(current, word, before, after,
  312 + begin_position, end_position)
  313 + end
  314 + super(s)
  315 + end
  316 +
  317 + # Return the original string verbatim.
  318 + def to_s
  319 + self[0..-2].inject(""){|s, word| s + word.current + word.after} + last.current
  320 + end
  321 +
  322 + # Return the original string verbatim.
  323 + def inspect
  324 + to_s
  325 + end
  326 + end
  327 +
  328 +
  329 + # Standoff syntactic annotation of natural language text which may contain
  330 + # multiple sentences.
  331 + #
  332 + # This is an Array of StandoffNode objects, one for each sentence in the
  333 + # text.
  334 + class StandoffParsedText < Array
  335 + # Parse the text and create the standoff annotation.
  336 + #
  337 + # The default parser is a singleton instance of the English language
  338 + # Stanford Natural Langugage parser. There may be a delay of a few
  339 + # seconds for it to load the first time it is created.
  340 + def initialize(text, nodetype = StandoffNode,
  341 + tokenizer = EN_PENN_TREEBANK_TOKENIZER,
  342 + parser = DefaultParser.instance)
  343 + preprocessor = StandoffDocumentPreprocessor.new(tokenizer)
  344 + # Segment the text into sentences. Parse each sentence, writing
  345 + # standoff annotation information into the terminal nodes.
  346 + preprocessor.getSentencesFromString(text).map do |sentence|
  347 + parse = parser.apply(sentence.to_s)
  348 + push(nodetype.new(parse, sentence))
  349 + end
  350 + end
  351 +
  352 + # Print class name and number of sentences.
  353 + def inspect
  354 + "<#{self.class.name}, #{length} sentences>"
  355 + end
  356 +
  357 + # Print parses.
  358 + def to_s
  359 + flatten.join(" ")
  360 + end
  361 + end
  362 +
  363 +
  364 + # Standoff syntactic tree annotation of text. Terminal nodes are labeled
  365 + # with the appropriate StandoffToken objects. Standoff parses can reproduce
  366 + # the original string from which they were generated verbatim, optionally
  367 + # with brackets around the yields of specified non-terminal nodes.
  368 + class StandoffNode < Treebank::ParentedNode
  369 + # Create the standoff tree from a tree returned by the Stanford parser.
  370 + # For non-terminal nodes, the <em>tokens</em> argument will be a
  371 + # StandoffSentence containing the StandoffToken objects representing all
  372 + # the tokens beneath and after this node. For terminal nodes, the
  373 + # <em>tokens</em> argument will be a StandoffToken.
  374 + def initialize(stanford_parser_node, tokens)
  375 + # Annotate this node with a non-terminal label or a StandoffToken as
  376 + # appropriate.
  377 + super(tokens.instance_of?(StandoffSentence) ?
  378 + stanford_parser_node.value : tokens)
  379 + # Enumerate the children depth-first. Tokens are removed from the list
  380 + # left-to-right as terminal nodes are added to the tree.
  381 + stanford_parser_node.children.each do |child|
  382 + subtree = self.class.new(child, child.leaf? ? tokens.shift : tokens)
  383 + attach_child!(subtree)
  384 + end
  385 + end
  386 +
  387 + # Return the original text string dominated by this node.
  388 + def to_original_string
  389 + leaves.inject("") do |s, leaf|
  390 + s += leaf.label.current + leaf.label.after
  391 + end
  392 + end
  393 +
  394 + # Print the original string with brackets around word spans dominated by
  395 + # the specified consituents.
  396 + #
  397 + # The constituents to bracket are specified by passing a list of node
  398 + # coordinates, which are arrays of integers of the form returned by the
  399 + # tree enumerators of Treebank::Node objects.
  400 + #
  401 + # _coords_:: the coordinates of the nodes around which to place brackets
  402 + # _open_:: the open bracket symbol
  403 + # _close_:: the close bracket symbol
  404 + def to_bracketed_string(coords, open = "[", close = "]")
  405 + # Get a list of all the leaf nodes and their coordinates.
  406 + items = depth_first_enumerator(true).find_all {|n| n.first.leaf?}
  407 + # Enumerate over all the matching constituents inserting open and close
  408 + # brackets around their yields in the items list.
  409 + coords.each do |matching|
  410 + # Insert using a simple state machine with three states: :start,
  411 + # :open, and :close.
  412 + state = :start
  413 + # Enumerate over the items list looking for nodes that are the
  414 + # children of the matching constituent.
  415 + items.each_with_index do |item, index|
  416 + # Skip inserted bracket characters.
  417 + next if item.is_a? String
  418 + # Handle terminal node items with the state machine.
  419 + node, terminal_coordinate = item
  420 + if state == :start
  421 + next if not in_yield?(matching, terminal_coordinate)
  422 + items.insert(index, open)
  423 + state = :open
  424 + else # state == :open
  425 + next if in_yield?(matching, terminal_coordinate)
  426 + items.insert(index, close)
  427 + state = :close
  428 + break
  429 + end
  430 + end # items.each_with_index
  431 + # Handle the case where a matching constituent is flush with the end
  432 + # of the sentence.
  433 + items << close if state == :open
  434 + end # each
  435 + # Replace terminal nodes with their string representations. Insert
  436 + # spacing characters in the list.
  437 + items.each_with_index do |item, index|
  438 + next if item.is_a? String
  439 + text = item.first.label.current
  440 + spacing = item.first.label.after
  441 + # Replace the terminal node with its text.
  442 + items[index] = text
  443 + # Insert the spacing that comes after this text before the first
  444 + # non-close bracket character.
  445 + close_pos = find_index(items[index+1..-1]) {|item| not item == close}
  446 + items.insert(index + close_pos + 1, spacing)
  447 + end
  448 + items.join
  449 + end # to_bracketed_string
  450 +
  451 + # Find the index of the first item in _list_ for which _block_ is true.
  452 + # Return 0 if no items are found.
  453 + def find_index(list, &block)
  454 + list.each_with_index do |item, index|
  455 + return index if block.call(item)
  456 + end
  457 + 0
  458 + end
  459 +
  460 + # Is the node at _terminal_ in the yield of the node at _node_?
  461 + def in_yield?(node, terminal)
  462 + # If node A's coordinates match the prefix of node B's coordinates, node
  463 + # B is in the yield of node A.
  464 + terminal.first(node.length) == node
  465 + end
  466 +
  467 + private :in_yield?, :find_index
  468 + end # StandoffNode
  469 +
  470 +end # StanfordParser
224 test/test_stanfordparser.rb
... ... @@ -0,0 +1,224 @@
  1 +#!/bin/env ruby
  2 +
  3 +#--
  4 +
  5 +# Copyright 2007-2008 William Patrick McNeill
  6 +#
  7 +# This file is part of the Stanford Parser Ruby Wrapper.
  8 +#
  9 +# The Stanford Parser Ruby Wrapper is free software; you can redistribute it
  10 +# and/or modify it under the terms of the GNU General Public License as
  11 +# published by the Free Software Foundation; either version 2 of the License,
  12 +# or (at your option) any later version.
  13 +#
  14 +# The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
  15 +# useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
  16 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
  17 +# Public License for more details.
  18 +#
  19 +# You should have received a copy of the GNU General Public License along with
  20 +# editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
  21 +# St, Fifth Floor, Boston, MA 02110-1301 USA
  22 +#
  23 +#++
  24 +
  25 +# Test cases for the Stanford Parser module
  26 +
  27 +require "test/unit"
  28 +require "set"
  29 +require "singleton"
  30 +require "stanfordparser"
  31 +
  32 +
  33 +class LexicalizedParserTestCase < Test::Unit::TestCase
  34 + def test_root_path
  35 + assert_equal StanfordParser::ROOT.class, Pathname
  36 + end
  37 +
  38 + def setup
  39 + @parser = StanfordParser::DefaultParser.instance
  40 + @tree = @parser.apply("This is a sentence.")
  41 + end
  42 +
  43 + def test_parser
  44 + assert_equal @parser.grammar, StanfordParser::ROOT + "englishPCFG.ser.gz"
  45 + assert_equal @tree.class, StanfordParser::Tree
  46 + end
  47 +
  48 + def test_localTrees
  49 + # The following call exercises the conversion from java.util.HashSet
  50 + # objects to Ruby sets.
  51 + l = @tree.localTrees
  52 + assert_equal l.size, 5
  53 + assert_equal Set.new(l.collect {|t| "#{t.label}"}),
  54 + Set.new(["S", "NP", "VP", "ROOT", "NP"])
  55 + end
  56 +
  57 + def test_enumerable
  58 + # StanfordParser::LexicalizedParser is not an enumerable object.
  59 + assert_equal @parser.map, []
  60 + end
  61 +end # LexicalizedParserTestCase
  62 +
  63 +
  64 +class TreeTestCase < Test::Unit::TestCase
  65 + def setup
  66 + @parser = StanfordParser::DefaultParser.instance
  67 + @tree = @parser.apply("This is a sentence.")
  68 + end
  69 +
  70 + def test_enumerable
  71 + assert @tree.all? {|n| n.class == StanfordParser::Tree}
  72 + assert @tree.all? {|n|
  73 + n._classname == "edu.stanford.nlp.trees.LabeledScoredTreeNode" or
  74 + n._classname == "edu.stanford.nlp.trees.LabeledScoredTreeLeaf"
  75 + }
  76 + assert_equal @tree.map {|n| "#{n.label}"},
  77 + ["ROOT", "S", "NP", "DT", "This", "VP", "VBZ", "is", "NP", "DT", "a", \
  78 + "NN", "sentence", ".", "."]
  79 + end
  80 +end # TreeTestCase
  81 +
  82 +
  83 +class FeatureLabelTestCase < Test::Unit::TestCase
  84 + def test_feature_label
  85 + f = StanfordParser::FeatureLabel.new
  86 + assert_equal "BEGIN_POS", f.BEGIN_POSITION_KEY
  87 + f.put(f.BEGIN_POSITION_KEY, 3)
  88 + assert_equal "END_POS", f.END_POSITION_KEY
  89 + f.put(f.END_POSITION_KEY, 7)
  90 + assert_equal "current", f.CURRENT_KEY
  91 + f.put(f.CURRENT_KEY, "word")
  92 + assert_equal "{BEGIN_POS=3, END_POS=7, current=word}", f.inspect
  93 + assert_equal "word [3,7]", f.to_s
  94 + end
  95 +end
  96 +
  97 +
  98 +class DocumentPreprocessorTestCase < Test::Unit::TestCase
  99 + def setup
  100 + @preproc = StanfordParser::DocumentPreprocessor.new
  101 + @standoff_preproc = StanfordParser::StandoffDocumentPreprocessor.new
  102 + end
  103 +
  104 + def test_get_sentences_from_string
  105 + # The following call exercises the conversion from java.util.ArrayList
  106 + # objects to Ruby arrays.
  107 + s = @preproc.getSentencesFromString("This is a sentence. So is this.")
  108 + assert_equal "#{s[0]}", "This is a sentence ."
  109 + assert_equal "#{s[1]}", "So is this ."
  110 + end
  111 +
  112 + def test_enumerable
  113 + # StanfordParser::DocumentPreprocessor is not an enumerable object.
  114 + assert_equal @preproc.map, []
  115 + end
  116 +
  117 + # Segment and tokenize text containing two sentences.
  118 + def test_standoff_document_preprocessor
  119 + sentences = @standoff_preproc.getSentencesFromString("He (John) is tall. So is she.")
  120 + # Recognize two sentences.
  121 + assert_equal 2, sentences.length
  122 + assert sentences.all? {|sentence| sentence.instance_of? StanfordParser::StandoffSentence}
  123 + assert_equal "He (John) is tall.", sentences.first.to_s
  124 + assert_equal 7, sentences.first.length
  125 + assert sentences[0].all? {|token| token.instance_of? StanfordParser::StandoffToken}
  126 + assert_equal "So is she.", sentences.last.to_s
  127 + assert_equal 4, sentences.last.length
  128 + assert sentences[1].all? {|token| token.instance_of? StanfordParser::StandoffToken}
  129 + # Get the correct token information for the first sentence.
  130 + assert_equal ["He", "He"], [sentences[0][0].current(), sentences[0][0].word()]
  131 + assert_equal [0,2], [sentences[0][0].begin_position(), sentences[0][0].end_position()]
  132 + assert_equal ["(", "-LRB-"], [sentences[0][1].current(), sentences[0][1].word()]
  133 + assert_equal [3,4], [sentences[0][1].begin_position(), sentences[0][1].end_position()]
  134 + assert_equal ["John", "John"], [sentences[0][2].current(), sentences[0][2].word()]
  135 + assert_equal [4,8], [sentences[0][2].begin_position(), sentences[0][2].end_position()]
  136 + assert_equal [")", "-RRB-"], [sentences[0][3].current(), sentences[0][3].word()]
  137 + assert_equal [8,9], [sentences[0][3].begin_position(), sentences[0][3].end_position()]
  138 + assert_equal ["is", "is"], [sentences[0][4].current(), sentences[0][4].word()]
  139 + assert_equal [10,12], [sentences[0][4].begin_position(), sentences[0][4].end_position()]
  140 + assert_equal ["tall", "tall"], [sentences[0][5].current(), sentences[0][5].word()]
  141 + assert_equal [13,17], [sentences[0][5].begin_position(), sentences[0][5].end_position()]
  142 + assert_equal [".", "."], [sentences[0][6].current(), sentences[0][6].word()]
  143 + assert_equal [17,18], [sentences[0][6].begin_position(), sentences[0][6].end_position()]
  144 + # Get the correct token information for the second sentence.
  145 + assert_equal ["So", "So"], [sentences[1][0].current(), sentences[1][0].word()]
  146 + assert_equal [20,22], [sentences[1][0].begin_position(), sentences[1][0].end_position()]
  147 + assert_equal ["is", "is"], [sentences[1][1].current(), sentences[1][1].word()]
  148 + assert_equal [23,25], [sentences[1][1].begin_position(), sentences[1][1].end_position()]
  149 + assert_equal ["she", "she"], [sentences[1][2].current(), sentences[1][2].word()]
  150 + assert_equal [26,29], [sentences[1][2].begin_position(), sentences[1][2].end_position()]
  151 + assert_equal [".", "."], [sentences[1][3].current(), sentences[1][3].word()]
  152 + assert_equal [29,30], [sentences[1][3].begin_position(), sentences[1][3].end_position()]
  153 + end
  154 +
  155 + def test_stringification
  156 + assert_equal "<DocumentPreprocessor>", @preproc.inspect
  157 + assert_equal "<DocumentPreprocessor>", @preproc.to_s
  158 + assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.inspect
  159 + assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.to_s
  160 + end
  161 +
  162 +end # DocumentPreprocessorTestCase
  163 +
  164 +
  165 +class StandoffParsedTextTestCase < Test::Unit::TestCase
  166 + def setup
  167 + @text = "He (John) is tall. So is she."
  168 + end
  169 +
  170 + def test_parse_text_default_nodetype
  171 + parsed_text = StanfordParser::StandoffParsedText.new(@text)
  172 + verify_parsed_text(parsed_text, StanfordParser::StandoffNode)
  173 + end
  174 +
  175 + # Verify correct parsing with variable node types for text containing two sentences.
  176 + def verify_parsed_text(parsed_text, nodetype)
  177 + # Verify that there are two sentences.
  178 + assert_equal 2, parsed_text.length
  179 + assert parsed_text.all? {|sentence| sentence.instance_of? nodetype}
  180 + # Verify the tokens in the leaf node of the first sentence.
  181 + leaves = parsed_text[0].leaves.collect {|node| node.label}
  182 + assert_equal ["He", "He"], [leaves[0].current(), leaves[0].word()]
  183 + assert_equal [0,2], [leaves[0].begin_position(), leaves[0].end_position()]
  184 + assert_equal ["(", "-LRB-"], [leaves[1].current(), leaves[1].word()]
  185 + assert_equal [3,4], [leaves[1].begin_position(), leaves[1].end_position()]
  186 + assert_equal ["John", "John"], [leaves[2].current(), leaves[2].word()]
  187 + assert_equal [4,8], [leaves[2].begin_position(), leaves[2].end_position()]
  188 + assert_equal [")", "-RRB-"], [leaves[3].current(), leaves[3].word()]
  189 + assert_equal [8,9], [leaves[3].begin_position(), leaves[3].end_position()]
  190 + assert_equal ["is", "is"], [leaves[4].current(), leaves[4].word()]
  191 + assert_equal [10,12], [leaves[4].begin_position(), leaves[4].end_position()]
  192 + assert_equal ["tall", "tall"], [leaves[5].current(), leaves[5].word()]
  193 + assert_equal [13,17], [leaves[5].begin_position(), leaves[5].end_position()]
  194 + assert_equal [".", "."], [leaves[6].current(), leaves[6].word()]
  195 + assert_equal [17,18], [leaves[6].begin_position(), leaves[6].end_position()]
  196 + # Verify the tokens in the leaf node of the second sentence.
  197 + leaves = parsed_text[1].leaves.collect {|node| node.label}
  198 + assert_equal ["So", "So"], [leaves[0].current(), leaves[0].word()]
  199 + assert_equal [20,22], [leaves[0].begin_position(), leaves[0].end_position()]
  200 + assert_equal ["is", "is"], [leaves[1].current(), leaves[1].word()]
  201 + assert_equal [23,25], [leaves[1].begin_position(), leaves[1].end_position()]
  202 + assert_equal ["she", "she"], [leaves[2].current(), leaves[2].word()]
  203 + assert_equal [26,29], [leaves[2].begin_position(), leaves[2].end_position()]
  204 + assert_equal [".", "."], [leaves[3].current(), leaves[3].word()]
  205 + assert_equal [29,30], [leaves[3].begin_position(), leaves[3].end_position()]
  206 + # Verify that the original string is recoverable.
  207 + assert_equal "He (John) is tall. ", parsed_text[0].to_original_string
  208 + assert_equal "So is she." , parsed_text[1].to_original_string
  209 + # Draw < and > brackets around 3 constituents.
  210 + b = parsed_text[0].to_bracketed_string([[0,0], [0,0,1,1], [0,1,1]], "<", ">")
  211 + assert_equal "<He (<John>)> is <tall>. ", b
  212 + end
  213 +end
  214 +
  215 +
  216 +class MiscPreprocessorTestCase < Test::Unit::TestCase
  217 + def test_model_location
  218 + assert_equal "$(ROOT)/englishPCFG.ser.gz", StanfordParser::ENGLISH_PCFG_MODEL
  219 + end
  220 +
  221 + def test_word
  222 + assert StanfordParser::Word.new("edu.stanford.nlp.ling.Word", "dog") == "dog"
  223 + end
  224 +end # MiscPreprocessorTestCase

0 comments on commit 1460c23

Please sign in to comment.
Something went wrong with that request. Please try again.