Permalink
Browse files

First commit, this is the stock gem at v2.2.0

  • Loading branch information...
0 parents commit 1460c236b0a48a6f1fcb8f7aff9fccac6262797e John Wilkinson committed Jun 17, 2010
Showing with 992 additions and 0 deletions.
  1. +123 −0 README
  2. +46 −0 examples/stanford-sentence-parser.rb
  3. +129 −0 lib/java_object.rb
  4. +470 −0 lib/stanfordparser.rb
  5. +224 −0 test/test_stanfordparser.rb
@@ -0,0 +1,123 @@
+= Stanford Natural Language Parser Wrapper
+
+This module is a wrapper for the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
+
+The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby along with pure Ruby objects that enable standoff parsing.
+
+
+= Installation and Configuration
+
+In addition to the Ruby gems it requires, to run this module you must manually install the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
+
+This module expects the parser to be installed in the <tt>/usr/local/stanford-parser/current</tt> directory on UNIX platforms and in the <tt>C:\stanford-parser\current</tt> directory on Windows platforms. This is the directory that contains the <tt>stanford-parser.jar</tt> file. When the module is loaded, it adds this directory to the Java classpath and launches the Java VM with the arguments <tt>-server -Xmx150m</tt>.
+
+These defaults can be overridden by creating the configuration file <tt>/etc/ruby_stanford_parser.yaml</tt> on UNIX platforms and <tt>C:\stanford-parser\ruby-stanford-parser.yaml</tt> on Windows platforms. This file is in the Ruby YAML[http://ruby-doc.org/stdlib/libdoc/yaml/rdoc/index.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
+
+ root: /usr/local/stanford-parser/other/location
+ jvmargs: -Xmx100m -verbose
+
+
+=Tokenization and Parsing
+
+Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into sentences and words.
+
+ >> require "stanfordparser"
+ => true
+ >> preproc = StanfordParser::DocumentPreprocessor.new
+ => <DocumentPreprocessor>
+ >> puts preproc.getSentencesFromString("This is a sentence. So is this.")
+ This is a sentence .
+ So is this .
+
+Use the StanfordParser::LexicalizedParser class to parse sentences.
+
+ >> parser = StanfordParser::LexicalizedParser.new
+ Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [5.5 sec].
+ => edu.stanford.nlp.parser.lexparser.LexicalizedParser
+ >> puts parser.apply("This is a sentence.")
+ (ROOT
+ (S [24.917]
+ (NP [6.139] (DT [2.300] This))
+ (VP [17.636] (VBZ [0.144] is)
+ (NP [12.299] (DT [1.419] a) (NN [8.897] sentence)))
+ (. [0.002] .)))
+
+For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
+
+
+=Standoff Tokenization and Parsing
+
+This module also contains support for standoff tokenization and parsing, in which the terminal nodes of parse trees contain information about the text that was used to generate them.
+
+Use StanfordParser::StandoffDocumentPreprocessor class to tokenize text and files into sentences and words.
+
+ >> preproc = StanfordParser::StandoffDocumentPreprocessor.new
+ => <StandoffDocumentPreprocessor>
+ >> s = preproc.getSentencesFromString("This is a sentence. So is this.")
+ => [This is a sentence., So is this.]
+
+The standoff preprocessor returns StanfordParser::StandoffToken objects, which contain character offsets into the original text along with information about spacing characters that came before and after the token.
+
+ >> puts s
+ This [0,4]
+ is [5,7]
+ a [8,9]
+ sentence [10,18]
+ . [18,19]
+ So [21,23]
+ is [24,26]
+ this [27,31]
+ . [31,32]
+ >> "This is a sentence. So is this."[27..31]
+ => "this."
+
+This is the same information contained in the <tt>edu.stanford.nlp.ling.FeatureLabel</tt> class in the Stanford Parser Java implementation.
+
+Similarly, use the StanfordParser::StandoffParsedText object to parse a block of text into StanfordParser::StandoffNode parse trees whose terminal nodes are StanfordParser::StandoffToken objects.
+
+ >> t = StanfordParser::StandoffParsedText.new("This is a sentence. So is this.")
+ Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [4.9 sec].
+ => <StanfordParser::StandoffParsedText, 2 sentences>
+ >> puts t.first
+ (ROOT
+ (S
+ (NP (DT This [0,4]))
+ (VP (VBZ is [5,7])
+ (NP (DT a [8,9]) (NN sentence [10,18])))
+ (. . [18,19])))
+
+Standoff parse trees can reproduce the text from which they were generated verbatim.
+
+ >> t.first.to_original_string
+ => "This is a sentence. "
+
+They can also reproduce the original text with brackets inserted around the yields of specified parse nodes.
+
+ >> t.first.to_bracketed_string([[0,0,0], [0,1,1]])
+ => "[This] is [a sentence]. "
+
+The format of the coordinates used to specify individual nodes is described in the documentation for the Ruby Treebank[http://rubyforge.org/projects/treebank/] gem.
+
+See the documentation of the individual classes in this module for more details.
+
+Unlike their parents StanfordParser::DocumentPreprocessor and StanfordParser::LexicalizedParser, which produce Ruby wrappers around Java objects, StanfordParser::StandoffDocumentPreprocessor and StanfordParser::StandoffParsedText produce pure Ruby objects. This is to facilitate serialization of these objects using tools like the Marshal module, which cannot serialize Java objects.
+
+= History
+
+1.0.0:: Initial release
+1.1.0:: Make module initialization function private. Add example code.
+1.2.0:: Read Java VM arguments from the configuration file. Add Word class.
+2.0.0:: Add support for standoff parsing. Change the way Rjb::JavaObjectWrapper wraps returned values: see wrap_java_object for details. Rjb::JavaObjectWrapper supports static members. Minor changes to stanford-sentence-parser script.
+2.1.0:: Different default paths for Windows machines; Minor changes to StandoffToken definition
+2.2.0:: Add parent information to StandoffNode
+
+= Copyright
+
+Copyright 2007-2008, William Patrick McNeill
+
+This program is distributed under the GNU General Public License.
+
+
+= Author
+
+W.P. McNeill mailto:billmcn@gmail.com
@@ -0,0 +1,46 @@
+#!/usr/bin/env ruby
+
+#--
+
+# Copyright 2007-2008 William Patrick McNeill
+#
+# This file is part of the Stanford Parser Ruby Wrapper.
+#
+# The Stanford Parser Ruby Wrapper is free software; you can redistribute it
+# and/or modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation; either version 2 of the License,
+# or (at your option) any later version.
+#
+# The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
+# useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
+# Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along with
+# editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
+# St, Fifth Floor, Boston, MA 02110-1301 USA
+#
+#++
+
+# == Synopsis
+#
+# Parse a sentence passed in on the command line.
+#
+# == Usage
+#
+# stanford-sentence-parser.rb [options] sentence
+#
+# options::
+# See the Java Stanford Parser documentation for details
+#
+# sentence::
+# A sentence to parse. This must appear after all the options and be quoted.
+
+
+require "stanfordparser"
+
+# The last argument is the sentence. The rest of the command line is passed
+# along to the parser object.
+sentence = ARGV.pop
+parser = StanfordParser::LexicalizedParser.new(StanfordParser::ENGLISH_PCFG_MODEL, ARGV)
+puts parser.apply(sentence)
@@ -0,0 +1,129 @@
+# Copyright 2007-2008 William Patrick McNeill
+#
+# This file is part of the Stanford Parser Ruby Wrapper.
+#
+# The Stanford Parser Ruby Wrapper is free software; you can redistribute it
+# and/or modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation; either version 2 of the License,
+# or (at your option) any later version.
+#
+# The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
+# useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
+# Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along with
+# editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
+# St, Fifth Floor, Boston, MA 02110-1301 USA
+
+# Extenions to the {Ruby-Java Bridge}[http://rjb.rubyforge.org/] module that
+# add a generic Java object wrapper class.
+module Rjb
+
+ #--
+ # The documentation for this class appears next to its extension inside the
+ # StanfordParser module in stanfordparser.rb. This should be changed if Rjb
+ # is ever moved into its own gem. See the documention in stanfordparser.rb
+ # for more details.
+ #++
+ class JavaObjectWrapper
+ include Enumerable
+
+ # The underlying Java object.
+ attr_reader :java_object
+
+ # Initialize with a Java object <em>obj</em>. If <em>obj</em> is a
+ # String, treat it as a Java class name and instantiate it. Otherwise,
+ # treat <em>obj</em> as an instance of a Java object.
+ def initialize(obj, *args)
+ @java_object = obj.class == String ?
+ Rjb::import(obj).send(:new, *args) : obj
+ end
+
+ # Enumerate all the items in the object using its iterator. If the object
+ # has no iterator, this function yields nothing.
+ def each
+ if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
+ i = @java_object.iterator
+ while i.hasNext
+ yield wrap_java_object(i.next)
+ end
+ end
+ end # each
+
+ # Reflect unhandled method calls to the underlying Java object and wrap
+ # the return value in the appropriate Ruby object.
+ def method_missing(m, *args)
+ begin
+ wrap_java_object(@java_object.send(m, *args))
+ rescue RuntimeError => e
+ # The instance method failed. See if this is a static method.
+ if not e.message.match(/^Fail: unknown method name/).nil?
+ getClass.send(m, *args)
+ end
+ end
+ end
+
+ # Convert a value returned by a call to the underlying Java object to the
+ # appropriate Ruby object.
+ #
+ # If the value is a JavaObjectWrapper, convert it using a protected
+ # function with the name wrap_ followed by the underlying object's
+ # classname with the Java path delimiters converted to underscores. For
+ # example, a <tt>java.util.ArrayList</tt> would be converted by a function
+ # called wrap_java_util_ArrayList.
+ #
+ # If the value lacks the appropriate converter function, wrap it in a
+ # generic JavaObjectWrapper.
+ #
+ # If the value is not a JavaObjectWrapper, return it unchanged.
+ #
+ # This function is called recursively for every element in an Array.
+ def wrap_java_object(object)
+ if object.kind_of?(Array)
+ object.collect {|item| wrap_java_object(item)}
+ elsif object.respond_to?(:_classname)
+ # Ruby-Java Bridge Java objects all have a _classname member which
+ # tells the name of their Java class. Convert this to the
+ # corresponding wrapper function name.
+ wrapper_name = ("wrap_" + object._classname.gsub(/\./, "_")).to_sym
+ respond_to?(wrapper_name) ? send(wrapper_name, object) : JavaObjectWrapper.new(object)
+ else
+ object
+ end
+ end
+
+ # Convert <tt>java.util.ArrayList</tt> objects to Ruby Array objects.
+ def wrap_java_util_ArrayList(object)
+ array_list = []
+ object.size.times do
+ |i| array_list << wrap_java_object(object.get(i))
+ end
+ array_list
+ end
+
+ # Convert <tt>java.util.HashSet</tt> objects to Ruby Set objects.
+ def wrap_java_util_HashSet(object)
+ set = Set.new
+ i = object.iterator
+ while i.hasNext
+ set << wrap_java_object(i.next)
+ end
+ set
+ end
+
+ # Show the classname of the underlying Java object.
+ def inspect
+ "<#{@java_object._classname}>"
+ end
+
+ # Use the underlying Java object's stringification.
+ def to_s
+ toString
+ end
+
+ protected :wrap_java_object, :wrap_java_util_ArrayList, :wrap_java_util_HashSet
+
+ end # JavaObjectWrapper
+
+end # Rjb
Oops, something went wrong.

0 comments on commit 1460c23

Please sign in to comment.