Merge 8e861be into 961d48b

inukshuk · Nov 10, 2017 · 1e3375a · 1e3375a
2 parents 961d48b + 8e861be
commit 1e3375a
Show file tree

Hide file tree

Showing 3 changed files with 65 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -79,30 +79,45 @@ Usage
 
 ### Parsing
 
-You can access the main Anystyle-Parser instance at `Anystyle.parser`;
-the `#parse` method is also available via `Anystyle.parse`. For more complex
-requirements (e.g., if you need multiple Parser instances simultaneously) you
-can create your own instances from the `Anystyle::Parser::Parser` class.
+You can access the default Anystyle-Parser instance at `Anystyle.parser`;
+the `#parse` method is also available via `Anystyle.parse`. A very
+simple example in IRB:
 
-The two fundamental methods you need to know about in order to use
-Anystyle-Parser are `#parse` and `#train` that both accept two arguments.
+    > require 'anystyle/parser'
+    > Anystyle.parse 'Poe, Edgar A. Essays and Reviews. New York: Library of America, 1984.'
+    => [{:author=>"Poe, Edgar A.", :title=>"Essays and Reviews",
+    :location=>"New York", :publisher=>"Library of America",
+    :year=>1984, :type=>:book}]
+
+This uses a default model trained on the example citation strings which
+are shipped with Anystyle (in `/resources/`). Anystyle also provides for
+more complex requirements – for example, if Anystyle's default
+model is not parsing some of your citations correctly, or you need
+multiple different parsers. You can train, save and load your own
+models; see the section immediately below.
+
+For all usage, the fundamental method that you need to know about in
+order to use Anystyle-Parser is `#parse`.
 
     Parser#parse(input, format = :hash)
-    Parser#train(input = options[:training_data], truncate = true)
-
+
 `#parse` parses the passed-in input (either a filename, your reference strings,
 or an array of your reference strings; files are only opened if the string is
 not tainted) and returns the parsed data in the
 format specified as the second argument (supported formats include: *:hash*,
 *:bibtex*, *:citeproc*, *:tags*, and *:raw*).
 
+    Parser#train(input = options[:training_data], truncate = true)
+
 `#train` allows you to easily train the Parser's CRF model. The first argument
 is either a filename (if the string is not tainted) or your data as a string;
 the format of training data
 follows the XML-like syntax of the
 [CORA dataset](http://www.cs.umass.edu/~mccallum/data/cora-ie.tar.gz); the
 optional boolean argument lets you decide whether to train the existing
-model or to create an entirely new one.
+model or to create an entirely new one. (**Note**: the addition of new
+training data to an existing model may not be working correctly at the
+moment, see https://github.com/inukshuk/anystyle-parser/issues/62). 
 
 The following irb sessions illustrates some parser goodness:
 
@@ -125,7 +140,7 @@ The following irb sessions illustrates some parser goodness:
     }
     => nil
 
-### Unhappy with the results?
+### Unhappy with the results? Training your own model
 
 Citation references come in many forms, so, inevitably, you will find data
 where Anystyle-Parser does not produce satisfying parsing results.
@@ -154,19 +169,22 @@ the Parser's model what names (labels) it knows about:
 Once you have tagged a few references that you want Anystyle-Parser to learn,
 you can train the model as follows:
 
-    > Anystyle.parser.train 'training.txt', false
+    > my_parser = Anystyle.train_parser 'training.txt'
+
+The training process may take some time. You can save the results of the
+training process for future use 
+
+    > my_parser.model.save 'my_model.mod'
+
+And when you wish to re-use this model in the future, you can load it:
+
+    > my_parser = Anystyle.load_parser 'my_model.mod'
 
-By passing `true` as the second argument, you will discard Anystyle's default
-model; the resulting model will be based entirely on your own data. By default
-the new or altered model will not be saved, but you can do so at any time
-by calling `Anystyle.parser.model.save` to save the model to the default file.
-If you want to save the model to a different file, set the
-`Anystyle.parser.model.path` attribute accordingly.
 
 After teaching Anystyle-Parser with the tagged references, try to parse your
 data again:
 
-    > Anystyle.parse 'John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning, pages 282-289. Morgan Kaufmann, San Francisco, CA.'
+    > my_parser.parse 'John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning, pages 282-289. Morgan Kaufmann, San Francisco, CA.'
     => [{:author=>"John Lafferty and Andrew McCallum and Fernando Pereira", :title=>"Conditional random fields: probabilistic models for segmenting and labeling sequence data", :booktitle=>"Proceedings of the International Conference on Machine Learning", :pages=>"282--289", :publisher=>"Morgan Kaufmann", :location=>"San Francisco, CA", :year=>2001, :type=>:inproceedings}]
 
 If you want to make Anystyle-Parser smarter, please consider sending us your

diff --git a/lib/anystyle/parser/parser.rb b/lib/anystyle/parser/parser.rb
@@ -48,6 +48,21 @@ def language(string)
           return unless @language_detector
           @language_detector.detect string
         end
+
+        # Create a new parser with a new model trained on the passed
+        # data. The training data is passed as a String, not a file, and
+        # should be in UTF-8 encoding. If no data is passed, Anystyle's
+        # default training data will be used.
+        def train_new(training_data = nil)
+          parser = new(model: nil)
+          training_data ||= File.read(@defaults[:training_data],
+                                      encoding: 'UTF-8')
+          tokenised = parser.prepare(training_data, true)
+          parser.model = Wapiti::train(tokenised,
+                                       pattern: @defaults[:pattern]) # 
+          parser
+        end
+
       end
 
       attr_reader :options
@@ -56,9 +71,7 @@ def language(string)
 
       def initialize(options = {})
         @options = Parser.defaults.merge(options)
-
-        reload
-
+        reload if @options[:model]
         @normalizer = Normalizer.instance
       end
 

diff --git a/lib/anystyle/parser/utility.rb b/lib/anystyle/parser/utility.rb
@@ -12,8 +12,20 @@ def self.dictionary
     Parser::Dictionary.instance
   end
 
+  # Create a new Anystyle parser, training it using the marked-up
+  # training data contained in _training_file_
+  def self.train_parser(training_file)
+    training_data = File.read(training_file, encoding: "UTF-8")
+    Parser::Parser::train_new(training_data)
+  end
+
+  # Load an Anystyle parser, using the saved Wapiti model contained in
+  # model_file.
+  def self.load_parser(model_file)
+    Parser::Parser::load(model_file)
+  end
+
   module Parser
-
     def self.instance
       Parser.instance
     end