Skip to content

Commit

Permalink
Merge 8e861be into 961d48b
Browse files Browse the repository at this point in the history
  • Loading branch information
a-fent committed Nov 10, 2017
2 parents 961d48b + 8e861be commit 1e3375a
Show file tree
Hide file tree
Showing 3 changed files with 65 additions and 22 deletions.
54 changes: 36 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,30 +79,45 @@ Usage

### Parsing

You can access the main Anystyle-Parser instance at `Anystyle.parser`;
the `#parse` method is also available via `Anystyle.parse`. For more complex
requirements (e.g., if you need multiple Parser instances simultaneously) you
can create your own instances from the `Anystyle::Parser::Parser` class.
You can access the default Anystyle-Parser instance at `Anystyle.parser`;
the `#parse` method is also available via `Anystyle.parse`. A very
simple example in IRB:

The two fundamental methods you need to know about in order to use
Anystyle-Parser are `#parse` and `#train` that both accept two arguments.
> require 'anystyle/parser'
> Anystyle.parse 'Poe, Edgar A. Essays and Reviews. New York: Library of America, 1984.'
=> [{:author=>"Poe, Edgar A.", :title=>"Essays and Reviews",
:location=>"New York", :publisher=>"Library of America",
:year=>1984, :type=>:book}]

This uses a default model trained on the example citation strings which
are shipped with Anystyle (in `/resources/`). Anystyle also provides for
more complex requirements – for example, if Anystyle's default
model is not parsing some of your citations correctly, or you need
multiple different parsers. You can train, save and load your own
models; see the section immediately below.

For all usage, the fundamental method that you need to know about in
order to use Anystyle-Parser is `#parse`.

Parser#parse(input, format = :hash)
Parser#train(input = options[:training_data], truncate = true)


`#parse` parses the passed-in input (either a filename, your reference strings,
or an array of your reference strings; files are only opened if the string is
not tainted) and returns the parsed data in the
format specified as the second argument (supported formats include: *:hash*,
*:bibtex*, *:citeproc*, *:tags*, and *:raw*).

Parser#train(input = options[:training_data], truncate = true)

`#train` allows you to easily train the Parser's CRF model. The first argument
is either a filename (if the string is not tainted) or your data as a string;
the format of training data
follows the XML-like syntax of the
[CORA dataset](http://www.cs.umass.edu/~mccallum/data/cora-ie.tar.gz); the
optional boolean argument lets you decide whether to train the existing
model or to create an entirely new one.
model or to create an entirely new one. (**Note**: the addition of new
training data to an existing model may not be working correctly at the
moment, see https://github.com/inukshuk/anystyle-parser/issues/62).

The following irb sessions illustrates some parser goodness:

Expand All @@ -125,7 +140,7 @@ The following irb sessions illustrates some parser goodness:
}
=> nil

### Unhappy with the results?
### Unhappy with the results? Training your own model

Citation references come in many forms, so, inevitably, you will find data
where Anystyle-Parser does not produce satisfying parsing results.
Expand Down Expand Up @@ -154,19 +169,22 @@ the Parser's model what names (labels) it knows about:
Once you have tagged a few references that you want Anystyle-Parser to learn,
you can train the model as follows:

> Anystyle.parser.train 'training.txt', false
> my_parser = Anystyle.train_parser 'training.txt'

The training process may take some time. You can save the results of the
training process for future use

> my_parser.model.save 'my_model.mod'

And when you wish to re-use this model in the future, you can load it:

> my_parser = Anystyle.load_parser 'my_model.mod'

By passing `true` as the second argument, you will discard Anystyle's default
model; the resulting model will be based entirely on your own data. By default
the new or altered model will not be saved, but you can do so at any time
by calling `Anystyle.parser.model.save` to save the model to the default file.
If you want to save the model to a different file, set the
`Anystyle.parser.model.path` attribute accordingly.

After teaching Anystyle-Parser with the tagged references, try to parse your
data again:

> Anystyle.parse 'John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning, pages 282-289. Morgan Kaufmann, San Francisco, CA.'
> my_parser.parse 'John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning, pages 282-289. Morgan Kaufmann, San Francisco, CA.'
=> [{:author=>"John Lafferty and Andrew McCallum and Fernando Pereira", :title=>"Conditional random fields: probabilistic models for segmenting and labeling sequence data", :booktitle=>"Proceedings of the International Conference on Machine Learning", :pages=>"282--289", :publisher=>"Morgan Kaufmann", :location=>"San Francisco, CA", :year=>2001, :type=>:inproceedings}]

If you want to make Anystyle-Parser smarter, please consider sending us your
Expand Down
19 changes: 16 additions & 3 deletions lib/anystyle/parser/parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,21 @@ def language(string)
return unless @language_detector
@language_detector.detect string
end

# Create a new parser with a new model trained on the passed
# data. The training data is passed as a String, not a file, and
# should be in UTF-8 encoding. If no data is passed, Anystyle's
# default training data will be used.
def train_new(training_data = nil)
parser = new(model: nil)
training_data ||= File.read(@defaults[:training_data],
encoding: 'UTF-8')
tokenised = parser.prepare(training_data, true)
parser.model = Wapiti::train(tokenised,
pattern: @defaults[:pattern]) #
parser
end

end

attr_reader :options
Expand All @@ -56,9 +71,7 @@ def language(string)

def initialize(options = {})
@options = Parser.defaults.merge(options)

reload

reload if @options[:model]
@normalizer = Normalizer.instance
end

Expand Down
14 changes: 13 additions & 1 deletion lib/anystyle/parser/utility.rb
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,20 @@ def self.dictionary
Parser::Dictionary.instance
end

# Create a new Anystyle parser, training it using the marked-up
# training data contained in _training_file_
def self.train_parser(training_file)
training_data = File.read(training_file, encoding: "UTF-8")
Parser::Parser::train_new(training_data)
end

# Load an Anystyle parser, using the saved Wapiti model contained in
# model_file.
def self.load_parser(model_file)
Parser::Parser::load(model_file)
end

module Parser

def self.instance
Parser.instance
end
Expand Down

0 comments on commit 1e3375a

Please sign in to comment.