Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brill tagger #555

Closed
stevenbird opened this issue Nov 28, 2013 · 19 comments
Closed

Brill tagger #555

stevenbird opened this issue Nov 28, 2013 · 19 comments
Assignees
Milestone

Comments

@stevenbird
Copy link
Member

@muneson has contributed a new implementation of the Brill tagger in #549. I've merged it into a new feature branch brill and invite people to test it out and post comments here.

@muneson
Copy link
Contributor

muneson commented Dec 2, 2013

Suggestion: skip nltk2 compatibility

At present, in order not to break any tests, you can say

from nltk.tag.brill import (
    BrillTagger,
    ProximateTagsRule,
    ProximateWordsRule,
    ProximateTokensRule,
    ProximateTokensTemplate,
    SymmetricProximateTokensTemplate
)

and get the exact nltk2 API. This unnecessarily pollutes the
nltk.tag.brill namespace, however, so a moderate but
efficient improvement is to require the user to say

from nltk.tag.brill.nltk2 import (
    BrillTagger,
    ProximateTagsRule,
    ProximateWordsRule,
    ProximateTokensRule,
    ProximateTokensTemplate,
    SymmetricProximateTokensTemplate
)

Still, since there really is nothing which can be done with the older API only,
I would prefer to drop the nltk2 concerns entirely,
so that one need not worry about keeping legacy signatures synchronized
(say, with dummy arguments) when the signatures of their newer
counterparts are changed (e.g., by adding new keyword arguments).
If not when changing major version number, then when?

@muneson
Copy link
Contributor

muneson commented Dec 2, 2013

Warning: possibly too automated yaml serialization (but yaml is to disappear anyway)

The earlier nltk2 version required user-defined (application) subclasses (of Rule, with that API) to specify
a yaml tag. By contrast, with the suggested API, user-defined subclasses (of Feature) automatically constructs a
yaml tag from the class name, if not specifically given by the user.

There may be a reason for the original design which I have missed. Anyway, it seems yaml is scheduled for removal (#540) in favour of pickle and optional json, so I haven't put much effort into it.

@muneson
Copy link
Contributor

muneson commented Dec 2, 2013

A note on the place of 'brill tagging' in the namespace hierarchy

In my view, both of the terms "brill" and "tagging" get increasingly inaccurate when extensions to the basic transformation-based learning (TBL) paradigm are considered. Brill invented the basic paradigm and certainly deserves all credit for that. However, TBL has been much developed since. In an upcoming survey on TBL, I review two dozen papers on later extensions to basic TBL, and only one of them is written by Brill.

Also, I find TBL is a quite flexible method, a general paradigm for discrete structured prediction, rather than just "tagging" (however that is interpreted). Classification on sequences, yes; but also for instance predictions on iid data, on parse trees, or on 2+-dimensional grids; and also prediction of sets of or probability distributions on labels. TBL has even been applied to regression analysis.

Admittedly, the present implementation covers almost none of those, but hopefully it will grow over time. When it does, I find "tbl" a more neutral name than "brill"; and to the extent that the division between "tagging" and "classification" is tenable, I think tbl rather belongs in the second camp.

@stevenbird
Copy link
Member Author

There is no need for the new implementation to be backwards compatible.

@ghost ghost assigned stevenbird Dec 3, 2013
@stevenbird
Copy link
Member Author

There's several aspects of this package that do not fit NLTK's model for subpackages:

  • it's the first case of a subsubpackage -- I don't see a problem with this
  • it contains further subpackages (demo, application, trainer) -- I don't see the need for these to be subpackages of nltk.tag.brill. Can the demo and trainer be modules at the top level? Can the application live in nltk.app?
  • it contains code for a previous version -- can we remove nltk2?

Also, if this is a complete rewrite, I think the authorship block should reflect that by giving a single author and listing the others as authors of the previous version on which this is (loosely?) based.

It has a yaml dependency and we've just removed yaml from the rest of the toolkit, cf #540

@muneson would you have time to look into these?

@muneson
Copy link
Contributor

muneson commented Dec 12, 2013

@muneson would you have time to look into these?

Sure, although realistically not much will happen until second half of January.
Very briefly,

  • removing nltk2 is what I suggested above, so no problem
  • yaml replaced by pickle and optionally json: will do
  • authorship: a lot (probably most) code is new but some is lifted with few or even no
    changes. Maybe authorship can be reasonably described on the level of individual modules.
    I will put together some basic statistics for a more informed decision. No big deal as far
    as I am concerned, though.
  • subpackaging: Subpackage trainer and friends are in my view important for extensibility -- there
    should be an obvious place to put later extensions (the previous version had none of that,
    and consequently was difficult to build on). However, crucially Python makes it easy
    to define a nice API irrespective of the underlying module structure, by deciding which names should
    be importable from which package. So the nltk.tag.brill namespace could (should) be set up
    (in nltk.tag.brill.__init__.py, by from .<subpack> import <name>)
    to hold all names which are always needed, say Template,
    and also those for which there are obvious defaults, say TaggerTrainer.
    I think the actual division into modules is a concern for the implementor, however,
    and for any package which can be expected to grow, smaller modules are better than bigger.
    (As a sidenote, wc nltk/**/*.py shows a dozen modules with line counts in the thousands;
    they would almost certainly gain maintainability and extensibility if split into subpackages
    in that way, with no change in APIs.)
  • putting things under nltk.app: Not sure. nltk.app sounds like a bit like predefined, fixed functionality.
    I rather intended nltk.tag.brill.application as a place to showcase TBL flexibility. I will add more things
    there later, especially examples showing radically different problem encodings. For instance, dialogue act tagging (see http://www.ling.gu.se/~lager/Mutbl/Papers/da_ssomc.pdf) -- this is interesting as an example,
    but probably not as a ready-to-use application, and it would be odd to place it in nltk.app.
    What about instead merging nltk.tag.brill.demo and nltk.tag.brill.application into nltk.tag.brill.examples?
    Application-specific names are not always needed and thus should not pollute the top-level namespace.
  • If backwards compatibility is anyway to be sacrificed, I suggest (see above) that the entire thing is
    moved to nltk.classifier.tbl instead. Due to Brill and many others, TBL is a general ML technique,
    it applies to discrete classification of sequences, but also to other problems.

@kmike
Copy link
Member

kmike commented Dec 12, 2013

Hey @muneson,

I wanted to comment on nltk.classify vs nltk.tag: currently nltk.tag contains modules for structured prediction, while nltk.classify contains modules for classification of individual observations. Brill tagger works on sequences, so a better place for it is nltk.tag.

@stevenbird
Copy link
Member Author

  • @jskda recently replaced yaml with json in other packages and could do it here as well, if that helps
  • I don't have a strong opinion about whether this belongs in the classify or tag packages (note that sentence boundary detection is a classification task, classifying punctuation as sentence-breaking or not, and yet it lives in the tokenize package, not the classify package)
  • I don't understand the logic of embedding application code inside a library; we've done that in nltk.app as a special case; none of the other NLTK packages contain applications -- I would rather leave this out and move ahead with the library functionality, and return to the question later
  • the trainer subpackage contains two trainers, but one of them is the original one which I think should be removed, so we're left with one trainer and I think it can be moved up to the parent level for now
  • since this contribution changes the NLTK API, I'm keen to include it in the NLTK 3.0 beta which could go out any day now; I favour an imperfect solution that gets the new interface into NLTK and leaves unresolved some of the code structuring questions that have no visibility at the API level.

@heatherleaf
Copy link
Contributor

I guess that this means the following:

  • remove directories nltk2 and trainer
  • move trainer/fast.py to trainer.py
  • update __init__.py accordingly
  • replace yaml with json
  • keep brill in the tag package (not move to classify)

Regarding the application directory:

  • it does not fit in nltk.app, it's not a working application at all
  • currently it contains one file which defines the Word and Pos features
  • I suggest to rename application/postagging.py to feature.py
  • perhaps move the Feature class from template.py to feature.py
  • new features can be added to feature.py, even if they are not POS tagging features
    • if the file gets bloated we can split it in several, but that can wait until later

The demo directory is more similar to nltk.app, but it's not a working application, more like an extended version of the demo() functions that are in most NLTK files. It would be good to add something like this to demo/__init__.py:

if __name__ == '__main__':
    postagging.demo()

To make it possible to run this from the command line:

$ python -m nltk.tag.brill.demo

@heatherleaf
Copy link
Contributor

@muneson Do you have time to do these minor changes? Otherwise I can give it a try next week.

@heatherleaf
Copy link
Contributor

I had some time over, so I did some reorganization of the code, see my branch "brill-simplified", this commit: 46be033

The demo is run like this:

python -m nltk.tag.brill.demo.postagging

And the doctests are all passed:

python -m nltk.tag.brill.template
python -m nltk.tag.brill.rule
python -m nltk.tag.brill.trainer
python -m nltk.tag.brill.tagger

I've tested with python 2.7 and 3.3

@muneson Are you fine with these changes?

@muneson
Copy link
Contributor

muneson commented Dec 13, 2013

I have no time today, I will look more into it tomorrow. Briefly,

  • I still prefer nltk.tag.tbl to nltk.tag.brill (stronger preference
    than nltk.tag vs nltk.classify, which I don't think is possible
    to resolve with full consistency)

  • I understand that the slow trainer BrillOrig perhaps should not be
    exposed by default in the API but I don't understand the point of throwing
    it away. It has, if nothing else,
    a) pedagogical value in teaching (which may be the reason some people
    use nltk in the first place) and
    b) use as a reference implementation (for any future attempts at improving
    efficiency, say)

  • I see zero advantages of
    trainer.py:

    class Trainer()

as opposed to the identical API
trainer/ __init__.py:

from .impl1 import Trainer

trainer/impl1:

class Trainer()

trainer/impl2:

class Trainer()

Instead, I just see disadvantages. For instance,
while not checked in and not entirely finished, I already
have an impl2 of an alternative trainer with other space/time tradeoffs.

  • Subclasses of Feature are user-defined, application-specific code which
    depend on the input data format. They should be clearly
    marked as application-specific and not placed with Feature.

2013/12/13 Peter Ljunglöf notifications@github.com

I had some time over, so I did some reorganization of the code, see my
branch "brill-simplified", this commit: 46be03346be033

The demo is run like this:

python -m nltk.tag.brill.demo.postagging

And the doctests are all passed:

python -m nltk.tag.brill.template
python -m nltk.tag.brill.rule
python -m nltk.tag.brill.trainer
python -m nltk.tag.brill.tagger

I've tested with python 2.7 and 3.3

@muneson https://github.com/muneson Are you fine with these changes?


Reply to this email directly or view it on GitHubhttps://github.com//issues/555#issuecomment-30494567
.

@heatherleaf
Copy link
Contributor

  • nltk.tag.tbl is fine with me
  • keeping the trainer directory is also ok, and brillorig.py for the reasons you write
  • but I don't agree with you about the features:
    • just because some very common features (Word, Tag) happen to be defined in feature.py, that doesn't suggest that they are in some way better than other features
    • but the docstrings of Word and Tag (and other features) should state very clear what they can be used for
    • anyway application is not a good package name, since people can start believing that it contains an application
  • here's another suggestion:
    • a directory feature, with the files api.py (containing the Feature class), and postagging.py (containing Word and Tag)

@muneson @stevenbird @kmike What do you say?

(In some way I feel this is a Java vs Python issue: In Java, all classes have to be in different files, and people often create a very deep hierarcichal folder/module structure. In Python the modules are often flatter, and several classes can be put in the same file. But now I'm getting philosophical.)

@muneson
Copy link
Contributor

muneson commented Dec 13, 2013

I'm on it right now, will have something concrete tomorrow. An
application, in the sense I intended, is applying the technique to some
domain or some specific problem, not a ready-written program.
An TBL application generally requires the user to specify a) features
(inherited from Feature) and b) templates.

here's another suggestion:

a directory feature, with the files api.py (containing the Feature class),
and postagging.py (containing Word and Tag)

Actually, this was quite literally the design I had in an earlier version,
and
on the whole I like it. I changed "feature" to "application" when I didn't
know where
to put pre-defined templates for a given task. But I agree "application"
is a name bound to create confusion.

I'm happy with your suggestion, and would put predefined templates ifor pos
n
feature/postagging.py. Another possibility is to instead name it "task",
like so:

task/api.py #Feature
task/postagging.py #Word, Pos; sample template sets
task/dacttagging.py #SpeakerShift, SentLen, many others; sample template
sets
task/...

2013/12/14 Peter Ljunglöf notifications@github.com

  • nltk.tag.tbl is fine with me
  • keeping the trainer directory is also ok, and brillorig.py for the
    reasons you write
  • but I don't agree with you about the features:
    • just because some very common features (Word, Tag) happen to be
      defined in feature.py, that doesn't suggest that they are in some
      way better than other features
    • but the docstrings of Word and Tag (and other features) should
      state very clear what they can be used for
    • anyway application is not a good package name, since people can
      start believing that it contains an application
      • here's another suggestion:
    • a directory feature, with the files api.py (containing the
      Feature class), and postagging.py (containing Word and Tag)

@stevenbird https://github.com/stevenbird @kmikehttps://github.com/kmikeWhat do you say?

(In some way I feel this is a Java vs Python issue: In Java, all classes
have to be in different files, and people often create a very deep
hierarcichal folder/module structure. In Python the modules are often
flatter, and several classes can be put in the same file. But now I'm
getting philosophical.)


Reply to this email directly or view it on GitHubhttps://github.com//issues/555#issuecomment-30551802
.

@muneson
Copy link
Contributor

muneson commented Dec 13, 2013

2013/12/14 Marcus Uneson marcus.uneson@gmail.com

I'm happy with your suggestion, and would put predefined templates ifor
pos n
feature/postagging.py.

this got unreadable:
I'm happy with your suggestion, and would put predefined
templates for postagging in feature/postagging.py without worrying too much.

@stevenbird
Copy link
Member Author

I'm glad to see that we're making progress. Thanks very much @heatherleaf, @kmike, and @muneson.

I would like to be conservative about deviations from NLTK's established package structure, and so I am reluctant to add sub-sub-sub-packages to NLTK unless there's a strong case. There are so many reasonable ways we could do things, and many decisions about style are essentially arbitrary (cf PEP-8), but simplicity and consistency are important attributes and make our long-term maintenance task much easier. Where would we be if each contribution sought to innovate in different ways.

  • I'm happy with nltk.tag.tbl, but can we consider one other option: nltk.classify.tbl which is used by nltk.tag.brill? This assumes that we could accommodate tbl into (a possibly extended) ClassifierI. If not, then I would like to consider a new interface nltk.classify.SequenceClassifierI or even a new subpackage for sequence classification. I guess tag.hmm needs to be put in the same place. Thoughts?
  • I think we should stick with our existing nomenclature about "application", and not create modules or packages with this name.

@muneson
Copy link
Contributor

muneson commented Dec 14, 2013

A new PR with changes as per the API-relevant discussions (except JSON and
the move to nltk.classify
left undecided).

#564

2013/12/14 Steven Bird notifications@github.com

I'm glad to see that we're making progress. Thanks very much @heatherleafhttps://github.com/heatherleaf,
@kmike https://github.com/kmike, and @munesonhttps://github.com/muneson
.

I would like to be conservative about deviations from NLTK's established
package structure, and so I am reluctant to add sub-sub-sub-packages to
NLTK unless there's a strong case. There are so many reasonable ways we
could do things, and many decisions about style are essentially arbitrary
(cf PEP-8), but simplicity and consistency are important attributes and
make our long-term maintenance task much easier. Where would we be if each
contribution sought to innovate in different ways.

I'm happy with nltk.tag.tbl, but can we consider one other option:
nltk.classify.tbl which is used by nltk.tag.brill? This assumes that
we could accommodate tbl into (a possibly extended) ClassifierI. If
not, then I would like to consider a new interface
nltk.classify.SequenceClassifierI or even a new subpackage for
sequence classification. I guess tag.hmm needs to be put in the same
place. Thoughts?

I think we should stick with our existing nomenclature about
"application", and not create modules or packages with this name.


Reply to this email directly or view it on GitHubhttps://github.com//issues/555#issuecomment-30554117
.

@stevenbird stevenbird modified the milestones: nltk3beta2, nltk3beta Feb 6, 2014
@stevenbird stevenbird modified the milestones: nltk3 final, nltk3beta2 Apr 18, 2014
@stevenbird
Copy link
Member Author

Resolved. See #549, #564

@muneson
Copy link
Contributor

muneson commented Jun 25, 2014

For a number of reasons I haven't been able to touch this project for some time, but I will look into the remaining issues next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants