Brill tagger #555

stevenbird · 2013-11-28T22:03:18Z

@muneson has contributed a new implementation of the Brill tagger in #549. I've merged it into a new feature branch brill and invite people to test it out and post comments here.

The text was updated successfully, but these errors were encountered:

muneson · 2013-12-02T12:24:56Z

Suggestion: skip nltk2 compatibility

At present, in order not to break any tests, you can say

from nltk.tag.brill import (
    BrillTagger,
    ProximateTagsRule,
    ProximateWordsRule,
    ProximateTokensRule,
    ProximateTokensTemplate,
    SymmetricProximateTokensTemplate
)

and get the exact nltk2 API. This unnecessarily pollutes the
nltk.tag.brill namespace, however, so a moderate but
efficient improvement is to require the user to say

from nltk.tag.brill.nltk2 import (
    BrillTagger,
    ProximateTagsRule,
    ProximateWordsRule,
    ProximateTokensRule,
    ProximateTokensTemplate,
    SymmetricProximateTokensTemplate
)

Still, since there really is nothing which can be done with the older API only,
I would prefer to drop the nltk2 concerns entirely,
so that one need not worry about keeping legacy signatures synchronized
(say, with dummy arguments) when the signatures of their newer
counterparts are changed (e.g., by adding new keyword arguments).
If not when changing major version number, then when?

muneson · 2013-12-02T12:29:20Z

Warning: possibly too automated yaml serialization (but yaml is to disappear anyway)

The earlier nltk2 version required user-defined (application) subclasses (of Rule, with that API) to specify
a yaml tag. By contrast, with the suggested API, user-defined subclasses (of Feature) automatically constructs a
yaml tag from the class name, if not specifically given by the user.

There may be a reason for the original design which I have missed. Anyway, it seems yaml is scheduled for removal (#540) in favour of pickle and optional json, so I haven't put much effort into it.

muneson · 2013-12-02T12:38:26Z

A note on the place of 'brill tagging' in the namespace hierarchy

In my view, both of the terms "brill" and "tagging" get increasingly inaccurate when extensions to the basic transformation-based learning (TBL) paradigm are considered. Brill invented the basic paradigm and certainly deserves all credit for that. However, TBL has been much developed since. In an upcoming survey on TBL, I review two dozen papers on later extensions to basic TBL, and only one of them is written by Brill.

Also, I find TBL is a quite flexible method, a general paradigm for discrete structured prediction, rather than just "tagging" (however that is interpreted). Classification on sequences, yes; but also for instance predictions on iid data, on parse trees, or on 2+-dimensional grids; and also prediction of sets of or probability distributions on labels. TBL has even been applied to regression analysis.

Admittedly, the present implementation covers almost none of those, but hopefully it will grow over time. When it does, I find "tbl" a more neutral name than "brill"; and to the extent that the division between "tagging" and "classification" is tenable, I think tbl rather belongs in the second camp.

stevenbird · 2013-12-03T05:44:19Z

There is no need for the new implementation to be backwards compatible.

stevenbird · 2013-12-12T09:20:48Z

There's several aspects of this package that do not fit NLTK's model for subpackages:

it's the first case of a subsubpackage -- I don't see a problem with this
it contains further subpackages (demo, application, trainer) -- I don't see the need for these to be subpackages of nltk.tag.brill. Can the demo and trainer be modules at the top level? Can the application live in nltk.app?
it contains code for a previous version -- can we remove nltk2?

Also, if this is a complete rewrite, I think the authorship block should reflect that by giving a single author and listing the others as authors of the previous version on which this is (loosely?) based.

It has a yaml dependency and we've just removed yaml from the rest of the toolkit, cf #540

@muneson would you have time to look into these?

muneson · 2013-12-12T18:41:42Z

@muneson would you have time to look into these?

Sure, although realistically not much will happen until second half of January.
Very briefly,

removing nltk2 is what I suggested above, so no problem
yaml replaced by pickle and optionally json: will do
authorship: a lot (probably most) code is new but some is lifted with few or even no
changes. Maybe authorship can be reasonably described on the level of individual modules.
I will put together some basic statistics for a more informed decision. No big deal as far
as I am concerned, though.
subpackaging: Subpackage trainer and friends are in my view important for extensibility -- there
should be an obvious place to put later extensions (the previous version had none of that,
and consequently was difficult to build on). However, crucially Python makes it easy
to define a nice API irrespective of the underlying module structure, by deciding which names should
be importable from which package. So the nltk.tag.brill namespace could (should) be set up
(in nltk.tag.brill.__init__.py, by from .<subpack> import <name>)
to hold all names which are always needed, say Template,
and also those for which there are obvious defaults, say TaggerTrainer.
I think the actual division into modules is a concern for the implementor, however,
and for any package which can be expected to grow, smaller modules are better than bigger.
(As a sidenote, wc nltk/**/*.py shows a dozen modules with line counts in the thousands;
they would almost certainly gain maintainability and extensibility if split into subpackages
in that way, with no change in APIs.)
putting things under nltk.app: Not sure. nltk.app sounds like a bit like predefined, fixed functionality.
I rather intended nltk.tag.brill.application as a place to showcase TBL flexibility. I will add more things
there later, especially examples showing radically different problem encodings. For instance, dialogue act tagging (see http://www.ling.gu.se/~lager/Mutbl/Papers/da_ssomc.pdf) -- this is interesting as an example,
but probably not as a ready-to-use application, and it would be odd to place it in nltk.app.
What about instead merging nltk.tag.brill.demo and nltk.tag.brill.application into nltk.tag.brill.examples?
Application-specific names are not always needed and thus should not pollute the top-level namespace.
If backwards compatibility is anyway to be sacrificed, I suggest (see above) that the entire thing is
moved to nltk.classifier.tbl instead. Due to Brill and many others, TBL is a general ML technique,
it applies to discrete classification of sequences, but also to other problems.

kmike · 2013-12-12T19:09:47Z

Hey @muneson,

I wanted to comment on nltk.classify vs nltk.tag: currently nltk.tag contains modules for structured prediction, while nltk.classify contains modules for classification of individual observations. Brill tagger works on sequences, so a better place for it is nltk.tag.

stevenbird · 2013-12-13T02:16:18Z

@jskda recently replaced yaml with json in other packages and could do it here as well, if that helps
I don't have a strong opinion about whether this belongs in the classify or tag packages (note that sentence boundary detection is a classification task, classifying punctuation as sentence-breaking or not, and yet it lives in the tokenize package, not the classify package)
I don't understand the logic of embedding application code inside a library; we've done that in nltk.app as a special case; none of the other NLTK packages contain applications -- I would rather leave this out and move ahead with the library functionality, and return to the question later
the trainer subpackage contains two trainers, but one of them is the original one which I think should be removed, so we're left with one trainer and I think it can be moved up to the parent level for now
since this contribution changes the NLTK API, I'm keen to include it in the NLTK 3.0 beta which could go out any day now; I favour an imperfect solution that gets the new interface into NLTK and leaves unresolved some of the code structuring questions that have no visibility at the API level.

heatherleaf · 2013-12-13T07:50:01Z

I guess that this means the following:

remove directories nltk2 and trainer
move trainer/fast.py to trainer.py
update __init__.py accordingly
replace yaml with json
keep brill in the tag package (not move to classify)

Regarding the application directory:

it does not fit in nltk.app, it's not a working application at all
currently it contains one file which defines the Word and Pos features
I suggest to rename application/postagging.py to feature.py
perhaps move the Feature class from template.py to feature.py
new features can be added to feature.py, even if they are not POS tagging features
- if the file gets bloated we can split it in several, but that can wait until later

The demo directory is more similar to nltk.app, but it's not a working application, more like an extended version of the demo() functions that are in most NLTK files. It would be good to add something like this to demo/__init__.py:

if __name__ == '__main__':
    postagging.demo()

To make it possible to run this from the command line:

$ python -m nltk.tag.brill.demo

heatherleaf · 2013-12-13T07:51:59Z

@muneson Do you have time to do these minor changes? Otherwise I can give it a try next week.

heatherleaf · 2013-12-13T08:49:58Z

I had some time over, so I did some reorganization of the code, see my branch "brill-simplified", this commit: 46be033

The demo is run like this:

python -m nltk.tag.brill.demo.postagging

And the doctests are all passed:

python -m nltk.tag.brill.template
python -m nltk.tag.brill.rule
python -m nltk.tag.brill.trainer
python -m nltk.tag.brill.tagger

I've tested with python 2.7 and 3.3

@muneson Are you fine with these changes?

muneson · 2013-12-13T09:22:26Z

I have no time today, I will look more into it tomorrow. Briefly,

I still prefer nltk.tag.tbl to nltk.tag.brill (stronger preference
than nltk.tag vs nltk.classify, which I don't think is possible
to resolve with full consistency)
I understand that the slow trainer BrillOrig perhaps should not be
exposed by default in the API but I don't understand the point of throwing
it away. It has, if nothing else,
a) pedagogical value in teaching (which may be the reason some people
use nltk in the first place) and
b) use as a reference implementation (for any future attempts at improving
efficiency, say)
I see zero advantages of
trainer.py:

class Trainer()

as opposed to the identical API
trainer/ __init__.py:

from .impl1 import Trainer

trainer/impl1:

class Trainer()

trainer/impl2:

class Trainer()

Instead, I just see disadvantages. For instance,
while not checked in and not entirely finished, I already
have an impl2 of an alternative trainer with other space/time tradeoffs.

Subclasses of Feature are user-defined, application-specific code which
depend on the input data format. They should be clearly
marked as application-specific and not placed with Feature.

2013/12/13 Peter Ljunglöf notifications@github.com

I had some time over, so I did some reorganization of the code, see my
branch "brill-simplified", this commit: 46be03346be033

The demo is run like this:

python -m nltk.tag.brill.demo.postagging

And the doctests are all passed:

python -m nltk.tag.brill.template
python -m nltk.tag.brill.rule
python -m nltk.tag.brill.trainer
python -m nltk.tag.brill.tagger

I've tested with python 2.7 and 3.3

@muneson https://github.com/muneson Are you fine with these changes?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/555#issuecomment-30494567
.

heatherleaf · 2013-12-13T23:19:40Z

nltk.tag.tbl is fine with me
keeping the trainer directory is also ok, and brillorig.py for the reasons you write
but I don't agree with you about the features:
- just because some very common features (Word, Tag) happen to be defined in feature.py, that doesn't suggest that they are in some way better than other features
- but the docstrings of Word and Tag (and other features) should state very clear what they can be used for
- anyway application is not a good package name, since people can start believing that it contains an application
here's another suggestion:
- a directory feature, with the files api.py (containing the Feature class), and postagging.py (containing Word and Tag)

@muneson @stevenbird @kmike What do you say?

(In some way I feel this is a Java vs Python issue: In Java, all classes have to be in different files, and people often create a very deep hierarcichal folder/module structure. In Python the modules are often flatter, and several classes can be put in the same file. But now I'm getting philosophical.)

muneson · 2013-12-13T23:41:12Z

I'm on it right now, will have something concrete tomorrow. An
application, in the sense I intended, is applying the technique to some
domain or some specific problem, not a ready-written program.
An TBL application generally requires the user to specify a) features
(inherited from Feature) and b) templates.

here's another suggestion:

a directory feature, with the files api.py (containing the Feature class),
and postagging.py (containing Word and Tag)

Actually, this was quite literally the design I had in an earlier version,
and
on the whole I like it. I changed "feature" to "application" when I didn't
know where
to put pre-defined templates for a given task. But I agree "application"
is a name bound to create confusion.

I'm happy with your suggestion, and would put predefined templates ifor pos
n
feature/postagging.py. Another possibility is to instead name it "task",
like so:

task/api.py #Feature
task/postagging.py #Word, Pos; sample template sets
task/dacttagging.py #SpeakerShift, SentLen, many others; sample template
sets
task/...

2013/12/14 Peter Ljunglöf notifications@github.com

nltk.tag.tbl is fine with me

keeping the trainer directory is also ok, and brillorig.py for the
reasons you write

but I don't agree with you about the features:

just because some very common features (Word, Tag) happen to be
defined in feature.py, that doesn't suggest that they are in some
way better than other features

but the docstrings of Word and Tag (and other features) should
state very clear what they can be used for

anyway application is not a good package name, since people can
start believing that it contains an application

here's another suggestion:

a directory feature, with the files api.py (containing the
Feature class), and postagging.py (containing Word and Tag)

@stevenbird https://github.com/stevenbird @kmikehttps://github.com/kmikeWhat do you say?

(In some way I feel this is a Java vs Python issue: In Java, all classes
have to be in different files, and people often create a very deep
hierarcichal folder/module structure. In Python the modules are often
flatter, and several classes can be put in the same file. But now I'm
getting philosophical.)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/555#issuecomment-30551802
.

muneson · 2013-12-13T23:45:13Z

2013/12/14 Marcus Uneson marcus.uneson@gmail.com

I'm happy with your suggestion, and would put predefined templates ifor
pos n
feature/postagging.py.

this got unreadable:
I'm happy with your suggestion, and would put predefined
templates for postagging in feature/postagging.py without worrying too much.

stevenbird · 2013-12-14T00:12:54Z

I'm glad to see that we're making progress. Thanks very much @heatherleaf, @kmike, and @muneson.

I would like to be conservative about deviations from NLTK's established package structure, and so I am reluctant to add sub-sub-sub-packages to NLTK unless there's a strong case. There are so many reasonable ways we could do things, and many decisions about style are essentially arbitrary (cf PEP-8), but simplicity and consistency are important attributes and make our long-term maintenance task much easier. Where would we be if each contribution sought to innovate in different ways.

I'm happy with nltk.tag.tbl, but can we consider one other option: nltk.classify.tbl which is used by nltk.tag.brill? This assumes that we could accommodate tbl into (a possibly extended) ClassifierI. If not, then I would like to consider a new interface nltk.classify.SequenceClassifierI or even a new subpackage for sequence classification. I guess tag.hmm needs to be put in the same place. Thoughts?
I think we should stick with our existing nomenclature about "application", and not create modules or packages with this name.

muneson · 2013-12-14T19:56:00Z

A new PR with changes as per the API-relevant discussions (except JSON and
the move to nltk.classify
left undecided).

#564

2013/12/14 Steven Bird notifications@github.com

I'm glad to see that we're making progress. Thanks very much @heatherleafhttps://github.com/heatherleaf,
@kmike https://github.com/kmike, and @munesonhttps://github.com/muneson
.

I would like to be conservative about deviations from NLTK's established
package structure, and so I am reluctant to add sub-sub-sub-packages to
NLTK unless there's a strong case. There are so many reasonable ways we
could do things, and many decisions about style are essentially arbitrary
(cf PEP-8), but simplicity and consistency are important attributes and
make our long-term maintenance task much easier. Where would we be if each
contribution sought to innovate in different ways.

I'm happy with nltk.tag.tbl, but can we consider one other option:
nltk.classify.tbl which is used by nltk.tag.brill? This assumes that
we could accommodate tbl into (a possibly extended) ClassifierI. If
not, then I would like to consider a new interface
nltk.classify.SequenceClassifierI or even a new subpackage for
sequence classification. I guess tag.hmm needs to be put in the same
place. Thoughts?

I think we should stick with our existing nomenclature about
"application", and not create modules or packages with this name.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/555#issuecomment-30554117
.

stevenbird · 2014-04-20T22:50:44Z

Resolved. See #549, #564

muneson · 2014-06-25T21:37:08Z

For a number of reasons I haven't been able to touch this project for some time, but I will look into the remaining issues next week.

stevenbird mentioned this issue Nov 28, 2013

update to nltk.tag.brill #549

Closed

ghost assigned stevenxxiu Nov 28, 2013

ghost assigned stevenbird Dec 3, 2013

muneson mentioned this issue Dec 14, 2013

Transformation-based learning #564

Merged

stevenbird modified the milestones: nltk3beta2, nltk3beta Feb 6, 2014

stevenbird modified the milestones: nltk3 final, nltk3beta2 Apr 18, 2014

stevenbird closed this as completed Apr 20, 2014

This was referenced Apr 20, 2014

Brill tagger API #29

Closed

Error in Brill tagger docstrings #64

Closed

Brill Tagger on third-party corpora #27

Closed

stevenbird mentioned this issue Apr 20, 2014

BrillTagger fails on non-ascii text #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brill tagger #555

Brill tagger #555

stevenbird commented Nov 28, 2013

muneson commented Dec 2, 2013

muneson commented Dec 2, 2013

muneson commented Dec 2, 2013

stevenbird commented Dec 3, 2013

stevenbird commented Dec 12, 2013

muneson commented Dec 12, 2013

kmike commented Dec 12, 2013

stevenbird commented Dec 13, 2013

heatherleaf commented Dec 13, 2013

heatherleaf commented Dec 13, 2013

heatherleaf commented Dec 13, 2013

muneson commented Dec 13, 2013

heatherleaf commented Dec 13, 2013

muneson commented Dec 13, 2013

muneson commented Dec 13, 2013

stevenbird commented Dec 14, 2013

muneson commented Dec 14, 2013

stevenbird commented Apr 20, 2014

muneson commented Jun 25, 2014

Brill tagger #555

Brill tagger #555

Comments

stevenbird commented Nov 28, 2013

muneson commented Dec 2, 2013

muneson commented Dec 2, 2013

muneson commented Dec 2, 2013

stevenbird commented Dec 3, 2013

stevenbird commented Dec 12, 2013

muneson commented Dec 12, 2013

kmike commented Dec 12, 2013

stevenbird commented Dec 13, 2013

heatherleaf commented Dec 13, 2013

heatherleaf commented Dec 13, 2013

heatherleaf commented Dec 13, 2013

muneson commented Dec 13, 2013

heatherleaf commented Dec 13, 2013

muneson commented Dec 13, 2013

muneson commented Dec 13, 2013

stevenbird commented Dec 14, 2013

muneson commented Dec 14, 2013

stevenbird commented Apr 20, 2014

muneson commented Jun 25, 2014