-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brill tagger #555
Comments
Suggestion: skip nltk2 compatibility At present, in order not to break any tests, you can say
and get the exact nltk2 API. This unnecessarily pollutes the
Still, since there really is nothing which can be done with the older API only, |
Warning: possibly too automated yaml serialization (but yaml is to disappear anyway) The earlier nltk2 version required user-defined (application) subclasses (of Rule, with that API) to specify There may be a reason for the original design which I have missed. Anyway, it seems yaml is scheduled for removal (#540) in favour of pickle and optional json, so I haven't put much effort into it. |
A note on the place of 'brill tagging' in the namespace hierarchy In my view, both of the terms "brill" and "tagging" get increasingly inaccurate when extensions to the basic transformation-based learning (TBL) paradigm are considered. Brill invented the basic paradigm and certainly deserves all credit for that. However, TBL has been much developed since. In an upcoming survey on TBL, I review two dozen papers on later extensions to basic TBL, and only one of them is written by Brill. Also, I find TBL is a quite flexible method, a general paradigm for discrete structured prediction, rather than just "tagging" (however that is interpreted). Classification on sequences, yes; but also for instance predictions on iid data, on parse trees, or on 2+-dimensional grids; and also prediction of sets of or probability distributions on labels. TBL has even been applied to regression analysis. Admittedly, the present implementation covers almost none of those, but hopefully it will grow over time. When it does, I find "tbl" a more neutral name than "brill"; and to the extent that the division between "tagging" and "classification" is tenable, I think tbl rather belongs in the second camp. |
There is no need for the new implementation to be backwards compatible. |
There's several aspects of this package that do not fit NLTK's model for subpackages:
Also, if this is a complete rewrite, I think the authorship block should reflect that by giving a single author and listing the others as authors of the previous version on which this is (loosely?) based. It has a yaml dependency and we've just removed yaml from the rest of the toolkit, cf #540 @muneson would you have time to look into these? |
Sure, although realistically not much will happen until second half of January.
|
Hey @muneson, I wanted to comment on |
|
I guess that this means the following:
Regarding the
The
To make it possible to run this from the command line:
|
@muneson Do you have time to do these minor changes? Otherwise I can give it a try next week. |
I had some time over, so I did some reorganization of the code, see my branch "brill-simplified", this commit: 46be033 The demo is run like this:
And the doctests are all passed:
I've tested with python 2.7 and 3.3 @muneson Are you fine with these changes? |
I have no time today, I will look more into it tomorrow. Briefly,
as opposed to the identical API
Instead, I just see disadvantages. For instance,
2013/12/13 Peter Ljunglöf notifications@github.com
|
@muneson @stevenbird @kmike What do you say? (In some way I feel this is a Java vs Python issue: In Java, all classes have to be in different files, and people often create a very deep hierarcichal folder/module structure. In Python the modules are often flatter, and several classes can be put in the same file. But now I'm getting philosophical.) |
I'm on it right now, will have something concrete tomorrow. An
Actually, this was quite literally the design I had in an earlier version, I'm happy with your suggestion, and would put predefined templates ifor pos task/api.py #Feature 2013/12/14 Peter Ljunglöf notifications@github.com
|
2013/12/14 Marcus Uneson marcus.uneson@gmail.com
this got unreadable: |
I'm glad to see that we're making progress. Thanks very much @heatherleaf, @kmike, and @muneson. I would like to be conservative about deviations from NLTK's established package structure, and so I am reluctant to add sub-sub-sub-packages to NLTK unless there's a strong case. There are so many reasonable ways we could do things, and many decisions about style are essentially arbitrary (cf PEP-8), but simplicity and consistency are important attributes and make our long-term maintenance task much easier. Where would we be if each contribution sought to innovate in different ways.
|
A new PR with changes as per the API-relevant discussions (except JSON and 2013/12/14 Steven Bird notifications@github.com
|
For a number of reasons I haven't been able to touch this project for some time, but I will look into the remaining issues next week. |
@muneson has contributed a new implementation of the Brill tagger in #549. I've merged it into a new feature branch
brill
and invite people to test it out and post comments here.The text was updated successfully, but these errors were encountered: