Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Add a CRF POS tagger, replacing the old Mallet interface with CRFSuite #104

Closed
alexrudnick opened this Issue · 20 comments

9 participants

@alexrudnick
Collaborator

NLTK's mallet interface, nltk.classify.mallet, is for mallet 0.4. It
should be upgraded to the current version (which has a different API).

Migrated from http://code.google.com/p/nltk/issues/detail?id=523

@xim
Collaborator
xim commented

It might be a good idea to document that we only support Mallet 0.4 (current version is 2.0.7) or drop support entirely.

Current master fails as well:

Traceback (most recent call last):
  File "nltk/tag/crf.py", line 760, in <module>
    demo()
  File "nltk/tag/crf.py", line 746, in demo
    acc = nltk.tag.accuracy(crf, brown_test)
AttributeError: 'module' object has no attribute 'accuracy'
@duhaime

I know this is an old thread, but has this problem been fixed in the current master?

@kmike
Collaborator

No, this problem is not fixed.

@rmunn

This is a problem for packaging NLTK in Linux distrubutions, specifically Debian and Ubuntu. I worked on the Ubuntu package a while ago, and had to remove the mallet functionality entirely from the Ubuntu package because Ubuntu didn't carry Mallet 0.4 at all. And I was unable to get the NLTK package into Debian, for similar reasons: I started looking at whether I could create a mallet0 package for Debian, got bogged down in Java complexities I wasn't familiar with, and ended up abandoning the Debian NLTK package altogether.

I would like to create a Debian package for NLTK version 3 so that it could be available on both Debian and Ubuntu, but until this issue is resolved one way or the other, I doubt I'll be able to create that Debian package.

@ewan-klein
Owner

Much of the problem is that mallet.py calls internals.py which adds NLTK_JAR to the Java class path. This is intended to provide a more fine-grained access to parameters that were embedded in the Mallet 0.4 CRF Java code. It is not straightforward to update the Java code that compiles to NLTK_JAR, since it calls classes and fields that are no longer present in Mallet 2.x. I propose that we provide a much simpler interface to the Mallet 2.x CRF classifier which is just a wrapper for cc.mallet.fst.SimpleTagger. I have a crude version of this working now.

@stevenbird
Owner

If we need to modify the ClassifierI interface please move this to the nltk3beta milestone.

@jskda
Collaborator

It might be a good idea to document that we only support Mallet 0.4 (current version is 2.0.7) or drop support entirely.

I've documented this in the wiki: https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software

@stevenbird
Owner

@ewan-klein can we drop Mallet 0.4 support from NLTK 3? And then we can see about adding Mallet 2 support later? This would deal with @rmunn's issue finally.

@stevenbird stevenbird modified the milestone: nltk3beta, nltk3 final
@ewan-klein
Owner

Yes, this sounds sensible.

@kmike
Collaborator

How much simpler is this Mallet interface going to be? I haven't used Mallet myself, but it looks like its main advantage is flexibility - you can use higher-order transition features and stuff like that.

But if what is needed is a first-order linear chain CRF then there are much faster implementations available like http://wapiti.limsi.fr/ or http://www.chokkan.org/software/crfsuite/. CRFSuite has a "benchmark" page http://www.chokkan.org/software/crfsuite/benchmark.html where it claims to be 100x faster to train than Mallet 2.0.6 on CoNLL 2000 chunking task.

@stevenbird
Owner

@kmike the idea is to remove the Mallet interface for now. We can obviously consider adding one (or more) CRF interfaces as soon as someone wants to contribute them.

@ewan-klein
Owner

The potential advantage of Mallet is that is well known and also does topic detection. I agree that if people are looking for high-performance, then other tools might be better. I did some work on fixing the wrapper code, but haven't had time to finish it off. I'd be happy to have someone take it out of my hands!

@jskda jskda was assigned by stevenbird
@jskda jskda referenced this issue from a commit in jskda/nltk
Steven Xu removed mallet interface, c.f. nltk#104 701468b
@stevenbird
Owner

@rmunn please open a new issue if there are any other obstacles to Linux packaging.

@stevenbird stevenbird closed this
@jskda jskda was unassigned by stevenbird
@stevenbird
Owner

@longdt219 please try to recreate this CRF tagger using the current version of Mallet.

@stevenbird stevenbird reopened this
@stevenbird stevenbird modified the milestone: End-January, nltk3beta
@longdt219
Collaborator

Hi @stevenbird,
Just to be clear, you just want to implement the linear chain CRF POS tagger with supporting the training and testing function ? Mallet also support for General CRF . Should we allow user to play with it more (i.e. second-order crf, multi-layer crf) ? Like @kmike said, the best of Mallet is the flexibility. To some extend, we should allow user to tweak it a bit, e.g. jointly predict POS and NER

@stevenbird
Owner

I like both suggestions. Definitely provide an interface to the linear chain CRF tagger. Before implementing anything more general, please propose the interface, for discussion here.

@longdt219
Collaborator

The use can specify the template they going to use in the list of possible templates including
1. Bigram template
2. Trigram template
3. Pairwise template between layers.
Therefore, the linear chain CRF equivalent to single layer general CRF with bigram template. Second order CRF equivalent to single layer general CRF with trigram template. Pairwise template specify the relation across layers. For example, for jointly predict the POS and NER we will use
Bigram Template for POS layer, Bigram Template for NER layer and Pairwise template between POS and NER layer.

This should be simple enough for user to play with.

@stevenbird
Owner

That sounds reasonable; please go ahead.

@longdt219
Collaborator

Hi @stevenbird,
It turns out that the general CRF doesn't not support for separate training and testing. I implemented the wrapper for the convention linear chain CRF
I created a pull-request for this.

@stevenbird
Owner

In #855 we are discussing the idea of using CRFSuite instead of Mallet. There seems to be no reason to provide a Mallet interface. Unless there are any objections, I propose that we provide a CRF tagger using CRFSuite.

@stevenbird stevenbird changed the title from Upgrade mallet interface to Add a CRF POS tagger, replacing the old Mallet interface with CRFSuite
@stevenbird stevenbird closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.