Add a CRF POS tagger, replacing the old Mallet interface with CRFSuite #104

Closed
alexrudnick opened this Issue Jan 17, 2012 · 20 comments

Comments

Projects
None yet
9 participants
Member

alexrudnick commented Jan 17, 2012

NLTK's mallet interface, nltk.classify.mallet, is for mallet 0.4. It
should be upgraded to the current version (which has a different API).

Migrated from http://code.google.com/p/nltk/issues/detail?id=523

Member

xim commented Apr 8, 2012

It might be a good idea to document that we only support Mallet 0.4 (current version is 2.0.7) or drop support entirely.

Current master fails as well:

Traceback (most recent call last):
  File "nltk/tag/crf.py", line 760, in <module>
    demo()
  File "nltk/tag/crf.py", line 746, in demo
    acc = nltk.tag.accuracy(crf, brown_test)
AttributeError: 'module' object has no attribute 'accuracy'

duhaime commented Jun 24, 2013

I know this is an old thread, but has this problem been fixed in the current master?

Member

kmike commented Jun 24, 2013

No, this problem is not fixed.

rmunn commented Jul 2, 2013

This is a problem for packaging NLTK in Linux distrubutions, specifically Debian and Ubuntu. I worked on the Ubuntu package a while ago, and had to remove the mallet functionality entirely from the Ubuntu package because Ubuntu didn't carry Mallet 0.4 at all. And I was unable to get the NLTK package into Debian, for similar reasons: I started looking at whether I could create a mallet0 package for Debian, got bogged down in Java complexities I wasn't familiar with, and ended up abandoning the Debian NLTK package altogether.

I would like to create a Debian package for NLTK version 3 so that it could be available on both Debian and Ubuntu, but until this issue is resolved one way or the other, I doubt I'll be able to create that Debian package.

Owner

ewan-klein commented Oct 31, 2013

Much of the problem is that mallet.py calls internals.py which adds NLTK_JAR to the Java class path. This is intended to provide a more fine-grained access to parameters that were embedded in the Mallet 0.4 CRF Java code. It is not straightforward to update the Java code that compiles to NLTK_JAR, since it calls classes and fields that are no longer present in Mallet 2.x. I propose that we provide a much simpler interface to the Mallet 2.x CRF classifier which is just a wrapper for cc.mallet.fst.SimpleTagger. I have a crude version of this working now.

Owner

stevenbird commented Nov 28, 2013

If we need to modify the ClassifierI interface please move this to the nltk3beta milestone.

It might be a good idea to document that we only support Mallet 0.4 (current version is 2.0.7) or drop support entirely.

I've documented this in the wiki: https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software

Owner

stevenbird commented Feb 10, 2014

@ewan-klein can we drop Mallet 0.4 support from NLTK 3? And then we can see about adding Mallet 2 support later? This would deal with @rmunn's issue finally.

@stevenbird stevenbird modified the milestones: nltk3beta, nltk3 final Feb 10, 2014

Owner

ewan-klein commented Feb 10, 2014

Yes, this sounds sensible.

Member

kmike commented Feb 10, 2014

How much simpler is this Mallet interface going to be? I haven't used Mallet myself, but it looks like its main advantage is flexibility - you can use higher-order transition features and stuff like that.

But if what is needed is a first-order linear chain CRF then there are much faster implementations available like http://wapiti.limsi.fr/ or http://www.chokkan.org/software/crfsuite/. CRFSuite has a "benchmark" page http://www.chokkan.org/software/crfsuite/benchmark.html where it claims to be 100x faster to train than Mallet 2.0.6 on CoNLL 2000 chunking task.

Owner

stevenbird commented Feb 10, 2014

@kmike the idea is to remove the Mallet interface for now. We can obviously consider adding one (or more) CRF interfaces as soon as someone wants to contribute them.

Owner

ewan-klein commented Feb 10, 2014

The potential advantage of Mallet is that is well known and also does topic detection. I agree that if people are looking for high-performance, then other tools might be better. I did some work on fixing the wrapper code, but haven't had time to finish it off. I'd be happy to have someone take it out of my hands!

stevenxxiu pushed a commit to stevenxxiu/nltk that referenced this issue Feb 12, 2014

stevenbird added a commit that referenced this issue Feb 12, 2014

Owner

stevenbird commented Feb 12, 2014

@rmunn please open a new issue if there are any other obstacles to Linux packaging.

@stevenbird stevenbird closed this Feb 16, 2014

@stevenbird stevenbird assigned longdt219 and unassigned stevenxxiu Dec 22, 2014

Owner

stevenbird commented Dec 22, 2014

@longdt219 please try to recreate this CRF tagger using the current version of Mallet.

@stevenbird stevenbird reopened this Dec 22, 2014

@stevenbird stevenbird modified the milestones: End-January, nltk3beta Dec 22, 2014

Contributor

longdt219 commented Jan 18, 2015

Hi @stevenbird,
Just to be clear, you just want to implement the linear chain CRF POS tagger with supporting the training and testing function ? Mallet also support for General CRF . Should we allow user to play with it more (i.e. second-order crf, multi-layer crf) ? Like @kmike said, the best of Mallet is the flexibility. To some extend, we should allow user to tweak it a bit, e.g. jointly predict POS and NER

Owner

stevenbird commented Jan 18, 2015

I like both suggestions. Definitely provide an interface to the linear chain CRF tagger. Before implementing anything more general, please propose the interface, for discussion here.

Contributor

longdt219 commented Jan 21, 2015

The use can specify the template they going to use in the list of possible templates including

  1. Bigram template
  2. Trigram template
  3. Pairwise template between layers.
    Therefore, the linear chain CRF equivalent to single layer general CRF with bigram template. Second order CRF equivalent to single layer general CRF with trigram template. Pairwise template specify the relation across layers. For example, for jointly predict the POS and NER we will use
    Bigram Template for POS layer, Bigram Template for NER layer and Pairwise template between POS and NER layer.

This should be simple enough for user to play with.

Owner

stevenbird commented Jan 22, 2015

That sounds reasonable; please go ahead.

Contributor

longdt219 commented Jan 25, 2015

Hi @stevenbird,
It turns out that the general CRF doesn't not support for separate training and testing. I implemented the wrapper for the convention linear chain CRF
I created a pull-request for this.

Owner

stevenbird commented Jan 26, 2015

In #855 we are discussing the idea of using CRFSuite instead of Mallet. There seems to be no reason to provide a Mallet interface. Unless there are any objections, I propose that we provide a CRF tagger using CRFSuite.

@stevenbird stevenbird changed the title from Upgrade mallet interface to Add a CRF POS tagger, replacing the old Mallet interface with CRFSuite Feb 6, 2015

stevenbird added a commit that referenced this issue Mar 6, 2015

@stevenbird stevenbird closed this Mar 12, 2015

kruskod pushed a commit to kruskod/nltk that referenced this issue Jul 15, 2015

kruskod pushed a commit to kruskod/nltk that referenced this issue Jul 15, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment