Skip to content
This repository

Upgrade mallet interface #104

Closed
alexrudnick opened this Issue · 13 comments

8 participants

Alex Rudnick Morten M. Neergaard Douglas Duhaime Mikhail Korobov Robin Munn Ewan Klein Steven Bird jskda
Alex Rudnick
Collaborator

NLTK's mallet interface, nltk.classify.mallet, is for mallet 0.4. It
should be upgraded to the current version (which has a different API).

Migrated from http://code.google.com/p/nltk/issues/detail?id=523

Morten M. Neergaard
Collaborator
xim commented

It might be a good idea to document that we only support Mallet 0.4 (current version is 2.0.7) or drop support entirely.

Current master fails as well:

Traceback (most recent call last):
  File "nltk/tag/crf.py", line 760, in <module>
    demo()
  File "nltk/tag/crf.py", line 746, in demo
    acc = nltk.tag.accuracy(crf, brown_test)
AttributeError: 'module' object has no attribute 'accuracy'
Douglas Duhaime

I know this is an old thread, but has this problem been fixed in the current master?

Mikhail Korobov
Collaborator
kmike commented

No, this problem is not fixed.

Robin Munn
rmunn commented

This is a problem for packaging NLTK in Linux distrubutions, specifically Debian and Ubuntu. I worked on the Ubuntu package a while ago, and had to remove the mallet functionality entirely from the Ubuntu package because Ubuntu didn't carry Mallet 0.4 at all. And I was unable to get the NLTK package into Debian, for similar reasons: I started looking at whether I could create a mallet0 package for Debian, got bogged down in Java complexities I wasn't familiar with, and ended up abandoning the Debian NLTK package altogether.

I would like to create a Debian package for NLTK version 3 so that it could be available on both Debian and Ubuntu, but until this issue is resolved one way or the other, I doubt I'll be able to create that Debian package.

Ewan Klein
Owner

Much of the problem is that mallet.py calls internals.py which adds NLTK_JAR to the Java class path. This is intended to provide a more fine-grained access to parameters that were embedded in the Mallet 0.4 CRF Java code. It is not straightforward to update the Java code that compiles to NLTK_JAR, since it calls classes and fields that are no longer present in Mallet 2.x. I propose that we provide a much simpler interface to the Mallet 2.x CRF classifier which is just a wrapper for cc.mallet.fst.SimpleTagger. I have a crude version of this working now.

Steven Bird
Owner

If we need to modify the ClassifierI interface please move this to the nltk3beta milestone.

jskda
Collaborator

It might be a good idea to document that we only support Mallet 0.4 (current version is 2.0.7) or drop support entirely.

I've documented this in the wiki: https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software

Steven Bird
Owner

@ewan-klein can we drop Mallet 0.4 support from NLTK 3? And then we can see about adding Mallet 2 support later? This would deal with @rmunn's issue finally.

Ewan Klein
Owner

Yes, this sounds sensible.

Mikhail Korobov
Collaborator

How much simpler is this Mallet interface going to be? I haven't used Mallet myself, but it looks like its main advantage is flexibility - you can use higher-order transition features and stuff like that.

But if what is needed is a first-order linear chain CRF then there are much faster implementations available like http://wapiti.limsi.fr/ or http://www.chokkan.org/software/crfsuite/. CRFSuite has a "benchmark" page http://www.chokkan.org/software/crfsuite/benchmark.html where it claims to be 100x faster to train than Mallet 2.0.6 on CoNLL 2000 chunking task.

Steven Bird
Owner

@kmike the idea is to remove the Mallet interface for now. We can obviously consider adding one (or more) CRF interfaces as soon as someone wants to contribute them.

Ewan Klein
Owner

The potential advantage of Mallet is that is well known and also does topic detection. I agree that if people are looking for high-performance, then other tools might be better. I did some work on fixing the wrapper code, but haven't had time to finish it off. I'd be happy to have someone take it out of my hands!

jskda jskda referenced this issue from a commit in jskda/nltk
Steven Xu removed mallet interface, c.f. nltk#104 701468b
Steven Bird
Owner

@rmunn please open a new issue if there are any other obstacles to Linux packaging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.