Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilized MaltParser API #944

Merged
merged 38 commits into from
Aug 14, 2015
Merged

Stabilized MaltParser API #944

merged 38 commits into from
Aug 14, 2015

Conversation

alvations
Copy link
Contributor

From #943,

MaltParser was requiring all sorts of weird os.environ to make it find the binary and then call jar file with environment java classpath.

  • The new API requires only where the user saves his/her installed version of maltparser and finds the jar files using os.walk and uses full classpath and org.maltparser.Malt to call Maltparser instead of -jar
  • Also the generate_malt_command makes updating the API to suit Maltparser easier.

I've tried with Maltparser-1.7.2 and Maltparser-1.8

@alvations
Copy link
Contributor Author

However there remain problems with DependencyGraph and how it reads the maltparser output files.

Pre-trained models from http://www.maltparser.org/mco/mco.html outputs uncased chunk labels, e.g. nsubj, null, dobj, poss:

1    I    _    PRP    PRP    _    2    nsubj    _    _
2    shot    _    VBD    VBD    _    0    null    _    _
3    an    _    DT    DT    _    4    det    _    _
4    elephant    _    NN    NN    _    2    dobj    _    _
5    in    _    IN    IN    _    2    prep    _    _
6    my    _    PRP$    PRP$    _    7    poss    _    _
7    pajamas    _    NN    NN    _    5    pobj    _    _

But DependencyChart is expecting nice chunk tags, e.g. ROOT, SUBJ, SPEC, OBJ. E.g.

1    John    _    NNP   _    _    2    SUBJ    _    _
2    sees    _    VB    _    _    0    ROOT    _    _
3    a       _    DT    _    _    4    SPEC    _    _
4    dog     _    NN    _    _    2    OBJ     _    _

The demo is fine with we parse using a trained model from NLTK. So the awkward find_binary and NLTK's job to call MaltParser to retrieve the output is seamless.

But there's still problem when reading the parses from a pre-trained model in NLTK:

from nltk.parse import malt
from nltk import word_tokenize, sent_tokenize

indir = '/home/alvas/maltparser-1.7.2/dist/maltparser-1.7.2/'
modelfilepath = '/home/alvas/engmalt.linear-1.7.mco'
maltParser = malt.MaltParser(path_to_maltparser=indir, model=modelfilepath)

sentences = [word_tokenize(sent) for sent in sent_tokenize('I shot an elephant in my pajamas. This is a foobar sentence')]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in sentences]

maltParser.tagged_parse_sents(tagged_sentences)

[out]:

Traceback (most recent call last):
  File "/home/alvas/git/nltk/test_malt.py", line 15, in <module>
    print maltParser.tagged_parse_sents(tagged_sentences)
  File "/home/alvas/git/nltk/newermalt.py", line 132, in tagged_parse_sents
    DependencyGraph.load(output_file.name))
  File "/home/alvas/git/nltk/nltk/parse/dependencygraph.py", line 156, in load
    for tree_str in infile.read().split('\n\n')
  File "/home/alvas/git/nltk/nltk/parse/dependencygraph.py", line 72, in __init__
    cell_separator=cell_separator,
  File "/home/alvas/git/nltk/nltk/parse/dependencygraph.py", line 260, in _parse
    "The graph does'n contain a node "
nltk.parse.dependencygraph.DependencyGraphError: The graph does'n contain a node that depends on the root element.

Although, there was an outputfile created from MaltParser if we add print output_file.read() before https://github.com/alvations/nltk/blob/develop/nltk/parse/malt.py#L158

output_file.read() prints:

1   I   _   PRP PRP _   2   nsubj   _   _
2   shot    _   VBD VBD _   0   null    _   _
3   an  _   DT  DT  _   4   det _   _
4   elephant    _   NN  NN  _   2   dobj    _   _
5   in  _   IN  IN  _   2   prep    _   _
6   my  _   PRP$    PRP$    _   7   poss    _   _
7   pajamas _   NN  NN  _   5   pobj    _   _
8   .   _   .   .   _   2   punct   _   _

1   This    _   DT  DT  _   5   nsubj   _   _
2   is  _   VBZ VBZ _   5   cop _   _
3   a   _   DT  DT  _   5   det _   _
4   foobar  _   NN  NN  _   5   nn  _   _
5   sentence    _   NN  NN  _   0   null    _

@alvations
Copy link
Contributor Author

@dhgarrette , @kmike, @heatherleaf , @stevenbird .

Any idea why the pre-trained model outputs is unreadable by DependencyChart.load()?
Or are there some secret options in maltparser that can make it readable to DependencyChart?

I'll leave this as it is now and let someone else deal with the dependency parses. I'll go back to the translate, model and align packages =)

@Santosh-Gupta
Copy link

Thanks Alvations!!

I was wondering if you could give an example of how to use it in python.

@alvations
Copy link
Contributor Author

@Santosh-Gupta , the demo() shows how you can train a parser and then use it. But loading the pre-trained model is still messy because of DependencyChart objects

@stevenbird
Copy link
Member

@alvations, that error message was introduced in e0f0630#diff-31ba76604fcce0dbd82cdfd1dba4233d.

@dimazest it looks like this change gets in the way of loading pre-trained models. Are you able to investigate please?

@stevenbird stevenbird self-assigned this Apr 18, 2015
@stevenbird
Copy link
Member

Just pinging you again @dimazest

@dimazest
Copy link
Contributor

Sorry, I somehow missed the first mention, I'll have a look to this right now...

dimazest added a commit to dimazest/nltk that referenced this pull request Apr 28, 2015
…ion.

This should resolve issues faced at nltk#944. However, there is code that
depends on a fake root node, for example the tree visualisation code reads this and FStructure.to_depgraph() sets it.
@stevenbird
Copy link
Member

@dimazest thanks for the PR. @alvations, are you able to load pre-trained models now?

@alvations
Copy link
Contributor Author

Sorry for the late reply. @dimazest thanks for the fix!! @stevenbird, now the malt API works with pre-trained model.

I'm not sure why it only works with malt.MaltParser.parse_one(sentence):

_path_to_maltparser = '/home/alvas/maltparser-1.8/dist/maltparser-1.8/'
_path_to_model= '/home/alvas/engmalt.linear-1.7.mco'     
>>> mp = MaltParser(path_to_maltparser=_path_to_maltparser, model=_path_to_model)
>>> sent = 'I shot an elephant in my pajamas'.split()
>>> print(mp.parse_one(sent).tree())
(pajamas (shot I) an elephant in my)

But when i tried to do malt.MaltParser.parse_sents(sentences) for multiple sentence, it didn't return me an iterable of DependencyGraph but a listiterator:

_path_to_maltparser = '/home/alvas/maltparser-1.8/dist/maltparser-1.8/'
_path_to_model= '/home/alvas/engmalt.linear-1.7.mco'     
>>> mp = MaltParser(path_to_maltparser=_path_to_maltparser, model=_path_to_model)
>>> sent = 'I shot an elephant in my pajamas'.split()
>>> sent2 = 'Time flies like banana'.split()
>>> print(mp.parse_one(sent).tree())
(pajamas (shot I) an elephant in my)
>>> print(next(mp.parse_sents([sent,sent2])))
<listiterator object at 0x7f0a2e4d3d90> 
>>> print(next(next(mp.parse_sents([sent,sent2]))))
[{u'address': 0,
  u'ctag': u'TOP',
  u'deps': [2],
  u'feats': None,
  u'lemma': None,
  u'rel': u'TOP',
  u'tag': u'TOP',
  u'word': None},
 {u'address': 1,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 2,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'I'},
 {u'address': 2,
  u'ctag': u'NN',
  u'deps': [1, 11],
  u'feats': u'_',
  u'head': 0,
  u'lemma': u'_',
  u'rel': u'null',
  u'tag': u'NN',
  u'word': u'shot'},
 {u'address': 3,
  u'ctag': u'AT',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'AT',
  u'word': u'an'},
 {u'address': 4,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'elephant'},
 {u'address': 5,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'in'},
 {u'address': 6,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'my'},
 {u'address': 7,
  u'ctag': u'NNS',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NNS',
  u'word': u'pajamas'},
 {u'address': 8,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'Time'},
 {u'address': 9,
  u'ctag': u'NNS',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NNS',
  u'word': u'flies'},
 {u'address': 10,
  u'ctag': u'NN',
  u'deps': [],
  u'feats': u'_',
  u'head': 11,
  u'lemma': u'_',
  u'rel': u'nn',
  u'tag': u'NN',
  u'word': u'like'},
 {u'address': 11,
  u'ctag': u'NN',
  u'deps': [3, 4, 5, 6, 7, 8, 9, 10],
  u'feats': u'_',
  u'head': 2,
  u'lemma': u'_',
  u'rel': u'dep',
  u'tag': u'NN',
  u'word': u'banana'}]

@alvations
Copy link
Contributor Author

With help from http://goo.gl/TpW1iY, I manage to get a tree from parse_sents() by calling print(next(next(mp.parse_sents([sent,sent2]))).tree()). Somehow the parse_sents() looks to be broken, it was combining two sentences into one instead of parsing them separately.

    # Initialize a MaltParser object with a pre-trained model.
    mp = MaltParser(path_to_maltparser=path_to_maltparser, model=path_to_model) 
    sent = 'I shot an elephant in my pajamas'.split()
    sent2 = 'Time flies like banana'.split()
    # Parse a single sentence.
    print(mp.parse_one(sent).tree())
    print(next(next(mp.parse_sents([sent,sent2]))).tree())

[out]:

(pajamas (shot I) an elephant in my)
(shot I (banana an elephant in my pajamas Time flies like))

@alvations
Copy link
Contributor Author

@dimazest @stevenbird: Fixed at last, now we can easily malt any sentences with the API. And i'll be able to use this for tree2string models in nltk.translate

@stevenbird
Copy link
Member

Thanks @alvations and @dimazest.

If either of you has time it would be nice to include a doctest with little demonstration in the docstring for the MaltParser class, cf: https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L120

Syncing with bleeding edge develop branch
(shot I (elephant an) (in (pajamas my)) .)
"""
def __init__(self, parser_dirname, model_filename=None, tagger=None,
additional_java_args=[]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make additional_java_args=None and add this

if additional_java_args is None:
    additional_java_args = []

as having mutable default parameters might lead to obscure bugs.

@alvations
Copy link
Contributor Author

@dimazest , @stevenbird It's all patched up.

@stevenbird
Copy link
Member

Thanks @dimazest for the code review, and @alvations for all this work. It's looking good to me, so I'm going to merge.

stevenbird added a commit that referenced this pull request Aug 14, 2015
@stevenbird stevenbird merged commit 73fc655 into nltk:develop Aug 14, 2015
@alvations alvations deleted the patch-1 branch August 25, 2015 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants