Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for alignment to text tokens #19

Closed
danielhers opened this issue Jul 28, 2018 · 17 comments
Closed

Support for alignment to text tokens #19

danielhers opened this issue Jul 28, 2018 · 17 comments
Milestone

Comments

@danielhers
Copy link
Contributor

AMR supports annotations on nodes and edges, specifying alignment to one or more token indices. See an example in Nathan Schneider's AMR IO module: https://github.com/nschneid/amr-hackathon/blob/master/src/amr.py#L172
The module I mentioned supports reading AMRs, but building them from triples for the purpose of writing to file is more convenient with the penman package - except that it doesn't support alignments.
A simple solution would be to allow a suffix to a node/edge label containing "~e." and a number.

@goodmami
Copy link
Owner

Hi Daniel,

In principle I'm happy to support the surface alignments, but I haven't seen much description of them. Nathan Schneider's PEG has this:

ALIGNMENT = "~" ~r"[A-Za-z0-9.,]+"

which looks pretty free-form to me (~.Ab,,,C.1 is valid?). In the examples in his amr.py, I only see two variants: ~e.N, where N is a token index (0-based), and ~e.M,N, where M,N is a list of token indices (also 0-based). Assuming the latter can have more than just 2 indices, this means there's really one form, something like:

ALIGNMENT = "~e." ~r"[0-9]+" ("," ~r"[0-9]"+)*

Finally, I'm confused as to what it means to align relations (edges) to tokens. E.g., in the example from the README, ~e.1 seems redundant on :polarity and I'm not sure why ~e.0 is on :mode:

>>> a = AMR("(h / hug-01~e.2 :polarity~e.1 -~e.1 :ARG0 (y / you~e.3) :ARG1 y \
             :mode~e.0 imperative~e.5 :result (s / silly-01~e.4 :ARG1 y))", \
            "Do n't hug yourself silly !".split())
>>> a
(h / hug-01~e.2[hug] :polarity~e.1[n't] -~e.1[n't]
    :ARG0 (y / you~e.3[yourself])
    :ARG1 y
    :mode~e.0[Do] imperative~e.5[!]
    :result (s / silly-01~e.4[silly]
        :ARG1 y))

But still, I have not seen these alignments in any gold data, papers about AMR, or the dozen or so AMR-related code projects I've browsed (besides the amr-hackathon one, of course). The only other notation for token alignment's I've seen is the JAMR-style metadata string, e.g.:

# ::tok Saudi Arabia ( SA )
# ::alignments 0-2|0+0.0+0.0.0+0.0.1
(c / country
  :name (n / name
          :op1 "Saudi"
          :op2 "Arabia"))

@nschneid, can you please confirm if the more restrictive ALIGNMENT rule above is accurate/sufficient? And you call them "ISI-style alignments"; does that mean there is some ISI document that describes them?

Thank you for any clarification.

@nschneid
Copy link

"ISI-style" because they are of the kind produced by the ISI aligner. The output of this aligner is included in the LDC release of AMR. However, I haven't used this code in awhile, so I can't speak to the precise constraints on the output alignments.

@goodmami
Copy link
Owner

goodmami commented Jul 28, 2018

Thanks for the explanation and link! Somehow that paper didn't turn up in my search, so I must've been using the wrong search terms. I don't have the LDC data, but I could not find alignments in the sample or documentation (although the DEFT corpus claims to have alignments). Nevertheless, your pointers helped me turn up some more info.

Here's what I found:

  • the paper you linked to used ~N 1-based alignments (no mention about ~M,N alignments)
  • I think this is the code corresponding to the paper, though there's no documentation
  • The dev data associated with the paper uses ::alignments metadata, though the format is different from JAMR's
  • I then found Chu and Kurohashi, 2017, which uses ~eN (0-based, no "." after "~e"), and seems to suggest that the "e" stands for "English"
  • The LDC README says there's a description of the alignment format at doc/AMR-alignment-format.txt, which may be a file in the LDC release (https://amr.isi.edu/doc/AMR-alignment-format.html is a broken URL)
  • The LDC README also mentions https://amr.isi.edu/doc/amr-alignment-guidelines.html, which has info about why relations are aligned (so does Chu and Kurohashi, 2017), although it has no mention of the alignment format
  • The BioAMR corpus uses ::alignments metadata as well as the ~e.N and ~e.M,N (0-based) annotations; the metadata format is the same as the Pourdamghani et al. 2014 data (different from JAMR's)

So now I have 3 in-graph alignment formats:

  • ~N
  • ~eN
  • ~e.N and ~e.M,N

And two metadata formats:

  • ::alignments M-N|i+i.j (M-N is a token span, i+i.j is the ith node and the jth node of the ith node, all 0-based)
  • ::alignments M-i.j (M is a token index (0-based), i.j is the jth node of the ith node (1-based))

It seems the last variants of both kinds of annotations are the more "official" ones.

Also, none of these have character-spans as an option, only tokens, so the input must be tokenized (important both for English (e.g., "didn't" -> "did" + "n't") and for Chinese (which generally doesn't use spaces)), which means the tokenized string (e.g., ::tok metadata, I think) must be present for the annotations to make any sense, or the tokenizer must be specified and fixed to the data release.

@danielhers
Copy link
Contributor Author

Thank you so much for the attention and time spent looking into this!

The official AMR data (LDC2017T10) contains both unaligned AMRs, and AMRs with alignments (automatically generated by the ISI aligner) in both formats you mentioned (the last variants), just the same as the freely-available Bio AMR Corpus.
Of course ::tok is necessary for them to make any sense.

Here is the code I use to parse this suffix from Nathan's AMR data structure. Basically just stripping the ~e. and splitting by commas, assuming I get integers. I suppose it's true that the PEG could be more restrictive to capture the actual supported syntax, but I think it's general to support arbitrary suffixes, which may include alignments to multiple languages etc., although I've never seen anything like this in practice. I've used my code successfully with the official AMR data.

Regarding alignment of relations, I'm not sure either why both :polarity and - are aligned, but I do understand why some relations are aligned - for example, in the first AMR in the Bio AMR dev set, ARG2 is aligned to in because this preposition expresses the semantic relation represented by ARG2.

@goodmami
Copy link
Owner

goodmami commented Aug 3, 2018

@danielhers Ok so I sat down to look at this, and then I noticed that you had done a lot of work on the amr-hackathon implementation, so you've probably thought through some of the same problems. In general I kinda like how # ::alignments are separate from the graph, but since I parse the AMR as a graph and not a tree it is fragile (they have to be recalculated if you alter the structure or select a new top), so I'll need to map normalized (deinverted) triples to tokens (as you do) rather than tree positions to tokens. I'm not really satisfied with having separate maps for regular alignments and role alignments, but it might be the most practical step for now. One complication is that there are three ways to create a graph (Graph(), codec.triples_to_graph(), and codec.decode()), where the latter two will normalize triples but the first will not.

How do you want to specify alignments when you create a graph from triples? Something like this?

penman.Graph(
    [triple1, triple2, triple3],
    alignments={triple1: [2], triple2: [4,5], triple3: [1]},
    role_alignments={triple2: [3]}
)

@danielhers
Copy link
Contributor Author

danielhers commented Aug 4, 2018 via email

goodmami added a commit that referenced this issue Aug 6, 2018
@goodmami goodmami added this to the v0.7.0 milestone Aug 6, 2018
@goodmami
Copy link
Owner

goodmami commented Aug 6, 2018

@danielhers I've pushed an implementation to the 'alignments' branch. Can you try it out?

I allow it to read any of the three in-line formats (~1, ~e1, ~e.1) with one or more integer alignments. The general pattern is given by the following:

    ALIGNMENT_RE = re.compile(r'~([a-zA-Z]\.?)?(\d+(?:,\d+)*)\s*')

There are a number of limitations, currently:

  • It assumes 0-based token indices (even for ~1)
  • It only outputs the ~e. format, regardless of the input format
  • It doesn't read nor write the # ::alignments metadata (see Read metadata lines #23)
  • It now disallows ~ characters from appearing in relations and symbols, as it will confuse the pattern match for alignments
  • It does not print alignments on triples (e.g., triples=True in the API or --triples in the script), because commas are used as the delimiter for both alignments and triple parts (consider month(d, 12~e1,2,3)); it may be possible to do this unambiguously because the first argument inside the parentheses should never have an alignment (I think), but in any case I didn't include support for it just yet

This implementation should at least cover your use case. I have a simple unit test, and I checked some outputs on the auto-aligned Bio AMR corpus, but I'd like to know if you run into any problems. If all is good I'll merge it into develop and make a release soon after

@goodmami
Copy link
Owner

goodmami commented Aug 9, 2018

I did a more thorough comparison using the Bio AMR corpus, and I noticed a problem: if a reentrancy had an alignment but on serialization the node gets expanded (with its concept, etc.) at that point, the alignment is printed after the branch ends:

(h / harbor-01~e.12     
      :ARG0 c3~e.11 
   ...

becomes

(h / harbor-01~e.12
      :ARG0 (c3 / cell-line~e.6,7
            :quant 6~e.4
            :mod (d / disease
                  :name (n2 / name
                        :op1 "CRC"~e.5)))~e.11
   ...

Unless I can fix #25, I think the best I can do is discard the alignments (e.11 above), since they are not in a valid location, and neither would ... :ARG0 (c3~e.11 / cell-line~e.6,7....

@danielhers
Copy link
Contributor Author

I guess it wouldn't be that bad to drop the alignment if this phenomenon is rare.

I did start to try this out in https://github.com/danielhers/semstr/tree/alignments, but I still need to resolve issues unrelated to the new alignments feature (since I switched from amr-hackathon to penman for AMR reading too).

@goodmami
Copy link
Owner

I'd like to get a fix for #25 so that hand-annotated AMRs (including alignments) are losslessly round-tripped through the graph representation, but that wouldn't fix the case where you serialize a new AMR from triples. One solution is to consider alignments on the concept/node-type to be analyzed as alignments on the node. That is, when you have:

(a /abc :ARG0 (b / bcd~e.1) ...

The alignment is placed on (a, ARG0, b). Then, whichever variable is expanded to a full node gets the alignment serialized on the concept. This will complicate parsing a bit (since the alignment is not assigned to the node currently being parsed, but to the previous), and I'd have to special-case the root node (or include an explicit triple to indicate the root).

@danielhers
Copy link
Contributor Author

This does make sense. Since role alignments are separate from node alignments, there wouldn't be an ambiguity between the two.

@danielhers
Copy link
Contributor Author

Not sure it's related to the alignments at all, but I had an issue when trying to use penman to read AMRs in my branch. When reading a_pmid_2094_2929.48 from the BioAMR training set (which contains the triple f / figure~e.17 :mod "1A"~e.19), I was asking if "1A" in amr.variables() and the answer was True. I think this is because of the :mod edge, which is an inverted :domain. In a way, it is true that the triple ("1A", "domain", "f") exists, but that doesn't mean "1A" is a variable, while f is.

@nschneid
Copy link

Interesting. I was not aware that :mod "..." was allowed; see the above issue.

@nschneid
Copy link

Perhaps others will weigh in on this, but I think the simplest solution for now is to stipulate that only :mod relations between variables should be considered inverted :domain relations.

@goodmami
Copy link
Owner

@nschneid thanks! I was just about to write up the issue at amrisi/amr-guidelines. I added some more info in #26.

stipulate that only :mod relations between variables should be considered inverted :domain relations.

Ok, I can probably manage something like that. At least, I'll try to not normalize :mod triples to :domain ones, but maybe still allow them to be inverted on serialization.

@danielhers
Copy link
Contributor Author

@goodmami after adding a workaround for #26, using penman with alignments for both reading and writing AMRs seems to work great!

@goodmami
Copy link
Owner

@danielhers thanks! I had been working on a more significant set of changes to parsing and serialization, but had to step away for a bit. I'll try to wrap those up, but in the meantime I'm glad you got something working. I'll take a look at your workarounds to see if they would be generally useful in Penman.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants