Support for alignment to text tokens #19

danielhers · 2018-07-28T08:03:24Z

AMR supports annotations on nodes and edges, specifying alignment to one or more token indices. See an example in Nathan Schneider's AMR IO module: https://github.com/nschneid/amr-hackathon/blob/master/src/amr.py#L172
The module I mentioned supports reading AMRs, but building them from triples for the purpose of writing to file is more convenient with the penman package - except that it doesn't support alignments.
A simple solution would be to allow a suffix to a node/edge label containing "~e." and a number.

goodmami · 2018-07-28T18:05:48Z

Hi Daniel,

In principle I'm happy to support the surface alignments, but I haven't seen much description of them. Nathan Schneider's PEG has this:

ALIGNMENT = "~" ~r"[A-Za-z0-9.,]+"

which looks pretty free-form to me (~.Ab,,,C.1 is valid?). In the examples in his amr.py, I only see two variants: ~e.N, where N is a token index (0-based), and ~e.M,N, where M,N is a list of token indices (also 0-based). Assuming the latter can have more than just 2 indices, this means there's really one form, something like:

ALIGNMENT = "~e." ~r"[0-9]+" ("," ~r"[0-9]"+)*

Finally, I'm confused as to what it means to align relations (edges) to tokens. E.g., in the example from the README, ~e.1 seems redundant on :polarity and I'm not sure why ~e.0 is on :mode:

>>> a = AMR("(h / hug-01~e.2 :polarity~e.1 -~e.1 :ARG0 (y / you~e.3) :ARG1 y \
             :mode~e.0 imperative~e.5 :result (s / silly-01~e.4 :ARG1 y))", \
            "Do n't hug yourself silly !".split())
>>> a
(h / hug-01~e.2[hug] :polarity~e.1[n't] -~e.1[n't]
    :ARG0 (y / you~e.3[yourself])
    :ARG1 y
    :mode~e.0[Do] imperative~e.5[!]
    :result (s / silly-01~e.4[silly]
        :ARG1 y))

But still, I have not seen these alignments in any gold data, papers about AMR, or the dozen or so AMR-related code projects I've browsed (besides the amr-hackathon one, of course). The only other notation for token alignment's I've seen is the JAMR-style metadata string, e.g.:

# ::tok Saudi Arabia ( SA )
# ::alignments 0-2|0+0.0+0.0.0+0.0.1
(c / country
  :name (n / name
          :op1 "Saudi"
          :op2 "Arabia"))

@nschneid, can you please confirm if the more restrictive ALIGNMENT rule above is accurate/sufficient? And you call them "ISI-style alignments"; does that mean there is some ISI document that describes them?

Thank you for any clarification.

nschneid · 2018-07-28T18:35:38Z

"ISI-style" because they are of the kind produced by the ISI aligner. The output of this aligner is included in the LDC release of AMR. However, I haven't used this code in awhile, so I can't speak to the precise constraints on the output alignments.

goodmami · 2018-07-28T20:04:18Z

Thanks for the explanation and link! Somehow that paper didn't turn up in my search, so I must've been using the wrong search terms. I don't have the LDC data, but I could not find alignments in the sample or documentation (although the DEFT corpus claims to have alignments). Nevertheless, your pointers helped me turn up some more info.

Here's what I found:

the paper you linked to used ~N 1-based alignments (no mention about ~M,N alignments)
I think this is the code corresponding to the paper, though there's no documentation
The dev data associated with the paper uses ::alignments metadata, though the format is different from JAMR's
I then found Chu and Kurohashi, 2017, which uses ~eN (0-based, no "." after "~e"), and seems to suggest that the "e" stands for "English"
The LDC README says there's a description of the alignment format at doc/AMR-alignment-format.txt, which may be a file in the LDC release (https://amr.isi.edu/doc/AMR-alignment-format.html is a broken URL)
The LDC README also mentions https://amr.isi.edu/doc/amr-alignment-guidelines.html, which has info about why relations are aligned (so does Chu and Kurohashi, 2017), although it has no mention of the alignment format
The BioAMR corpus uses ::alignments metadata as well as the ~e.N and ~e.M,N (0-based) annotations; the metadata format is the same as the Pourdamghani et al. 2014 data (different from JAMR's)

So now I have 3 in-graph alignment formats:

~N
~eN
~e.N and ~e.M,N

And two metadata formats:

::alignments M-N|i+i.j (M-N is a token span, i+i.j is the ith node and the jth node of the ith node, all 0-based)
::alignments M-i.j (M is a token index (0-based), i.j is the jth node of the ith node (1-based))

It seems the last variants of both kinds of annotations are the more "official" ones.

Also, none of these have character-spans as an option, only tokens, so the input must be tokenized (important both for English (e.g., "didn't" -> "did" + "n't") and for Chinese (which generally doesn't use spaces)), which means the tokenized string (e.g., ::tok metadata, I think) must be present for the annotations to make any sense, or the tokenizer must be specified and fixed to the data release.

danielhers · 2018-07-29T11:24:39Z

Thank you so much for the attention and time spent looking into this!

The official AMR data (LDC2017T10) contains both unaligned AMRs, and AMRs with alignments (automatically generated by the ISI aligner) in both formats you mentioned (the last variants), just the same as the freely-available Bio AMR Corpus.
Of course ::tok is necessary for them to make any sense.

Here is the code I use to parse this suffix from Nathan's AMR data structure. Basically just stripping the ~e. and splitting by commas, assuming I get integers. I suppose it's true that the PEG could be more restrictive to capture the actual supported syntax, but I think it's general to support arbitrary suffixes, which may include alignments to multiple languages etc., although I've never seen anything like this in practice. I've used my code successfully with the official AMR data.

Regarding alignment of relations, I'm not sure either why both :polarity and - are aligned, but I do understand why some relations are aligned - for example, in the first AMR in the Bio AMR dev set, ARG2 is aligned to in because this preposition expresses the semantic relation represented by ARG2.

goodmami · 2018-08-03T18:28:42Z

@danielhers Ok so I sat down to look at this, and then I noticed that you had done a lot of work on the amr-hackathon implementation, so you've probably thought through some of the same problems. In general I kinda like how # ::alignments are separate from the graph, but since I parse the AMR as a graph and not a tree it is fragile (they have to be recalculated if you alter the structure or select a new top), so I'll need to map normalized (deinverted) triples to tokens (as you do) rather than tree positions to tokens. I'm not really satisfied with having separate maps for regular alignments and role alignments, but it might be the most practical step for now. One complication is that there are three ways to create a graph (Graph(), codec.triples_to_graph(), and codec.decode()), where the latter two will normalize triples but the first will not.

How do you want to specify alignments when you create a graph from triples? Something like this?

penman.Graph(
    [triple1, triple2, triple3],
    alignments={triple1: [2], triple2: [4,5], triple3: [1]},
    role_alignments={triple2: [3]}
)

danielhers · 2018-08-04T08:21:14Z

Nathan has done most of the work there, I just suggested some improvements. Yes, I think the API you suggested is appropriate. Of course it's already not much easier than appending the alignment suffixes to the relations and types manually, but this way is cleaner, and it can generate both suffixes and ::alignments comment perhaps, and it can support alignment of variables in reentrancy, which is also possible.

…

On Fri, 3 Aug 2018 21:28 Michael Wayne Goodman, ***@***.***> wrote: @danielhers <https://github.com/danielhers> Ok so I sat down to look at this, and then I noticed that you had done a lot of work on the amr-hackathon implementation, so you've probably thought through some of the same problems. In general I kinda like how # ::alignments are separate from the graph, but since I parse the AMR as a graph and not a tree it is fragile (they have to be recalculated if you alter the structure or select a new top), so I'll need to map normalized (deinverted) triples to tokens (as you do) rather than tree positions to tokens. I'm not really satisfied with having separate maps for regular alignments and role alignments, but it might be the most practical step for now. One complication is that there are three ways to create a graph (Graph(), codec.triples_to_graph(), and codec.decode()), where the latter two will normalize triples but the first will not. How do you want to specify alignments when you create a graph from triples? Something like this? penman.Graph( [triple1, triple2, triple3], alignments={triple1: [2], triple2: [4,5], triple3: [1]}, role_alignments={triple2: [3]} ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAQEw2knrepFL2_qKQHGIdzm1WiDO_nWks5uNJZagaJpZM4Vk9RM> .

Resolves #19

goodmami · 2018-08-06T20:29:23Z

@danielhers I've pushed an implementation to the 'alignments' branch. Can you try it out?

I allow it to read any of the three in-line formats (~1, ~e1, ~e.1) with one or more integer alignments. The general pattern is given by the following:

    ALIGNMENT_RE = re.compile(r'~([a-zA-Z]\.?)?(\d+(?:,\d+)*)\s*')

There are a number of limitations, currently:

It assumes 0-based token indices (even for ~1)
It only outputs the ~e. format, regardless of the input format
It doesn't read nor write the # ::alignments metadata (see Read metadata lines #23)
It now disallows ~ characters from appearing in relations and symbols, as it will confuse the pattern match for alignments
It does not print alignments on triples (e.g., triples=True in the API or --triples in the script), because commas are used as the delimiter for both alignments and triple parts (consider month(d, 12~e1,2,3)); it may be possible to do this unambiguously because the first argument inside the parentheses should never have an alignment (I think), but in any case I didn't include support for it just yet

This implementation should at least cover your use case. I have a simple unit test, and I checked some outputs on the auto-aligned Bio AMR corpus, but I'd like to know if you run into any problems. If all is good I'll merge it into develop and make a release soon after

goodmami · 2018-08-09T23:10:58Z

I did a more thorough comparison using the Bio AMR corpus, and I noticed a problem: if a reentrancy had an alignment but on serialization the node gets expanded (with its concept, etc.) at that point, the alignment is printed after the branch ends:

(h / harbor-01~e.12     
      :ARG0 c3~e.11 
   ...

becomes

(h / harbor-01~e.12
      :ARG0 (c3 / cell-line~e.6,7
            :quant 6~e.4
            :mod (d / disease
                  :name (n2 / name
                        :op1 "CRC"~e.5)))~e.11
   ...

Unless I can fix #25, I think the best I can do is discard the alignments (e.11 above), since they are not in a valid location, and neither would ... :ARG0 (c3~e.11 / cell-line~e.6,7....

danielhers · 2018-08-10T05:41:41Z

I guess it wouldn't be that bad to drop the alignment if this phenomenon is rare.

I did start to try this out in https://github.com/danielhers/semstr/tree/alignments, but I still need to resolve issues unrelated to the new alignments feature (since I switched from amr-hackathon to penman for AMR reading too).

goodmami · 2018-08-10T16:50:06Z

I'd like to get a fix for #25 so that hand-annotated AMRs (including alignments) are losslessly round-tripped through the graph representation, but that wouldn't fix the case where you serialize a new AMR from triples. One solution is to consider alignments on the concept/node-type to be analyzed as alignments on the node. That is, when you have:

(a /abc :ARG0 (b / bcd~e.1) ...

The alignment is placed on (a, ARG0, b). Then, whichever variable is expanded to a full node gets the alignment serialized on the concept. This will complicate parsing a bit (since the alignment is not assigned to the node currently being parsed, but to the previous), and I'd have to special-case the root node (or include an explicit triple to indicate the root).

danielhers · 2018-08-10T17:05:37Z

This does make sense. Since role alignments are separate from node alignments, there wouldn't be an ambiguity between the two.

danielhers · 2018-08-13T19:54:24Z

Not sure it's related to the alignments at all, but I had an issue when trying to use penman to read AMRs in my branch. When reading a_pmid_2094_2929.48 from the BioAMR training set (which contains the triple f / figure~e.17 :mod "1A"~e.19), I was asking if "1A" in amr.variables() and the answer was True. I think this is because of the :mod edge, which is an inverted :domain. In a way, it is true that the triple ("1A", "domain", "f") exists, but that doesn't mean "1A" is a variable, while f is.

nschneid · 2018-08-13T23:37:12Z

Interesting. I was not aware that :mod "..." was allowed; see the above issue.

nschneid · 2018-08-13T23:42:58Z

Perhaps others will weigh in on this, but I think the simplest solution for now is to stipulate that only :mod relations between variables should be considered inverted :domain relations.

goodmami · 2018-08-14T00:08:19Z

@nschneid thanks! I was just about to write up the issue at amrisi/amr-guidelines. I added some more info in #26.

stipulate that only :mod relations between variables should be considered inverted :domain relations.

Ok, I can probably manage something like that. At least, I'll try to not normalize :mod triples to :domain ones, but maybe still allow them to be inverted on serialization.

danielhers · 2018-08-23T07:13:26Z

@goodmami after adding a workaround for #26, using penman with alignments for both reading and writing AMRs seems to work great!

goodmami · 2018-08-27T19:21:38Z

@danielhers thanks! I had been working on a more significant set of changes to parsing and serialization, but had to step away for a bit. I'll try to wrap those up, but in the meantime I'm glad you got something working. I'll take a look at your workarounds to see if they would be generally useful in Penman.

goodmami added a commit that referenced this issue Aug 6, 2018

Add support for surface alignments

37fad53

Resolves #19

goodmami added this to the v0.7.0 milestone Aug 6, 2018

goodmami mentioned this issue Aug 6, 2018

Read metadata lines #23

Closed

nschneid mentioned this issue Aug 13, 2018

:mod + constant in figure/table references: trouble with inverses amrisi/amr-guidelines#235

Open

goodmami mentioned this issue Aug 13, 2018

Inverted edges without a source variable #26

Closed

goodmami closed this as completed in d280077 Nov 21, 2019

goodmami mentioned this issue Apr 20, 2020

Position of a node #74

Closed

goodmami mentioned this issue Mar 8, 2021

Linking node back to input sentence. #98

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for alignment to text tokens #19

Support for alignment to text tokens #19

danielhers commented Jul 28, 2018

goodmami commented Jul 28, 2018

nschneid commented Jul 28, 2018

goodmami commented Jul 28, 2018 •

edited

Loading

danielhers commented Jul 29, 2018

goodmami commented Aug 3, 2018

danielhers commented Aug 4, 2018 via email

goodmami commented Aug 6, 2018

goodmami commented Aug 9, 2018

danielhers commented Aug 10, 2018

goodmami commented Aug 10, 2018

danielhers commented Aug 10, 2018

danielhers commented Aug 13, 2018

nschneid commented Aug 13, 2018

nschneid commented Aug 13, 2018

goodmami commented Aug 14, 2018

danielhers commented Aug 23, 2018

goodmami commented Aug 27, 2018

Support for alignment to text tokens #19

Support for alignment to text tokens #19

Comments

danielhers commented Jul 28, 2018

goodmami commented Jul 28, 2018

nschneid commented Jul 28, 2018

goodmami commented Jul 28, 2018 • edited Loading

danielhers commented Jul 29, 2018

goodmami commented Aug 3, 2018

danielhers commented Aug 4, 2018 via email

goodmami commented Aug 6, 2018

goodmami commented Aug 9, 2018

danielhers commented Aug 10, 2018

goodmami commented Aug 10, 2018

danielhers commented Aug 10, 2018

danielhers commented Aug 13, 2018

nschneid commented Aug 13, 2018

nschneid commented Aug 13, 2018

goodmami commented Aug 14, 2018

danielhers commented Aug 23, 2018

goodmami commented Aug 27, 2018

goodmami commented Jul 28, 2018 •

edited

Loading