-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for alignment to text tokens #19
Comments
Hi Daniel, In principle I'm happy to support the surface alignments, but I haven't seen much description of them. Nathan Schneider's PEG has this:
which looks pretty free-form to me (
Finally, I'm confused as to what it means to align relations (edges) to tokens. E.g., in the example from the README,
But still, I have not seen these alignments in any gold data, papers about AMR, or the dozen or so AMR-related code projects I've browsed (besides the amr-hackathon one, of course). The only other notation for token alignment's I've seen is the JAMR-style metadata string, e.g.:
@nschneid, can you please confirm if the more restrictive Thank you for any clarification. |
"ISI-style" because they are of the kind produced by the ISI aligner. The output of this aligner is included in the LDC release of AMR. However, I haven't used this code in awhile, so I can't speak to the precise constraints on the output alignments. |
Thanks for the explanation and link! Somehow that paper didn't turn up in my search, so I must've been using the wrong search terms. I don't have the LDC data, but I could not find alignments in the sample or documentation (although the DEFT corpus claims to have alignments). Nevertheless, your pointers helped me turn up some more info. Here's what I found:
So now I have 3 in-graph alignment formats:
And two metadata formats:
It seems the last variants of both kinds of annotations are the more "official" ones. Also, none of these have character-spans as an option, only tokens, so the input must be tokenized (important both for English (e.g., "didn't" -> "did" + "n't") and for Chinese (which generally doesn't use spaces)), which means the tokenized string (e.g., |
Thank you so much for the attention and time spent looking into this! The official AMR data (LDC2017T10) contains both unaligned AMRs, and AMRs with alignments (automatically generated by the ISI aligner) in both formats you mentioned (the last variants), just the same as the freely-available Bio AMR Corpus. Here is the code I use to parse this suffix from Nathan's AMR data structure. Basically just stripping the Regarding alignment of relations, I'm not sure either why both |
@danielhers Ok so I sat down to look at this, and then I noticed that you had done a lot of work on the amr-hackathon implementation, so you've probably thought through some of the same problems. In general I kinda like how How do you want to specify alignments when you create a graph from triples? Something like this? penman.Graph(
[triple1, triple2, triple3],
alignments={triple1: [2], triple2: [4,5], triple3: [1]},
role_alignments={triple2: [3]}
) |
Nathan has done most of the work there, I just suggested some improvements.
Yes, I think the API you suggested is appropriate. Of course it's already
not much easier than appending the alignment suffixes to the relations and
types manually, but this way is cleaner, and it can generate both suffixes
and ::alignments comment perhaps, and it can support alignment of variables
in reentrancy, which is also possible.
…On Fri, 3 Aug 2018 21:28 Michael Wayne Goodman, ***@***.***> wrote:
@danielhers <https://github.com/danielhers> Ok so I sat down to look at
this, and then I noticed that you had done a lot of work on the
amr-hackathon implementation, so you've probably thought through some of
the same problems. In general I kinda like how # ::alignments are
separate from the graph, but since I parse the AMR as a graph and not a
tree it is fragile (they have to be recalculated if you alter the structure
or select a new top), so I'll need to map normalized (deinverted) triples
to tokens (as you do) rather than tree positions to tokens. I'm not really
satisfied with having separate maps for regular alignments and role
alignments, but it might be the most practical step for now. One
complication is that there are three ways to create a graph (Graph(),
codec.triples_to_graph(), and codec.decode()), where the latter two will
normalize triples but the first will not.
How do you want to specify alignments when you create a graph from
triples? Something like this?
penman.Graph(
[triple1, triple2, triple3],
alignments={triple1: [2], triple2: [4,5], triple3: [1]},
role_alignments={triple2: [3]}
)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAQEw2knrepFL2_qKQHGIdzm1WiDO_nWks5uNJZagaJpZM4Vk9RM>
.
|
@danielhers I've pushed an implementation to the 'alignments' branch. Can you try it out? I allow it to read any of the three in-line formats ( ALIGNMENT_RE = re.compile(r'~([a-zA-Z]\.?)?(\d+(?:,\d+)*)\s*') There are a number of limitations, currently:
This implementation should at least cover your use case. I have a simple unit test, and I checked some outputs on the auto-aligned Bio AMR corpus, but I'd like to know if you run into any problems. If all is good I'll merge it into develop and make a release soon after |
I did a more thorough comparison using the Bio AMR corpus, and I noticed a problem: if a reentrancy had an alignment but on serialization the node gets expanded (with its concept, etc.) at that point, the alignment is printed after the branch ends:
becomes
Unless I can fix #25, I think the best I can do is discard the alignments ( |
I guess it wouldn't be that bad to drop the alignment if this phenomenon is rare. I did start to try this out in https://github.com/danielhers/semstr/tree/alignments, but I still need to resolve issues unrelated to the new alignments feature (since I switched from amr-hackathon to penman for AMR reading too). |
I'd like to get a fix for #25 so that hand-annotated AMRs (including alignments) are losslessly round-tripped through the graph representation, but that wouldn't fix the case where you serialize a new AMR from triples. One solution is to consider alignments on the concept/node-type to be analyzed as alignments on the node. That is, when you have:
The alignment is placed on |
This does make sense. Since role alignments are separate from node alignments, there wouldn't be an ambiguity between the two. |
Not sure it's related to the alignments at all, but I had an issue when trying to use penman to read AMRs in my branch. When reading |
Interesting. I was not aware that |
Perhaps others will weigh in on this, but I think the simplest solution for now is to stipulate that only |
@nschneid thanks! I was just about to write up the issue at amrisi/amr-guidelines. I added some more info in #26.
Ok, I can probably manage something like that. At least, I'll try to not normalize |
@danielhers thanks! I had been working on a more significant set of changes to parsing and serialization, but had to step away for a bit. I'll try to wrap those up, but in the meantime I'm glad you got something working. I'll take a look at your workarounds to see if they would be generally useful in Penman. |
AMR supports annotations on nodes and edges, specifying alignment to one or more token indices. See an example in Nathan Schneider's AMR IO module: https://github.com/nschneid/amr-hackathon/blob/master/src/amr.py#L172
The module I mentioned supports reading AMRs, but building them from triples for the purpose of writing to file is more convenient with the penman package - except that it doesn't support alignments.
A simple solution would be to allow a suffix to a node/edge label containing "~e." and a number.
The text was updated successfully, but these errors were encountered: