Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layout engine may introduce some diffs #25

Closed
goodmami opened this issue Aug 9, 2018 · 4 comments
Closed

Layout engine may introduce some diffs #25

goodmami opened this issue Aug 9, 2018 · 4 comments
Milestone

Comments

@goodmami
Copy link
Owner

goodmami commented Aug 9, 2018

One goal of the project is to model the PENMAN structure as graphs but to retain enough information from their serialization so the tree structure doesn't change on reserialization. Here is an example from the Bio-AMR corpus where a diff is introduced:

(e / enhance-01~e.11 :li 2~e.0 
      :ARG1 (a3 / and~e.6 
            :op1 (n6 / nucleic-acid 
                  :name (n / name :op1 "mRNA"~e.5) 
                  :ARG0-of (e2 / encode-01 
                        :ARG1 p)) 
            :op2 (p / protein~e.7 
                  :name (n2 / name :op1 "serpinE2"~e.4))) 
      :manner~e.10 (m / marked~e.10) 
      :mod (a2 / also~e.9) 
      :location~e.12 (c / cell~e.15 
            :ARG0-of (e3 / exhibit-01~e.16 
                  :ARG1 (m2 / mutate-01~e.17 
                        :ARG1 (a4 / and~e.22 
                              :op1 (g / gene 
                                    :name (n4 / name :op1 "KRAS"~e.20)) 
                              :op2 (g2 / gene 
                                    :name (n5 / name :op1 "BRAF"~e.24))))) 
            :mod (h / human~e.13) 
            :mod (d / disease 
                  :name (n3 / name :op1 "CRC"~e.14))) 
      :manner~e.2 (i / interesting~e.2))

Here is what is produced (with whitespace differences normalized):

(e / enhance-01~e.11 :li 2~e.0
      :ARG1 (a3 / and~e.6
            :op1 (n6 / nucleic-acid
                  :name (n / name :op1 "mRNA"~e.5)
                  :ARG0-of (e2 / encode-01
                        :ARG1 (p / protein~e.7
                              :name (n2 / name :op1 "serpinE2"~e.4))))
            :op2 p)
      :manner~e.10 (m / marked~e.10)
      :mod (a2 / also~e.9)                                                                     
      :location~e.12 (c / cell~e.15                                                            
            :ARG0-of (e3 / exhibit-01~e.16                                                     
                  :ARG1 (m2 / mutate-01~e.17                                                   
                        :ARG1 (a4 / and~e.22                                                   
                              :op1 (g / gene                                                   
                                    :name (n4 / name :op1 "KRAS"~e.20))                        
                              :op2 (g2 / gene                                                  
                                    :name (n5 / name :op1 "BRAF"~e.24)))))                     
            :mod (h / human~e.13)
            :mod (d / disease       
                  :name (n3 / name :op1 "CRC"~e.14)))
      :manner~e.2 (i / interesting~e.2))

Note how the reentrancy of the p node is reversed. The layout engine prefers edges to appear in their original orientation, but in this case they do. I could possibly prefer reentrancies to start from deeper nestings, or maybe I could embed some info about reentrancy in the triple (as I do with inversion).

@danielhers
Copy link
Contributor

Note that in general, I think a rule of thumb is that in coreference or predicate conjunction or gapping, a variable is expanded where it is mentioned explicitly, and appears as a reentrancy where a pronoun is used or the argument is elided. So in the case of

In the panel of six CRC cell lines , all of them harboured a <i> KRAS </i> gene mutation that was located in codon 12 or 13 .

It makes sense to expand the variable c3 when referring to the cell lines in the panel, and as a reentrancy when referred to as them as an argument of harboured.

As another example from the guidelines, in

The boy arrived and left on Tuesday.

boy is expanded as an argument of arrived but used as reentrancy as an argument of left (where it is elided).

I don't know how easy it is to take these issues into account, though.

@goodmami
Copy link
Owner Author

Thanks for explaining. Those are good guidelines for hand-annotation, but I don't think it would help for serializing from triples since we don't know the surface form. It may be possible to use the alignments, if available, and the ::tok annotaiton, if available, but as they optional meta info and not part of the graph, it seems like a bad direction to go.

As an aside, for the harboured example, it sounds like you're arguing for the rearranged output of the Penman module than the original annotation, but if the module's output is better it's surely just by chance. I would, however, like Penman to allow deterministic restructuring for normalization, which could help with ML models learned from AMR by reducing unnecessary (?) variation.

@danielhers
Copy link
Contributor

I'm "arguing" for the original annotation, actually. In a_pmid_2256_9000.150 the ARG0 of harbor-01 is a reentrancy (corresponding to them, the subject of harboured).

@goodmami
Copy link
Owner Author

Oh, my mistake. I was misreading the graph. Thanks for pointing that out.

@goodmami goodmami added this to the v0.7.0 milestone Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants