rleaman edited this page Nov 20, 2015 · 18 revisions

BLAH2 documentation on progress

Design of annotation system for semi-structured (XML and SVG) documents in PubAnnotation

https://github.com/linkedannotation/BLAH2/wiki/Enabling-annotation-of-semi-structured-(XML)-documents-using-PubAnnotation

GATE2PubAnnotation

Integrating corpora into PubAnnotation

MetaMap2PubAnnotation

Annotation Tool Interoperability with PubAnnotation

PubAnnotation quality control & evaluation

https://github.com/linkedannotation/BLAH2/wiki/PubAnnotationQualityControlEval

Semantic Integration of Lexical Resources

  • Explored use of pairwise learning to rank in an active learning setting for learning a semantic model for whether two names are synonymous.
  • The idea is to see if a reusable tool employing this approach would be enough support to semi-automatically integrate large lexicons.
  • Initial focus was diseases entities from the MEDIC vocabulary by CTD (which has 41000 unique names) and a large subset of the UMLS Metathesaurus including the Medcin, SNOMED-CT and Consumer Health vocabularies (with 345000 unique names).
  • Implementation is Java and written to be as generic as possible. Code is implemented so additional lexicons can be handled by only implementing a loader class. Names are sent through a processing pipeline that can be configured differently for other entity types (e.g. chemicals, genes).
  • Previous work with pairwise learning to rank (DNorm) trained with stochastic gradient descent, which is not feasible in this context. Instead we use the margin-infused relaxed algortihm, which is incremental and online.
  • Introducing an additional feature for the cosine similarity frees the high-dimensional pairwise matrix from modeling basic similarity calculations, greatly speeding up learning.
  • Performance is high: it takes about 8ms to find and classify the closest synonyms for each name. Training for each batch of 10 names averages less than 10ms, with the initial batch taking about 100ms.
  • There appears to be a significant amount of synonymy remaining in the focus lexicons; initial evaluation reveals there are at least several thousand to evaluate.
  • Approach seems to be a successful proof of concept, with the method quickly learning the appropriate weights for contrastive and non-contrastive variations it has seen.
  • Like an annotation project, guidelines will be needed for what constitues a synonym.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.