Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
BLAH2 documentation on progress
Design of annotation system for semi-structured (XML and SVG) documents in PubAnnotation
- Documentation page http://nikolamilosevic86.github.io/GATE2PubAnnotation/
- GitHub repo https://github.com/nikolamilosevic86/GATE2PubAnnotation
Integrating corpora into PubAnnotation
Annotation Tool Interoperability with PubAnnotation
- GitHub link: https://github.com/linkedannotation/BLAH2/wiki/AnnotationToolInteroperabilityWithPubAnnotation
PubAnnotation quality control & evaluation
Semantic Integration of Lexical Resources
- Explored use of pairwise learning to rank in an active learning setting for learning a semantic model for whether two names are synonymous.
- The idea is to see if a reusable tool employing this approach would be enough support to semi-automatically integrate large lexicons.
- Initial focus was diseases entities from the MEDIC vocabulary by CTD (which has 41000 unique names) and a large subset of the UMLS Metathesaurus including the Medcin, SNOMED-CT and Consumer Health vocabularies (with 345000 unique names).
- Implementation is Java and written to be as generic as possible. Code is implemented so additional lexicons can be handled by only implementing a loader class. Names are sent through a processing pipeline that can be configured differently for other entity types (e.g. chemicals, genes).
- Previous work with pairwise learning to rank (DNorm) trained with stochastic gradient descent, which is not feasible in this context. Instead we use the margin-infused relaxed algortihm, which is incremental and online.
- Introducing an additional feature for the cosine similarity frees the high-dimensional pairwise matrix from modeling basic similarity calculations, greatly speeding up learning.
- Performance is high: it takes about 8ms to find and classify the closest synonyms for each name. Training for each batch of 10 names averages less than 10ms, with the initial batch taking about 100ms.
- There appears to be a significant amount of synonymy remaining in the focus lexicons; initial evaluation reveals there are at least several thousand to evaluate.
- Approach seems to be a successful proof of concept, with the method quickly learning the appropriate weights for contrastive and non-contrastive variations it has seen.
- Like an annotation project, guidelines will be needed for what constitues a synonym.
Clone this wiki locally
Press h to open a hovercard with more details.