- Authors: Márcia Barros, Pedro Ruas, Diana Sousa, and Francisco M. Couto
- E-mail addresses: (mcbarros, psruas, dfsousa, fjcouto)@fc.ul.pt
- Institution: LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
- Retrieval of a sample of COVID-related articles;
- Automatic annotation of the articles;
- Manual correction of the automatic annotations in a subset the articles;
- Expansion of the annotations set in the subset;
- Automatic evaluation of the recommendation dataset;
- Manual evaluation of the recommendation dataset.
Initial meeting to present projects.
Sample upload of 2.NER/NEL datasets (in EN and PT) and 3.RE dataset (in EN) to PubAnnotation in collection LASIGE: Annotating a multilingual COVID-19-related corpus for BLAH7 here.
The 4.recommender system datasets (in EN and PT) are available here on GitHub.
Discussion of crowd validation pipeline for the available datasets.
-
Download the pubannotation_deliverables/ directory available on this GitHub page;
-
Go to PubAnnotations Projects and click create:
- Name your project ideally something intuitive, such as: ENG_RE_<yourname/username> or ENG_NER_NEL_<yourname/username>;
- Choose the CC-BY License;
- Status Developing;
- Finally, click Create project.
-
Under the Annotations headline click Upload Annotations:
- Select Choose File and choose one of the available deliverables (that you downloaded in step 1):
- en_entities_pubannotation.tar.gz for English NER and NEL (20 documents);
- en_relations_pubannotation.tar.gz for English NER, NEL, and RE (20 documents);
- pt_entities_pubannotation.tar.gz for Portuguese NER and NEL (20 documents).
- Click Upload and confirm that the following message pops up: The task, 'Upload annotations', is created.;
- Go to the previous page and refresh it (it can take a minute to update). Confirm that next to the Annotations headline you now have a number.
- Select Choose File and choose one of the available deliverables (that you downloaded in step 1):
-
Click on the number 20 under the Documents headline:
- Click in one document:
- Next to the Annotations headline you click the option TextAE;
- You now can delete/correct/add the existing annotations and relations.
- Repeat for all documents.
- Click in one document:
-
To validate the annotations you can check the respective vocabularies:
-
After finishing with the validation you are ready to deliver the altered files. Go back to the main project page and under the Annotations headline click Create a downloadable file. Please share it with the team!
Final meeting to present projects progress.
The global motivation is the creation of parallel multilingual datasets for text mining systems in COVID-19-related literature. Tracking the most recent advances in the COVID-19-related research is essential given the novelty of the disease and its impact on society. Still, the pace of publication requires automatic approaches to access and organize the knowledge that keeps being produced every day. It is necessary to develop text mining pipelines to assist in that task, which is only possible with evaluation datasets. However, there is a lack of COVID-19-related datasets, even more, if considering other languages besides English. The expected contribution of the project will be the annotation of a multilingual parallel dataset (EN-ES and EN-PT), providing this resource to the community to improve the text mining research on COVID-19-related literature.
Find the video presentation of the project here.
We start by generating a silver standard by applying our COVID text mining pipeline, which includes three modules: entity extraction, relation extraction, and recommender system. The entity extraction module recognizes disease, chemical, and anatomical entities and links them to the respective MeSH identifier. The relation extraction module recognizes candidate relations between those entities. We are particularly interested in negative relations where there is already evidence of no association to prevent researchers from pursuing already refuted research hypotheses, and focus their research. The recommender system module creates a dataset of user, item, rating, where the users are authors from the research documents related to COVID-19, the items are the entities extracted in the entity extraction phase, and the ratings are the number of articles in which the author mentioned the entity. Lastly, we manually correct the annotations in a selected subset of the silver standard and add missing annotations.
The first part consists of the document retrieval by two different approaches:
- Download of .tsv file with LitCovid citations. Consider the PMIDs in the file, retrieve the respective abstracts from PubMed using the available API, and filter those with English and Spanish abstracts simultaneously (or English and Portuguese).
Or instead:
- Retrieve PubMed articles with the available API using the search profile: new coronavirus* OR novel coronavirus* OR ncov OR sars-cov OR covid* OR cov-2 OR cov-19 and then filter those with English and Spanish abstracts simultaneously (or English and Portuguese)
The second part consists of the application of the COVID-19 text mining pipeline. The entity extraction module performs NER by applying the MER tool, which can recognize Disease, Chemical, Anatomy entities and link them to the respective vocabulary identifiers. On English texts, the recognized entities will be linked to MeSh identifiers using the CTD Disease (MEDIC), CTD Chemical, and CTD Anatomy vocabularies or, in alternative, the Coronavirus Infectious Disease Ontology (CIDO). On Spanish/Portuguese texts the recognized entities will be linked to MeSH identifiers through the DeCS vocabulary.
The third part regards the relation extraction module which performs RE by applying the BiOnt system, which was built to allow the extraction of relations between multiple biomedical entities supported by ontologies. Using the MeSh identifiers or the CIDO ontology linked to the recognized Disease, Chemical, Anatomy entities the BiOnt system can identify relations between those entities, provided we can use the pre-trained models trained on available training data. Additionally, the BiOnt system must be adapted to allow the identification of negative relations.
Finally, in the fourth part, the recommender system dataset, is created through LIBRETTI methodology, which was developed with the goal of creating scientific recommendation dataset using research literature to extract implicit feedback. This dataset allows the recommendation of COVID-19 related entities of interest for a researcher, which could be lost for the researchers in the large number of entities enclosed in the literature.
Subset of annotations in the corpus for crowd evaluation.
The recommendation dataset is evaluated in two phases, first automatically, testing the dataset before and after the curation of the previous phases (NER and RE), and second, manually, with experts testing if the recommendation are suitable for the users, according to their previous preferences.
- Retrieval of a sample of COVID-related articles;
- Automatic annotation of the articles;
- Manual correction of the automatic annotations in a subset the articles;
- Expansion of the annotations set in the subset;
- Automatic evaluation of the recommendation dataset;
- Manual evaluation of the recommendation dataset.
- Barros, M. A., Lamurias, A., Sousa, D., Ruas, P., & Couto, F. M. (2020). COVID-19: A Semantic-Based Pipeline for Recommending Biomedical Entities. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020.
- Barros, M., Moitinho., A., & Couto, F. M. (2019). Using research literature to generate datasets of implicit feedback for recommending scientific items. IEEE Access 7: 176668-176680.
- Sousa, D., Lamurias, A., & Couto, F. M. (2020). A Hybrid Approach toward Biomedical Relation Extraction Training Corpora: Combining Distant Supervision with Crowdsourcing. Database 2020.
- Sousa, D., Lamurias, A., & Couto, F. M. (2020). Improving accessibility and distinction between negative results in biomedical relation extraction. Genomics & Informatics, 18(2).
- Sousa, D., Lamurias, A., & Couto, F. M. (2019). A Silver Standard Corpus of Human Phenotype-Gene Relations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1487–1492.
- Couto, F. & Lamurias, A. (2018). MER: a Shell Script and Annotation Server for Minimal Named Entity Recognition and Linking. Journal of Cheminformatics, 10:58, 2018.