Skip to content

Ping Lab Intern Project, Summar, 2022: :octocat: Building Named Entity Recognition model with fine tuning BioBERT - a data engineering approach

License

Notifications You must be signed in to change notification settings

pinglab-intern/Cardio-NER-BioBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ping Lab Intern Project, Summar, 2022

Title:

Building Named Entity Recognition model with fine-tuning BioBERT - a data engineering approach to cardiovascular medicine

Detail:

This project is about data engineering of Cardiovascular documents where we transform the data from raw text to the mapping of named entities through transfer learning [1][2]. The mapping of documents to entities can be implemented for designing Knowledgegraph and machine learning models. To conduct the Named Entity Recognization (74 biomedical entities: e.g., anatomy, biomolecules, chemicals) from Cardiovascular Disease (CVD) documents from PubMed, we propose to build a NER model with a pre-trained bioBERT model in NLP.

In brief, we will start from pre-trained BioBert to fine-tune it for the NER model with BioNLP data. The fine-tuned model will assist in identifying biomedical entities in CVD documents.

Data Sources:

Pretrained Model

  • BioBERT is a pre-trained biomedical language representation model for biomedical text mining from DMIS-Lab.

BioNLP Tags:

1. I-Cellular_component  2. E-Gene_or_gene_product  3. I-Organism_subdivision
4. I-Organism_substance  5. B-Gene_or_gene_product  6. B-Cancer
7. I-Cancer   8. E-Pathological_formation   9. I-Pathological_formation
10. S-Organism_substance  11. S-Organ  12. E-Organ
13. I-Immaterial_anatomical_entity  14. E-Cell  15. I-Simple_chemical
16. E-Tissue  17. B-Organism  18. S-Cellular_component
19. S-Pathological_formation  20. I-Amino_acid  21. E-Anatomical_system
22. S-Developing_anatomical_structure 23. B-Immaterial_anatomical_entity
24. B-Protein  25. I-Chemical  26. S-Organism  27. I-Gene_or_gene_product
28. I-Cell  29. E-Multi-tissue_structure  30. B-Organism_subdivision
31. E-Cellular_component  32. S-Chemical  33. S-Protein
34. B-Simple_chemical  35. E-Organism  36. B-Developing_anatomical_structure
37. S-Multi-tissue_structure  38. S-Immaterial_anatomical_entity
39. B-Organism_substance  40. E-Organism_substance  41. E-Simple_chemical
42. I-Tissue  43. E-Immaterial_anatomical_entity  44. I-Organism
45. I-Protein  46. S-Organism_subdivision  47. E-Cancer
48. I-Developing_anatomical_structure  49. S-Tissue
50. E-Chemical  51. S-Amino_acid  52. O
53. S-Gene_or_gene_product 54. E-Organism_subdivision
55. B-Anatomical_system  56. B-Chemical  57. B-Cell  
58. E-Developing_anatomical_structure  59. I-Multi-tissue_structure  
60. B-Pathological_formation  61. B-Cellular_component  62. B-Organ
63. I-Anatomical_system 64. S-Cell  65. E-Amino_acid
66. B-Tissue 67. S-Simple_chemical 68. E-Protein
69. B-Multi-tissue_structure 70. I-Organ 71. S-Cancer
72. B-Amino_acid 73. S-Anatomical_system 74. PAD

Fine Tuning

  • Fine-tuning with BioNLP data will be conducted with BioNLP data to identify 74 biomedical entities from CVD documents.

Project Walkthrough:

  1. Learn more about Transformer (e.g., embedding and positional encoding, multi-head self attention, layer addition, and normalization)
  2. Learn more about BERT (e.g., model architecture, building pre-trained model)
  3. Learn more about building bioBERT with PubMed documents as data sources
  4. Explore advanced NLP libraries (e.g., Huggingface, simple transformer) with tutorials
  5. Preparing BioNLP data with sentences and tags for fine-tuning the bioBERT model.
  6. Train the NER model with the pre-trained bioBERT model with sentences and tags obtained from BioNLP.
  7. Prepare Cardiovascular documents for implementing the NER model for the document to entities mapping.
  8. Prepare the document for entity mapping.

Application

The recognized named entities will be used to create a knowledge graph for CVD documents. It will help to conduct

  • (1) smart queries for data exploration, create data summary, visualization, and analysis
  • (2) create graph embedding and conduct dimensionality reduction and visualization (PCA, t-SNE)
  • (3) Graph Neural Network especially link Prediction between CVD diseases and molecular mechanism (gene, proteins, pathways)

Educational Goal:

This project offers an opportunity to familiarize with advanced NLP models (e.g., transformer and BioBERT) and implement Named entity recognition over CVD documents.

Scientific Goal:

This project offers a new data engineering concept which could deliver high-quality data to assist other scientific exploration (e.g., knowledge graph, graph neural network, entity resolution, concept normalization)

References

  1. BioBERT: a pre-trained biomedical language representation model for biomedical text mining
  2. Tagging Genes and Proteins with BioBERT

About

Ping Lab Intern Project, Summar, 2022: :octocat: Building Named Entity Recognition model with fine tuning BioBERT - a data engineering approach

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published