Master Thesis: Custom Named Entity Recognition and Topic Classification for Global Health Publications

Welcome to the GitHub repository for my Master's thesis project on Natural Language Processing (NLP) techniques applied to global health documents at the University of Geneva, Switzerland. This repository is organized into different sections, each focusing on a specific aspect of the project. Below is an overview of the project structure and instructions for navigating the repository effectively.

Project Structure

In an era where information flows ceaselessly and health crises span the globe, harnessing the power of NLP is to optimize the extraction of meaningful insights from vast volumes of global health documents released on a daily basis. The repository is organized in the four main components that made up the thesis:

1. Word2Vec Tag Similarity Discovery

Folder: word2vec Tag Discovery

In the Word2Vec Tag Similarity Discovery component, the objective is to identify similar tags based on an initial set of tags using cosine similarity within Word2Vec models. The core concept revolves around comparing tag similarity across models trained with varying corpus sizes—ranging from 100,000 words to 2 million and even 8 billion words. The primary question this component addresses is whether words that are correlated in real-world health contexts also exhibit corresponding mathematical vector correlations in Word2Vec representations.

2. Named Entity Recognition (NER) of Global Health Documents

Folder: Named Entity Recognition

The Named Entity Recognition (NER) component involves an model comparison of all available spaCy models on a custom test set formed by the extraction of global health publications in PubMed. This comparison evaluates the differences between CNN and Transformer-based models in order to find the best model that balances speed and accuracy. After careful per-label evaluation, the chosen CNN model is fine-tuned using a custom dataset that was annotated and curated using Prodigy. The resulting model excels at identifying various entity types such as ORG (Organizations), GPE (Geopolitical Entities), LAWs (Public Health Laws), PERSON, and DISEASE—where the DISEASE label was incorporated into NER by further refining the model through additional training on the NCBI dataset.

3. Topic Classification

Folder: Topic Classification

Topic classification is another natural language processing task that involves labelling a given textual content with the corresponding topic that it represents, essentially text categorization. Therefore, in this sector, the focus shifts to few-shot classification as well as zero-shot single and multi-label classification. A custom test set consisting of 1000 global health sentences is employed, categorized into 50 selected labels. The facebook/bart-large-mnli model from HuggingFace's repository serves as the backbone for the zero-shot classification, showcasing its robust performance in this context. Nevertheless, the accuracy of the model comes with a high toll in computational complexity, therefore it stands as its own pipeline, not to be merged with the fast CNN NER model.

4. Integrated Pipeline

Folder: Integrated Pipeline

The Integrated Pipeline encapsulates the end-to-end process of annotating a given PDF containing a global health document. The pipeline spans from initial PDF text extraction, text preprocessing, and tokenization to NER predictions and subsequent sorting of entities by frequency. Entities that surpass a defined filtering threshold are selected for further annotation. This comprehensive pipeline offers a seamless transformation of raw PDF content into a an annotated global health document with named entities. The goal of this is to facilitate research literature indexing, searching, retrieval, and avoid human annotation.

Note: To fully reverage this repository, you need to have access to the fine-tuned NER model. Hence, it should be read only for educative purposes at this time.

Contact and Feedback

For inquiries, feedback, or collaboration opportunities, please don't hesitate to send me a dm.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
ner_model_evaluation		ner_model_evaluation
process_publication		process_publication
results_visualization		results_visualization
topic_classification		topic_classification
word2vec Tag Discovery		word2vec Tag Discovery
.gitignore		.gitignore
README.md		README.md
ner_document_processing.ipynb		ner_document_processing.ipynb
process_single_document.py		process_single_document.py
query_for_mesh_terms.py		query_for_mesh_terms.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Master Thesis: Custom Named Entity Recognition and Topic Classification for Global Health Publications

Project Structure

1. Word2Vec Tag Similarity Discovery

2. Named Entity Recognition (NER) of Global Health Documents

3. Topic Classification

4. Integrated Pipeline

Contact and Feedback

About

Releases

Packages

Languages

mrgeooo14/UNIGE-Master-Thesis

Folders and files

Latest commit

History

Repository files navigation

Master Thesis: Custom Named Entity Recognition and Topic Classification for Global Health Publications

Project Structure

1. Word2Vec Tag Similarity Discovery

2. Named Entity Recognition (NER) of Global Health Documents

3. Topic Classification

4. Integrated Pipeline

Contact and Feedback

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages