***
# <font color=red>Chapter 6: MedTALN inc.'s Case Study -  Pre-Trained MLM Models Selection from Hugging Face</font>
<p style="margin-left:10%; margin-right:10%;">by <font color=teal> John Doe (typica.ai) </font></p>

***


## Overview:

This notebook guides you through the process of selecting a list of pre-trained models that suitable for our case study i.e. MLM supporting Healthcare domain and the French language that can be fine-tuned into a Healthcare NER model.

This notebook will help us identify top-performing MLM models based on specific objective criteria.

The notebook is structured into the following key steps:

- **Search for MLM Models**: We begin by searching for candidate MLM models on the Hugging Face Hub that are suitable for our needs.
- **Evaluate and Rank Models**: We evaluate the selected MLM models on a small, handcrafted dataset to determine which models best predict medical entities in the fill-mask task. The models are then ranked based on their performance.


Install the hugging face transformers library (the first time).

In [4]:
%%capture
!pip install transformers -U

Filters out warnings

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Identify a list of candidate MLM models from Hugging Face Hub

Search for Masked Language Models (MLM) suited for our  case study.

We will use the Hugging Face Transformers Library to search and filter models programmatically


1. Search Models : search for the MLM models supporting the french language  

2. Filter found Models : filter out the returned models to retain only models supproting only french (monolangual) and have at least one of the healthcare domain related tags.


## Search for MLM models
We will use the Hugging Face Transformers Library to search and filter models programmatically

Search Models : search for the MLM models supporting the french language

Filter found Models : filter out the returned models to retain only models supproting only french (monolangual) and have at least one of the healthcare domain related tags.

In [2]:
from huggingface_hub import list_models

# Fetch the list of models with the specified criteria
models = list_models(
    language ="fr", task="fill-mask", library = "pytorch", cardData = True
)


# List of tags to filter by
filter_tags = ["healthcare", "medical",  "clinical", "biomedical", "biology", "life science"]

# Print the model IDs and some basic information
included_models = []
for model in models:
    if  len(model.card_data.language) == 1 and \
        model.card_data.library_name == 'transformers' and \
        any(tag in model.tags for tag in filter_tags):

      included_models.append(model.modelId)

included_models

['Dr-BERT/DrBERT-4GB',
 'Dr-BERT/DrBERT-7GB',
 'Dr-BERT/DrBERT-4GB-CP-PubMedBERT',
 'almanach/camembert-bio-base',
 'Dr-BERT/DrBERT-7GB-Large',
 'abazoge/DrLongformer',
 'abazoge/DrBERT-4096',
 'PantagrueLLM/jargon-general-base',
 'PantagrueLLM/jargon-general-biomed',
 'PantagrueLLM/jargon-biomed-4096',
 'PantagrueLLM/jargon-multidomain-base',
 'PantagrueLLM/jargon-biomed',
 'PantagrueLLM/jargon-NACHOS',
 'PantagrueLLM/jargon-NACHOS-4096']

## Check the Models Configuration

In this step, we validate that the selected models adhere to the architecture of the BERT base model, specifically with 12 hidden layers and 12 attention heads. This ensures consistency in terms of model size for our fine-tuned models.


In [3]:
from transformers import AutoConfig

# List of models to check
model_ids = included_models
# Initialize an empty dictionary to store model ID and their details
models_with_right_config = []

# Function to fetch the number of layers and attention heads
def get_model_details(model_id):
    try:
        # Load the model configuration
        config = AutoConfig.from_pretrained(model_id, trust_remote_code=False)

        # Get the number of layers and attention heads
        num_layers = config.num_hidden_layers
        num_heads = config.num_attention_heads

        return num_layers, num_heads
    except Exception as e:
        return f"Error retrieving config for {model_id}: {e}", None

# Iterate through the models and populate the dictionary with their details
for model_id in model_ids:
    details = get_model_details(model_id)
    if details[0] == 12 and details[1] == 12:
      models_with_right_config.append(model_id)

models_with_right_config

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/710 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/979 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/980 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.82k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.82k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

['Dr-BERT/DrBERT-4GB',
 'Dr-BERT/DrBERT-7GB',
 'Dr-BERT/DrBERT-4GB-CP-PubMedBERT',
 'almanach/camembert-bio-base',
 'abazoge/DrLongformer',
 'abazoge/DrBERT-4096']

## Retrieve  Mask Tokens

In this step, we retrieve the mask tokens for each of the models that retained models so far. Using the `AutoTokenizer` from the Hugging Face Transformers library, we load the tokenizer associated with each model and extract its mask token.

We then validate that the mask token matches the expected tokens, specifically `"[MASK]"` or `"<mask>"`, which are commonly used in models like BERT and its variants. Models that meet this criterion are stored in the `models_with_mask_tokens` dictionary for further processing.


In [4]:
from transformers import AutoTokenizer

# Initialize an empty dictionary to store model ID and mask token
models_with_mask_tokens = {}

# Function to fetch the mask token using the tokenizer
def get_mask_token_via_tokenizer(model_id):
    try:
        # Load the tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_id)

        # Get the mask token
        return tokenizer.mask_token
    except Exception as e:
        return f"Error retrieving tokenizer for {model_id}: {e}"

# Iterate through the models and populate the dictionary with mask tokens
for model_id in models_with_right_config:
    mask_token = get_mask_token_via_tokenizer(model_id)
    if mask_token in ["[MASK]", "<mask>"]:
        models_with_mask_tokens[model_id] = mask_token

# Print the constructed dictionary
models_with_mask_tokens

tokenizer_config.json:   0%|          | 0.00/626 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/794k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.39M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/791k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/496 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/706k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.32M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/374 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/400 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/791k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

{'Dr-BERT/DrBERT-4GB': '<mask>',
 'Dr-BERT/DrBERT-7GB': '<mask>',
 'Dr-BERT/DrBERT-4GB-CP-PubMedBERT': '[MASK]',
 'almanach/camembert-bio-base': '<mask>',
 'abazoge/DrLongformer': '<mask>',
 'abazoge/DrBERT-4096': '<mask>'}

# Evaluate and Rank Models Based on Entity Prediction

In this step, we evaluate a set of models to determine their effectiveness in predicting specific medical entities within masked sentences. Using the `fill-mask` pipeline from the Hugging Face Transformers library, each model is tested on a series of examples where key medical terms are masked.

The models are scored based on how well their predictions match a combination of generic and specific expected entities. These scores are then aggregated to produce a cumulative score for each model. Finally, the models are ranked based on their cumulative scores, helping us identify the most effective model for our healthcare NER task.


In [8]:
from transformers import pipeline


# Define the generic expected entities with their weights
generic_expected_entities = [
    {'médicaments': 0.3},
    {'traitements': 0.3},
    {'soins': 0.3},
    {'remèdes': 0.3},
    {'conseils': 0.1},
    {'indications': 0.1},
    {'instructions': 0.05},
    {'interventions': 0.05},
    {'compléments': 0.05}
]


# Define the examples and their specific expected entities
examples = [
    {
        "text": "Le medecin donne des {} en cas d'infections des voies respiratoires.",
        "expected_entities": [{'antibiotiques': 1}]
    },
    {
        "text": "Le médecin recommande des {} pour réduire l'inflammation dans les poumons.",
        "expected_entities": [{'corticoïdes': 1}, {'anti-inflammatoires': 0.9}]
    },
    {
        "text": "Pour soulager les symptômes d'allergie, le médecin prescrit des {}.",
        "expected_entities": [{'antihistaminiques': 1}]
    },
    {
        "text": "Pour gérer le diabète, le médecin prescrit une {}.",
        "expected_entities": [{'insulinothérapie': 1}]
    },
    {
        "text": "Après une blessure musculaire, le patient doit suivre une {}.",
        "expected_entities": [{'physiothérapie': 1}, {'rééducation': 0.8}]
    },
    {
        "text": "En cas d'infection bactérienne, le médecin recommande une {}.",
        "expected_entities": [{'antibiothérapie': 1}]
    }
]

models = models_with_mask_tokens

# Initialize a dictionary to store the cumulative scores for each model
model_scores = {model_name: 0 for model_name in models}

# Iterate over each model
for model_name, mask_token in models.items():
    print(f"Testing {model_name} ...")
    try:
      # Load the fill-mask pipeline for the current model
      fill_mask = pipeline("fill-mask", model=model_name, tokenizer=model_name, trust_remote_code=False)

      # Iterate over each example
      for example in examples:
          # Prepare the example sentence with the correct mask token
          masked_example = example["text"].format(mask_token)
          specific_expected_entities = example["expected_entities"]

          # Combine generic and specific entities, giving priority to specific ones
          combined_expected_entities = {**{k: v for d in generic_expected_entities for k, v in d.items()}, **{k: v for d in specific_expected_entities for k, v in d.items()}}
          # Get predictions
          results = fill_mask(masked_example)

          # Extract the top predicted tokens
          predicted_tokens = [result['token_str'] for result in results]

          # Calculate a score based on matching expected entities
          score = 0
          for entity, weight in combined_expected_entities.items():
              if entity in predicted_tokens:
                  score += weight

          # Add the score to the cumulative score for the model
          model_scores[model_name] += score
            
    except:
      print(f"Error in {model_name}")

# Rank models based on their cumulative scores
ranked_models = sorted(model_scores.items(), key=lambda item: item[1], reverse=True)

# Print the final ranking
print("\nModel Ranking based on Weighted Entity Match Scores (top-5): ")
for rank, (model_name, score) in enumerate(ranked_models, 1):
    #print only the top-5 models
    if rank <= 5:
      print(f"{rank}. {model_name}: Cumulative Score = {score}")

Testing Dr-BERT/DrBERT-4GB ...
Testing Dr-BERT/DrBERT-7GB ...
Testing Dr-BERT/DrBERT-4GB-CP-PubMedBERT ...
Testing almanach/camembert-bio-base ...
Testing abazoge/DrLongformer ...
Testing abazoge/DrBERT-4096 ...

Model Ranking based on Weighted Entity Match Scores (top-5): 
1. Dr-BERT/DrBERT-4GB: Cumulative Score = 3.65
2. abazoge/DrBERT-4096: Cumulative Score = 2.85
3. Dr-BERT/DrBERT-7GB: Cumulative Score = 2.75
4. almanach/camembert-bio-base: Cumulative Score = 2.5
5. Dr-BERT/DrBERT-4GB-CP-PubMedBERT: Cumulative Score = 0.2
