In [7]:
import json
import os
import config
import logging
import pandas as pd
from collections import Counter
from collections import defaultdict
logging.basicConfig(level=logging.INFO)

## Tokenization Insights and Comparison for RadGraph

This notebook explores how the RadGraph corpus is tokenized by:
- Custom tokenizers (BPE and WordPiece) trained from scratch.
- Pretrained tokenizers used in popular biomedical language models (e.g., Bio_ClinicalBERT).

### Objectives:
1. Analyze the RadGraph corpus to understand token distributions.
2. Compare tokenization behavior for common, rare, and domain-specific terms.
3. Highlight potential challenges with subword splits in pretrained tokenizers.
4. Draw insights on whether custom tokenization offers benefits for downstream tasks.

**Note**: This step is **not required** for training the Named Entity Recognition (NER) model. It is only included here to facilitate experimentation and understanding of how domain-specific tokenization works when training from scratch using RadGraph data.


In [8]:
from collections import Counter, defaultdict
import os
import json

# Load reports
def load_reports(file_path):
    """Load reports from a JSONL file."""
    with open(file_path, "r") as f:
        return [json.loads(line) for line in f]

# Count total tokens and unique tokens
def count_tokens_by_space(input_file):
    """Count the total number of space-separated tokens in the dataset and track unique tokens."""
    reports = load_reports(input_file) 
    total_tokens = 0
    tokens = set()  # Use a set to track unique tokens
    for report in reports:
        if isinstance(report, str):
            total_tokens += len(report.split())
            tokens.update(report.split())
        elif isinstance(report, dict) and "text" in report:
            total_tokens += len(report["text"].split())
            tokens.update(report["text"].split())
    return total_tokens, tokens

# Count occurrences of tokens across entities
def count_token_occurrences(input_file):
    """Count occurrences of individual token text (words/phrases) across all entities."""
    reports = load_reports(input_file)
    token_counter = Counter()
    for report in reports:
        for label_data in report["labels"]:
            token_counter[label_data["tokens"]] += 1  # Increment count for each token text
    return token_counter

# Extract entity text and labels
def extract_entity_text_and_labels_with_counts(input_file):
    """Extract entity text, labels, and their counts."""
    reports = load_reports(input_file)
    entity_text_and_labels = []
    entity_counter = defaultdict(int)  # Count occurrences of each entity label

    for report in reports:
        for label_data in report["labels"]:
            entity_text_and_labels.append({
                "text": label_data["tokens"],  # Entity text
                "label": label_data["label"]  # Entity label
            })
            entity_counter[label_data["label"]] += 1  # Increment the label count

    return entity_text_and_labels, entity_counter

# Input file
input_file = "../data/processed/radgraph.jsonl"






   


## Section 2: RadGraph Corpus Insights

In this section, I analyze the RadGraph corpus to understand the distribution of tokens and their frequencies.

### **Token Counts**
- **Total Tokens**: 15,815,652
- **Unique Tokens**: 33,855
- The significant difference between total and unique tokens highlights the repetitive nature of the dataset, with frequent mentions of anatomical and diagnostic terms.

### **Token Frequency**
- **Top Tokens**:
  - Common terms include "RIGHT (133,060 occurrences)," "LEFT (127,717)," and "PLEURAL (115,285)."
- **Middle Tokens**:
  - Domain-specific phrases like "hyper - inflated" and "tumoral infiltration" appear twice.
- **Bottom Tokens**:
  - Rare terms like "FOCAL BULLAE" and "NEOPLASTIC FOCUS" appear only once.

### **Vocabulary Size Considerations**
- RadGraph has **33,855 unique tokens**, aligning closely with vocabulary sizes used in pretrained models like:
  - **Bio_ClinicalBERT** (~30,000 tokens)
  - **PubMedBERT** (~30,000 tokens)
- A vocabulary size of **30,000 tokens** is reasonable for RadGraph:
  - It closely matches the corpus’ unique token count, minimizing token splits.
  - Larger vocabularies (e.g., 50,000 tokens) would increase computational overhead without significant benefits.


In [9]:
# Count tokens
total_tokens, tokens = count_tokens_by_space(input_file)
print(f"Total number of space-separated tokens in the corpus: {total_tokens}")
print(f"Total unique space-separated tokens in the corpus: {len(tokens)}")


Total number of space-separated tokens in the corpus: 15815652
Total unique space-separated tokens in the corpus: 33855


In [10]:
# Count token occurrences
token_counts = count_token_occurrences(input_file)

# Sort tokens by count (descending)
sorted_tokens_by_count = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)

# Display token statistics
print("Top tokens by count:")
for token, count in sorted_tokens_by_count[:15]:  # Top 5
    print(f"{token}: {count}")

print("\nMiddle tokens by count:")
middle_index = len(sorted_tokens_by_count) // 2
for token, count in sorted_tokens_by_count[middle_index:middle_index + 5]:  # 5 around the middle
    print(f"{token}: {count}")

print("\nBottom tokens by count:")
for token, count in sorted_tokens_by_count[-5:]:  # Bottom 5
    print(f"{token}: {count}")

# Sample top, middle, and bottom tokens
top_tokens = [token for token, _ in sorted_tokens_by_count[:15]]  # Top 5 tokens
middle_tokens = [token for token, _ in sorted_tokens_by_count[middle_index:middle_index + 5]]  # Middle 5 tokens
bottom_tokens = [token for token, _ in sorted_tokens_by_count[-5:]]  # Bottom 5 tokens

# Combine sampled tokens for tokenization comparison
sampled_tokens = top_tokens + middle_tokens + bottom_tokens

Top tokens by count:
RIGHT: 133060
LEFT: 127717
PLEURAL: 115285
STABLE: 85669
EFFUSION: 84829
LUNG: 83813
PULMONARY: 83238
right: 78237
EDEMA: 74638
left: 71237
PNEUMOTHORAX: 66651
pleural: 65192
UNCHANGED: 60725
ATELECTASIS: 58278
CONSOLIDATION: 53031

Middle tokens by count:
MOST PROMINENTLY: 2
VARIABLE WAXING AND WANING: 2
TUMORAL INFILTRATION: 2
MARKEDLY REDUCED: 2
hyper - inflated: 2

Bottom tokens by count:
FOCAL BULLAE: 1
RIGHT ASCENDING AORTA: 1
DISSECTING AIR: 1
ACHALASIA: 1
NEOPLASTIC FOCUS: 1


## Section 3: Tokenization Comparison

This section compares how sampled tokens from RadGraph are tokenized by:
- BPE and WordPiece (trained from scratch).
- Pretrained tokenizers in popular biomedical models (e.g., Bio_ClinicalBERT, ClinicalBERT).

### **Comparison Table**
The table below shows tokenization results for:
- **Top 15 frequent tokens**
- **Middle 5 tokens**
- **Bottom 5 rare tokens**

| Token                      |   Count | BPE                        | WordPiece                  | DistilBERT                       | Bio_ClinicalBERT                 | MedBERT                                         | ClinicalBERT                        | SapBERT                          | BioMistral                           | BiomedCLIP                       |
|:---------------------------|--------:|:---------------------------|:---------------------------|:---------------------------------|:---------------------------------|:------------------------------------------------|:------------------------------------|:---------------------------------|:-------------------------------------|:---------------------------------|
| RIGHT                      |  133060 | RIGHT                      | RIGHT                      | right                            | right                            | R ##IG ##HT                                     | right                               | right                            | ▁R IGHT                              | right                            |
| LEFT                       |  127717 | LEFT                       | LEFT                       | left                             | left                             | L ##EF ##T                                      | left                                | left                             | ▁LE FT                               | left                             |
| PLEURAL                    |  115285 | PLEURAL                    | PLEURAL                    | pl ##eur ##al                    | p ##le ##ural                    | P ##LE ##UR ##AL                                | pl ##eur ##al                       | pleural                          | ▁P LE UR AL                          | pleural                          |
| STABLE                     |   85669 | STABLE                     | STABLE                     | stable                           | stable                           | ST ##AB ##LE                                    | stable                              | stable                           | ▁ST ABLE                             | stable                           |
| EFFUSION                   |   84829 | EFFUSION                   | EFFUSION                   | e ##ff ##usion                   | e ##ff ##usion                   | E ##FF ##US ##ION                               | ef ##fus ##ion                      | effusion                         | ▁E FF US ION                         | effusion                         |
| LUNG                       |   83813 | LUNG                       | LUNG                       | lung                             | lung                             | L ##UN ##G                                      | lung                                | lung                             | ▁L UN G                              | lung                             |
| PULMONARY                  |   83238 | PULMONARY                  | PULMONARY                  | pulmonary                        | pulmonary                        | P ##U ##LM ##ON ##AR ##Y                        | pu ##lm ##onar ##y                  | pulmonary                        | ▁P UL MON ARY                        | pulmonary                        |
| right                      |   78237 | right                      | right                      | right                            | right                            | right                                           | right                               | right                            | ▁right                               | right                            |
| EDEMA                      |   74638 | EDEMA                      | EDEMA                      | ed ##ema                         | ed ##ema                         | E ##DE ##MA                                     | ede ##ma                            | edema                            | ▁E DE MA                             | edema                            |
| left                       |   71237 | left                       | left                       | left                             | left                             | left                                            | left                                | left                             | ▁left                                | left                             |
| PNEUMOTHORAX               |   66651 | PNEUMOTHORAX               | PNEUMOTHORAX               | p ##ne ##um ##otho ##ra ##x      | p ##ne ##um ##oth ##orax         | P ##NE ##UM ##OT ##H ##OR ##A ##X               | pne ##umo ##th ##orax               | pneumothorax                     | ▁P NE UM OT H OR AX                  | pneumothorax                     |
| pleural                    |   65192 | pleural                    | pleural                    | pl ##eur ##al                    | p ##le ##ural                    | p ##le ##ural                                   | pl ##eur ##al                       | pleural                          | ▁ple ural                            | pleural                          |
| UNCHANGED                  |   60725 | UNCHANGED                  | UNCHANGED                  | unchanged                        | unchanged                        | UN ##CH ##AN ##GE ##D                           | un ##chang ##ed                     | unchanged                        | ▁UN CHAN G ED                        | unchanged                        |
| ATELECTASIS                |   58278 | ATELECTASIS                | ATELECTASIS                | ate ##le ##cta ##sis             | ate ##lect ##asis                | AT ##EL ##EC ##TA ##SI ##S                      | at ##ele ##cta ##sis                | ate ##lec ##ta ##sis             | ▁A TE LECT AS IS                     | ate ##le ##ct ##asis             |
| CONSOLIDATION              |   53031 | CONSOLIDATION              | CONSOLIDATION              | consolidation                    | consolidation                    | CO ##NS ##OL ##ID ##AT ##ION                    | con ##sol ##idat ##ion              | consolidation                    | ▁CON S OL ID ATION                   | consolidation                    |
| MOST PROMINENTLY           |       2 | MOST PROMINENTLY           | MOST PROMINENTLY           | most prominently                 | most prominently                 | M ##OS ##T PR ##OM ##IN ##EN ##TL ##Y           | most prominent ##ly                 | most prominently                 | ▁MO ST ▁P ROM IN ENT LY              | most prominently                 |
| VARIABLE WAXING AND WANING |       2 | VARIABLE WAXING AND WANING | VARIABLE WAXING AND WANING | variable wax ##ing and wan ##ing | variable wax ##ing and wa ##ning | VA ##RI ##AB ##LE WA ##X ##ING AND WA ##NI ##NG | variable wa ##xin ##g and wa ##ning | variable wax ##ing and wan ##ing | ▁VAR I ABLE ▁W AX ING ▁AND ▁W AN ING | variable wax ##ing and wa ##ning |
| TUMORAL INFILTRATION       |       2 | TUMORAL INFILTRATION       | TUMORAL INFILTRATION       | tumor ##al in ##filtration       | tumor ##al in ##fi ##ltration    | T ##UM ##OR ##AL IN ##FI ##LT ##RA ##TI ##ON    | tumor ##al in ##fil ##tration       | tumoral infiltration             | ▁T UM OR AL ▁IN FIL TR ATION         | tumoral infiltration             |
| MARKEDLY REDUCED           |       2 | MARKEDLY REDUCED           | MARKEDLY REDUCED           | markedly reduced                 | marked ##ly reduced              | MA ##R ##KE ##D ##L ##Y R ##ED ##UC ##ED        | marked ##ly reduced                 | markedly reduced                 | ▁M ARK ED LY ▁R ED UC ED             | markedly reduced                 |
| hyper - inflated           |       2 | hyper - inflated           | hyper - inflated           | hyper - inflated                 | h ##yper - in ##f ##lated        | h ##yper - in ##f ##lated                       | hy ##per - in ##f ##lated           | hyper - inflated                 | ▁hyper ▁- ▁infl ated                 | hyper - infl ##ated              |
| FOCAL BULLAE               |       1 | FOCAL BULLAE               | FOCAL BULLAE               | focal bull ##ae                  | focal bull ##ae                  | F ##OC ##AL B ##U ##LL ##A ##E                  | focal bu ##lla ##e                  | focal bull ##ae                  | ▁F OC AL ▁B ULL AE                   | focal bull ##ae                  |
| RIGHT ASCENDING AORTA      |       1 | RIGHT ASCENDING AORTA      | RIGHT ASCENDING AORTA      | right ascending ao ##rta         | right ascending a ##ort ##a      | R ##IG ##HT AS ##CE ##ND ##ING A ##OR ##TA      | right as ##cend ##ing ao ##rta      | right ascending aorta            | ▁R IGHT ▁A SC END ING ▁A ORT A       | right ascending aorta            |
| DISSECTING AIR             |       1 | DISSECTING AIR             | DISSECTING AIR             | di ##sse ##cting air             | di ##sse ##cting air             | D ##IS ##SE ##CT ##ING AI ##R                   | disse ##cting air                   | dissecting air                   | ▁DIS SE CT ING ▁A IR                 | dissecting air                   |
| ACHALASIA                  |       1 | ACH AL ASIA                | ACH ##AL ##ASIA            | ac ##hala ##sia                  | a ##chal ##asi ##a               | AC ##HA ##LA ##SI ##A                           | ach ##alas ##ia                     | ach ##ala ##si ##a               | ▁A CH AL AS IA                       | ach ##ala ##si ##a               |
| NEOPLASTIC FOCUS           |       1 | NEOPLASTIC FOCUS           | NEOPLASTIC FOCUS           | neo ##pl ##astic focus           | neo ##p ##lastic focus           | NE ##OP ##LA ##ST ##IC F ##OC ##US              | neo ##pla ##stic focus              | neoplastic focus                 | ▁NE OP LAST IC ▁F OC US              | neoplastic focus                 |


### **Insights**
- Common terms like "RIGHT" and "PLEURAL" are mostly preserved across tokenization schemes, though some pretrained tokenizers split them into subwords.
- Rare or domain-specific terms (e.g., "hyper - inflated") are frequently split, potentially affecting downstream task performance.
- Despite tokenization challenges, pretrained models like Bio_ClinicalBERT are robust and can adapt well during fine-tuning.


In [11]:
from transformers import AutoTokenizer

# Add more pretrained tokenizers
pretrained_models = {
    "DistilBERT": "distilbert-base-uncased",
    "Bio_ClinicalBERT": "emilyalsentzer/Bio_ClinicalBERT",
    "MedBERT": "Charangan/MedBERT",
    "ClinicalBERT": "medicalai/ClinicalBERT",
    "SapBERT": "cambridgeltl/SapBERT-from-PubMedBERT-fulltext",
    "BioMistral" : "BioMistral/BioMistral-7B",
    "BiomedCLIP": "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"
}

examples = [
    "PICC line insertion noted with cardiomediastinal shift.",
    "Retrocardiac opacity observed in the lower lobe.",
    "Atelectasis seen in the left lung base."
]
# Combine top 15, middle 5, and bottom 5 tokens
middle_index = len(sorted_tokens_by_count) // 2
sampled_tokens_with_counts = (
    sorted_tokens_by_count[:15] +  # Top 15
    sorted_tokens_by_count[middle_index:middle_index + 5] +  # Middle 5
    sorted_tokens_by_count[-5:]  # Bottom 5
)

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Prepare corpus from RadGraph
def load_radgraph_text(file_path):
    """Load text from RadGraph reports for training tokenizers."""
    # Example implementation (adjust based on actual RadGraph format)
    with open(file_path, "r") as f:
        reports = f.readlines()
    return reports

# Load the corpus
corpus = load_radgraph_text("../data/radgraph_reports.txt")

# Train BPE tokenizer
bpe_tokenizer = Tokenizer(models.BPE())
bpe_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
bpe_trainer = trainers.BpeTrainer(vocab_size=30000, min_frequency=2)
bpe_tokenizer.train_from_iterator(corpus, trainer=bpe_trainer)

# Train WordPiece tokenizer
wp_tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
wp_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
wp_trainer = trainers.WordPieceTrainer(vocab_size=30000, special_tokens=["[UNK]"])
wp_tokenizer.train_from_iterator(corpus, trainer=wp_trainer)

# Save tokenizers in the notebook directory
os.makedirs("models", exist_ok=True)  # Save relative to the notebook location
bpe_tokenizer.save("models/bpe_tokenizer.json")
wp_tokenizer.save("models/wp_tokenizer.json")

# Tokenize sampled tokens
comparison_results = []
for token, count in sampled_tokens_with_counts:
    row = {"Token": token, "Count": count}
    row["BPE"] = " ".join(bpe_tokenizer.encode(token).tokens)
    row["WordPiece"] = " ".join(wp_tokenizer.encode(token).tokens)
    for model_name, model_path in pretrained_models.items():
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        row[model_name] = " ".join(tokenizer.tokenize(token))
    comparison_results.append(row)

# Convert to DataFrame
comparison_df = pd.DataFrame(comparison_results)

# Display as Markdown
print(comparison_df.to_markdown(index=False))










| Token                      |   Count | BPE                        | WordPiece                  | DistilBERT                       | Bio_ClinicalBERT                 | MedBERT                                         | ClinicalBERT                        | SapBERT                          | BioMistral                           | BiomedCLIP                       |
|:---------------------------|--------:|:---------------------------|:---------------------------|:---------------------------------|:---------------------------------|:------------------------------------------------|:------------------------------------|:---------------------------------|:-------------------------------------|:---------------------------------|
| RIGHT                      |  133060 | RIGHT                      | RIGHT                      | right                            | right                            | R ##IG ##HT                                     | right                               | right

## Section 4: Conclusions and Next Steps

### **Takeaways**
1. **Corpus Analysis**:
   - RadGraph’s token distribution highlights significant repetition among anatomical and diagnostic terms.
   - Rare or domain-specific terms could benefit from improved tokenization.

2. **Tokenization Insights**:
   - Pretrained tokenizers handle common terms well but often split rare phrases into subwords.
   - While custom tokenization could improve representation for domain-specific terms, pretrained models are likely sufficient for most tasks.

### **Next Steps**
- Fine-tune pretrained models (e.g., BioBERT, ClinicalBERT) on RadGraph for tasks like NER and relation extraction.
- Evaluate the impact of token splits on model performance.
- Explore custom tokenization schemes as a future experiment.

