
# ICD-11 Code Similarity Analysis using BERT Embeddings

This notebook performs a comprehensive similarity analysis of ICD-11 code embeddings using multiple models. We:

1. Compare embeddings from seven different models: TF-IDF, FastText, BERT, BioBERT, BioClinicalBERT, PubMedBERT, and GatorTron
2. Analyze branch-level matching accuracy for each model
3. Evaluate hierarchical code relationships through:
   - Branch-level matching (first symbol)
   - Sub-branch matching (first two characters)
   - Cosine similarity scores
4. Compare model performance using metrics like:
   - Percentage of same-branch matches
   - Mean similarity scores
   - Distribution of similarity values

The analysis helps identify which models best capture the hierarchical structure of ICD-11 codes, with PubMedBERT showing the strongest performance (97.42% mean similarity score) and highest branch-matching accuracy (55.29%).

# Branch Matching Analysis

In [8]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
import os

def load_and_preprocess_embeddings(file_path):
    """Load embeddings and convert vector strings to numpy arrays"""
    df = pd.read_csv(file_path)
    def convert_to_array(x):
        if isinstance(x, str):
            try:
                return np.array([float(i) for i in x.strip('[]').split(',')])
            except:
                return None
        return x
    df['vector_array'] = df['Vector'].apply(convert_to_array)
    return df[['ICD11_code', 'vector_array']].dropna()

def normalize(vectors):
    """Normalize vectors to unit length"""
    norms = np.linalg.norm(vectors, axis=1, keepdims=True)
    return vectors / norms if norms.all() else vectors  # Avoid division by zero

In [9]:
def analyze_prefix_matches_with_comparison(models_dict):
    """Complete analysis function that creates comparison DataFrame for all models"""
    results = {}
    
    for model_name, (def_file, data_file) in models_dict.items():
        try:
            print(f"\nProcessing {model_name}...")
            # Load embeddings
            def_emb = load_and_preprocess_embeddings(os.path.join('definitions2vec', def_file))
            data_emb = load_and_preprocess_embeddings(os.path.join('data', data_file))
            
            # Find overlapping codes and prepare matrices
            overlap_codes = set(def_emb['ICD11_code']).intersection(set(data_emb['ICD11_code']))
            def_overlap = def_emb[def_emb['ICD11_code'].isin(overlap_codes)]
            data_overlap = data_emb[data_emb['ICD11_code'].isin(overlap_codes)]
            
            # Convert to matrices
            def_matrix = np.stack(def_overlap['vector_array'].values)
            data_matrix = np.stack(data_overlap['vector_array'].values)
            
            # Handle dimension mismatch
            if def_matrix.shape[1] != data_matrix.shape[1]:
                target_dim = min(def_matrix.shape[1], data_matrix.shape[1])
                print(f"Reducing dimensions to {target_dim}")
                
                # Reduce definition matrix if needed
                if def_matrix.shape[1] > target_dim:
                    svd_def = TruncatedSVD(n_components=target_dim)
                    def_matrix = svd_def.fit_transform(def_matrix)
                
                # Reduce data matrix if needed
                if data_matrix.shape[1] > target_dim:
                    svd_data = TruncatedSVD(n_components=target_dim)
                    data_matrix = svd_data.fit_transform(data_matrix)
            
            # Normalize matrices
            def_matrix = normalize(def_matrix)
            data_matrix = normalize(data_matrix)
            
            # Calculate similarities and find top matches
            similarities = cosine_similarity(def_matrix, data_matrix)
            model_results = []
            
            # Add debug counters
            total_comparisons = 0
            no_matches_count = 0
            debug_examples = []
            
            for i, sims in enumerate(similarities):
                # Get top 3 most similar codes
                top_idx = np.argsort(sims)[-3:][::-1]
                orig_code = str(def_overlap['ICD11_code'].iloc[i])
                top_codes = [str(data_overlap['ICD11_code'].iloc[idx]) for idx in top_idx]
                
                # Initialize prefix match counters
                prefix_matches = [0] * 5  # For 0 to 4 symbols
                
                # Check prefix matches for each top code
                has_any_match = False
                for top_code in top_codes:
                    for n in range(5):
                        if orig_code[:n] == top_code[:n]:
                            prefix_matches[n] += 1
                            has_any_match = True
                
                total_comparisons += 1
                if not has_any_match:
                    no_matches_count += 1
                    # Store example for debugging
                    debug_examples.append({
                        'orig_code': orig_code,
                        'top_codes': top_codes,
                        'similarities': [sims[idx] for idx in top_idx]
                    })

                model_results.append(prefix_matches)
            
            # Print debugging information
            print(f"\nDebug info for {model_name}:")
            print(f"Total comparisons: {total_comparisons}")
            print(f"No matches count: {no_matches_count}")
            print(f"No matches percentage: {(no_matches_count/total_comparisons)*100:.2f}%")
            
            if debug_examples:
                print("\nExamples of codes with no matches:")
                for i, example in enumerate(debug_examples[:5]):  # Show first 5 examples
                    print(f"\nExample {i+1}:")
                    print(f"Original code: {example['orig_code']}")
                    print(f"Top 3 matches: {example['top_codes']}")
                    print(f"Similarities: {[f'{sim:.4f}' for sim in example['similarities']]}")
            

            # Convert results to numpy array
            model_results = np.array(model_results)
            
            # Calculate metrics for this model
            metrics = {
                '% Match First 4 Symbols': (model_results[:, 4] > 0).mean() * 100,
                '% Match First 3 Symbols': (model_results[:, 3] > 0).mean() * 100,
                '% Match First 2 Symbols': (model_results[:, 2] > 0).mean() * 100,
                '% Match First Symbol': (model_results[:, 1] > 0).mean() * 100,
                '% No First Symbol Matches': 100 - (model_results[:, 1] > 0).mean() * 100 
            }
            
            results[model_name] = metrics
            
        except Exception as e:
            print(f"Error processing {model_name}: {str(e)}")
    
    # Create comparison DataFrame
    comparison_df = pd.DataFrame(results).T.round(2)
    
    # Ensure valid percentages
    comparison_df = comparison_df.apply(lambda x: np.clip(x, 0, 100))
    
    return comparison_df

In [10]:
models = {
    'tfidf': ('tfidf_encyclopedia_embeddings.csv', 'tfidf_ICD11_embeddings.csv'),
    'fasttext': ('fasttext_encyclopedia_embeddings.csv', 'fasttext_ICD11_embeddings.csv'),
    'bert': ('bert_encyclopedia_embeddings.csv', 'bert_ICD11_embeddings.csv'),
    'biobert': ('biobert_encyclopedia_embeddings.csv', 'biobert_ICD11_embeddings.csv'),
    'bioclinicalbert': ('bioclinicalbert_encyclopedia_embeddings.csv', 'bioclinicalbert_ICD11_embeddings.csv'),
    'pubmedbert': ('pubmedbert_encyclopedia_embeddings.csv', 'pubmedbert_ICD11_embeddings.csv'),
    'gatortron': ('gatortron_encyclopedia_embeddings.csv', 'gatortron_ICD11_embeddings.csv')
}

comparison_df = analyze_prefix_matches_with_comparison(models)




Processing tfidf...
Reducing dimensions to 378

Debug info for tfidf:
Total comparisons: 378
No matches count: 0
No matches percentage: 0.00%

Processing fasttext...

Debug info for fasttext:
Total comparisons: 378
No matches count: 0
No matches percentage: 0.00%

Processing bert...

Debug info for bert:
Total comparisons: 378
No matches count: 0
No matches percentage: 0.00%

Processing biobert...

Debug info for biobert:
Total comparisons: 378
No matches count: 0
No matches percentage: 0.00%

Processing bioclinicalbert...

Debug info for bioclinicalbert:
Total comparisons: 378
No matches count: 0
No matches percentage: 0.00%

Processing pubmedbert...

Debug info for pubmedbert:
Total comparisons: 378
No matches count: 0
No matches percentage: 0.00%

Processing gatortron...
Reducing dimensions to 378

Debug info for gatortron:
Total comparisons: 378
No matches count: 0
No matches percentage: 0.00%


In [11]:
# Display results
print("\nModel Comparison Results:")
display(comparison_df)

# Save results
comparison_df.to_csv('model_prefix_matching_comparison.csv')
print("\nResults saved to model_prefix_matching_comparison.csv")


Model Comparison Results:


Unnamed: 0,% Match First 4 Symbols,% Match First 3 Symbols,% Match First 2 Symbols,% Match First Symbol,% No First Symbol Matches
tfidf,1.06,8.47,15.61,31.48,68.52
fasttext,0.79,2.12,4.23,16.67,83.33
bert,51.85,64.81,72.49,83.07,16.93
biobert,66.14,77.51,82.01,89.95,10.05
bioclinicalbert,54.5,65.34,70.9,82.54,17.46
pubmedbert,75.13,83.33,86.24,92.33,7.67
gatortron,2.91,11.11,26.19,50.79,49.21



Results saved to model_prefix_matching_comparison.csv


In our evaluation of seven ICD-11 code embedding approaches, the **BERT**-based models (PubMedBERT, BioClinicalBERT, BioBERT, and base BERT) demonstrate superior performance, achieving first-symbol match rates between 82-92%. Among these, domain-specific variants show strong performance preservation across prefix lengths - PubMedBERT's accuracy only drops from 92.33% (first symbol) to 75.13% (first 4 symbols), while BioClinicalBERT maintains 82.54% to 54.50%.

The performance decay pattern reveals striking differences: **GatorTron** shows a dramatic drop from 50.79% for first-symbol matches to just 2.91% for 4-symbol matches, demonstrating poor preservation of deeper hierarchical relationships. Traditional methods perform even worse - **TF-IDF** drops from 31.48% to 1.06% and **FastText** from 16.67% to 0.79%, with particularly steep declines after the first two symbols (15.61% and 4.23% respectively).

These decay patterns highlight that while initial symbol matching might be achievable through various approaches, maintaining accuracy across longer prefixes—and thus truly capturing the hierarchical structure of ICD-11 codes—requires the sophisticated contextual understanding provided by transformer models, particularly those pre-trained on medical text.