# Exploration of pathways

We assume you have loaded in `Lunguage.csv` and `Lunguage_vocab.csv` in the `lunguage` folder, from the Physionet project "Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation".

First process the pathways to get them into a standard machine-readable form.

In [None]:
import pandas as pd
from modules.processor import VocabularyProcessor, set_global_vocab_processor, PathwayFormatter, FirstPathwayProcessor

# Load and preprocess data -> vocab file and pathway file
vocab_df = pd.read_csv("../data_resources/lunguage/Lunguage_vocab.csv")
dx_pathway_df = pd.read_csv("../data_resources/dx_pathway.csv")

# Convert to lowercase
vocab_df = vocab_df.apply(lambda x: x.str.lower() if x.dtype == "object" else x)
dx_pathway_df = dx_pathway_df.apply(lambda x: x.str.lower() if x.dtype == "object" else x)

# Drop rows with missing disease information
dx_pathway_df.dropna(subset=['disease', 'specific disease'], inplace=True)

# Initialize processors
vocab_processor = VocabularyProcessor(vocab_df)
set_global_vocab_processor(vocab_processor)
pathway_formatter = PathwayFormatter()
pathway_processor = FirstPathwayProcessor(pathway_formatter)

# Process pathway data -> turn each pathway into a structured string and store it in the column "formatted_results"
processed_df = pathway_processor.process_dataframe(dx_pathway_df)
processed_df.drop_duplicates(inplace=True)
# Create grouped dictionary -> save all pathways per disease into a dict
grouped_dict = pathway_processor.create_grouped_dict(processed_df)

print(f"Processed {len(processed_df)} rows")
print(f"Found {len(pathway_formatter.get_unmatched_terms())} unmatched terms")

### Analyze the number of unique pathways and specific diseases for each disease group
- For each disease, count unique pathways and specific diseases
- Display groups where number of specific diseases doesn't match group length
- Example: Pleural effusion has 2 specific variations but 10 total pathways due to view-dependent variations


In [9]:
# 1. Basic Information
print("=== processed_df Basic Information ===")
print(f"Shape: {processed_df.shape}")
print(f"Columns: {list(processed_df.columns)}")
print()

# 2. Disease Statistics
print("=== Disease Statistics ===")
disease_stats = processed_df.groupby('disease').agg({
    'specific disease': 'nunique',
    'formatted_results': 'nunique'
}).rename(columns={
    'specific disease': 'unique_specific_diseases',
    'formatted_results': 'unique_pathways'
})
print(disease_stats)
print()


=== processed_df Basic Information ===
Shape: (106, 23)
Columns: ['disease', 'specific disease', 'view', 'observation 1', 'status 1', 'loc 1', 'attributes 1-1', 'attributes 1-2', 'attributes 1-3', 'observation 2', 'status 2', 'loc 2', 'attributes 2-1', 'attributes 2-2', 'observation 3', 'status 3', 'loc 3', 'attributes 3-1', 'observation 4', 'status 4', 'loc 4', 'attributes 4-1', 'formatted_results']

=== Disease Statistics ===
                          unique_specific_diseases  unique_pathways
disease                                                            
atelectasis                                      7                1
bronchitis                                       4                2
cardiomegaly                                     1                1
congestive heart failure                         5                1
consolidation                                    4                1
copd                                             3                1
emphysema               

In [10]:
# 3. Number of Pathway Options per Disease
print("=== Number of Pathway Options per Disease ===")
for disease in processed_df['disease'].unique():
    disease_data = processed_df[processed_df['disease'] == disease]
    print(f"{disease}: {len(disease_data)} pathway options")
print()

# 4. Pleural Effusion Detailed Analysis
print("=== Pleural Effusion Detailed Analysis ===")
pe_data = processed_df[processed_df['disease'] == 'pleural effusion']
print(f"Total {len(pe_data)} pathway options")
print("\nView condition distribution:")
print(pe_data['view'].value_counts())
print("\nObservation items:")
for i in range(1, 5):
    obs_col = f'observation {i}'
    if obs_col in pe_data.columns:
        obs_values = pe_data[obs_col].dropna().unique()
        if len(obs_values) > 0:
            print(f"  Observation {i}: {list(obs_values)}")
print()

# 5. CHF Detailed Analysis
print("=== Congestive Heart Failure Detailed Analysis ===")
chf_data = processed_df[processed_df['disease'] == 'congestive heart failure']
print(f"Total {len(chf_data)} pathway options")
print("\nObservation items:")
for i in range(1, 5):
    obs_col = f'observation {i}'
    if obs_col in chf_data.columns:
        obs_values = chf_data[obs_col].dropna().unique()
        if len(obs_values) > 0:
            print(f"  Observation {i}: {list(obs_values)}")
print()


=== Number of Pathway Options per Disease ===
pleural effusion: 10 pathway options
pneumothorax: 5 pathway options
cardiomegaly: 1 pathway options
emphysema: 6 pathway options
fracture: 15 pathway options
tuberculosis: 5 pathway options
atelectasis: 7 pathway options
consolidation: 4 pathway options
pneumonia: 22 pathway options
lung cancer: 7 pathway options
bronchitis: 4 pathway options
pulmonary edema: 12 pathway options
congestive heart failure: 5 pathway options
copd: 3 pathway options

=== Pleural Effusion Detailed Analysis ===
Total 10 pathway options

View condition distribution:
view
ap, pa, lateral                       2
erect, pa, ap, lateral                1
supine, pa, ap, lateral               1
supine, erect, pa, ap, lateral        1
supine, recumbent, pa, ap, lateral    1
recumbent, pa, ap, lateral            1
recumbent, erect, pa, ap, lateral     1
ap, lateral                           1
pa, lateral                           1
Name: count, dtype: int64

Observation i

In [11]:
# 6. Pathway Extension Possibility Analysis
print("=== Pathway Extension Possibility Analysis ===")
all_observations = set()
for i in range(1, 5):
    obs_col = f'observation {i}'
    if obs_col in processed_df.columns:
        obs_values = processed_df[obs_col].dropna().unique()
        all_observations.update(obs_values)

print(f"Total number of observation items: {len(all_observations)}")
print("Observation items:", sorted(list(all_observations)))
print()

# 7. View Condition Analysis
print("=== View Condition Analysis ===")
view_conditions = processed_df['view'].value_counts()
print("Most common view conditions:")
print(view_conditions.head(10))
print()

# 8. Formatted Results Sample
print("=== Formatted Results Sample ===")
sample_results = processed_df['formatted_results'].head(3)
for i, result in enumerate(sample_results, 1):
    print(f"Sample {i}: {result}")
print()

# 9. Direct Observations per Disease
print("=== Direct Observations per Disease ===")
for disease in processed_df['disease'].unique():
    disease_data = processed_df[processed_df['disease'] == disease]
    direct_obs = set()
    
    for i in range(1, 5):
        obs_col = f'observation {i}'
        if obs_col in disease_data.columns:
            obs_values = disease_data[obs_col].dropna().unique()
            direct_obs.update(obs_values)
    
    print(f"{disease}: {sorted(list(direct_obs))}")
print()

# 10. Pathway Extension Mapping Possibilities
print("=== Pathway Extension Mapping Possibilities ===")
disease_names = set(processed_df['disease'].unique())
all_obs = set()

for i in range(1, 5):
    obs_col = f'observation {i}'
    if obs_col in processed_df.columns:
        obs_values = processed_df[obs_col].dropna().unique()
        all_obs.update(obs_values)

# Check if observations match with other disease names
potential_expansions = {}
for obs in all_obs:
    matching_diseases = []
    for disease in disease_names:
        if obs.lower() == disease.lower() or obs.lower() in disease.lower() or disease.lower() in obs.lower():
            matching_diseases.append(disease)
    if matching_diseases:
        potential_expansions[obs] = matching_diseases

print("Expandable Observations:")
for obs, diseases in potential_expansions.items():
    print(f"  {obs} → {diseases}")

=== Pathway Extension Possibility Analysis ===
Total number of observation items: 27
Observation items: ['air', 'architectural distortion', 'blunting', 'callus formation', 'cardiomegaly', 'consolidation', 'cough', 'deformity', 'destruction', 'disruption', 'dyspnea', 'emphysema', 'fever', 'fluid', 'heart size', 'hyperinflation', 'loss of height', 'lucency', 'marking', 'nodules', 'opacity', 'pleural effusion', 'pulmonary edema', 'pulmonary vascularity', 'shift', 'thickening', 'volume loss']

=== View Condition Analysis ===
Most common view conditions:
view
ap, pa, lateral                       97
erect, pa, ap, lateral                 1
supine, pa, ap, lateral                1
supine, recumbent, pa, ap, lateral     1
recumbent, pa, ap, lateral             1
supine, erect, pa, ap, lateral         1
recumbent, erect, pa, ap, lateral      1
ap, lateral                            1
pa, lateral                            1
ap, pa                                 1
Name: count, dtype: int64

==

### Pathway Depth & Width Analysis
 
This code analyzes disease pathways by:
- Calculating depth (max expansion steps) and width (total leaf nodes)
- Providing statistical summaries (avg/max depth & width)

In [12]:
def calculate_depth_width(disease_name, processed_df, visited=None, current_depth=0):
    """
    Calculate correct depth and width.
    - depth: maximum number of pathway expansion steps
    - width: total number of final leaf nodes
    """
    if visited is None:
        visited = set()
    
    # Prevent circular references
    if disease_name in visited:
        return current_depth, set()
    
    visited.add(disease_name)
    
    # Extract direct observations for current disease
    disease_data = processed_df[processed_df['disease'] == disease_name]
    direct_obs = set()
    
    for i in range(1, 5):
        obs_col = f'observation {i}'
        if obs_col in disease_data.columns:
            obs_values = disease_data[obs_col].dropna().unique()
            direct_obs.update(obs_values)
    
    if not direct_obs:
        # If no more expansions possible, this is a leaf node
        visited.remove(disease_name)
        return current_depth, {disease_name}
    
    max_depth = current_depth + 1
    all_leaf_nodes = set()
    
    # Check if each observation can expand to other diseases
    for obs in direct_obs:
        # Check if obs matches other disease names
        matched_disease = None
        for other_disease in processed_df['disease'].unique():
            if (obs.lower() == other_disease.lower() or 
                obs.lower() in other_disease.lower() or 
                other_disease.lower() in obs.lower()):
                matched_disease = other_disease
                break
        
        if matched_disease and matched_disease != disease_name:
            # Can expand to another disease - expand recursively
            sub_depth, sub_leaf_nodes = calculate_depth_width(
                matched_disease, processed_df, visited.copy(), current_depth + 1
            )
            max_depth = max(max_depth, sub_depth)
            all_leaf_nodes.update(sub_leaf_nodes)
        else:
            # No more expansions possible - this is a leaf node
            all_leaf_nodes.add(obs)
    
    visited.remove(disease_name)
    return max_depth, all_leaf_nodes

# Calculate depth and width for all diseases
print("=== Pathway Depth & Width Analysis ===")
disease_analysis = {}

for disease in processed_df['disease'].unique():
    max_depth, final_leaf_nodes = calculate_depth_width(disease, processed_df)
    
    disease_analysis[disease] = {
        'depth': max_depth,
        'width': len(final_leaf_nodes),
        'final_leaf_nodes': list(final_leaf_nodes)
    }

# Print results
print(f"{'Disease Name':<25} {'Depth':<8} {'Width':<8} {'Final Leaf Nodes'}")
print("-" * 80)

# Sort by width
sorted_by_width = sorted(disease_analysis.items(), key=lambda x: x[1]['width'], reverse=True)

for disease, analysis in sorted_by_width:
    leaf_nodes = ', '.join(analysis['final_leaf_nodes'][:5])  # Show first 5 only
    if len(analysis['final_leaf_nodes']) > 5:
        leaf_nodes += f" ... (+{len(analysis['final_leaf_nodes'])-5} more)"
    
    print(f"{disease:<25} {analysis['depth']:<8} {analysis['width']:<8} {leaf_nodes}")

# Statistical summary
depths = [analysis['depth'] for analysis in disease_analysis.values()]
widths = [analysis['width'] for analysis in disease_analysis.values()]

print(f"\n=== Statistical Summary ===")
print(f"Average Depth: {sum(depths)/len(depths):.2f}")
print(f"Maximum Depth: {max(depths)}")
print(f"Average Width: {sum(widths)/len(widths):.2f}")
print(f"Maximum Width: {max(widths)}")

# Width distribution
from collections import defaultdict
width_distribution = defaultdict(int)
for width in widths:
    width_distribution[width] += 1

print(f"\nWidth Distribution:")
for width in sorted(width_distribution.keys()):
    print(f"  Width {width}: {width_distribution[width]} diseases")

# Detailed analysis (top 5)
print(f"\n=== Detailed Pathway Expansion Analysis ===")
for disease, analysis in sorted_by_width[:5]:
    print(f"\n{disease} (Depth={analysis['depth']}, Width={analysis['width']}):")
    print(f"  Final Leaf Nodes: {', '.join(analysis['final_leaf_nodes'])}")

=== Pathway Depth & Width Analysis ===
Disease Name              Depth    Width    Final Leaf Nodes
--------------------------------------------------------------------------------
congestive heart failure  3        6        blunting, opacity, dyspnea, fluid, heart size ... (+1 more)
fracture                  1        5        disruption, opacity, deformity, loss of height, callus formation
copd                      2        5        pulmonary vascularity, hyperinflation, destruction, lucency, cough
pneumothorax              1        4        marking, shift, lucency, air
pulmonary edema           2        4        blunting, opacity, fluid, volume loss
pleural effusion          1        3        blunting, opacity, fluid
emphysema                 1        3        destruction, lucency, pulmonary vascularity
tuberculosis              1        3        architectural distortion, opacity, nodules
pneumonia                 2        3        opacity, volume loss, fever
bronchitis              

## Exploration of Lunguage++
1. Expanded Rows
2. Expandable Findings
3. Inferred Sub-findings


In [2]:
import pandas as pd
extended_lunguage = pd.read_csv("../data_resources/Lunguage++.csv")
# Count rows where 'sent' column is empty
empty_sent_count = extended_lunguage['sent'].isna().sum()
print(f"Total number of expanded rows: {empty_sent_count}")

Total number of expanded rows: 4761


- Distribution of definitive/tentative and positive/negative cases for 14 diseases in original data, and probability values for tentative cases

In [12]:
representative_disease_mapping = {
    'congestive heart failure': [
        'congestive heart failure', 'chf', 'congestive failure', 'decompensated congestive heart failure'
    ],
    'pulmonary edema': [
        'pulmonary edema', 'edema', 'interstitial edema', 'interstitial pulmonary edema', 'reexpansion edema', 
        'alveolar edema', 'interstitial lung edema', 'reexpansion pulmonary edema', 'cardiogenic pulmonary edema', 
        'pulmonary interstitial edema', 'interstitial edema pattern', 'interstial edema', 'congestion/edema'
    ],
    'cardiomegaly': [
        'cardiomegaly'
    ],
    'copd': [
        'copd', 'chronic obstructive pulmonary disease', 'chronic small airways obstruction'
    ],
    'emphysema': [
        'emphysema', 'emphysematous changes', 'centrilobular emphysema', 
        'emphysematous change', 'paraseptal emphysema'
    ],
    'consolidation': [
        'consolidation', 'consolidations', 'pneumonic consolidation', 'consolidative process',
        'infectious consolidation', 'consolidated'
    ],
    'fracture': [
        'compression fractures', 'compression fracture', 'manubrium fracture', 
        'manubrial fracture'
    ],
    'pneumothorax': [
        'pneumothorax', 'hydropneumothorax', 'pneumothoraces', 'tension pneumothorax'
    ],
    'tuberculosis': [
        'tuberculosis', 'tb', 'nonactive tuberculosis'
    ],
    'pneumonia': [
        'pneumonia', 'post-obstructive pneumonia', 
        'postobstructive pneumonia', 'aspiration pneumonia', 'post obstructive pneumonia',
        'post-obstructive pneumonitis', 'lobar pneumonia', 'pneumocystis jiroveci pneumonia',
        'superimposed pneumonia', 'bronchial infection', 'atypical pneumonia',
        'community acquired lung infection', 'pna', 'aspiration pneumonitis',
        'infectious pneumonia', 'exogenous lipoid pneumonia', 'drug related pneumonitis',
        'organizing pneumonia'
    ],
    'atelectasis': [
        'atelectasis', 'atelectatic changes', 'atelectasis/scarring', 'atelectases',
        'atelectatic change', 'atelectasis/collapse'
    ],
    'lung cancer': [
        'malignancy', 'lung malignancy', 'lung cancer', 'primary lung cancer', 'non-small cell lung cancer'
    ],
    'bronchitis': [
        'bronchitis'
    ],
    'pleural effusion': [
        'pleural effusion', 'effusion', 'pleural effusions'
    ]
}

original_rows = extended_lunguage[(extended_lunguage['sent'].notna())]

# Define mapping function
def map_entity_to_disease(entity):
    entity = str(entity).lower().strip()
    
    for disease, terms in representative_disease_mapping.items():
        if entity in terms:
            if entity == disease or entity in [term for term in terms if term == entity]:
                # print(f"Exact mapping {entity} to {disease}")
                return disease

    return 'other'
original_rows['mapped_disease'] = original_rows['ent'].apply(map_entity_to_disease)

# mapped_disease별 status 분포 확인
print("\n=== Distribution of mapped diseases by status in original rows ===")
print(f"Total rows: {len(original_rows)}")
print("\nMapped disease and status distribution:")
display(pd.crosstab(original_rows['mapped_disease'], original_rows['status']))

# tp나 tn인 경우의 prob 통계량을 통합 계산
tp_tn_stats = original_rows[original_rows['status'].isin(['tp', 'tn'])].groupby(['mapped_disease'])['prob'].agg(['mean', 'std', 'min', 'max']).round(3)

print("\nProbability statistics for TP and TN cases combined:")
display(tp_tn_stats)



=== Distribution of mapped diseases by status in original rows ===
Total rows: 14049

Mapped disease and status distribution:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  original_rows['mapped_disease'] = original_rows['ent'].apply(map_entity_to_disease)


status,dn,dp,tn,tp
mapped_disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
atelectasis,11,408,2,253
bronchitis,0,1,0,1
cardiomegaly,5,308,0,7
congestive heart failure,26,23,1,5
consolidation,316,90,3,74
copd,0,18,0,12
emphysema,0,40,0,10
fracture,0,12,1,1
lung cancer,1,6,0,7
other,2094,6050,71,869



Probability statistics for TP and TN cases combined:


Unnamed: 0_level_0,mean,std,min,max
mapped_disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
atelectasis,0.536,0.156,0.145,0.876
bronchitis,0.475,,0.475,0.475
cardiomegaly,0.518,0.141,0.365,0.706
congestive heart failure,0.591,0.248,0.102,0.765
consolidation,0.447,0.182,0.113,0.808
copd,0.537,0.05,0.453,0.594
emphysema,0.518,0.134,0.314,0.66
fracture,0.219,0.165,0.102,0.336
lung cancer,0.531,0.171,0.249,0.662
other,0.436,0.179,0.102,0.892


- Distribution analysis of expandable data (probability > 0) after pathway matching in original data

In [9]:
# Filter rows where 'sent' is not NaN and has pathway
pathway_rows = extended_lunguage[(extended_lunguage['sent'].notna())&
                                (extended_lunguage['pathway'].notna())]

# Map entities to diseases
pathway_rows['mapped_disease'] = pathway_rows['ent'].apply(map_entity_to_disease)
prob_rows = pathway_rows[pathway_rows['prob'] > 0]

# Get distribution for all rows
disease_distribution = pathway_rows['mapped_disease'].value_counts()
prob_disease_distribution = prob_rows['mapped_disease'].value_counts()

print("\n=== Distribution of diseases with valid pathways ===")
print(f"Total rows with pathways: {len(pathway_rows)}")
print(f"Total rows with prob > 0: {len(prob_rows)}")

print("\nDisease distribution:")
print("All rows vs Prob > 0 rows:")
comparison = pd.DataFrame({
    'All': disease_distribution,
    'Prob>0': prob_disease_distribution
})
print(comparison)

# Calculate percentage distribution
print("\nPercentage distribution:")
print("All rows vs Prob > 0 rows (%):")
all_pct = (disease_distribution / len(extended_lunguage[(extended_lunguage['sent'].notna())]) * 100).round(2)
prob_pct = (prob_disease_distribution / len(extended_lunguage[(extended_lunguage['sent'].notna())]) * 100).round(2)
pct_comparison = pd.DataFrame({
    'All %': all_pct,
    'Prob>0 %': prob_pct
})
print(pct_comparison)

# Print unmapped entities
unmapped = pathway_rows[pathway_rows['mapped_disease'] == 'other']['ent'].value_counts()
print("\nTop unmapped entities:")
print(unmapped.head(10))


=== Distribution of diseases with valid pathways ===
Total rows with pathways: 4965
Total rows with prob > 0: 2639

Disease distribution:
All rows vs Prob > 0 rows:
                           All  Prob>0
mapped_disease                        
atelectasis                674     663
bronchitis                   2       2
cardiomegaly               320     315
congestive heart failure    55      29
consolidation              483     167
copd                        30      30
emphysema                   50      50
fracture                    14      14
lung cancer                 14      13
pleural effusion          1351     683
pneumonia                  439     249
pneumothorax               908      54
pulmonary edema            618     366
tuberculosis                 7       4

Percentage distribution:
All rows vs Prob > 0 rows (%):
                          All %  Prob>0 %
mapped_disease                           
atelectasis                4.80      4.72
bronchitis                 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pathway_rows['mapped_disease'] = pathway_rows['ent'].apply(map_entity_to_disease)


- Analyze which disease categories the final expanded entities came from
    - Check the distribution of disease categories for expanded entities (prob > 0)
    - Compare with the overall distribution to understand expansion patterns


In [10]:
# Get rows where sent is NaN and extract first element from associate column
nan_sent_rows = extended_lunguage[extended_lunguage['sent'].isna()]
first_associates = nan_sent_rows['associate'].str.split(',').str[0]

# Map first associates to representative diseases and track unmapped terms
mapped_associates = {}
unmapped_terms = {}

for associate in first_associates:
    if pd.isna(associate):
        continue
    associate = str(associate).lower().strip()
    mapped = None
    for disease, terms in representative_disease_mapping.items():
        if associate in terms:
            mapped = disease
            break
    if mapped is not None:
        if mapped not in mapped_associates:
            mapped_associates[mapped] = 0
        mapped_associates[mapped] += 1

print("\n=== Distribution of mapped first associate values where sent is NaN ===") 
print(f"Total rows with NaN sent: {len(nan_sent_rows)}")
print("\nCounts by mapped disease:")
for disease, count in sorted(mapped_associates.items(), key=lambda x: x[1], reverse=True):
    print(f"{disease}: {count}")


=== Distribution of mapped first associate values where sent is NaN ===
Total rows with NaN sent: 4761

Counts by mapped disease:
atelectasis: 1215
pleural effusion: 915
consolidation: 801
pulmonary edema: 622
cardiomegaly: 340
pneumonia: 311
pneumothorax: 160
emphysema: 98
copd: 88
congestive heart failure: 79
fracture: 25
lung cancer: 12
bronchitis: 2
tuberculosis: 2
