# Quantification of Dataset Inconsistencies in MIMIC-CXR

## Notebook Setup

In [1]:
from collections import Counter

import pandas as pd
import numpy as np
from scipy.stats import entropy

## Load Data

In [2]:
df = pd.read_json('../data/processed/mimic-official/reports.train.json')
df.head()

Unnamed: 0,impression,findings,comparison,examination,history,technique,indication,id,findings+bg,chexpert_labels_findings,chexpert_labels_impression
0,Nonspecific retrocardiac and right middle lobe...,Lung volumes are low. Retrocardiac opacity wit...,None.,Chest radiograph,,Semi upright AP radiograph view of the chest.,_-year-old woman presenting with weakness. Eva...,50000014,Examination:\nChest radiograph\n\nIndication:\...,"{'No Finding': None, 'Enlarged Cardiomediastin...","{'No Finding': None, 'Enlarged Cardiomediastin..."
1,"Increased left lower lobe opacity, likely comb...",An opacity in the left lower lung base and ret...,_.,,,,,50000103,Comparison:\n_.\n\nFindings:\nAn opacity in th...,"{'No Finding': None, 'Enlarged Cardiomediastin...","{'No Finding': None, 'Enlarged Cardiomediastin..."
2,No change,Compared to the prior study the ET tube has be...,_,CHEST (PORTABLE AP),,Portable chest,_ y.o female with SCC and recent stent placeme...,50000173,Examination:\nCHEST (PORTABLE AP)\n\nIndicatio...,"{'No Finding': 1.0, 'Enlarged Cardiomediastinu...","{'No Finding': 1.0, 'Enlarged Cardiomediastinu..."
3,Basilar atelectasis with small pleural effusio...,The cardiomediastinal and hilar contours are n...,None.,,,,"Fever, sweats, abdominal pain, crackles at lun...",50000186,"Indication:\nFever, sweats, abdominal pain, cr...","{'No Finding': None, 'Enlarged Cardiomediastin...","{'No Finding': None, 'Enlarged Cardiomediastin..."
4,No acute cardiopulmonary abnormality.,Heart size is normal. The mediastinal and hila...,_,CHEST (PORTABLE AP),,Upright AP view of the chest,History: _F with shortness of breath,50000198,Examination:\nCHEST (PORTABLE AP)\n\nIndicatio...,"{'No Finding': 1.0, 'Enlarged Cardiomediastinu...","{'No Finding': 1.0, 'Enlarged Cardiomediastinu..."


## Number of Duplicates

In [3]:
N = len(df)

print('Without background\n' + '='*18)
duplicates = df['findings'].duplicated(keep=False) # keep=False marks all occurences
N_dup = duplicates.sum()
N_impressions = df[duplicates]['impression'].nunique()

print(f'Total reports: {N:,}')
print(f'Duplicate findings: {N_dup:,} ({N_dup/N*100:.1f}%)')
print(f'Distinct impressions among duplicates: {N_impressions:,}')
print()

print('With background\n' + '='*18)
duplicates = df['findings+bg'].duplicated(keep=False) # keep=False marks all occurences
N_dup = duplicates.sum()
N_impressions = df[duplicates]['impression'].nunique()
print(f'Total reports: {N:,}')
print(f'Duplicate findings: {N_dup:,} ({N_dup/N*100:.1f}%)')
print(f'Distinct impressions among duplicates: {N_impressions:,}')


Without background
Total reports: 122,500
Duplicate findings: 14,596 (11.9%)
Distinct impressions among duplicates: 1,036

With background
Total reports: 122,500
Duplicate findings: 797 (0.7%)
Distinct impressions among duplicates: 72


## Duplication examples (most frequent and highest label entropy)

In [4]:
def label_entropy(samples):
    n = len(samples)
    n_distinct = samples['impression'].nunique()
    distinct_ratio = n/n_distinct
    
    vcs = samples['impression'].value_counts()
    e = entropy(vcs.values/n)
    e_norm = e / (np.log(n_distinct) + 1e-10) # metric entropy (measure of randomness)
    return pd.Series({
        'n': n,
        'distinct': n_distinct,
        'e': e_norm,
        'counts': vcs.values,
        'impressions': vcs
    })

In [5]:
top_findings = df['findings'].value_counts().head(100) # only calculate entropy for top 100 findings
df_filtered = df[df['findings'].isin(top_findings.index)]
df_stats = df_filtered.groupby('findings').apply(label_entropy)
df_stats['n_normalized'] = df_stats['n'].divide(len(df)).multiply(100).round(2)
df_stats['e'] = df_stats['e'].round(2)

In [6]:
print('Sorted by n')
display(df_stats.sort_values('n', ascending=False).head(10))

print('Sorted by e')
display(df_stats.sort_values('e', ascending=False).head(10))

Sorted by n


Unnamed: 0_level_0,n,distinct,e,counts,impressions,n_normalized
findings,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"PA and lateral views of the chest provided. There is no focal consolidation, effusion, or pneumothorax. The cardiomediastinal silhouette is normal. Imaged osseous structures are intact. No free air below the right hemidiaphragm is seen.",1141,26,0.12,"[1061, 45, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1,...",No acute intrathoracic process. ...,0.93
Heart size is normal. The mediastinal and hilar contours are normal. The pulmonary vasculature is normal. Lungs are clear. No pleural effusion or pneumothorax is seen. There are no acute osseous abnormalities.,1033,34,0.11,"[974, 24, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",No acute cardiopulmonary abnormality. ...,0.84
The lungs are clear without focal consolidation. No pleural effusion or pneumothorax is seen. The cardiac and mediastinal silhouettes are unremarkable.,753,47,0.2,"[665, 15, 8, 7, 4, 3, 3, 3, 2, 2, 2, 2, 2, 2, ...",No acute cardiopulmonary process. ...,0.61
Compared to the prior study there is no significant interval change.,461,16,0.15,"[430, 8, 5, 3, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1...",No change. ...,0.38
The lungs are clear. The cardiomediastinal silhouette is within normal limits. No acute osseous abnormalities.,226,7,0.14,"[215, 3, 2, 2, 2, 1, 1]",No acute cardiopulmonary process. ...,0.18
The lungs are well expanded and clear. Cardiomediastinal and hilar contours are unremarkable. There is no pleural effusion or pneumothorax.,165,11,0.61,"[64, 58, 21, 13, 3, 1, 1, 1, 1, 1, 1]",Unremarkable chest radiographic examination. ...,0.13
"The heart size is normal. The hilar and mediastinal contours are within normal limits. There is no pneumothorax, focal consolidation, or pleural effusion.",139,8,0.18,"[129, 4, 1, 1, 1, 1, 1, 1]",No acute intrathoracic process. ...,0.11
Heart size is normal. The mediastinal and hilar contours are normal. The pulmonary vasculature is normal. Lungs are clear. No pleural effusion or pneumothorax is seen.,137,13,0.28,"[118, 6, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",No acute cardiopulmonary abnormality. ...,0.11
"The lungs are clear.The cardiac, hilar and mediastinal contours are normal.No pleural abnormality is seen.",123,6,0.13,"[118, 1, 1, 1, 1, 1]",No acute cardiopulmonary process. ...,0.1
Cardiomediastinal contours are normal. The lungs are clear. There is no pneumothorax or pleural effusion. The osseous structures are unremarkable,120,16,0.27,"[104, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...",No acute cardiopulmonary abnormalities ...,0.1


Sorted by e


Unnamed: 0_level_0,n,distinct,e,counts,impressions,n_normalized
findings,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
The heart is normal in size. The mediastinal and hilar contours appear within normal limits. There is no pleural effusion or pneumothorax. The lungs appear clear. Bony structures appear within normal limits.,25,2,0.99,"[14, 11]",No evidence of acute cardiopulmonary disease. ...,0.02
The lungs are clear. There is no pneumothorax. The heart and mediastinum are within normal limits. Regional bones and soft tissues are unremarkable.,25,2,0.94,"[16, 9]",Clear lungs with no evidence of pneumonia. ...,0.02
The lungs are well expanded and clear. Hila and cardiomediastinal contours and pleural surfaces are normal.,23,15,0.92,"[6, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",Normal. No evidence of pneumonia. ...,0.02
The lungs are clear. Mediastinal and cardiac contours are normal. There is no pleural effusion or pneumothorax.,33,16,0.85,"[11, 4, 3, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",There is no evidence of pneumonia. ...,0.03
Cardiomediastinal silhouette and hilar contours are normal. Lungs are clear. There is no pleural effusion or pneumothorax.,22,13,0.85,"[8, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","Normal chest radiograph; specifically, no evid...",0.02
"The lungs are clear without focal opacity, pulmonary edema, pleural effusion or pneumothorax. The cardiac and mediastinal contours are normal.",25,7,0.84,"[10, 6, 3, 2, 2, 1, 1]",No acute intrathoracic process. ...,0.02
"Frontal and lateral chest radiographs were obtained. The lungs are fully expanded and clear. The cardiomediastinal silhouette, hilar contours, and pleural surfaces are normal. There is no pleural effusion or pneumothorax.",23,13,0.81,"[10, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",No radiographic evidence for acute cardiopulmo...,0.02
"Frontal and lateral chest radiographs demonstrate a normal cardiomediastinal silhouette and well-aerated lungs which are clear. There is no focal consolidation, pleural effusion, or pneumothorax. The visualized upper abdomen is unremarkable.",26,4,0.79,"[14, 7, 4, 1]",No acute cardiopulmonary process. ...,0.02
"The lungs are clear. There is no evidence of pneumonia, pneumothorax, or pleural effusion. Cardiac silhouette is normal in size.",52,14,0.74,"[20, 9, 9, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1]",Normal chest x-ray. ...,0.04
Lungs are fully expanded and clear. No pleural abnormalities. Heart size is normal. Cardiomediastinal and hilar silhouettes are normal.,39,11,0.74,"[14, 12, 3, 3, 1, 1, 1, 1, 1, 1, 1]",No acute cardiopulmonary abnormality. ...,0.03


Render as latex (may produce slightly different examples because of sorting)

In [7]:
def df_to_latex(df):
    tex = ''
    for index, row in df.iterrows():
        finding = index
        n = row['n']
        n_normalized = row['n_normalized']
        distinct = row['distinct']
        entropy = row['e']
        impressions = list(row['impressions'].head(5).items())

        for i, (impression, count) in enumerate(impressions):
            impression = impression.replace('_', r'\_')

            if i == 0:
                tex += f'{finding} & {n} & {n_normalized}\% & {distinct} & {entropy} & {count} & {impression}\\\\' + '\n'
            else:
                tex += f'& & & & & {count} & {impression}\\\\' + '\n'
        tex += '\\addlinespace\n'
    return tex    

In [8]:
top = df_stats.sort_values('n', ascending=False).head(3)
print(df_to_latex(top))

PA and lateral views of the chest provided. There is no focal consolidation, effusion, or pneumothorax. The cardiomediastinal silhouette is normal. Imaged osseous structures are intact. No free air below the right hemidiaphragm is seen. & 1141 & 0.93\% & 26 & 0.12 & 1061 & No acute intrathoracic process.\\
& & & & & 45 & No acute intrathoracic process\\
& & & & & 3 & No evidence of pneumonia.\\
& & & & & 3 & No acute intrathoracic process. Specifically, no pneumothorax.\\
& & & & & 3 & No acute intrathoracic process. \_, MD\\
\addlinespace
Heart size is normal. The mediastinal and hilar contours are normal. The pulmonary vasculature is normal. Lungs are clear. No pleural effusion or pneumothorax is seen. There are no acute osseous abnormalities. & 1033 & 0.84\% & 34 & 0.11 & 974 & No acute cardiopulmonary abnormality.\\
& & & & & 24 & No evidence of pneumonia.\\
& & & & & 3 & No radiographic evidence of pneumonia.\\
& & & & & 2 & No acute cardiopulmonary abnormality. No displaced fract

In [9]:
top = df_stats.sort_values('e', ascending=False).head(3)
print(df_to_latex(top))

The heart is normal in size. The mediastinal and hilar contours appear within normal limits. There is no pleural effusion or pneumothorax. The lungs appear clear. Bony structures appear within normal limits. & 25 & 0.02\% & 2 & 0.99 & 14 & No evidence of acute cardiopulmonary disease.\\
& & & & & 11 & No evidence of acute disease.\\
\addlinespace
The lungs are clear. There is no pneumothorax. The heart and mediastinum are within normal limits. Regional bones and soft tissues are unremarkable. & 25 & 0.02\% & 2 & 0.94 & 16 & Clear lungs with no evidence of pneumonia.\\
& & & & & 9 & Clear lungs.\\
\addlinespace
The lungs are well expanded and clear. Hila and cardiomediastinal contours and pleural surfaces are normal. & 23 & 0.02\% & 15 & 0.92 & 6 & Normal. No evidence of pneumonia.\\
& & & & & 2 & Normal chest radiograph.\\
& & & & & 2 & No evidence of pneumonia.\\
& & & & & 2 & No pneumonia.\\
& & & & & 1 & Normal. No evidence of mass.\\
\addlinespace

