# Process Slicer Markup files

Description: 

This notebook processes the Slicer markup files that were generated by Joe & Daisuke from the raw Veolity outputs.

They removed non-nodules and added in nodules that were not picked up by Veolity.

This final LSUT nodule locations will need to be tied to the LSUT annotations file to add in the additional nodule detail.

There will additionally need to be some error resolutions where there are discrepancies between this new nodule identification process and the original one carried out on LSUT.

<strong>Steps</strong>
1. Load the markup files into a dataframe
2. Compare the raw Veolity output with the adjusted markup files
3. Review metrics for Veolity
4. Merge in annotations file and assign characteristics to nodules where possible / check
5. Generate a spreedsheet with nodule data including data entry capability to add in nodule-type and nodule-diameter-mm


Cases to exclude due:
- UCLH_43663037 - too many nodules to manually annotate, Veolity has identified 75 candidates, annotations.Total_nos_nod: 25
- UCLH_45634500 - too many ground glass nodules Veolity only identified 10, annotations say 15 reality was 100's non-solid nodules
- UCLH_59066126 - too many nodules to consider trying to mark up and confirm them all



In [123]:
import json
import pandas as pd
from pathlib import Path

# 1-3. Load & combine markup files, compare and gen. metrics

In [129]:

def read_markup(file_path):

    patient_id = Path(file_path).stem
    markup_json = json.load(open(file_path))

    control_points_json = markup_json['markups'][0]['controlPoints']

    control_points = []
    for control_point in control_points_json:
        control_points.append({
            'patient_id' : patient_id,
            'label' : control_point['label'],
            'X' : control_point['position'][0],
            'Y' : control_point['position'][1],
            'Z' : control_point['position'][2],
            'orientation' : control_point['orientation']
        })
    return pd.DataFrame(control_points)

all_patient_ids = [patient_id.stem for patient_id in Path('reader1').glob('*.json')] + [patient_id.stem for patient_id in Path('reader2').glob('*.json')]

print('Number of patients:', len(all_patient_ids))

reader1_original_markup_data = pd.concat([
    read_markup(original_markup_file)
    for original_markup_file in Path('reader1').glob('*.json')
])

reader2_original_markup_data = pd.concat([
    read_markup(original_markup_file)
    for original_markup_file in Path('reader2').glob('*.json')
])

original_markup_data = reader2_original_markup_data
original_markup_data = pd.concat([reader1_original_markup_data, reader2_original_markup_data]).reset_index(drop=True)

reader1_corrected_markup_data = pd.concat([
    read_markup(corrected_markup_file)
    for corrected_markup_file in Path('reader1/corrected').glob('*.json')
])

reader2_corrected_markup_data = pd.concat([
    read_markup(corrected_markup_file)
    for corrected_markup_file in Path('reader2/corrected').glob('*.json')
])

corrected_markup_data = reader2_corrected_markup_data.reset_index(drop=True)
corrected_markup_data = pd.concat([reader1_corrected_markup_data, reader2_corrected_markup_data]).reset_index(drop=True)

scan_count = 0
tp_counts = []
fp_counts = []
fn_counts = []
for patient_id in all_patient_ids:

    original_patient_data = original_markup_data[original_markup_data.patient_id == patient_id]
    corrected_patient_data = corrected_markup_data[corrected_markup_data.patient_id == patient_id]
    scan_count += 1
    
    if original_patient_data.shape[0] > 0 or corrected_patient_data.shape[0] > 0:
        tp_cnt = original_patient_data.merge(corrected_patient_data, on=['label'], how='inner').shape[0]

        
        tp_counts.append(tp_cnt)
        fp_counts.append(original_patient_data.patient_id.count() - tp_cnt)
        fn_counts.append(corrected_patient_data.patient_id.count() - tp_cnt)

tp_counts = sum(tp_counts)
fp_counts = sum(fp_counts)
fn_counts = sum(fn_counts)

print('Scan count:', scan_count)
print('True positives:', tp_counts, 'False nagatives:', fn_counts)
print('Sensitivity:', round(tp_counts / (tp_counts + fn_counts),1))
print('False positives:', fp_counts, 'False positive per scan rate:', round(fp_counts / scan_count,1))


Number of patients: 131
Scan count: 131
True positives: 459 False nagatives: 106
Sensitivity: 0.8
False positives: 548 False positive per scan rate: 4.2


In [125]:
# 4. Load and merge annotations file

def pixel_to_real_world(offset, spacing, pixel_value):
    return round(offset + pixel_value * spacing, 2)

annotations = pd.read_csv('annotations.csv')
metaio_metadata = pd.read_csv('lung_metadata.csv').assign(scan_id=lambda x: x['scan_id'].str.replace('.mhd', ''))

annotations = pd.merge(
    metaio_metadata,
    annotations,
    left_on='scan_id',
    right_on='ScananonID',
    how='left'
)

annotations['Nod1_floc'] = annotations.apply(
    lambda row: row['slices'] - row['Nod1_loc'] if pd.notnull(row['Nod1_loc']) else None, axis=1
)

annotations['Nod2_floc'] = annotations.apply(
    lambda row: row['slices'] - row['Nod2_loc'] if pd.notnull(row['Nod2_loc']) else None, axis=1
)
    
annotations['Nod1_real_world'] = annotations.apply(
    lambda row: pixel_to_real_world(row['z-offset'], row['z-spacing'], row['Nod1_floc']) if pd.notnull(row['Nod1_floc']) else (None), axis=1
)

annotations['Nod2_real_world'] = annotations.apply(
    lambda row: pixel_to_real_world(row['z-offset'], row['z-spacing'], row['Nod2_floc']) if pd.notnull(row['Nod2_floc']) else (None), axis=1
)

nod1_recode = {
    'Nod1_diam' : 'Nod_diam',
    'Nod1_type' : 'Nod_type',
    'Nod1_type_other' : 'Nod_type_other',
    'Nod1_real_world' : 'Nod_real_world',
    'Nod1_pos' : 'Nod_pos',
    'Nod1_pos_other' : 'Nod_pos_other',
}

nod2_recode = {
    'Nod2_diam' : 'Nod_diam',
    'Nod2_type' : 'Nod_type',
    'Nod2_type_other' : 'Nod_type_other',
    'Nod2_real_world' : 'Nod_real_world',
    'Nod2_pos' : 'Nod_pos',
    'Nod2_pos_other' : 'Nod_pos_other',
}

nod1_data = annotations[['ScananonID', 'Total_no_nods'] + list(nod1_recode.keys())].rename(columns=nod1_recode).query('Nod_real_world.notnull()')
nod2_data = annotations[['ScananonID', 'Total_no_nods'] + list(nod2_recode.keys())].rename(columns=nod2_recode).query('Nod_real_world.notnull()')

nod_data = pd.concat([nod1_data, nod2_data]).reset_index(drop=True)

display(nod_data.head())

display(nod_data.Nod_type.value_counts())
display(nod_data.Nod_pos.value_counts())
display(nod_data.Nod_pos_other.value_counts())


  metaio_metadata = pd.read_csv('lung_metadata.csv').assign(scan_id=lambda x: x['scan_id'].str.replace('.mhd', ''))


Unnamed: 0,ScananonID,Total_no_nods,Nod_diam,Nod_type,Nod_type_other,Nod_real_world,Nod_pos,Nod_pos_other
0,UCLH_00134949,1.0,6.0,SN,,-1452.8,subpleural (<5mm from pleura),
1,UCLH_00239233,1.0,15.0,PSN,airspace,1786.1,parenchymal,
2,UCLH_07024905,10.0,22.0,SN,,1721.7,subpleural (<5mm from pleura),
3,UCLH_22801382,2.0,2.5,SN,,2118.1,parenchymal,
4,UCLH_23344772,1.0,6.0,SN,,1854.5,parenchymal,


SN       99
pGGN     25
PSN      18
Other     4
Name: Nod_type, dtype: int64

subpleural (<5mm from pleura)    70
parenchymal                      55
other                            20
Name: Nod_pos, dtype: int64

parenchymal             13
perifissural             2
pleural based            2
interfissural            1
central bronchogenic     1
parenchyma               1
Name: Nod_pos_other, dtype: int64

In [126]:
# Now match up the annotations with the corrected markup data

found = {idx : [] for idx in nod_data.index}
used = {mdx : None for mdx in corrected_markup_data.index}

for patient_id in corrected_markup_data.patient_id.unique():

    patient_annotation_data = nod_data[nod_data.ScananonID == patient_id]
    patient_markup_nodule_data = corrected_markup_data[corrected_markup_data.patient_id == patient_id]
    for idx, annotation_nodule in patient_annotation_data.iterrows():
        
        for mdx, markup_nodule in patient_markup_nodule_data.iterrows():

            if abs(annotation_nodule['Nod_real_world'] - markup_nodule['Z']) <= annotation_nodule['Nod_diam'] * 0.8:
                found[idx].append(mdx)
                used[mdx] = idx

used_df = pd.DataFrame([(k, v) for k, v in used.items()], columns=['markup_idx', 'annotation_idx'])

lsut_nodule_data = (
    corrected_markup_data
    .merge(used_df, left_index=True, right_on='markup_idx', how='left')
    .merge(nod_data, left_on='annotation_idx', right_index=True, how='left')
    .drop(columns=['ScananonID','Total_no_nods'])
    .merge(annotations[['ScananonID','Total_no_nods']], left_on='patient_id', right_on='ScananonID', how='left')
    .filter(
        [
            'patient_id',
            'label',
            'X',
            'Y',
            'Z',
            'Total_no_nods',
            'orientation',
            'Nod_diam',
            'Nod_type',
            'Nod_type_other',
            'Nod_real_world',
            'Nod_pos',
            'Nod_pos_other'
        ]
    )
)

lsut_nodule_data.to_csv('lsut_nodule_data.csv', index=False)
lsut_nodule_data.head()

Unnamed: 0,patient_id,label,X,Y,Z,Total_no_nods,orientation,Nod_diam,Nod_type,Nod_type_other,Nod_real_world,Nod_pos,Nod_pos_other
0,UCLH_00134949,F-0,28.125,-25.0,-1442.4,1.0,"[-1.0, -0.0, -0.0, -0.0, -1.0, -0.0, 0.0, 0.0,...",,,,,,
1,UCLH_00134949,F-1,24.375,-15.625,-1449.6,1.0,"[-1.0, -0.0, -0.0, -0.0, -1.0, -0.0, 0.0, 0.0,...",6.0,SN,,-1452.8,subpleural (<5mm from pleura),
2,UCLH_00134949,F-2,24.375,-15.625,-1449.6,1.0,"[-1.0, -0.0, -0.0, -0.0, -1.0, -0.0, 0.0, 0.0,...",6.0,SN,,-1452.8,subpleural (<5mm from pleura),
3,UCLH_00134949,F-3,-70.0,61.25,-1566.4,1.0,"[-1.0, -0.0, -0.0, -0.0, -1.0, -0.0, 0.0, 0.0,...",,,,,,
4,UCLH_00134949,UCLH_00134949-2,-35.178335,17.050104,-1439.9,1.0,"[-1.0, -0.0, -0.0, -0.0, -1.0, -0.0, 0.0, 0.0,...",,,,,,


In [127]:

found_df = pd.DataFrame([(k, v) for k, v in found.items()], columns=['annotation_idx', 'markup_idx'])
nod_data.merge(found_df, left_index=True, right_on='annotation_idx', how='left')

Unnamed: 0,ScananonID,Total_no_nods,Nod_diam,Nod_type,Nod_type_other,Nod_real_world,Nod_pos,Nod_pos_other,annotation_idx,markup_idx
0,UCLH_00134949,1.0,6.0,SN,,-1452.8,subpleural (<5mm from pleura),,0,"[1, 2]"
1,UCLH_00239233,1.0,15.0,PSN,airspace,1786.1,parenchymal,,1,[430]
2,UCLH_07024905,10.0,22.0,SN,,1721.7,subpleural (<5mm from pleura),,2,"[490, 491, 497, 498]"
3,UCLH_22801382,2.0,2.5,SN,,2118.1,parenchymal,,3,[]
4,UCLH_23344772,1.0,6.0,SN,,1854.5,parenchymal,,4,"[391, 393]"
...,...,...,...,...,...,...,...,...,...,...
142,UCLH_92376642,2.0,3.0,pGGN,,-864.8,subpleural (<5mm from pleura),,142,[]
143,UCLH_99025861,2.0,4.0,SN,,-1004.6,other,perifissural,143,[]
144,UCLH_92436946,2.0,13.0,SN,,1765.8,subpleural (<5mm from pleura),,144,[]
145,UCLH_90527584,2.0,5.0,PSN,,1844.7,other,parenchymal,145,[]


In [128]:
chk = lsut_nodule_data.patient_id.values

annotations[annotations.ScananonID.isin(chk)]

Unnamed: 0.1,Unnamed: 0,mhd_path,ObjectType,NDims,BinaryData,BinaryDataByteOrderMSB,CompressedData,CompressedDataSize,TransformMatrix,CenterOfRotation,...,feb_Path_T,feb_Path_N,feb_Path_M,feb_Path_PL,feb_Path_R,feb_Path_stage,Nod1_floc,Nod2_floc,Nod1_real_world,Nod2_real_world
0,0,/cluster/project0/lung-triage/lsut/LUNG/UCLH_0...,Image,3,True,False,True,107771947,[[1 0 0]\n [0 1 0]\n [0 0 1]],[0. 0. 0.],...,1a,0.0,0,0.0,0.0,Stage 1A,264.0,,-1452.8,
6,6,/cluster/project0/lung-triage/lsut/LUNG/UCLH_0...,Image,3,True,False,True,112077199,[[1 0 0]\n [0 1 0]\n [0 0 1]],[0. 0. 0.],...,,,,,,Unclassified,298.0,,1786.1,
7,7,/cluster/project0/lung-triage/lsut/LUNG/UCLH_0...,Image,3,True,False,True,139838181,[[1 0 0]\n [0 1 0]\n [0 0 1]],[0. 0. 0.],...,,,,,,Unclassified,271.0,254.0,1721.7,1708.1
13,13,/cluster/project0/lung-triage/lsut/LUNG/UCLH_0...,Image,3,True,False,True,126143019,[[1 0 0]\n [0 1 0]\n [0 0 1]],[0. 0. 0.],...,,,,,,,,,,
19,19,/cluster/project0/lung-triage/lsut/LUNG/UCLH_2...,Image,3,True,False,True,126519155,[[1 0 0]\n [0 1 0]\n [0 0 1]],[0. 0. 0.],...,,,,,,,252.0,252.0,2118.1,2118.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
593,593,/cluster/project0/lung-triage/lsut/LUNG/UCLH_7...,Image,3,True,False,True,137105011,[[1 0 0]\n [0 1 0]\n [0 0 1]],[0. 0. 0.],...,,,,,,Unclassified,,,,
608,608,/cluster/project0/lung-triage/lsut/LUNG/UCLH_8...,Image,3,True,False,True,116475204,[[1 0 0]\n [0 1 0]\n [0 0 1]],[0. 0. 0.],...,,,,,,,,,,
643,643,/cluster/project0/lung-triage/lsut/LUNG/UCLH_8...,Image,3,True,False,True,343093236,[[1 0 0]\n [0 1 0]\n [0 0 1]],[0. 0. 0.],...,1a,0.0,0,0.0,1.0,Stage 1A,393.0,,1815.4,
718,718,/cluster/project0/lung-triage/lsut/LUNG/UCLH_5...,Image,3,True,False,True,109473771,[[1 0 0]\n [0 1 0]\n [0 0 1]],[0. 0. 0.],...,,,,,,,155.0,,105.5,
