<a href="https://colab.research.google.com/github/matjesg/deepflash2/blob/master/paper/3-2_performance_challenge_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# deepflash2 - Performance comparison on challenge datasets

> This notebook calculates the performance metrics for the methods in the deepflash2 [paper](https://arxiv.org/abs/2111.06693).

- **Data and results**: The data and results of the different methods are available on [Google Drive](https://drive.google.com/drive/folders/1r9AqP9qW9JThbMIvT0jhoA5mPxWEeIjs?usp=sharing). To use the data in Google Colab, create a [shortcut](https://support.google.com/drive/answer/9700156?hl=en&co=GENIE.Platform%3DDesktop) of the data folder in your personal Google Drive.

*Source files created with this notebook*
- `semantic_segmentation_results_challenge.csv`
- `instance_segmentation_results_challenge.csv`
- `instance_segmentation_results_agg_challenge.csv`

The preceding segmentation results can be reproduced using the `train-and-predict` notebooks on [github](https://github.com/matjesg/deepflash2/paper).

*References*:

Griebel, M., Segebarth, D., Stein, N., Schukraft, N., Tovote, P., Blum, R., & Flath, C. M. (2021). Deep-learning in the bioimaging wild: Handling ambiguous data with deepflash2. arXiv preprint arXiv:2111.06693.


## Setup

- Install dependecies
- Connect to drive

In [None]:
!pip install -Uq deepflash2

In [None]:
# Imports
import imageio
import tifffile
import cv2
import pandas as pd
import numpy as np
from pathlib import Path
from fastprogress import progress_bar
from deepflash2.all import *
from deepflash2.data import _read_msk
from skimage.segmentation import relabel_sequential
check_cellpose_installation()

Installing cellpose. Please wait.


In [None]:
# Connect to drive
try:
  from google.colab import drive
  drive.mount('/gdrive')
except:
  print('Google Drive is not available.')

Mounted at /gdrive


## Settings

For sementic and instance segmentation results. 

In [None]:
DATASETS_SEMANTIC_SEG = ['gleason']
DATASETS_INSTANCE_SEG = ['monuseg', 'conic']
METHODS = [ 'deepflash2','nnunet']


#https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py
thresholds = np.linspace(.5, 0.95, int(np.round((0.95 - .5) / .05)) + 1, endpoint=True)

OUTPUT_PATH = Path("/content/")
DATA_PATH = Path('/gdrive/MyDrive/deepflash2-paper')

SUBDIR = 'test'

# Datasets have different numbers of classes
num_classes_dict = {
    'conic':7,
    'gleason':4,
    'monuseg':2
}

# Not all datasets are based on ground truth estimations from multiple experts
mask_dir_dict = {
    'conic':'masks',
    'gleason':'masks_STAPLE',
    'monuseg':'masks'
}

# Minimum object size in px calculated from train set
min_pixel_dict = {
    'conic':{1: 3, 2: 3, 3: 3, 4: 3, 5: 6, 6: 3},
    'monuseg': {1: 38}
}

def clean_labels(label_msk, min_pixel):
  'Remove areas blow below threshold'
  # remove areas < min pixel
  unique, counts = np.unique(label_msk, return_counts=True)
  label_msk[np.isin(label_msk, unique[counts<min_pixel])] = 0

  # re-label image
  label_msk, _ , _ = relabel_sequential(label_msk, offset=1)

  return label_msk

## Metrics

We propose a two-step evaluation:

1. Calculation of performance metrics (method vs. estimated ground truth)
  - Dice score for instance segmentation
  - Mean average precision for semantic segmentation
  - Average precision at IoU_50 for detection (supplement only)
2. Comparison to expert performance (against estimated ground truth)
  - Accounts for the ambiguity in the data

All results are calculated on the hold-out test sets.

In [None]:
# Semantic segmentation
results_semantic = []
metric = 'dice_score'

for dataset in progress_bar(DATASETS_SEMANTIC_SEG):
  mask_dir = mask_dir_dict[dataset]
  path = DATA_PATH/'data'/dataset/SUBDIR
  gt_path = path/mask_dir
  gt_masks_paths = [f for f in gt_path.iterdir()]

  for method in progress_bar(METHODS, leave=False):
    method_path = DATA_PATH/'results'/'semantic_segmentation'/dataset/method
    results_method = []
    
    for repetition in range(1,4):
      repetition_name = str(repetition)
      pred_path = method_path/repetition_name

      for f in gt_masks_paths:
        idx = f.stem#.split('_')[0]
        if dataset=='gleason': idx = idx.replace('_classimg_nonconvex','')
        msk = imageio.imread(f)
        if dataset=='monuseg': msk = msk>0
        pred = imageio.imread(pred_path/f'{idx}.png')
        if pred.max()==255: pred = pred//255

        tmp = pd.Series({
          'dataset': dataset,
          'method': method,
          'repetition': str(repetition),
          'repetition_name': repetition_name,
          'idx': idx,
          'uncertainty_score': None
          })   
        
        # Calculate Dice scores
        scores = []
        for cl in range(0, num_classes_dict[dataset]):
            msk_bin = msk==cl
            pred_bin = pred==cl
            if np.any([msk_bin, pred_bin]):
                score = dice_score(msk_bin, pred_bin)
                tmp[f'dice_score_class{cl}'] = score
                scores.append(score)

        tmp['average_dice_score'] = np.mean(scores)

        if method=='deepflash2' and repetition==1:
            # Load uncertainty scores
            df_unc = pd.read_csv(method_path/f'1_uncertainty_scores.csv')
            df_unc['idx']=df_unc.file.str[:-4] 
            tmp['uncertainty_score'] = df_unc.loc[df_unc.idx==idx]['uncertainty_score'].values[0]
        
        results_method.append(tmp)

    # Relate to expert performance
    df_method = pd.DataFrame(results_method)
    results_semantic.append(df_method)

df_semantic = pd.concat(results_semantic)
df_semantic.to_csv(OUTPUT_PATH/'semantic_segmentation_results_challenge.csv', index=False)
df_semantic.tail()

Unnamed: 0,dataset,method,repetition,repetition_name,idx,uncertainty_score,dice_score_class0,dice_score_class1,dice_score_class2,average_dice_score,dice_score_class3
142,gleason,nnunet,3,3,slide007_core056,,0.894335,0.932365,0.0,0.6089,
143,gleason,nnunet,3,3,slide005_core029,,0.698328,0.735006,0.0,0.477778,
144,gleason,nnunet,3,3,slide001_core039,,0.940071,0.768881,0.954598,0.88785,
145,gleason,nnunet,3,3,slide001_core011,,0.733722,0.19842,0.39246,0.441534,
146,gleason,nnunet,3,3,slide001_core010,,0.943654,0.0,0.954248,0.474475,0.0


In [None]:
df_semantic.groupby(['dataset','method', 'repetition']).mean().round(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,uncertainty_score,dice_score_class0,dice_score_class1,dice_score_class2,average_dice_score,dice_score_class3
dataset,method,repetition,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
gleason,deepflash2,1,0.273,0.911,0.659,0.63,0.763,0.0
gleason,deepflash2,2,,0.91,0.625,0.609,0.742,0.0
gleason,deepflash2,3,,0.911,0.653,0.616,0.754,0.0
gleason,nnunet,1,,0.907,0.531,0.538,0.663,0.183
gleason,nnunet,2,,0.902,0.547,0.535,0.668,0.023
gleason,nnunet,3,,0.908,0.527,0.521,0.653,0.045


In [None]:
# Instance segmentation and detection
results_instance_agg = []

for dataset in progress_bar(DATASETS_INSTANCE_SEG):
  mask_dir = mask_dir_dict[dataset]
  path = DATA_PATH/'data'/dataset/SUBDIR
  gt_path = path/mask_dir
  gt_masks_paths = [f for f in gt_path.iterdir()]

  for method in progress_bar(METHODS, leave=False):
    method_path = DATA_PATH/'results'/'instance_segmentation'/dataset/method
    results_method_agg = []
    
    for repetition in range(1,4):
      repetition_name = str(repetition)
      pred_path = method_path/repetition_name

      for f in gt_masks_paths:
        idx = f.name if method=='deepflash2' else f.stem
        
        # Aggregated results
        tmp_agg = pd.Series({
          'dataset': dataset,
          'method': method,
          'repetition': str(repetition),
          'repetition_name': repetition_name,
          'idx': f.stem
          })   

        ap_list, ap50_list = [], []
        for cl in range(1, num_classes_dict[dataset]):
            iname = f'{idx}_class{cl}.tif'

            # Load and clean predicted mask
            label_pred = tifffile.imread(pred_path/iname)
            label_pred = clean_labels(label_pred, min_pixel=min_pixel_dict[dataset][cl])
            
            # Load and clean GT mask
            if dataset=='conic': 
                iname = f'{f.stem}.tif_class{cl}.tif'
                msk = imageio.imread(DATA_PATH/'data'/dataset/'masks_by_class'/iname)
            else:
                msk = imageio.imread(f)
            label_msk = clean_labels(msk, min_pixel=min_pixel_dict[dataset][cl])
            if np.any([label_msk>0, label_pred>0]):
                # Calculate instance segmentation metrics
                ap, tp, fp, fn = get_instance_segmentation_metrics(label_msk,
                                                                label_pred, 
                                                                is_binary=False, 
                                                                thresholds=thresholds,
                                                                )
                tmp_agg[f'AP50_class{cl}'] = ap[0]
                tmp_agg[f'mAP_class{cl}'] = ap.mean()
                ap_list.append(ap.mean())
                ap50_list.append(ap[0])
                
        tmp_agg[f'average_AP50'] = np.mean(ap50_list)
        tmp_agg[f'average_mAP'] =  np.mean(ap_list)
        
        results_method_agg.append(tmp_agg)

    # Relate to expert performance
    df_method = pd.DataFrame(results_method_agg)
    results_instance_agg.append(df_method)

# Concat and save aggregated results
df_instance_agg = pd.concat(results_instance_agg)
df_instance_agg.to_csv(OUTPUT_PATH/'instance_segmentation_results_agg_challenge.csv', index=False)
df_instance_agg.tail()

Unnamed: 0,dataset,method,repetition,repetition_name,idx,AP50_class1,mAP_class1,average_AP50,average_mAP,AP50_class2,mAP_class2,AP50_class3,mAP_class3,AP50_class4,mAP_class4,AP50_class5,mAP_class5,AP50_class6,mAP_class6
139,conic,nnunet,3,3,crag_44,0.17,0.102411,0.349483,0.229566,0.468072,0.268934,0.443439,0.31222,0.476923,0.356741,0.0,0.0,0.538462,0.337087
140,conic,nnunet,3,3,crag_48,0.26506,0.13572,0.426228,0.257457,0.452493,0.246359,0.651934,0.445827,0.204545,0.155957,0.333333,0.147619,0.65,0.413262
141,conic,nnunet,3,3,crag_54,0.0,0.0,0.153867,0.094837,0.445095,0.27137,0.114754,0.075036,0.0,0.0,,,0.209486,0.127778
142,conic,nnunet,3,3,crag_58,0.0,0.0,0.242787,0.154433,0.52901,0.310466,0.264151,0.191315,0.185185,0.169101,0.0,0.0,0.478375,0.255716
143,conic,nnunet,3,3,crag_64,0.083333,0.049333,0.333205,0.214954,0.337483,0.185672,0.643162,0.461449,0.259259,0.189919,0.111111,0.053749,0.564881,0.349601


In [None]:
df_instance_agg.groupby(['dataset','method','repetition']).mean().round(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,AP50_class1,mAP_class1,average_AP50,average_mAP,AP50_class2,mAP_class2,AP50_class3,mAP_class3,AP50_class4,mAP_class4,AP50_class5,mAP_class5,AP50_class6,mAP_class6
dataset,method,repetition,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
conic,deepflash2,1,0.112,0.064,0.45,0.272,0.665,0.352,0.618,0.433,0.36,0.236,0.256,0.121,0.563,0.338
conic,deepflash2,2,0.118,0.062,0.456,0.275,0.667,0.352,0.626,0.436,0.366,0.246,0.248,0.117,0.567,0.34
conic,deepflash2,3,0.113,0.061,0.452,0.275,0.67,0.354,0.633,0.44,0.352,0.24,0.234,0.111,0.572,0.345
conic,nnunet,1,0.088,0.056,0.391,0.254,0.455,0.252,0.594,0.43,0.343,0.241,0.194,0.114,0.515,0.327
conic,nnunet,2,0.096,0.062,0.396,0.258,0.456,0.253,0.592,0.43,0.348,0.247,0.211,0.121,0.521,0.331
conic,nnunet,3,0.091,0.05,0.39,0.253,0.456,0.253,0.592,0.43,0.364,0.262,0.179,0.102,0.519,0.326
monuseg,deepflash2,1,0.733,0.37,0.733,0.37,,,,,,,,,,
monuseg,deepflash2,2,0.742,0.382,0.742,0.382,,,,,,,,,,
monuseg,deepflash2,3,0.738,0.372,0.738,0.372,,,,,,,,,,
monuseg,nnunet,1,0.648,0.33,0.648,0.33,,,,,,,,,,
