<a href="https://colab.research.google.com/github/matjesg/deepflash2/blob/master/paper/3_performance_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# deepflash2 - Performance Comparison

> This notebook calculates the performance metrics for the methods in the deepflash2 [paper](https://arxiv.org/abs/2111.06693).

- **Data and results**: The data and results of the different methods are available on [Google Drive](https://drive.google.com/drive/folders/1r9AqP9qW9JThbMIvT0jhoA5mPxWEeIjs?usp=sharing). To use the data in Google Colab, create a [shortcut](https://support.google.com/drive/answer/9700156?hl=en&co=GENIE.Platform%3DDesktop) of the data folder in your personal Google Drive.

*Source files created with this notebook*
- `semantic_segmentation_results.csv`
- `instance_segmentation_results.csv`
- `instance_segmentation_results_agg.csv`

The preceding segmentation results can be reproduced using the `train-and-predict` notebooks on [github](https://github.com/matjesg/deepflash2/paper).

*References*:

Griebel, M., Segebarth, D., Stein, N., Schukraft, N., Tovote, P., Blum, R., & Flath, C. M. (2021). Deep-learning in the bioimaging wild: Handling ambiguous data with deepflash2. arXiv preprint arXiv:2111.06693.


## Setup

- Install dependecies
- Connect to drive

In [None]:
!pip install -Uq deepflash2

[K     |████████████████████████████████| 56 kB 2.9 MB/s 
[K     |████████████████████████████████| 102 kB 14.6 MB/s 
[K     |████████████████████████████████| 153 kB 47.0 MB/s 
[K     |████████████████████████████████| 3.3 MB 40.7 MB/s 
[K     |████████████████████████████████| 47.6 MB 104 kB/s 
[K     |████████████████████████████████| 197 kB 48.7 MB/s 
[K     |████████████████████████████████| 88 kB 3.7 MB/s 
[K     |████████████████████████████████| 60 kB 3.4 MB/s 
[K     |████████████████████████████████| 34.5 MB 7.9 MB/s 
[K     |████████████████████████████████| 376 kB 46.9 MB/s 
[K     |████████████████████████████████| 58 kB 2.7 MB/s 
[K     |████████████████████████████████| 6.2 MB 40.7 MB/s 
[?25h  Building wheel for efficientnet-pytorch (setup.py) ... [?25l[?25hdone
  Building wheel for pretrainedmodels (setup.py) ... [?25l[?25hdone
  Building wheel for asciitree (setup.py) ... [?25l[?25hdone


In [None]:
# Imports
import imageio
import tifffile
import cv2
import pandas as pd
import numpy as np
from pathlib import Path
from fastprogress import progress_bar
from deepflash2.all import *
from deepflash2.data import _read_msk
from skimage.segmentation import relabel_sequential
check_cellpose_installation()

Installing cellpose. Please wait.


In [None]:
# Connect to drive
try:
  from google.colab import drive
  drive.mount('/gdrive')
except:
  print('Google Drive is not available.')

Mounted at /gdrive


## Settings

For sementic and instance segmentation results. 

In [None]:
DATASETS_SEMANTIC_SEG = ['PV_in_HC', 'cFOS_in_HC', 'mScarlet_in_PAG', 'YFP_in_CTX', 'GFAP_in_HC']
METHODS_SEMANTIC_SEG = ['otsu', 'cellpose', 'cellpose_single', 'cellpose_ensemble', 'unet_2019', 'nnunet', 'deepflash2']

DATASETS_INSTANCE_SEG = ['PV_in_HC', 'cFOS_in_HC', 'mScarlet_in_PAG', 'YFP_in_CTX']
METHODS_INSTANCE_SEG = ['cellpose', 'cellpose_single', 'cellpose_ensemble', 'unet_2019', 'nnunet', 'deepflash2']

#https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py
thresholds = np.linspace(.5, 0.95, int(np.round((0.95 - .5) / .05)) + 1, endpoint=True)

OUTPUT_PATH = Path("/content/")
DATA_PATH = Path('/gdrive/MyDrive/deepflash2-paper')

SUBDIR = 'test'

min_pixel_dict = {
    'PV_in_HC':61, 
    'cFOS_in_HC':30, 
    'mScarlet_in_PAG':385, 
    'YFP_in_CTX':193,
}

cellpose_dict = {
    'PV_in_HC':'cyto', 
    'cFOS_in_HC':'cyto2',
    'mScarlet_in_PAG':'cyto2',
    'YFP_in_CTX':'cyto',
    'GFAP_in_HC':'cyto2',
}

def repetition_mapper(x, method, dataset):
  'Returns correct subfolder for non-trainable methods'
  if  method=='otsu': x = 'default'
  if  method=='cellpose': x = cellpose_dict[dataset]
  return str(x)

def expert_comparison(df, metric):
  'Calculates expert comparison metrics on data frame'
  df['expert_comparison'] = 'in expert range'
  df.loc[df[metric]>df['expert_max'], 'expert_comparison'] = 'above best expert'
  df.loc[df[metric]<df['expert_min'], 'expert_comparison'] = 'below worst expert'
  return df

def clean_labels(label_msk, min_pixel):
  'Remove areas blow below threshold'
  # remove areas < min pixel
  unique, counts = np.unique(label_msk, return_counts=True)
  label_msk[np.isin(label_msk, unique[counts<min_pixel])] = 0

  # re-label image
  label_msk, _ , _ = relabel_sequential(label_msk, offset=1)

  return label_msk

## Metrics

We propose a two-step evaluation:

1. Calculation of performance metrics (method vs. estimated ground truth)
  - Dice score for instance segmentation
  - Mean average precision for semantic segmentation
  - Average precision at IoU_50 for detection (supplement only)
2. Comparison to expert performance (against estimated ground truth)
  - Accounts for the ambiguity in the data

All results are calculated on the hold-out test sets.

In [None]:
# Semantic segmentation
results_semantic = []
metric = 'dice_score'

for dataset in progress_bar(DATASETS_SEMANTIC_SEG):
  revised = '' if  dataset=='GFAP_in_HC' else '_revised'
  mask_dir = 'masks_STAPLE'+revised
  path = DATA_PATH/'data'/dataset/SUBDIR
  gt_path = path/mask_dir
  gt_masks_paths = [f for f in gt_path.iterdir()]

  df_exp = pd.read_csv(path/f'STAPLE_vs_experts{revised}.csv')
  df_exp['idx'] = df_exp['file'].str.split('_').str[0]
  df_exp = df_exp.groupby(['idx']).agg(expert_min=(metric, np.min),
                                        expert_mean=(metric, np.mean),
                                        expert_max=(metric, np.max))

  for method in progress_bar(METHODS_SEMANTIC_SEG, leave=False):
    method_path = DATA_PATH/'results'/'semantic_segmentation'/dataset/method
    results_method = []
    
    for repetition in range(1,4):
      repetition_name = repetition_mapper(repetition, method, dataset)
      pred_path = method_path/repetition_name

      for f in gt_masks_paths:
        idx = f.stem.split('_')[0]
        msk = imageio.imread(f)//255
        pred = imageio.imread(pred_path/f'{idx}.png')//255

        # Calculate dice score
        ds = dice_score(msk, pred)

        tmp = pd.Series({
          'dataset': dataset,
          'method': method,
          'repetition': str(repetition),
          'repetition_name': repetition_name,
          'idx': idx,
           metric: ds,
          'uncertainty_score': None
          })   

        if method=='deepflash2' and repetition==1:
            # Load uncertainty scores
            df_unc = pd.read_csv(method_path/f'1_uncertainty_scores.csv')
            df_unc['idx']=df_unc.file.str[:-4] 
            tmp['uncertainty_score'] = df_unc.loc[df_unc.idx==idx]['uncertainty_score'].values[0]
        
        results_method.append(tmp)

    # Relate to expert performance
    df_method = pd.DataFrame(results_method)
    df_method = df_method.set_index(['idx']).join(df_exp).reset_index()
    results_semantic.append(df_method)

df_semantic = pd.concat(results_semantic)
df_semantic = expert_comparison(df_semantic, metric)
df_semantic.to_csv(OUTPUT_PATH/'semantic_segmentation_results.csv', index=False)
df_semantic.tail()

Unnamed: 0,idx,dataset,method,repetition,repetition_name,dice_score,uncertainty_score,expert_min,expert_mean,expert_max,expert_comparison
19,2378-2,GFAP_in_HC,deepflash2,2,2,0.813112,,0.733295,0.801691,0.876338,in expert range
20,2378-2,GFAP_in_HC,deepflash2,3,3,0.813892,,0.733295,0.801691,0.876338,in expert range
21,2378-3,GFAP_in_HC,deepflash2,1,1,0.793719,0.316744,0.735275,0.79436,0.841553,in expert range
22,2378-3,GFAP_in_HC,deepflash2,2,2,0.79545,,0.735275,0.79436,0.841553,in expert range
23,2378-3,GFAP_in_HC,deepflash2,3,3,0.793684,,0.735275,0.79436,0.841553,in expert range


In [None]:
df_semantic.groupby(['dataset', 'method','repetition']).mean().round(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,dice_score,expert_min,expert_mean,expert_max
dataset,method,repetition,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GFAP_in_HC,cellpose,1,0.167,0.792,0.835,0.874
GFAP_in_HC,cellpose,2,0.167,0.792,0.835,0.874
GFAP_in_HC,cellpose,3,0.167,0.792,0.835,0.874
GFAP_in_HC,cellpose_ensemble,1,0.536,0.792,0.835,0.874
GFAP_in_HC,cellpose_ensemble,2,0.521,0.792,0.835,0.874
...,...,...,...,...,...,...
mScarlet_in_PAG,otsu,2,0.156,0.722,0.796,0.853
mScarlet_in_PAG,otsu,3,0.156,0.722,0.796,0.853
mScarlet_in_PAG,unet_2019,1,0.748,0.722,0.796,0.853
mScarlet_in_PAG,unet_2019,2,0.763,0.722,0.796,0.853


In [None]:
# Instance segmentation and detection
results_instance = []
results_instance_agg = []
metric = 'mean_average_precision'

for dataset in progress_bar(DATASETS_INSTANCE_SEG):
  revised = '_revised'
  mask_dir = 'masks_STAPLE'+revised
  path = DATA_PATH/'data'/dataset/SUBDIR
  gt_path = path/mask_dir
  gt_masks_paths = [f for f in gt_path.iterdir()]

  df_exp = pd.read_csv(path/f'STAPLE_vs_experts{revised}.csv')
  df_exp['idx'] = df_exp['file'].str.split('_').str[0]
  df_exp = df_exp.groupby(['idx']).agg(expert_min=(metric, np.min),
                                        expert_mean=(metric, np.mean),
                                        expert_max=(metric, np.max))

  for method in progress_bar(METHODS_INSTANCE_SEG, leave=False):
    method_path = DATA_PATH/'results'/'instance_segmentation'/dataset/method
    results_method_agg = []
    
    for repetition in range(1,4):
      repetition_name = repetition_mapper(repetition, method, dataset)
      pred_path = method_path/repetition_name

      for f in gt_masks_paths:
        idx = f.stem.split('_')[0]

        # Load and clean gt mask
        msk = imageio.imread(f)//255
        _, label_msk = cv2.connectedComponents(msk.astype('uint8'), connectivity=4)
        label_msk = clean_labels(label_msk, min_pixel=min_pixel_dict[dataset])

        # Load and clean prediction
        label_pred = tifffile.imread(pred_path/f'{idx}.tif')
        label_pred = clean_labels(label_pred, min_pixel=min_pixel_dict[dataset])

        # Calculate instance segmentation metrics
        ap, tp, fp, fn = get_instance_segmentation_metrics(label_msk,
                                                           label_pred, 
                                                           is_binary=False, 
                                                           thresholds=thresholds,
                                                           )
        # Detailed results
        tmp = pd.DataFrame({
          'dataset': dataset,
          'method': method,
          'repetition': str(repetition),
          'repetition_name': repetition_name,
          'idx': idx,
          'threshold':thresholds,
          'average_precision':ap
          })   
        results_instance.append(tmp)

        # Aggregated results
        tmp_agg = pd.Series({
          'dataset': dataset,
          'method': method,
          'repetition': str(repetition),
          'repetition_name': repetition_name,
          'idx': idx,
           metric: ap.mean(),
          'average_precision_at_iou_50':ap[0]
          })   
        
        results_method_agg.append(tmp_agg)

    # Relate to expert performance
    df_method = pd.DataFrame(results_method_agg)
    df_method = df_method.set_index(['idx']).join(df_exp).reset_index()
    results_instance_agg.append(df_method)

df_instance = pd.concat(results_instance)
df_instance.to_csv(OUTPUT_PATH/'instance_segmentation_results.csv', index=False)
display(df_instance.tail())

# Concat and save aggregated results
df_instance_agg = pd.concat(results_instance_agg)
df_instance_agg = expert_comparison(df_instance_agg, metric)
df_instance_agg.to_csv(OUTPUT_PATH/'instance_segmentation_results_agg.csv', index=False)
df_instance_agg.tail()

creating new log file


Unnamed: 0,dataset,method,repetition,repetition_name,idx,threshold,average_precision
5,YFP_in_CTX,deepflash2,3,3,2349,0.75,0.357143
6,YFP_in_CTX,deepflash2,3,3,2349,0.8,0.30625
7,YFP_in_CTX,deepflash2,3,3,2349,0.85,0.215116
8,YFP_in_CTX,deepflash2,3,3,2349,0.9,0.060914
9,YFP_in_CTX,deepflash2,3,3,2349,0.95,0.0


Unnamed: 0,idx,dataset,method,repetition,repetition_name,mean_average_precision,average_precision_at_iou_50,expert_min,expert_mean,expert_max,expert_comparison
19,2347,YFP_in_CTX,deepflash2,2,2,0.590798,0.863014,0.482786,0.543476,0.586813,above best expert
20,2347,YFP_in_CTX,deepflash2,3,3,0.557446,0.851351,0.482786,0.543476,0.586813,in expert range
21,2349,YFP_in_CTX,deepflash2,1,1,0.389174,0.721311,0.377525,0.497888,0.616582,in expert range
22,2349,YFP_in_CTX,deepflash2,2,2,0.387817,0.719008,0.377525,0.497888,0.616582,in expert range
23,2349,YFP_in_CTX,deepflash2,3,3,0.383695,0.699187,0.377525,0.497888,0.616582,in expert range


In [None]:
df_instance_agg.groupby(['dataset', 'method','repetition']).mean().round(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mean_average_precision,average_precision_at_iou_50,expert_min,expert_mean,expert_max
dataset,method,repetition,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
PV_in_HC,cellpose,1,0.541,0.701,0.564,0.696,0.802
PV_in_HC,cellpose,2,0.541,0.701,0.564,0.696,0.802
PV_in_HC,cellpose,3,0.541,0.701,0.564,0.696,0.802
PV_in_HC,cellpose_ensemble,1,0.629,0.862,0.564,0.696,0.802
PV_in_HC,cellpose_ensemble,2,0.600,0.816,0.564,0.696,0.802
...,...,...,...,...,...,...,...
mScarlet_in_PAG,nnunet,2,0.445,0.674,0.420,0.482,0.548
mScarlet_in_PAG,nnunet,3,0.442,0.668,0.420,0.482,0.548
mScarlet_in_PAG,unet_2019,1,0.334,0.572,0.420,0.482,0.548
mScarlet_in_PAG,unet_2019,2,0.341,0.577,0.420,0.482,0.548
