<a href="https://colab.research.google.com/github/matjesg/deepflash2/blob/master/paper/challenge_data/preprocess_monuseg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data preprocessing for the MoNuSeg 2018 Challenge Dataset


![Monuseg Logo](https://rumc-gcorg-p-public.s3.amazonaws.com/i/2020/02/22/Snip20200222_7.png)


from  https://monuseg.grand-challenge.org/:
- The train dataset (images and annotations) can be downloaded from https://drive.google.com/file/d/1ZgqFJomqQGNnsx7w7QBzQQMVA16lbVCA/view

- The test dataset can be downloaded from https://drive.google.com/file/d/1NKkSQ5T0ZNQ8aUhh0a8Dt2YKYCQXIViw/view


**References:** Kumar, Neeraj, et al. "A multi-organ nucleus segmentation challenge." IEEE transactions on medical imaging 39.5 (2019): 1380-1391.


## 1. Download Data

In [None]:
!gdown 1ZgqFJomqQGNnsx7w7QBzQQMVA16lbVCA -O train.zip
!mkdir train && unzip -ju train.zip -d train/images

!gdown 1NKkSQ5T0ZNQ8aUhh0a8Dt2YKYCQXIViw -O test.zip
!mkdir test && unzip -ju test.zip -d test/images

## 2. Imports and functions 

In [None]:
# imagecodecs required to read tif files
!pip install -U git+https://github.com/matjesg/deepflash2.git@master

In [None]:
# Imports
from pathlib import Path
import xml.etree.ElementTree as ET
import cv2
import numpy as np
import imageio
from deepflash2.data import preprocess_mask

In [None]:
# Function to convert xml to mask
# Adapted from https://github.com/vqdang/hover_net/blob/d743e633ed59e588af6113cae185d4db589b4368/src/misc/proc_kumar_ann.py#L39
def xml_to_mask(xml_path, hw):
    xml = ET.parse(xml_path)

    contour_dbg = np.zeros(hw, np.uint8)

    insts_list = []
    for idx, region_xml in enumerate(xml.findall('.//Region')):
        vertices = []
        for vertex_xml in region_xml.findall('.//Vertex'):
            attrib = vertex_xml.attrib
            vertices.append([float(attrib['X']), 
                             float(attrib['Y'])])
        vertices = np.array(vertices) + 0.5
        vertices = vertices.astype('int32')
        contour_blb = np.zeros(hw, np.uint8)
        # fill both the inner area and contour with idx+1 color
        cv2.drawContours(contour_blb, [vertices], 0, idx+1, -1)
        insts_list.append(contour_blb)

    insts_size_list = np.array(insts_list)
    insts_size_list = np.sum(insts_size_list, axis=(1 , 2))
    insts_size_list = list(insts_size_list)

    pair_insts_list = zip(insts_list, insts_size_list)
    # sort in z-axis basing on size, larger on top
    pair_insts_list = sorted(pair_insts_list, key=lambda x: x[1])
    insts_list, insts_size_list = zip(*pair_insts_list)

    ann = np.zeros(hw, np.int32)
    for idx, inst_map in enumerate(insts_list):
        ann[inst_map > 0] = idx + 1

    return ann

## 3. Convert and save masks (instance labels)

In [None]:
for dir in ['train', 'test']:
    files = [x for x in (Path(dir)/'images').iterdir() if x.suffix=='.xml' and not x.name.startswith('.')]
    outpath = Path(dir)/'masks'
    outpath.mkdir(exist_ok=True)
    outpath2 = Path(dir)/'masks_preprocessed'
    outpath2.mkdir(exist_ok=True)

    for f in files:
        img = imageio.imread(f.with_suffix('.tif'))
        hw = img.shape[:2]
        labels = xml_to_mask(f, hw)
        imageio.imsave(outpath/f'{f.stem}_mask.tif', labels)

        msk = preprocess_mask(clabels=None, instlabels=labels, remove_connectivity=True, num_classes=2)
        imageio.imsave(outpath2/f'{f.stem}_mask.png', msk.astype('uint8')*255)