# Kaggle Nuclei Featurization


In this notebook, we explore the 2018 Kaggle Data Science Bowl Dataset and discuss the strategy we used to create a featurizer using this dataset.

## The Dataset

The dataset that we used for our nuclei featurizer is from the [2018 Kaggle Data Science Bowl](https://www.kaggle.com/c/data-science-bowl-2018). The dataset contains a large number of segmented nuclei images. Each *ImageId* contains the origin image (4 channels) along with it's overall segmentation and each instance segmentation. 

Since the first three channels of the original image are identical, and the 4th is just the alpha, we only use the first channel to train our featurizer. 

## Imports

In [11]:
%gui qt5
import napari
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Example from the dataset

Notice that the first 3 channels are identical, and the 4th channel is blank.

In [6]:
image = np.load("nuclei_example.npy")
viewer = napari.view(image)

## Data Processing

After downloading the dataset from Kaggle, all the training examples are stored into a HDF5 file. To simplify the batch training process, only images with the majority size (256x256x4) were kept (334/634).

The code we used to generate this file can be found [here](https://github.com/marshuang80/BioSegmentation/blob/master/process_data/create_hdf5.py)

You can specify the input folder of your data and your desired output directory as the following:

```
python create_hdf5.py --input_dir "/home/user/kaggle-dsbowl-2018-dataset-fixes/stage1_train/" \
                      --output_dir "/home/user/data/"
```

After the HDF5 file is generated, we impletemented a PyTorch Dataset to process these input data for the model [code](https://github.com/marshuang80/BioSegmentation/blob/master/dataset/dataset.py)

## Training

To train our featurizer, we have implemented a simple [UNet model](https://arxiv.org/pdf/1505.04597)

![](./figs/unet.png)

The impletmentation code can be found [here](https://github.com/marshuang80/BioSegmentation/blob/master/model/unet.py)

As mentioned above, only the first dimention of the input images were used during training due to redundency. Since we are training for object segmentation instead of instance segmentation, only the whole image mask is used as the network target. 

| Input | Target | 
| --- | --- |
| ![](figs/input.png) |  ![](figs/target.png) |


To train the model, run **train.py** from [BioSegmentation](https://github.com/marshuang80/BioSegmentation) with the right parameters, for example: 

```
python train.py --num_kernel 8 \
                --kernel_size 3\
		        --lr 1e-3 \
		        --epoch 200\
			    --train_data /home/user/Nuclei/train.hdf5 \
			    --val_data /home/user/Nuclei/val.hdf5 \
			    --save_dir ./ \
                --device cuda\
                --optimizer adam\
                --model unet\
                --shuffle False \
                --num_workers 16 \
                --vflip False \
                --hflip False \
                --zoom True \
                --rotate False \
                --batch_size 64 \
                --gpu_ids 0,1,2,3\
                --experiment_name unet_k8_s3_adam
```

The saved model is called **UNet.pth**

## Featurize

After the segmentation network is trained and saved, we can then use the trained model as a featurizer by removing the last layer or simply switch the last layer to an identity function. The featurizing code can be found ([here](https://github.com/transformify-plugins/segmentify/blob/master/segmentify/semantic/main.py))

The featurized image can be visualized with the following block:

In [15]:
nuclei_features = np.load("nuclei_features.npy")
nuclei_features = np.transpose(nuclei_features, (2,0,1))
viewer = napari.view(nuclei_features)

## Results

We use the following steps to evaluate our featurizers:
- Featurize all images (including train and test)
- Pick 20% of the pixels from the training data
- Train a Random Forest Classifier using the 20% selected pixels
- Predict binary segmentation on test set using trained RFC
- Remove small islands

Using this nuclei featurizer, we are able to obtain the results below. The x axis show the number of training examples used, and the y axis is the performance metric (IoU, precision).

![](./figs/nuclei_featurize.png)