Cell Nuclei Segmentation in Microscopic Images
The overarching goal of this project is to create a computer model that can identify and segment out a range of nuclei across varied conditions. Identifying the cells' nuclei is the starting point for most analyses because the nucleus of a cell contains its full DNA. As a result, identifying nuclei allows researchers to identify each individual cell in a sample, and by measuring how cells react to various treatments, the researcher can understand the underlying biological processes at work. Consequently, automating the process of identifying nuclei will allow for more efficient drug testing, which can dramatically reduce the time that it takes for new drugs to come to market.
This task is made difficult by the diversity and limited size of the available draining data. First of all, segmenting cell nuclei by hand is time consuming and difficult to do with high precision, which limits the amount of annotated (i.e. labeled) examples. Secondly, there are a large number of cell types, image modalities, and magnification levels for which the proposed classifier must perform within parameters. Figure 1 shows several cell types, image modalities, and magnification levels present in the training set. Figure 2 shows the result applying the ground truth segmentations (present in the training set) to highlight differences in the nuclei of different cell types.
Specifically, in this project, nuclei of 278 different cell types must be segmented across 3 different image modalities (Bright field, Stained/Histological, Fluorescence) and 12 different magnification levels. Given the small size of the dataset (670 images), several cell types, imaging modalities and magnification levels are dramatically underrepresented.
Since any single classifier will have a difficult time generalizing across such a diverse and biased dataset, in this project, two completely different approaches are explored and compared. One based on Semantic Segmentation using U-Net [CITATION Ola15 \l 4105]1 and another one based on Instance Segmentation using Mask R-CNN[CITATION Kai18 \l 4105]2. This will act as a preliminarily study which enables a methodology based on an ensemble or fusion of several different classifiers for future work. In the following, section 2. Data will discuss the details of the dataset, section 3. Methodology will describe the approaches taken in this project including the theory behind U-Net and Mask R-CNN models, section 4. Implementation describes the implementation details of each model, section 5. Results discusses the experimental results, section 6. describes the future works for this project, section 7. concludes the project, section 8. contains the references, and finally section 9 is the appendix.
2. Data Exploration
The data required for this project comes from the 2018 Data Science Bowl in 3 sets; one set of images along with the segmented masks (labeled dataset) that is used for training, and two test sets which contain just the images and no masks; used for validation and testing. The trained models are submitted to Kaggle for validation on the first test set, and later for testing on the second test set. Only Kaggle has access to the segmented masks for the two test sets to ensure that they are not used in the training process in any way.
All 3 datasets contain a large number of segmented nuclei images. The images were acquired under a variety of conditions and vary in the cell type, magnification, and imaging modality (Bright field, Stained/Histological, Fluorescence). The dataset is designed to challenge an algorithm's ability to generalize across these variations. More specifically, the training set contains 670 images, where each image contains 10-200 cells of the same type. Each image is represented by an associated ImageId. Files belonging to an image are contained in a folder with this ImageId.
Within this folder are two subfolders:
- Images directory contains the image file.
- Masks directory contains the segmented masks of each nucleus. This folder is only included in the training set. Each mask contains one nucleus. Masks are not allowed to overlap (no pixel belongs to two masks).
Figure 3 shows a single training example (image) from the training set along with its ground truth segmented mask. Please note that the training dataset, we have a single mask for each cell nucleus (i.e. for each instance). In Figure 3, I have combined all the cells into a single mask to save space; in figure 2, you can see the masks as they are in the dataset (one mask for each cell).
2.1 Clustering to detect image types
As you can see in Figure 2, there is a lot of diversity in the dataset. Some images are gray scale with black background and nuclei in gray scale intensities, some images are in color, and some images seemed to be black on white. As such, I performed K-Means clustering on the RGB pixel values (by converting them to HSV after throwing away the alpha channel) to detect different types of images based on the colors present in them (i.e. based on dominant HSV colors distributions). I ran k-means with several numbers of clusters but got the best clustering results using 3 clusters. This makes sense since there are 3 imaging modalities in the dataset, which implies at least 3 clusters in the dataset as well.
Running the clustering algorithm proved the assumption above to be through. There are 3 clearly separated clusters corresponding to the 3 image modalities. In the training set, I found the following distribution:
It looks like we have the following breakdown:
- Cluster 1 Bright-field images: 2.4%
- Cluster 2 Fluorescent images: 81.5%
- Cluster 3 Histological images: 16.1%
Figure 4 shows some images taken using Fluorescent imaging modality (the first cluster), Figure 5 shows some images taken in using Stained or Histological imaging modality and Figure 6 shows some images taken suing Bright-Field imaging (third cluster).
As far as the competition is concerned, the output of the over-all system should be a binary mask where the pixels belonging to a cell's nuclei are represented by 1 and background pixels are represented by 0. Consequently, at first glance, this might seem like a semantic segmentation problem (i.e. classifying each pixel as either background or a nucleus).
However, a valid submission requires that no two predicted masks for the same image are overlapping. In other words, as shown in figure 8, for two overlapping nuclei two individual masks must be returned. This implies that performing instance segmentation (i.e. multi-class classification of pixels) might outperform the models based on semantic segmentation in this competition (i.e. based on the evaluation metric specified below).
The following are the results of preliminary comparison between semantic segmentation using U-Net, multi-class classification of each pixel using improved U-Net, and instance segmentation using Mask-R-CNN on a small portion of the training set (approximately 10% of examples selected at random). The evaluation metrics is the mean average precision at different intersection over union (IoU) thresholds as described in the next section.
- U-Net (single class, i.e. semantic segmentation) : 0.35
- Enhanced U-Net (multi-class with post processing) : 0.42
- Mask-RCNN (instance segmentation) : 0.50
In this project, two models are trained and compared in terms of their performance for this task: Basic U-Net (semantic segmentation) and Mask R-CNN (instance segmentation). The problem is posed in two different frameworks (instance segmentation V.S. semantic segmentation) in order to gage the importance of each approach in the task of segmenting cell nuclei without any over-lap, without differentiating between different instances.
In addition, in practice, multiple models might be used for this task in an ensemble method (such as non-max suppression) in order to take advantage of the independent errors made by each classifier. Basically, for nuclei that are not detected by Mask R-CNN, we can fall back on semantic segmentations using U-Net. Also, since the results from different classifiers are cumulative, one can always convert an ensemble system to single network using knowledge distil or teacher-student learning for practical applications.
3.1 Evaluation Metric
Experimental results are obtained by measuring the performance of the algorithm on the two test sets held by Kaggle. The prediction results of the algorithm on test sets are uploaded to Kaggle which compares them to the ground truth (segmented masks) available only to Kaggle and publishes the results.
The models performance is evaluated on the mean average precision at different intersection over union (IoU) thresholds. The IoU of a proposed set of object pixels and a set of true object pixels is calculated as:
The metric sweeps over a range of IoU thresholds, at each point calculating an average precision value. The threshold values range from 0.5 to 0.95 with a step size of 0.05: (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95). In other words, at a threshold of 0.5, a predicted object is considered a "hit" if its intersection over union with a ground truth object is greater than 0.5.
At each threshold value t, a precision value is calculated based on the number of true positives (TP), false negatives (FN), and false positives (FP) resulting from comparing the predicted object to all ground truth objects:
A true positive is counted when a single predicted object matches a ground truth object with an IoU above the threshold. A false positive indicates a predicted object had no associated ground truth object. A false negative indicates a ground truth object had no associated predicted object. The average precision of a single image is then calculated as the mean of the above precision values at each IoU threshold:
Finally, the score returned by Kaggle is the mean taken over the individual average precisions of each image in the test dataset.
3.2 Semantic Segmentation using U-Net
U-Net[CITATION Ola15 \l 4105]1 is used in this project in order to provide a base-line performance using a relatively simple model. In addition, it provides the means for comparing semantic segmentation in contrast with instance segmentation, as applied to this particular task. U-Net was chosen among several semantic segmentation methods because it has won the Grand Challenge for Computer-Automated Detection of Caries in Bitewing Radiography at ISBI 2015, and it has won the Cell Tracking Challenge at ISBI 2015 on the two most challenging transmitted light microscopy categories (Phase contrast and DIC microscopy) by a large margin. The task and the datasets evaluated in these competitions are very similar to the task of segmenting cell nuclei and our given dataset which means U-Net has a good chance of performing well in this application.
Similar to how Mask-RCNN is an extension of the R-CNN line of models for instance segmentation, U-Net can be thought of as an extension to the FCN (Fully Convolutional Network) line of models for semantic segmentation. Since FCNs do not use any connected layers, they have been heavily used for dense predictions because they are much faster compared to the patch classification approach. FCNs allow for segmentation maps to be generated for images of any size.
However, aside from the computational expense of fully connected layers (which FCNs aim to overcome), CNNs have a problem with pooling layers. Pooling layers increase the field of view and are able to aggregate the context while discarding the spatial information. However, semantic segmentation requires the exact alignment of class maps and thus, needs the spatial information to be preserved. U-Net tries to solve this problem by utilizing encoder-decoder architecture. The general idea is that the encoder gradually reduces the spatial dimension with pooling layers while the decoder gradually recovers the object details and spatial dimensions. Another method for solving this problem would be using dilated/atrous convolutions to replace pooling layers all together.
The network architecture of basic U-Net is show in figure 9. It consists of a contracting path (left side) and an expansive path (right side). The contracting path follows the typical architecture of a convolutional network by down-sampling the input using 3*3 Convolutional layers (with no padding) while increasing its depth (number of filters); followed by ReLU and a max pooling operation with strides of 2 (to cut the image dimensions by half). The expansive path, reduces the number of filters (again using convolutions) while increasing input dimensions by concatenating each feature map with the correspondingly cropped feature map from the contracting path. The cropping is necessary due to the loss of border pixels in convolutions. Finally, at the last layer, 1*1 convolutions is used to map each one of the 64 dimensional feature vectors to the desired number of classes.
As you can see in Figure 9, the architecture of U-Net consists of a few (2 to 4 in my experiments) encoding and the same number of decoding layers. The encoding layers are used to extract different levels of contextual feature maps. The decoding layers are designed to combine these feature maps produced by the encoding layers to generate the desired segmentation maps. Larger number of encoder/decoder stages with larger sizes has been shown to lead to better performance at the cost of higher computational requirements.
3.3 Instance Segmentation using Mask R-CNN
Figure 10 below shows the typical pipeline for instance segmentation. As you can see, instance segmentation combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances. Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance.
Mask R-CNN[CITATION Kai18 \l 4105]2 is the next extension in the R-CNN line of models. It extends Faster R-CNN[CITATION Sha16 \l 4105]3 (the latest R-CNN model) in two ways: 1. by adding a third branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression. 2. By adding a newly designed layer called RoIAlign which fixes the misalignment between extracted features and the input caused by the quantization performed by the RoIPool layer used in Faster R-CNN[CITATION RGi15 \l 4105]4.
Figure 11 shows the evolution of R-CNN line of architectures from the original R-CNN[CITATION Sha16 \l 4105]3 to Mask R-CNN[CITATION Kai18 \l 4105]2. As you can see, Mask R-CNN is a modular composition of several recent ideas and has the following major components which are all individually modifiable to the specific task.
Feature Pyramid Network (FPN)
Region Proposal Network (RPN)
Region of interest feature alignment (RoIAlign)
Multi-task network head:
- Box classifier
- Box regressor
- Mask predictor
- Keypoint predictor
In this project, Mask-RCNN is used instead of Faster R-CNN because of its superior performance and that fact that it adds only a small overhead to Faster R-CNN, running at 5 fps.
The U-Net for this project is implemented using Keras with a TensorFlow backend. You can find the code in the Jupiter notebook contained at https://github.com/hooman67/Cell_Nuclei_Segmentation.
As mentioned in the Data section, this dataset contains images of 5 different dimensions (128, 128), (256, 256), (360, 360),(520, 696), and (1024, 1024). Since U-Net (like most other models) requires images to be of a fixed canonical size, after loading the images and masks they are all down sampled to 128*128. Working with smaller images, makes computations much faster, at the cost of losing information during down sampling.
Specifically in this task, one concern is that some cell nuclei are very small even in their original image size. After down sampling, these nuclei might become smaller than the detection threshold of U-Net. Note that since 128 by 128 is the smallest dimension in our dataset, using this as the canonical size makes sense because it avoids having to up sample any of the images (by interpolation for example). However, it should be noted that for the purposes of the competition, Kaggle expects all masks to be in the original image size. Consequently, after predicting a 128*128 mask, it should be up sampled to the original size in order to be run-length encoded and submitted to Kaggle for validation. For this task, I used the resize function contained in scikit-image, with an order 1 spline interpolation.
In addition, as mentioned in the Methodology section, since U-Net performs semantic segmentation, it requires the ground truth masks in a single image. In our dataset, for each image, we have a variable number of masks (one for each cell nucleus) of the same size as the image. Consequently, after resizing all images and mask to a canonical size, I combine the all the masks available for each image (i.e. all the different nuclei in an image) into a single mask. Since the masks are binary, this is done through a max operation that replaces each pixel with the maximum value from all the masks. This is possible because the annotated masks are guaranteed not to overlap.
4.1.2 Data Augmentation
I experimented with various data augmentations, but they generally did not improve the performance of the model. I tried Rotations with an angle of 45 degrees, Height and Weights shifts with a range of 0.1, shear deformation with a range of 0.2, different zoom levels with range of 0.2, and both constant and reflect fill modes. Applying these augmentations reduced the performance of the model in comparison with an identically trained model without data augmentations. The only augmentations that have a very small positive effect on the performance of the model were horizontal and vertical flips.
4.1.3 Network Architecture
The architecture of the original U-Net is shown in Figure 9. The network built in this project is loosely based on the original paper. In addition to modifying the network to accommodate images of different size than the ones in the paper, I have made the following modifications:
The original U-Net paper does not include any drop out layers. In order to reduce over fitting, I have added a drop out layer with a probability of 0.1 to each stage. Due to the relatively small size of training data, it would be easy to over fit a complex model. Consequently, I added dropout, which is a form of regularizing the network.
The original paper uses ReLU activations throughout the network. However, there is research that shows Exponential Linear Units (ELUs) outperformed ReLUs on object classification and localization tasks on ImageNet using various CNNs. Figure 16 shows the mathematical equation for ELUs and a plot comparison of ELUs with ReLUs and Leaky ReLUs (LReLU).
Like ReLUs, leaky ReLUs (LReLUs) and parameterized ReLUs (PReLUs), ELUs also avoid a vanishing gradient via the identity for positive values. However ELUs have improved learning characteristics compared to the other activation functions. In contrast to ReLUs, ELUs have negative values which allow them to push mean unit activations closer to zero. Zero means speed up learning because they bring the gradient closer to the unit natural gradient. Like batch normalization, ELUs push the mean towards zero, but with a significantly smaller computational footprint. While other activation functions like LReLUs and PReLUs also have negative values, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the propagated variation and information. Therefore ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. Consequently dependencies between ELU units are much easier to model and distinct concepts are less likely to interfere.
Since U-Net is based on an encoder-decoder design, its architecture depends on input sizes. This is because the original image is first converted to a deep feature map of 8*8 images through a serious of convolutional and max pooling layers, and is then converted to the original size through a serious of transpose convolutions. Consequently, to investigate the effects of different canonical sizes, I had to implement two different networks one with 9 stages for 128*128 images, and another with 13 stages for 512*512 images. You can find the result of this investigation in the Results section.
Figure 17 shows the models optimized for 128*128 images (using 9 stages to take the input from 128*128 to 8*8 and back to 128*128) while Figure 18 shows the models optimized for 512*512 images (using 13 stages to take the input from 512*512 to 8*8 and back to 512*512). Both models use Adam optimizer with the Binary Cross Entropy loss. The accuracy metric for the optimizer must be the mean average precision at different intersection over union (IoU) thresholds (10 values from 0.5 to 0.95). Tensorflow has a mean IoU implemented in the in tf.metrics.mean_iou class, but has no method for finding the mean over multiple thresholds. As a result, I implemented my own method (which can be find in the U-Net notebook in the repo).
For the choice of optimizer, I used Adam with Binary Cross Entropy loss. The binary loss is acceptable here since the segmentation masks are binary. For the evaluation metrics, I am using the average over mean IoUs as described above. For the beta_1 and beta_2 parameters of the Adam optimizer, I am using the default values of 0.9 and 0.999, respectively.
The results presented here where obtained by starting from a Learning Rate of 0.001 and training the model for 50 epochs. However, I used an early stopper call back that ends training when the validation loss does not change by more than 5 epochs. I use this because at this point, I have not applied any regularization to the model (no L2 or L1). Furthermore, I am using no weight decay in the Adam optimizer either (i.e. the value of weight decay was set to 0). With the early stopper and a learning rate of 0.001, the model ends training at around 25 epochs. In addition, 10% of the training set was chosen at random and used for validation. The validation loss was calculated five times at every epoch.
As a future works, I plan to more thoroughly train the model by baby-sitting the learning rate to reduce it every time the early stopper ends the training process pre-maturely. When/If I get to the point that I realize the model is over fitting, I might add L1/L2 regularization or weight decay to compensate. Right now, the only mechanisms by which I battle over fitting are early stopping and Drop out layers in between all convolutional and Transpose Convolutional layers.
4.2 Mask R-CNN
The implementation of Mask R-CNN for this project is based on Keras with a Tensorflow backend and is largely influenced by the Matter Port implementation of Mask R-CNN[CITATION Kai18 \l 4105]2 (add refrence). The code for this project can be found in the github repo. I tried to follow the Mask R-CNN paper's recommendations for the most part, but I made the following changes to make the model more suitable for this particular task. For the backbone architecture, I am using ResNet101.
4.2.1 Image Resizing
As mentioned in the U-Net implementation section, we have multiple image dimensions in our dataset. To have a canonical size for the network (and to be able to use multiple images per batch), I resize all images to 512 by 512. In case of images that have different aspect ratios (i.e. are not squares), I pad them with zeros. Notice that, in the original paper, the input images are not resized because they resize the masks to a small fixed size. Here, I am resizing the images as well for consistency and to make training faster.
However, I do resize the masks to a fixed size by extracting the bounding box of the object and resizing it (just the object) to a fixed size of 56 by 56. Since our dataset provides only the masks and no bounding boxes, I generate my own. In doing so, I pick the smallest box that encapsulates all the pixels of the mask as the bounding box. This simplifies the implementation and also makes it easy to apply certain image augmentations that would otherwise be really hard to apply to bounding boxes, such as image rotation. Nonetheless, in terms of data augmentation, I am only using horizontal flips. This is because after trying various crops, rotations, transitions, and rescaling; I noticed that they do not improve the results. For the ROI Pooling layer, I use the same parameters as suggested by the paper; a pool size of 7 for ROIs and a pool size of 14 for the masks.
4.2.2 Anchors in the Region Proposal Network (RPN)
The original paper suggests Anchor scales of 32, 64, 128, 256, and 512. These are the lengths of square anchor side in pixels. However, since Cell nuclei are much smaller than the object that the Mask R-CNN in the original paper was trying to detect, in this project, I am using Anchor scales of 4, 8, 16, 128, and 256 instead. This is because in a preliminary study I realized that most of the nuclei are in the 8 pixel by 8 pixel range with some as big as 100*100 pixel range. This is why I keep the first smallest scales (4, 8, and 16) but then jump to 128*128 pixels. In other words, there are very few nuclei in our dataset that are bigger than 16 by 16 and smaller than 128 by 128 pixels.
I use the same Aspect Ratios for Anchors as suggested in the paper (0.5, 1, and 2). For the Anchor stride, I use a value of one, which produces 1 Anchor for every position (i.e. Pixel) in the backbone feature map. For the strides of the FPN pyramid, I use the same values suggested in the paper (4, 8, 16, 32, 64) since these values were optimized for ResNet101 backbone architecture that I am using. Using above values, the lowest level of the pyramid has a stride of 4px relative to the image, so anchors are created at every 4 pixel intervals.
For the threshold of the none-max suppression stage, the original paper suggests 0.3 so that they can generate a large number of proposals. However, I realized that a higher value of 0.7 or 0.8 can increase the performance on the test set by having stricter criteria for proposals. For training, I use 256 Anchors for each image.
4.2.3 Training and Testing Regions of Interest (ROIs)
After None Max Suppression, I am keeping 2000 ROIs for training and 1000 ROIs during test time. The original paper generates 1000 ROI per image (initially), which I have reduced to 600 here. This is because the paper mentions that for the best results the sampling stage should pick up approximately 33% positive ROIs. Since my images are smaller and have fewer objects compared to the task for which Mask R-CNN was used in the original paper, I generate fewer ROIs per image initially, so that after down sampling I am left with 33% positive ROIs.
For the number of ROIs per image to feed to the classifier and mask heads, the original paper suggests 512. However, I am using 200. Again, this is because when using 512, RPN generates too many positive proposals. So, I am generating fewer proposals to begin with in order to keep the positive to negative ROI ratio at 1/3 as per the paper's suggestion. The criteria for deciding between positive and negative ROIs are the same described in the Methodology section (and in the original paper). Also, for detection, I have set the minimum probability value to accept a detected instance to 0.7. Basically, ROIs with confidence levels below 0.7 are ignored during test time.
In the original Mask R-CNN paper, the maximum number of instances to use in training and to return in testing must be specified. The original paper uses a value of 100 instances per image in both; meaning regardless of how many instances there are in the image, we return only 100 objects. However, in this project, some images have up to 500 nuclei in them. Unfortunately, I cannot set the maximum number of ground truth instances to 512 during training due to memory limits. Consequently, I use a maximum of 256 ground truth instances for training (i.e. allowable objects in each image during training). For test time, however, I have set this number to 512 so that I can correctly classify images with large number of cell nuclei in them.
Finally, for the average RGB pixel values to differentiate different colors, I use the same values as suggested in the paper (123.7, 116.8, 103.9). At a future time, I might change these values to better reflect the colors present in our dataset (based on a more sophisticated clustering algorithm than the one I used in the data section). I am also using the same values as suggested in the paper for the standard deviations used in the mask refinement stage in both the RPN stage and the final refinement stage. For both the RPN bounding box refinement stage and the final refinement stage these were suggested to be set at 0.1, 0.1, 0.2, and 0.2.
Faster R-CNN was trained using a multi-step approach, training parts independently and merging the trained weights before a final full training approach. However, the Mask R-CNN paper recommends an end-to-end training strategy (i.e. joint training). In this project, after putting the complete model together, there are 4 different losses, two for the RPN and two for R-CNN. For this task, aside from training the layers in the RPN and in the R-CNN, I have decided to fine-tune the ResNet101 backbone structure as well; starting from the weights trained on the COCO dataset.
This is because the objects that the backbone model (i.e. ResNet101) was trained on (i.e. the images in the COCO dataset) are very different from the objects we are trying to detect (i.e. Cells and their nuclei). Training the ResNet101 architecture is very computationally expensive, however, required in order to squeeze all the possible performance we can get out of the ResNet101. Consequently, starting from the COCO weights, I fine-tune the ResNet101 backbone structure for 50 epochs, and then I freeze the backbone weights and train only the heads (i.e. RPN and R-CNN) for 300 epochs.
During the end-to-end training, I combine the four different losses using a weighted sum; assigning the classification losses higher weight relative to regression ones and the R-CNN losses more power over the RPN losses. Apart from the regular losses, I also have the regularization losses which are defined both in RPN and in R-CNN. I use L2 regularization in all of the layers in the RPN and in the R-CNN, but no regularization in the backbone (ResNet101) network since the original training (on the COCO dataset) was performed without any regularization.
For the choice of optimizer, I am using Stochastic Gradient Descent with a momentum value of 0.9, which is the default. For Learning Rate, the Mask R-CNN paper uses a learning rate of 0.02, but I found that to be too high, and often causes the weights to explode, especially when using a small batch size. This might be related to differences between how Caffe and TensorFlow compute gradients (sum V.S. mean across batches and GPUs). Or, maybe the official model uses gradient clipping to avoid this issue. I do use gradient clipping, but don't set it too aggressively. In terms of batch sizes, since I have images that are as large as 1024 by 1024, I can only fit 2 images on the GPU (which has 12GB) when training the full end-to-end model (i.e. including the ResNet101). When training just the head (i.e. RPN and R-CNN) I use a batch size of 8. If I change the backend architecture to ResNet50, I can have a batch size as large as 16. However, I found that using ResNet50 instead of ResNet101 has a relatively large effect on the accuracy, so I decided that the sacrifice in the accuracy would not be worth the reduced training time.
Consequently, when training the entire network end-to-end (i.e. fine-tuning the backbone ResNet50 as well as RPN and R-CNN heads), I start with a learning rate of 0.001 but set an early stopper call back to end the training if the validation loss does not change within 5 epochs. Every time, the model early stops, I divide the starting learning rate by 10 and resume the training. According to the documentation SGA optimizer automatically decreases the learning rate by a factor of 10 every 50000 steps, but I found that to not be enough. When training just the heads, I use a learning rate of 0.0001, but did not have to decrease it (i.e. the training ends when the network reaches the specified number of epochs ==300 and not due to early stopping). As mentioned previously, I am also using regularization. For the regularization parameter I am using a fixed value of 0.0001. Finally, 10% of the training set was chosen at random and used for validation. I perform 5 validation steps per epoch.
5. Discussion of Results and Related Works
Table 1 shows the results of the experiments performed using U-Net. All of the reported scores (both here and everywhere else in this report, unless explicitly mentioned otherwise) are the mean average precision at different intersection over union (IoU) thresholds, described in the Evaluation section. This is the score on the test set calculated by Kaggle. I do not have access to this test set, which guarantees an unbiased approximation of the classifiers performance.
|Model and training description||Score|
|U-Net, 128*128 image sizes, 3 stages, early returned at 25 epochs, no data augmentation||0.245|
|U-Net, 128*128 image sizes, 3 stages, early returned at 25 epochs, with data augmentation||0.243|
|U-Net, 512*512 image sizes, 3 stages, early returned at 25 epochs, no data augmentation||0.230|
|U-Net, 512*512 image sizes, 5 stages, early returned at 25 epochs, no data augmentation||0.262|
The first two rows of Table 1 show the results of training a U-Net with 3 encoder/decoder stages (i.e. the structure shown in Figure 17) for 25 epochs. The number of epochs in the training was set to 50 but, in both cases, the early stopper call-back terminated the training at 25 epochs (after 5 epochs for which the validation loss did not decrease). The input images are all resized to 128*128 as described in the implementation section. The optimizer was Adam with default Learning rate, Beta 1, and Beta 2 parameters with a decay rate of 0.
The only difference between the (first) two rows of Table 1 is that the model in first row did not include any data augmentation. However, for the model in the second row, all of the data augmentations mentioned in the implementation section were added. As you can see, data augmentation caused the test score to go down by 0.002 on the test set (as calculated by Kaggle), which is in line with observations made in other works in the particular task of cell nuclei segmentation.
In 2018, Cue et al.[CITATION Yux18 \l 4105]6 looked at several data augmentation techniques proposed for the task of cell nuclei segmentation and concluded that none of them significantly improve the performance. The Histological / stained images in our dataset are acquired using H&S staining in which the nuclei of cells are stained to blue by Haematoxylin while cytoplasm is coloured pink by Eosin. Since in practice the color of H&E stained images could vary a lot due to variation in the H&E reagents, staining process, scanner and the specialist who performs the staining; several H&E stain normalization methods have been proposed to eliminate the negative interference caused by color variation. However, Cue Et al. tried several of these techniques and observed no considerable difference in the performance.
Another augmentation that Cue Et Al tried was using non-negative matrix factorization(NMF) to convert the color space of the images in both training and test sets to a fixed color space; usually the color space of the best stained H&E image in the training set. Again, Cue et al.'s results show that this augmentation method has little effect on the performance of models in cell nuclei segmentation tasks. Finally, Cue Et al argues that the large body of nuclei segmentation focused augmentation techniques that are based on using Deconvolution algorithms to extract the H-channel from the images and use those as learning features, actually reduce the performance in deep fully convolutional networks. In other words, according to Cue Et al, deep FCNs learn better from raw RGB images than H-Channel gray-scale images (i.e. the information contained in RGB values is actually relevant).
Aside from data augmentation techniques, several other improvements have been suggested in recent years to improve the performance of U-Nets in nuclei segmentation. Zhao and Sun[CITATION Hou \l 4105]7 proposed an architecture that is very similar to U-Net (i.e. consists of a contracting path and an expansive path) but takes advantage of inception modules and batch normalization instead of ordinary convolutional layers, which reduce the quantity of parameters and accelerate training without loss of accuracy. I did not try this myself since my main focus is on maximizing accuracy and not training speed.
Finally, Pena et al.[CITATION Fid18 \l 4105]8 proposed a method for improving the performance of U-Net in segmenting cell nuclei in images with high cell density/ clutter. Segmenting individual touching cell nuclei in cluttered regions is challenging as the feature distribution on shared borders and cell foreground are similar, which makes it difficult to correctly classify the pixels. To solve this, Pena et al.[CITATION Fid18 \l 4105]8 proposed extending U-Net's binary semantic segmentation predictions to instance segmentation using 3 classes (background, foreground, and touching). To help with this Pena et al, also, introduced a new multiclass weighted cross entropy loss function that takes into account not only the cell geometry but also the class imbalance. Unfortunately, this technique is not applicable in this project since our training and test images do not have labels that convey whether they are cluttered/ densely packed or not.
Table 2 shows the results of the experiments performed using Mask R-CNN. Again the reported score is the mean average precision at different intersection over union (IoU) thresholds
|Model and training description||Score|
|Mask R-CNN, ResNet50 backbone, all layers trained for 10 epochs||0.285|
|Mask R-CNN, ResNet101 backbone, all layers trained for 10 epochs||0.291|
|Mask R-CNN, ResNet101 backbone, tuned ROI and RPN hyper-parameters, all layers trained for 10 epochs||0.344|
|Mask R-CNN, ResNet101 backbone, with cleaned up training data , all layers trained for 10 epochs||0.318|
|Mask R-CNN, ResNet101 backbone, with cleaned up training data, and tuned ROI and RPN hyper-parameters, all layers trained for 10 epochs||0.380|
The first row of table 2 shows the result of training all the layers of Mask R-CNN for 10 epochs using a ResNet50 backbone and starting from weights pre-trained on COCO dataset. The second row shows the results when ResNet101 is used as the backbone architecture instead of ResNet50. In both cases, the SGD optimizer was used with a learning rate of 0.00001 with a momentum of 0.9 and weight decay of 0.0007. As you can see, (as expected) ResNet101 performs better than ResNet50 at the cost of taking almost twice as long to train and test.
Further exploration of the predictions performed by the Mask R-CNN model using ResNet101 backbone (second row of Table2) showed that several ground truth nuclei masks for the images in the training set have holes or missing pieces. I performed a simple pre-processing space using OpenCV's contour detection function to exclude any masks that contain secondary level (i.e. contour inside of contour) contours. This step proved to increase the performance of the classifier as you can see in row 4 of table 2. Figure 19 shows a few examples of masks with holes in them, while Figure 20 shows a few masks with missing pieces.
The nuclei shown in Figure 21 represent the average size of nuclei present in the training set. We have several cell types with nuclei much larger or much smaller than the ones shown in Figure 22. However, it makes sense to adjust the ROI scales and aspect ratios to match the average sizes of nuclei present in our dataset, which is what I am doing here. It is intuitive to assume that the closer the scales of the original Anchors are to the actual objects we are interested in, the better the results.
Consequently, as mentioned in the implementation section, since most cell nuclei are much smaller than the objects the original Mask R-CNN paper was interested in detecting, I have decreased the scales of my initial anchors by a power of 4. Figures 22 and 23 show the highest scoring anchors (i.e. most positive anchors based on their objectiveness score) that are proposed by the RPN before the refinement stage. Figure 23 shows these anchors at the scale suggested by the Mask R-CNN paper, while Figure 22 shows the anchors after the scales are reduced by a power of 4. As you can see the proposals in Figure 23 are much closer to the actual sizes of the objects we are interested in (i.e. cell nuclei).
6. Future Works
The scope of future works for this project can be divided into 3 broad categories: improving the performance of U-Net, improving the performance of Mask R-CNN, and combining the results from several classifiers in and ensemble in order to achieve competitive results.
6.1 Improving U-Net
As mentioned in the implementation section, I would like to train the U-Net model over a larger number of epochs. Currently, using the early stopper call-back, which ends the training process pre-maturely ones the validation loss does not decrease within five epochs, causes the algorithm to early stop after 25 epochs. I would like to avoid this (and thus train the model for more epochs) by increasing the decay rate of the learning rate used by Adam optimizer. The documentation for Adam optimizer mentions that it automatically controls the learning rate along the training, so it is not necessary to manually set the momentum and decay, but in my experience a value of above 0 achieves much better results.
Furthermore, since the original U-Net paper was published in 2015, several improvements have been suggested to make U-Net more suitable for the task for segmenting cell nuclei. Most recently (i.e. in 2018), Cui et al.[CITATION Yux18 \l 4105]6 suggested several modifications to the architecture of U-Net to improve the performance of U-Net for nuclei segmentation of histo-pathological images. They propose a new method for initializing the encoder/decoder weights using Glorot Uniform and propose the use of Scaled Exponential Linear Units (SELUs) instead of ReLU activations. SELUs are designed to give self-normalizing capability to forward neural networks (FNN). And, Cui et al.[CITATION Yux18 \l 4105]6 shows that FNNs using SELUs outperform the ones using explicit normalization methods, such as batch normalization, layer normalization, and weight normalization.
In addition to proposing an overlapped patch extraction and assembling method designed for seamless prediction of nuclei in large whole-slide images, Cui et al. proposes a nuclei-boundary model to explicitly detect nuclei and their boundaries simultaneously from histopathology images. This is done in order to solve the case of over-lapping nuclei that was mentioned at the beginning of the Methods section. This paper shows that detecting boundary is able to improve the accuracy of nuclei detection and help split the touched and overlapped nuclei. As a future works for the U-Net portion of this project, I'd like to experiment with the effects of above suggested improvements.
6.2 Improving Mask R-CNN
To improve the results produced by Mask R-CNN, I am planning on experimenting with different values for the hyper parameters controlling the sizes and locations of RPN Anchors. In addition, I'd like to experiment with different parameters for the ROIs in both training and testing. Ideally, I'd like to find the best parameters for the confidence probability of selecting ROIs as positive or negative in order to decrease the number of false positives given by Mask R-CNN. In addition, I'd like to optimize the model further by training both the backbone ResNet101 and the RPN, Mask, and R-CNN heads. I might experiment with using the Adam optimizer instead of SGA in order to have a reactive learning rate (i.e. have the learning rate decay automatically).
In addition, I'd like to further experiment with the effects of data augmentation on the performance of Mask R-CNN on this task. Specifically, I'd like to experiment with the color normalizations technique that was described in the implementation section for U-Net (i.e. H&E stain normalization methods). Finally, I'd like to investigate the effects of image transformations such as None-Negative Matrix Factorization (or even normal PCA) on the performance of Mask R-CNN instance segmentation. As mentioned previously, Cui et al. showed that converting one image's color into the target image's color space based on sparse non-negative matrix factorization, did not improve the performance of U-Net. However, I suspect this to be due to the fact that they are already using SELU activations, which in effect perform normalization. It is possible that without using SELU activations, color normalization based augmentation techniques might have a more significant effect on the performance of Mask R-CNN in this particular task.
6.3 Ensemble of Several Methods
As mentioned in the Data section, performing K-Means clustering on the training and test datasets showed that there are at least three clusters in our training datasets. At 3 clusters (i.e. k=3 in K-Means), K-Means produces clusters that perfectly match the 3 imaging modalities present in our dataset, even though there might be more low level sub-clusters within each image modality as well. As a future works, I'd like to further explore the presence of clusters in the dataset. Ones a set of clusters are identified that are present in both the training set and the test set, one might imagine fitting several different classifiers one on each cluster. By limiting the feature space to a particular imaging modality (i.e. cluster), classifiers might show a big increase in performance. However, the ultimate success of this approach is largely dependent on the clusters in the training set to be similar to the ones present in the test set.
Aside from training models on individual clusters, one might consider fitting several models to bootstrapped samples from the training set and combining them in an ensemble. For the methods that perform semantic segmentation (i.e. U-Net), this can be done with a late fusion method. So far, I have tried averaging several U-Net models and taking the maximum of predictions among models. Both approaches showed promise in improving the performance, but averaging the predictions showed a higher improvement. Consequently, I'd like to experiment with stacking several Semantic segmentation classifiers as well. One might imagine that feeding the result of several classifiers into a separate network, and allowing it to pick the best features from each prediction, can highly improve the results.
On the other hand, ensembling over Instance Segmentation methods (i.e. Mask R-CNN) is not as easily performed as in the case of Semantic Segmentation models and requires a more sophisticated method such a None Maximum Suppression. This is because, unlike semantic segmentation methods that provide per pixel probabilities, instance segmentation methods provide scores that are specific to each instance. I'd like to experiment with applying None-Maximum-Suppression to the bounding boxes returned by my two trained Mask R-CNN model to combine their predictions while ignoring redundant and overlapping bounding boxes.