## Introduction

Machine learning algorithms can outperform some radiologists in cancer diagnosis, but can they also have good reasonings to support their classifications? The main goal of our project is to increase the 'explainability' of the models we explored. Namely, our algorithms do not only identify whether there is abnormality in a given image, but also localize where the problem is when they find lesions in the image.

To achieve this goal, we mainly experiment with ResNet and U-Net. ResNet allows us to train deep CNN models relatively efficiently. It gives out binary classification prediction (normal/abnormal) for each given image. To visualize where the model 'thinks' the trouble regions are in the images predicted to be abnormal, we generate saliency maps from ResNet. On the other hand, U-Net's prediction output consists of class labels assigned to each pixel in the given image, so it directly indicates the location of nodules. 

We also think up two ways to combine the advantages of the two models. One is to restructure U-Net by adding shortcut connections to the neural network so it can be trained more efficiently with a deeper structure, which may lead to better performance in limited time. Another is to stack ResNet and U-Net to improve the overall diagnosis accuracy of the models. 

All of the models are tested on the CBIS-DDSM data set (Curated Breast Imaging Subset of DDSM) ([4]), which is the updated version of DDSM (Digital Database for Screening Mammography). The models are evaluated by 3 metrics, diagnosis accuracy, pixel accuracy and intersecion over union ratio. The first metric measures whether the model gives right binary classification prediction (normal/abnormal) for an image, while the other two metrics measure whether the model does well on tumor localization.

Our final delivery of the project is an online prediction function, which gives binary classification prediction and heatmap with tumor localization for each given image.

## Literature Review

Artificial Neural Networks are now broadly applied in the study and modeling of cancers. They are used for susceptibility analysis and disease diagnosis. Their ability to model nonlinear relationships allow them to better predict treatment outcomes and therefore improve and even design individualized treatment plans. Experiments show that they can even outperform experienced experts (Steiner, 2017).

The lack of training data is a common problem in the machine learning study of Mammography. To overcome this problem, ResNet performs data augmentations such as shift, rotation, grey value variation, and random elastic deformation (He, Zhang, Ren & Sun, 2016). Class-conditional generative adversarial networks (GANs) are also used to synthesize healthy screening and artificial lesion signals to produce new training samples (Wu, Wu, Cox & Lotter, 2017).

For cancer diagnosis, in addition to simple classification, we also care for the location of the cancer cells. This is the motivation for U-Net as an image segmentation algorithm proposed by Ronneberger, Fischer and Brox (2017). A U-Net consists of a series of pooling layers and a series of upsampling layers, so the neural network as a whole has a symmetric U-shape. It uses the upsampling layers to increase resolution and therefore accuracy of the output, and it combines the upsampling section with feature maps from the contracting section in order to localize the output within context.

After we construct these deep convolutional networks, experiments show that they are harder to train -- with the same number of iterations, they have higher training error than shallower networks. Thus, deep residual learning is proposed to solve this degradation problem (He, Zhang, Ren & Sun, 2016). It adds shortcut connections to make certain layers explicitly approximate residual functions. This architecture brings two benefits: first, the network is easier to optimize (achieve low training error faster); second, it has lower complexity (no extra parameter or complexity from identity shortcut connections).

## Data

We use data from the CBIS-DDSM data set. We use this data set because it includes cleaned data from DDSM and all the images are reformatted to dicom files. The whole data set covers two kinds of tumors, mass and calcification. We only use data of mass cases as we think there may be distinctions in these two categories of problems, which means they need to be identified by two models separately. 

The original data set is already divided into training data and testing data, which are picked by the researchers meticulously to make sure that the two data sets are balanced with the same proportion of classification and situation. The training data set contains 1318 cases while the testing data set contains 378 cases. 

For each case, a dicom file of the X-ray image and a dicom file of regions of interest segmentation with the same shape are provided. The regions of interest segmentation files are masks with 0 (normal) and 1 (abnormal) annotated pixel by pixel. The images are often arrays with both width and height larger than 4000 and shapes of the images are varied, so we crop the images into $256*256$ patches. To limit the size of the data set and avoid meaningless cropped images (all dark or contains other parts of body), we keep all cropped images with 1 in the regions of interest segmentation files (which means there is abnormality in these images) and randomly pick cropped images with no trouble regions, which results in 19702 train images and 5835 test images. The proportion of abnormal data in both data sets is 0.3.

The codes for data cleaning and data preprocessing are presented as below.

In [None]:
import pandas as pd
import pydicom
from pathlib import *
import json
import matplotlib.pyplot as plt
import numpy as np

mass_train_description = pd.read_csv("mass_case_description_train_set.csv")

# Get the path of the image and store the array into the dictionary whose key is patient id, side and view

file_dic = {}
for i in range(len(mass_train_description)):
    patient_id = mass_train_description['patient_id'].iloc[i]
    side = mass_train_description['left or right breast'].iloc[i]
    view = mass_train_description['image view'].iloc[i]
    pid = patient_id + side + view

    p = Path("Mass_Training_image/"+(mass_train_description['image file path'].iloc[i]).split('/')[0])
    lst = list(p.glob('**/*.dcm'))
    dcm = pydicom.read_file(str(lst[0]))
    file = "Mass_Training/" + pid + "train_X.npy"
    np.save(file, dcm.pixel_array)
    file_dic[pid] = [file]
    
    df = mass_train_description[
        (mass_train_description['patient_id']==patient_id) & (mass_train_description['left or right breast']==side) & (mass_train_description['image view']==view)]

    # The same image may have more than 1 overlay files, so we have to add them together to obtain the final file
    label = np.zeros(dcm.pixel_array.shape)
    for j in range(len(df)):
        p = Path("Mass_Training_label/"+(df['ROI mask file path'].iloc[j]).split('/')[0])
        lst = list(p.glob('**/*.dcm'))
        for pic in lst:
            dcm = pydicom.read_file(str(pic))
            if (dcm.pixel_array.shape) == label.shape:
                label = label + dcm.pixel_array
    file = "Mass_Training/" + pid + "train_y.npy"
    np.save(file, label)
    file_dic[pid].append(file)

mass_training_path = pd.DataFrame(file_dic)
mass_training_path = mass_training_path.transpose()
mass_training_path.columns = ['train_X', 'train_y']
mass_training_path.to_csv("mass_training_path.csv")

mass_test_description = pd.read_csv("mass_case_description_test_set.csv")

## Get the path of the image and store the array into the dictionary whose key is patient id, side and view

file_dic = {}
for i in range(len(mass_test_description)):
    patient_id = mass_test_description['patient_id'].iloc[i]
    side = mass_test_description['left or right breast'].iloc[i]
    view = mass_test_description['image view'].iloc[i]
    pid = patient_id + side + view

    p = Path("Mass_Testing_image/"+(mass_test_description['image file path'].iloc[i]).split('/')[0])
    lst = list(p.glob('**/*.dcm'))
    dcm = pydicom.read_file(str(lst[0]))
    file = "Mass_Testing/" + pid + "test_X.npy"
    np.save(file, dcm.pixel_array)
    file_dic[pid] = [file]
    
    df = mass_test_description[
        (mass_test_description['patient_id']==patient_id) & (mass_test_description['left or right breast']==side) & (mass_test_description['image view']==view)]

    ## The same image may have more than 1 overlay files, so we have to add them together to obtain the final file
    label = np.zeros(dcm.pixel_array.shape)
    for j in range(len(df)):
        p = Path("Mass_Testing_label/"+(df['ROI mask file path'].iloc[j]).split('/')[0])
        lst = list(p.glob('**/*.dcm'))
        for pic in lst:
            dcm = pydicom.read_file(str(pic))
            if (dcm.pixel_array.shape) == label.shape:
                label = label + dcm.pixel_array
    file = "Mass_Testing/" + pid + "test_y.npy"
    np.save(file, label)
    file_dic[pid].append(file)

mass_testing_path = pd.DataFrame(file_dic)
mass_testing_path = mass_testing_path.transpose()
mass_testing_path.columns = ['test_X', 'test_y']
mass_testing_path.to_csv("mass_testing_path.csv")

mass_training = pd.read_csv("mass_training_path.csv")
x_shape = 256
y_shape = 256
p = 0.1

k = 1
file_dic = {}
for num in range(len(mass_training)):
    p_X =Path(mass_training['train_X'].iloc[num])
    p_y = Path(mass_training['train_y'].iloc[num])
    train_X = np.load(p_X)
    train_y = np.load(p_y)
    x_dim = int(test_X.shape[0] / x_shape) + 1
    y_dim = int(test_X.shape[1] / y_shape) + 1

    # crop the images into 256*256 patches
    for i in range(x_dim):
        for j in range(y_dim):
            s = train_X[i*x_shape:(i+1)*x_shape,j*y_shape:(j+1)*y_shape]
            l = train_y[i*x_shape:(i+1)*x_shape,j*y_shape:(j+1)*y_shape]
            if(s.shape == (x_shape,y_shape)):
                if (sum(sum(l)) > 0):
                    path_X = Path("Train_X/"+str(k)+(mass_training['train_X'].iloc[num]).split('/')[1])
                    np.save(path_X, s)
                    path_y = Path("Train_y/"+str(k)+(mass_training['train_X'].iloc[num]).split('/')[1])
                    np.save(path_y, l)
                    file_dic[k] = [path_X, path_y]
                    k = k + 1
                # randomly choose images without abnormality
                elif (random.random() < p):
                    # filter all dark images and meaningless images
                    if (sum(sum(s)) > (x_shape*y_shape*110)) & (sum(sum(s)) < (x_shape*y_shape*140)):
                        path_X = Path("Train_X/"+str(k)+(mass_training['train_X'].iloc[num]).split('/')[1])
                        np.save(path_X, s)
                        path_y = Path("Train_y/"+str(k)+(mass_testing['train_X'].iloc[num]).split('/')[1])
                        np.save(path_y, l)
                        file_dic[k] = [path_X, path_y]
                        k = k + 1
                        
mass_training_path = pd.DataFrame(file_dic)
mass_training_path = mass_training_path.transpose()
mass_training_path.columns = ['train_X', 'train_y']
mass_training_path.to_csv("mass_training_path.csv")

mass_testing = pd.read_csv("mass_testing_path.csv")
x_shape = 256
y_shape = 256
p = 0.1

k = 1
file_dic = {}
for num in range(len(mass_testing)):
    p_X =Path(mass_testing['test_X'].iloc[num])
    p_y = Path(mass_testing['test_y'].iloc[num])
    test_X = np.load(p_X)
    test_y = np.load(p_y)
    x_dim = int(test_X.shape[0] / x_shape) + 1
    y_dim = int(test_X.shape[1] / y_shape) + 1

    # crop the images into 256*256 patches
    for i in range(x_dim):
        for j in range(y_dim):
            s = test_X[i*x_shape:(i+1)*x_shape,j*y_shape:(j+1)*y_shape]
            l = test_y[i*x_shape:(i+1)*x_shape,j*y_shape:(j+1)*y_shape]
            if(s.shape == (x_shape,y_shape)):
                if (sum(sum(l)) > 0):
                    path_X = Path("Test_X/"+str(k)+(mass_testing['test_X'].iloc[num]).split('/')[1])
                    np.save(path_X, s)
                    path_y = Path("Test_y/"+str(k)+(mass_testing['test_X'].iloc[num]).split('/')[1])
                    np.save(path_y, l)
                    file_dic[k] = [path_X, path_y]
                    k = k + 1
                # randomly choose images without abnormality
                elif (random.random() < p):
                    # filter all dark images and meaningless images
                    if (sum(sum(s)) > (x_shape*y_shape*110)) & (sum(sum(s)) < (x_shape*y_shape*140)):
                        path_X = Path("Test_X/"+str(k)+(mass_testing['test_X'].iloc[num]).split('/')[1])
                        np.save(path_X, s)
                        path_y = Path("Test_y/"+str(k)+(mass_testing['test_X'].iloc[num]).split('/')[1])
                        np.save(path_y, l)
                        file_dic[k] = [path_X, path_y]
                        k = k + 1
                        
mass_testing_path = pd.DataFrame(file_dic)
mass_testing_path = mass_testing_path.transpose()
mass_testing_path.columns = ['test_X', 'test_y']
mass_testing_path.to_csv("mass_testing_path.csv")

## Reference

[1] Steiner. “Solving Cancer: The Use of Artificial Neural Networks in Cancer Diagnosis and Treatment.” Journal of Young Investigators, 33(5). 2017.

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, doi:10.1109/cvpr.2016.90.

[3] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-Net Convolutional Networks for Biomedical Image Segmentation.” Informatik Aktuell Bildverarbeitung Für Die Medizin 2017, 2017, pp. 3–3., doi:10.1007/978-3-662-54345-0_3.

[4] https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM

[5] Bo Dai, Sanja Fidler, Raquel Urtasun, Dahua Lin. “Towards Diverse and Natural Image Descriptions via a Conditional GAN.” 2017 IEEE International Conference on Computer Vision (ICCV), 2017, doi:10.1109/iccv.2017.323.

[6] Wu, Eric, Kevin Wu, David Cox, and William Lotter. "Conditional Infilling GANs for Data Augmentation in Mammogram Classification." Image Analysis for Moving Organ, Breast, and Thoracic Images Lecture Notes in Computer Science, 2018, 98-106. doi:10.1007/978-3-030-00946-5_11.