# BIOMEDIN 260/RAD260: Problem Set 3 - Mammogram Project

## Spring 2021

## Name 1:

Aaron Sossin

## Name 2:

Vivian Zhu

## Introduction

Breast cancer has the highest incidence and second highest mortality rate for women in the US.

Your task is to utilize machine learning to study mammograms in any way you want (e.g. classification, segmentation) as long as you justify why it is useful to do whatever it is you want to do. Turning in a deep dream assignment using mammograms might be amusing, for example, but not so useful to patients. That being said, choose something that interests you. As the adage goes, "do what you love, and you’ll never have to work another day in your life, at least in BMI 260."

Treat this as a mini-project. We highly encourage working with 1 other person, possibly someone in your main project team. 

In addition to the mammograms themselves, the dataset includes "ground-truth" segmentations and `mass_case_description_train_set.csv`, which contains metadata information about mass shapes, mass margins, assessment numbers, pathology diagnoses, and subtlety in the data. Take some time to research what all of these different fields mean and how you might utilize them in your work. You dont need to use all of what is provided to you.

Some ideas:

1. Use the ROI’s or segmentations to extract features, and then train a classifier based on those features using the algorithms presented to you in the machine learning lectures (doesn't need to use deep learning).

2. Use convolutional neural networks. Feel free to use any of the code we went over in class or use your own (custom code, sklearn, keras, Tensorflow etc.). If you dont want to place helper functions and classes into this notebook, place them in a `.py` file in the same folder called `helperfunctions.py` and import them into this notebook.

## Data

The data is here:

https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM

## Grading and Submission

This assignment has 3 components: code, figures (outputs/analyses of your code), and a write-up detailing your mini-project. You will be graded on these categories.

If you're OK with Python or R, please place all three parts into this notebook/.Rmd file that we have provided where indicated. We have written template sections for you to follow for simplicity/completeness. When you're done, save as a `.pdf` (please knit to `.pdf` if you are using `.Rmd`, or knit to `.html` and use a browser's "Print" function to convert to `.pdf`).

If you don't like Python OR R, we will allow you to use a different language, but please turn your assignment in with: 1) a folder with all your code, 2) a folder with all your figures, and 3) a `.tex`/`.doc`/`.pdf` file with a write-up.

## PROJECT TITLE HERE

**1. Describe what you are doing and why it matters to patients using at least one citation.**

Breast cancer is one of the most common cancer in women worldwide. 

We will have use two approaches of deep learning in this project: 

In the first approach, we plan to use the mammograms images and its masks to predict whether the tumor is benign or not. We plan to utilize a classic convolutional neural network to take the two images as inputs and return the diagnosis as a one dimensional vector. This image based method could be promising for helping doctors to identify patients with tumors. we can also use it to double check doctor's diagnosis result. 


In the second approach, we plan to use the mamogram images to predict its corresponding mask. We will be using both convolutional neural networks and U-nets to do this image segmentation. 

**2. Describe the relevant statistics of the data. How were the images taken? How were they labeled? What is the class balance and majority classifier accuracy? How will you divide the data into testing, training and validation sets?**

For our first approach, our data consists of breast X-ray scan mammograms and their masks. It includes the patient's left and right side in both the craniocaudal and mediolateral oblique views. They are labeled by doctors as malignant, benign, and benign_without_callback. 

For the second approach, the input data is the mammogram scan image and the output image is the mask image. 

In [152]:
# SPACE FOR CODE FOR QUESTION 2 HERE, SHOULD YOU NEED IT

In [3]:
import os
import numpy as np
import pydicom as dicom
import matplotlib.pyplot as plt
import os
import skimage
import sys
import h5py
import pandas as pd
from sklearn.model_selection import train_test_split
from keras import layers
from keras.layers import Input, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, Conv3D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalMaxPooling2D, GlobalAveragePooling2D, UpSampling2D, concatenate
from keras.models import Model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from keras.optimizers import Adam, SGD
import tensorflow as tf
from tensorflow.keras.layers import Conv2D
from tensorflow.keras import Sequential
import cv2
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras import datasets, layers, models

# First approach: using images to predict whether the outcome is malignant or not

In [14]:
# preprocess data
file_root = "data"
case = pd.read_csv("mass_case_description_train_set.csv")
def get_data(path):
    X = []
    y = []
    path_contents = os.walk(path)
    for root, directories, files in path_contents: 
        for file in files:
            if (file.split(".")[1] == "h5"):
                path = root + '/' + file
                h5 = h5py.File(path, 'r')
                

                patient_id = "P_" + root.split("/")[1]
                side = file.split(".")[0].split("_")[0]
                view = file.split(".")[0].split("_")[1]
                pathology = case[(case.patient_id == patient_id) & (case.side == side) & (case.view == view) ]["pathology"]
                if (len(pathology.tolist())>0):
                    X.append(h5['data'])
                    pathology = pathology.tolist()[0]
                    y.append(pathology)
    X = np.array(X)
    y = np.array(y)
    return X, y

def normalize_data(x, y):
    x -= x.mean(axis=(1, 2), keepdims=True)
    x /= x.std(axis=(1, 2), keepdims=True)
    encoder = LabelBinarizer()
    transformed_label = encoder.fit(np.unique(y).tolist())
    y = transformed_label.transform(y.tolist())
    return x, y

In [15]:
X, y = get_data(file_root)
X, y = normalize_data(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.1)


In [16]:
def model_1():
    model = Sequential()
    model.add(Conv2D(filters=16, kernel_size=(5, 5), activation="relu", input_shape=(256,256,2)))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Conv2D(filters=32, kernel_size=(5, 5), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Conv2D(filters=64, kernel_size=(5, 5), activation="relu"))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Conv2D(filters=64, kernel_size=(5, 5), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(3, activation='sigmoid'))
    print(model.summary())
    return model

In [17]:
model = model_1()
model.summary()
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test), batch_size=64)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 252, 252, 16)      816       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 126, 126, 16)      0         
_________________________________________________________________
dropout (Dropout)            (None, 126, 126, 16)      0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 122, 122, 32)      12832     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 61, 61, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 61, 61, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 57, 57, 64)        5

<tensorflow.python.keras.callbacks.History at 0x7f334d329e10>

# Second approach: Predicting masks from images

In [None]:
# filename = "data/00051/LEFT_CC.h5"
# h5 = h5py.File(filename, 'r')
# data= h5['data']
# label = h5['label']
# name = h5['name']
# X = data[:, :, 0]
# X2 = np.reshape(X, (256, 256, 1))

# X = cv2.merge((X, X, X))
# y = data[:, :, 1]
# plt.imshow(X)
# plt.show()
# plt.imshow(y)
# plt.show()
# y = np.reshape(y, (256, 256, 1))

In [None]:
# preprocess data
file_root = "data"

def get_data(path):
    patients = list()
    left = list()
    right = list()
    
    X = []
    y = []
    path_contents = os.walk(path)
    for root, directories, files in path_contents: 
        for file in files:
            if (file.split(".")[1] == "h5"):
                path = root + '/' + file
                h5 = h5py.File(path, 'r')
                data= h5['data']
                temp = data[:, :, 0]
                temp = cv2.merge((temp, temp, temp))
                X.append(temp)
                y.append(data[:, :, 1].reshape(256, 256, 1))
#             patients.append(file)
            
#             if file[0:4] == "LEFT":
#                 left.append('/'.join((path, file)))
#             elif file[0:5] == "RIGHT":
#                 right.append('/'.join((path, file)))
    X = np.array(X)
    y = np.array(y)
    return X, y

In [None]:
X, y = get_data(file_root)

In [None]:
# Define the model functions

def model_3Layer_2D(learning_rate):
    X_input = Input((256, 256,3))
    X = BatchNormalization(name='bn0')(X_input)
    X = Conv2D(30, (1, 1), strides=(1, 1), name='conv0')(X)
    X = BatchNormalization(name='bn1')(X)
    X = Conv2D(10, (1, 1), strides=(1, 1), name='conv1')(X)
    X = BatchNormalization(name='bn2')(X)
    X = Conv2D(1, (1,1), strides=(1,1), name='conv2')(X)
    model = Model(inputs=X_input, outputs=X, name='TwoLayer')
    opt = Adam(learning_rate=learning_rate)
    model.compile(loss='accuracy', optimizer=opt, metrics=['accuracy'])
    return model



In [None]:
model = model_3Layer_2D(0.001)
model.summary()
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.fit(X, y, epochs=3, batch_size=5)


YOUR WRITTEN ANSWER TO QUESTION 2 HERE

**3. Describe your data pipeline (how is the data scrubbed, normalized, stored, and fed to the model for training?).**

The data were 

**4. Explain how the model you chose works alongside the code for it. Add at least one technical citation to give credit where credit is due.**

In [None]:
# YOUR CODE FOR QUESTION 4 HERE. USE ADDITIONAL CODE/MARKDOWN CELLS IF NEEDED

YOUR WRITTEN ANSWER TO QUESTION 4 HERE. USE ADDITIONAL CODE/MARKDOWN CELLS IF NEEDED

**5. There are many ways to do training. Take us through how you do it (e.g. "We used early stopping and stopped when validation loss increased twice in a row.").**

YOUR WRITTEN ANSWER TO QUESTION 5 HERE

**6. Make a figure displaying your results.**

In [None]:
# YOUR CODE FOR QUESTION 6 HERE

YOUR WRITTEN ANSWER TO QUESTION 6 HERE

**7. Discuss pros and cons of your method and what you might have done differently now that you've tried or would try if you had more time.**

YOUR WRITTEN ANSWER TO QUESTION 7 HERE

**You will not be graded on the performance of your model. You'll only be graded on the scientific soundness of your claims, methodology, evaluation (i.e. fair but insightful statistics), and discussion of the strengths and shortcomings of what you tried. Feel free to reuse some of the code you are/will be using for your projects. The write-up doesn't need to be long (~1 page will suffice), but please cite at least one clinical paper and one technical paper (1 each in questions 1 and 4 at least, and more if needed).**