# BMI 260 ASSIGNMENT #3 | Mammogram Spring 2018

## Name 1:Joseph Nicolls

## Name 2: Alex Lu

Breast cancer has the highest incidence and second highest mortality rate for women in the US. 
Your task is to utilize machine learning to either classify AND/OR segment mammograms or neither, as long as you justify why it is useful to do whatever it is you want to do. If someone turns in a deep dream assignment using mammograms this might be amusing, but not so useful to patients. Consider this a mini-project, we highly suggest you work with 1 other person, it can be someone in your team. 

In addition to the mammograms, the dataset includes segmentations and mass_case_description_train_set.csv, which contains information about the mass shape, mass margins, assessment number, pathology diagnosis, and subtlety. Take some time to research what all of these different fields mean and whether you can use them in your algorithm or not. You dont need to use all of what is provided to you. 

Some ideas:

1. use the ROI’s or segmentations to extract features, and then train a classifier based on those features using the algorithms presented to you in the machine learning lectures, does not need to be deep learning. 

2. use convolutional neural networks, feel free to use any of the code we went over in class or use your own (custom code, sklearn, keras, Tensorflow etc.). If you dont want to place helper functions and classes into this notebook, place them in a .py file in the same folder called helperfunctions.py and import them into this notebook. 

The data is here:

https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM

If you do not like python, you can use a different language and turn your assignment in as a folder with all your code, a folder with all your figures and a latex or doc file with the writeup. The writeup doesnt need to be long, 1 page will do and cite at least one clinical paper and one technical paper. If you like python please place the writeup and code into this notebook. Use the markdown to tell is what you are doing in each section. You will not be graded on the performance of your model, only the scientific soundness of your claims, methodology, evaluation (ie fair but insightful statistics) and discussion of the shortcomings of what you tried. 

The format for this homework is as follows:

1. Describe what you are doing and why it matters to patients using at least one citation

2. Describe the relevant statistics of the data, how were the images taken ? how were they labeled ? what is the class balance and majority classifier accuracy ? How will you divide the data into testing, training and validation.

3. Describe your data pipeline, ie how is the data scrubbed, normalized, stored and fed to the model for training. 

4. Explain how the model you chose works along side the code for it and at least one technical citation to give credit where credit is due. 

5. There are many ways to do training, take us through how you do it. (ie we used early stopping and we decided when to stop based on validation loss)  

6. Make a figure displaying your results

7. Discuss pros and cons of your method and what you might have done differently now that youve tried or would try if you had more time. 

## Traditional ML techniques for Classification of Mammograms

###  Approach and Relevance

Mammograms are difficult to classify because identifying features for malignant tumors can be vague and masses in mammograms can appear anywhere with any orientation in the breast tissue. Though the setup for this problem suggests use of CNNs, we want to use traditional ML techniques in this exploratory research for a variety of reasons. The principal reason that we want to do this is for the feature analysis possible with traditional ML technqiues that isn't possible with CNNs. According to a study from Britton et. all, sensitivity of radiologists in classifying mammograms can vary between 53.1-74.1%. Through feature analysis, indiciative features could be highlighted for radiologists to focus on, potentially raising sensitivites and decreasing the variability of senestivities between radiologists. 


### Data Source


The images were taken from the Digital Database for Screening Mammography (DDSM), a database which consists of 2620 mammogram studies. These images are X-rays, 

We will first try to leverage the following features in order to make attempts at classification: 
* features in the features_matrix.csv (ASM, Area, Centroid coordinates, ...) 

We initially predict some potential issues with this approach: the number of features is large, and many are potentially correlated. Let's try to do some PCA on this data to figure out which features seem important 

First, let's import the packages that we're going to use 

In [18]:
import h5py 
import numpy as np
import pandas as pd
import seaborn as sns


In [19]:
import os 
import random
random.seed(42)

In [20]:
from sklearn.linear_model import Lasso
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc
from sklearn.metrics import roc_curve

from sklearn.model_selection import train_test_split

Defining global variables for paths, etc. 

In [21]:
data_parent_dir = "./data_fixed_crop_w_mask"
precomputed_path = "features_matrix.csv"
validation_prop = .1


In [29]:
def get_image_features(paths):
    data = []
    label = []
    name = []
    for first_path in paths:
        if first_path[0] == '.': # check for random dot files that come up :( )
            continue
        local_dir = os.path.join(data_parent_dir, first_path)
        for image in os.listdir(local_dir):
            hf = h5py.File(os.path.join(local_dir, image), 'r')
            data.append(np.array(hf.get('data')))
            label.append(np.array(hf.get('label')))
            name.append(np.array(hf.get('name')))
    d = {'pixel_data':data, 'label':label, 'name':name}
    df = pd.DataFrame(data=d)
    return df
            
            

In [35]:
def preprocess():
    
    # reading things in 
    all_paths = os.listdir(data_parent_dir)
    image_df = get_image_features(all_paths)
    precomputed_df = pd.read_csv(precomputed_path)

    # zero-centering, normalization
    precomputed_fts = precomputed_df.values[:,1:]
    precomputed_fts = np.array(precomputed_fts,dtype=np.float32)   
    precomputed_fts -= np.mean(precomputed_fts, axis = 0)
    precomputed_fts /= np.std(precomputed_fts, axis=0)
    
    # adding back to df 
    feature_names = list(precomputed_df)[1:]
    precomputed_df[feature_names] = precomputed_fts
    
    # joining dfs
    df = image_df.join(precomputed_df)

 #   feature_names.append("pixel_data")
    train, test = train_test_split(df, test_size=0.2)
    return train, test, feature_names
    

In [31]:
def train_models(train,feature_names):
    lassoClf = Lasso()
    lassoClf.fit(train[feature_names], train['label'].astype('int'))
    
    svcClf = SVC()
    svcClf.fit(train[feature_names], train['label'].astype('int'))
    
    rfClf = RandomForestClassifier()
    rfClf.fit(train[feature_names], train['label'].astype('int'))
    
    return [lassoClf, svcClf, rfClf]
    
    
    
    
    

In [32]:
def validate_models(models, test, feature_names):
    for model in models:
        y_true = test['label'].astype('int')
        y_pred = model.predict(test[feature_names])
        print("auroc: " +  str(roc_auc_score(y_true, y_pred)))
    
    

In [33]:
def pipeline():
    train, test, feature_names = preprocess()
    models = train_models(train, feature_names)
    validate_models(models, test, feature_names)
    

In [36]:
pipeline()


auroc: 0.5
auroc: 0.44826828410689173
auroc: 0.48430907172995774


# References

Britton P, Warwick J, Wallis MG, et al. Measuring the accuracy of diagnostic imaging in symptomatic breast patients: team and individual performance. The British Journal of Radiology. 2012;85(1012):415-422. doi:10.1259/bjr/32906819.