# BMI 260 ASSIGNMENT #3 | Mammogram Spring 2018

## Name 1:Joseph Nicolls

## Name 2: Alex Lu

Breast cancer has the highest incidence and second highest mortality rate for women in the US. 
Your task is to utilize machine learning to either classify AND/OR segment mammograms or neither, as long as you justify why it is useful to do whatever it is you want to do. If someone turns in a deep dream assignment using mammograms this might be amusing, but not so useful to patients. Consider this a mini-project, we highly suggest you work with 1 other person, it can be someone in your team. 

In addition to the mammograms, the dataset includes segmentations and mass_case_description_train_set.csv, which contains information about the mass shape, mass margins, assessment number, pathology diagnosis, and subtlety. Take some time to research what all of these different fields mean and whether you can use them in your algorithm or not. You dont need to use all of what is provided to you. 

Some ideas:

1. use the ROI’s or segmentations to extract features, and then train a classifier based on those features using the algorithms presented to you in the machine learning lectures, does not need to be deep learning. 

2. use convolutional neural networks, feel free to use any of the code we went over in class or use your own (custom code, sklearn, keras, Tensorflow etc.). If you dont want to place helper functions and classes into this notebook, place them in a .py file in the same folder called helperfunctions.py and import them into this notebook. 

The data is here:

https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM

If you do not like python, you can use a different language and turn your assignment in as a folder with all your code, a folder with all your figures and a latex or doc file with the writeup. The writeup doesnt need to be long, 1 page will do and cite at least one clinical paper and one technical paper. If you like python please place the writeup and code into this notebook. Use the markdown to tell is what you are doing in each section. You will not be graded on the performance of your model, only the scientific soundness of your claims, methodology, evaluation (ie fair but insightful statistics) and discussion of the shortcomings of what you tried. 

The format for this homework is as follows:

1. Describe what you are doing and why it matters to patients using at least one citation

2. Describe the relevant statistics of the data, how were the images taken ? how were they labeled ? what is the class balance and majority classifier accuracy ? How will you divide the data into testing, training and validation.

3. Describe your data pipeline, ie how is the data scrubbed, normalized, stored and fed to the model for training. 

4. Explain how the model you chose works along side the code for it and at least one technical citation to give credit where credit is due. 

5. There are many ways to do training, take us through how you do it. (ie we used early stopping and we decided when to stop based on validation loss)  

6. Make a figure displaying your results

7. Discuss pros and cons of your method and what you might have done differently now that youve tried or would try if you had more time. 

In [94]:
import h5py
import numpy as np
import sklearn
import pandas as pd
import os
import random
from sklearn.model_selection import train_test_split

from sklearn.linear_model import Lasso
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc
from sklearn.metrics import roc_curve


In [95]:
data_parent_dir = "./data_fixed_crop_w_mask"
semantic_path = "features_matrix.csv"
validation_prop = .1
random.seed(42)

In [127]:
def validate_models(models, test, feature_names):
    for model in models:
        y_true = test['label'].astype('int')
        y_pred = model.predict(test[feature_names])
        print("auroc: " +  str(roc_auc_score(y_true, y_pred)))
    
    

In [132]:
def train_models(train,feature_names):
    lassoClf = Lasso()
    lassoClf.fit(train[feature_names], train['label'].astype('int'))
    
    svcClf = SVC()
    svcClf.fit(train[feature_names], train['label'].astype('int'))
    
    rfClf = RandomForestClassifier()
    rfClf.fit(train[feature_names], train['label'].astype('int'))
    
    return [lassoClf, svcClf, rfClf]
    
    
    
    
    

In [98]:
def get_image_features(paths):
    data = []
    label = []
    name = []
    for first_path in paths:
        local_dir = os.path.join(data_parent_dir, first_path)
        for image in os.listdir(local_dir):
            hf = h5py.File(os.path.join(local_dir, image), 'r')
            data.append(np.array(hf.get('data')))
            label.append(np.array(hf.get('label')))
            name.append(np.array(hf.get('name')))
    d = {'pixel_data':data, 'label':label, 'name':name}
    df = pd.DataFrame(data=d)
    return df
            
            

In [101]:
def preprocess():
    all_paths = os.listdir(data_parent_dir)
    image_df = get_image_features(all_paths)
    semantic_df = pd.read_csv(semantic_path)

    df = image_df.join(semantic_df)
    feature_names = list(semantic_df)[1:]
 #   feature_names.append("pixel_data")
    train, test = train_test_split(df, test_size=0.2)
    return train, test, feature_names
    

In [130]:
def pipeline():
    train, test, feature_names = preprocess()
    models = train_models(train, feature_names)
    validate_models(models, test, feature_names)
    

In [133]:
pipeline()


         ASM      Area  Centroid_x  Centroid_y  Contrast  Dissimilarity  \
1095  0.9905   67842.0    0.331235    0.617191    0.0015         0.0012   
787   0.9935   44132.0    0.651422    0.461004    0.0012         0.0008   
749   0.9968   38430.0    0.605663    0.493404    0.0008         0.0004   
951   0.9890  129393.0    0.548505    0.161380    0.0025         0.0010   
1505  0.9982   31090.0    0.513563    0.354078    0.0007         0.0002   
100   0.9961   44685.0    0.498383    0.062689    0.0011         0.0004   
1336  0.9962   25419.0    0.562396    0.800842    0.0013         0.0007   
1326  0.9882  136186.0    0.577631    0.086982    0.0026         0.0012   
505   0.9955   29826.0    0.845618    0.611984    0.0012         0.0008   
49    0.9919   64504.0    0.745665    0.262636    0.0017         0.0009   
1356  0.9911   56783.0    0.611938    0.562459    0.0024         0.0009   
290   0.9962   23883.0    0.528286    0.473205    0.0016         0.0005   
419   0.9847   89211.0   

# References