# BMI 260 ASSIGNMENT #3 | Mammogram Spring 2018

## Name 1:Joseph Nicolls

## Name 2: Alex Lu

Breast cancer has the highest incidence and second highest mortality rate for women in the US. 
Your task is to utilize machine learning to either classify AND/OR segment mammograms or neither, as long as you justify why it is useful to do whatever it is you want to do. If someone turns in a deep dream assignment using mammograms this might be amusing, but not so useful to patients. Consider this a mini-project, we highly suggest you work with 1 other person, it can be someone in your team. 

In addition to the mammograms, the dataset includes segmentations and mass_case_description_train_set.csv, which contains information about the mass shape, mass margins, assessment number, pathology diagnosis, and subtlety. Take some time to research what all of these different fields mean and whether you can use them in your algorithm or not. You dont need to use all of what is provided to you. 

Some ideas:

1. use the ROI’s or segmentations to extract features, and then train a classifier based on those features using the algorithms presented to you in the machine learning lectures, does not need to be deep learning. 

2. use convolutional neural networks, feel free to use any of the code we went over in class or use your own (custom code, sklearn, keras, Tensorflow etc.). If you dont want to place helper functions and classes into this notebook, place them in a .py file in the same folder called helperfunctions.py and import them into this notebook. 

The data is here:

https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM

If you do not like python, you can use a different language and turn your assignment in as a folder with all your code, a folder with all your figures and a latex or doc file with the writeup. The writeup doesnt need to be long, 1 page will do and cite at least one clinical paper and one technical paper. If you like python please place the writeup and code into this notebook. Use the markdown to tell is what you are doing in each section. You will not be graded on the performance of your model, only the scientific soundness of your claims, methodology, evaluation (ie fair but insightful statistics) and discussion of the shortcomings of what you tried. 

The format for this homework is as follows:

1. Describe what you are doing and why it matters to patients using at least one citation

2. Describe the relevant statistics of the data, how were the images taken ? how were they labeled ? what is the class balance and majority classifier accuracy ? How will you divide the data into testing, training and validation.

3. Describe your data pipeline, ie how is the data scrubbed, normalized, stored and fed to the model for training. 

4. Explain how the model you chose works along side the code for it and at least one technical citation to give credit where credit is due. 

5. There are many ways to do training, take us through how you do it. (ie we used early stopping and we decided when to stop based on validation loss)  

6. Make a figure displaying your results

7. Discuss pros and cons of your method and what you might have done differently now that youve tried or would try if you had more time. 

## Traditional ML techniques for Classification of Mammograms

###  Approach and Relevance

Mammograms are difficult to classify because identifying features for malignant tumors can be vague and masses in mammograms can appear anywhere with any orientation in the breast tissue. Though the setup for this problem suggests use of CNNs, we want to use traditional ML techniques in this exploratory research for a variety of reasons. The principal reason that we want to do this is for the feature analysis possible with traditional ML technqiues that isn't possible with CNNs. According to a study from Britton et. all, sensitivity of radiologists in classifying mammograms can vary between 53.1-74.1%. Through feature analysis, indiciative features could be highlighted for radiologists to focus on, potentially raising sensitivites and decreasing the variability of senestivities between radiologists. 




### Data Source


The images were taken from the Digital Database for Screening Mammography (DDSM), a database which consists of 2620 mammogram studies. These images are X-rays, recorded as grey-scale images. Previous work had been done to filter these images to the ROI mass. Within features_matrix.csv, results of calculations on these masses have been recorded. We have didvided the data into training and testing at random, leaving 10% out for testing. 52% of these images are negatively labeled while 48% are positively labeled. 

In addition, semantic features associated with some on the images have also been included. These semantic features include the type of view included in the image, which side of the patient the breast was on, and case descriptions with a limited vocabulary describing characteristics of the mass. We believe that semantic features such as breast density, mass shape, and mass margins can be highly useful. The limited vocabulary of these qualitative features allows one-hot encoding and incorporation into our feature vectors for traditional ML methods. 



We will first try to leverage the following features in order to make attempts at classification: 
* features in the features_matrix.csv (ASM, Area, Centroid coordinates, ...) 


First, let's import the packages that we're going to use 

In [173]:
import h5py 
import numpy as np
import pandas as pd
import seaborn as sns


# References

Britton P, Warwick J, Wallis MG, et al. Measuring the accuracy of diagnostic imaging in symptomatic breast patients: team and individual performance. The British Journal of Radiology. 2012;85(1012):415-422. doi:10.1259/bjr/32906819.

In [174]:
import os 
import random
random.seed(42)

In [175]:
from sklearn import preprocessing

from sklearn.linear_model import Lasso
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix

from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split

Defining global variables for paths, etc. 

In [176]:
data_parent_dir = "./data_fixed_crop_w_mask"
precomputed_path = "features_matrix.csv"
validation_prop = .1


A function that reads in image data

In [177]:
def get_image_features(paths):
    '''
        input:
            * paths, a list of the paths that we need to input 
        output:
            * a dataframe containing image data, label, and name 
    '''
    print("preparing to read images")
    
    # initialize data structures 
    data = []
    label = []
    name = []
    
    # iteratio nover paths, dirs
    for first_path in paths:
        if first_path[0] == '.': # check for random dot files that come up :( )
            continue
        local_dir = os.path.join(data_parent_dir, first_path)
        for image in os.listdir(local_dir):
            hf = h5py.File(os.path.join(local_dir, image), 'r')
            data.append(np.array(hf.get('data')))
            label.append(np.array(hf.get('label')).item(0))
            name.append(np.array(hf.get('name')))
    d = {'pixel_data':data, 'label':label, 'name':name}
   
    df = pd.DataFrame(data=d)
    print(len(df[df['label'] == 0]))
    print(len(df[df['label'] == 1]))
    print(len(df))
    return df
            
            

We initially predict some potential issues with this approach: the number of features is large, and many are potentially correlated. For that reason, we can implement some dimensionality reduction and take a look at the variation explained by the principal components (and their contents)

In [178]:
def reduce_dimensionality(raw_data, new_dims=3):
    '''
        input:
            * raw_data, the raw matrix that will be reduced in dimensionality 
        output:
            * the dimensionality-reduced data 
        
    '''
    print("preparing to reduce dimensionality")
    pca = PCA()
    pca.fit(raw_data)
    print(">>> variance explained by each principal component")
    print(pca.explained_variance_ratio_)  
    print(">>> the first principal component")
    print(pca.components_[0])
    reduced = pca.transform(raw_data)[:,:new_dims]
    return reduced

The beginnings of our preprocessing pipeline

In [179]:
def preprocess(dim_reduc=None):
    print('Entering preprocessing')
    
    feature_names = []
    
    # reading things in 
    all_paths = os.listdir(data_parent_dir)
    image_df = get_image_features(all_paths)
    precomputed_df = pd.read_csv(precomputed_path)

    # zero-centering, normalization
    precomputed_fts = precomputed_df.values[:,1:]
    new_fts = np.array(precomputed_fts,dtype=np.float32)   
    new_fts -= np.mean(new_fts, axis = 0)
    new_fts /= np.std(new_fts, axis=0)
    
    total_patientIDs = precomputed_df.values[:,0]
    # dimensionality reduction 
    if dim_reduc is not None:
        new_fts = reduce_dimensionality(new_fts, dim_reduc)
        feature_names += ["pc# "+str(x) for x in range(1, dim_reduc+1)]
        precomputed_df = pd.DataFrame(new_fts, columns=feature_names, index=precomputed_df.index)
        
    # adding back to df 
    else:
        feature_names += list(precomputed_df)[1:]
        precomputed_df[feature_names] = new_fts
    
    # joining dfs
    df = image_df.join(precomputed_df)

 #   feature_names.append("pixel_data")
    train, test = train_test_split(df, test_size=0.2)
    
    return (train[feature_names], train['label'], test[feature_names], test['label'], total_patientIDs)
#    return train, test, feature_names
    

Now that we've pretty much established how our data should be preprocessed, we want to introduce and train some basic machine learning models 

In [180]:
def train_models(train, labels):
    
    #lassoClf = Lasso()
    #lassoClf.fit(train, labels)
    
    svcClf = SVC(random_state = 260)
    svcClf.fit(train, labels)
    
    rfClf = RandomForestClassifier(random_state = 260)
    rfClf.fit(train, labels)
    
    return [svcClf, rfClf]


  #  return [lassoClf, svcClf, rfClf]
    
    
    
    
    

In [181]:
def validate_models(models, test, y_true):
    '''
        input: 
            * models, a list of trained models 
            * test, the test set on which models will be evaluated
            * feature_names, the list of feature columns that are being used 
    '''
    print("Preparing to validate models")
    for model in models:
        y_pred = model.predict(test)
        print(confusion_matrix(y_true, y_pred))
        #print(y_pred)
        print("auroc: " +  str(roc_auc_score(y_true, y_pred)))
        
    print("Done validating models")
    
    

In [183]:
def pipeline():
    X_train, y_train, X_test, y_test, all_IDs = preprocess(dim_reduc=3)
    models = train_models(X_train, y_train)
    validate_models(models, X_test, y_test)
    

In [184]:
pipeline()


Entering preprocessing
preparing to read images
786
722
1508
preparing to reduce dimensionality
>>> variance explained by each principal component
[4.4473657e-01 1.6824159e-01 9.1169290e-02 5.7367690e-02 5.6748547e-02
 5.2320685e-02 3.6198188e-02 3.1822890e-02 2.2656877e-02 1.7601196e-02
 9.3034180e-03 7.1804966e-03 2.3674807e-03 1.1376145e-03 8.7159796e-04
 2.6753976e-04 6.5508975e-06 1.8152512e-06]
>>> the first principal component
[-0.33582258  0.3129383  -0.04206447 -0.01204051  0.31197217  0.3208049
  0.00518964 -0.33550972 -0.30273733  0.21556541  0.16609216  0.02282244
  0.00318173  0.30442756  0.21560022 -0.05786494  0.3394732   0.22484674]
Preparing to validate models
[[92 66]
 [83 61]]
auroc: 0.5029447960618847
[[106  52]
 [ 88  56]]
auroc: 0.529887482419128
Done validating models


Okay, so now that we've seen our models perform like garbage, we're going to try to add in the semantic feature encoding to see if that will help 

In [185]:
semantics_path = "mass_case_description_train_set.csv"
semantic_df = pd.read_csv(semantics_path)
semantic_df.dropna(inplace=True)
semantic_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1273 entries, 0 to 1317
Data columns (total 13 columns):
patient_id        1273 non-null object
breast_density    1273 non-null int64
side              1273 non-null object
view              1273 non-null object
abn_num           1273 non-null int64
mass_shape        1273 non-null object
mass_margins      1273 non-null object
assessment        1273 non-null int64
pathology         1273 non-null object
subtlety          1273 non-null int64
od_img_path       1273 non-null object
od_crop_path      1273 non-null object
mask_path         1273 non-null object
dtypes: int64(4), object(9)
memory usage: 139.2+ KB


In [186]:
semantic_feature_names = ['breast_density', 'abn_num', 'mass_shape', 'mass_margins', 'assessment']
semantic_label_name = ['pathology']

In [187]:
for feature in semantic_feature_names:
    le = preprocessing.LabelEncoder()
    #print(semantic_df[feature].values)
    #print(semantic_df[semantic_df[feature].isnull() == True])
    le.fit(list(semantic_df[feature].astype(str)))
    print(le.classes_)
    semantic_df[feature] = le.transform(semantic_df[feature])

['1' '2' '3' '4']
['1' '2' '3' '4' '5' '6']
['ARCHITECTURAL_DISTORTION' 'ASYMMETRIC_BREAST_TISSUE'
 'FOCAL_ASYMMETRIC_DENSITY' 'IRREGULAR'
 'IRREGULAR-ARCHITECTURAL_DISTORTION' 'IRREGULAR-FOCAL_ASYMMETRIC_DENSITY'
 'LOBULATED' 'LOBULATED-ARCHITECTURAL_DISTORTION' 'LOBULATED-IRREGULAR'
 'LOBULATED-LYMPH_NODE' 'LOBULATED-OVAL' 'LYMPH_NODE' 'OVAL'
 'OVAL-LYMPH_NODE' 'ROUND' 'ROUND-IRREGULAR-ARCHITECTURAL_DISTORTION'
 'ROUND-LOBULATED' 'ROUND-OVAL']
['CIRCUMSCRIBED' 'CIRCUMSCRIBED-ILL_DEFINED'
 'CIRCUMSCRIBED-MICROLOBULATED' 'CIRCUMSCRIBED-OBSCURED' 'ILL_DEFINED'
 'ILL_DEFINED-SPICULATED' 'MICROLOBULATED' 'MICROLOBULATED-ILL_DEFINED'
 'MICROLOBULATED-ILL_DEFINED-SPICULATED' 'MICROLOBULATED-SPICULATED'
 'OBSCURED' 'OBSCURED-ILL_DEFINED' 'OBSCURED-ILL_DEFINED-SPICULATED'
 'OBSCURED-SPICULATED' 'SPICULATED']
['0' '1' '2' '3' '4' '5']


NameError: name 'total_patientIDs' is not defined

In [208]:
enc = preprocessing.OneHotEncoder(sparse=False)
semantic_one_hots = enc.fit_transform(semantic_df[semantic_feature_names])
has_semantic = [s1 + "_" + s2 + "_" + s3 for (s1, s2, s3) in zip(list(semantic_df['patient_id']), list(semantic_df['side']), list(semantic_df['view']))]
semantic_encoded_df = pd.DataFrame(semantic_one_hots, index=has_semantic)

In [209]:
# precomputed_df = pd.read_csv(precomputed_path)
# total_patientIDs = precomputed_df.values[:,0]
# missing_semantics = []
# for id1 in total_patientIDs:
#     if id1 not in has_semantic:
#         missing_semantics.append(id1)

# print(len(missing_semantics))

In [210]:
precomputed_df = pd.read_csv(precomputed_path)
total_patientIDs = precomputed_df.values[:,0]
print(len(total_patientIDs))
print(len(has_semantic))
missing_semantics = set(total_patientIDs).difference(has_semantic)
print(len(missing_semantics))
_, one_hot_length = semantic_one_hots.shape
missing_semantics_df = pd.DataFrame(np.zeros((len(missing_semantics), one_hot_length)), index=missing_semantics)
missing_semantics_df.info()

1599
1273
669
<class 'pandas.core.frame.DataFrame'>
Index: 669 entries, P_00668_LEFT_MLO to P_00947_RIGHT_MLO
Data columns (total 49 columns):
0     669 non-null float64
1     669 non-null float64
2     669 non-null float64
3     669 non-null float64
4     669 non-null float64
5     669 non-null float64
6     669 non-null float64
7     669 non-null float64
8     669 non-null float64
9     669 non-null float64
10    669 non-null float64
11    669 non-null float64
12    669 non-null float64
13    669 non-null float64
14    669 non-null float64
15    669 non-null float64
16    669 non-null float64
17    669 non-null float64
18    669 non-null float64
19    669 non-null float64
20    669 non-null float64
21    669 non-null float64
22    669 non-null float64
23    669 non-null float64
24    669 non-null float64
25    669 non-null float64
26    669 non-null float64
27    669 non-null float64
28    669 non-null float64
29    669 non-null float64
30    669 non-null float64
31    669 non-null f

In [207]:
semantic_encoded_df = semantic_encoded_df.append(missing_semantics_df)
semantic_encoded_df.info()
        

<class 'pandas.core.frame.DataFrame'>
Index: 1942 entries, P_00001_LEFT_CC to P_00947_RIGHT_MLO
Data columns (total 49 columns):
0     1942 non-null float64
1     1942 non-null float64
2     1942 non-null float64
3     1942 non-null float64
4     1942 non-null float64
5     1942 non-null float64
6     1942 non-null float64
7     1942 non-null float64
8     1942 non-null float64
9     1942 non-null float64
10    1942 non-null float64
11    1942 non-null float64
12    1942 non-null float64
13    1942 non-null float64
14    1942 non-null float64
15    1942 non-null float64
16    1942 non-null float64
17    1942 non-null float64
18    1942 non-null float64
19    1942 non-null float64
20    1942 non-null float64
21    1942 non-null float64
22    1942 non-null float64
23    1942 non-null float64
24    1942 non-null float64
25    1942 non-null float64
26    1942 non-null float64
27    1942 non-null float64
28    1942 non-null float64
29    1942 non-null float64
30    1942 non-null float64
31 