# Assignment 3

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

## Your assignment

Build a Multi-Layer Perceptron and train it to classify masses as benign or malignant based on its features.

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

Remember to normalize your data first! And experiment with different topologies, optimizers, and hyperparameters.

I was able to achieve over 80% accuracy - can you beat that?


## Let's begin: prepare your data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe (hint: use read_csv) and take a look at it.

In [393]:
import numpy as np 
import pandas as pd 

mm = pd.read_csv("mammographic_masses.data")
mm



Unnamed: 0,5,67,3,5.1,3.1,1
0,4,43,1,1,?,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,?,1
4,4,65,1,?,3,0
...,...,...,...,...,...,...
955,4,47,2,1,3,0
956,4,56,4,5,3,1
957,4,64,4,5,3,0
958,5,66,4,5,3,1


Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

In [394]:
columns = ['BI-RADS assessment', 'Age', 'Shape', 'Margin', 'Density', 'Severity']
mm_dataset = pd.read_csv('mammographic_masses.data', names = columns, na_values=['?'])


Evaluate whether the data needs cleaning; your model is only as good as the data it's given. Hint: use describe() on the dataframe.

In [395]:
mm_dataset.describe()

Unnamed: 0,BI-RADS assessment,Age,Shape,Margin,Density,Severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [396]:
mm_dataset.loc[(mm_dataset['Age'].isnull())   |
               (mm_dataset['Shape'].isnull()) |
               (mm_dataset['Margin'].isnull()) |
               (mm_dataset['Density'].isnull())]

Unnamed: 0,BI-RADS assessment,Age,Shape,Margin,Density,Severity
1,4.0,43.0,1.0,1.0,,1
4,5.0,74.0,1.0,5.0,,1
5,4.0,65.0,1.0,,3.0,0
6,4.0,70.0,,,3.0,0
7,5.0,42.0,1.0,,3.0,0
...,...,...,...,...,...,...
778,4.0,60.0,,4.0,3.0,0
819,4.0,35.0,3.0,,2.0,0
824,6.0,40.0,,3.0,4.0,1
884,5.0,,4.0,4.0,3.0,1


If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna().

In [397]:
mm_dataset = mm_dataset.dropna()
mm_dataset


Unnamed: 0,BI-RADS assessment,Age,Shape,Margin,Density,Severity
0,5.0,67.0,3.0,5.0,3.0,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
8,5.0,57.0,1.0,5.0,3.0,1
10,5.0,76.0,1.0,4.0,3.0,1
...,...,...,...,...,...,...
956,4.0,47.0,2.0,1.0,3.0,0
957,4.0,56.0,4.0,5.0,3.0,1
958,4.0,64.0,4.0,5.0,3.0,0
959,5.0,66.0,4.0,5.0,3.0,1


In [398]:
mm_dataset.describe()

Unnamed: 0,BI-RADS assessment,Age,Shape,Margin,Density,Severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,55.781928,2.781928,2.813253,2.915663,0.485542
std,1.888371,14.671782,1.242361,1.567175,0.350936,0.500092
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [399]:
features = mm_dataset[['Age', 'Shape', 'Margin', 'Density']].values
classes = mm_dataset['Severity'].values
feature_name = ['Age', 'Shape', 'Margin', 'Density']
features

array([[67.,  3.,  5.,  3.],
       [58.,  4.,  5.,  3.],
       [28.,  1.,  1.,  3.],
       ...,
       [64.,  4.,  5.,  3.],
       [66.,  4.,  5.,  3.],
       [62.,  3.,  3.,  3.]])

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler().

In [400]:

from numpy import asarray
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
features_scaled = scaler.fit_transform(features)
features_scaled

array([[ 0.7650629 ,  0.17563638,  1.39618483,  0.24046607],
       [ 0.15127063,  0.98104077,  1.39618483,  0.24046607],
       [-1.89470363, -1.43517241, -1.157718  ,  0.24046607],
       ...,
       [ 0.56046548,  0.98104077,  1.39618483,  0.24046607],
       [ 0.69686376,  0.98104077,  1.39618483,  0.24046607],
       [ 0.42406719,  0.17563638,  0.11923341,  0.24046607]])

In [401]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(mm_dataset, classes, test_size=0.4)

## Now build your neural network.

Now set up an actual MLP model using Keras:

In [402]:
import keras
from keras.layers import Dense, Dropout, Activation
from keras.models import Sequential

#np.random.seed(42)

#def create_model():
model = Sequential()
model.add(Dense(4, input_dim=6,  activation='relu'))#kernel_initializer='normal',
model.add(Dropout(0.2))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    
model.summary()

Model: "sequential_25"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_80 (Dense)            (None, 4)                 28        
                                                                 
 dropout_25 (Dropout)        (None, 4)                 0         
                                                                 
 dense_81 (Dense)            (None, 1)                 5         
                                                                 
Total params: 33
Trainable params: 33
Non-trainable params: 0
_________________________________________________________________


In [403]:
#from sklearn.model_selection import cross_val_score
#from keras.wrappers.scikit_learn import KerasClassifier
#estimate = KerasClassifier(build_fn=create_model, nb_epoch = 100, verbose=0)
#cv_scores = cross_val_score(estimate, features_scaled, classes, cv=10)
#cv_scores.mean()
#from sklearn.model_selection import train_test_splitm

x_train, x_test, y_train, y_test = train_test_split(mm_dataset, classes, test_size= 0.4, )#ramdom_state=0)

x_train = np.asarray(x_train).astype(np.float32)
y_train = np.asarray(y_train).astype(np.float32)
model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['accuracy'])

trained = model.fit(x_train, y_train, batch_size=16, epochs = 100, validation_data=(x_test, y_test), verbose=2,) #/*callbacks=[tf.docs.modeling.EpochDots()] */)

Epoch 1/100
32/32 - 5s - loss: 0.6967 - accuracy: 0.4659 - val_loss: 0.6748 - val_accuracy: 0.5422 - 5s/epoch - 161ms/step
Epoch 2/100
32/32 - 0s - loss: 0.6914 - accuracy: 0.4317 - val_loss: 0.6815 - val_accuracy: 0.5512 - 360ms/epoch - 11ms/step
Epoch 3/100
32/32 - 0s - loss: 0.6880 - accuracy: 0.4900 - val_loss: 0.6762 - val_accuracy: 0.5422 - 391ms/epoch - 12ms/step
Epoch 4/100
32/32 - 0s - loss: 0.6849 - accuracy: 0.4980 - val_loss: 0.6698 - val_accuracy: 0.5422 - 415ms/epoch - 13ms/step
Epoch 5/100
32/32 - 0s - loss: 0.6849 - accuracy: 0.4819 - val_loss: 0.6705 - val_accuracy: 0.5572 - 363ms/epoch - 11ms/step
Epoch 6/100
32/32 - 0s - loss: 0.6824 - accuracy: 0.4920 - val_loss: 0.6682 - val_accuracy: 0.5663 - 345ms/epoch - 11ms/step
Epoch 7/100
32/32 - 0s - loss: 0.6807 - accuracy: 0.4839 - val_loss: 0.6720 - val_accuracy: 0.6566 - 352ms/epoch - 11ms/step
Epoch 8/100
32/32 - 0s - loss: 0.6824 - accuracy: 0.5763 - val_loss: 0.6688 - val_accuracy: 0.6386 - 347ms/epoch - 11ms/step
Ep

In [404]:
score = model.evaluate(x_train, y_train)
print("\n Training Accuracy:", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("\n Testing Accuracy:", score[1])


 Training Accuracy: 0.9959839582443237

 Testing Accuracy: 0.9879518151283264


 How did you do?

Which topology, and which choice of hyperparameters, performed the best? Feel free to share your results!