# Mammogramin mass data analysis

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

Let's apply several different supervised machine learning techniques to this data set, and see which one yields the highest accuracy as measured with K-Fold cross validation (K=10). Apply:

* Decision tree
* Random forest
* KNN
* Naive Bayes
* SVM
* Logistic Regression
* Neural network using Keras.



## Prepare our data

We start by importing the mammographic_masses.data.txt file into a Pandas dataframe using read_csv and take a look at it.

In [1]:
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
#import seaborn as sn


We make sure to convert missing data indicated by a ? into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

In [4]:
Attribute_names=['BI-rads','Age','Shape','Margin','Density','Severity']
mammographic_masses=pd.read_csv("./mammographic_masses.data.txt", na_values=['?'], names= Attribute_names)
mammographic_masses


Unnamed: 0,BI-rads,Age,Shape,Margin,Density,Severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1
...,...,...,...,...,...,...
956,4.0,47.0,2.0,1.0,3.0,0
957,4.0,56.0,4.0,5.0,3.0,1
958,4.0,64.0,4.0,5.0,3.0,0
959,5.0,66.0,4.0,5.0,3.0,1


 We evaluate whether the data needs cleaning by using describe() on the dataframe.

In [5]:
mammographic_masses.describe()


Unnamed: 0,BI-rads,Age,Shape,Margin,Density,Severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [2]:
scatter_matrix(mammographic_masses)
plt.show()

NameError: name 'scatter_matrix' is not defined

If the missing data seems randomly distributed, we can go ahead and drop rows with missing data.by using dropna().

In [7]:
data=mammographic_masses.dropna(axis=0)

Next we'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). We'll also need an array of the feature name labels.

In [8]:
X_array=data[['Age','Shape','Margin','Density']].values
Y_array=data['Severity'].values
X_array
X_features=['Age','Shape','Margin','Density']

Some of our models require the input data to be normalized, so we normalize the attribute data using preprocessing.StandardScaler().
SVM also require the input data to be normalized first.


In [1]:
from sklearn import preprocessing
Scalar=preprocessing.StandardScaler()
Scaled_X=Scalar.fit_transform(X_array)
Scaled_X

NameError: name 'X_array' is not defined

## Decision Trees

Before moving to K-Fold cross validation and random forests, start by creating a single train/test split of our data. Set aside 75% for training, and 25% for testing.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn import tree
(Training_inputs,Test_inputs,Training_output,Test_output)=train_test_split(Scaled_X,Y_array,train_size=0.75,random_state=31)


ImportError: No module named 'sklearn'

Now create a DecisionTreeClassifier and fit it to our training data.

In [7]:
from sklearn.tree import DecisionTreeClassifier
DecTree=DecisionTreeClassifier(max_depth=5,random_state=31)
DesResult=DecTree.fit(Training_inputs,Training_output)

ImportError: No module named 'sklearn'

Display the resulting decision tree.

In [8]:
from IPython.display import Image  
from sklearn.externals.six import StringIO  
import pydotplus

dot_data = StringIO()  
tree.export_graphviz(DesResult, out_file=dot_data,  
                         feature_names=X_features)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())  

ImportError: No module named 'sklearn'

Measure the accuracy of the resulting decision tree model using your test data.

In [9]:
DesResult.score(Test_inputs,Test_output)

NameError: name 'DesResult' is not defined

In [10]:
from sklearn.model_selection import cross_val_score

DecTree= DecisionTreeClassifier(random_state=31)

scores=cross_val_score(DecTree,Scaled_X,Y_array,cv=10)
scores.mean()

ImportError: No module named 'sklearn'

Now try a RandomForestClassifier instead. Does it perform better?

In [11]:
from sklearn.ensemble import RandomForestClassifier
a=RandomForestClassifier(n_estimators=10)
scores=cross_val_score(a, Scaled_X,Y_array, cv=10)
scores.mean()

ImportError: No module named 'sklearn'

## SVM

Next try using svm.SVC with a linear kernel. How does it compare to the decision tree?

In [15]:
from sklearn import svm
SVM=svm.SVC(kernel='linear',C=1,gamma='scale')


In [16]:
scores=cross_val_score(SVM,Scaled_X,Y_array,cv=10)
scores.mean()

0.7964988875362076

## KNN
Now we would like to try K nearest neighbours. We start with a K of 10. K is an example of a hyperparameter - a parameter on the model itself which may need to be tuned for best results on your particular data set.

In [17]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=10)
scores=cross_val_score(knn,Scaled_X,Y_array,cv=10)
scores.mean()

0.7854795488574507

Choosing K is tricky, so we can't discard KNN until we've tried different values of K. Write a for loop to run KNN with K values ranging from 1 to 50 and see if K makes a substantial difference. Make a note of the best performance you could get out of KNN.

In [18]:
diffkscores=[]
for k in range(1,50):
    knn=KNeighborsClassifier(n_neighbors=k)
    diffk=cross_val_score(knn,Scaled_X,Y_array,cv=10)
    diffkscores.append(diffk.mean())
max(diffkscores)


0.7940595133145824

## Naive Bayes

Now we try naive_bayes.MultinomialNB.

In [19]:
from sklearn.naive_bayes import MultinomialNB

scaler = preprocessing.MinMaxScaler()
all_features_minmax = scaler.fit_transform(X_array)

nb = MultinomialNB()
scores = cross_val_score(nb, all_features_minmax, Y_array, cv=10)
scores.mean()

0.7844055665169388

## Revisiting SVM

svm.SVC may perform differently with different kernels. The choice of kernel is an example of a "hyperparamter." Try the rbf, sigmoid, and poly kernels and see what the best-performing kernel is.

In [20]:
SVM=svm.SVC(kernel='poly',C=1,gamma='scale')
scores=cross_val_score(SVM,Scaled_X,Y_array,cv=10)
scores.mean()

0.793973454794789

In [21]:
SVM=svm.SVC(kernel='sigmoid',C=1,gamma='scale')
scores=cross_val_score(SVM,Scaled_X,Y_array,cv=10)
scores.mean()

0.7374711389110449

In [22]:
SVM=svm.SVC(kernel='rbf',C=1,gamma='scale')
scores=cross_val_score(SVM,Scaled_X,Y_array,cv=10)
scores.mean()

0.8023928466479158

## Logistic Regression

We've tried all these fancy techniques, but fundamentally this is just a binary classification problem. Try Logisitic Regression, which is a simple way to tackling this sort of thing.

In [23]:
from sklearn.linear_model import LogisticRegression

LR=LogisticRegression(solver='lbfgs')
scores=cross_val_score(LR,Scaled_X, Y_array,cv=10)
scores.mean()

0.8073583532737221

## Neural Networks

Let's see if an Deep neural network can do even better. 

In [11]:
from sklearn.model_selection import cross_val_score
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential


def NNmodel():
    model=Sequential()
    model.add(Dense(32, input_dim=4,kernel_initializer='normal',activation='relu'))
    model.add(Dense(16,kernel_initializer='normal',activation='relu'))
    model.add(Dense(4,kernel_initializer='normal',activation='relu'))
    model.add(Dense(1,kernel_initializer='normal',activation='sigmoid'))
    model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
    return model

from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

estimator=KerasClassifier(build_fn=NNmodel, epochs=1, verbose=0)

nn_scores=cross_val_score(estimator,Scaled_X,Y_array,cv=10)
print(nn_scores.mean())


0.5578313320875168


## Do we have a winner?


We can clearly see that ogistic regression performed the best. Even its a very simple classification machine leanring algorithm, the best resuts we obtained.