# Final Project

## Predict whether a mammogram mass is benign or malignant



This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

## Goals

Apply several different supervised machine learning techniques to this data set, and see which one yields the highest accuracy as measured with K-Fold cross validation (K=10). Apply:

* Decision tree
* Random forest
* KNN
* Naive Bayes
* SVM
* Logistic Regression
* And, as a bonus challenge, a neural network using Keras.

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.




## Let's begin: preparing the data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe .

In [2]:
import pandas as pd
col_names = ['BI-RADS','Age','Shape','Margin','Density','Severity']
data = pd.read_csv("mammographic_masses.data.txt",names=col_names)
data.head(10)


Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,5,67,3,5,3,1
1,4,43,1,1,?,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
4,5,74,1,5,?,1
5,4,65,1,?,3,0
6,4,70,?,?,3,0
7,5,42,1,?,3,0
8,5,57,1,5,3,1
9,5,60,?,5,1,1


In [3]:
import numpy as np
df = data.replace('?',np.nan)
df.head(10)

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,5,67,3.0,5.0,3.0,1
1,4,43,1.0,1.0,,1
2,5,58,4.0,5.0,3.0,1
3,4,28,1.0,1.0,3.0,0
4,5,74,1.0,5.0,,1
5,4,65,1.0,,3.0,0
6,4,70,,,3.0,0
7,5,42,1.0,,3.0,0
8,5,57,1.0,5.0,3.0,1
9,5,60,,5.0,1.0,1


## Analysing the data

In [4]:
print(df.shape)
print(df.size)
print(df.describe())

(961, 6)
5766
         Severity
count  961.000000
mean     0.463059
std      0.498893
min      0.000000
25%      0.000000
50%      0.000000
75%      1.000000
max      1.000000


In [5]:
import seaborn as sns
sns.countplot(df['Severity'],label='count')
B,M = df['Severity'].value_counts()
print("Benign",B)
print("Malignant",M)

Benign 516
Malignant 445


In [6]:
dataset = df.dropna()
print(dataset.shape)
dataset.head()


(830, 6)


Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,5,67,3,5,3,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
8,5,57,1,5,3,1
10,5,76,1,4,3,1


### Converting the data to Numpy Array

In [7]:
x = dataset.iloc[:,1:5]
X = np.array(x)
print(X.shape)
X

(830, 4)


array([['67', '3', '5', '3'],
       ['58', '4', '5', '3'],
       ['28', '1', '1', '3'],
       ...,
       ['64', '4', '5', '3'],
       ['66', '4', '5', '3'],
       ['62', '3', '3', '3']], dtype=object)

In [8]:
y = dataset.iloc[:,5]
Y = np.array(y)  
print(Y.shape)
#print(Y)

(830,)


In [9]:
label_names= np.array(col_names[1:])
label_names

array(['Age', 'Shape', 'Margin', 'Density', 'Severity'], dtype='<U8')

### Some of the models require the input data to be normalized, normalizing the attribute data. 

In [10]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_norm = sc.fit_transform(X)
x_norm



array([[ 0.7650629 ,  0.17563638,  1.39618483,  0.24046607],
       [ 0.15127063,  0.98104077,  1.39618483,  0.24046607],
       [-1.89470363, -1.43517241, -1.157718  ,  0.24046607],
       ...,
       [ 0.56046548,  0.98104077,  1.39618483,  0.24046607],
       [ 0.69686376,  0.98104077,  1.39618483,  0.24046607],
       [ 0.42406719,  0.17563638,  0.11923341,  0.24046607]])

## Decision Trees

Before moving to K-Fold cross validation and random forests, started creating a single train/test split of our data. Setting aside 75% for training, and 25% for testing.

In [11]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
print("train data",round(622/(622+208) ,2),"%")
print("test data", round(208/ (622+208),2),"%")      

(622, 4)
(622,)
(208, 4)
(208,)
train data 0.75 %
test data 0.25 %


 Now create a DecisionTreeClassifier and fit it to your training data.

In [12]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=1)
model.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

Measure the accuracy of the resulting decision tree model using test data.

In [14]:
from sklearn.metrics import *
prediction = model.predict(x_test)
print("Accuracy of dct is",accuracy_score(prediction,y_test))
print("Error", mean_absolute_error(prediction,y_test))

Accuracy of dct is 0.7403846153846154
Error 0.25961538461538464


In [15]:
from sklearn.model_selection import cross_val_score
cv = cross_val_score(model,x_train,y_train,cv=10)
cv.mean()

0.7398533571722361

In [16]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10,random_state=2)
rfc.fit(x_train,y_train)
prediction = rfc.predict(x_test)
print("Accuracy of rfc is",accuracy_score(prediction,y_test))
print("Error", mean_absolute_error(prediction,y_test))
cv = cross_val_score(rfc,x_train,y_train,cv=10)
cv.mean()

Accuracy of rfc is 0.7451923076923077
Error 0.2548076923076923


0.75915615320692

## SVM

In [18]:
from sklearn import svm
svm = svm.SVC(kernel='linear')
svm.fit(x_train,y_train)
prediction = svm.predict(x_test)
print("Accuracy of svm is",accuracy_score(prediction,y_test))
print("Error", mean_absolute_error(prediction,y_test))


Accuracy of svm is 0.8221153846153846
Error 0.1778846153846154


## KNN 

In [19]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(x_train,y_train)
prediction = model.predict(x_test)
print("Accuracy of svm is",accuracy_score(prediction,y_test))
print("Error", mean_absolute_error(prediction,y_test))

cv = cross_val_score(model,x_train,y_train,cv=10)
cv.mean()


Accuracy of svm is 0.7596153846153846
Error 0.2403846153846154


0.7943046846801474

Choosing K is tricky, so  try different values of K. A loop to run KNN with K values ranging from 1 to 50 and see if K makes a substantial difference.

In [20]:
for i in range(1,50):
    model = KNeighborsClassifier(n_neighbors=i)
    cv = cross_val_score(model,x_train,y_train,cv=10)
    print(i,cv.mean())

1 0.7141211922808962
2 0.7139898264964367
3 0.7667556428529458
4 0.7669885757934409
5 0.7943046846801474
6 0.7862393291531313
7 0.7927685863698556
8 0.7782524573375974
9 0.7847561129158167
10 0.7846784686023184
11 0.7846537063617973
12 0.7701879412085652
13 0.7830408031359909
14 0.7717232001208734
15 0.7846545457597811
16 0.7862145669126103
17 0.7942799224396262
18 0.7878018684999119
19 0.7814014588736957
20 0.7700334919795523
21 0.7845496210118104
22 0.7781236097470894
23 0.7829887604609974
24 0.7780724064700797
25 0.7782276950970763
26 0.7717232001208734
27 0.7684709526327718
28 0.7716455558073749
29 0.770084695256562
30 0.7699277278335978
31 0.7683924689212897
32 0.7699797705085912
33 0.7699797705085912
34 0.7747937179454895
35 0.7731799753216992
36 0.7635017165688768
37 0.769901286797109
38 0.769926888435614
39 0.766726683622506
40 0.7618623723065817
41 0.7602494690807753
42 0.7570492642676672
43 0.76027507071928
44 0.7602494690807753
45 0.7618879739450864
46 0.7570748659061721
47 

## Naive Bayes


In [21]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(x_train,y_train)
prediction = nb.predict(x_test)
print("Accuracy of svm is",accuracy_score(prediction,y_test))
print("Error", mean_absolute_error(prediction,y_test))

cv = cross_val_score(model,x_train,y_train,cv=10)
cv.mean()

Accuracy of svm is 0.7548076923076923
Error 0.24519230769230768


0.7603535544307622


## Revisiting SVM


In [22]:
from sklearn import svm
svm = svm.SVC(kernel='rbf',C=1)
svm.fit(x_train,y_train)
prediction = svm.predict(x_test)
print("Accuracy of svm is",accuracy_score(prediction,y_test))
print("Error", mean_absolute_error(prediction,y_test))

Accuracy of svm is 0.7980769230769231
Error 0.20192307692307693




In [23]:
from sklearn import svm
svm = svm.SVC(kernel='sigmoid',C=1)
svm.fit(x_train,y_train)
prediction = svm.predict(x_test)
print("Accuracy of svm is",accuracy_score(prediction,y_test))
print("Error", mean_absolute_error(prediction,y_test))

Accuracy of svm is 0.47115384615384615
Error 0.5288461538461539




In [None]:
from sklearn import svm
svm = svm.SVC(kernel='poly',C=1)
svm.fit(x_train,y_train)
prediction = svm.predict(x_test)
print("Accuracy of svm is",accuracy_score(prediction,y_test))
print("Error", mean_absolute_error(prediction,y_test))

## Logistic Regression

In [26]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
prediction = lr.predict(x_test)
print("Accuracy of svm is",accuracy_score(prediction,y_test))
print("Error", mean_absolute_error(prediction,y_test))

cv = cross_val_score(lr,x_train,y_train,cv=10)
cv.mean()



Accuracy of svm is 0.7980769230769231
Error 0.20192307692307693




0.797556932168249


##  ANN Neural Networks


In [27]:
import tensorflow
import keras
from keras.models import Sequential
from keras.layers import Dense


Using TensorFlow backend.


In [28]:
classifier = Sequential()
classifier.add(Dense(output_dim =8 ,init = 'normal', activation = 'relu', input_dim =4 ))
classifier.add(Dense(output_dim =4 ,init = 'normal', activation = 'relu' ))
classifier.add(Dense(output_dim =1 ,init = 'normal', activation = 'sigmoid'))
classifier.compile(optimizer = 'Adam',loss='binary_crossentropy',metrics=['accuracy'])
 

  


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


In [29]:
classifier.fit(x_train,y_train,batch_size=60,nb_epoch=200)

  """Entry point for launching an IPython kernel.



Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 7

Epoch 79/200
Epoch 80/200
Epoch 81/200
Epoch 82/200
Epoch 83/200
Epoch 84/200
Epoch 85/200
Epoch 86/200
Epoch 87/200
Epoch 88/200
Epoch 89/200
Epoch 90/200
Epoch 91/200
Epoch 92/200
Epoch 93/200
Epoch 94/200
Epoch 95/200
Epoch 96/200
Epoch 97/200
Epoch 98/200
Epoch 99/200
Epoch 100/200
Epoch 101/200
Epoch 102/200
Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200
Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200
Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200

Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


<keras.callbacks.callbacks.History at 0x276c7cace80>

In [30]:
score=classifier.evaluate(x_test,y_test,verbose=0)
score

[0.5063044520524832, 0.7836538553237915]