## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

In [38]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [30]:
#loading data and assigning column names to it
features = ['BI-RADS','Age','Shape','Margin','Density','Severity']
data = pd.read_csv('mammographic_masses.data.txt',names=features,na_values=['?'])
data.head()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [33]:
#checking for null values in columnwise
data.isnull().sum()

BI-RADS     0
Age         0
Shape       0
Margin      0
Density     0
Severity    0
dtype: int64

In [32]:
#filling null values with mean of their specific column
for i in data.columns:
    data[i] = data[i].fillna(data[i].mean())

In [34]:
data.describe()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
count,961.0,961.0,961.0,961.0,961.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.781173,14.442373,1.222561,1.52688,0.365074,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


In [46]:
input_features = data[['Age','Shape','Margin','Density']].values
label = data['Severity'].values
feature_names = ['Age', 'Shape', 'Margin', 'Density']

In [40]:
input_features

array([[67.        ,  3.        ,  5.        ,  3.        ],
       [43.        ,  1.        ,  1.        ,  2.91073446],
       [58.        ,  4.        ,  5.        ,  3.        ],
       ...,
       [64.        ,  4.        ,  5.        ,  3.        ],
       [66.        ,  4.        ,  5.        ,  3.        ],
       [62.        ,  3.        ,  3.        ,  3.        ]])

In [41]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
input_features_scaled = scaler.fit_transform(input_features)
input_features_scaled

array([[ 0.79755224,  0.22791465,  1.44403739,  0.24464071],
       [-0.86508983, -1.40884695, -1.17704837,  0.        ],
       [ 0.17406146,  1.04629545,  1.44403739,  0.24464071],
       ...,
       [ 0.58972198,  1.04629545,  1.44403739,  0.24464071],
       [ 0.72827549,  1.04629545,  1.44403739,  0.24464071],
       [ 0.45116848,  0.22791465,  0.13349451,  0.24464071]])

# decision Tree

In [42]:
from sklearn.model_selection import train_test_split
np.random.seed(1234)
x_train,x_test,y_train,y_test=train_test_split(input_features_scaled,label,train_size=0.75,random_state=1)

In [43]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=1)
clf.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1, splitter='best')

In [52]:
clf.score(x_test,y_test)

0.7344398340248963


# Now will check with cross validation to Regularize data

In [53]:
from sklearn.model_selection import cross_val_score
clf = DecisionTreeClassifier(random_state=1)
cross_val = cross_val_score(clf,input_features_scaled,label,cv=10)
cross_val.mean()

0.740902627057334

# Random forest classifier

In [56]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10,max_depth=10,random_state=10)
cv_score = cross_val_score(clf,input_features_scaled,label,cv=10)
cv_score.mean()

0.7719609106529208

In [66]:
import math
print(math.sqrt(241))

15.524174696260024


# SVM

In [59]:
from sklearn import svm
c = 1.0
svc = svm.SVC(kernel = 'linear',C=c)

In [60]:
cv_scores = cross_val_score(svc,input_features_scaled,label,cv=10)
cv_scores.mean()

0.7919067643335141

In [67]:
svc = svm.SVC(kernel = 'rbf',C=c)
cv_scores = cross_val_score(svc,input_features_scaled,label,cv=10)
cv_scores.mean()



0.8020961068909388

In [72]:
svc = svm.SVC(kernel = 'sigmoid',C=c)
cv_scores = cross_val_score(svc,input_features_scaled,label,cv=10)
cv_scores.mean()



0.7358681497558328

In [73]:
svc = svm.SVC(kernel = 'poly',C=c)
cv_scores = cross_val_score(svc,input_features_scaled,label,cv=10)
cv_scores.mean()



0.7948688732139627

# KNN

In [78]:
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier(n_neighbors=15)
cv_scores = cross_val_score(clf,input_features_scaled,label,cv=10)
cv_scores.mean()

0.7885855263157895

In [62]:
#check for K value which would give better performance.
#better to take sq.root of test data size as k value(sqrt(241)~15)
for i in range(1,100):
    clf = neighbors.KNeighborsClassifier(n_neighbors=i)
    cv_scores = cross_val_score(clf,input_features_scaled,label,cv=10)
    print(i,'---->',cv_scores.mean())

1 ----> 0.7010444926749864
2 ----> 0.6904857343100018
3 ----> 0.7542176478567553
4 ----> 0.7470013790920601
5 ----> 0.7688354584915899
6 ----> 0.7823243353228432
7 ----> 0.7896274190631216
8 ----> 0.7854929688913004
9 ----> 0.7864587854946644
10 ----> 0.7937945378911195
11 ----> 0.7884989374208717
12 ----> 0.7832899258455417
13 ----> 0.7854388225718937
14 ----> 0.7864910019895098
15 ----> 0.7885855263157895
16 ----> 0.7875763022246337
17 ----> 0.7895840115753301
18 ----> 0.7822372942665944
19 ----> 0.783322594501718
20 ----> 0.7842991499366974
21 ----> 0.7885313799963827
22 ----> 0.7895623078314342
23 ----> 0.7864804892385603
24 ----> 0.789540830168204
25 ----> 0.7895842376559956
26 ----> 0.7895842376559956
27 ----> 0.7916568321577139
28 ----> 0.7906042005787666
29 ----> 0.7927094637366612
30 ----> 0.7927094637366613
31 ----> 0.7906476080665581
32 ----> 0.7926877599927653
33 ----> 0.7906259043226622
34 ----> 0.7895842376559956
35 ----> 0.7958456547296076
36 ----> 0.7947713194067643
37 

# Naive Bayes

In [71]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
input_features_scaled_minmax = scaler.fit_transform(input_features)

clf = MultinomialNB()
cv_score = cross_val_score(clf,input_features_scaled_minmax,label,cv=10)
cv_score.mean()

0.7502033595586906

# Logistic Regression

In [74]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
cv_score = cross_val_score(clf,input_features_scaled,label,cv=10)
cv_score.mean()



0.8001649258455418

# Neural Networks

In [75]:
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
def create_model():
    model = Sequential()
    model.add(Dense(6,input_dim=4,kernel_initializer='normal',activation='relu'))
    model.add(Dense(4,kernel_initializer='normal',activation='relu'))
    model.add(Dense(1,kernel_initializer='normal',activation='sigmoid'))
    model.compile(loss='binary_crossentropy',optimizer='adam',metrics =['accuracy'])
    return model

In [77]:
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

estimator = KerasClassifier(build_fn = create_model,epochs=100,verbose=2)
cv_scores = cross_val_score(estimator,input_features_scaled,label,cv=10)
cv_scores.mean()

Train on 864 samples
Epoch 1/100
864/864 - 1s - loss: 0.6930 - accuracy: 0.5220
Epoch 2/100
864/864 - 0s - loss: 0.6917 - accuracy: 0.5370
Epoch 3/100
864/864 - 0s - loss: 0.6884 - accuracy: 0.5370
Epoch 4/100
864/864 - 0s - loss: 0.6808 - accuracy: 0.5613
Epoch 5/100
864/864 - 0s - loss: 0.6679 - accuracy: 0.7153
Epoch 6/100
864/864 - 0s - loss: 0.6476 - accuracy: 0.7882
Epoch 7/100
864/864 - 0s - loss: 0.6214 - accuracy: 0.7963
Epoch 8/100
864/864 - 0s - loss: 0.5910 - accuracy: 0.7963
Epoch 9/100
864/864 - 0s - loss: 0.5605 - accuracy: 0.8009
Epoch 10/100
864/864 - 0s - loss: 0.5323 - accuracy: 0.7986
Epoch 11/100
864/864 - 0s - loss: 0.5096 - accuracy: 0.7998
Epoch 12/100
864/864 - 0s - loss: 0.4918 - accuracy: 0.8021
Epoch 13/100
864/864 - 0s - loss: 0.4790 - accuracy: 0.8009
Epoch 14/100
864/864 - 0s - loss: 0.4702 - accuracy: 0.7986
Epoch 15/100
864/864 - 0s - loss: 0.4639 - accuracy: 0.7975
Epoch 16/100
864/864 - 0s - loss: 0.4600 - accuracy: 0.7975
Epoch 17/100
864/864 - 0s - 

Epoch 11/100
865/865 - 0s - loss: 0.4848 - accuracy: 0.7977
Epoch 12/100
865/865 - 0s - loss: 0.4764 - accuracy: 0.7965
Epoch 13/100
865/865 - 0s - loss: 0.4695 - accuracy: 0.7977
Epoch 14/100
865/865 - 0s - loss: 0.4645 - accuracy: 0.7977
Epoch 15/100
865/865 - 0s - loss: 0.4623 - accuracy: 0.7965
Epoch 16/100
865/865 - 0s - loss: 0.4602 - accuracy: 0.7965
Epoch 17/100
865/865 - 0s - loss: 0.4584 - accuracy: 0.7977
Epoch 18/100
865/865 - 0s - loss: 0.4569 - accuracy: 0.7977
Epoch 19/100
865/865 - 0s - loss: 0.4562 - accuracy: 0.7977
Epoch 20/100
865/865 - 0s - loss: 0.4555 - accuracy: 0.8000
Epoch 21/100
865/865 - 0s - loss: 0.4550 - accuracy: 0.8000
Epoch 22/100
865/865 - 0s - loss: 0.4543 - accuracy: 0.8000
Epoch 23/100
865/865 - 0s - loss: 0.4540 - accuracy: 0.8000
Epoch 24/100
865/865 - 0s - loss: 0.4538 - accuracy: 0.8000
Epoch 25/100
865/865 - 0s - loss: 0.4536 - accuracy: 0.7988
Epoch 26/100
865/865 - 0s - loss: 0.4532 - accuracy: 0.7988
Epoch 27/100
865/865 - 0s - loss: 0.4533

Epoch 34/100
865/865 - 0s - loss: 0.4500 - accuracy: 0.8046
Epoch 35/100
865/865 - 0s - loss: 0.4498 - accuracy: 0.8046
Epoch 36/100
865/865 - 0s - loss: 0.4495 - accuracy: 0.8023
Epoch 37/100
865/865 - 0s - loss: 0.4497 - accuracy: 0.8023
Epoch 38/100
865/865 - 0s - loss: 0.4494 - accuracy: 0.8046
Epoch 39/100
865/865 - 0s - loss: 0.4497 - accuracy: 0.8046
Epoch 40/100
865/865 - 0s - loss: 0.4499 - accuracy: 0.8023
Epoch 41/100
865/865 - 0s - loss: 0.4501 - accuracy: 0.8023
Epoch 42/100
865/865 - 0s - loss: 0.4501 - accuracy: 0.8023
Epoch 43/100
865/865 - 0s - loss: 0.4497 - accuracy: 0.8046
Epoch 44/100
865/865 - 0s - loss: 0.4500 - accuracy: 0.8035
Epoch 45/100
865/865 - 0s - loss: 0.4496 - accuracy: 0.8023
Epoch 46/100
865/865 - 0s - loss: 0.4500 - accuracy: 0.8046
Epoch 47/100
865/865 - 0s - loss: 0.4492 - accuracy: 0.8035
Epoch 48/100
865/865 - 0s - loss: 0.4494 - accuracy: 0.8046
Epoch 49/100
865/865 - 0s - loss: 0.4493 - accuracy: 0.8035
Epoch 50/100
865/865 - 0s - loss: 0.4492

Epoch 57/100
865/865 - 0s - loss: 0.4636 - accuracy: 0.7896
Epoch 58/100
865/865 - 0s - loss: 0.4637 - accuracy: 0.7884
Epoch 59/100
865/865 - 0s - loss: 0.4637 - accuracy: 0.7884
Epoch 60/100
865/865 - 0s - loss: 0.4635 - accuracy: 0.7884
Epoch 61/100
865/865 - 0s - loss: 0.4634 - accuracy: 0.7884
Epoch 62/100
865/865 - 0s - loss: 0.4634 - accuracy: 0.7908
Epoch 63/100
865/865 - 0s - loss: 0.4638 - accuracy: 0.7884
Epoch 64/100
865/865 - 0s - loss: 0.4641 - accuracy: 0.7884
Epoch 65/100
865/865 - 0s - loss: 0.4638 - accuracy: 0.7896
Epoch 66/100
865/865 - 0s - loss: 0.4638 - accuracy: 0.7896
Epoch 67/100
865/865 - 0s - loss: 0.4635 - accuracy: 0.7896
Epoch 68/100
865/865 - 0s - loss: 0.4635 - accuracy: 0.7919
Epoch 69/100
865/865 - 0s - loss: 0.4635 - accuracy: 0.7919
Epoch 70/100
865/865 - 0s - loss: 0.4633 - accuracy: 0.7919
Epoch 71/100
865/865 - 0s - loss: 0.4639 - accuracy: 0.7908
Epoch 72/100
865/865 - 0s - loss: 0.4644 - accuracy: 0.7919
Epoch 73/100
865/865 - 0s - loss: 0.4640

Epoch 80/100
865/865 - 0s - loss: 0.4575 - accuracy: 0.7977
Epoch 81/100
865/865 - 0s - loss: 0.4576 - accuracy: 0.7988
Epoch 82/100
865/865 - 0s - loss: 0.4578 - accuracy: 0.7988
Epoch 83/100
865/865 - 0s - loss: 0.4578 - accuracy: 0.7988
Epoch 84/100
865/865 - 0s - loss: 0.4579 - accuracy: 0.7965
Epoch 85/100
865/865 - 0s - loss: 0.4576 - accuracy: 0.7977
Epoch 86/100
865/865 - 0s - loss: 0.4576 - accuracy: 0.7977
Epoch 87/100
865/865 - 0s - loss: 0.4575 - accuracy: 0.7977
Epoch 88/100
865/865 - 0s - loss: 0.4576 - accuracy: 0.7977
Epoch 89/100
865/865 - 0s - loss: 0.4578 - accuracy: 0.7977
Epoch 90/100
865/865 - 0s - loss: 0.4570 - accuracy: 0.7965
Epoch 91/100
865/865 - 0s - loss: 0.4580 - accuracy: 0.7988
Epoch 92/100
865/865 - 0s - loss: 0.4578 - accuracy: 0.8000
Epoch 93/100
865/865 - 0s - loss: 0.4576 - accuracy: 0.7977
Epoch 94/100
865/865 - 0s - loss: 0.4575 - accuracy: 0.7965
Epoch 95/100
865/865 - 0s - loss: 0.4573 - accuracy: 0.7977
Epoch 96/100
865/865 - 0s - loss: 0.4578

865/865 - 1s - loss: 0.6929 - accuracy: 0.5318
Epoch 2/100
865/865 - 0s - loss: 0.6920 - accuracy: 0.5364
Epoch 3/100
865/865 - 0s - loss: 0.6905 - accuracy: 0.5364
Epoch 4/100
865/865 - 0s - loss: 0.6871 - accuracy: 0.5364
Epoch 5/100
865/865 - 0s - loss: 0.6810 - accuracy: 0.5364
Epoch 6/100
865/865 - 0s - loss: 0.6703 - accuracy: 0.5364
Epoch 7/100
865/865 - 0s - loss: 0.6527 - accuracy: 0.5364
Epoch 8/100
865/865 - 0s - loss: 0.6300 - accuracy: 0.5364
Epoch 9/100
865/865 - 0s - loss: 0.6050 - accuracy: 0.5364
Epoch 10/100
865/865 - 0s - loss: 0.5830 - accuracy: 0.5364
Epoch 11/100
865/865 - 0s - loss: 0.5658 - accuracy: 0.5364
Epoch 12/100
865/865 - 0s - loss: 0.5546 - accuracy: 0.6971
Epoch 13/100
865/865 - 0s - loss: 0.5467 - accuracy: 0.7977
Epoch 14/100
865/865 - 0s - loss: 0.5413 - accuracy: 0.8046
Epoch 15/100
865/865 - 0s - loss: 0.5368 - accuracy: 0.8023
Epoch 16/100
865/865 - 0s - loss: 0.5330 - accuracy: 0.7977
Epoch 17/100
865/865 - 0s - loss: 0.5299 - accuracy: 0.7942
E

Epoch 18/100
865/865 - 0s - loss: 0.5149 - accuracy: 0.7977
Epoch 19/100
865/865 - 0s - loss: 0.5126 - accuracy: 0.7988
Epoch 20/100
865/865 - 0s - loss: 0.5102 - accuracy: 0.8000
Epoch 21/100
865/865 - 0s - loss: 0.5080 - accuracy: 0.8000
Epoch 22/100
865/865 - 0s - loss: 0.5067 - accuracy: 0.8000
Epoch 23/100
865/865 - 0s - loss: 0.5050 - accuracy: 0.8000
Epoch 24/100
865/865 - 0s - loss: 0.5037 - accuracy: 0.8000
Epoch 25/100
865/865 - 0s - loss: 0.5021 - accuracy: 0.8000
Epoch 26/100
865/865 - 0s - loss: 0.5006 - accuracy: 0.8000
Epoch 27/100
865/865 - 0s - loss: 0.4994 - accuracy: 0.8000
Epoch 28/100
865/865 - 0s - loss: 0.4979 - accuracy: 0.8000
Epoch 29/100
865/865 - 0s - loss: 0.4965 - accuracy: 0.8012
Epoch 30/100
865/865 - 0s - loss: 0.4955 - accuracy: 0.8012
Epoch 31/100
865/865 - 0s - loss: 0.4947 - accuracy: 0.7988
Epoch 32/100
865/865 - 0s - loss: 0.4932 - accuracy: 0.7988
Epoch 33/100
865/865 - 0s - loss: 0.4919 - accuracy: 0.8012
Epoch 34/100
865/865 - 0s - loss: 0.4908

Epoch 41/100
865/865 - 0s - loss: 0.4938 - accuracy: 0.7942
Epoch 42/100
865/865 - 0s - loss: 0.4927 - accuracy: 0.7931
Epoch 43/100
865/865 - 0s - loss: 0.4920 - accuracy: 0.7931
Epoch 44/100
865/865 - 0s - loss: 0.4917 - accuracy: 0.7931
Epoch 45/100
865/865 - 0s - loss: 0.4904 - accuracy: 0.7977
Epoch 46/100
865/865 - 0s - loss: 0.4896 - accuracy: 0.7919
Epoch 47/100
865/865 - 0s - loss: 0.4890 - accuracy: 0.7931
Epoch 48/100
865/865 - 0s - loss: 0.4887 - accuracy: 0.7931
Epoch 49/100
865/865 - 0s - loss: 0.4878 - accuracy: 0.7908
Epoch 50/100
865/865 - 0s - loss: 0.4870 - accuracy: 0.7931
Epoch 51/100
865/865 - 0s - loss: 0.4864 - accuracy: 0.7931
Epoch 52/100
865/865 - 0s - loss: 0.4859 - accuracy: 0.7931
Epoch 53/100
865/865 - 0s - loss: 0.4854 - accuracy: 0.7965
Epoch 54/100
865/865 - 0s - loss: 0.4847 - accuracy: 0.7954
Epoch 55/100
865/865 - 0s - loss: 0.4843 - accuracy: 0.7977
Epoch 56/100
865/865 - 0s - loss: 0.4840 - accuracy: 0.7931
Epoch 57/100
865/865 - 0s - loss: 0.4834

Epoch 64/100
865/865 - 0s - loss: 0.4813 - accuracy: 0.7988
Epoch 65/100
865/865 - 0s - loss: 0.4802 - accuracy: 0.7988
Epoch 66/100
865/865 - 0s - loss: 0.4796 - accuracy: 0.7988
Epoch 67/100
865/865 - 0s - loss: 0.4792 - accuracy: 0.7977
Epoch 68/100
865/865 - 0s - loss: 0.4787 - accuracy: 0.7977
Epoch 69/100
865/865 - 0s - loss: 0.4779 - accuracy: 0.8000
Epoch 70/100
865/865 - 0s - loss: 0.4772 - accuracy: 0.7988
Epoch 71/100
865/865 - 0s - loss: 0.4772 - accuracy: 0.7977
Epoch 72/100
865/865 - 0s - loss: 0.4764 - accuracy: 0.7977
Epoch 73/100
865/865 - 0s - loss: 0.4758 - accuracy: 0.7988
Epoch 74/100
865/865 - 0s - loss: 0.4761 - accuracy: 0.7988
Epoch 75/100
865/865 - 0s - loss: 0.4753 - accuracy: 0.7988
Epoch 76/100
865/865 - 0s - loss: 0.4744 - accuracy: 0.7988
Epoch 77/100
865/865 - 0s - loss: 0.4744 - accuracy: 0.7988
Epoch 78/100
865/865 - 0s - loss: 0.4735 - accuracy: 0.8000
Epoch 79/100
865/865 - 0s - loss: 0.4729 - accuracy: 0.8000
Epoch 80/100
865/865 - 0s - loss: 0.4727

0.7950279176235199

# Finally

In [79]:
# Decisiontree --->74.09
#Naivebayes ----->75.02
#Logistic ------->80
#Neural n/w ----->79.5
#svm ------------>79.19
#Randomforest --->77.19
#KNN ------------>78.85
#Finally will select any of the above models except Decisiontree and NaiveBayes classifiers