## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

***

In [27]:
#Ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
#Importing the dataset

import pandas as pd

pd.set_option('display.max_rows',10)

dataset = pd.read_csv('mammographic_masses.data.txt')
dataset.head()

Unnamed: 0,5,67,3,5.1,3.1,1
0,4,43,1,1,?,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,?,1
4,4,65,1,?,3,0


In [3]:
#Replacing ? in the dataset by NaN  for the model to read it as a missing value

dataset = pd.read_csv('mammographic_masses.data.txt', na_values=['?'], names = ['BI-RADS', 'Age', 'Shape', 'Margin', 'Density', 'Severity'])
dataset.head()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [4]:
dataset.tail()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
956,4.0,47.0,2.0,1.0,3.0,0
957,4.0,56.0,4.0,5.0,3.0,1
958,4.0,64.0,4.0,5.0,3.0,0
959,5.0,66.0,4.0,5.0,3.0,1
960,4.0,62.0,3.0,3.0,3.0,0


In [5]:
dataset.describe()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


***
* Lets look into the missing data in the dataset
***

In [6]:
dataset.loc[(dataset['Age'].isnull()) | (dataset['Shape'].isnull()) | (dataset['Margin'].isnull()) | (dataset['Density'].isnull())]

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
1,4.0,43.0,1.0,1.0,,1
4,5.0,74.0,1.0,5.0,,1
5,4.0,65.0,1.0,,3.0,0
6,4.0,70.0,,,3.0,0
7,5.0,42.0,1.0,,3.0,0
...,...,...,...,...,...,...
778,4.0,60.0,,4.0,3.0,0
819,4.0,35.0,3.0,,2.0,0
824,6.0,40.0,,3.0,4.0,1
884,5.0,,4.0,4.0,3.0,1


***
* Filling the missing value in the Age series with its mode.
***

In [7]:
from termcolor import cprint

dataset["Age"] = dataset["Age"].transform(lambda x: x.fillna(x.mode()[0]))
cprint(f'Number of missing value in Age Series : {dataset["Age"].isnull().sum()}','yellow')
dataset['Age']

[33mNumber of missing value in Age Series : 0[0m


0      67.0
1      43.0
2      58.0
3      28.0
4      74.0
       ... 
956    47.0
957    56.0
958    64.0
959    66.0
960    62.0
Name: Age, Length: 961, dtype: float64

***
* As all the other missing values are randomly distributed, we will drop those rows.
***

In [8]:
dataset.dropna(inplace=True)
dataset.describe()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
count,835.0,835.0,835.0,835.0,835.0,835.0
mean,4.396407,55.801198,2.788024,2.819162,2.916168,0.488623
std,1.883218,14.629845,1.241508,1.565398,0.349943,0.50017
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


***
* Now splitting our dataset into dependent and independent variables.
***

In [9]:
X = dataset.iloc[:,[1,2,3,4]].values
y = dataset.iloc[:,5].values

***
* Feature Scaling
***

In [10]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X = sc.fit_transform(X)

In [11]:
#Splitting the dataset into train and test sets.
import numpy as np
from sklearn.model_selection import train_test_split

X_train , X_test , y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state = 0)

In [12]:
#Importing necessary libraries
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB

***
* Testing out different algorithms to predict the dataset and comparing their scores.
***

In [13]:
# Decision Tree Classifier

classifier = DecisionTreeClassifier()
classifier.fit(X_train,y_train)

scr_dec = cross_val_score(classifier,X_train,y_train,cv = 10)
cprint(f"The k-fold average score of Decision Tree Classifier is : {scr_dec.mean()}",'yellow')

[33mThe k-fold average score of Decision Tree Classifier is : 0.7637461090270572[0m


In [14]:
#Support Vector Machine

classifier = SVC(C = 1,kernel='linear')

scr_svc = cross_val_score(classifier,X_train,y_train,cv = 10)
cprint(f"The k-fold average score of Support Vector Classifier is : {scr_svc.mean()}",'yellow')

[33mThe k-fold average score of Support Vector Classifier is : 0.8040964961289807[0m


In [15]:
#Using Grid Search to find the best kernel for support vector machine

parameters = [{'C':[1 ,10, 100, 1000],'kernel':['linear']},{'C':[1, 10, 100, 1000],'kernel':['rbf'],'gamma':[0.1,0.3,0.5,0.7,0.9]},{'C':[1, 10, 100, 1000],'kernel':['sigmoid'],'gamma':[0.1,0.3,0.5,0.7,0.9]},{'C':[1, 10, 100, 1000],'kernel':['poly'],'degree':[2,3,4,5]}]
grid_search = GridSearchCV(estimator=classifier,param_grid=parameters,n_jobs=-1,cv=10,scoring='accuracy')
grid_search.fit(X_train,y_train)
grid_search.best_params_



{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}

In [16]:
#Running more detailed Grid Search

parameters = [{'C':[0.5,1,2],'kernel':['rbf'],'gamma':[0.05,0.1,0.15,0.2,0.25]}]
grid_search = GridSearchCV(estimator=classifier,param_grid=parameters,n_jobs=-1,cv=10,scoring='accuracy')
grid_search.fit(X_train,y_train)
grid_search.best_params_



{'C': 1, 'gamma': 0.05, 'kernel': 'rbf'}

In [17]:
#Support Vector Machine (Improved)

classifier = SVC(kernel='rbf',gamma=0.05,C= 1)

scr_svc = cross_val_score(classifier,X_train,y_train,cv = 10)
cprint(f"The k-fold average score of Improved Support Vector Classifier is : {scr_svc.mean()}",'yellow')

[33mThe k-fold average score of Improved Support Vector Classifier is : 0.8146120999281667[0m


In [18]:
#Random Forest Classifier

classifier = RandomForestClassifier(n_estimators=300)

scr_ran = cross_val_score(classifier,X_train,y_train,cv = 10)
cprint(f"The k-fold average score of Random Forest Classifier is : {scr_ran.mean()}",'yellow')

[33mThe k-fold average score of Random Forest Classifier is : 0.7921981004070556[0m


In [19]:
#K-Nearest Neighbors

classifier = KNeighborsClassifier(n_neighbors=5)

scr_knn = cross_val_score(classifier,X_train,y_train,cv = 10)
cprint(f"The k-fold average score of K-Nearest Neighbor Classifier is : {scr_knn.mean()}",'yellow')

[33mThe k-fold average score of K-Nearest Neighbor Classifier is : 0.7935130497246388[0m


In [20]:
#Looking at the score with different Number of neighbors.
knn = pd.DataFrame(columns=['No. of Neighbors','K-fold Average Score'])
for n in range(1, 50):
    clf = KNeighborsClassifier(n_neighbors=n)
    clf.fit(X_train,y_train)
    cv_scores = cross_val_score(clf, X_train, y_train, cv=10)
    knn = knn.append({'No. of Neighbors': n, 'K-fold Average Score': cv_scores.mean()},ignore_index=True)
knn.set_index("No. of Neighbors", inplace = True)
knn.sort_values(by='K-fold Average Score', ascending=False)

Unnamed: 0_level_0,K-fold Average Score
No. of Neighbors,Unnamed: 1_level_1
41.0,0.810111
42.0,0.810044
44.0,0.810000
15.0,0.809956
45.0,0.808574
...,...
8.0,0.772437
6.0,0.771056
4.0,0.771054
1.0,0.726095


In [21]:
#K-Nearest Neighbors(Improved)

classifier = KNeighborsClassifier(n_neighbors=41)

scr_knn = cross_val_score(classifier,X_train,y_train,cv = 10)
cprint(f"The k-fold average score of K-Nearest Neighbor Classifier is : {scr_knn.mean()}",'yellow')

[33mThe k-fold average score of K-Nearest Neighbor Classifier is : 0.8101112086625696[0m


In [28]:
#Logistic Regression

classifier = LogisticRegression()

scr_log = cross_val_score(classifier,X_train,y_train,cv = 10)
cprint(f"The k-fold average score of Logistic Regression is : {scr_log.mean()}",'yellow')

[33mThe k-fold average score of Logistic Regression is : 0.8146333838827253[0m


In [23]:
#Naive Bayes Classifier

classifier = GaussianNB()

scr_nav = cross_val_score(classifier,X_train,y_train,cv=10)
cprint(f"The k-fold average score of Naive Bayes Classifier is : {scr_nav.mean()}",'yellow')

[33mThe k-fold average score of Naive Bayes Classifier is : 0.795117992923085[0m


In [24]:
#XGBoost Classifier

classifier = XGBClassifier(use_label_encoder=False,eval_metric='error',learning_rate = 0.1,max_depth = 3)

scr_xgb = cross_val_score(classifier,X_train,y_train,cv=10)
cprint(f"The k-fold average score of XGBoost Classifier is : {scr_xgb.mean()}",'yellow')

[33mThe k-fold average score of XGBoost Classifier is : 0.804006704445686[0m


In [25]:
#Artificial Neural Network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

def create_model():
    model = Sequential()
    #4 feature inputs going into an 6-unit layer (more does not seem to help - in fact you can go down to 4)
    model.add(Dense(6, input_dim=4, kernel_initializer='normal', activation='relu'))
    # Output layer with a binary classification (benign or malignant)
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# Wrap our Keras model in an estimator compatible with scikit_learn
classifier = KerasClassifier(build_fn=create_model, epochs=100, verbose=0)

scr_ann = cross_val_score(classifier, X_train, y_train, cv=10)
cprint(f"The k-fold average score of Artificial Neural Network is : {scr_ann.mean()}",'yellow')

[33mThe k-fold average score of Artificial Neural Network is : 0.8068747222423553[0m


***
* As we can see Logistic Regression has the best model score of **81.463%**.
***