# Finding out the most accurate model to predict a malignant tumor using mammogram data

This project is aimed to figure out what the most accurate and effective model is for predictive modeling. I am going to use a decision tree classifier, a random forest classifier, a XGBoost classifier, K nearest neighbors method, Naive Bayes, Support Vector Machines, and finally a nueral network run on Keras. We will be finding the accuracy using K fold cross validation where it can be applied, otherwise we will use the provided accuracy score methods.

# Loading and cleaning Data

The most important part of any data science is actually working on cleaning your data. New frameworks make it really easy to actually implement your algorithms, but the data cleaning is still based on us. We need to decide which data we actually need, which data to drop, which data is relevent and whether there is some implicit bias in your data. All these things need to be factored in so we can actually make our model work effeciently. Remember, the model won't do everything for you. 

In [7]:
import pandas as pd
from pandas import DataFrame
import numpy as np

In [8]:
##Let's Load in our data
feature_names = ['BI_RADS', 'age', 'shape', 'margin', 'density', 'severity']
df = pd.read_csv("mammographic_masses.data.txt", na_values = ['?'], names = feature_names)
df.head()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [9]:
df.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


In [10]:
#Drop all the rows with Null values. It seems like the null values are randomly distributed, so we aren't adding 
#Any bias to our model
df.dropna(inplace=True)
df.describe()


Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,55.781928,2.781928,2.813253,2.915663,0.485542
std,1.888371,14.671782,1.242361,1.567175,0.350936,0.500092
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


In [11]:
#Now lets filter for outliers. If we see the description of our data we can see that the maximum BI_RADS value is 5
#While it is on a scale of 1-5. So we can already tell this is a bit fishy, so we need to take out the outlier
#So our data doesn't get skewed in any way.
df_filtered = df[df['BI_RADS'] < 6]
df_filtered = df_filtered[df_filtered['BI_RADS']>0]
df_filtered.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,815.0,815.0,815.0,815.0,815.0,815.0
mean,4.341104,55.694479,2.770552,2.801227,2.915337,0.480982
std,0.579304,14.69589,1.244197,1.570536,0.352524,0.499945
min,2.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.5,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,5.0,96.0,4.0,5.0,4.0,1.0


In [19]:
from sklearn.preprocessing import StandardScaler
from sklearn import tree

In [20]:
#Now we need to define our features. We are using age, shape, margin, and density of the tumor because BI_RADS is not a predective stat
scaler = StandardScaler()
features = list(df.columns[[1,2,3,4]])
print(features)
labels = df['severity']

['age', 'shape', 'margin', 'density']


In [21]:
scaler = scaler.fit(df[features])

In [22]:
#Here we scale our data down using a standard scaler function, this makes the data more usable by the model and keeps it normalized.
feats = df[features]
df[features] = scaler.transform(df[features])
df.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,2.6351080000000002e-17,-1.958273e-16,2.037694e-16,1.180448e-16,0.485542
std,1.888371,1.000603,1.000603,1.000603,1.000603,0.500092
min,0.0,-2.576695,-1.435172,-1.157718,-5.462015,0.0
25%,4.0,-0.6671191,-0.629768,-1.157718,0.2404661,0.0
50%,4.0,0.08307148,0.1756364,0.1192334,0.2404661,0.0
75%,5.0,0.6968638,0.9810408,0.7577091,0.2404661,1.0
max,55.0,2.742838,0.9810408,1.396185,3.091707,1.0


# Decision Trees

In [23]:
from sklearn.model_selection import train_test_split, cross_val_score

In [24]:
x_train, x_test, y_train, y_test = train_test_split(df[features], labels, train_size = 0.75, random_state=0)

In [25]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)

In [26]:
scores = cross_val_score(clf, x_test, y_test, cv=10)
print(scores)
print(scores.mean())

[0.52380952 0.71428571 0.71428571 0.76190476 0.71428571 0.71428571
 0.66666667 0.66666667 0.8        0.7       ]
0.6976190476190476


So as we can see here the decision tree is not that accurate, but there are still a lot of methods left to try so lets keep going!

# Random Forest Classifier

In [27]:
from sklearn.ensemble import RandomForestClassifier

clf2 = RandomForestClassifier(n_estimators=10)
clf2 = clf2.fit(x_train, y_train)

In [28]:
scores = cross_val_score(clf2, x_test, y_test, cv=10)
print(scores)
print(scores.mean())

[0.61904762 0.71428571 0.61904762 0.85714286 0.76190476 0.66666667
 0.61904762 0.71428571 0.85       0.8       ]
0.7221428571428571


Again, not a very high accuracy, but still better than the regular decision tree. Of course 73% accuracy isn't good enough in something as important as cancer detection so let's go on to the next method

# XGBoost

In [30]:
import xgboost as xgb

In [31]:
param = {
    'max_depth': 4,
    'eta': 0.01,
    'objective': 'multi:softmax',
    'num_class':2} 
epochs = 100
train = xgb.DMatrix(x_train, label=y_train)
test = xgb.DMatrix(x_test, label=y_test)

In [32]:
xgbmodel = xgb.train(param, train, epochs)



In [33]:
from sklearn.metrics import accuracy_score
predictions = xgbmodel.predict(test)
accuracy_score(y_test, predictions)

0.7788461538461539

Ok, so as we keep on going we can see that this is at least an upwards trend in the accuracy of the model, so lets move on to other techniques

# K Nearest Neighbors

In [34]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors =6)

K Nearest Neighbors is a method that basically puts all the parameters on a graph and uses the average of the K(some integer) number of points closest to point you want to test. This is a very simple method, but a very effective one, if you can get the sweetspot of the K value. In this case that "sweetspot" is 6 neighbors, you can try different amounts to see the difference in accuracy.  

In [35]:
knn = knn.fit(x_train, y_train)

In [36]:
scores = cross_val_score(knn, x_test, y_test, cv=10)
print(scores)
print(scores.mean())

[0.66666667 0.71428571 0.71428571 0.85714286 0.80952381 0.80952381
 0.80952381 0.80952381 0.75       0.75      ]
0.7690476190476191


It is still getting better but not quite at the level we want it yet. We are aiming for at least 85% accuracy

# Naive Bayes

Here we need to rescale our parameters to a min max format due to the fact that the Naive Bayes Classifier does not take negative inputs. 

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
scale = MinMaxScaler()
scale = scale.fit(x_train)
x_train2 = scale.transform(x_train)
classifier = MultinomialNB()
counts = (x_train2)
targets = y_train.values
classifier.fit(counts, targets)

MultinomialNB()

In [38]:

x_test2 = scale.transform(x_test)
scores = cross_val_score(classifier, x_test2, y_test.values, cv=10)
print(scores)
print(scores.mean())

[0.76190476 0.76190476 0.71428571 0.95238095 0.80952381 0.80952381
 0.71428571 0.80952381 0.7        0.85      ]
0.7883333333333333


So we can see again that it is defenitely getting more accurate but it is still staying around that 75-78% range.

# Support Vector Machines

In [39]:
from sklearn import svm

C = 1.0
svc = svm.SVC(kernel='rbf', C=C).fit(x_train, y_train)

Now we can play around with the different types of kernels but in this case the rbf and polynomial kernel seem to work the best.

In [40]:
scores = cross_val_score(svc, x_test2, y_test.values, cv=10)
print(scores)
print(scores.mean())

[0.71428571 0.80952381 0.76190476 0.95238095 0.80952381 0.80952381
 0.71428571 0.80952381 0.75       0.85      ]
0.7980952380952381


We are still sort of in that 78-80% accuracy range and we are sort of in a stalemate here. So now let's try out a deep learning neural network. 

# Neural Network with Keras

In this neural network we are going to use the Keras API which works on top of the Tensorflow framework. Keras makes it really easy to do this, and makes calculating the gradient descent very easy. We are going to be using a layer of 64 hidden neurons to go along with 128 input neurons. We are going to run 10 epochs and dropout 20% of each of our neurons to prevent overfitting of our model. We will also be incorporating sklearn when we run the model, to see how it can play a part in deep learning and also to use K Cross validation. We will be 

In [41]:
from tensorflow import keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import RMSprop
from keras.wrappers.scikit_learn import KerasClassifier

def create_model():
    model = Sequential()
    model.add(Dropout(0.2))
    model.add(Dense(512, input_dim=4, kernel_initializer='normal',activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(256, kernel_initializer='normal',activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(128, kernel_initializer='normal',activation='relu'))
    model.add(Dense(1, kernel_initializer='normal',activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
    return model
model = create_model()
history = model.fit(x_train, y_train, batch_size=75, epochs=42, verbose=2, validation_data = (x_test, y_test))

Epoch 1/42
9/9 - 4s - loss: 0.6449 - accuracy: 0.6945 - val_loss: 0.5616 - val_accuracy: 0.7885
Epoch 2/42
9/9 - 0s - loss: 0.5123 - accuracy: 0.7990 - val_loss: 0.4728 - val_accuracy: 0.7788
Epoch 3/42
9/9 - 0s - loss: 0.4748 - accuracy: 0.8039 - val_loss: 0.4843 - val_accuracy: 0.7740
Epoch 4/42
9/9 - 0s - loss: 0.4612 - accuracy: 0.7878 - val_loss: 0.4685 - val_accuracy: 0.7788
Epoch 5/42
9/9 - 0s - loss: 0.4621 - accuracy: 0.7830 - val_loss: 0.4677 - val_accuracy: 0.7837
Epoch 6/42
9/9 - 0s - loss: 0.4670 - accuracy: 0.7846 - val_loss: 0.4731 - val_accuracy: 0.7837
Epoch 7/42
9/9 - 0s - loss: 0.4548 - accuracy: 0.7926 - val_loss: 0.4800 - val_accuracy: 0.7837
Epoch 8/42
9/9 - 0s - loss: 0.4474 - accuracy: 0.8087 - val_loss: 0.4723 - val_accuracy: 0.7981
Epoch 9/42
9/9 - 0s - loss: 0.4556 - accuracy: 0.7958 - val_loss: 0.4738 - val_accuracy: 0.7885
Epoch 10/42
9/9 - 0s - loss: 0.4697 - accuracy: 0.7942 - val_loss: 0.4734 - val_accuracy: 0.7837
Epoch 11/42
9/9 - 0s - loss: 0.4685 - a

In [42]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss: ", score[0])
print("Test accuracy:", score[1])

Test loss:  0.47111397981643677
Test accuracy: 0.7980769276618958


Here we can see again that the accuracy is actually very close to what it was before, so the neural network doesn't really make a big difference compared to most of the other tedchniques that we used.

# So which method is the best??

There is actually no clear best method here. There is a clear worst method though, and that is a basic decision tree. It makes sense, high dimension data doesn't really work all that well with a basic decision tree compared to the other methods, like xgboost, a random forest, or a neural network.

# Basically anything except for a basic decision tree works well, which is going to be the case in most scenarios