Frank, our instructor, challenged us to work through this problem first and compare with his answers later. I went through it first on my own and followed his step by step guidance. These are my solutions. Going through this final project was a nice way to tie in everything together and reinforce concepts learned in the class. I highly recommend students go through the exercise on their own before looking at the solutions!

# Final Project

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

## Your assignment

Apply several different supervised machine learning techniques to this data set, and see which one yields the highest accuracy as measured with K-Fold cross validation (K=10). Apply:

* Decision tree
* Random forest
* KNN
* Naive Bayes
* SVM
* Logistic Regression
* And, as a bonus challenge, a neural network using Keras.

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

Remember some techniques such as SVM also require the input data to be normalized first.

Many techniques also have "hyperparameters" that need to be tuned. Once you identify a promising approach, see if you can make it even better by tuning its hyperparameters.

I was able to achieve over 80% accuracy - can you beat that?

Below I've set up an outline of a notebook for this project, with some guidance and hints. If you're up for a real challenge, try doing this project from scratch in a new, clean notebook!


## Let's begin: prepare your data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe (hint: use read_csv) and take a look at it.

In [1]:
import pandas as pd

feature_names = ['BI_RADS', 'age', 'shape', 'margin', 'density', 'severity']

df = pd.read_csv('C:/Users/sicel/Desktop/Portland Data Science/DataScience/DataScience-Python3/mammographic_masses.data.txt',
                na_values=['?'], names = feature_names)
df.head()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [2]:
df.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 961 entries, 0 to 960
Data columns (total 6 columns):
BI_RADS     959 non-null float64
age         956 non-null float64
shape       930 non-null float64
margin      913 non-null float64
density     885 non-null float64
severity    961 non-null int64
dtypes: float64(5), int64(1)
memory usage: 45.1 KB


Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

Evaluate whether the data needs cleaning; your model is only as good as the data it's given. Hint: use describe() on the dataframe.

In [4]:
df.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [5]:
df.isnull()[:6]

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
0,False,False,False,False,False,False
1,False,False,False,False,True,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,True,False
5,False,False,False,True,False,False


In [6]:
# The density column has a lot of missing values
missing = df.isna().sum()
missing

BI_RADS      2
age          5
shape       31
margin      48
density     76
severity     0
dtype: int64

If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna().

In [7]:
df.dropna(inplace=True)
df.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,55.781928,2.781928,2.813253,2.915663,0.485542
std,1.888371,14.671782,1.242361,1.567175,0.350936,0.500092
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [8]:
all_features = df[['age', 'shape', 'margin', 'density']].values
all_classes = df['severity'].values
all_features

array([[67.,  3.,  5.,  3.],
       [58.,  4.,  5.,  3.],
       [28.,  1.,  1.,  3.],
       ...,
       [64.,  4.,  5.,  3.],
       [66.,  4.,  5.,  3.],
       [62.,  3.,  3.,  3.]])

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler().

In [9]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
features_sc = scaler.fit_transform(all_features)
features_sc

array([[ 0.7650629 ,  0.17563638,  1.39618483,  0.24046607],
       [ 0.15127063,  0.98104077,  1.39618483,  0.24046607],
       [-1.89470363, -1.43517241, -1.157718  ,  0.24046607],
       ...,
       [ 0.56046548,  0.98104077,  1.39618483,  0.24046607],
       [ 0.69686376,  0.98104077,  1.39618483,  0.24046607],
       [ 0.42406719,  0.17563638,  0.11923341,  0.24046607]])

## Decision Trees

Before moving to K-Fold cross validation and random forests, start by creating a single train/test split of our data. Set aside 75% for training, and 25% for testing.

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features_sc, all_classes, test_size=0.25, random_state=5)

Now create a DecisionTreeClassifier and fit it to your training data.

In [11]:
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

Display the resulting decision tree.

Measure the accuracy of the resulting decision tree model using your test data.

In [12]:
clf.score(X_test, y_test)

0.7596153846153846

In [13]:
y_predict = clf.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

0.7596153846153846

Now instead of a single train/test split, use K-Fold cross validation to get a better measure of your model's accuracy (K=10). Hint: use model_selection.cross_val_score

In [14]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, features_sc, all_classes, cv=10)

print(scores)

print(scores.mean())

[0.71428571 0.75       0.76190476 0.72289157 0.77108434 0.68674699
 0.72289157 0.76829268 0.75609756 0.68292683]
0.7337122007192532


Now try a RandomForestClassifier instead. Does it perform better?

-Yes, it performs slightly better!

In [15]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.7740384615384616

## SVM

Next try using svm.SVC with a linear kernel. How does it compare to the decision tree?

In [16]:
from sklearn import svm

C = 1.0
svc = svm.SVC(kernel='linear', C=C).fit(X_train, y_train)

In [17]:
svc.score(X_test, y_test)

0.7980769230769231

In [18]:
# SVM works well for classifying data which has a lot of features. Finds support vectors which divide the data (hyperplanes)
# svc stands for support vector classification
# It's a supervised technique
# It can be computationally expensive especially for kernels beyond linear

## KNN
How about K-Nearest-Neighbors? Hint: use neighbors.KNeighborsClassifier - it's a lot easier than implementing KNN from scratch like we did earlier in the course. Start with a K of 10. K is an example of a hyperparameter - a parameter on the model itself which may need to be tuned for best results on your particular data set.

In [19]:
from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics

knn = KNeighborsClassifier(n_neighbors=10)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7884615384615384


Choosing K is tricky, so we can't discard KNN until we've tried different values of K. Write a for loop to run KNN with K values ranging from 1 to 50 and see if K makes a substantial difference. Make a note of the best performance you could get out of KNN.

In [20]:
# This is pretty cool. It looks like a good K value is 14, 16
for k in range(1,50):
    knn = KNeighborsClassifier(k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    print('Accuracy for K value = ', k , 'is ', metrics.accuracy_score(y_test, y_pred))

Accuracy for K value =  1 is  0.7548076923076923
Accuracy for K value =  2 is  0.7403846153846154
Accuracy for K value =  3 is  0.7692307692307693
Accuracy for K value =  4 is  0.75
Accuracy for K value =  5 is  0.7980769230769231
Accuracy for K value =  6 is  0.7788461538461539
Accuracy for K value =  7 is  0.7932692307692307
Accuracy for K value =  8 is  0.7932692307692307
Accuracy for K value =  9 is  0.8028846153846154
Accuracy for K value =  10 is  0.7884615384615384
Accuracy for K value =  11 is  0.8076923076923077
Accuracy for K value =  12 is  0.7980769230769231
Accuracy for K value =  13 is  0.8028846153846154
Accuracy for K value =  14 is  0.8076923076923077
Accuracy for K value =  15 is  0.8028846153846154
Accuracy for K value =  16 is  0.8076923076923077
Accuracy for K value =  17 is  0.8028846153846154
Accuracy for K value =  18 is  0.8028846153846154
Accuracy for K value =  19 is  0.7932692307692307
Accuracy for K value =  20 is  0.7836538461538461
Accuracy for K value = 

## Naive Bayes

Now try naive_bayes.MultinomialNB. How does its accuracy stack up? Hint: you'll need to use MinMaxScaler to get the features in the range MultinomialNB requires.

In [21]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

# Test data is already normalized
scaler = MinMaxScaler()
train_data = scaler.fit_transform(all_features)

X_train, X_test, y_train, y_test = train_test_split(train_data, all_classes, test_size=0.25, random_state=5)


classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print("r-squared:",metrics.r2_score(y_test,y_pred))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


r-squared: 0.1847704367301234
Accuracy: 0.7980769230769231


## Revisiting SVM

svm.SVC may perform differently with different kernels. The choice of kernel is an example of a "hyperparamter." Try the rbf, sigmoid, and poly kernels and see what the best-performing kernel is. Do we have a new winner?

-Using rbf kernel improves accuracy score

In [22]:
C = 1.0
svc = svm.SVC(kernel='rbf', C=C).fit(X_train, y_train)
svc.score(X_test, y_test)



0.8028846153846154

In [23]:
C = 1.0
svc = svm.SVC(kernel='sigmoid', C=C).fit(X_train, y_train)
svc.score(X_test, y_test)



0.7980769230769231

In [24]:
C = 1.0
svc = svm.SVC(kernel='poly', C=C).fit(X_train, y_train)
svc.score(X_test, y_test)



0.7980769230769231

## Logistic Regression

We've tried all these fancy techniques, but fundamentally this is just a binary classification problem. Try Logisitic Regression, which is a simple way to tackling this sort of thing.

In [25]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

In [26]:
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)



In [27]:
print("r-squared:",metrics.r2_score(y_test, y_pred))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

r-squared: 0.24300111982082884
Accuracy: 0.8125


## Neural Networks

As a bonus challenge, let's see if an artificial neural network can do even better. You can use Keras to set up a neural network with 1 binary output neuron and see how it performs. Don't be afraid to run a large number of epochs to train the model if necessary.

In [28]:
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.models import Sequential
from sklearn.model_selection import cross_val_score

In [29]:
def create_model():
    model = Sequential()
    #4 feature inputs (features) going into an 32-unit layer 
    model.add(Dense(64, input_dim=4, kernel_initializer='normal', activation='relu'))
    # Another hidden layer of 16 units
    model.add(Dense(32, kernel_initializer='normal', activation='relu'))
    # Output layer with a binary classification (benign or malignant mammogram)
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

In [30]:
# Wrap our Keras model in an estimator compatible with scikit_learn
estimator = KerasClassifier(build_fn=create_model, epochs=100, verbose=0)

In [31]:
# Now we can use scikit_learn's cross_val_score to evaluate this model identically to the others
cv_scores = cross_val_score(estimator, features_sc, all_classes, cv=10)
cv_scores.mean()

0.813253011545503

## Do we have a winner?

Which model, and which choice of hyperparameters, performed the best? Feel free to share your results!

Logistic Regression and Artificial Neural Networks performed the best! K-nearest neighbors and SVM using an rbf kernel are not too far off either.