## Programming Assignment #6

**Machine Learning Validation**

100 points possible.

This assignment asks you to (1) compare validation strategies for a ML model and (2) get experience trying different models.

##The Setting -- Breast Cancer Prediction

The code below loads a publicly available data set for breast cancer; the target is a cancer status (benign or malignant) and the features are measurements of the observed mass. The data set description (DESCR) is printed below for additional context.

In [1]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer(as_frame=True)
cancer_df = cancer.frame
print(cancer.DESCR)


.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [2]:
# The measurements are numerical values
cancer_df.head(n=20)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,0
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,0
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,0
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,0
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,0


In [3]:
# the target is a 0 (malignant) or 1 (benign).
cancer_df.target.value_counts()

1    357
0    212
Name: target, dtype: int64

# Holdout Set Example

The code below splits the data set 50/50 into training and testing data.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Split 50/50 for train and test, set random_state to an int for a static cut of the data
data_train, data_test = train_test_split(cancer_df, test_size=.5, random_state=54321)

# Separate training/test data into feature matrix and target
X_train = data_train.drop(columns='target')
X_test = data_test.drop(columns='target')
y_train = data_train.target
y_test = data_test.target

# Create a Gaussian Naive-Bayes model and train it with our 50% holdout
gaussiannb = GaussianNB()
model = gaussiannb.fit(X_train, y_train)

# Generate predictions using the test data
y_pred = gaussiannb.predict(X_test)

# Print/compare predictions
print(y_pred)
print(accuracy_score(y_test, y_pred))


[1 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 1 0 0
 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 0
 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 0 1 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0
 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0 1
 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1
 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1
 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0]
0.9298245614035088


# Part 0 -- Submission Details


(10 points) Please enter your name and the date below. Submit your answers as a completed notebook by the deadline posted on Canvas.  Late submissions will not get credit for this section.

Name: ***Matt Massey***

Date: ***12/6/2022***


#Part 1 -- Cross Validation

(15 points) Repeat the GaussianNB example, but use 5-fold cross validation. (Hint: *cv=5* when using *cross_val_score*). Print the mean accuracy score and standard deviation.  You may combine parts 1-3 into a single block of code if you wish to use iteration.

In [78]:
# import cross validation module
from sklearn.model_selection import cross_val_score

# create feature matrix with all data
X = cancer_df.drop('target', axis=1)

# create target vector of all data
y = cancer_df['target']

# PARTS 1-3
# list of cross validation folds to use for parts 1-3
cv_folds = [5,10,20]

# loop three cross validation folds (5,10.20) of gaussian naive-bayes classifier
for cvf in cv_folds:  
  score = cross_val_score(GaussianNB(), X, y, cv=cvf)     # cross-validation score of model
  print('Gaussian Naive Bayes, {}-fold cross validation:'.format(cvf))    # print title of model
  print('  {0:.2f} mean accuracy'.format(score.mean()))   # print mean accuracy of model
  print('  {0:.2f} standarad deviation'.format(score.std()))    # print standard deviation of model accuracy scores
  print('\n')   # print new line before next loop

Gaussian Naive Bayes, 5-fold cross validation:
  0.94 mean accuracy
  0.01 standarad deviation


Gaussian Naive Bayes, 10-fold cross validation:
  0.94 mean accuracy
  0.03 standarad deviation


Gaussian Naive Bayes, 20-fold cross validation:
  0.94 mean accuracy
  0.04 standarad deviation




#Part 2 -- Cross Validation

(10 points) Repeat the GaussianNB example, but use 10-fold cross validation. Print the mean accuracy score and standard deviation.

In [18]:
# see Part 1 above

#Part 3 -- Cross Validation

(10 points) Repeat the GaussianNB example, but use 20-fold cross validation. Print the mean accuracy score and standard deviation.

In [17]:
# see Part 1 above

#Part 4 -- Critical Thinking

(15 points) Answer the following questions (as code comments below):

1.   Which validation scheme, 50/50 holdout or cross-validation, reported better accuracy? Why?
2.   Do your cross validation results appear stable when varying how many folds? What does this suggest about your results?



In [89]:
# 4.1 --> The cross-validation models all reported better mean accuracy scores than 
# the holdout set because the cross-validation models are going through the entire 
# dataset, rather than ony 50% of the dataset.


# 4.2 --> The mean accuracy scores for each cross-validation model of 5, 10, and 
# 20 folds are all 0.94, however, the standard deviation increases with additional 
# folds. This suggests that there are small local variations in the dataset that
# are creating more variance as the validation sets get smaller with the increasing 
# number of folds, although the overlall mean remains the same. 

# Part 5 -- Model Swap

(15 points) Repeat your 10-fold cross validation, but swap in a model different than GaussianNB. Choose a model appropriate for a classification task.

Print the mean accuracy score and standard deviation. Leave as a comment which model performed better and speculate why there may or may not exist a performance difference.

Hint: Become familiar with and browse the scikit-learn user guide: https://scikit-learn.org/stable/user_guide.html The "appropriate" model is one suitable for this type of task and this type of data.

In [87]:
# import k-neighbors classifier
from sklearn.neighbors import KNeighborsClassifier

# use the k-neighbors classifier model with all default parameters; neighbors=5
kn5 = KNeighborsClassifier()

# print model title, mean accuracy, and standard of means for 10-fold cross validation
print('k-nearest neighbors (5 neighbors), 10-fold cross validation:')
print('  {0:.2f} mean accuracy'.format(cross_val_score(kn5, X, y, cv=10).mean()))
print('  {0:.2f} standard deviation'.format(cross_val_score(kn5, X, y, cv=10).std()))

k-nearest neighbors (5 neighbors), 10-fold cross validation:
  0.93 mean accuracy
  0.03 standard deviation


In [80]:
# The k-neighnors model (0.93 +/-0.03) doesn't perform quite as well as the gaussian naive Bayes
# model (0.94 +/-0.3), although they are veru close. The k-nearest neighbors uses the default 5 nearest
# neighbors and uniform weighting to classify the point. Since the targets are not evenly distributed
# in the dataset, the k-nearest neighbors may not have uniform training with the different folds.

# Part 6 -- Model Swap

(15 points) Repeat your 10-fold cross validation, but swap in a model different than GaussianNB (and different than part 5). Choose a model appropriate for a classification task.

Print the mean accuracy score and standard deviation. Leave as a comment which model performed better and speculate why there may or may not exist a performance difference.

In [86]:
# import make_pipeline function, standardscaler to scale dataset for svc classification;
# and support vector machine classifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import svm

# support vector classifier standardizes features in cancer dataset (X), then uses the 
# svm model by using the make_pipeline function; default parameters for model
svc = make_pipeline(StandardScaler(), svm.SVC())

# print results of model; print title, mean accuracy, and standard deviation of means
# for 10 fold cross validation of support vector machine
print('C-Support Vector Classification, 10-fold cross validation:')
print('  {0:.2f} mean accuracy'.format(cross_val_score(svc, X, y, cv=10).mean()))
print('  {0:.2f} standard deviation'.format(cross_val_score(svc, X, y, cv=10).std()))

C-Support Vector Classification, 10-fold cross validation:
  0.98 mean accuracy
  0.03 standard deviation


In [88]:
# The support vector machine model has a much better accuracy compared to the gaussian
# naive Bayes or k-neighbors models above, with an accuracy of 0.98 +/-0.03. The features
# of the cancer dataset are scaled, then used in the model, where the svm fits a 
# hyperplane (in all dimensions of dataset features) that separates the data into two
# classes. This seems like a much more realistic model than assuming a Gaussian distribution
# and/or using nearest neighbors.

# Part 7 -- Documentation and Correctness
(10 points) Please document your code with human-readable messages explaining what the code is doing; at a minimum, every function and control structure should be documented.  If your response is a 1-liner, explain how it works.

Additionally, please error check your code; partial credit will be given to answers that do not fully address the requirements. For example, if it says write a function, please make sure your code provides a function.

Please make sure your submission has everything completed.