Holden Davis

### Lab - Machine Learning
In this notebook, you will use the Gaussian Naive Bayes estimator to perform and evaluate a binary classification.  You will also compare the performance of this algorithm to other classification algoritms.

This is the Breast Cancer Wisconsin Diagnostic dataset that is bundled with scikit-learn

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

The dataset contains 569 samples, each with 30 features and a label indicating whether a tumor was malignant (0) or benign (1). There are only two labels, so this dataset is commonly used to perform binary classification. 

Using this dataset, reimplement the steps of this chapter’s classification case study in Sections 15.2–15.3.
* Use the GaussianNB (short for Gaussian Naive Bayes) estimator. 
* When you execute multiple classifiers (as in Section 15.3.3) to determine which one is best for the Breast Cancer Wisconsin Diagnostic dataset, include a LogisticRegression classifier in the estimators dictionary. Logistic regression is another popular algorithm for binary classification.  Use the following parameters: 

    ```
    solver='lbfgs', multi_class='ovr', max_iter=10000
    ```

**Implement the following steps/tasks. Clearly document each step with markup descriptions. (HINT: look at the steps in the book from 15.2.2-15.3.3)**
* Load the data
* Display the data description
* Check the sample and target sizes
* Split the data for training and testing
* Create the model (GaussianNB)
* Train the model
* Predict
* Determine accuracy with score
* Determine accuracy with confusion matrix
* Determine accuracy with classification report
* Visualize the confusion matrix
* Perform k-fold cross validation
* Run multiple models to find the best one, include GaussianNB, KNeighborsClassifier, LogisticRegression, and SVC.
* Which classifer performs the best?







In [135]:
import sklearn.model_selection as ms
import sklearn.metrics as metrics
from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn import svm, neural_network, neighbors, tree, linear_model

In [136]:
# Load the data
dataset = load_breast_cancer()

In [137]:
# Display the data description
print(dataset.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [138]:
# Check the sample and target sizes
print(dataset.data.shape)
print(dataset.target.shape)

(569, 30)
(569,)


In [139]:
# Split the data for training and testing
xtrain, xtest, ytrain, ytest = ms.train_test_split(dataset.data, dataset.target, test_size=0.2, random_state=7274)

In [140]:
# Create the model (GaussianNB)
gauss = GaussianNB()

In [141]:
# Train the model
gauss.fit(xtrain, ytrain)

In [142]:
# Predict
ypred = gauss.predict(xtest)

In [143]:
# Determine accuracy with score
accscore = gauss.score(xtest, ytest)
print(accscore)

0.956140350877193


In [144]:
# Determine accuracy with confusion matrix
accconf = metrics.confusion_matrix(ytest, ypred)

In [145]:
# Determine accuracy with classification report
acc = metrics.classification_report(ytest, ypred)
print(acc)


              precision    recall  f1-score   support

           0       0.98      0.91      0.94        44
           1       0.95      0.99      0.97        70

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



In [146]:
# Visualize the confusion matrix
accconf

array([[40,  4],
       [ 1, 69]])

In [147]:
# Perform k-fold cross validation
kf = ms.KFold(n_splits=5, shuffle=True, random_state=7274)

In [148]:
# Run multiple models to find the best one, include GaussianNB, KNeighborsClassifier, LogisticRegression, and SVC.
models = [GaussianNB(), neighbors.KNeighborsClassifier(), linear_model.LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=10000), svm.SVC(), tree.DecisionTreeClassifier(), neural_network.MLPClassifier()]
for model in models:
    accscore = ms.cross_val_score(model, dataset.data, dataset.target, cv=kf)
    print(model, accscore.mean())

GaussianNB() 0.9402266728768824
KNeighborsClassifier() 0.9332091290172333
LogisticRegression(max_iter=10000, multi_class='ovr') 0.956047197640118
SVC() 0.9139108834031984
DecisionTreeClassifier() 0.924468250271697
MLPClassifier() 0.924406148113647


# Which classifer performs the best?
While every model stays within the 0.90 - 1.00 range, Logsitic Regression performs far better than its neighbors with a 0.956 score, as opposed to the 0.93~ range scores of the other models.