## Instructions
In your final project – time to shine! – you'll use machine learning to predict whether a tumor is benign or malignant. 

NOTE: These data are not the same as the data we used before - those were a toy version and these are the real deal.

The data have a bunch of potential predictor variables and one target variable. The file "FP_breast_cancer_data.csv" is the raw data, with one target variable column coded as 0 or 1. This is best for machine learning.

The file "FP_breast_cancer_data_catcol.csv" has an additional column I added that codes the target variable as "benign" or "malignant". This is easier to use when playing around with, for example, seaborn's pairplot() function.

Your goal is to compare 2 machine learning algorithms for classifying tumor type. You can use two of the 3 we covered in class, or try one we haven't covered (such as k-means).

For each algorithm, try both using 2 variables you identify yourself as potentially useful as well as the "best" two variables (principal components) identified by PCA. In other words, you'll end up with 4 sets of results as per the table below.



Algorithm 1|Algorithm 2
|-|-|
2 good variables by eye	| 2 good variables by eye
Best two variables via PCA | Best two variables via PCA

The final project should be in the form of a brief report to your (busy) boss recommending which way we should go for an automated screening algorithm.

In [131]:
# Load library
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [132]:
# Get data
raw = pd.read_csv('./data/FP_breast_cancer_data.csv')
df = pd.read_csv('./data/FP_breast_cancer_data_catcol.csv')

In [133]:
# Split data into training and testing segments
X = df.iloc[:, 0:-2] # predictor matrix (everything but the tumor type)
y = df.target   # target vector
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Algorithm 1 (K-Nearest Neighbors classifier with k=3)

In [134]:
# Make algorithm
from sklearn.neighbors import KNeighborsClassifier
k = 3
knn = KNeighborsClassifier(n_neighbors=k)

# Fit algorithm
knn.fit(X_train, y_train)

# Predict new values
y_pred = knn.predict(X_test)

# Compare test data with predicted values
acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {acc_score:.2f}%")

# Make  classification report
cls_report = classification_report(y_test, y_pred)
print(cls_report)


# find best 2 variables

Accuracy Score: 0.93%
              precision    recall  f1-score   support

           0       0.93      0.88      0.90        43
           1       0.93      0.96      0.94        71

    accuracy                           0.93       114
   macro avg       0.93      0.92      0.92       114
weighted avg       0.93      0.93      0.93       114



## Algorithm 2 (Gaussian Naive Bayes Classifier)

In [140]:
# Make algorithm
gnb = GaussianNB()

# Fit algorithm
gnb.fit(X_train, y_train)

# Predict new values
y_pred = gnb.predict(X_test)

# Compare test data with predicted values
acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {acc_score:.2f}%")

# Make  classification report
cls_report = classification_report(y_test, y_pred)
print(cls_report)

# find best 2 variables

Accuracy Score: 0.92%
              precision    recall  f1-score   support

           0       1.00      0.78      0.88        63
           1       0.89      1.00      0.94       108

    accuracy                           0.92       171
   macro avg       0.94      0.89      0.91       171
weighted avg       0.93      0.92      0.92       171



## Perform PCA

In [136]:
# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Split data into training and testing segments from PCA
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)

### Algorithm 1 (Nearest Neighbors) on PCA

In [137]:
# Make algorithm 1
from sklearn.neighbors import KNeighborsClassifier
k = 3
knn = KNeighborsClassifier(n_neighbors=k)

# Fit algorithm
knn.fit(X_train, y_train)

# Predict new values
y_pred = knn.predict(X_test)

# Compare test data with predicted values
acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {acc_score:.2f}%")

# Make  classification report
cls_report = classification_report(y_test, y_pred)
print(cls_report)

# find best 2 variables

Accuracy Score: 0.94%
              precision    recall  f1-score   support

           0       0.92      0.92      0.92        63
           1       0.95      0.95      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.94      0.94       171
weighted avg       0.94      0.94      0.94       171



### Algorithm 2 (Gaussian Naive Bayes Classifier) on PCA

In [139]:
# Create a Naive Bayes classifier
gnb = GaussianNB()

# Fit algorithm
gnb.fit(X_train, y_train)

# Predict new values
y_pred = gnb.predict(X_test)

# Compare test data with predicted values
acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {acc_score:.2f}%")

# Make  classification report
cls_report = classification_report(y_test, y_pred)
print(cls_report)

# find best 2 variables

Accuracy Score: 0.92%
              precision    recall  f1-score   support

           0       1.00      0.78      0.88        63
           1       0.89      1.00      0.94       108

    accuracy                           0.92       171
   macro avg       0.94      0.89      0.91       171
weighted avg       0.93      0.92      0.92       171

