# This notebook compares the accuracy of various models using confusion matrices to predict breast cancer. 

The data is from the UCI Machine Learning Repository [Link](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm

In [None]:
breast_cancer_data = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer_data.feature_names)
X = X[['mean texture', 'mean perimeter']]
y = pd.Categorical.from_codes(breast_cancer_data.target, breast_cancer_data.target_names)
y = pd.get_dummies(y, drop_first=True)
print(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
#KNN Implmentation 
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train)

In [None]:
#Decision Tree Implmentation 
dtree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
dtree.fit(X_train,y_train)


In [None]:
#Logistic Regressiob Implmentation 
model_lr = LogisticRegression(C=0.0001, solver='liblinear')
model_lr.fit(X_train, y_train.values.ravel())
model_lr

In [None]:
#SVM implementation 
svm_m = svm.SVC(kernel='linear')
svm_m.fit(X_train, y_train.values.ravel())
svm_m

In [None]:
#Implementing the models 
y_pred_knn = knn.predict(X_test)
y_pred_tree = dtree.predict(X_test)
y_pred_log = model_lr.predict(X_test)
y_pred_svm = svm_m.predict(X_test)

In [None]:
#Confusion Matrix for each Model 
print("KNN: ", confusion_matrix(y_test, y_pred_knn))
print("Deciscion Tree: ", confusion_matrix(y_test, y_pred_tree))
print("Logistic Regression: ", confusion_matrix(y_test, y_pred_log))
print("SVM: ", confusion_matrix(y_test, y_pred_svm))

From this we can see that it appears the SVM model is the most accurate with the KNN model coming in second. An interesting extension of this will be to adjust the paramters used to predict the values and then examine each model i.e. looking at smoothness and compactness instead of the parameters used in this case: texture and perimeter. 

Thank you for viewing! 