### Capstone 3: Breast Cancer Prediction

*Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.*

*This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.*

**Goal**: Build a classification model that can accurately identify the diagnosis of breast cancer based on the measurements and attributes of a tumor.


*Source: https://www.kaggle.com/datasets/merishnasuwal/breast-cancer-prediction-dataset*

### Modeling

In [27]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

from math import sqrt
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve, cross_val_score

from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.feature_selection import SelectKBest
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [2]:
#import breast cancer risk dataset
df = pd.read_csv('/Users/joyopsvig/github/springboard/3-Capstone/Notebooks/breastcancerrisk_cleaned.csv')

In [4]:
#view head of data
df.head()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0
1,20.57,17.77,132.9,1326.0,0.08474,0
2,19.69,21.25,130.0,1203.0,0.1096,0
3,11.42,20.38,77.58,386.1,0.1425,0
4,20.29,14.34,135.1,1297.0,0.1003,0


In [3]:
#view info for data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   mean_radius      568 non-null    float64
 1   mean_texture     568 non-null    float64
 2   mean_perimeter   568 non-null    float64
 3   mean_area        568 non-null    float64
 4   mean_smoothness  568 non-null    float64
 5   diagnosis        568 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 26.8 KB


In [5]:
#drop the response variable
X = df.drop('diagnosis', axis = 1)
y = df['diagnosis']

In [6]:
#split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

#scale the data to normalize values between 0-1
scaler = MinMaxScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.transform(X_test)

### Logisitic Regression Model

In [16]:
#create model
model = LogisticRegression()

#train the model
model.fit(X_train_norm, y_train)

#create the predictions
predictions = model.predict(X_test_norm)

#create classificiation report, confusion matrix
logclassreport = classification_report(y_test, predictions)
logconfmatrix = confusion_matrix(y_test, predictions)

#print the results
print(logclassreport)
print(logconfmatrix)


              precision    recall  f1-score   support

           0       0.97      0.70      0.81        50
           1       0.86      0.99      0.92        92

    accuracy                           0.89       142
   macro avg       0.92      0.84      0.87       142
weighted avg       0.90      0.89      0.88       142

[[35 15]
 [ 1 91]]


### Random Forest Model

In [21]:
#create model
clf=RandomForestClassifier()

#create model
clf.fit(X_train_norm, y_train)

#create the predictions
predictions2 = clf.predict(X_test_norm)

#create classificiation report, confusion matrix
clfclassreport = classification_report(y_test, predictions2)
clfconfmatrix = confusion_matrix(y_test, predictions2)

#print the results
print(clfclassreport)
print(clfconfmatrix)

              precision    recall  f1-score   support

           0       0.96      0.88      0.92        50
           1       0.94      0.98      0.96        92

    accuracy                           0.94       142
   macro avg       0.95      0.93      0.94       142
weighted avg       0.94      0.94      0.94       142

[[44  6]
 [ 2 90]]


In [31]:
features = pd.DataFrame(clf.feature_importances_, index=X_train.columns, columns=['importance']).sort_values('importance', ascending=False)
features

Unnamed: 0,importance
mean_area,0.325251
mean_perimeter,0.255591
mean_radius,0.199271
mean_texture,0.112006
mean_smoothness,0.107881


### K-Nearest Neighbors Model

In [23]:
#create model
knn=KNeighborsClassifier()

#create model
knn.fit(X_train_norm, y_train)

#create the predictions
predictions3 = knn.predict(X_test_norm)

#create classificiation report, confusion matrix
knnclassreport = classification_report(y_test, predictions3)
knnconfmatrix = confusion_matrix(y_test, predictions3)

#print the results
print(knnclassreport)
print(knnconfmatrix)

              precision    recall  f1-score   support

           0       0.93      0.80      0.86        50
           1       0.90      0.97      0.93        92

    accuracy                           0.91       142
   macro avg       0.91      0.88      0.90       142
weighted avg       0.91      0.91      0.91       142

[[40 10]
 [ 3 89]]


In [32]:
#print accuracy for each model
#accurracy is number of correct total predictions divided by the number of total predictions
print("Logistic Regression Accuracy:",accuracy_score(y_test, predictions))
print("Random Forest Accuracy:",accuracy_score(y_test, predictions2))
print("KNN Accuracy:",accuracy_score(y_test, predictions3))

Logistic Regression Accuracy: 0.8873239436619719
Random Forest Accuracy: 0.9436619718309859
KNN Accuracy: 0.9084507042253521
