# Lesson 05 Assignment

## Background

    Rooney's client is a tech-manufacturing startup working on a number of automated detection devices for the medical and construction industries. Among the auto-detection devices is a reader that looks at possible carcinoma tissue samples to classify the sample as either benign or malignant. Rooney asks you for help in developing a better algorithm than the current classifier, perhaps a decision tree can help.

    For this assignment, you will be designing an experiment using decision tree classifiers for the detection of breast cancer and comparing the accuracy.

## Data Details

    The Breast Cancer Wisconsin Data Set (Links to an external site.)Links to an external site. data were obtained from the University of Wisconsin Hospitals, Madison. Donors:

    (1) Dr. William H. Wolberg, General Surgery Dept.
    (2) W. Nick Street, Computer Sciences Dept.
    (3) Olvi L. Mangasarian, Computer Sciences Dept.
    (4) They contain the simplified and normalized attributes used to detect breast cancer. 

    Attributes:
    (1) Sample code number: id number
    (2) Class (4 for malignant, 2 for benign)
    (3) Clump Thickness: 1 - 10
    (4) Uniformity of Cell Size: 1 - 10
    (5) Uniformity of Cell Shape: 1 - 10
    (6) Marginal Adhesion: 1 - 10
    (7) Single Epithelial Cell Size: 1 - 10
    (8) Bare Nuclei: 1 - 10
    (9) Bland Chromatin: 1 - 10
    (10) Normal Nucleoli: 1 - 10
    (11) Mitosis

## Instructions

    It is recommended you complete the lab exercises for this lesson before beginning the assignment.

    Using the WI_Breast_Cancer csv file, create a new notebook to build a decision tree classifier that would be able to detect whether a tumor is benign or malignant. Complete the following tasks and answer the questions:

    -Read Data
    -Test both entropy and the gini coefficient. Which performs better and why?
    -What are the best hyperparameter settings for both?
    -Determine the AUC for the best model you can achieve. What are the precision and recal values and which might be the one you want to maximize?
    -What are the implications of using this type of machine learning algorithm for breast cancer analysis?

In [None]:
# Import packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from collections import OrderedDict
import datetime as dt
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.model_selection import train_test_split 

#Plot styling

import seaborn as sns; sns.set()  # for plot styling
%matplotlib inline

## Read Data

In [None]:
# Reading url

data = pd.read_csv("/Users/matt.denko/Downloads/WI_Breast_Cancer.csv") 
data.columns = ['sample_id','class','clump_thickness', 'cell_size', 'cell_shape','adhesion',
                'epithelial','nuclei','chromatin','nucleoli','mitosis'] 
(nrows, ncols) = data.shape
print(data.columns)
data.describe()
data.head()

In [None]:
#Removing cases with missing data

data = data.replace(to_replace= "?", value=float("NaN"))
data_null = data.isnull().sum()
print(data_null)
print("There are 0 columns with missing data")

## Test both entropy and the gini coefficient. Which performs better and why?

### Convert string features to integers

In [None]:
# Convert string features to integers

colnames = list(data.columns.values)
string_encoding = {}
data_encoded = data.copy()
for i in range(ncols):
    levels = list(set(data.iloc[:, i]))
    num_levels = len(levels)
    string_encoding_i = dict(zip(levels, range(num_levels)))
    string_encoding[colnames[i]] = string_encoding_i
    for j in range(nrows):
        data_encoded.iloc[j, i] = string_encoding_i[data.iloc[j, i]]

print(string_encoding)
print(data_encoded.head())

### One Hot Encoding Categorial Variables

In [None]:
#One Hot Encoding Categorical Variables

target_label = 'class'
feature_labels = [x for x in data_encoded.columns if x not in [target_label]]
x = data_encoded[feature_labels]

enc = preprocessing.OneHotEncoder()
enc.fit(x.iloc[:,0:4])
data_onehotencoded = enc.transform(x.iloc[:,0:4])
feature_names = ['sample_id','class','clump_thickness', 'cell_size', 'cell_shape','adhesion',
                'epithelial','nuclei','chromatin','nucleoli','mitosis']

### Assign the X (feature) and Y (class) Arrays and Split into Train and Test Data

In [None]:
#Train and Test Data
# Ensure the decision tree is deterministic

np.random.seed(101)

X = data_onehotencoded.toarray()
Y = data_encoded[target_label]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.1, random_state = 99)
print(y_test)

### Generate and Evaluate the Model - Entropy

In [None]:
# Generate the Classification model

dec_tree_ent = DecisionTreeClassifier(criterion='entropy', max_depth=3)
model = dec_tree_ent.fit(X_train,y_train)

# Validate the model

y_predict_ent = model.predict(X_test)

In [None]:
# Generate the accuracy score
#def measure_performance(X_train, y_train)

acc_ent = accuracy_score(y_test, y_predict_ent) * 100
print("Entropy Accuracy is : {}%".format(acc_ent))

### Create a Confusion Matrix

In [None]:
# Create a Confusion Matrix

pd.DataFrame(
    confusion_matrix(y_test, y_predict_ent),
    columns=['Predicted Benign', 'Predicted Malignant'],
    index=['True Benign', 'True Malignant']
)

### Gini Score

In [None]:
# Use Gini impurity (default) instead of Information Gain (entropy)

dec_tree_gini = DecisionTreeClassifier().fit(X_train,y_train)  

# Validate the model

y_predict_gini = dec_tree_gini.predict(X_test)

# Generate the accuracy score

acc_gini = accuracy_score(y_test, y_predict_gini) * 100
print("Gini Accuracy is : {}%".format(acc_gini))

#### Comments:

    The accuracy of the entropy model was 94% while the accuracy of the gini model was 97%
    

## What are the best hyperparameter settings for both?

#### Comments: 

    Both the gini and entropy have extremley high accuracy scores. Both of the accuracy scores are above 90% so we can conclude that having all features and replacing null values with 0s provides the best hyperparameter tuning for both the gini and the entropy model.

## Determine the AUC for the best model you can achieve. What are the precision and recal values and which might be the one you want to maximize?

In [None]:
#AUC

auc_score = roc_auc_score(y_test, y_predict_gini)
print("AUC score: ",auc_score)

#Precision-Recall

prfs = precision_recall_fscore_support(y_test, y_predict_gini, average='macro')
print("Precision, Recall, Fscore: ", prfs)

### Comments: 

    The AUC score is .98, the precision score is .94, and the recall score is .98. The accuracy of this model is .94. These are all extremely high and good scores. In this case we do not need to maximize precision or recall as they are close to being maximized as is.

## What are the implications of using this type of machine learning algorithm for breast cancer analysis?

### Comments: 

     Becuase this analysis is focused on a delicate medical issue such as Breast Cancer the takeaways from any study can have huge implications. In this case since we are predicting whether or not a person has breast cancer we would ideally want to reduce the amount of False Negatives. Since in that case we would be missing a diagnosis and potenatilly having a participant with Breast Cancer go untreated. But machine learning can help Breast Cancer treatment in finding easier ways to detect cancerous tumors at an earlier stage.