# Lesson 05
# Peter Lorenz

## 0. Preliminaries

Import the required libraries:

In [8]:
import matplotlib as mpl
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

Set global options:

In [2]:
# Display plots inline
%matplotlib inline

# Display multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Suppress scientific notation
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Declare utility functions:

## Read data
First we import the data set:

In [10]:
# Internet location of the data set
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"

# Download the data into a dataframe object
cancer_data = pd.read_csv(url, comment='#')

cancer_data.columns = ['ID', 'Clump Thickness', 
                       'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
                       'Marginal Adhesion', 'Single Epithelial Cell Size', 
                       'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 
                       'Mitosis', 'Class']

# Display shape and initial data
cancer_data.shape
cancer_data.head()

# Examine column types
cancer_data.info()

(698, 11)

Unnamed: 0,ID,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitosis,Class
0,1002945,5,4,4,5,7,10,3,2,1,2
1,1015425,3,1,1,1,2,2,3,1,1,2
2,1016277,6,8,8,1,3,4,3,7,1,2
3,1017023,4,1,1,3,2,1,3,1,1,2
4,1017122,8,10,10,8,7,10,9,7,1,4


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   ID                           698 non-null    int64 
 1   Clump Thickness              698 non-null    int64 
 2   Uniformity of Cell Size      698 non-null    int64 
 3   Uniformity of Cell Shape     698 non-null    int64 
 4   Marginal Adhesion            698 non-null    int64 
 5   Single Epithelial Cell Size  698 non-null    int64 
 6   Bare Nuclei                  698 non-null    object
 7   Bland Chromatin              698 non-null    int64 
 8   Normal Nucleoli              698 non-null    int64 
 9   Mitosis                      698 non-null    int64 
 10  Class                        698 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.1+ KB


The 'Bare Nuclei' column appears to be a string due to having missing values encoded as question marks. So we replace the missing values with the column median and convert the column to int:

In [23]:
# Impute missing values using column median
cancer_data = cancer_data.replace('?', np.NaN)
cancer_data = cancer_data.apply(lambda x: x.fillna(x.median()))

# Convert column to integer
cancer_data['Bare Nuclei'] = np.array(cancer_data['Bare Nuclei']).astype(int)

Next we prepare the data set for modeling by removing and setting aside the 'ID' column:

In [6]:
# Extract and reserve the ID column
id_values = np.array(cancer_data['ID'].values)
cancer_data = cancer_data.drop(['ID'], axis=1)

Now extract the 'Class' column as the target variable and encode malignant as '1' and benign as '0':

In [25]:
# Extract and encode target variable
is_malignant = np.array([cancer_data['Class'] == 4]).astype(int)

# Remove class from the data set
cancer_data = cancer_data.drop(['Class'], axis=1)

Class 1 is represented as the mutual exclusion of the other classes.

## 1. Test both entropy and the gini coefficient
In this section we test entropy and the gini coefficient to answer the question which performs better and why. First we partition our data set into training and test:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, 
                                                    test_size = 0.1, 
                                                    random_state = 99)

Next we build a model using a decision tree classifier using the entropy coefficient:

In [None]:
tree_onehot_ent = DecisionTreeClassifier(criterion = 'entropy', max_depth = 3)
tree_onehot_ent.fit(X_train_onehot, y_train)

y_pred_test = tree_onehot_ent.predict(X_test_onehot)

We also build a model using a decision tree classifier with the gini coefficient:

In [None]:
tree_onehot_gini = DecisionTreeClassifier(criterion = 'gini')
tree_onehot_gini.fit(X_train_onehot, y_train)

y_pred_test = tree_onehot_gini.predict(X_test_onehot)

## 2. Find best hyperparameter settings
In this section we find the best hyperparameter settings for entropy and the gini coefficient.

## 3. Visualize both models and see which feature is selected for each criterion
In this section we visualize both models and see which feature is selected for each criterion. Our goal is to determine whether the same feature is selected for both models and ascertain the reason behind this.

## 4. Determine the AUC for the best model you can achieve
In this section we determine the AUC for the best model we can achieve. We inquire as to the precision and recal values and as to which might be the one we want to maximize.

## 5. Implications of using this type of machine learning algorithm for breast cancer analysis
Here we examine the implications of using this type of machine learning algorithm for breast cancer analysis.