## Assignment 6

Instructions: use what you learned about evaluation and ensembles this week to evaluate the performance of a set of classifiers on the breast cancer data set (https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data (Links to an external site.)). In particular, you will need to download the data and perform any necessary preprocessing, then use 5-fold cross-validation to evaluate the performance of two classifiers of your choice. Report the results in terms of precision, recall, and area under the ROC curve (AUC), across the 5 folds (both mean and standard deviation). Finally, perform a two-sample t-test on the AUC scores to report whether there is a statistically significant difference between the performance of both classifiers. If there is a difference, give at least one reason why you think that is the case. If there is no difference, you should also explain why you think that is the case.

Please note that in this data set, the class is the first column. You can find the feature names and details at https://archive.ics.uci.edu/ml/datasets/breast+cancer (Links to an external site.).

The assignment will be graded as follows:

- Data understanding and preprocessing (be sure to check for missing values) (5 points)
- Correct application of 5-fold cross validation (10 points)
- Correctly computing precision, recall, and AUC (you may use sklearn or other libraries) (10 points)
- Correctly applying the t-test and justifying why one classifier may have outperformed the other (10 points)

In [56]:
import pandas as pd
import numpy as np
from sklearn.mixture import GaussianMixture

# download data
feature_names = ['class','age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data'
df = pd.read_csv(data_url, names=feature_names)
df

Unnamed: 0,class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no
...,...,...,...,...,...,...,...,...,...,...
281,recurrence-events,30-39,premeno,30-34,0-2,no,2,left,left_up,no
282,recurrence-events,30-39,premeno,20-24,0-2,no,3,left,left_up,yes
283,recurrence-events,60-69,ge40,20-24,0-2,no,1,right,left_up,no
284,recurrence-events,40-49,ge40,30-34,3-5,no,3,left,left_low,no


In [57]:
####Data Pre-Processing
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

#Null Values
print("MISSING VALUES: \n",df.isnull().sum(), "\n") #No Null Values

#Value count summary
for col in df.columns:
    print("FEATURE {}: {}\n".format(col, df[col].value_counts()))

#Found '?' on node-caps and breast-quad feature, We consider these to be missing and thus remove them from the dataset.
missingValueColumns = ['node-caps','breast-quad']
for col in missingValueColumns:
    df = df[df[col] != '?']
    print("Remove '?' from {}: {}\n".format(col, df[col].value_counts()))

#Encode categorical variables to numeric
encoder = LabelEncoder()
for column in df.columns:
    df[column] = encoder.fit_transform(df[column])

MISSING VALUES: 
 class          0
age            0
menopause      0
tumor-size     0
inv-nodes      0
node-caps      0
deg-malig      0
breast         0
breast-quad    0
irradiat       0
dtype: int64 

FEATURE class: class
no-recurrence-events    201
recurrence-events        85
Name: count, dtype: int64

FEATURE age: age
50-59    96
40-49    90
60-69    57
30-39    36
70-79     6
20-29     1
Name: count, dtype: int64

FEATURE menopause: menopause
premeno    150
ge40       129
lt40         7
Name: count, dtype: int64

FEATURE tumor-size: tumor-size
30-34    60
25-29    54
20-24    50
15-19    30
10-14    28
40-44    22
35-39    19
0-4       8
50-54     8
5-9       4
45-49     3
Name: count, dtype: int64

FEATURE inv-nodes: inv-nodes
0-2      213
3-5       36
6-8       17
9-11      10
15-17      6
12-14      3
24-26      1
Name: count, dtype: int64

FEATURE node-caps: node-caps
no     222
yes     56
?        8
Name: count, dtype: int64

FEATURE deg-malig: deg-malig
2    130
3     85
1  

In [58]:
#View final dataset structure.
df.head()

Unnamed: 0,class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,0,1,2,5,0,0,2,0,1,0
1,0,2,2,3,0,0,1,1,4,0
2,0,2,2,3,0,0,1,0,1,0
3,0,4,0,2,0,0,1,1,2,0
4,0,2,2,0,0,0,1,1,3,0


In [59]:
#View Distributions of features
df.describe()

Unnamed: 0,class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
count,277.0,277.0,277.0,277.0,277.0,277.0,277.0,277.0,277.0,277.0
mean,0.292419,2.642599,1.093863,4.068592,1.01444,0.202166,1.057762,0.476534,1.787004,0.223827
std,0.455697,1.010125,0.988264,2.178366,1.876574,0.402342,0.729989,0.500353,1.097483,0.417562
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,3.0,0.0,0.0,1.0,0.0,1.0,0.0
50%,0.0,3.0,2.0,4.0,0.0,0.0,1.0,0.0,2.0,0.0
75%,1.0,3.0,2.0,5.0,0.0,0.0,2.0,1.0,2.0,0.0
max,1.0,5.0,2.0,10.0,6.0,1.0,2.0,1.0,4.0,1.0


In [62]:
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import precision_score, recall_score, roc_auc_score
from sklearn.preprocessing import StandardScaler

###Running classifiers: Decision Tree and nB 
#Seperate features and class
X = df.drop('class', axis=1)
Y = df['class']

#Perform Kfold cross validation
kfold = KFold(n_splits=5, shuffle=True) # initialize the KFold object
splits = kfold.split(X) # call the split method on our feature set

#Initialize models
dtClassifier = DecisionTreeClassifier()
nBClassifier = GaussianNB()


dt_results = []
nB_results = []

for i, split in enumerate(splits):
    train_idx, test_idx = split
    X_train = X.iloc[train_idx]
    y_train = Y.iloc[train_idx]
    X_test = X.iloc[test_idx]
    y_test = Y.iloc[test_idx]
    
    
    #Train Model
    dtClassifier.fit(X_train, y_train)
    nBClassifier.fit(X_train, y_train)
    
    # Predict with models
    dt_pred = dtClassifier.predict(X_test)
    nB_pred = nBClassifier.predict(X_test)
    
    # Compute metrics for Decision Tree
    dt_precision = precision_score(y_test, dt_pred, zero_division=0)
    dt_recall = recall_score(y_test, dt_pred, zero_division=0)
    dt_auc = roc_auc_score(y_test, dtClassifier.predict_proba(X_test)[:, 1])
    
    dt_results.append((dt_precision, dt_recall, dt_auc))
    
    # Compute metrics for Gaussian Mixture Model
    nB_precision = precision_score(y_test, nB_pred, zero_division=0)
    nB_recall = recall_score(y_test, nB_pred, zero_division=0)
    nB_auc = roc_auc_score(y_test, nBClassifier.predict_proba(X_test).max(axis=1))
    
    nB_results.append((nB_precision, nB_recall, nB_auc))


# Convert results to numpy arrays for easier mean and std calculation
dt_results = np.array(dt_results)
nB_results = np.array(nB_results)

# Calculate mean and standard deviation for Decision Tree
dt_precision_mean = dt_results[:, 0].mean()
dt_precision_std = dt_results[:, 0].std()
dt_recall_mean = dt_results[:, 1].mean()
dt_recall_std = dt_results[:, 1].std()
dt_auc_mean = dt_results[:, 2].mean()
dt_auc_std = dt_results[:, 2].std()

# Calculate mean and standard deviation for Gaussian Mixture Model
nB_precision_mean = nB_results[:, 0].mean()
nB_precision_std = nB_results[:, 0].std()
nB_recall_mean = nB_results[:, 1].mean()
nB_recall_std = nB_results[:, 1].std()
nB_auc_mean = nB_results[:, 2].mean()
nB_auc_std = nB_results[:, 2].std()

# Print results for Decision Tree
print("Decision Tree Classifier - Precision: Mean =", dt_precision_mean, "STD =", dt_precision_std)
print("Decision Tree Classifier - Recall: Mean =", dt_recall_mean, "STD =", dt_recall_std)
print("Decision Tree Classifier - AUC: Mean =", dt_auc_mean, "STD =", dt_auc_std)

# Print results for Gaussian Mixture Model
print("\nGaussian Naive Bayes Classifier - Precision: Mean =", nB_precision_mean, "STD =", nB_precision_std)
print("Gaussian Naive Bayes Classifier - Recall: Mean =", nB_recall_mean, "STD =", nB_recall_std)
print("Gaussian Naive Bayes Classifier - AUC: Mean =", nB_auc_mean, "STD =", nB_auc_std)
    
print(dt_results)
print(nB_results)


Decision Tree Classifier - Precision: Mean = 0.4072549019607844 STD = 0.15563934690581044
Decision Tree Classifier - Recall: Mean = 0.36846405228758167 STD = 0.13023300195642323
Decision Tree Classifier - AUC: Mean = 0.5763275735729108 STD = 0.09719146838855597

Gaussian Naive Bayes Classifier - Precision: Mean = 0.573015873015873 STD = 0.07016519398508046
Gaussian Naive Bayes Classifier - Recall: Mean = 0.5339379084967321 STD = 0.0787244591971602
Gaussian Naive Bayes Classifier - AUC: Mean = 0.48791812483491104 STD = 0.08408370739841893
[[0.5        0.53333333 0.67886179]
 [0.58333333 0.41176471 0.65987934]
 [0.46666667 0.38888889 0.56981982]
 [0.13333333 0.13333333 0.40416667]
 [0.35294118 0.375      0.56891026]]
[[0.55555556 0.66666667 0.42276423]
 [0.64285714 0.52941176 0.54751131]
 [0.66666667 0.44444444 0.61636637]
 [0.5        0.46666667 0.46833333]
 [0.5        0.5625     0.38461538]]


In [63]:
#Perform two-sample t-test on both classifiers
from scipy.stats import ttest_ind

# Perform two-sample t-test for AUC
dt_auc = dt_results[:, 2]
nB_auc = nB_results[:, 2]

t_stat_auc, p_value_auc = ttest_ind(dt_auc, nB_auc)
print("Two-sample t-test for AUC: t-statistic =", t_stat_auc, ", p-value =", p_value_auc)


Two-sample t-test for AUC: t-statistic = 1.3758558544810988 , p-value = 0.20615196929891294


After having computed the two sample t test for the AUC, we note that the p-value of 0.206 is significantly higher than our confidence value of p=0.05, thus we cannot reject the null hypothesis that these samples are not statistically significant, thus meaning that there is not sufficient evidence for the auc performance of the decision tree classifier and the gaussian naive bayes classifier to significantly outperform another.