In this assignment I am to create both a decision tree classifier and a naive bayes classifier for a breast cancer prediction dataset.

Wisconsin Diagnostic Breast Cancer dataset is a binary class dataset whose primary task is to classify whether the breast tumor is malignant or benign. Each observation in the dataset corresponds to one patient case and contains a set of measurements that were obtained through digitized images of a fine needle aspirate of a breast mass. The measurements report various properties of the cell nuclei in the sample.

There are thirty numerical attributes per case in the database, measuring statistical properties of nuclei such as their radius, perimeter, area, texture, smoothness, compactness, concavity, symmetry and fractal dimension. For each of these measurements, the mean value, standard error and a "worst" or maximum value are noted, providing an accurate description of the sample. Aside from these features, there is a single identifier column that simply labels the samples and does not have any predictive information. The column is typically not used at the preprocessing stage because it will not be helpful when classifying.

The target feature is the diagnosis column, which marks whether the tumor is malignant or benign. Malignant tumors are labeled with the character M and benign tumors with the character B. The objective of a machine learning algorithm trained on this data is therefore to acquire a decision function that, from the thirty numeric features, can correctly determine whether a tumor is cancerous. The purpose of the dataset is intended in terms of enabling the development and evaluation of classification methods that may be applied for helping early diagnosis of breast cancer and guiding clinical decision making.


Step 1: Loading the data

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report



#loading data
path = r"C:\Users\gumas\OneDrive\Skrivebord\Dataingenior\Programmer\Visual Studio Code\Mandatory assignment 1\data\wdbc.data"
df = pd.read_csv(path, header=None)

#I assign the column names
# From the wdbc.names file: "field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.":
colnames = (
    ["id", "diagnosis", 
        "radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",
        "compactness_mean","concavity_mean","concave_points_mean","symmetry_mean","fractal_dimension_mean",
        "radius_se","texture_se","perimeter_se","area_se","smoothness_se",
        "compactness_se","concavity_se","concave_points_se","symmetry_se","fractal_dimension_se",
        "radius_worst","texture_worst","perimeter_worst","area_worst","smoothness_worst",
        "compactness_worst","concavity_worst","concave_points_worst","symmetry_worst","fractal_dimension_worst"
    ]
)
df.columns = colnames

print("Shape:", df.shape)
print("Diagnosis counts:\n", df["diagnosis"].value_counts())

#Missing values is 0. This fits the description in the dataset info file
print("Missing values:", df.isna().sum().sum())
df.head()


Shape: (569, 32)
Diagnosis counts:
 diagnosis
B    357
M    212
Name: count, dtype: int64
Missing values: 0


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In this step, I will preform the necassary pre processing of the data; removing the unstable columns. An unstable column is a column that does not have any useful or predictive information for the classification task. In this dataset, the ID- set is an example of this. It contains only a rndom identifier for each sample. This identifier is unique for each row and has no correlation to whether the tumor is malignant or benign. If I were to include it, it would mislead the model into memorizing arbitary patterns. Therefore, I chose to drop the id column.

Step 2: Pre-processing - removing the unstable columns

In [None]:
# dropping non-predictive ID
df_clean = df.drop(columns=["id"]).copy()

#  target: malignant=1, benign=0
df_clean["target"] = (df_clean["diagnosis"] == "M").astype(int)
df_clean = df_clean.drop(columns=["diagnosis"])

# eatures/labels
X = df_clean.drop(columns=["target"]).values
y = df_clean["target"].values

print("X shape:", X.shape, "| y shape:", y.shape)
print("Class balance (0=benign, 1=malignant):", np.bincount(y))


X shape: (569, 30) | y shape: (569,)
Class balance (0=benign, 1=malignant): [357 212]


Regarding the testing of the models on the unseen data, I divided the data into a test set and a training set using random sampling. I used a 80/20 split, which is a common practice and it provides enough data for the model to learn great patterns and yet has a big enough hold-out set ffor an honest test. The division was stratified on the target feature to perserve the original ratio of malignant and benign cases for each of the divisions. Doing this helps prevent class imbalances in each set and it enables an equal comparison of performance measures. The reproductibility of my results was ensured using a fixed random state.

I partitioned the data using stratified random sampling. This random sampling avoids any ordering bias which may have been prevalent in the original data. To keep the same proportion of malignant and benign cases within the training- and test set, I did the stratification on the basis of the target variable. Doing this ensured that both sets represented the overall population and the performance measurements are not biased because of this class imbalance.


Step 3: Stratified random split (defense: 80/20, stratified)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=42
)

def counts(arr):
    return {"neg(0)": int((arr==0).sum()), "pos(1)": int((arr==1).sum())}

print("Train:", counts(y_train), "size:", len(y_train))
print("Test :", counts(y_test),  "size:", len(y_test))


Train: {'neg(0)': 285, 'pos(1)': 170} size: 455
Test : {'neg(0)': 72, 'pos(1)': 42} size: 114


Step 4: Helper - evaluation on the test set

This function is a small helper that is used after training a model to evaluate how well it performs on the test set. The function takes four inputs; the trained model, test features Xte, the best labels yte and an optional name. 

The model will make predictions on the test data with model.predict(Xte), which will give ywhat, the predicted labels. Then it will calculate different significant performance metrics. The accuracy will be the proportion of all predictions that were correct. The precision emasures how many of the samles predicted as malignant were actually malignant, which is important for understanding the false- positive rate. The recall measures how many of the truly malignant cases where successfully detected by the model.

In [4]:
def evaluate(model, Xte, yte, name="model"):
    yhat = model.predict(Xte)
    acc = accuracy_score(yte, yhat)
    prec = precision_score(yte, yhat, zero_division=0)
    rec = recall_score(yte, yhat, zero_division=0)
    f1  = f1_score(yte, yhat, zero_division=0)
    cm  = confusion_matrix(yte, yhat)
    print(f"\n{name} — Test metrics")
    print(f"Accuracy : {acc:.4f}")
    print(f"Precision: {prec:.4f} (positive=malignant)")
    print(f"Recall   : {rec:.4f} (positive=malignant)")
    print(f"F1-score : {f1:.4f} (positive=malignant)")
    print("Confusion matrix [[TN, FP],[FN, TP]]:\n", cm)
    print("\nClassification report:\n", classification_report(yte, yhat, digits=4))
    return {"acc":acc,"prec":prec,"rec":rec,"f1":f1,"cm":cm}


Instead of manually applying different values for the hyperparameter tuning, I used nested for each loops to find the best results and store the values in a list.

This code performs manual hyperparameter tuning for the decision tree classifier. It first makes a StratifiedKFold object that splits the training set into five folds with the same ratio of malignant and benign cases in all the folds. It then runs over all possible combinations of four critical hyperparameters; the splitting rule, maximum tree depth, minimum samples to split a node, and minimum samples in a leaf. For each combination, it builds a  decision tree and cross-validates it, calculating the mean and standard deviation of the F1-score across the five folds. The result of each combination is stored as dictionaries in a list.

Once the loops are done, the result list is converted to a pandas DataFrame and sorted by the mean F1-score so that the best-performing combinations are shown first. The top ten rows of this sorted DataFrame are then printed so that you can see which hyperparameter setting performed best. In the second step of the code, the optimal combination, the first row of the sorted DataFrame is obtained and used to train the last decision tree model on the entire training set prior to testing it on the test set.

Step 5: Manual tuning using nested for each loops.

In [5]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

dt_results = []
for criterion in ["gini", "entropy"]:
    for max_depth in [3, 4, 5, 6, 8, None]:
        for min_samples_split in [2, 5, 10]:
            for min_samples_leaf in [1, 2, 4]:
                dt_temp = DecisionTreeClassifier(
                    criterion=criterion,
                    max_depth=max_depth,
                    min_samples_split=min_samples_split,
                    min_samples_leaf=min_samples_leaf,
                    random_state=42
                )
                # using cross-validated F1 on training data only
                scores = cross_val_score(dt_temp, X_train, y_train, cv=cv, scoring="f1")
                dt_results.append({
                    "criterion": criterion,
                    "max_depth": max_depth,
                    "min_samples_split": min_samples_split,
                    "min_samples_leaf": min_samples_leaf,
                    "cv_f1_mean": scores.mean(),
                    "cv_f1_std":  scores.std()
                })

dt_results_df = pd.DataFrame(dt_results).sort_values("cv_f1_mean", ascending=False)
dt_results_df.head(10)


Unnamed: 0,criterion,max_depth,min_samples_split,min_samples_leaf,cv_f1_mean,cv_f1_std
67,entropy,4.0,5,2,0.916303,0.034097
64,entropy,4.0,2,2,0.916303,0.034097
63,entropy,4.0,2,1,0.914479,0.025547
70,entropy,4.0,10,2,0.914191,0.026295
79,entropy,5.0,10,2,0.912127,0.034256
66,entropy,4.0,5,1,0.910777,0.032087
106,entropy,,10,2,0.909411,0.03493
88,entropy,6.0,10,2,0.909411,0.03493
97,entropy,8.0,10,2,0.909411,0.03493
69,entropy,4.0,10,1,0.90868,0.026739


Step 6: Train the chosen Decision Tree and evaluate on test set

In [6]:
#Using the first row in the sorted dataframe because it contains the best combination of hyperparameters
best_dt = dt_results_df.iloc[0]
best_dt


criterion             entropy
max_depth                 4.0
min_samples_split           5
min_samples_leaf            2
cv_f1_mean           0.916303
cv_f1_std            0.034097
Name: 67, dtype: object

This code creates the final decision tree using the best hyperparameters found in the previous tuning step, trains it on the entire training set, and then evaluates its performance on the test set using the evaluate function.

In [7]:
dt_best_manual = DecisionTreeClassifier(
    criterion=best_dt["criterion"],
    max_depth=None if pd.isna(best_dt["max_depth"]) else int(best_dt["max_depth"]) if best_dt["max_depth"] is not None else None,
    min_samples_split=int(best_dt["min_samples_split"]),
    min_samples_leaf=int(best_dt["min_samples_leaf"]),
    random_state=42
)
dt_best_manual.fit(X_train, y_train)
dt_test = evaluate(dt_best_manual, X_test, y_test, name="Decision Tree (manual tuned)")



Decision Tree (manual tuned) — Test metrics
Accuracy : 0.9298
Precision: 1.0000 (positive=malignant)
Recall   : 0.8095 (positive=malignant)
F1-score : 0.8947 (positive=malignant)
Confusion matrix [[TN, FP],[FN, TP]]:
 [[72  0]
 [ 8 34]]

Classification report:
               precision    recall  f1-score   support

           0     0.9000    1.0000    0.9474        72
           1     1.0000    0.8095    0.8947        42

    accuracy                         0.9298       114
   macro avg     0.9500    0.9048    0.9211       114
weighted avg     0.9368    0.9298    0.9280       114



Step 7:  Manual tuning GassianNB

In this code, an empty list is created, and the program loops through several values of the var_smoothing parameter for gaussive naive bayes, and a temporare model is trained for each value, while its mean F1 score is measured using cross validation.

each result gets stored in the list, which then gets converted into a dataframe and sorted so tat the highest performing value is at the top.

In [8]:
nb_results = []
for vs in np.logspace(-12, -7, 10):
    nb_temp = GaussianNB(var_smoothing=vs)
    scores = cross_val_score(nb_temp, X_train, y_train, cv=cv, scoring="f1")
    nb_results.append({
        "var_smoothing": vs,
        "cv_f1_mean": scores.mean(),
        "cv_f1_std":  scores.std()
    })

nb_results_df = pd.DataFrame(nb_results).sort_values("cv_f1_mean", ascending=False)
nb_results_df.head(10)


Unnamed: 0,var_smoothing,cv_f1_mean,cv_f1_std
5,5.994843e-10,0.918514,0.036146
0,1e-12,0.917039,0.034203
1,3.593814e-12,0.917039,0.034203
2,1.29155e-11,0.917039,0.034203
3,4.641589e-11,0.913809,0.037602
4,1.668101e-10,0.913724,0.035581
7,7.742637e-09,0.910204,0.036487
6,2.154435e-09,0.907261,0.04118
9,1e-07,0.901795,0.033775
8,2.782559e-08,0.896033,0.03792


Step 8: Train the chosen GaussianNB and evaluate on test set

In [9]:
#Using the the first row in the sorted dataframe because it contains the best combination of hyperparameters
best_nb = nb_results_df.iloc[0]
best_nb


var_smoothing    5.994843e-10
cv_f1_mean       9.185135e-01
cv_f1_std        3.614565e-02
Name: 5, dtype: float64

In [10]:
nb_best_manual = GaussianNB(var_smoothing=float(best_nb["var_smoothing"]))
nb_best_manual.fit(X_train, y_train)
nb_test = evaluate(nb_best_manual, X_test, y_test, name="GaussianNB (manual tuned)")



GaussianNB (manual tuned) — Test metrics
Accuracy : 0.9386
Precision: 0.9730 (positive=malignant)
Recall   : 0.8571 (positive=malignant)
F1-score : 0.9114 (positive=malignant)
Confusion matrix [[TN, FP],[FN, TP]]:
 [[71  1]
 [ 6 36]]

Classification report:
               precision    recall  f1-score   support

           0     0.9221    0.9861    0.9530        72
           1     0.9730    0.8571    0.9114        42

    accuracy                         0.9386       114
   macro avg     0.9475    0.9216    0.9322       114
weighted avg     0.9408    0.9386    0.9377       114



Step 9: Side-by-side comparison table (test set)

In [11]:
comp = pd.DataFrame({
    "Algorithm": ["Decision Tree", "GaussianNB"],
    "Accuracy":  [dt_test["acc"], nb_test["acc"]],
    "Precision": [dt_test["prec"], nb_test["prec"]],
    "Recall":    [dt_test["rec"], nb_test["rec"]],
    "F1":        [dt_test["f1"], nb_test["f1"]],
})
comp.sort_values("F1", ascending=False).reset_index(drop=True)


Unnamed: 0,Algorithm,Accuracy,Precision,Recall,F1
0,GaussianNB,0.938596,0.972973,0.857143,0.911392
1,Decision Tree,0.929825,1.0,0.809524,0.894737


**Discussing the results**

In this task, I would say that both of the classifiers (GaussianNB, Decision tree) achieved great performances on the breast cancer dataset. Both of them had high accuracies above 92%; GaussianNB with 93.8% and Decision Tree with 92.9%. GaussianNB had a higher recall and F1 score as well, while Decision Tree managed to get a perfect 1 score. Having a perfect 1 score means that every tumor it had classified as malignant indeed was malignant. A consequence of this however is that the recall score turned out to be lower, meaning it missed more malignant cases than GaussianNB did.

The main goal of the model is to correctly identify as many malignant cases as possible, making the recall a cruical value for this specific task. GaussianNB had a higher reccall, which in turn resulted in a better F1- score that reflects a more favorable balance between precision and recall. Because of this, GaussianNB is the better algorithm for this specific dataset, because it offers a slightly higher overall performance and is more effective at minimizing false negatives, which is severely important in a medical context