In [3]:
# Importing dataset from UCI ML Repository
import pandas as pd 

# Using UCI ML Repository because load_breast_cancer in sklearn.datasets doesn't include 'diagnosis'
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
column_names = ["id", "diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean", "concavity_mean", "concave points_mean", "symmetry_mean", "fractal_dimension_mean", "radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se", "compactness_se", "concavity_se", "concave points_se", "symmetry_se", "fractal_dimension_se", "radius_worst", "texture_worst", "perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", "concavity_worst", "concave points_worst", "symmetry_worst", "fractal_dimension_worst"]

df = pd.read_csv(url, names=column_names)

df.head()
df.columns
df['diagnosis'].value_counts()

diagnosis
B    357
M    212
Name: count, dtype: int64

In [13]:
# 1. Logistic regression
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, roc_auc_score

features = ['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']
X = df[features]
y = df['diagnosis']

# Encoding the target variable (B = 0, M = 1, as in the original dataset)
le = LabelEncoder()
y = le.fit_transform(y)
print(le.classes_)

# 80% for training 
X_train_lgr, X_test_lgr, y_train_lgr, y_test_lgr = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=10000)

# Starting the timer to calculate training time
start_time = time.time()

# Training
model.fit(X_train_lgr, y_train_lgr)

y_pred_lgr = model.predict(X_test_lgr)

end_time = time.time()

tnp_time_lgr = end_time - start_time

accuracy_lgr = accuracy_score(y_test_lgr, y_pred_lgr)

print(f'Model accuracy: {accuracy_lgr}')
print(f'Logistic Regression training and prediction time: {tnp_time_lgr} seconds')

# Generating classification report
report = classification_report(y_test_lgr, y_pred_lgr)
print(report)

roc_auc = roc_auc_score(y_test_lgr, y_pred_lgr)
print(f'ROC AUC score: {roc_auc}\n')

# Getting the feature importance
importance = model.coef_[0]

# Summarizing feature importance
for i, j in enumerate(importance):
    print(f'Feature: {features[i]}, Score: {j}')

print()

max_positive_index = importance.argmax()
max_negative_index = importance.argmin()

print(f'Most positively influential feature on diagnosis with malignant mass: {features[max_positive_index]}')
print(f'Most negatively influential feature on diagnosis with malignant mass: {features[max_negative_index]}')

['B' 'M']
Model accuracy: 0.9298245614035088
Logistic Regression training and prediction time: 0.018082857131958008 seconds
              precision    recall  f1-score   support

           0       0.96      0.93      0.94        71
           1       0.89      0.93      0.91        43

    accuracy                           0.93       114
   macro avg       0.92      0.93      0.93       114
weighted avg       0.93      0.93      0.93       114

ROC AUC score: 0.9299050114641335

Feature: radius_mean, Score: -2.39880795219185
Feature: texture_mean, Score: 0.23377629281415488
Feature: perimeter_mean, Score: 0.5665869622072763
Feature: area_mean, Score: -0.0047376948950455
Feature: smoothness_mean, Score: 0.4228157935522469
Feature: compactness_mean, Score: 0.7167695646330422
Feature: concavity_mean, Score: 1.2669748455037906
Feature: concave points_mean, Score: 0.6788174183189047
Feature: symmetry_mean, Score: 0.5992161591717076
Feature: fractal_dimension_mean, Score: 0.122095269281132

**Model accuracy = Number of Correct Predictions​ / Total Number of Predictions.** (This value was calculated with a separate function.)
The model has a big accuracy rate (0.93).

**Precision = True Positives / True Positives + False Positives​,** is THE RATIO OF TRUE POSITIVES to the sum of true positives and false positives. It indicates how many of the positive predictions were ACTUALLY correct. 
High precision for both classes indicates that the model performing well. 
The model seems to make predictions with high success in both classes, slightly better in the 0 class (0.96) than in the 1 class (0.89).

**Recall = True Positives​ / True Positives + False Negatives,** also known as sensitivity or true positive rate, measures the proportion of actual positive cases that were correctly identified by the model. 
It answers the question: "Of all the actual positives, how many did the model correctly predict?" 
High Recall indicates that the model is good at identifying all positive instances, but it might also lead to more false positives. 
The model shows 93% success rate for both classes.

**F1-score = 2 x (Precision x Recall / Precision + Recall)​,** is the harmonic mean of precision and recall, providing a single metric that balances both. 
High F1-scores for both classes indicate a good balance between precision and recall and good model performance. 
The model has a high F1 Score for both classes, with class 0 (0.94) being slightly higher than class 1 (0.91).

**Support** is the number of actual occurrences of the classes. It gives context to the other metrics. It is 71 for class 0 and 43 for class 1.

**Accuracy** is the ratio of correctly predicted instances to the total instances. (This value was calculated in the classification report)
It provides an overall indication of how often the model is correct.
The model has an overall success rate of 93%.

**Macro avg** is the average of the precision, recall, and F1-score across all classes, weighted equally. It provides an overall metric of the model's performance. The model has high macro avg values (0.92 for precision, 0.93 for recall and F1 score). 

**The weighted average** takes into account the support (number of true instances) for each class, providing a better overall metric for imbalanced datasets. The model also has high weighted average values (0.93 for all metrics).

**ROC AUC** is a measure of the model's ability to discriminate between positive and negative classes.
It is a numerical expression of ROC Curve and Area Under the Curve.
The model is capable of distinguishing between the positive class and the negative class with 93% accuracy. 

In [5]:
# 2. Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, roc_auc_score
import time
import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(44)

# Creating a Decision Tree classifier
clf = DecisionTreeClassifier()

start_time = time.time()

X_train_dt, X_test_dt, y_train_dt, y_test_dt = train_test_split(X, y, test_size=0.2, random_state=42)

clf.fit(X_train_dt, y_train_dt)

y_pred_dt = clf.predict(X_test_dt) # Making predictions on the test set

end_time = time.time()

tnp_time_dt = end_time - start_time
print(f'Decision Tree training and prediction time: {tnp_time_dt} seconds\n')

report = classification_report(y_test_dt, y_pred_dt)
print(report)

roc_auc_dt = roc_auc_score(y_test_dt, y_pred_dt)
print(f'ROC AUC score: {roc_auc_dt}\n')

# Feature importances
importance_dt = clf.feature_importances_

for i, j in enumerate(importance_dt):
    print(f'Feature: {features[i]}, Score: {j}')

print()

max_positive_index_dt = importance_dt.argmax()
min_positive_index_dt = importance_dt.argmin()

print(f'Most influential feature on diagnosis with malignant mass: {features[max_positive_index_dt]}')
print(f'Least influential feature on diagnosis with malignant mass: {features[min_positive_index_dt]}')

""" 
In the decision tree, these values are not interpreted as coefficients. 
Instead, they indicate the importance of each feature in determining the model's decisions. 
The influence of features is not separated as positive or negative. 
Therefore, the least influential feature may still have an influence, 
but this influence is less than that of other features. 
"""

Decision Tree training and prediction time: 0.006905078887939453 seconds

              precision    recall  f1-score   support

           0       0.93      0.93      0.93        71
           1       0.88      0.88      0.88        43

    accuracy                           0.91       114
   macro avg       0.91      0.91      0.91       114
weighted avg       0.91      0.91      0.91       114

ROC AUC score: 0.9066491975106452

Feature: radius_mean, Score: 0.0018210283843930922
Feature: texture_mean, Score: 0.07410236185108188
Feature: perimeter_mean, Score: 0.023858157823691654
Feature: area_mean, Score: 0.05970311938155739
Feature: smoothness_mean, Score: 0.019377589823598745
Feature: compactness_mean, Score: 0.01775147928994082
Feature: concavity_mean, Score: 0.04275409598376106
Feature: concave points_mean, Score: 0.7496495603255221
Feature: symmetry_mean, Score: 0.010982607136453288
Feature: fractal_dimension_mean, Score: 0.0

Most influential feature on diagnosis with maligna

" In the decision tree, these values are not interpreted as coefficients. \nInstead, they indicate the importance of each feature in determining the model's decisions. \nThe influence of features is not separated as positive or negative. \nTherefore, the least influential feature may still have an influence, \nbut this influence is less than that of other features. "

The Decision Tree model has a shorter training and testing time. 
It may be advantageous to use a Decision Tree when working with large data sets or in projects where fast prediction is required. 
Nevertheless, according to the classification report, the Logistic Regression model performs slightly better in many metrics. 

In [14]:
# 3. Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
import time
from sklearn.model_selection import train_test_split

np.random.seed(555)

# Creating a Random Forest classifier
clf_rf = RandomForestClassifier()

start_time_rf = time.time()

X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X, y, test_size=0.2, random_state=42)

clf_rf.fit(X_train_rf, y_train_rf)

y_pred_rf = clf_rf.predict(X_test_rf)

end_time_rf = time.time()

tnp_time_rf = end_time_rf - start_time_rf
print(f'Random Forest training and prediction time: {tnp_time_rf} seconds\n')

report_rf = classification_report(y_test_rf, y_pred_rf)
print(report_rf)

roc_auc_rf = roc_auc_score(y_test_rf, y_pred_rf)
print(f'Random Forest ROC AUC score: {roc_auc_rf}\n')

importance_rf = clf_rf.feature_importances_

for i, j in enumerate(importance_rf):
    print(f'Feature: {features[i]}, Score: {j}')

print()

max_importance_index_rf = importance_rf.argmax()
min_importance_index_rf = importance_rf.argmin()

print(f'Most influential feature on diagnosis with malignant mass: {features[max_importance_index_rf]}')
print(f'Least influential feature on diagnosis with malignant mass: {features[min_importance_index_rf]}')

Random Forest training and prediction time: 0.1474318504333496 seconds

              precision    recall  f1-score   support

           0       0.96      0.97      0.97        71
           1       0.95      0.93      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

Random Forest ROC AUC score: 0.9510317720275139

Feature: radius_mean, Score: 0.1277548816015485
Feature: texture_mean, Score: 0.05761965781797203
Feature: perimeter_mean, Score: 0.1419629655393132
Feature: area_mean, Score: 0.1462915048546199
Feature: smoothness_mean, Score: 0.02584276927626497
Feature: compactness_mean, Score: 0.04889031192818518
Feature: concavity_mean, Score: 0.14695627090912988
Feature: concave points_mean, Score: 0.26585529642294686
Feature: symmetry_mean, Score: 0.019716685733047416
Feature: fractal_dimension_mean, Score: 0.01910965591697205

Most influential feature on 

Although the Random Forest model has a longer training and testing time than the other two models, it has a higher accuracy rate and higher performance metric values.

In [7]:
# 4. Support Vector Machine
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report
import time
from sklearn.metrics import roc_auc_score

start_time_svm = time.time()

X_train_svm, X_test_svm, y_train_svm, y_test_svm = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the features to ensure that all features have the same scales
scaler = StandardScaler()
X_train_svm = scaler.fit_transform(X_train_svm)
X_test_svm = scaler.transform(X_test_svm)

'''
fit + transform; fit calculates the parameters (mean and standard deviation) needed to standardise the data. 
transform standardises the TRAINING DATA using these parameters.
fit_transform performs these two operations in a single step. 

transform; standardises the TEST DATA using the parameters calculated from the training data.

The model learns based on the distribution of the Training Data and 
therefore the Test Data should use the same distribution when evaluating the model.

In distance-based algorithms like KNN or SVM, if features are not on the same scale, 
features with larger scales may dominate the model, leading to suboptimal performance.
'''

# Creating an SVM model
svm_model = SVC()

svm_model.fit(X_train_svm, y_train_svm)

y_pred_svm = svm_model.predict(X_test_svm)

end_time_svm = time.time()

tnp_time_svm = end_time_svm - start_time_svm
print(f'Support Vector Machine training and prediction time: {tnp_time_svm} seconds\n')

print("\nClassification Report:")
print(classification_report(y_test_svm, y_pred_svm))

roc_auc_svm = roc_auc_score(y_test_svm, y_pred_svm)
print(f'SVM ROC AUC score: {roc_auc_svm}\n')

importance_rf = clf_rf.feature_importances_

for i, j in enumerate(importance_rf):
    print(f'Feature: {features[i]}, Score: {j}')

print()

max_importance_index_rf = importance_rf.argmax()
min_importance_index_rf = importance_rf.argmin()

print(f'Most influential feature on diagnosis with malignant mass: {features[max_importance_index_rf]}')
print(f'Least influential feature on diagnosis with malignant mass: {features[min_importance_index_rf]}')

Support Vector Machine training and prediction time: 0.008525609970092773 seconds


Classification Report:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        71
           1       1.00      0.93      0.96        43

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

SVM ROC AUC score: 0.9651162790697674

Feature: radius_mean, Score: 0.1277548816015485
Feature: texture_mean, Score: 0.05761965781797203
Feature: perimeter_mean, Score: 0.1419629655393132
Feature: area_mean, Score: 0.1462915048546199
Feature: smoothness_mean, Score: 0.02584276927626497
Feature: compactness_mean, Score: 0.04889031192818518
Feature: concavity_mean, Score: 0.14695627090912988
Feature: concave points_mean, Score: 0.26585529642294686
Feature: symmetry_mean, Score: 0.019716685733047416
Feature: fractal_dimension_mean, Score: 0.01910965591697205

Mos

The Support Vector Machine model has short training and testing time and high performance metrics and is the best model in terms of speed and accuracy out of the 4 models so far.

In [8]:
# 5. K-Nearest Neighbors
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
import time

start_time_knn = time.time()

X_train_knn, X_test_knn, y_train_knn, y_test_knn = train_test_split(X, y, test_size=0.2, random_state=1)

# Scaling the features
scaler = StandardScaler()
X_train_knn = scaler.fit_transform(X_train_knn)
X_test_knn = scaler.transform(X_test_knn)

# Creating and training the KNN classifier
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_knn, y_train_knn)

# Making predictions on the test set
y_pred_knn = knn_model.predict(X_test_knn)

end_time_knn = time.time()

# Evaluating the model
tnp_time_knn = end_time_knn - start_time_knn
print(f'K-Nearest Neighbors training and prediction time: {tnp_time_knn} seconds\n')

print("\nClassification Report:")
print(classification_report(y_test_knn, y_pred_knn))

roc_auc_knn = roc_auc_score(y_test_knn, y_pred_knn)
print(f'KNN ROC AUC score: {roc_auc_knn}\n')

from sklearn.inspection import permutation_importance # To get the feature importance

result = permutation_importance(knn_model, X_test_knn, y_test_knn, n_repeats=10, random_state=42, n_jobs=-1)

sorted_idx = result.importances_mean.argsort()

importance_scores = {}

for i in sorted_idx:
    if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
        importance_scores[features[i]] = result.importances_mean[i]
        print(f"Feature: {features[i]}, score: {result.importances_mean[i]}")

# Sorting the dictionary by value in descending order
importance_scores_sorted = dict(sorted(importance_scores.items(), key=lambda item: item[1], reverse=True))

# Getting the most and least influential features
most_influential_feature = list(importance_scores_sorted.keys())[0]
least_influential_feature = list(importance_scores_sorted.keys())[-1]

print(f"\nMost influential feature on diagnosis: {most_influential_feature}")
print(f"Least influential feature on diagnosis: {least_influential_feature}")

K-Nearest Neighbors training and prediction time: 0.011308431625366211 seconds


Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.97      0.96        72
           1       0.95      0.90      0.93        42

    accuracy                           0.95       114
   macro avg       0.95      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114

KNN ROC AUC score: 0.9384920634920634



Feature: symmetry_mean, score: 0.021052631578947323
Feature: fractal_dimension_mean, score: 0.03508771929824557
Feature: compactness_mean, score: 0.036842105263157864
Feature: texture_mean, score: 0.04298245614035082
Feature: smoothness_mean, score: 0.047368421052631525
Feature: concavity_mean, score: 0.049122807017543804
Feature: concave points_mean, score: 0.06315789473684205

Most influential feature on diagnosis: concave points_mean
Least influential feature on diagnosis: symmetry_mean


The K-Nearest Neighbours model is almost as fast as the Support Vector Machine model, but performs worse than the Random Forest Model in many performance metrics. 

# Overall ranking

### Accuracy
1.  SVM (0.97)
2.  Random Forest (0.96)
3.  KNN (0.95)
4.  Logistic Regression (0.93)
5.  Decision Tree (0.91)

### Training and Prediction Time
1.  Decision Tree (0.0069 seconds)
2.  SVM (0.0085 seconds)
3.  KNN (0.0113 seconds)
4.  Logistic Regression (0.0180 seconds)
5.  Random Forest (0.1474 seconds)

### ROC AUC Scores
1.  SVM (0.9651)
2.  Random Forest (0.9510)
3.  KNN (0.9384)
4.  Logistic Regression (0.9299)
5.  Decision Tree (0.9066)

# Final results

The results of the models in terms of tumour features show that **the features related to the concavity of the tumour** (concave points_mean, concavity_mean) are the features that have **the greatest impact on diagnosis.** 

**SVM**  emerges as the **best performing algorithm.** It achieves the highest accuracy (0.97) and the highest ROC AUC score (0.9651), while maintaining a competitive training and prediction time (0.0085 seconds), ranking second in case of time.