# Tumor Diagnosis

In this project, we aim to apply data analysis, dimensionality reduction, and clustering techniques to better understand the tumor characteristics and their classification.

### 01. Import libraries

In [14]:
# Data manipulation and analysis
import numpy as np
import pandas as pd
import time
import json
import math

# Modeling
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, \
    classification_report, confusion_matrix
from sklearn.tree import plot_tree

# Visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [15]:
# Set option to display all columns
pd.set_option('display.max_columns', None)

### 02. Import data

__Class distribution:__
- __diagnosis__: Target column containing the class $labels$
  - $M$ - $Malignant$ $\rightarrow$ Tending to invade normal tissue, indicating a more harmful nature.
  - $B$ - $Benign$ $\rightarrow$ Not harmful, indicating a non-invasive and less concerning form.

__Columns in the dataset:__

0. __id__ Contains unique identifiers for each record. As a unique identifier, it cannot be used for classification purposes.
1. __radius__ Mean of distances from the center to points on the perimeter of the nucleus.
2. __texture__ Standard deviation of the gray-scale values.
3. __perimeter__ Total distance around the boundary of the nucleus.
4. __area__ Total area of the nucleus.
5. __smoothness__ Local variation in radius lengths, indicating the smoothness of the boundary.
6. __compactness__ Indicating how compact the nucleus is. (perimeter^2 / area - 1.0) 
7. __concavity__ Severity of concave portions of the contour, measuring how inward the boundary is.
8. __concave points__ Number of concave portions on the contour of the nucleus.
9. __symmetry__ Measures the symmetry of the nucleus.
10. __fractal dimension__ A measure of the "coastline approximation", calculated as the ratio of the perimeter to area.

In [None]:
# Find principal path of the project
from pathlib import Path
project_root = str(Path.cwd().parents[0])

# Load the column names from the JSON file
with open(project_root + '.\static\column_names.json', 'r') as json_file:
    saved_column_names = json.load(json_file)

# Read the .data file into a Pandas DataFrame
df = pd.read_csv(project_root + '.\static\wdbc.data', header=None, names=saved_column_names)

# Display the first few rows of the DataFrame
df.head()

In [None]:
# Beautifully formatted output for dataset details
print(f"Data Points (Rows)   : {df.shape[0]:,}")
print(f"Features (Columns)   : {df.shape[1]}")
print(f"Feature Names        : {', '.join(df.columns)}")

### 03. Data preprocesing

In this step, the unique identifier column will be removed from the dataset, as it does not contribute to the classification task. Additionally, we will perform label encoding on the $diagnosis$ column to transform the categorical labels into numerical values, making them suitable for machine learning models.

The LabelEncoder from sklearn.preprocessing will be used to convert the labels in the diagnosis column ('B' for Benign and 'M' for Malignant) to binary values (0 and 1).

In [18]:
# The 'id' column is an arbitrary identifier with no meaningful contribution 
# to pattern analysis, correlations, or clustering, and may introduce noise into the model.
df_cleaned = df.drop(columns=['id'])

In [19]:
# Creating an instance of LabelEncoder to perform label encoding
label_encoder = LabelEncoder()

# Convert 'diagnosis' column from 'b' and 'm' to 0 and 1
df_cleaned['diagnosis'] = label_encoder.fit_transform(df_cleaned['diagnosis'])

In [None]:
# See preprocessing result
df_cleaned.head()

### 04. Correlation Analysis

The purpose of this analysis is to examine the correlation between the features in the dataset. Identifying highly correlated features is important because multicollinearity can negatively impact model performance. However, in this case, we will not remove any features, as we want to evaluate their influence on the performance of PCA (Principal Component Analysis) and clustering in later stages of the analysis.

In [21]:
# Create correlation matrix
corr_matrix = df_cleaned.corr()

In [None]:
# Create Pairplot from Seaborn to see relationship between individual features and diagnosis
# 'Benign (0)', 'Malignant (1)'
sns.pairplot(df_cleaned, palette='coolwarm', hue='diagnosis')

# Adjust legend position to the upper left corner
plt.legend(title='Diagnosis', loc='upper left', labels=['Benign (0)', 'Malignant (1)'], fontsize=12)

# Show the plot
plt.show()

In [None]:
# Distribution of Features
plt.figure(figsize=(19, 17))

# Automatically get numerical columns
numerical_features = df_cleaned.select_dtypes(include=['number']).columns

# Number of rows and columns for subplots
rows = math.ceil(len(df_cleaned.columns) / 3)
cols = 3

# Adjust the figure size
fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize=(15, 15))

# Flatten the axis array for easy access
ax = ax.flatten()

# Iterate over the numerical features
for i in range(len(numerical_features)):
    sns.histplot(df_cleaned[numerical_features[i]], color='crimson', kde=True, ax=ax[i])
    ax[i].set_title(f'Distribution: {numerical_features[i]}')

# Remove unused subplots if there are fewer features
for j in range(i + 1, rows * cols):
    fig.delaxes(ax[j])

plt.tight_layout()
plt.show()

In [None]:
# Plot the correlation matrix
plt.figure(figsize=(19, 17))
sns.heatmap(corr_matrix, xticklabels=corr_matrix.columns, yticklabels=corr_matrix.columns, annot=True, annot_kws={"size": 8}, fmt=".2f")
plt.show()

In [None]:
# Mask the diagonal (set to NaN) to avoid self-correlation
np.fill_diagonal(corr_matrix.values, np.nan)

# Stack the correlation matrix and sort the values
corr_pairs = corr_matrix.unstack().sort_values(ascending=False)

# Show the pairs with the highest correlation, excluding the diagonal
corr_pairs.head(10)

### 05. Decision Tree

Decision Trees are a supervised classification and regression algorithm that splits the dataset into subsets based on feature values. It recursively partitions the data, creating a tree-like structure where each internal node represents a feature test, each branch represents the outcome of the test, and each leaf node represents a predicted class or value.

- __Input Features:__ Decision trees work well with both numerical and categorical features. The model recursively splits the data based on feature values to create the tree structure. The most informative features (based on some splitting criterion) are used at the top of the tree.

- __Splitting Criterion:__ Decision trees use criteria such as Gini impurity or Entropy (for classification) to evaluate how well a feature splits the data. A split is made when it minimizes impurity in the child nodes.

    - Gini Impurity: Measures the likelihood of incorrect classification of a new instance.
    
    \$\Gini = 1 - \sum_{i=1}^{C} p_i^2$

        where \$\p_i$ is the probability of class \$\i$ in the current node.

    - Entropy: Measures the amount of uncertainty or disorder in the data.
    
    \$\Entropy = - \sum_{i=1}^{C} p_i\log_2(p_i)$

        where \$\p_i$ is the probability of class \$\i$ in the current node.

- __Thresholding:__ Once the model outputs probabilities, a threshold (commonly 0.5) is applied to classify observations into one of the two categories.


Combining Trees with Other Models:
If you plan to compare decision trees with models like Logistic Regression, SVMs, or Gradient Descent-based models, standardizing ensures that all models work on similarly scaled data.

In [23]:
# Features and target
X = df_cleaned.iloc[:, 1:]  # Features (skip diagnosis)
y = df_cleaned['diagnosis']  # Target

In [24]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

The baseline establishes an initial performance level that serves as a minimum standard for comparing more complex models.

In [None]:
# Initialize the model with default parameters
dt_baseline = DecisionTreeClassifier(random_state=42)

# Train the model
dt_baseline.fit(X_train, y_train)

# Predictions
y_pred = dt_baseline.predict(X_test)
y_pred_prob = dt_baseline.predict_proba(X_test)[:, 1]  # Probabilities for ROC-AUC

# Calculate Evaluation Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)

# Print Metrics
print("Baseline Decision Tree Performance:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")

In [None]:
# Visualize the tree
plt.figure(figsize=(9, 7))
plot_tree(dt_baseline, feature_names=X.columns, class_names=["Benign", "Malignant"], filled=True)
plt.show()

### 06. Evaluate the Tuned Model

In [None]:
# Define hyperparameters to tune
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy'],
    'max_features': [None, 'sqrt', 'log2']
}

# Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print("Best Hyperparameters:", best_params)

In [None]:
# Evaluate the best model
y_pred = best_model.predict(X_test)
y_pred_prob = best_model.predict_proba(X_test)[:, 1]  # Probability for the positive class

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)

# Print metrics
print("\n# Evaluation Metrics for Tuned Decision Tree:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")

### 07. Compare with Other Models ("Fight of Models")

In [31]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Define models
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=42, **best_params),
    "Random Forest": RandomForestClassifier(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42)
}

# Train and evaluate models
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred_prob = model.predict_proba(X_test)[:, 1]  # Get probabilities for ROC-AUC
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_prob)

    # Store results
    results[name] = {
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1,
        "ROC-AUC": roc_auc
    }

    # Print results
    print(f"{name}:")
    print(f"  Accuracy: {accuracy:.2f}")
    print(f"  Precision: {precision:.2f}")
    print(f"  Recall: {recall:.2f}")
    print(f"  F1 Score: {f1:.2f}")
    print(f"  ROC-AUC: {roc_auc:.2f}")
    print()

# Convert results to DataFrame for better visualization
results_df = pd.DataFrame(results).T

# Plot results for Accuracy, Precision, Recall, F1, and ROC-AUC
fig, axes = plt.subplots(2, 2, figsize=(14, 5))

# Plot Accuracy
axes[0, 0].bar(results_df.index, results_df['Accuracy'], color='skyblue')
axes[0, 0].set_title('Accuracy')
axes[0, 0].set_ylabel('Score')

# Plot Precision
axes[0, 1].bar(results_df.index, results_df['Precision'], color='lightcoral')
axes[0, 1].set_title('Precision')
axes[0, 1].set_ylabel('Score')

# Plot Recall
axes[1, 0].bar(results_df.index, results_df['Recall'], color='lightgreen')
axes[1, 0].set_title('Recall')
axes[1, 0].set_ylabel('Score')

# Plot F1 Score
axes[1, 1].bar(results_df.index, results_df['F1 Score'], color='orange')
axes[1, 1].set_title('F1 Score')
axes[1, 1].set_ylabel('Score')

# Adjust layout
plt.tight_layout()
plt.show()

### 09. Model evaluation and interpretation

In this step, we will evaluate the performance of the decision tree model by examining its ability to classify diagnoses accurately. 

In [None]:
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")

In [None]:
# Plotting the confusion matrix as a heatmap
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'], cbar=False)
plt.title('Confusion Matrix: Logistic Regression vs Diagnosis', fontsize=14)
plt.xlabel('Predicted Diagnosis', fontsize=12)
plt.ylabel('True Diagnosis', fontsize=12)
plt.show()

In [None]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"Logistic Regression (AUC = {roc_auc:.2f})", color="blue")
plt.plot([0, 1], [0, 1], color="gray", linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

__Metrics__
- Accuracy (0.97): The model correctly classifies 97% of the instances. This is a strong indication of its overall reliability in distinguishing between malignant and benign cases.

- Precision (0.98): Out of all the cases the model predicted as malignant, 98% were indeed malignant. This metric highlights the model's ability to avoid false positives (incorrectly predicting benign cases as malignant).

- Recall (0.95): The model successfully identified 95% of the actual malignant cases, which shows its capability to minimize false negatives (failing to detect malignant cases).

- F1 Score (0.96): The F1 score balances precision and recall, providing a harmonic mean. A score of 0.96 confirms the model's robustness in handling both false positives and false negatives.

- ROC-AUC (1.00): The model achieved perfect separation between the malignant and benign classes, demonstrating its exceptional ability to differentiate between the two.


__Conclusion__

In summary, __false positives (FP)__ and __false negatives (FN)__ are critical in cancer diagnosis because they have direct implications on patient health and treatment. Managing these cases requires thorough review, additional testing, continuous monitoring, and refining the models to minimize these errors. Collaborating with medical specialists and integrating advanced diagnostic technologies are essential to improving accuracy and reducing the risks associated with diagnostic errors.

The model’s performance underscores the effectiveness of logistic regression in this context, demonstrating its potential as a reliable tool for aiding in cancer diagnosis. However, validation with domain experts and real-world data is essential to ensure clinical applicability.

### End