# Model Comparison Assignment

In this notebook, you will:
1. **Load and preprocess** one of the provided datasets (e.g., `breast-cancer-wisconsin.data`).
2. **Split** the data into training and testing sets.
3. **Train** and **evaluate**:
   - A **Decision Tree** (for classification) side-by-side with a **Logistic Regression** model.
4. **Compare** their performance using metrics such as accuracy, classification report, and ROC AUC.
5. **Extend** your analysis by building a **Random Forest** classifier on the same data and comparing its results to the previous models.


## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, confusion_matrix


## 2. Load and Preprocess Data

Load the **Breast Cancer Wisconsin** dataset. Handle missing values and encode the target variable.

In [None]:
# Column names from the UCI repository
columns = [
    'ID', 'ClumpThickness', 'UniformityCellSize', 'UniformityCellShape',
    'MarginalAdhesion', 'SingleEpithelialCellSize', 'BareNuclei',
    'BlandChromatin', 'NormalNucleoli', 'Mitoses', 'Class'
]
# Read the data, treating '?' as missing
df = pd.read_csv('breast-cancer-wisconsin.data', names=columns, na_values='?')
# Drop rows with missing values
df.dropna(inplace=True)
# Features and target
X = df.drop(['ID', 'Class'], axis=1)
# Map Class: 2 -> benign (0), 4 -> malignant (1)
y = df['Class'].map({2: 0, 4: 1})


## 3. Train-Test Split

Split the data into training and test sets (e.g., 70% train, 30% test).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


## 4. Decision Tree vs. Logistic Regression

Train both a Decision Tree classifier and a Logistic Regression model, then compare their performance.

In [None]:
# Initialize models
dt = DecisionTreeClassifier(random_state=42)
lr = LogisticRegression(max_iter=1000, solver='liblinear')

# Train
dt.fit(X_train, y_train)
lr.fit(X_train, y_train)

# Predictions
y_pred_dt = dt.predict(X_test)
y_pred_lr = lr.predict(X_test)

# Probabilities for ROC AUC
y_prob_dt = dt.predict_proba(X_test)[:, 1]
y_prob_lr = lr.predict_proba(X_test)[:, 1]

# Evaluate
for name, y_pred, y_prob in [
    ('Decision Tree', y_pred_dt, y_prob_dt),
    ('Logistic Regression', y_pred_lr, y_prob_lr)
]:
    print(f"=== {name} ===")
    print("Accuracy:\t", accuracy_score(y_test, y_pred))
    print("ROC AUC:\t", roc_auc_score(y_test, y_prob))
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))
    print()


### Discussion

- Which model achieved higher accuracy? Higher AUC?
- Examine the confusion matrices: where do the models make more mistakes?

## 5. Random Forest Classifier

Build a Random Forest on the same data and compare its performance with the Decision Tree and Logistic Regression.

In [None]:
# Initialize Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Train
rf.fit(X_train, y_train)
# Predict
y_pred_rf = rf.predict(X_test)
y_prob_rf = rf.predict_proba(X_test)[:, 1]
# Evaluate
print("=== Random Forest ===")
print("Accuracy:\t", accuracy_score(y_test, y_pred_rf))
print("ROC AUC:\t", roc_auc_score(y_test, y_prob_rf))
print(classification_report(y_test, y_pred_rf))
print(confusion_matrix(y_test, y_pred_rf))


### Final Comparison

- Summarize performance of all three models.
- Discuss trade-offs between interpretability (Decision Tree), simplicity (Logistic Regression), and ensemble power (Random Forest).

---

**Extension:**
- Try tuning hyperparameters for the Decision Tree and Random Forest.
- Apply the same workflow to another dataset (e.g., `heart-disease.data` or `adult.data`).
