# Notebook 3 — Decision Tree Classification

**Dataset:** Breast Cancer Wisconsin (Diagnostic)

**Purpose:** Train and evaluate a Decision Tree classifier; show confusion matrix, accuracy, feature importance, and pruning techniques.

## Setup & Load dataset
Either download the CSV from Kaggle or use sklearn's built-in loader (this notebook shows both approaches).

In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Attempt to load local CSV first; otherwise load from sklearn
CSV_PATH = 'breast_cancer_data.csv'
try:
    df = pd.read_csv(CSV_PATH)
    print('Loaded local CSV:', df.shape)
except Exception:
    print('Local CSV not found — loading from sklearn.datasets')
    data = load_breast_cancer()
    df = pd.DataFrame(data.data, columns=data.feature_names)
    df['target'] = data.target
    print('Loaded sklearn breast cancer dataset:', df.shape)
    display(df.head())


## Preprocessing & Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

X = df.drop(columns=['target'])
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification report:\n', classification_report(y_test, y_pred))

# Plot tree (small depth for readability)
import matplotlib.pyplot as plt
plt.figure(figsize=(12,8))
plot_tree(clf, max_depth=3, feature_names=X.columns, class_names=['benign','malignant'], filled=True)
plt.show()


## Feature importance

In [None]:
import pandas as pd
feat_imp = pd.Series(clf.feature_importances_, index=X.columns).sort_values(ascending=False)
display(feat_imp.head(15))
