# Brain Tumor Classification - Exploratory Data Analysis (EDA)
**Project Goal**: Analyze MRI scan dataset and evaluate ViT model performance for tumor classification  
**Key Findings**:
- Class distribution shows [balanced/imbalanced] data
- Model achieves 97.1% validation accuracy but confuses glioma/meningioma
- "No tumor" class is most consistently predicted correctly

## 1. Dataset Overview

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Load dataset metadata
data_dir = Path("data/brain_tumor_mri")
classes = ["glioma", "meningioma", "pituitary", "notumor"]
counts = []

for class_name in classes:
    class_dir = data_dir / "Training" / class_name
    counts.append(len(list(class_dir.glob("*.jpg")))

df_dist = pd.DataFrame({"Class": classes, "Count": counts})

### Class Distribution

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(data=df_dist, x="Class", y="Count", palette="Blues_d")
plt.title("Training Set Class Distribution")
plt.savefig("class_distribution.png", bbox_inches="tight")
plt.show()

**Observation**: 
- The dataset appears [balanced/imbalanced] with [describe distribution]
- [Add any sampling bias notes]

## 2. Sample Images

In [None]:
from PIL import Image
import numpy as np

plt.figure(figsize=(15, 8))
for i, class_name in enumerate(classes):
    sample_img = next((data_dir / "Training" / class_name).glob("*.jpg"))
    img = Image.open(sample_img)
    
    plt.subplot(1, 4, i+1)
    plt.imshow(img, cmap="gray")
    plt.title(f"{class_name}\n{img.size}")
    plt.axis("off")
    
plt.suptitle("Sample MRI Scans from Each Class", y=0.75)
plt.savefig("sample_images.png", bbox_inches="tight")
plt.show()

**Observation**:
- Key visual differences between classes: [describe]
- Preprocessing challenges: [note any artifacts/noise]

## 3. Model Performance Analysis

### Training Metrics

In [None]:
epochs = [1, 2, 3]
train_acc = [90.49, 98.49, 99.04]
val_acc = [96.93, 98.64, 97.10]

plt.figure(figsize=(10, 5))
plt.plot(epochs, train_acc, label="Train Accuracy", marker="o")
plt.plot(epochs, val_acc, label="Validation Accuracy", marker="o")
plt.xlabel("Epoch")
plt.ylabel("Accuracy (%)")
plt.title("Training vs Validation Accuracy")
plt.legend()
plt.grid()
plt.savefig("accuracy_curve.png", bbox_inches="tight")
plt.show()

**Observation**:
- Model converges quickly (high accuracy by epoch 1)
- Slight overfitting in epoch 3 (train acc > val acc)

### Confusion Matrix Evolution

In [None]:
confusion_matrices = [
    # Epoch 1
    np.array([
        [228,   4,   0,   0],
        [  4, 144,   0,   1],
        [  0,   0,  20,   0],
        [  1,   8,   0, 177]
    ]),
    # Epoch 2
    np.array([
        [227,   5,   0,   0],
        [  0, 148,   0,   1],
        [  0,   0,  20,   0],
        [  0,   2,   0, 184]
    ]),
    # Epoch 3
    np.array([
        [220,  11,   0,   1],
        [  0, 148,   0,   1],
        [  0,   0,  20,   0],
        [  0,   4,   0, 182]
    ])
]

#### Most Common Misclassifications

In [None]:
misclass = pd.DataFrame({
    "Epoch": [1, 2, 3],
    "Glioma→Meningioma": [4, 5, 11],
    "Meningioma→Glioma": [4, 0, 0],
    "NoTumor→Meningioma": [8, 2, 4]
})

plt.figure(figsize=(10, 5))
misclass.set_index("Epoch").plot(kind="bar", stacked=True)
plt.title("Evolution of Key Misclassifications")
plt.ylabel("Count")
plt.savefig("misclass_evolution.png", bbox_inches="tight")
plt.show()

**Key Findings**:
1. **Glioma vs Meningioma** confusion increases in Epoch 3 (11 cases)
2. **No Tumor** class is consistently well-predicted after Epoch 1
3. **Pituitary** tumors are always correctly classified (perfect recall)

## 4. Error Analysis

In [None]:
error_cases = {
    "Type": ["Glioma→Meningioma", "Meningioma→NoTumor", "NoTumor→Meningioma"],
    "Count": [11, 1, 4],  # From Epoch 3
    "Possible Reasons": [
        "Similar texture in MRI slices",
        "Small tumor size resembling healthy tissue",
        "Boundary ambiguity in scans"
    ]
}

pd.DataFrame(error_cases)

## 5. Recommendations

1. **Data-Level**:
   - Augment glioma/meningioma samples with rotation/flips
   - Review misclassified scans for labeling quality
2. **Model-Level**:
   - Add attention visualization to understand model focus areas
   - Experiment with class weights for minority classes
3. **Deployment**:
   - Highest confidence threshold for pituitary cases
   - Human review for glioma/meningioma predictions

## Next Steps

- [ ] Add SHAP explainability
- [ ] Create Gradio demo interface
- [ ] Compare with ResNet baseline