# Decision Tree Model: Hyperparameter Tuning and Analysis

## Overview
This notebook performs comprehensive analysis and hyperparameter tuning on a custom Decision Tree model built from scratch.

## Objectives
1. **Hyperparameter Tuning**: Optimize `max_depth` and `min_samples_split` parameters
2. **Performance Evaluation**: Assess model using multiple metrics
3. **Model Analysis**: Understand feature importance and model behavior
4. **Overfitting Analysis**: Examine training vs validation performance

## Tasks to Complete

### C4. Hyperparameter Tuning and Analysis
- Explore `max_depth ∈ {2, 4, 6, 8, 10}` and `min_samples_split ∈ {2, 5, 10}`
- Use validation set to select best hyperparameter combination
- Retrain model on combined training+validation data with optimal parameters
- Evaluate final model on test set
- Analyze how training/validation accuracy change with `max_depth` (with fixed `min_samples_split`)
- Present results in tables and plots

### C5. Analysis and Evaluation
**Required Analysis:**
1. **Hyperparameter Tuning**: Test different `max_depth` values
2. **Performance Metrics**: Calculate Accuracy, Precision, Recall, F1-score for both classes
3. **Confusion Matrix**: Detailed analysis of classification errors
4. **Feature Importance**: Rank features by their information gain contribution  
5. **Tree Complexity**: Analyze relationship between tree depth and performance
6. **Overfitting Analysis**: Compare training vs validation performance

## Implementation Approach
1. **Stratified Data Split**: 70% training, 15% validation, 15% test (preserving class distribution)
2. **Grid Search**: Systematic exploration of hyperparameter combinations
3. **Visual Analysis**: Plots showing model behavior and performance trade-offs
4. **Comprehensive Metrics**: Multiple evaluation perspectives on model quality

## Expected Outcomes
- Optimal hyperparameter combination for the decision tree
- Understanding of model strengths and weaknesses
- Insights into feature importance for the classification task
- Documentation of overfitting patterns and model complexity trade-offs

### Fetching and Splitting Data

In [4]:
import sys
import os

# Adding project root to Python path
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
if PROJECT_ROOT not in sys.path:
    sys.path.append(PROJECT_ROOT)
from sklearn import datasets
from src.utils.data_split import split_data

data = datasets.load_breast_cancer()

X, y = data.data, data.target 


X_train, X_val, X_test, y_train, y_val, y_test = split_data(X,y)

In [None]:
from sklearn.metrics import accuracy_score

for max_depth in max_depths:
    for min_samples_split in min_samples_splits:
        # Initialize and train model with current parameters
        model = DecisionTreeClassifier(
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            random_state=42
        )
        model.fit(X_train, y_train)
        
        # Make predictions
        y_train_pred = model.predict(X_train)
        y_val_pred = model.predict(X_val)
        
        # Calculate accuracies
        train_acc = accuracy_score(y_train, y_train_pred)
        val_acc = accuracy_score(y_val, y_val_pred)
        
        # Store results
        results['max_depth'].append(max_depth)
        results['min_samples_split'].append(min_samples_split)
        results['train_accuracy'].append(train_acc)
        results['val_accuracy'].append(val_acc)
        
        # Check if this is the best validation accuracy
        if val_acc > best_accuracy:
            best_accuracy = val_acc
            best_params = {'max_depth': max_depth, 'min_samples_split': min_samples_split}

NameError: name 'moew' is not defined