# Decision Trees

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.datasets import load_iris, load_breast_cancer, load_diabetes, make_classification
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

: 

In [None]:
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


: 

Decision trees are hierarchical, tree-like structures used for:
        • Classification: Predicting categorical outcomes
        • Regression: Predicting continuous values
        
        Structure:
        • Root Node: Starting point with entire dataset
        • Internal Nodes: Decision points with splitting criteria
        • Leaf Nodes: Final predictions/outcomes
        • Branches: Paths connecting nodes based on conditions

        Algorithm Steps:
        • Start with root node containing all data
        • Find best feature and threshold to split data
        • Create child nodes based on split
        • Repeat recursively until stopping criteria met
        • Assign predictions to leaf nodes
        
        Key Concepts:
        • Splitting Criteria: How to choose best split
        • Impurity Measures: Quantify node homogeneity
        • Pruning: Prevent overfitting
        • Tree Depth: Control complexity

Classification Impurity Measures:
        
        a) Gini Impurity:
           Gini = 1 - Σ(p_i²) where p_i is probability of class i
           • Range: [0, 1-1/k] where k is number of classes
           • 0 = pure node (all samples same class)
           • Higher values = more impure
        
        b) Entropy:
           Entropy = -Σ(p_i * log2(p_i))
           • Range: [0, log2(k)]
           • 0 = pure node
           • log2(k) = maximum impurity
        
        c) Information Gain:
           IG = Parent_Impurity - Σ(|S_v|/|S| * Child_v_Impurity)
           • Measures reduction in impurity after split
           • Higher IG = better split
        
        Regression Impurity Measures:
        
        a) Mean Squared Error (MSE):
           MSE = (1/n) * Σ(y_i - y_mean)²
           • Measures variance within node
        
        b) Mean Absolute Error (MAE):
           MAE = (1/n) * Σ|y_i - y_median|
           • Less sensitive to outliers

Best Split Selection:
        
        1. For each feature:
           - Sort unique values
           - Consider each value as potential threshold
           - Calculate impurity for resulting split
        
        2. Choose split that maximizes:
           - Information Gain (classification)
           - Variance reduction (regression)
        
        3. Stopping Criteria:
           - Maximum tree depth reached
           - Minimum samples per node
           - Minimum impurity decrease
           - All samples in node belong to same class

DISADVANTAGES OF DECISION TREES

        ✗ Prone to overfitting (high variance)
        ✗ Unstable (small changes in data → different trees)
        ✗ Can create biased trees if classes are imbalanced
        ✗ May not generalize well to unseen data
        ✗ Limited to axis-parallel splits
        ✗ Can be computationally expensive for large datasets

        Pre-pruning (Early Stopping):
        • max_depth: Maximum tree depth
        • min_samples_split: Minimum samples to split node
        • min_samples_leaf: Minimum samples in leaf node
        • min_impurity_decrease: Minimum impurity reduction
        
        Post-pruning (Cost Complexity Pruning):
        • Remove subtrees that don't improve performance
        • Balance tree complexity vs. accuracy
        • Use validation set to determine optimal pruning