# 🚀 ** Guide to Decision Tree Classification & Regression with Pruning in Scikit-Learn** 🌳



---

## 🎯 **What is a Decision Tree?**  

A **Decision Tree** is a **flowchart-like structure** where:  
- Each **node** represents a **decision/question** 🧐  
- Each **branch** represents an **outcome (Yes/No, True/False, etc.)**  
- Each **leaf** represents the **final prediction/classification** 🎯  

🌳 Think of it like **playing "20 Questions"** to guess an object. You keep **asking questions** until you reach the **final answer**!

---

## 🛠 **1️⃣ Importing Required Libraries**  
Before we start, we need to **import** some essential Python libraries.  

```python
import numpy as np  # 🔢 Helps with numerical calculations
import pandas as pd  # 📊 Helps handle datasets like spreadsheets
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_text  # 🌳 Decision Trees
from sklearn.model_selection import train_test_split  # ✂️ Splits data into training & testing
from sklearn.metrics import accuracy_score, mean_squared_error  # ✅ Model evaluation metrics
```

### 🧐 **What’s Happening Here?**
✅ `numpy` (`np`) → Helps in working with **numbers, arrays, and random numbers**  
✅ `pandas` (`pd`) → Used for **handling tabular data (like Excel files)**  
✅ `DecisionTreeClassifier` & `DecisionTreeRegressor` → Implement **Decision Trees**  
✅ `export_text` → Displays a **text-based visualization** of the Decision Tree  
✅ `train_test_split` → Splits data into **Training (80%) and Testing (20%)**  
✅ `accuracy_score` → Measures **classification accuracy**  
✅ `mean_squared_error` → Measures **regression error**  

---

## 📊 **2️⃣ Creating a Sample Dataset (Features & Targets)**
Since we don’t have a dataset, we’ll **generate random data**.

```python
np.random.seed(42)  # 🎲 Ensures we get the same random values every time
X = np.random.randint(1, 100, (10, 3))  # 🔢 Generate a 10x3 matrix (values between 1-100)
y_classification = np.random.choice([0, 1], size=10)  # 🎯 Binary classification target (0 or 1)
y_regression = np.random.randint(50, 150, 10)  # 📈 Continuous target values (for regression)
```

### 🧐 **What’s Happening Here?**
✅ `np.random.seed(42)` → Ensures the same random numbers are generated **each time we run the code**  
✅ `np.random.randint(1, 100, (10, 3))` → Creates **10 rows × 3 features** (random values **between 1-100**)  
✅ `np.random.choice([0, 1], size=10)` → Randomly assigns **0 or 1** as the **classification labels**  
✅ `np.random.randint(50, 150, 10)` → Generates **random continuous values (50-150) for regression**  

📌 **Example Data Generated:**  
```
Features (X)       Classification Target (y_class)   Regression Target (y_reg)
[34, 98, 56]  →          1                            123
[67, 23, 78]  →          0                             85
...
```

---

## ✂️ **3️⃣ Splitting Data into Training & Testing Sets**
Before training a model, we **split** the dataset into **training (80%)** and **testing (20%)**.  

```python
X_train, X_test, y_train_class, y_test_class = train_test_split(X, y_classification, test_size=0.2, random_state=42)
X_train, X_test, y_train_reg, y_test_reg = train_test_split(X, y_regression, test_size=0.2, random_state=42)
```

### 🧐 **What’s Happening Here?**
✅ **`train_test_split(X, y, test_size=0.2, random_state=42)`**  
✔ Splits **80% for training** & **20% for testing**  
✔ Ensures results are **reproducible** with `random_state=42`

---

## ✂️ **4️⃣ Pre-Pruning: Controlling Tree Growth (Before Training)**
Before training the model, we **limit its complexity** to **prevent overfitting**.

```python
clf_pruned = DecisionTreeClassifier(
    criterion='entropy',  # 📊 Use Entropy for impurity measurement
    max_depth=3,  # ⬇️ Restrict tree depth to 3 levels
    min_samples_split=2,  # 🔢 Minimum samples needed to split a node
    min_samples_leaf=1,  # 🍃 Each leaf must have at least 1 sample
    min_impurity_decrease=0.01,  # 🛑 Stops splitting if impurity decrease is too small
    random_state=42
)
clf_pruned.fit(X_train, y_train_class)  # 🎯 Train the pre-pruned classifier
y_pred_pruned = clf_pruned.predict(X_test)  # 🔍 Predict using the pruned model
```

### 🧐 **What’s Happening Here?**
✅ **Pre-Pruning prevents overfitting** by **stopping tree growth early**.  
- `max_depth=3` → Limits the tree to **3 levels**  
- `min_samples_split=2` → Node must have **at least 2 samples** to split  
- `min_samples_leaf=1` → Each **leaf node must have at least 1 sample**  
- `min_impurity_decrease=0.01` → Stops **unnecessary splits**  

---

## 🪓 **5️⃣ Post-Pruning (After Training)**
Instead of limiting the tree **before training**, we **first train the full tree and prune it later**.

```python
clf_unpruned = DecisionTreeClassifier(criterion='entropy', random_state=42)
clf_unpruned.fit(X_train, y_train_class)

# 🔍 Find best pruning alpha
path = clf_unpruned.cost_complexity_pruning_path(X_train, y_train_class)
ccp_alphas = path.ccp_alphas  # 📈 List of possible alpha values

# 🏆 Choose a middle-value alpha
best_alpha = ccp_alphas[len(ccp_alphas) // 2]
clf_post_pruned = DecisionTreeClassifier(criterion='entropy', ccp_alpha=best_alpha, random_state=42)
clf_post_pruned.fit(X_train, y_train_class)
y_pred_post_pruned = clf_post_pruned.predict(X_test)
```

### 🧐 **What’s Happening Here?**
✅ Trains a **full tree first**  
✅ Finds **cost-complexity pruning path**  
✅ Uses **`ccp_alpha`** to **prune unnecessary nodes**  

---

## 📈 **6️⃣ Decision Tree Regression with Pre-Pruning**
```python
regressor_pruned = DecisionTreeRegressor(
    max_depth=3,
    min_samples_split=2,
    min_samples_leaf=1,
    min_impurity_decrease=0.01,
    random_state=42
)
regressor_pruned.fit(X_train, y_train_reg)
y_pred_reg_pruned = regressor_pruned.predict(X_test)
```
✅ **Same pre-pruning concepts apply to Regression Trees!**

---

## ✅ **7️⃣ Evaluating Model Performance**
```python
print("📊 Accuracy (Pre-Pruned):", accuracy_score(y_test_class, y_pred_pruned))
print("📊 Accuracy (Post-Pruned):", accuracy_score(y_test_class, y_pred_post_pruned))
print("📉 Regression MSE (Pre-Pruned):", mean_squared_error(y_test_reg, y_pred_reg_pruned))
```

---

## 🌳 **8️⃣ Displaying Decision Tree Structure**
```python
print("\n🌳 Pre-Pruned Tree:")
print(export_text(clf_pruned, feature_names=['Feature 1', 'Feature 2', 'Feature 3']))
```

---

# 🎯 **Key Takeaways**
✅ **Pre-Pruning** → Stops tree **early** (limits depth, minimum samples, impurity).  
✅ **Post-Pruning** → Grows **full tree**, then **removes weak branches** (`ccp_alpha`).  

Hope this detailed breakdown helps! 🚀 Let me know if anything needs more clarity! 😊