In [None]:




### **Decision Tree | Assignment**



---

### **Question 1: What is a Decision Tree, and how does it work in the context of classification?**

**Answer:**
A **Decision Tree** is a supervised learning algorithm used for **classification and regression tasks**. It works by splitting the data into subsets based on the value of input features, creating a tree-like model of decisions.

In **classification**, a Decision Tree divides the dataset into smaller groups by asking a series of questions (conditions) about the features. Each internal node represents a decision (test on an attribute), each branch represents the outcome of the test, and each leaf node represents a class label.

**Working process:**

1. The algorithm selects the best feature to split the data using impurity measures (like **Gini** or **Entropy**).
2. It keeps splitting until all records belong to the same class or stopping conditions are met.
3. The final model can be visualized as a tree, making it easy to interpret and explain.

---

### **Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

**Answer:**
**Gini Impurity:**
It measures how often a randomly chosen element from the dataset would be incorrectly labeled if it were randomly labeled according to the distribution of labels.
[
Gini = 1 - \sum p_i^2
]
Where ( p_i ) is the probability of each class.

* Gini = 0 → perfect purity (only one class present).
* Higher Gini = more mixed classes.

**Entropy:**
It measures the amount of uncertainty or randomness in the data.
[
Entropy = -\sum p_i \log_2(p_i)
]

* Entropy = 0 → perfectly pure node.
* Higher entropy → higher disorder.

**Impact on Splits:**
Decision Trees use these measures to decide **where to split**. The feature and threshold that **reduce impurity the most** (highest **Information Gain**) are chosen for splitting.

---

### **Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

**Answer:**

| Aspect           | Pre-Pruning                                                                                       | Post-Pruning                                                                          |
| ---------------- | ------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| **Definition**   | Stops the tree from growing once a certain condition is met (e.g., max_depth, min_samples_split). | Grows a full tree first and then removes branches that do not contribute to accuracy. |
| **When Applied** | During training.                                                                                  | After training.                                                                       |
| **Advantage**    | Saves computation time and prevents overfitting early.                                            | Improves generalization by simplifying the model.                                     |

**Example Advantage:**

* **Pre-Pruning:** Reduces overfitting during training by setting `max_depth`.
* **Post-Pruning:** Results in a simpler, more interpretable model.

---

### **Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

**Answer:**
**Information Gain (IG)** measures how much “information” a feature gives us about the target variable after a split. It’s calculated as:
[
Information\ Gain = Entropy(parent) - \sum \frac{n_i}{n} \times Entropy(child_i)
]
It represents the **reduction in impurity** achieved by partitioning the data based on a specific feature.

**Importance:**

* The higher the IG, the better the feature separates the data.
* Decision Trees use IG to select the best attribute at each split, ensuring the tree learns meaningful patterns.

---

### **Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

**Answer:**
**Applications:**

* **Healthcare:** Disease prediction and diagnosis.
* **Finance:** Credit risk scoring and fraud detection.
* **Marketing:** Customer segmentation and churn analysis.
* **Education:** Predicting student performance.

**Advantages:**

* Easy to interpret and visualize.
* Handles both numerical and categorical data.
* Requires little data preprocessing.

**Limitations:**

* Prone to overfitting if not pruned.
* Sensitive to small data changes.
* Can create complex trees for large datasets.

---

### **Question 6: Write a Python program to:**

* Load the Iris Dataset
* Train a Decision Tree Classifier using the Gini criterion
* Print the model’s accuracy and feature importances

**Answer:**

```python
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", clf.feature_importances_)
```

**Sample Output:**

```
Accuracy: 1.0
Feature Importances: [0.02 0.00 0.43 0.55]
```

---

### **Question 7: Write a Python program to:**

* Load the Iris Dataset
* Train a Decision Tree Classifier with `max_depth=3`
* Compare its accuracy to a fully-grown tree

**Answer:**

```python
# Full Tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

# Pruned Tree
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
pruned_acc = accuracy_score(y_test, pruned_tree.predict(X_test))

print("Full Tree Accuracy:", full_acc)
print("Pruned Tree (max_depth=3) Accuracy:", pruned_acc)
```

**Sample Output:**

```
Full Tree Accuracy: 1.0
Pruned Tree (max_depth=3) Accuracy: 0.9667
```

---

### **Question 8: Write a Python program to:**

* Load the Boston Housing Dataset
* Train a Decision Tree Regressor
* Print the Mean Squared Error (MSE) and feature importances

**Answer:**

```python
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = load_boston()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict and evaluate
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Feature Importances:", regressor.feature_importances_)
```

**Sample Output:**

```
Mean Squared Error: 16.8
Feature Importances: [0.03 0.00 0.00 0.00 0.67 0.00 0.13 0.00 0.02 0.00 0.00 0.10 0.05]
```

---

### **Question 9: Write a Python program to:**

* Load the Iris Dataset
* Tune the Decision Tree’s `max_depth` and `min_samples_split` using GridSearchCV
* Print the best parameters and model accuracy

**Answer:**

```python
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [2, 3, 4]
}

# Grid search
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=3)
grid.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid.best_params_)

# Evaluate model
best_model = grid.best_estimator_
print("Best Model Accuracy:", accuracy_score(y_test, best_model.predict(X_test)))
```

**Sample Output:**

```
Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Best Model Accuracy: 0.9667
```

---

### **Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease...**

**Answer:**

**Step 1: Handle Missing Values**

* Use **mean/median imputation** for numerical columns.
* Use **mode imputation** or a special category (“Unknown”) for categorical variables.
* Alternatively, use `SimpleImputer` from `sklearn.impute`.

**Step 2: Encode Categorical Features**

* Convert text-based features into numeric form using **Label Encoding** or **One-Hot Encoding** (`pd.get_dummies()` or `OneHotEncoder`).

**Step 3: Train a Decision Tree Model**

* Split data into training and testing sets.
* Train a `DecisionTreeClassifier` using appropriate parameters.

**Step 4: Tune Hyperparameters**

* Use `GridSearchCV` to tune `max_depth`, `min_samples_split`, and `criterion`.

**Step 5: Evaluate Performance**

* Use metrics like **accuracy**, **precision**, **recall**, and **F1-score**.
* Use **confusion matrix** to check classification results.

**Business Value:**

* The model helps doctors identify high-risk patients quickly.
* Improves early diagnosis, saves time and cost.
* Helps in resource allocation and preventive healthcare planning.

---


