#**Decision Tree | Assignment**





---



---



## **Question 1: What is a Decision Tree, and how does it work in the context of classification?**

### 1. Conceptual Explanation
A Decision Tree is a supervised machine learning algorithm that splits data into branches based on feature values to make predictions.  
In classification, it assigns a class label by traversing from root to leaf using decision rules.

### 2. Mathematical Explanation
At each node, we choose a feature that maximizes Information Gain:

Information Gain = Parent Impurity − Weighted Child Impurity

Using Entropy:
Entropy(S) = − Σ pᵢ log₂(pᵢ)

Split chosen:
IG(S, A) = Entropy(S) − Σ (|Sᵥ| / |S|) Entropy(Sᵥ)

### 3. Intuitive Explanation
Like a flowchart:
"Is age > 30?" → Yes/No  
"Is income high?" → Yes/No  
Finally, reach a decision like “Approve Loan” or “Reject Loan”.

### 4. Real-World Example
In healthcare:
Root: Is blood pressure high?
Branch: Is sugar level high?
Leaf: Patient has heart disease (Yes/No)

### 5. Assumptions
- Data can be split using clear rules
- Features are informative
- Training data represents real-world patterns

### 6. Advantages
- Easy to interpret
- Handles both numeric & categorical data
- No feature scaling needed

### 7. Disadvantages
- Overfitting
- Sensitive to small data changes
- Greedy splitting

### 8. Use Cases
- Medical diagnosis
- Credit approval
- Customer churn prediction


## **Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

### 1. Definition (Conceptual)

In a Decision Tree, impurity measures tell us how “mixed” the classes are at a node.  
A node is:
- **Pure** → all samples belong to one class  
- **Impure** → samples belong to multiple classes  

Two most common impurity measures are:

#### (a) Gini Impurity  
Gini Impurity measures the probability that a randomly chosen sample from a node would be **incorrectly classified** if it were randomly labeled according to the class distribution in that node.

#### (b) Entropy  
Entropy measures the **amount of uncertainty or disorder** in the node. It comes from Information Theory and quantifies how unpredictable the class labels are.

---

### 2. Mathematical Definitions

Let a node contain samples from \(k\) classes, and  
\(p_i\) = proportion of samples of class \(i\).

**Gini Impurity**

\[
Gini = 1 - \sum_{i=1}^{k} p_i^2
\]

- If all samples belong to one class → Gini = 0 (pure)
- If classes are evenly mixed → Gini is maximum

**Entropy**

\[
Entropy = - \sum_{i=1}^{k} p_i \log_2(p_i)
\]

- If one class has probability 1 → Entropy = 0 (no uncertainty)
- If classes are equally likely → Entropy is maximum

---

### 3. Intuitive Explanation

Think of a basket of fruits:
- Basket with only apples → very pure → low Gini, low Entropy  
- Basket with 50% apples and 50% oranges → very mixed → high Gini, high Entropy  

A Decision Tree tries to ask questions (splits) so that each resulting basket contains mostly one type of fruit.

---

### 4. Impact on Splits

For every possible split, the tree computes the impurity reduction:

\[
\text{Impurity Reduction} = \text{Impurity(parent)} - \sum \left( \frac{n_{child}}{n_{parent}} \times \text{Impurity(child)} \right)
\]

The split that gives the **maximum reduction in impurity** is chosen.

- Gini → chooses split that minimizes misclassification probability  
- Entropy → chooses split that maximally reduces uncertainty  

---

### 5. Real-World Example

In email spam detection:
- A node with 90% spam and 10% non-spam has low impurity.
- A node with 50% spam and 50% non-spam has high impurity.

A good feature (like the word “free”) creates child nodes that are more pure, thus lowering Gini and Entropy.

---

### 6. Comparison

| Aspect | Gini Impurity | Entropy |
|--------|---------------|---------|
| Formula | \(1 - \sum p_i^2\) | \(-\sum p_i \log_2 p_i\) |
| Speed | Faster | Slightly slower |
| Interpretation | Misclassification probability | Uncertainty / information |
| Used In | CART | ID3, C4.5 |

---

### 7. Advantages
- Clearly measures node purity
- Helps in selecting the best feature for splitting
- Works for multi-class classification

### 8. Disadvantages
- Greedy (local optimum)
- Sensitive to noise
- Biased towards features with many categories

### 9. Use Cases
- Decision Tree classifiers
- Random Forests
- Medical diagnosis systems
- Credit risk assessment


## **Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

### 1. Conceptual Definition

In Decision Trees, **pruning** is used to control the size of the tree and avoid overfitting.

There are two main types:

### (a) Pre-Pruning (Early Stopping)
Pre-pruning stops the tree from growing further **during training** by setting constraints such as:
- Maximum depth of the tree
- Minimum number of samples in a node
- Minimum information gain for a split

The tree is prevented from becoming too complex.

### (b) Post-Pruning (Backward Pruning)
Post-pruning allows the tree to grow **fully first**, and then removes (cuts) branches that do not improve performance on validation data.

---

### 2. Mathematical View

Pre-pruning conditions:
\[
\text{Stop splitting if: } depth \ge d_{max} \quad \text{or} \quad samples < n_{min}
\]

Post-pruning uses cost-complexity:
\[
Cost(T) = Error(T) + \alpha \times \text{Number of Leaves}
\]

where:
- \(Error(T)\) = misclassification error
- \(\alpha\) = complexity penalty

---

### 3. Intuitive Explanation

Think of growing a plant:

- **Pre-pruning**: You stop the plant from growing too tall early.
- **Post-pruning**: You let it grow fully, then trim unnecessary branches.

---

### 4. Real-World Example

In a **loan approval system**:

- Pre-pruning: Limit tree depth so that rules remain simple and fast.
- Post-pruning: Build a complex rule system first, then remove conditions that do not improve accuracy.

---

### 5. Practical Advantages

#### Advantage of Pre-Pruning
✅ **Faster training and simpler model**

Example:  
In real-time fraud detection, decisions must be made in milliseconds. A small pre-pruned tree gives quick predictions with acceptable accuracy.

#### Advantage of Post-Pruning
✅ **Higher predictive accuracy**

Example:  
In medical diagnosis, capturing complex interactions is important. Post-pruning keeps useful deep patterns and removes only truly irrelevant branches.

---

### 6. Comparison Table

| Aspect | Pre-Pruning | Post-Pruning |
|--------|------------|--------------|
| When applied | During training | After full training |
| Tree size | Smaller | Initially large, then reduced |
| Risk | Underfitting | Overfitting (before pruning) |
| Accuracy | Moderate | Usually higher |
| Computation | Low | Higher |

---

### 7. Advantages

**Pre-Pruning**
- Prevents overfitting early
- Fast and memory efficient
- Easy to control

**Post-Pruning**
- Better generalization
- Keeps important complex patterns
- More accurate on unseen data

---

### 8. Disadvantages

**Pre-Pruning**
- May stop too early → underfitting

**Post-Pruning**
- Computationally expensive
- Requires validation set

---

### 9. Use Cases

- Pre-Pruning: Large-scale real-time systems (banking, recommendation)
- Post-Pruning: High-accuracy systems (healthcare, fault diagnosis)


## **Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

### 1. Conceptual Definition

**Information Gain (IG)** is a measure used in Decision Trees to determine how well a feature separates the training data into target classes.  
It tells us how much **uncertainty (entropy)** is reduced after splitting the data using a particular feature.

In simple words:  
> Information Gain measures how much “information” a feature gives us about the class label.

---

### 2. Mathematical Formulation

Let:
- \(S\) = parent dataset
- \(A\) = attribute (feature) used for splitting
- \(S_v\) = subset of \(S\) where attribute \(A\) takes value \(v\)

#### Entropy of dataset \(S\):

\[
Entropy(S) = - \sum_{i=1}^{k} p_i \log_2(p_i)
\]

where \(p_i\) is the proportion of class \(i\).

#### Information Gain:

\[
IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Entropy(S_v)
\]

---

### 3. Intuitive Explanation

Imagine you are playing the game **“Guess the Animal”**:

You want to ask the question that reduces your uncertainty the most.

- Bad question: “Is it an animal?” (almost no information)
- Good question: “Does it have feathers?” (big reduction in possibilities)

Information Gain chooses the feature that gives the **most clarity** in one step.

---

### 4. How It Affects Splits

For every feature:
1. Calculate entropy before split.
2. Calculate weighted entropy after split.
3. Compute Information Gain.
4. Choose the feature with **maximum IG**.

So, the best split is:

\[
\text{Best Feature} = \arg\max_A IG(S, A)
\]

---

### 5. Real-World Example

In **email spam detection**:

Features:
- Contains word “free”
- Contains link
- Sender is unknown

The feature “Contains word ‘free’” might reduce uncertainty the most, so it gets the highest Information Gain and becomes the root split.

---

### 6. Assumptions

- Data is representative.
- Entropy correctly measures uncertainty.
- Features are independent for splitting decisions.

---

### 7. Advantages

- Provides a clear criterion for feature selection.
- Produces informative and interpretable trees.
- Based on solid information theory.

---

### 8. Disadvantages

- Biased toward attributes with many unique values.
- Greedy (chooses local best split, not global).
- Sensitive to noise.

---

### 9. Use Cases

- Decision Tree classifiers (ID3, C4.5)
- Feature selection
- Medical diagnosis systems
- Credit risk modeling
- Customer churn prediction


## **Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

### 1. Conceptual Explanation

A **Decision Tree** is a rule-based machine learning model that makes decisions by splitting data into branches based on feature conditions.  
Because of its tree-like and human-readable structure, it is widely used in real-world decision-making systems.

---

### 2. Common Real-World Applications

#### (a) Healthcare
- Disease diagnosis (diabetes, cancer, heart disease)
- Patient risk classification
- Treatment recommendation systems  

#### (b) Finance
- Loan approval and credit scoring
- Fraud detection
- Stock market trend classification  

#### (c) Marketing
- Customer segmentation
- Churn prediction
- Targeted advertising  

#### (d) Retail & E-commerce
- Product recommendation
- Demand forecasting
- Inventory classification  

#### (e) Manufacturing
- Fault detection
- Quality control
- Predictive maintenance  

---


### 4. Intuitive Explanation

A Decision Tree works like a **doctor's decision process**:

- If fever > 100°F → Check cough  
- If cough = yes → Check chest pain  
- Final decision → Disease present or not  

Each question reduces uncertainty and leads to a final decision.

---

### 5. Advantages

#### (a) Easy to Understand and Interpret
- Can be visualized as flowcharts
- Useful in domains requiring explainability (healthcare, law, banking)

#### (b) Handles Non-Linear Relationships
- No need for linear assumptions

#### (c) Minimal Data Preprocessing
- No scaling required
- Handles categorical and numerical features

#### (d) Feature Importance
- Automatically identifies important variables

---

### 6. Limitations

#### (a) Overfitting
- Deep trees memorize noise
- Poor generalization

#### (b) Instability
- Small data change → different tree

#### (c) Greedy Nature
- Finds local optimal splits, not global

#### (d) Bias Toward Multi-valued Attributes
- Information Gain favors high-cardinality features

---

### 7. Assumptions

- Features contain useful decision rules
- Training data represents real population
- Splits reduce class impurity

---

### 8. Practical Use Case Example

**Loan Approval System**

Rules:
- If income > 50k → Low risk  
- If credit score > 700 → Approve  
- Else → Reject  

This tree:
- Is interpretable to bank managers
- Justifies decisions legally
- Helps reduce default risk

---

### 9. Summary Table

| Aspect | Details |
|--------|--------|
| Strengths | Interpretable, non-linear, fast |
| Weaknesses | Overfitting, unstable |
| Best Used In | Healthcare, Finance, Marketing |
| Not Ideal For | Very noisy, high-dimensional data |

---



## **Question 6: Write a Python program to load the**
(1) Iris dataset, train a Decision Tree Classifier using the Gini criterion, and print the model’s accuracy and feature importances.



In [1]:
# Load required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Classifier using Gini criterion
dt_model = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_model.fit(X_train, y_train)

# Make predictions
y_pred = dt_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Get feature importances
feature_importances = pd.DataFrame({
    "Feature": iris.feature_names,
    "Importance": dt_model.feature_importances_
})

# Print accuracy and feature importances
print("Model Accuracy:", accuracy)
print("\nFeature Importances:")
print(feature_importances)


Model Accuracy: 1.0

Feature Importances:
             Feature  Importance
0  sepal length (cm)    0.000000
1   sepal width (cm)    0.019110
2  petal length (cm)    0.893264
3   petal width (cm)    0.087626


##**Question 7: Write a Python program to:**
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
(Include your Python code and

In [3]:
# Load required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train fully-grown Decision Tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_tree_pred = full_tree.predict(X_test)
full_tree_accuracy = accuracy_score(y_test, full_tree_pred)

# Train Decision Tree with max_depth = 3
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
pruned_tree_pred = pruned_tree.predict(X_test)
pruned_tree_accuracy = accuracy_score(y_test, pruned_tree_pred)

# Print accuracies
print("Fully-Grown Tree Accuracy:", full_tree_accuracy)
print("Max Depth = 3 Tree Accuracy:", pruned_tree_accuracy)


Fully-Grown Tree Accuracy: 1.0
Max Depth = 3 Tree Accuracy: 1.0


##**Question 8: Write a Python program to:**
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

In [6]:
# Load required libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load Boston Housing Dataset
boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)

# Make predictions
y_pred = dt_reg.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Get feature importances
feature_importances = pd.DataFrame({
    "Feature": boston.feature_names,
    "Importance": dt_reg.feature_importances_
})

# Print results
print("Mean Squared Error (MSE):", mse)
print("\nFeature Importances:")
print(feature_importances)


Mean Squared Error (MSE): 11.588026315789474

Feature Importances:
    Feature  Importance
0      CRIM    0.058465
1        ZN    0.000989
2     INDUS    0.009872
3      CHAS    0.000297
4       NOX    0.007051
5        RM    0.575807
6       AGE    0.007170
7       DIS    0.109624
8       RAD    0.001646
9       TAX    0.002181
10  PTRATIO    0.025043
11        B    0.011873
12    LSTAT    0.189980


##**Question 9: Write a Python program to:**
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

In [7]:
# Load required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# Initialize Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Apply GridSearchCV
grid_search = GridSearchCV(dt, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get best model
best_model = grid_search.best_estimator_

# Make predictions with best model
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print best parameters and accuracy
print("Best Parameters:", grid_search.best_params_)
print("Best Model Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Best Model Accuracy: 1.0




## **Question 10: End-to-End Process for Disease Prediction Using a Decision Tree**

You are working as a data scientist in a healthcare company. The goal is to predict whether a patient has a certain disease using a dataset that contains:
- Numerical features (age, blood pressure, sugar level, etc.)
- Categorical features (gender, smoking status, symptom type, etc.)
- Missing values

Below is the complete step-by-step process.

---

### 1. Handling Missing Values

#### Conceptual
Medical datasets often have missing entries due to:
- Patients skipping tests
- Data entry errors
- Sensor failures

Decision Trees cannot handle missing values directly, so we must impute them.

#### Methods
- **Numerical Features**
  - Mean Imputation: replace missing value with average
  - Median Imputation: robust to outliers
- **Categorical Features**
  - Mode Imputation (most frequent category)

#### Mathematical
Mean Imputation:
\[
x_{missing} = \frac{1}{n}\sum_{i=1}^{n} x_i
\]

#### Intuitive
If a patient’s cholesterol is missing, replace it with the average cholesterol of similar patients.

---

### 2. Encoding Categorical Features

#### Conceptual
Machine learning models require numbers, not text.

#### Methods
- **Label Encoding**
  - Male → 0, Female → 1
- **One-Hot Encoding**
  - Disease Type: {A, B, C} → [1,0,0], [0,1,0], [0,0,1]

#### Intuitive
Convert hospital form answers (Yes/No, Male/Female) into 0s and 1s.

---

### 3. Training the Decision Tree Model

#### Conceptual
A Decision Tree learns a set of rules that split patients into groups based on medical conditions.

#### Mathematical
Splits are chosen by maximizing Information Gain:

\[
IG = Entropy(parent) - \sum \frac{n_i}{n} Entropy(child_i)
\]

#### Intuitive
The model asks:
- Is age > 50?
- Is blood pressure > 140?
- Is glucose > 180?

And finally predicts: Disease = Yes / No.

---

### 4. Hyperparameter Tuning

#### Important Parameters
- **max_depth** – controls tree height (prevents overfitting)
- **min_samples_split** – minimum patients to split a node
- **min_samples_leaf** – minimum patients in a leaf

#### Method
Use **GridSearchCV** with cross-validation.

#### Mathematical
\[
\theta^* = \arg\min_{\theta} \frac{1}{k}\sum_{i=1}^{k} Error_i(\theta)
\]

#### Intuitive
Try many tree sizes and choose the one that performs best on unseen patients.

---

### 5. Model Evaluation

#### Metrics
- Accuracy
- Precision
- Recall
- F1-score
- ROC-AUC

#### Example Formula (Accuracy)
\[
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
\]

#### Intuitive
Check how many patients are correctly diagnosed as sick or healthy.

---

### 6. Business Value in Real-World Healthcare

#### a) Early Disease Detection
High-risk patients can be identified early, improving survival rates.

#### b) Clinical Decision Support
Doctors get an interpretable rule-based system:
"If BP > 140 and Sugar > 180 → High risk"

#### c) Cost Reduction
Avoid unnecessary tests and hospitalizations.

#### d) Resource Optimization
Hospitals can prioritize critical patients.

#### e) Regulatory Compliance
Decision Trees are explainable, which is essential in medical audits.

---

### 7. Summary

| Step | Purpose |
|------|--------|
| Missing Value Handling | Avoid data loss and bias |
| Encoding | Convert text to numbers |
| Training | Learn diagnosis rules |
| Hyperparameter Tuning | Improve generalization |
| Evaluation | Ensure reliability |
| Business Value | Better care, lower cost, trustable AI |

This complete pipeline transforms raw hospital data into a reliable, interpretable disease prediction system.

