**Question 1:  What is a Decision Tree, and how does it work in the context of classification?**

Ans = A Decision Tree is a supervised machine learning algorithm used to make decisions by recursively splitting data based on feature values, creating a tree-like structure that is both easy to interpret and visualize.

Structure and Components

The tree starts with a root node representing the full dataset.

Each internal node corresponds to a decision based on an input feature.

Branches denote the outcome of decisions (e.g., yes/no or specific ranges).

Leaf nodes are terminal points that assign a class label or category to subset data.

How Decision Trees Work in Classification

Classification trees are used to predict categorical outcomes, such as “spam” or “not spam” for emails, by asking a series of questions based on feature values.

The tree evaluates features using splitting criteria like Gini impurity or entropy, dividing the dataset into homogenous groups with regard to the target class.

The algorithm continues splitting at each node until a stopping criterion is met, such as maximum tree depth, a minimum number of samples, or purity of leaf nodes (all samples in a node belong to the same class).

Example in Classification

For example, if the goal is to classify if a person is “fit” or “unfit”:
The tree may first split by age, then by eating habits, and so on.

At each split, a decision node asks a question, and each branch directs to a new question or outcome based on the answer.

The process continues until the data cannot be split further, and the final classification is assigned at the leaf node

**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?**


Gini Impurity

It tells us: “How often would we be wrong if we randomly guessed the class of an item in this group?”

If all items are from the same class → Gini = 0 (pure).

If the items are mixed → Gini is higher.

The decision tree will split in a way that reduces this wrong-guess chance.

Entropy

It comes from information theory and measures the disorder or uncertainty in a group.

If a group has only one class → Entropy = 0 (no uncertainty).

If a group has many classes evenly mixed → Entropy is higher (high uncertainty).

The decision tree will split to reduce this uncertainty as much as possible.

Impact on Splits

Both Gini and Entropy help the tree decide where to split the data.

The goal is always to make the groups purer (more of one class, less mixing).

Difference:

Gini is simpler and faster → often used by default.

Entropy is more precise about how mixed the classes are but a bit slower.

In practice, both usually give very similar decision trees.

**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.**

Ans = Difference Between Pre-Pruning and Post-Pruning
Pre-Pruning (Early Stopping):
This technique stops the growth of the decision tree early, before the tree becomes too complex. During training, splits are prevented if they do not improve some performance criterion (like information gain above a threshold, minimum required samples per leaf, or maximum tree depth). Pre-pruning keeps the tree small from the beginning and is efficient because unnecessary branches are never created.

Post-Pruning (Reduced Error or Cost-Complexity Pruning):
Here, the decision tree is first fully grown, which may result in a very complex structure. Then, the tree is simplified by removing branches that have little importance (i.e., do not improve accuracy on validation data). Post-pruning typically uses cross-validation or error-based strategies to determine which nodes can be pruned to strike a balance between model complexity and accuracy.

Practical Advantage of Each
Advantage of Pre-Pruning:
It can significantly reduce computation time and memory usage, especially on large datasets, by avoiding the creation of unnecessary branches and resulting in simpler, faster, and more interpretable trees.

Advantage of Post-Pruning:
It often produces more accurate models because it allows the algorithm to find complex patterns first, then removes only the parts proven unnecessary, which improves generalization to unseen data by combatting overfitting more systematically.

Both approaches serve to regularize decision trees—pre-pruning is proactive, while post-pruning is reactive, and their use depends on the dataset and modeling requirements.Pre-Pruning and Post-Pruning are two different strategies used to control the complexity of decision trees and prevent overfitting.


**Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?**

Ans = What is Information Gain?

Information Gain is a measure of how much “uncertainty” (impurity) is reduced in the data after splitting it based on a feature.

In other words, it tells us how well a particular attribute separates the data into classes.

It is calculated as the difference between the impurity of the parent node (before the split) and the weighted impurity of child nodes (after the split).

Why is it important in Decision Trees?

Decision trees work by choosing the best feature to split on at each step.

The “best split” is the one that gives the highest Information Gain.

A higher Information Gain means:

The feature reduces uncertainty more.

The child nodes are purer (contain mostly one class).

The split helps the model make more accurate predictions.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
Answer:
2
Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

Ans = Applications

Healthcare – Diagnosing diseases based on symptoms and medical history.

Finance – Credit scoring and loan approval by checking income, credit history, etc.

Marketing & Sales – Customer segmentation and predicting customer churn.

Retail – Product recommendation and demand forecasting.

Manufacturing – Quality control and defect detection.

Education – Predicting student performance and dropout risks.

Advantages

Easy to understand & interpret (even without deep ML knowledge).

Handles both classification & regression tasks.

No need for heavy data preprocessing (works with numerical & categorical data).

Non-linear relationships can be captured.

Fast and efficient on small-to-medium datasets.

Limitations

Prone to overfitting (creates very complex trees).

Unstable – small changes in data can change the tree structure.

Biased towards features with many categories.

Less accurate compared to ensemble methods (Random Forest, XGBoost).

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(criterion="entropy", random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

# Accuracy
print("Classification Accuracy on Iris dataset:", accuracy_score(y_test, y_pred))


Classification Accuracy on Iris dataset: 1.0


In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions
y_pred = reg.predict(X_test)

# Evaluate with Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Regression MSE on Housing dataset:", mse)


Regression MSE on Housing dataset: 0.495235205629094


**Question 6:   Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances**

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Feature importances
print("Feature Importances:", clf.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


**Question 7:  Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.**

In [6]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train fully-grown decision tree (no max_depth limit)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Train decision tree with max_depth=3
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Print results
print("Accuracy of Fully-grown Decision Tree:", accuracy_full)
print("Accuracy of Decision Tree with max_depth=3:", accuracy_limited)


Accuracy of Fully-grown Decision Tree: 1.0
Accuracy of Decision Tree with max_depth=3: 1.0


**Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances**

In [7]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset (replacement for Boston Housing)
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions
y_pred = reg.predict(X_test)

# Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Feature importances
print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, reg.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


**Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy**

In [8]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Define parameter grid for tuning
param_grid = {
    "max_depth": [2, 3, 4, 5, None],   # try shallow to deep trees
    "min_samples_split": [2, 3, 4, 5, 10]  # min samples required to split
}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=5,            # 5-fold cross-validation
    scoring="accuracy"
)

# Fit grid search
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Best model
best_model = grid_search.best_estimator_

# Evaluate accuracy on test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy with Best Parameters: 1.0


**Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.**

Step-by-Step Process

**1. Handle Missing Values**

Check missingness: Use .isnull().sum() to see which features have missing values.

Strategy:

For numerical features → fill with mean/median (e.g., blood pressure, age).

For categorical features → fill with mode (most frequent category, e.g., gender = Male/Female).

If a feature has too many missing values, consider dropping it if not critical.

2. Encode the Categorical Features

Decision Trees can’t handle text directly.

Options:

One-Hot Encoding (good for features like “smoker: Yes/No”).

Label Encoding (if categories have natural order, e.g., “Stage 1, Stage 2, Stage 3”).

3. Train a Decision Tree Model

Split dataset into train (80%) and test (20%).

Initialize a Decision Tree Classifier:

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

4. Tune Hyperparameters

Use GridSearchCV to find the best parameters:

max_depth (controls tree depth).

min_samples_split (minimum samples to split a node).

criterion (e.g., "gini" or "entropy").

Example:

from sklearn.model_selection import GridSearchCV
param_grid = {
    "max_depth": [3, 5, 7, None],
    "min_samples_split": [2, 5, 10],
    "criterion": ["gini", "entropy"]
}
grid = GridSearchCV(model, param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)
best_model = grid.best_estimator_

5. Evaluate Performance

Predict on the test set:

from sklearn.metrics import accuracy_score, classification_report
y_pred = best_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Use metrics:

Accuracy → overall performance.

Precision/Recall/F1-score → very important in healthcare (e.g., false negatives can be dangerous).

ROC-AUC → measures probability ranking quality.

Business Value in Real-World Healthcare

Faster diagnosis: Helps doctors identify high-risk patients quickly.

Resource optimization: Prioritize tests for patients likely to have the disease.

Personalized treatment: Different features (like age, lifestyle, genetics) help tailor care plans.

Cost savings: Reduces unnecessary tests for low-risk patients.

Improved patient outcomes: Early detection leads to better treatment success.

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

# -----------------------
# 1. Create Example Dataset (simulated healthcare data)
# -----------------------
data = {
    "age": [25, 47, 52, np.nan, 36, 29, 65, 41, np.nan, 55],
    "gender": ["Male", "Female", "Female", "Male", "Male", np.nan, "Female", "Male", "Female", "Male"],
    "blood_pressure": [120, 140, 130, 135, np.nan, 128, 150, 145, 138, np.nan],
    "smoker": ["Yes", "No", "No", "Yes", "No", "Yes", "No", "No", "Yes", "No"],
    "disease": [0, 1, 1, 0, 0, 1, 1, 0, 1, 1]  # Target (0 = No Disease, 1 = Disease)
}

df = pd.DataFrame(data)

# Features and Target
X = df.drop("disease", axis=1)
y = df["disease"]

# -----------------------
# 2. Preprocessing Pipeline
# -----------------------

# Separate categorical and numerical features
categorical_features = ["gender", "smoker"]
numerical_features = ["age", "blood_pressure"]

# Transformers
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
# Numerical missing values will be handled inside Decision Tree (or we can fillna)

# Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transformer, categorical_features),
        ("num", "passthrough", numerical_features)
    ]
)

# -----------------------
# 3. Build Pipeline with Decision Tree
# -----------------------
dt = DecisionTreeClassifier(random_state=42)

pipeline = Pipeline(steps=[("preprocessor", preprocessor),
                           ("classifier", dt)])

# ---
