**Question 1: What is a Decision Tree, and how does it work in the context of
classification?**

- A Decision Tree is a supervised learning algorithm used for classification and regression tasks. It works like a flowchart, where each internal node represents a decision based on a feature, each branch represents an outcome of that decision, and each leaf node represents a final class label or output.

 **How It Works (for Classification):**
 1. Start with the entire dataset.
    - The algorithm begins at the root node and considers all features.

2. Select the best feature to split the data.
    - It chooses the feature that best separates the classes using a splitting criterion, such as:

      - Gini Impurity

      - Entropy / Information Gain

3. Split the dataset into subsets.
    - The data is divided based on the values of the selected feature.

4. Repeat the process recursively.
    - For each subset, the algorithm again selects the best feature and splits further — forming a tree-like structure.

5. Stop when a stopping condition is met.

    - All data points in a node belong to the same class, or

    - No remaining features to split, or

    - The maximum tree depth is reached.

6. Assign class labels to leaf nodes.
    - Each leaf node gives the final classification output.



**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?**  

- In a Decision Tree, impurity measures are used to decide which feature to split on at each step. The goal is to choose the feature that creates the purest child nodes — meaning the data in each branch is as homogeneous (single class) as possible.

 - Two commonly used impurity measures are Gini Impurity and Entropy.

1. Gini Impurity

 - Definition:
Gini Impurity measures how often a randomly chosen sample from the dataset would be incorrectly classified if it were labeled according to the distribution of labels in that subset.

 - Formula:

𝐺
𝑖
𝑛
𝑖
=
1
−
∑
𝑖
=
1
𝑐
𝑝
𝑖
2
Gini=1−
i=1
∑
c
	​

p
i
2
	​


where

𝑐
c = number of classes

𝑝
𝑖
p
i
	​

 = probability (proportion) of class i in the node

Interpretation:

Gini = 0 → perfectly pure node (only one class)

Gini = 0.5 → maximum impurity (two classes equally mixed)

Example:
If a node has 4 samples: 3 “Yes” and 1 “No”

𝑝
(
𝑌
𝑒
𝑠
)
=
3
/
4
=
0.75
,
𝑝
(
𝑁
𝑜
)
=
0.25
p(Yes)=3/4=0.75,p(No)=0.25
𝐺
𝑖
𝑛
𝑖
=
1
−
(
0.75
2
+
0.25
2
)
=
1
−
(
0.5625
+
0.0625
)
=
0.375
Gini=1−(0.75
2
+0.25
2
)=1−(0.5625+0.0625)=0.375

→ Lower Gini = better split (purer node)



2. Entropy (Information Gain)

Definition:
Entropy measures the amount of disorder or uncertainty in the dataset.

Formula:

𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
=
−
∑
𝑖
=
1
𝑐
𝑝
𝑖
log
⁡
2
(
𝑝
𝑖
)
Entropy=−
i=1
∑
c
	​

p
i
	​

log
2
	​

(p
i
	​

)

where

𝑝
𝑖
p
i
	​

 = probability of class i

Interpretation:

Entropy = 0 → pure node

Entropy = 1 → maximum impurity (50-50 split for binary classes)

Example:
Using the same data (3 “Yes”, 1 “No”):

𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
=
−
(
0.75
log
⁡
2
0.75
+
0.25
log
⁡
2
0.25
)
Entropy=−(0.75log
2
	​

0.75+0.25log
2
	​

0.25)
=
−
(
0.75
×
−
0.415
+
0.25
×
−
2
)
=−(0.75×−0.415+0.25×−2)
=
0.811
=0.811



How They Impact Splits

At each node, the Decision Tree tries different splits using each feature.

For each split, it calculates the weighted average impurity of the resulting child nodes.

The feature producing the lowest impurity (or highest information gain) is chosen for splitting.

Information Gain (based on Entropy):

𝐼
𝑛
𝑓
𝑜
𝑟
𝑚
𝑎
𝑡
𝑖
𝑜
𝑛

𝐺
𝑎
𝑖
𝑛
=
𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
𝑝
𝑎
𝑟
𝑒
𝑛
𝑡
−
∑
𝑛
𝑐
ℎ
𝑖
𝑙
𝑑
𝑛
𝑝
𝑎
𝑟
𝑒
𝑛
𝑡
×
𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
𝑐
ℎ
𝑖
𝑙
𝑑
Information Gain=Entropy
parent
	​

−∑
n
parent
	​

n
child
	​

	​

×Entropy
child
	​


The higher the information gain → the better the split.

**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each**

- In Decision Trees, pruning is a technique used to prevent overfitting — when a model becomes too complex and starts memorizing training data instead of generalizing patterns.

There are two types of pruning methods: Pre-Pruning and Post-Pruning.
1. Pre-Pruning (Early Stopping)

Definition:
Pre-pruning means stopping the tree growth early — before it becomes too deep or complex.

How it works:
The algorithm stops splitting a node if:

The information gain (or impurity reduction) is below a threshold, or

The node size is too small, or

The tree depth reaches a maximum limit, etc.

Example Parameters in scikit-learn:

max_depth

min_samples_split

min_samples_leaf

max_leaf_nodes

Practical Advantage:
 Faster training and simpler model.
Since the tree stops early, it’s less complex and faster to build — suitable for large datasets or real-time predictions.

2. Post-Pruning (Cost Complexity Pruning)

Definition:
Post-pruning means first growing the full tree, and then removing the less important branches that do not contribute much to accuracy.

How it works:

Build the complete tree (possibly overfitted).

Evaluate subtrees using a validation set or cost-complexity measure (α).

Prune branches that reduce accuracy only slightly or increase error.

Example in scikit-learn:

Controlled by ccp_alpha (cost complexity pruning parameter).

**Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?**

- Information Gain (IG) is a key concept in Decision Trees used to decide which feature to split on at each node.
It measures how much “information” or “purity” is gained after a dataset is split based on a particular feature.

In simple terms:
👉 Information Gain tells us how well a feature separates the classes.
The higher the Information Gain, the better that feature is for splitting.

📘 Definition:

Information Gain is the reduction in entropy (disorder) achieved by splitting the data based on a given feature.

𝐼
𝑛
𝑓
𝑜
𝑟
𝑚
𝑎
𝑡
𝑖
𝑜
𝑛

𝐺
𝑎
𝑖
𝑛
=
𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
(
𝑃
𝑎
𝑟
𝑒
𝑛
𝑡
)
−
∑
𝑖
=
1
𝑘
𝑛
𝑖
𝑛
×
𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
(
𝐶
ℎ
𝑖
𝑙
𝑑
𝑖
)
Information Gain=Entropy(Parent)−
i=1
∑
k
	​

n
n
i
	​

	​

×Entropy(Child
i
	​

)

where:

𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
(
𝑃
𝑎
𝑟
𝑒
𝑛
𝑡
)
Entropy(Parent): Entropy before the split

𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
(
𝐶
ℎ
𝑖
𝑙
𝑑
𝑖
)
Entropy(Child
i
	​

): Entropy of each child node after the split

𝑛
𝑖
n
i
	​

: Number of samples in child node i

𝑛
n: Total number of samples in parent node


**Importance of Information Gain**

Feature Selection:
Helps the tree choose the best feature at each node.

Purity Improvement:
Encourages splits that create homogeneous child nodes.

Model Accuracy:
Leads to better generalization and fewer classification errors.

Core Idea of ID3 Algorithm:
The ID3 (Iterative Dichotomiser 3) Decision Tree algorithm uses Information Gain as its main splitting criterion.

**Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?**

- **Common Real-World Applications of Decision Trees**

  -  **Customer Segmentation & Marketing**

      - Used to identify which type of customers are most likely to buy a product.

      - Example: Predict whether a customer will respond to a marketing campaign based on age, income, and past purchases.

  - **Credit Risk Analysis (Finance)**

      - Banks use decision trees to decide whether to approve or reject a loan.

      - Example: Based on income, job stability, previous defaults, etc.

  - **Medical Diagnosis**

      - Used to help doctors classify diseases based on symptoms and medical test results.

      - Example: Predicting whether a tumor is benign or malignant.

  - **Human Resource (HR) Analytics**

      - Predict whether an employee is likely to leave or stay in a company.

      - Example: Based on salary, satisfaction score, and work hours.

  - **Business Decision-Making**

      - Companies use decision trees to simulate possible outcomes of business strategies.

      - Example: Choosing between product launch strategies with different risks and profits.

  - **E-commerce & Recommendation Systems**

      - Used for predicting purchase behavior or recommending items.

      - Example: Classifying users into segments based on their browsing and purchase history.

---
---
| Advantage                             | Explanation                                                             |
| ------------------------------------- | ----------------------------------------------------------------------- |
| **Easy to interpret**                 | Can be visualized and explained like a flowchart.                       |
| **No need for data normalization**    | Works well without scaling features.                                    |
| **Handles both types of data**        | Works with **categorical and numerical** values.                        |
| **Captures non-linear relationships** | Can split data in a non-linear way.                                     |
| **Feature importance**                | Provides insight into which features are most important for prediction. |





---

---
| Limitation                           | Explanation                                                                    |
| ------------------------------------ | ------------------------------------------------------------------------------ |
| **Overfitting**                      | Can become too complex and fit noise in the data (especially deep trees).      |
| **Unstable**                         | Small changes in data can change the tree structure drastically.               |
| **Biased towards dominant features** | Features with more unique values tend to be selected first.                    |
| **Less accurate alone**              | Often less accurate compared to ensemble models like Random Forest or XGBoost. |


**Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).
Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)**

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("🌿 Decision Tree Classifier (Gini Criterion)")
print("--------------------------------------------")
print("Accuracy of the model: {:.2f}%".format(accuracy * 100))
print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name:25s} : {importance:.4f}")


🌿 Decision Tree Classifier (Gini Criterion)
--------------------------------------------
Accuracy of the model: 100.00%

Feature Importances:
sepal length (cm)         : 0.0000
sepal width (cm)          : 0.0167
petal length (cm)         : 0.9061
petal width (cm)          : 0.0772


**Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree**

In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fully grown tree
full_tree = DecisionTreeClassifier(criterion='gini', random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Tree with max_depth = 3
limited_tree = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
y_pred_limited = limited_tree.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

print("Decision Tree Accuracy Comparison")
print("------------------------------------")
print(f"Fully-grown Tree Accuracy   : {accuracy_full * 100:.2f}%")
print(f"Tree with max_depth=3       : {accuracy_limited * 100:.2f}%")


Decision Tree Accuracy Comparison
------------------------------------
Fully-grown Tree Accuracy   : 100.00%
Tree with max_depth=3       : 100.00%


**Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances**


In [9]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Decision Tree Regressor Results")
print("----------------------------------")
print("Mean Squared Error (MSE): {:.4f}".format(mse))
print("\nFeature Importances:")
for feature_name, importance in zip(data.feature_names, regressor.feature_importances_):
    print(f"{feature_name:20s} : {importance:.4f}")


Decision Tree Regressor Results
----------------------------------
Mean Squared Error (MSE): 0.4952

Feature Importances:
MedInc               : 0.5285
HouseAge             : 0.0519
AveRooms             : 0.0530
AveBedrms            : 0.0287
Population           : 0.0305
AveOccup             : 0.1308
Latitude             : 0.0937
Longitude            : 0.0829


**Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy**

In [10]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(criterion='gini', random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Decision Tree Hyperparameter Tuning (GridSearchCV)")
print("----------------------------------------------------")
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy: {:.2f}%".format(grid_search.best_score_ * 100))
print("Test Set Accuracy: {:.2f}%".format(accuracy * 100))


Decision Tree Hyperparameter Tuning (GridSearchCV)
----------------------------------------------------
Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 94.17%
Test Set Accuracy: 100.00%


**Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting**

1) Handle missing values

Understand the missingness — check patterns and type:

Missing Completely At Random (MCAR), Missing At Random (MAR), Missing Not At Random (MNAR).

Use .isnull().mean() per column, cross-tab missingness vs target, and visualize with heatmaps/missingness matrices.

Simple rules first

If a feature is > 60–80% missing, consider dropping it (unless clinically important).

If a row has very many missing values, consider dropping that row.

Imputation strategies (use pipelines to avoid leakage):

Numeric: median (robust), mean, or model-based (KNN, IterativeImputer/MICE) if relationships exist.

Categorical: new category "Missing" or most-frequent.

Indicator flags: add is_<feature>_missing boolean columns — they often help the model.

Special handling for MNAR / clinically meaningful missingness: missing might itself be predictive (e.g., a test not ordered). Treat missingness as a feature.

Always fit imputers on training data only (use Pipeline/ColumnTransformer so transforms are learned without leaking test info).

2) Encode categorical features

Assess cardinality:

Low cardinality (≤ ~10 unique): One-Hot Encoding (use OneHotEncoder(handle_unknown='ignore')).

High cardinality: target encoding / frequency encoding / hashing — but avoid target leakage; use methods that encode inside cross-validation folds or use TargetEncoder with CV.

Rare categories: group rare levels into "Other" to avoid overfitting.

Ordinal variables: use OrdinalEncoder only if order is meaningful.

Use ColumnTransformer to apply different encoders to different columns inside a scikit-learn pipeline.

3) Train a Decision Tree model

Baseline model: fit a vanilla DecisionTreeClassifier(criterion='gini', random_state=42) on processed data to get a baseline.

Address class imbalance (common in disease detection):

Use class_weight='balanced' in the tree, or

Resampling: SMOTE / ADASYN (for training folds only), or

Use appropriate thresholds and cost matrices.

Use cross-validation (stratified) to estimate performance (StratifiedKFold).

4) Tune hyperparameters

Important hyperparameters for DecisionTreeClassifier:

max_depth, min_samples_split, min_samples_leaf, max_features, ccp_alpha (cost-complexity pruning), criterion.

Search strategy:

Use RandomizedSearchCV for large search spaces, GridSearchCV for smaller ones. Use cv=5 stratified folds.

Optionally use nested CV to avoid optimistic bias when reporting final performance.

Example grid:

max_depth: [3, 5, 8, 12, None]

min_samples_split: [2, 5, 10, 20]

min_samples_leaf: [1, 2, 4, 8]

max_features: [None, 'sqrt', 'log2']

ccp_alpha: [0.0, 0.001, 0.01, 0.1]

Metric to optimize: choose based on business need — e.g., recall (sensitivity) if missing a disease is costly, or a weighted metric like f1 if balance matters.

5) Evaluate performance (beyond accuracy)

Primary metrics (healthcare focus):

Recall (Sensitivity): fraction of actual positives detected — often prioritized.

Precision: proportion of predicted positives that are true — important to limit unnecessary followups.

F1-score: harmonic mean of precision & recall.

ROC-AUC for overall ranking performance.

PR-AUC (preferred in imbalanced settings).

Confusion matrix — always inspect to see tradeoffs (FP vs FN).

Calibration — check predicted probabilities (reliability). Use calibration plots and CalibratedClassifierCV if needed.

Decision thresholds: tune probability threshold to meet business constraints (e.g., achieve ≥90% sensitivity with acceptable precision).

Clinical utility / cost analysis: translate FP/FN into expected costs (e.g., cost of missed disease vs cost of extra tests) and choose operating point that maximizes net benefit (decision curve analysis).

Robustness checks:

Evaluate on hold-out test set (never used in training/tuning).

Test on subgroups (age, gender, location) to check fairness and performance drift.

Perform bootstrap or repeated CV to estimate variance.

Explainability: visualize the tree (sklearn.tree.plot_tree) and provide feature importances. For deeper insight use SHAP to explain individual predictions — crucial in healthcare.



7) Example (compact sklearn pipeline)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

num_cols = [...]        # list numeric cols
cat_cols = [...]        # list categorical cols

num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())          # optional for tree but OK
])

cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preproc = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

pipe = Pipeline([
    ('preproc', preproc),
    ('clf', DecisionTreeClassifier(random_state=42, class_weight='balanced'))
])

param_grid = {
    'clf__max_depth': [3,5,8,None],
    'clf__min_samples_leaf': [1,2,4],
    'clf__ccp_alpha': [0.0, 0.001, 0.01]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(pipe, param_grid, cv=cv, scoring='recall', n_jobs=-1)
grid.fit(X_train, y_train)

