### Decision Tree

1. What is a Decision Tree, and how does it work in the context of
classification?
- A decision tree is a flowchart-like model that splits data into subsets using feature-based yes/no questions to reach a prediction at leaf nodes. For classification, it recursively chooses the feature and threshold that best separate classes (e.g., information gain), creating internal decision nodes and branches until stopping criteria are met (like max depth, min samples, or pure leaves). At inference, a sample traverses the tree by following the rules at each node, and the leaf’s majority class (or class probability from class counts) becomes the predicted label. Pruning and regularization (depth limits, min samples per split/leaf) help prevent overfitting, while ensembles like Random Forests and Gradient Boosted Trees improve accuracy and robustness.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
- Gini impurity and entropy both quantify how mixed the class labels are within a node. Gini impurity measures the probability of misclassification if a label is randomly assigned according to class frequencies: Gini = 1 − Σ p(k)^2, favoring splits that create nodes with high class purity and is slightly biased toward larger class separations with faster computation.

- Entropy from information theory measures uncertainty: Entropy = −Σ p(k) log2 p(k), and information gain is the reduction in entropy after a split.

- In decision trees, the algorithm evaluates candidate splits and selects the one that yields the largest impurity reduction (highest Gini decrease or highest information gain), producing purer child nodes; both typically select similar splits, though entropy can be more sensitive to changes near perfectly balanced mixes, while Gini is computationally simpler and often preferred in practice.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
- Pre-pruning halts tree growth during training using constraints like max depth, min samples per split/leaf, or min impurity decrease, preventing overly specific splits from ever forming; its practical advantage is faster training/inference with lower risk of overfitting on small datasets. Post-pruning first grows a large tree and then trims back branches based on validation performance or complexity penalties (e.g., reduced error pruning, CCP/α), merging subtrees that don’t improve generalization; its practical advantage is typically better accuracy–complexity trade-offs because pruning decisions use global, data-driven evaluation rather than local stopping rules.

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
- Information Gain is the reduction in impurity (uncertainty) achieved by splitting a node, typically measured using entropy: IG(parent, split) = Entropy(parent) − weighted sum of Entropy(children). It quantifies how much a feature/threshold improves class purity, so the tree chooses the split with the highest information gain to most effectively separate classes. This leads to purer child nodes, shorter trees, and better generalization by prioritizing splits that reveal the most informative structure in the data.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
- Decision trees are widely used in real-world applications such as medical diagnosis (classifying diseases), credit scoring, customer churn prediction, fraud detection, and decision support systems. Their main advantages include interpretability (easy-to-understand logic), ability to handle both numerical and categorical data, and minimal data preprocessing requirements. However, they can easily overfit, especially on noisy data, and are sensitive to small changes in the data, potentially leading to unstable trees. Additionally, decision trees may struggle with modeling complex, non-linear relationships compared to ensemble or deep learning methods.

In [2]:
""" 6. Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances """

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf = DecisionTreeClassifier(
    criterion="gini",
    random_state=42
)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)

importances = clf.feature_importances_

print(f"Accuracy: {acc:.4f}")
for name, imp in sorted(zip(feature_names, importances), key=lambda x: x[1], reverse=True):
    print(f"{name}: {imp:.4f}")

Accuracy: 0.9333
petal length (cm): 0.5586
petal width (cm): 0.4060
sepal width (cm): 0.0292
sepal length (cm): 0.0062


In [4]:
""" 7. Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree. """

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf_depth3 = DecisionTreeClassifier(
    criterion="gini",
    max_depth=3,
    random_state=42
).fit(X_train, y_train)

clf_full = DecisionTreeClassifier(
    criterion="gini",
    random_state=42
).fit(X_train, y_train)

acc_depth3 = accuracy_score(y_test, clf_depth3.predict(X_test))
acc_full = accuracy_score(y_test, clf_full.predict(X_test))

print(f"Accuracy (max_depth=3): {acc_depth3:.4f}")
print(f"Accuracy (fully-grown): {acc_full:.4f}")

train_acc_depth3 = accuracy_score(y_train, clf_depth3.predict(X_train))
train_acc_full = accuracy_score(y_train, clf_full.predict(X_train))
print(f"Train Accuracy (max_depth=3): {train_acc_depth3:.4f}")
print(f"Train Accuracy (fully-grown): {train_acc_full:.4f}")

Accuracy (max_depth=3): 0.9667
Accuracy (fully-grown): 0.9333
Train Accuracy (max_depth=3): 0.9833
Train Accuracy (fully-grown): 1.0000


In [5]:
""" 8. Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances """

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target
feature_names = X.columns.tolist()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

reg = DecisionTreeRegressor(
    random_state=42
)
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

importances = reg.feature_importances_
sorted_importances = sorted(zip(feature_names, importances), key=lambda x: x[1], reverse=True)

print(f"MSE: {mse:.4f}")
print("Feature importances:")
for name, imp in sorted_importances:
    print(f"{name}: {imp:.4f}")

MSE: 0.4952
Feature importances:
MedInc: 0.5285
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829
AveRooms: 0.0530
HouseAge: 0.0519
Population: 0.0305
AveBedrms: 0.0287


In [6]:
""" 9. Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy """

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf = DecisionTreeClassifier(random_state=42)
param_grid = {
    "max_depth": [None, 2, 3, 4, 5, 6, 8, 10],
    "min_samples_split": [2, 3, 4, 5, 8, 10]
}

grid = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    n_jobs=-1,
    refit=True,
)
grid.fit(X_train, y_train)

best_clf = grid.best_estimator_
y_pred = best_clf.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)

print("Best parameters:", grid.best_params_)
print(f"CV Best Score (mean accuracy): {grid.best_score_:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")

Best parameters: {'max_depth': None, 'min_samples_split': 2}
CV Best Score (mean accuracy): 0.9417
Test Accuracy: 0.9333


10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

-  Handle missing values: audit missingness; impute numerics with median and add “was_missing” flags; impute categoricals with most frequent or a “Missing” category; keep imputation inside a train-only pipeline to avoid leakage.
- Encode categoricals: one-hot low-cardinality features (handle_unknown=ignore); for high-cardinality features consider target/ordinal encoding with leakage-safe (CV) schemes.
-  Train model: build a Pipeline with ColumnTransformer (numeric imputer + categorical encoder) feeding DecisionTreeClassifier (use class_weight="balanced" if imbalanced); use stratified train/validation/test split.
-  Tune hyperparameters: cross-validate over max_depth, min_samples_split, min_samples_leaf, max_features, and min_impurity_decrease using randomized/grid search optimized for business-relevant metric (e.g., recall/F2).
-  Evaluate: report confusion matrix, precision/recall/F1, ROC-AUC and PR-AUC; calibrate probabilities if thresholding risk; check subgroup fairness and stability. Business value: earlier risk flagging for targeted diagnostics, better resource prioritization, interpretable decision support for clinicians, and population-level monitoring that can reduce costs and improve outcomes.