
## Question 1: What is a Decision Tree, and how does it work in the context of classification?

**Answer:**

A Decision Tree is a supervised machine learning model that maps observations about an item to conclusions about its target value. In classification tasks, the target is a discrete label (e.g., species of iris, disease vs no-disease). The structure of a decision tree resembles a flowchart: internal nodes represent tests on features (e.g., "sepal length > 5.5?"), branches represent outcomes of the test, and leaf nodes represent class labels or class distributions.

**How it works (step-by-step):**
1. **Root node creation:** The algorithm examines all features and possible splits on those features to find the split that best separates the classes according to a chosen impurity measure (like Gini or Entropy).
2. **Recursive splitting:** Once the best split is chosen, the dataset is partitioned into subsets — one per branch — and the process repeats on each subset. This recursion continues until a stopping criterion is met (pure nodes, minimal samples, or maximum depth).
3. **Leaf prediction:** When splitting stops, each leaf node holds the class most frequent among the training samples that reached that leaf. For probabilistic predictions, the leaf can provide class probabilities based on the class distribution at that leaf.
4. **Prediction for new samples:** To classify a new instance, traverse the tree from the root, following the branch whose condition matches the instance, until a leaf is reached. The leaf’s class (or probabilities) become the model’s prediction.

**Key properties and intuition:**
- Decision trees are **non-parametric**: they make no assumption about the data distribution.
- They are **interpretable**: the sequence of decisions (feature thresholds) leading to a prediction can be easily visualized and explained.
- Trees can **handle both numerical and categorical** features (categorical handling may require encoding depending on implementation).
- They are **invariant to monotonic transformations** of features (e.g., log scaling) because splits depend on orderings and thresholds.

**When they shine:** When interpretability matters, when relationships between features and outcomes are hierarchical or piecewise-constant, and when mixed-type features are present.


---


## Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

**Answer:**

Decision trees use impurity measures to evaluate how well a split separates the classes. Two common measures are **Gini Impurity** and **Entropy (Information Entropy)**.

**Gini Impurity**
- For a node with class probabilities \(p_1, p_2, ..., p_k\), Gini impurity is defined as:  
  $$ G = 1 - \sum_{i=1}^{k} p_i^2 $$
- Interpretation: it is the probability that a randomly chosen sample from the node would be misclassified if it were labeled randomly according to the class distribution in that node.
- Range: \(0\) (pure node) to \((1 - 1/k)\) for k classes (max impurity when classes are uniformly distributed).

**Entropy**
- Entropy is defined as:  
  $$ H = -\sum_{i=1}^{k} p_i \log_2 p_i $$
- Interpretation: it quantifies the uncertainty or disorder in the class distribution. A pure node has entropy 0; a maximally mixed node has higher entropy.
- Entropy is rooted in information theory and measures the expected number of bits needed to encode the class label.

**Impact on splits**
- At each candidate split, the algorithm computes the **weighted impurity** of the child nodes and chooses the split that **minimizes** the weighted impurity (equivalently, maximizes impurity reduction).
- **Information Gain** is often used with entropy; it measures the decrease in entropy after the split. With Gini, the analogous concept is Gini gain (impurity reduction).
- In practice, Gini and Entropy often lead to similar trees. Gini is slightly faster to compute and tends to isolate the most frequent class in a node, while entropy can be marginally more sensitive to changes in class probabilities.
- Choice of impurity can slightly affect which feature is chosen at nodes, but the overall predictive performance is typically comparable. For most tasks, using the library default (Gini in many implementations) is acceptable unless fine-grained control or interpretability by information-theoretic reasoning is desired.

**Practical note:** When comparing splits, always consider not just immediate impurity reduction but also downstream effects. Cross-validation or validation sets help ensure chosen impurity and splitting strategy generalize well.

---


## Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

**Answer:**

Decision trees can easily overfit the training data by creating very deep trees that capture noise. Pruning strategies control complexity and improve generalization.

**Pre-Pruning (Early Stopping)**
- **What it is:** Pre-pruning stops the tree-growing process early by imposing constraints during training. Examples: setting `max_depth`, `min_samples_split`, `min_samples_leaf`, or `max_leaf_nodes`.
- **Mechanism:** During recursive splitting, the algorithm checks stopping criteria; if they are met, it refrains from splitting further even if impurity can still be reduced.
- **Practical advantage:** **Computational efficiency and simplicity.** Pre-pruning reduces model training time and memory because the tree never grows beyond specified limits. It is straightforward to implement and tune when you have limited computational resources or when you want an immediately interpretable shallow tree.

**Post-Pruning (Prune After Full Growth)**
- **What it is:** Post-pruning first grows a full (or very large) tree, then prunes back branches that do not provide sufficient predictive power on validation data. Methods include reduced-error pruning and cost-complexity pruning (a.k.a. weakest link pruning or CCP).
- **Mechanism:** Evaluate subtrees using validation set or by optimizing a complexity-penalized objective (e.g., minimize training loss + alpha * number_of_leaves). Remove branches that increase generalization error or do not justify their complexity.
- **Practical advantage:** **Often yields better generalization.** Post-pruning examines the actual contribution of branches and can remove splits that only fit noise. It tends to produce a simpler tree without prematurely discarding potentially useful structure.

**Summary comparison**
- Pre-pruning is faster and simpler but risks underfitting if constraints are too strict.
- Post-pruning is more thorough and often achieves better bias-variance trade-offs but requires extra computation (validation set or cross-validation) and is more complex to implement.

---


## Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Answer:**

**Definition:** Information Gain (IG) is a metric that quantifies the reduction in impurity (uncertainty) achieved by partitioning a dataset based on a particular feature. In the entropy framework:

$$
IG(parent, split) = H(parent) - \sum_{children} rac{N_{child}}{N_{parent}} H(child)
$$

where \(H\) is entropy and \(N\) the number of samples.

**Why it matters:**
1. **Quantifies usefulness of a split:** IG measures how well a given feature separates the classes. The higher the IG, the more informative the split.
2. **Guides greedy selection:** Decision tree algorithms use IG (or analogous impurity reductions like Gini gain) to greedily pick the best split at each node. This local decision-making is what builds the hierarchical structure of the tree.
3. **Balances purity and partition size:** Because IG uses weighted child entropies, it accounts for both the purity of child nodes and the number of samples in each child. A split that isolates a tiny pure subset but leaves a large impure remainder may not have as high IG as a balanced split.
4. **Interpretability and feature importance:** Summed or averaged IG across splits involving a feature can provide an estimate of that feature’s importance in the learned tree.

**Limitations of Information Gain:**
- IG can be biased toward features with many possible values (e.g., ID-like features). To counteract this, alternatives like **Information Gain Ratio** (used in C4.5) adjust IG by split information.
- IG is a *greedy* local measure and does not consider future splits; hence, the globally optimal tree may not be achieved.

In practice, IG is a principled and widely-used criterion that transforms the abstract goal of improved predictability into a quantifiable objective during tree construction.

---


## Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Answer:**

**Applications**
- **Healthcare & Medical Diagnosis:** Predict disease presence from symptoms, lab tests; decision trees map well to clinical decision rules.
- **Finance & Credit Scoring:** Determine whether to approve loans using applicant features (income, employment history).
- **Marketing & Customer Segmentation:** Classify customers likely to churn or respond to campaigns.
- **Fraud Detection:** Spotlight transactions that are likely fraudulent based on transaction features.
- **Manufacturing & Quality Control:** Classify defective vs non-defective products using sensor measurements.
- **Rule Extraction & Compliance:** Derive explicit decision rules for regulatory audits.

**Advantages**
1. **Interpretability:** Trees provide if-then rules that stakeholders (including non-technical) can understand.
2. **No need for heavy preprocessing:** They handle mixed feature types, do not require feature scaling, and can work with missing values (in some implementations).
3. **Fast prediction:** Once trained, tree traversal is efficient and predictable in latency.
4. **Implicit feature selection:** Important features appear near the root; uninformative features are rarely used.

**Limitations**
1. **Overfitting risk:** Trees can grow deep and model noise. Pruning and parameter tuning are needed to avoid overfitting.
2. **Instability:** Small changes in data can yield very different trees (high variance).
3. **Bias toward dominant classes or features:** Without balancing or careful splitting criteria, trees can favor majority classes.
4. **Greedy learning:** The top-down greedy split selection may miss globally optimal trees.
5. **Poor extrapolation for regression:** Tree-based regression predicts piecewise-constant values and may be less smooth than parametric models.

**Mitigations**
- Use ensembles (Random Forests, Gradient Boosting) to reduce variance and improve accuracy while retaining interpretability through feature importance and partial dependence plots.
- Apply pruning, cross-validation, and domain-informed feature engineering to build robust trees.


---

## Question 6
Load the Iris dataset, train a Decision Tree Classifier using the Gini criterion, and print the model's accuracy and feature importances.

We'll load Iris, split into train/test, train a DecisionTreeClassifier(criterion='gini'), and report accuracy and feature importances.

In [1]:
# Question 6 - Code
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train Decision Tree with Gini
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# Evaluate
pred = model.predict(X_test)
acc = accuracy_score(y_test, pred)

print(f"Test accuracy: {acc:.4f}")
print("Feature importances:")
for name, importance in zip(data.feature_names, model.feature_importances_):
    print(f"  {name}: {importance:.4f}")

Test accuracy: 0.9333
Feature importances:
  sepal length (cm): 0.0000
  sepal width (cm): 0.0286
  petal length (cm): 0.5412
  petal width (cm): 0.4303


## Question 7
Train a Decision Tree Classifier with `max_depth=3` and compare its accuracy to a fully-grown tree.

A shallow tree (max_depth=3) often generalizes better; compare test accuracies to observe overfitting in the full tree.

In [2]:
# Question 7 - Code
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Full (unrestricted) tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

# Restricted tree
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
limited_acc = accuracy_score(y_test, limited_tree.predict(X_test))

print(f"Full tree test accuracy: {full_acc:.4f}")
print(f"Max-depth=3 tree test accuracy: {limited_acc:.4f}")

Full tree test accuracy: 0.9333
Max-depth=3 tree test accuracy: 0.9778


## Question 8
Load the California Housing dataset, train a Decision Tree Regressor, and print the Mean Squared Error (MSE) and feature importances.

`load_boston` is deprecated/removed; we use `fetch_california_housing()` instead.

In [3]:
# Question 8 - Code
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

california = fetch_california_housing()
Xb, yb = california.data, california.target

# Split data (re-using train_test_split imported earlier)
Xb_train, Xb_test, yb_train, yb_test = train_test_split(Xb, yb, test_size=0.3, random_state=42)

reg = DecisionTreeRegressor(random_state=42)
reg.fit(Xb_train, yb_train)

pred_b = reg.predict(Xb_test)
mse = mean_squared_error(yb_test, pred_b)

print(f"Mean Squared Error (MSE): {mse:.4f}")
print("Feature importances:")
for name, importance in zip(california.feature_names, reg.feature_importances_):
    print(f"  {name}: {importance:.4f}")

Mean Squared Error (MSE): 0.5280
Feature importances:
  MedInc: 0.5235
  HouseAge: 0.0521
  AveRooms: 0.0494
  AveBedrms: 0.0250
  Population: 0.0322
  AveOccup: 0.1390
  Latitude: 0.0900
  Longitude: 0.0888


## Question 9
Tune the Decision Tree's `max_depth` and `min_samples_split` using GridSearchCV and print the best parameters and resulting model accuracy.

We'll perform a small grid search with cross-validation on the Iris dataset.

In [4]:
# Question 9 - Code
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

params = {
    'max_depth': [2,3,4,5,None],
    'min_samples_split': [2,3,4,5]
}

grid = GridSearchCV(DecisionTreeClassifier(random_state=42), params, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X, y)

print("Best parameters:", grid.best_params_)
print(f"Best cross-validation accuracy: {grid.best_score_:.4f}")

Best parameters: {'max_depth': 3, 'min_samples_split': 2}
Best cross-validation accuracy: 0.9733



## Question 10: Practical workflow for a healthcare classification problem (predicting disease)  
**Provide detailed step-by-step process to handle missing values, encode categorical features, train a Decision Tree model, tune hyperparameters, evaluate performance, and describe business value. (10 marks)**

**1. Problem understanding & data audit**  
- Identify target variable (disease: yes/no) and features (demographics, vitals, labs, imaging metadata).  
- Check class balance, typical ranges, missingness patterns, and data types. Visualize distributions and correlations.

**2. Handle missing values**  
- **Quantify missingness:** compute missingness per column and by patient.  
- **Missing completely at random (MCAR) vs MAR vs MNAR:** attempt to diagnose the mechanism. If missingness correlates with target or other features, be careful.  
- **Imputation strategies:**  
  - Numerical: use domain-aware imputation (median robust to outliers, or KNN/imputation with other correlated features).  
  - Categorical: impute with a separate category ("Missing") or mode; consider learning-based imputation for complex patterns.  
  - If a feature has extremely high missing rate (e.g., >60–80%) consider dropping it unless clinically important.  
  - Preserve missingness indicators (binary flag) for features where missingness itself is informative.

**3. Encode categorical features**  
- For Decision Trees, **label encoding** can work because splits can handle ordinal thresholds; but beware: for nominal categories, label encoding may introduce arbitrary ordinality — however, tree algorithms typically handle that fine because they split on equality checks implicitly.  
- **One-Hot Encoding** is safe and interpretable, but increases dimensionality; use it for low-cardinality features.  
- For high-cardinality categorical features, consider target encoding or frequency encoding with appropriate cross-validation to avoid leakage.

**4. Feature engineering & scaling**  
- Trees don't require feature scaling; normalization is not necessary.  
- Create clinically-relevant features (ratios, flags, binned ages), interaction terms if meaningful, and temporal aggregations for time-series data.

**5. Train/Test split and cross-validation**  
- Use stratified split to maintain class balance in train/test.  
- Use K-fold or stratified K-fold cross-validation during model selection to estimate generalization performance reliably.

**6. Train Decision Tree model**  
- Start with a baseline DecisionTreeClassifier (no or mild constraints). Use `class_weight='balanced'` if classes are imbalanced.  
- Monitor training vs validation performance to detect overfitting.

**7. Hyperparameter tuning**  
- Tune `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features`, and `ccp_alpha` (cost-complexity pruning).  
- Use `GridSearchCV` or `RandomizedSearchCV` with stratified folds and scoring metrics aligned to business needs (e.g., `f1`, `recall` for disease detection).  
- Use pipeline objects (`sklearn.pipeline.Pipeline`) to chain preprocessing (imputation, encoding) and model training to avoid leakage.

**8. Evaluation metrics**  
- For healthcare, **sensitivity/recall** (catching true positives) and **specificity** are crucial. Also report **precision**, **F1-score**, **ROC AUC**, and confusion matrix.  
- Use calibration plots (probability calibration) because predicted probabilities may be used for risk stratification.  
- If costs differ (false negative more costly than false positive), incorporate them into thresholding or use cost-sensitive learning.

**9. Model validation and robustness checks**  
- Validate on an external hold-out or temporal split to assess performance over time.  
- Perform subgroup analysis (different age groups, genders, clinical sites) to ensure fairness.  
- Conduct permutation importance and SHAP analysis for interpretability and to detect spurious associations.

**10. Deployment and monitoring**  
- Export model along with preprocessing pipeline. Validate on live data before full deployment.  
- Monitor model drift and recalibrate or retrain periodically. Track performance metrics and data distribution shifts.

**11. Ethical, legal and privacy considerations**  
- Ensure compliance with healthcare regulations (HIPAA/GDPR equivalents).  
- Document model limitations, perform bias audits, and keep clinicians in the loop for acceptance.

**Business value:**  
A reliable disease-prediction model can enable early detection, prioritize high-risk patients for timely intervention, reduce unnecessary tests, optimize resource allocation, and improve patient outcomes. It also supports clinicians with decision support, potentially lowering costs and improving throughput when integrated into clinical workflows.
