# Decision tree


1.  What is a Decision Tree, and how does it work in the context of
classification?
  
      -> A Decision Tree is a supervised machine learning algorithm used for classification and also regression. It works like a flowchart, where decisions are made step by step by splitting the dataset into smaller subsets based on feature values.

   i. Root Node Selection

The tree starts with all the data at the root.
An algorithm chooses the best feature to split the data into classes.
“Best” is decided using measures like:
Gini Index
Entropy / Information Gain
Chi-square

   ii. Splitting

The dataset is divided into subsets based on the chosen feature’s values.
Example: If the feature is “Weather” → branches could be “Sunny,” “Rainy,” “Cloudy.”

   iii. Recursive Partitioning

For each subset, the process is repeated:
choose the best feature → split further → create new nodes.

   iv. Stopping Criteria
  Splitting continues until:

  All samples in a node belong to the same class, or

  No further improvement can be made, or
The tree reaches a maximum depth or minimum samples per node.

   v. Prediction

To classify a new observation, it is passed through the tree:

Start at root → follow branches according to feature values → reach a leaf node → predict the class.


2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

   -> Gini Impurity

 Probability of incorrectly classifying a randomly chosen element if it was labeled according to the distribution of classes in the node.

   Entropy (Information Gain)

  Measures the disorder (uncertainty) in a dataset.
  Comes from information theory.

When building a decision tree, the algorithm must decide which feature to split on at each step.

A “good” split is one that creates pure subsets (nodes where most samples belong to a single class).

To measure this purity/impurity, metrics like Gini Impurity and Entropy are used.

When building the tree:

For each candidate feature → compute Gini/Entropy for the subsets after the split.

Choose the split that reduces impurity the most.
This ensures splits make nodes purer, moving toward classification certainty.


3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

   -> Pruning in Decision Trees

Decision trees tend to grow deep and complex, perfectly fitting training data but generalizing poorly (overfitting).

Pruning reduces tree complexity by stopping splits or removing unnecessary branches.

There are two types: Pre-pruning and Post-pruning.

   i.Pre-Pruning (Early Stopping)

Stop the tree from growing too deep during training.

Use conditions like:

   Maximum depth of the tree (max_depth)

   Minimum samples required to split (min_samples_split)

   Minimum samples at a leaf (min_samples_leaf)

   Minimum impurity decrease

   Practical Advantage:
   Faster training & smaller trees → useful in real-time applications where speed and memory efficiency matter (e.g., quick classification in mobile apps).

   ii. Post-Pruning (Cost-Complexity Pruning / Reduced Error Pruning)

   First, grow the full tree (possibly overfitted).

   Then prune back branches that do not improve accuracy on a validation set.

  Approaches:

   Reduced Error Pruning: Remove a branch if accuracy does not drop.

  Cost-Complexity Pruning (CCP): Balance accuracy vs tree size.

Practical Advantage:
   Better generalization → since the tree first explores all splits, pruning ensures only meaningful branches remain (e.g., credit scoring models where interpretability + accuracy is important).


4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

   -> Information Gain (IG) is a metric used to decide which feature to split on at each step in a decision tree.

     It measures the reduction in uncertainty (entropy) about the class labels after splitting the dataset using a feature.

    Interpretation

     If a feature perfectly separates the classes → Entropy after split = 0 → IG is maximum.

     If a feature does not help separate classes → IG = 0 (no reduction in uncertainty).

     i.Guides splitting:

   The decision tree chooses the feature with the highest IG because it creates the purest child nodes.

   ii.Prevents random splits:

   Without IG (or similar metrics like Gini), the tree might split on irrelevant features.

     iii. Improves accuracy:

     By always choosing splits that maximize information, the tree makes better predictions.

     iv. Ensures interpretability:

     Features chosen early (with high IG) are the most informative, making the model easier to explain.


5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

     -> Common Real-World Applications of Decision Trees

     Healthcare & Medical Diagnosis

     Predicting whether a patient has a certain disease based on symptoms, age, lifestyle factors.

     Example: Classify patients as “high risk” vs. “low risk” for heart disease.

     Finance & Banking

    Credit risk assessment: Approve or reject loan applications.

   Fraud detection: Classify transactions as fraudulent or legitimate.

   Marketing & Customer Analytics

  Customer segmentation (e.g., “likely to buy” vs. “not likely”).

   Churn prediction (whether a customer will leave a service).

   Operations & Supply Chain

Demand forecasting (predicting product demand).

Inventory decision-making.

Human Resources

Employee attrition prediction (will an employee stay or quit?).

Resume screening and candidate selection.

Retail & E-commerce

Recommendation systems (products based on user behavior).

Pricing strategy optimization.

Manufacturing & Engineering

Quality control (defective vs. non-defective products).

Fault detection in machines.

  Main Advantages of Decision Trees

Easy to Understand & Interpret

Looks like a flowchart → managers and non-technical users can interpret it.

Handles Both Numeric & Categorical Data

Works well with mixed types of features.

Non-linear Relationships

Can model complex decision boundaries without requiring linear assumptions.

Requires Little Data Preparation

No need for feature scaling or normalization.

Fast for Prediction

Once built, classifying a new case is quick (just follow the path).

 Main Limitations of Decision Trees

Overfitting

Trees can grow too deep and memorize training data.

Controlled with pruning or ensemble methods (Random Forests, Gradient Boosted Trees).

Instability

Small changes in data → very different tree.

Bias Toward Features with Many Levels

Features with more categories may appear artificially “better” (higher information gain).

Not Always the Most Accurate

Alone, they’re often outperformed by ensemble methods (Random Forests, XGBoost).

Greedy Nature

Splits are chosen locally at each node; may not yield the globally optimal tree.


6. Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances .

In [1]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train a Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# 4. Make predictions
y_pred = clf.predict(X_test)

# 5. Print the model’s accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 6. Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


7. Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to

a fully-grown tree.

In [2]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train a Decision Tree with max_depth=3
shallow_tree = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
shallow_tree.fit(X_train, y_train)

# 4. Train a fully-grown Decision Tree (no depth limit)
full_tree = DecisionTreeClassifier(criterion="gini", random_state=42)
full_tree.fit(X_train, y_train)

# 5. Predictions
y_pred_shallow = shallow_tree.predict(X_test)
y_pred_full = full_tree.predict(X_test)

# 6. Accuracy comparison
acc_shallow = accuracy_score(y_test, y_pred_shallow)
acc_full = accuracy_score(y_test, y_pred_full)

print("Decision Tree (max_depth=3) Accuracy:", acc_shallow)
print("Decision Tree (Fully-grown) Accuracy:", acc_full)


Decision Tree (max_depth=3) Accuracy: 1.0
Decision Tree (Fully-grown) Accuracy: 1.0


8. Write a Python program to:

● Load the California Housing dataset from sklearn

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances .

In [3]:
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# 4. Predictions
y_pred = regressor.predict(X_test)

# 5. Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# 6. Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Mean Squared Error (MSE): 0.5280096503174904

Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


9. Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy

In [4]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Define the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# 4. Define the parameter grid for tuning
param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 10]
}

# 5. Apply GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid,
                           cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)

# 6. Best parameters and accuracy
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate on test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Test Set Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Test Set Accuracy: 1.0


10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world
setting.

  
   -> First things to do (quick data triage)

  i.Inspect dataset size, column types, and target balance (class frequencies).

  ii. check missingness pattern (MCAR / MAR / MNAR) and whether missingness correlates with the target.

   iii. Check if data are time-ordered (EHR/time-series) — if yes, use time-aware splitting.

   iv. Decide primary business metric early (recall/sensitivity vs precision vs ROC-AUC) — this drives imputation/thresholding/tuning.

   I. Handling missing values

  Explore missingness

  Per-column % missing, pairwise missingness patterns, and whether missingness predicts the label.

  Decide strategy by cause & column type

  Drop rows/cols only if missingness is rare or a column is useless.

  Numeric features

  Simple: median (robust to outliers) or mean (if symmetric).

  Advanced: KNNImputer or IterativeImputer (MICE) when values are correlated.

  Categorical features

  Replace with a new category "MISSING" or the mode.

For high-cardinality, consider grouping rare levels to "Other" before imputation.

Missingness as signal

If missingness itself is informative (e.g., a test only ordered for sicker patients), create a binary missing indicator column.

Avoid leakage

Always fit imputers only on training data inside a pipeline (use SimpleImputer/IterativeImputer in an sklearn Pipeline or ColumnTransformer) so validation/test data are not used to compute imputation statistics.

Temporal data

For repeated measures, use forward/backfill or model-based imputation that respects time order.

2) Encoding categorical features

Decision trees don’t require scaled inputs, but they do require numeric encoding of categories for sklearn.

Strategies:

One-Hot Encoding (safe, interpretable): good for low-to-moderate cardinality. Use OneHotEncoder(handle_unknown='ignore') in a pipeline.

Ordinal / Label Encoding: okay only for truly ordinal features. For nominal features, label encoding can accidentally inject order; avoid unless tree library supports categorical split natively.

Target / Leave-One-Out Encoding: powerful for high-cardinality (e.g., ICD codes). Important: avoid leakage — compute encodings inside cross-validation folds or use smoothing and out-of-fold target stats (use CategoryEncoders or implement fold-based encoding).

Frequency / Count encoding: simple and often effective for high-cardinality.

Implement using ColumnTransformer so numeric and categorical pipelines are separate.

3) Train a Decision Tree (practical recipe)

Basic pipeline (sketch) — keeps preprocessing + model together, avoids leakage :

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

# Define your numeric and categorical column names here
numeric_cols = ['placeholder_numeric_col_1', 'placeholder_numeric_col_2'] # Replace with your actual numeric column names
categorical_cols = ['placeholder_categorical_col_1', 'placeholder_categorical_col_2'] # Replace with your actual categorical column names

# Define your transformers for numeric and categorical features
# Example: Impute missing numeric values with the median
numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median'))])

# Example: Impute missing categorical values with a constant and then one-hot encode
cat_transformer = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='MISSING')),
                            ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', cat_transformer, categorical_cols),
])

clf_pipeline = Pipeline([
    ('pre', preprocessor),
    ('clf', DecisionTreeClassifier(random_state=42, class_weight='balanced'))
])

# You would then fit this pipeline to your data:
# clf_pipeline.fit(X_train, y_train)

Hyperparameter tuning

Use GridSearchCV or RandomizedSearchCV wrapped around the full pipeline (preprocessing + model) to avoid leakage.

Example parameter grid (tweak ranges as needed):

In [7]:
param_grid = {
    'clf__max_depth': [3, 5, 7, 10, None],
    'clf__min_samples_split': [2, 5, 10, 20],
    'clf__min_samples_leaf': [1, 2, 5, 10],
    'clf__criterion': ['gini','entropy'],
    'clf__ccp_alpha': [0.0, 0.001, 0.01]  # pruning
}


Evaluation & interpretability

Metrics

Confusion matrix (TP, FP, FN, TN).

Recall (sensitivity) — critical if missing a disease is costly.

Precision — important if follow-up tests are expensive/harmful.

F1 — balanced precision/recall.

ROC AUC and PR AUC (PR AUC is better for imbalanced problems).

Calibration (do predicted probabilities match true probabilities?) — use calibration plots and CalibratedClassifierCV if needed.

Threshold tuning

Tune decision threshold using validation data to optimize expected business cost (e.g., tradeoff FN vs FP).

Interpretability

feature_importances_ (quick but can be misleading).

Permutation importance (more reliable).

SHAP or LIME for per-prediction explanations — very useful in clinical settings to show why a patient was flagged.

Robustness checks

Subgroup performance (age, sex, ethnicity) — check fairness and disparate impact.

Sensitivity to missingness and imputation choices.

Test on an external holdout or prospective cohort if possible.

Statistical confidence

Report confidence intervals for metrics (bootstrapping or CV-based intervals).

6) Deployment, monitoring & governance

Human-in-the-loop: deliver predictions as decision support, not autonomous decisions — clinicians should review flagged cases.

Pilot / prospective validation: test model in real workflow before full rollout (A/B test or prospective cohort) to measure clinical impact and false alarm rates.

Monitoring: track model performance over time (drift detection), track input feature distributions and label distribution shifts, trigger retraining when performance drops.

Logging & auditability: store predictions, inputs, clinician overrides, and outcomes for post-hoc analysis and regulatory compliance.

Privacy & compliance: ensure EHR/HIPAA-compliant storage and processing; follow local regulations.

Explainability & acceptance: produce short human-friendly explanations with each prediction; involve clinicians in threshold setting.

7) Business value (what this model can deliver)

Early detection / triage: prioritize patients for further testing or early intervention → could reduce morbidity and downstream costs.

Resource optimization: route scarce diagnostic tests and specialist time to the most likely-positive patients.

Cost savings: reduce unnecessary tests or admissions if the model reduces false positives when properly tuned.

Operational efficiency: faster screening, reduced clinician workload on low-risk patients.

Measurable outcomes: improved time-to-diagnosis, reduced hospitalization rates, improved patient outcomes — measurable via an impact study or pilot.

Risks to manage: false negatives (missed disease) can be harmful; false positives cause anxiety and extra costs — choose thresholds and workflows to balance risk according to business/clinical priorities.

Quick practical checklist

EDA: missingness, distributions, class balance.

Build preprocessing ColumnTransformer (impute numeric, impute & encode categorical).

Pipeline → DecisionTreeClassifier(class_weight='balanced', random_state=42).

Tune parameters with GridSearchCV/RandomizedSearchCV and stratified CV; pick metric that matches clinical objective.

Evaluate with confusion matrix, ROC/PR AUC, calibration, and subgroup checks.

Explain results (feature importances, SHAP), pilot in clinic, then deploy with monitoring and retraining plan.