Question 1: What is a Decision Tree, and how does it work in the context of classification?

Answer:
A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks.
In classification, it predicts the class label of an instance by learning simple decision rules from data features.

- The dataset is split into smaller subsets based on feature values.

- At each internal node, a feature is chosen that best divides the data into classes.

- Each branch represents a possible outcome, and each leaf node represents a final class label (decision).

*How it works (Classification Example):*

1. Start with the root node containing all training data.

2. For each feature, calculate an impurity measure (like Gini or Entropy).

3. Select the feature that gives the best split (lowest impurity or highest information gain).

4. Repeat the process for each child node until:

    - All samples in a node belong to the same class, or

    - No more features are left.

    - This forms a tree-like structure of decisions.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Answer:
Both Gini Impurity and Entropy measure how mixed (impure) the classes are in a node.

ðŸ§® Gini Impurity:
            
            Gini=1âˆ’âˆ‘(piâ€‹)2

Where:
pi = probability of class i in a node.

- Gini = 0 â†’ Node is pure (only one class).

- Gini is higher when classes are mixed.

- Used in CART (Classification and Regression Trees).

ðŸ§® Entropy:

            Entropy=âˆ’âˆ‘piâ€‹log2â€‹(piâ€‹)

- Entropy = 0 â†’ Node is pure.

- Entropy increases as the mixture of classes increases.

- Used in ID3 and C4.5 algorithms.

> Impact on Splits:

- Both measures help decide which feature to split on.

- The algorithm chooses the feature that reduces impurity the most (maximizes Information Gain).

- Lower impurity â†’ better split â†’ more accurate classification.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Answer:
| Type                             | Description                                                                                  | Practical Advantage                                                                      |
| -------------------------------- | -------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| **Pre-Pruning (Early Stopping)** | Stops growing the tree early based on conditions (e.g., max depth, min samples per leaf).    | **Faster training** and prevents overfitting early.                                      |
| **Post-Pruning**                 | First grow a full tree, then remove branches that donâ€™t improve accuracy on validation data. | **Better generalization** â€” the model is simplified after training to avoid overfitting. |



Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Answer:
Information Gain (IG) measures how much uncertainty (impurity) is reduced after splitting a dataset based on a feature.

            IG=Entropy(Parent)âˆ’âˆ‘NNiâ€‹â€‹Ã—Entropy(Childiâ€‹)

Where:

Ni: number of samples in child node

N: total samples in parent node

Importance:

It helps select the best feature for splitting the data.

A feature with high Information Gain means it creates more homogeneous (pure) groups â€” leading to better classification accuracy.

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Answer:
Real-World Applications:

1. Healthcare:

- Used to predict diseases (e.g., diabetes, heart disease) based on patient data.

- Example: Classifying patients as High Risk or Low Risk using medical test results.

2. Finance:

- Used for credit risk analysis â€” deciding whether to approve a loan or not.

- Helps detect fraudulent transactions.

3. Marketing:

- Customer segmentation and predicting purchase likelihood.

- Example: Predicting if a customer will respond to a campaign.

4. Education:

- Predicting student performance or dropout risk based on attendance, grades, etc.

5. Manufacturing / Quality Control:

- Used to identify causes of product defects by analyzing production parameters.

6. Agriculture / Environment:

- Classifying plant species using features (like in the Iris dataset).

- Predicting pollution levels, rainfall, or crop yield.

7. Real Estate:

- Predicting house prices using factors like number of rooms, area, and location (like Boston Housing dataset).


> Advantages of Decision Trees:

1.Easy to Understand and Visualize â€“ Mimics human decision-making.

2.No Feature Scaling Needed â€“ Works without normalization or standardization.

3.Handles Both Numerical & Categorical Data.

4.Can Capture Non-linear Relationships.

5.Fast Prediction once the tree is built.



>Limitations of Decision Trees:

1. Overfitting:

- Trees can grow too deep and memorize training data.(Handled by pruning or limiting depth.)

2. Unstable:

- Small data changes can cause a completely different tree.

3. Biased Towards Dominant Classes:

- If dataset is imbalanced, tree may favor majority class.

4. Not Great for Continuous Predictions Alone:

- Regression trees can give piecewise constant outputs.

- Ensemble methods (like Random Forest or Gradient Boosting) perform better.

Example Datasets:

1. Iris Dataset (Classification):

- Task: Predict flower species (Setosa, Versicolor, Virginica) using features like sepal and petal length/width.

- from sklearn.datasets import load_iris

2. Boston Housing Dataset (Regression):

- Task: Predict median house prices based on features like number of rooms, age, and location.

- from sklearn.datasets import load_boston (deprecated; use updated housing datasets in sklearn or fetch_openml instead)

Question 6: Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier using the Gini criterion
- Print the modelâ€™s accuracy and feature importances (Include your Python code and output in the code box below.)

Answer:

In [1]:
# Question 6 Answer

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train Decision Tree Classifier using Gini criterion
clf_gini = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_gini.fit(X_train, y_train)

# Make predictions
y_pred = clf_gini.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Decision Tree Classifier (Gini) Accuracy:", round(accuracy * 100, 2), "%")
print("\nFeature Importances:")
for name, importance in zip(iris.feature_names, clf_gini.feature_importances_):
    print(f"{name}: {importance:.4f}")


Decision Tree Classifier (Gini) Accuracy: 100.0 %

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


Question 7:

Write a Python program to:

- Load the Iris Dataset

- Train a Decision Tree Classifier with max_depth=3

- Compare its accuracy to a fully-grown tree

Answer:

In [2]:
# Question 7 Answer

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train fully-grown Decision Tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Train Decision Tree with max_depth=3
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Print results
print("Fully-Grown Tree Accuracy:", round(accuracy_full * 100, 2), "%")
print("Tree with max_depth=3 Accuracy:", round(accuracy_limited * 100, 2), "%")

# Compare
if accuracy_limited < accuracy_full:
    print("\nThe limited-depth tree is slightly less accurate but generalizes better.")
else:
    print("\nBoth trees have similar accuracy; limiting depth helps prevent overfitting.")


Fully-Grown Tree Accuracy: 100.0 %
Tree with max_depth=3 Accuracy: 100.0 %

Both trees have similar accuracy; limiting depth helps prevent overfitting.


Question 8:

Write a Python program to:

- Load the Boston Housing Dataset

- Train a Decision Tree Regressor

- Print the Mean Squared Error (MSE) and feature importances

Answer:

In [3]:
# Question 8 Answer

from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the Boston Housing dataset (fetch_openml replaces deprecated load_boston)
boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Print results
print("Decision Tree Regressor - Mean Squared Error (MSE):", round(mse, 2))
print("\nFeature Importances:")
for name, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")


Decision Tree Regressor - Mean Squared Error (MSE): 11.59

Feature Importances:
CRIM: 0.0585
ZN: 0.0010
INDUS: 0.0099
CHAS: 0.0003
NOX: 0.0071
RM: 0.5758
AGE: 0.0072
DIS: 0.1096
RAD: 0.0016
TAX: 0.0022
PTRATIO: 0.0250
B: 0.0119
LSTAT: 0.1900


Question 9:

Write a Python program to:

- Load the Iris Dataset

- Tune the Decision Treeâ€™s max_depth and min_samples_split using GridSearchCV

- Print the best parameters and model accuracy

Answer:

In [4]:
# Question 9 Answer

from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define model
dt = DecisionTreeClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get best parameters
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate accuracy on test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters from GridSearchCV:", best_params)
print("Model Accuracy with Best Parameters:", round(accuracy * 100, 2), "%")


Best Parameters from GridSearchCV: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy with Best Parameters: 100.0 %


Question 10: Imagine youâ€™re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Answer:

1) Handle missing values

- Diagnose type of missingness: MCAR (random), MAR (depends on other observed features), MNAR (depends on unobserved value). This affects strategy.

> Simple rules:

- If a column has very high missing rate (e.g., >50â€“70%), consider dropping it unless clinically important.

> For numerical features:

- If MCAR/MAR: impute with median (robust) or use model-based imputation (KNN, IterativeImputer) if relationships exist.

- If missingness itself is informative (e.g., test not ordered because doctor thought patient healthy) â€” add a binary indicator column feature_X_was_missing.

> For categorical features:

- Impute missing as a separate category "<missing>" or the most frequent category.

- Avoid leaking target info: Always fit imputers on training data only (use pipelines).

    - Advanced: If many missing patterns and correlated missingness, consider IterativeImputer or models that can handle missingness natively.


2) Encode categorical features

- Low-cardinality categorical (few unique values): OneHotEncoder (sparse or drop one col to avoid collinearity).

- High-cardinality categorical: target encoding or frequency encoding (careful to prevent leakageâ€”use CV folds for target encoding).

- Ordinal features: use OrdinalEncoder if order matters.

- In pipelines: use ColumnTransformer to apply different transforms to numeric vs categorical features.

- Remember: Decision Trees donâ€™t require scaling, but proper encoding still matters.


3) Train a Decision Tree model

- Use scikit-learn DecisionTreeClassifier inside a pipeline:

- Keeps imputation + encoding + model in one reproducible unit.

- Set random_state for reproducibility.

- Consider class_weight='balanced' (or custom weights) if dataset is imbalanced.

- Example pipeline (compact, runnable):

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# X, y already loaded (pandas)
num_cols = ['age','blood_pressure','cholesterol']    # example
cat_cols = ['gender','smoking_status']

num_transform = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    # scaler optional for tree but ok to include if later using other models
])

cat_transform = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_transform, num_cols),
    ('cat', cat_transform, cat_cols)
])

pipe = Pipeline([
    ('preproc', preprocessor),
    ('clf', DecisionTreeClassifier(random_state=42, class_weight='balanced'))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)


4) Tune hyperparameters

- Hyperparameters that matter for Decision Trees:

    - max_depth â€” controls complexity.

    - min_samples_split, min_samples_leaf â€” prevent tiny leaves.

    - max_features â€” how many features to consider at split.

    - criterion â€” 'gini' or 'entropy'.

- Use GridSearchCV or RandomizedSearchCV with cv=5 (stratified if classification).

- Use scoring aligned with business need: e.g., scoring='recall' (if missing a disease is costly) or scoring='roc_auc'.

- Example grid:

              param_grid = {
                            'clf__max_depth': [3, 5, 8, None],
                            'clf__min_samples_split': [2, 5, 10],
                            'clf__min_samples_leaf': [1, 2, 5]
                        }

                        grid = GridSearchCV(pipe, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
                        grid.fit(X_train, y_train)
                        best_model = grid.best_estimator_
                        print(grid.best_params_)

- If classes are very imbalanced, also try:

    - class_weight='balanced' or custom weights,

    - resampling: SMOTE (only in training folds) or undersampling majority class.

5) Evaluate performance

- Prefer multiple metrics:

    - Recall (Sensitivity) â€” proportion of actual patients with disease that model finds. Crucial if missing a disease is bad.

    - Precision â€” how many predicted positives are true positives (important when false positives are costly).

    - F1-score â€” balance between precision and recall.

    - ROC AUC â€” ranking ability across thresholds.

    - PR AUC â€” more informative on imbalanced data.

    - Confusion Matrix â€” absolute counts of TP/FP/FN/TN for operational decisions.

    - Calibration â€” does predicted probability match true probability? Use calibration plots or CalibratedClassifierCV.

- Decision thresholds: choose threshold not necessarily 0.5. For health, you might use lower threshold to increase recall and accept more false positives.

- Cross-validation: Report CV meanÂ±std for metrics. Use nested CV for honest hyperparameter selection if reporting final performance.

- Statistical and clinical validation: evaluate on holdout test set and, ideally, on an external dataset from different hospital/population.

6) Interpretability & fairness

- Feature importances from tree give a quick sense which features matter.

- For more robust explanations: use SHAP or LIME to show per-patient explanations.

- Check for bias across subgroups (age, gender, ethnicity) â€” ensure model doesn't systematically underperform on any group.

- Keep clinical experts involved â€” they can validate whether learned rules make clinical sense.

7) Deployment, monitoring & governance

  - Clinical integration: model as decision support, not fully autonomous â€” flag high-risk patients for clinician review.

  - Monitor model drift (data distribution changes), performance decay, and changes in prevalence.

  - Logging: save model inputs/outputs for auditing.

  - Regulatory & privacy: ensure HIPAA/GDPR compliance, secure data handling, and maintain documentation for explainability and approvals.