Question 1: What is a Decision Tree, and how does it work in the context of
classification?

ans)A Decision Tree is a supervised machine learning algorithm used for both classification and regression, but it’s most commonly applied to classification problems.


It works like a flowchart structure, where:

Each internal node represents a feature/attribute test (e.g., “Is age > 18?”).

Each branch represents the outcome of the test (e.g., Yes/No).

Each leaf node represents the final class label (e.g., “Approved” or “Denied”).

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

ans)When building a decision tree, the algorithm needs to decide which feature to split on at each step.
To do this, it measures how “pure” or “impure” a node is.

Pure node → contains data points from only one class.

Impure node → contains a mixture of classes.

Two common impurity measures are:

1)Gini Impurity

2)Entropy (Information Gain)

1. Gini Impurity

Definition: Probability of incorrectly classifying a randomly chosen sample if it was labeled randomly according to the distribution of classes in the node.

gini=1-(sigma)p(i)^2
	​
where pi=probability of class i in the node.

Example:
Suppose a node has 10 samples → 6 “Yes”, 4 “No”

Pyes=0.6, Pno=0.4

gini=1-(0.6^2+0.4^2)
=1-(0.36+0.16)
=0.48

2. Entropy (Information Gain)

Definition: Measures the uncertainty (disorder) in a node.

entropy=-(sigma)Pi log(pi)

eg)same as above,
entropy=-(0.6 log 0.6 + 0.4 log 0.4)
=-(0.6-o.737 + 0.4 - 1.322)
=0.971

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

ans)Pruning in Decision Trees

Decision Trees are powerful, but if we let them grow without restriction, they become very deep, memorizing training data → overfitting.

Pruning = reducing the size of the tree by removing unnecessary branches/nodes.

Two main strategies: Pre-Pruning and Post-Pruning.

1. Pre-Pruning (Early Stopping)

The tree stops growing early, before it becomes too complex.

This is done by setting conditions like:

max_depth (maximum levels in the tree)

min_samples_split (minimum samples needed to split a node)

min_samples_leaf (minimum samples in a leaf)

max_leaf_nodes

Practical Advantage:

Faster training & simpler trees (good for real-time predictions).

2. Post-Pruning (Cost Complexity Pruning / Reduced Error Pruning)

The tree is allowed to grow fully (very deep, possibly overfitted).

Then, unnecessary branches are removed afterward using validation/testing performance.

Practical Advantage:

Better generalization because pruning is based on actual validation performance, not just predefined limits.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
ans) Information Gain (IG): Definition

Information Gain measures how much uncertainty (impurity) is reduced after splitting a dataset based on a feature.

It is based on Entropy.

Why Important?

Information Gain tells us which feature provides the most “clarity” about class labels.

The higher the IG, the better that feature is for splitting.

This ensures that at each step, the tree becomes more pure, improving classification accuracy.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

ans)Real-World Applications of Decision Trees

Decision Trees are widely used because they are simple, interpretable, and powerful. Some common applications:

1)Healthcare (Diagnosis & Risk Prediction)

Predicting whether a patient has a disease (e.g., diabetes, cancer) based on symptoms/test results.

Example: “If age > 50 and cholesterol high → risk of heart disease.”

2)Finance (Credit Risk & Fraud Detection)

Banks use decision trees to decide loan approvals (credit score, income, debt).

Fraud detection in credit card transactions.

3)Business & Marketing

Customer segmentation: “Will this customer buy the product?”

Churn prediction: Which customers are likely to leave a service.

4)Education

Predicting student performance (pass/fail) based on study habits, attendance, etc.

Advantages of Decision Trees:-

Easy to understand & interpret.

Handles both categorical & numerical data.

No need for scaling/normalization.

Limitations of Decision Trees:-
Overfitting.

Instability.

Bias toward features with many levels.

Question 6: Write a Python program to:
● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances

In [1]:
# Step 1: Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 2: Load the Iris dataset
iris = load_iris()
X = iris.data   # features
y = iris.target # labels

# Step 3: Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Train Decision Tree Classifier (using Gini criterion)
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Step 5: Make predictions on test set
y_pred = clf.predict(X_test)

# Step 6: Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Step 7: Print feature importances
print("Feature Importances:", clf.feature_importances_)


Model Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


Question 7: Write a Python program to:
● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [2]:
# Step 1: Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 2: Load the Iris dataset
iris = load_iris()
X = iris.data   # features
y = iris.target # labels

# Step 3: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Train a Decision Tree with max_depth=3
shallow_tree = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
shallow_tree.fit(X_train, y_train)

# Step 5: Train a fully-grown Decision Tree (no depth limit)
full_tree = DecisionTreeClassifier(criterion="gini", random_state=42)
full_tree.fit(X_train, y_train)

# Step 6: Evaluate both models
shallow_acc = accuracy_score(y_test, shallow_tree.predict(X_test))
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

print("Accuracy with max_depth=3:", shallow_acc)
print("Accuracy with fully-grown tree:", full_acc)


Accuracy with max_depth=3: 1.0
Accuracy with fully-grown tree: 1.0


Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances

In [3]:
# Step 1: Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Step 2: Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data   # features
y = housing.target # target (median house value)

# Step 3: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Step 5: Make predictions
y_pred = regressor.predict(X_test)

# Step 6: Evaluate using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Step 7: Print feature importances
print("Feature Importances:", regressor.feature_importances_)

# Optional: print feature names with importance
for name, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


Question 9: Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy

In [4]:
# Step 1: Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 2: Load the Iris dataset
iris = load_iris()
X = iris.data   # features
y = iris.target # labels

# Step 3: Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],          # try different tree depths
    'min_samples_split': [2, 3, 4, 5, 6, 10]  # min samples to split a node
}

# Step 5: Setup GridSearchCV
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(criterion="gini", random_state=42),
    param_grid=param_grid,
    cv=5,               # 5-fold cross-validation
    scoring='accuracy', # optimize for accuracy
    n_jobs=-1           # use all CPU cores
)

# Step 6: Fit the model
grid_search.fit(X_train, y_train)

# Step 7: Print best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

# Step 8: Evaluate on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", test_accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 0.9416666666666668
Test Set Accuracy: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.

Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

answer)Step 1: Handle Missing Values

Numeric features: replace missing values with median (robust against outliers).

Categorical features: replace missing values with the most frequent category.

If missingness is itself meaningful, add an indicator column.

Step 2: Encode Categorical Features:-

Use One-Hot Encoding for low/medium-cardinality features.

For high-cardinality features, consider target or frequency encoding (with care to avoid leakage).

Step 3: Train a Decision Tree Model:-

Split data into train/test sets with stratification.

Train a Decision Tree with class_weight="balanced" to handle class imbalance.

Build the model within a pipeline (so imputation + encoding happen during training).


Step 4: Tune Hyperparameters

Use GridSearchCV or RandomizedSearchCV with stratified cross-validation.

Step 5: Evaluate Performance

On the test set, report:

Confusion Matrix (true/false positives/negatives)

Classification report (precision, recall, F1-score)

ROC AUC and PR AUC (better for imbalanced datasets)


Step 6: Business Value in Real-World

Early disease detection: helps identify patients at risk sooner.

Resource efficiency: directs costly tests toward high-risk patients.

Cost savings: reduces late-stage treatments by catching disease early.

Clinical decision support: provides interpretable insights to doctors.

Population health: enables targeted preventive programs.