Question 1: What is a Decision Tree, and how does it work in the context of
classification?

Answer:
A Decision Tree is a supervised learning algorithm used for classification and regression problems. For classification, it recursively splits the dataset into subsets based on feature values, forming a tree where each internal node is a decision (question about a feature) and each leaf node represents a class label. The algorithm asks the optimal sequence of questions, guiding data samples down branches to reach specific predictions.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Answer:

Gini Impurity and Entropy are metrics used to evaluate the "purity" of a split in a Decision Tree.

Gini Impurity quantifies the likelihood of misclassification if labels were assigned randomly (formula):

Entropy measures the level of disorder (uncertainty) in a dataset:

A split that creates subsets with lower impurity (fewer mixed labels) is preferred, improving the Decision Tree’s ability to classify new examples.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Answer:

Pre-Pruning stops the tree from growing beyond a certain point (e.g., max depth, min samples) during training, preventing overfitting early. Advantage: Faster training and simpler, more interpretable trees.

Post-Pruning allows the tree to grow fully and then removes branches that don’t improve accuracy via validation. Advantage: More carefully balances accuracy and complexity; often results in better generalization.

Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Answer:

Information Gain is the reduction in impurity (e.g., entropy) by splitting a node on a specific feature. It's crucial because the tree always splits on the feature offering the highest information gain at each step, building the most effective structure for discrimination.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Answer:

Applications: Medical diagnosis, credit risk assessment, marketing, fraud detection, customer segmentation, and more.
Advantages: Easy to interpret and visualize, handles mixed data types, minimal data preprocessing required.
Limitations: Can overfit easily, sensitive to small data changes, less effective for highly correlated data or complex decision boundaries.




In [1]:
#Question 6: Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier using the Gini criterion
#● Print the model’s accuracy and feature importances
#(Include your Python code and output in the code box below.)

#Answer:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier


iris = load_iris()
X, y = iris.data, iris.target


clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X, y)

print("Accuracy:", clf.score(X, y))
print("Feature Importances:", clf.feature_importances_)




Accuracy: 1.0
Feature Importances: [0.         0.01333333 0.06405596 0.92261071]


In [2]:
#Question 7: Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
#a fully-grown tree.
#(Include your Python code and output in the code box below.)

#Answer:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


clf_full = DecisionTreeClassifier()
clf_full.fit(X_train, y_train)
print("Full Tree Accuracy:", clf_full.score(X_test, y_test))


clf_depth3 = DecisionTreeClassifier(max_depth=3)
clf_depth3.fit(X_train, y_train)
print("Max Depth=3 Accuracy:", clf_depth3.score(X_test, y_test))




Full Tree Accuracy: 1.0
Max Depth=3 Accuracy: 1.0


In [3]:
#Question 8: Write a Python program to:
#● Load the California Housing dataset from sklearn
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances
#(Include your Python code and output in the code box below.)

#Answer:

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error


housing = fetch_california_housing()
X, y = housing.data, housing.target


reg = DecisionTreeRegressor()
reg.fit(X, y)


y_pred = reg.predict(X)
print("Mean Squared Error:", mean_squared_error(y, y_pred))
print("Feature Importances:", reg.feature_importances_)




Mean Squared Error: 1.0070971343301193e-31
Feature Importances: [0.52439443 0.05105658 0.05227414 0.02781658 0.03276769 0.13235599
 0.0944909  0.08484369]


In [5]:
#Question 9: Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using
#GridSearchCV
#● Print the best parameters and the resulting model accuracy
#(Include your Python code and output in the code box below.)

#Answer:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV


iris = load_iris()
X, y = iris.data, iris.target


param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)

Best Parameters: {'max_depth': 4, 'min_samples_split': 5}
Best Accuracy: 0.9666666666666668


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to

● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance

And describe what business value this model could provide in the real-world
setting.

Answer:

Handle Missing Values: Impute numerical features with mean/median; handle categorical missing values with mode or a new category.

Encode Categorical Features: Use label or one-hot encoding for non-numeric features.

Train Decision Tree: Fit the model using the processed dataset.

Tune Hyperparameters: Use GridSearchCV to optimize parameters like max_depth and min_samples_split for best performance.

Evaluate Performance: Use metrics such as accuracy, precision, recall, and the confusion matrix.
Business Value: Enables automated, interpretable predictions that improve early disease detection, help target interventions, and optimize resource use, ultimately supporting better patient outcomes and operational efficiency in healthcare.