**Theoretical Questions**


Question 1: What is a Decision Tree, and how does it work?
Answer 1: A Decision Tree is a supervised learning model that splits data into branches to predict outcomes by learning decision rules from features.

Question 2: What are impurity measures in Decision Trees?
Answer 2: Impurity measures quantify how mixed the classes are in a node. Common measures: Gini Impurity and Entropy.

Question 3: What is the mathematical formula for Gini Impurity?
Answer 3: Gini = 1 - ∑(pi²), where pi is the probability of class i in the node.

Question 4: What is the mathematical formula for Entropy?
Answer 4: Entropy = - ∑(pi * log₂(pi)), where pi is the probability of class i in the node.

Question 5: What is Information Gain, and how is it used in Decision Trees?
Answer 5: Information Gain measures the reduction in impurity. It's used to choose the best feature for splitting.

Question 6: What is the difference between Gini Impurity and Entropy?
Answer 6: Gini is faster and favors larger partitions; Entropy is more informative but computationally heavier.

Question 7: What is the mathematical explanation behind Decision Trees?
Answer 7: Decision Trees use recursive binary splitting, selecting features that maximize Information Gain or minimize Gini Impurity.

Question 8: What is Pre-Pruning in Decision Trees?
Answer 8: Pre-pruning stops tree growth early using conditions like max depth or min samples.

Question 9: What is Post-Pruning in Decision Trees?
Answer 9: Post-pruning removes branches from a fully grown tree to reduce overfitting.

Question 10: What is the difference between Pre-Pruning and Post-Pruning?
Answer 10: Pre-pruning prevents overfitting during training; post-pruning simplifies the tree after full growth.

Question 11: What is a Decision Tree Regressor?
Answer 11: It is a Decision Tree model used for predicting continuous values instead of classes.

Question 12: What are the advantages and disadvantages of Decision Trees?
Answer 12:
Advantages: Easy to interpret, no need for feature scaling.
Disadvantages: Prone to overfitting, unstable to small data changes.

Question 13: How does a Decision Tree handle missing values?
Answer 13: It can use surrogate splits or assign missing data based on most frequent values.

Question 14: How does a Decision Tree handle categorical features?
Answer 14: It splits categories based on impurity reduction, using one-hot encoding or grouping.

Question 15: What are some real-world applications of Decision Trees?
Answer 15: Fraud detection, loan approval, medical diagnosis, customer segmentation.

** Practical Question**

Question 16: Write a Python program to train a Decision Tree Classifier on the Iris dataset and print the model accuracy

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))


Question 17: Train a Decision Tree Classifier using Gini Impurity and print the feature importances

In [None]:
model = DecisionTreeClassifier(criterion='gini')
model.fit(X_train, y_train)
print("Feature importances:", model.feature_importances_)


Question 18: Train a Decision Tree Classifier using Entropy and print the model accuracy

In [None]:
model = DecisionTreeClassifier(criterion='entropy')
model.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))


Question 19: Train a Decision Tree Regressor on a housing dataset and evaluate using MSE

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
print("MSE:", mean_squared_error(y_test, regressor.predict(X_test)))


Question 20: Visualize the tree using graphviz

In [None]:
from sklearn.tree import export_graphviz
import graphviz

dot_data = export_graphviz(model, out_file=None, feature_names=load_iris().feature_names, class_names=load_iris().target_names)
graph = graphviz.Source(dot_data)
graph.render("iris_tree", view=True)  # Will open a PDF


Question 21: Train a Decision Tree Classifier with max_depth=3 and compare with fully grown tree

In [None]:
model_limited = DecisionTreeClassifier(max_depth=3)
model_limited.fit(X_train, y_train)
acc_limited = accuracy_score(y_test, model_limited.predict(X_test))

model_full = DecisionTreeClassifier()
model_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, model_full.predict(X_test))

print("Limited Depth Accuracy:", acc_limited)
print("Full Tree Accuracy:", acc_full)


Question 22: Train using min_samples_split=5 and compare with default

In [1]:
model_default = DecisionTreeClassifier()
model_split = DecisionTreeClassifier(min_samples_split=5)

model_default.fit(X_train, y_train)
model_split.fit(X_train, y_train)

print("Default Accuracy:", accuracy_score(y_test, model_default.predict(X_test)))
print("min_samples_split=5 Accuracy:", accuracy_score(y_test, model_split.predict(X_test)))


NameError: name 'DecisionTreeClassifier' is not defined

Question 23: Apply feature scaling before training and compare accuracy

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, y, random_state=0)

model_scaled = DecisionTreeClassifier()
model_scaled.fit(X_train_s, y_train_s)

print("Unscaled Accuracy:", accuracy_score(y_test, model.predict(X_test)))
print("Scaled Accuracy:", accuracy_score(y_test_s, model_scaled.predict(X_test_s)))


Question 24: Train using One-vs-Rest (OvR) strategy

In [2]:
from sklearn.multiclass import OneVsRestClassifier

ovr_model = OneVsRestClassifier(DecisionTreeClassifier())
ovr_model.fit(X_train, y_train)
print("OvR Accuracy:", accuracy_score(y_test, ovr_model.predict(X_test)))


NameError: name 'DecisionTreeClassifier' is not defined

Question 25: Display feature importance scores

In [3]:
print("Feature importances:", model.feature_importances_)


NameError: name 'model' is not defined

Question 26: Train Regressor with max_depth=5 and compare with unrestricted

In [None]:
reg_limited = DecisionTreeRegressor(max_depth=5)
reg_full = DecisionTreeRegressor()

reg_limited.fit(X_train, y_train)
reg_full.fit(X_train, y_train)

print("MSE limited:", mean_squared_error(y_test, reg_limited.predict(X_test)))
print("MSE full:", mean_squared_error(y_test, reg_full.predict(X_test)))


Question 27: Apply Cost Complexity Pruning (CCP) and visualize effect

In [4]:
path = model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

accuracies = []
for ccp in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=ccp)
    clf.fit(X_train, y_train)
    accuracies.append(accuracy_score(y_test, clf.predict(X_test)))

import matplotlib.pyplot as plt
plt.plot(ccp_alphas, accuracies)
plt.xlabel("ccp_alpha")
plt.ylabel("Accuracy")
plt.title("CCP Pruning Effect")
plt.show()


NameError: name 'model' is not defined

Question 28: Evaluate using Precision, Recall, and F1-Score

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

y_pred = model.predict(X_test)
print("Precision:", precision_score(y_test, y_pred, average='macro'))
print("Recall:", recall_score(y_test, y_pred, average='macro'))
print("F1 Score:", f1_score(y_test, y_pred, average='macro'))


print("F1 Score:", f1_score(y_test, y_pred, average='macro'))
Question 29: Visualize confusion matrix using seaborn

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(y_test, model.predict(X_test))
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()


Question 30: Use GridSearchCV to find optimal max_depth and min_samples_split

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [2, 3, 4, 5], 'min_samples_split': [2, 5, 10]}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", accuracy_score(y_test, grid.best_estimator_.predict(X_test)))
