# Decision Tree Assignment

**Q: What is a Decision Tree, and how does it work**

A: A Decision Tree is a supervised learning model used for classification and regression tasks. It splits data into branches based on feature values, creating a tree-like model of decisions. At each node, it selects the feature that best separates the data using criteria like Gini Impurity or Entropy.

**Q: What are impurity measures in Decision Trees**

A: Impurity measures are metrics used to evaluate the quality of a split at a node in a Decision Tree. Common impurity measures include Gini Impurity and Entropy. Lower impurity means better class separation.

**Q: What is the mathematical formula for Gini Impurity**

A: Gini Impurity = 1 - Σ (p_i)^2, where p_i is the probability of class i at a particular node.

**Q: What is the mathematical formula for Entropy**

A: Entropy = -Σ p_i * log2(p_i), where p_i is the probability of class i at a particular node.

**Q: What is Information Gain, and how is it used in Decision Trees**

A: Information Gain measures the reduction in impurity or entropy after a dataset is split on a feature. It helps the Decision Tree algorithm choose the best feature for splitting the data.

**Q: What is the difference between Gini Impurity and Entropy**

A: Both measure impurity, but Gini is simpler and faster to compute, while Entropy uses logarithms and is more information-theoretic. They often yield similar results in practice.

**Q: What is the mathematical explanation behind Decision Trees**

A: Decision Trees use recursive binary splitting to partition the data space. At each node, a feature and a threshold are chosen to maximize Information Gain or minimize Gini Impurity. This process continues until a stopping condition is met.

**Q: What is Pre-Pruning in Decision Trees**

A: Pre-pruning stops the tree from growing once a condition is met (like max depth or min samples), helping to prevent overfitting.

**Q: What is Post-Pruning in Decision Trees**

A: Post-pruning builds a full tree first and then removes branches that do not improve performance on validation data, improving generalization.

**Q: What is the difference between Pre-Pruning and Post-Pruning**

A: Pre-pruning halts tree growth early, while post-pruning trims a fully grown tree. Pre-pruning avoids overfitting upfront, whereas post-pruning corrects it after full growth.

**Q: What is a Decision Tree Regressor**

A: A Decision Tree Regressor predicts continuous values by partitioning the data and fitting simple models (e.g., mean) in each region.

**Q: What are the advantages and disadvantages of Decision Trees**

A: Advantages: easy to interpret, handle both numerical and categorical data, require little preprocessing. Disadvantages: prone to overfitting, unstable with small variations in data, can be biased if classes are imbalanced.

**Q: How does a Decision Tree handle missing values**

A: Some implementations handle missing values by surrogate splits or assigning instances based on distribution. Others require preprocessing to fill or drop missing values.

**Q: How does a Decision Tree handle categorical features**

A: Categorical features are handled by splitting nodes based on the category values. Many libraries automatically encode them, while others require manual preprocessing.

**Q: What are some real-world applications of Decision Trees?**

A: Applications include medical diagnosis, credit scoring, fraud detection, customer churn prediction, and recommendation systems.

## Practical Questions

**Q: Write a Python program to train a Decision Tree Classifier on the Iris dataset and print the model accuracy**

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

**Q: Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances**

In [None]:
clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X_train, y_train)
print("Feature Importances:", clf.feature_importances_)

**Q: Write a Python program to train a Decision Tree Classifier using Entropy as the splitting criterion and print the model accuracy**

In [None]:
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

**Q: Write a Python program to train a Decision Tree Regressor on a housing dataset and evaluate using Mean Squared Error (MSE)**

In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.3, random_state=42)
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

Mean Squared Error: 0.5306169838965117


**Q: Write a Python program to train a Decision Tree Classifier and visualize the tree using graphviz**

In [None]:
from sklearn.tree import export_graphviz
import graphviz

dot_data = export_graphviz(clf, out_file=None, feature_names=iris.feature_names,
                           class_names=iris.target_names, filled=True, rounded=True,
                           special_characters=True)
graph = graphviz.Source(dot_data)
graph.render("decision_tree")

**Q: Write a Python program to train a Decision Tree Classifier with a maximum depth of 3 and compare its accuracy with a fully grown tree**

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset and split it into training and testing sets
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train with max_depth=3
clf_depth_3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth_3.fit(X_train, y_train)
y_pred_depth_3 = clf_depth_3.predict(X_test)
accuracy_depth_3 = accuracy_score(y_test, y_pred_depth_3)

# Train with no depth limit (fully grown tree)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy with max_depth=3: {accuracy_depth_3:.4f}")
print(f"Accuracy with fully grown tree: {accuracy_full:.4f}")

Accuracy with max_depth=3: 1.0000
Accuracy with fully grown tree: 1.0000


**Q: Write a Python program to train a Decision Tree Classifier using min_samples_split=5 and compare its accuracy with a default tree**

In [None]:
clf_split_5 = DecisionTreeClassifier(min_samples_split=5, random_state=42)
clf_split_5.fit(X_train, y_train)
y_pred_split_5 = clf_split_5.predict(X_test)
accuracy_split_5 = accuracy_score(y_test, y_pred_split_5)

# Train with default tree
clf_default = DecisionTreeClassifier(random_state=42)
clf_default.fit(X_train, y_train)
y_pred_default = clf_default.predict(X_test)
accuracy_default = accuracy_score(y_test, y_pred_default)

print(f"Accuracy with min_samples_split=5: {accuracy_split_5:.4f}")
print(f"Accuracy with default tree: {accuracy_default:.4f}")

**Q: Write a Python program to apply feature scaling before training a Decision Tree Classifier and compare its accuracy with unscaled data**

In [None]:
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train on scaled data
clf_scaled = DecisionTreeClassifier(random_state=42)
clf_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = clf_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# Train on unscaled data
clf_unscaled = DecisionTreeClassifier(random_state=42)
clf_unscaled.fit(X_train, y_train)
y_pred_unscaled = clf_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)

print(f"Accuracy with scaled data: {accuracy_scaled:.4f}")
print(f"Accuracy with unscaled data: {accuracy_unscaled:.4f}")

 **Q: Write a Python program to train a Decision Tree Classifier using One-vs-Rest (OvR) strategy for multiclass classification**

In [None]:
from sklearn.multiclass import OneVsRestClassifier

# Train using One-vs-Rest strategy
clf_ovr = OneVsRestClassifier(DecisionTreeClassifier(random_state=42))
clf_ovr.fit(X_train, y_train)
y_pred_ovr = clf_ovr.predict(X_test)
accuracy_ovr = accuracy_score(y_test, y_pred_ovr)

print(f"Accuracy with One-vs-Rest strategy: {accuracy_ovr:.4f}")

**Q: Write a Python program to train a Decision Tree Classifier and display the feature importance scores**

In [None]:
# Train a Decision Tree Classifier
clf_importance = DecisionTreeClassifier(random_state=42)
clf_importance.fit(X_train, y_train)

# Get feature importance
importance = clf_importance.feature_importances_

print(f"Feature importance scores: {importance}")

**Q: Write a Python program to train a Decision Tree Regressor with max_depth=5 and compare its performance with an unrestricted tree**

In [6]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Create a regression problem (e.g., predict the target variable using a subset of features)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X, y, test_size=0.3, random_state=42)

# Train with max_depth=5
regressor_depth_5 = DecisionTreeRegressor(max_depth=5, random_state=42)
regressor_depth_5.fit(X_train_reg, y_train_reg)
y_pred_depth_5 = regressor_depth_5.predict(X_test_reg)
mse_depth_5 = mean_squared_error(y_test_reg, y_pred_depth_5)

# Train with unrestricted tree
regressor_full = DecisionTreeRegressor(random_state=42)
regressor_full.fit(X_train_reg, y_train_reg)
y_pred_full_reg = regressor_full.predict(X_test_reg)
mse_full = mean_squared_error(y_test_reg, y_pred_full_reg)

print(f"MSE with max_depth=5: {mse_depth_5:.4f}")
print(f"MSE with unrestricted tree: {mse_full:.4f}")

MSE with max_depth=5: 0.0000
MSE with unrestricted tree: 0.0000


**Q: Write a Python program to train a Decision Tree Classifier, apply Cost Complexity Pruning (CCP), and visualize its effect on accuracy**

In [None]:
import matplotlib.pyplot as plt

# Train a Decision Tree Classifier
clf_ccp = DecisionTreeClassifier(random_state=42)
clf_ccp.fit(X_train, y_train)

# Get the effective alpha values for pruning
path = clf_ccp.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# Plot the effects of pruning on the training accuracy
train_scores = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_test, clf.predict(X_test)))

# Plot the pruning curve
plt.plot(ccp_alphas, train_scores, marker='o')
plt.xlabel('Alpha (Pruning Parameter)')
plt.ylabel('Accuracy')
plt.title('Effect of Cost Complexity Pruning')
plt.show()

**Q:Write a Python program to train a Decision Tree Classifier and evaluate its performance using Precision, Recall, and F1-Score**

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Train a Decision Tree Classifier
clf_pr = DecisionTreeClassifier(random_state=42)
clf_pr.fit(X_train, y_train)
y_pred_pr = clf_pr.predict(X_test)

# Calculate Precision, Recall, and F1-Score
precision = precision_score(y_test, y_pred_pr, average='weighted')
recall = recall_score(y_test, y_pred_pr, average='weighted')
f1 = f1_score(y_test, y_pred_pr, average='weighted')

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

**Q: Write a Python program to train a Decision Tree Classifier and visualize the confusion matrix using seaborn**

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Train a Decision Tree Classifier
clf_cm = DecisionTreeClassifier(random_state=42)
clf_cm.fit(X_train, y_train)
y_pred_cm = clf_cm.predict(X_test)

# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred_cm)

# Visualize the confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=data.target_names, yticklabels=data.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

**Q: Write a Python program to train a Decision Tree Classifier and use GridSearchCV to find the optimal values for max_depth and min_samples_split**

In [None]:
from sklearn.model_selection import GridSearchCV

# Set up the parameter grid
param_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}

# Initialize the Decision Tree Classifier
clf_grid = DecisionTreeClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=clf_grid, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters and accuracy
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")