Decision Tree | Assignment

Assignment Code: DA-AG-012


---



Question 1: What is a Decision Tree, and how does it work in the context of
classification?

Answer:A Decision Tree is a supervised learning algorithm that works like a flowchart to classify data.Decision Tree divides data in nodes and keeps on splitting the data unit all points are of same class.

How it Works
- The Split: The tree starts at a "Root Node" (the top) and looks for the feature that best separates the data into distinct groups.
- The Test: It uses mathematical metrics like Gini Impurity or Entropy to find the "purest" split—meaning it wants to group similar items together.
-The Path: Data points travel down branches based on their characteristics (e.g., Is age > 30?).
- The Result: The process repeats until it reaches a Leaf Node, which provides the final category or class.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?


Answer: Entropy measures "disorder" or uncertainty; the tree calculates the entropy before and after a split to find the Information Gain, choosing the path that reduces chaos the most. Gini Impurity measures the probability of a random item being misclassified; it is the industry default because it is computationally faster, avoiding the complex logarithmic calculations required by Entropy. Both measures impact the tree by forcing it to test every possible feature and "cut" point to find the split that creates the most uniform, homogeneous groups.


Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.


Answer: Pruning is the essential process of trimming a Decision Tree to prevent it from "overfitting," which occurs when the model becomes too complex and memorizes noise instead of learning patterns. Pre-Pruning, stops the tree's growth during the training process by setting specific constraints, such as a maximum depth or a minimum number of samples per leaf.Its primary practical advantage is computational efficiency, as it saves time and memory by never building an unnecessarily large tree. Post-Pruning allows the tree to grow to its full, complex height first and then trims away non-significant branches from the bottom up based on their performanc.Its main practical advantage is higher accuracy, as it evaluates the "big picture" and avoids prematurely cutting off splits that might have seemed weak initially but led to significant discoveries deeper in the tree.


Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Answer: Information Gain is the measurement of how much "uncertainty" or "chaos" is reduced in a dataset after a specific split is made. It is calculated by taking the Entropy (disorder) of the parent node and subtracting the weighted average Entropy of the resulting child nodes. The ultimate goal of a classification tree is to reach "leaf nodes" where every item belongs to the same class. Information Gain is the mathematical tool that drives the data toward that state of 100% purity.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Answer: Decision Trees are popular because they mirror human "If-Then" logic, but they come with specific trade-offs.

Real-World Applications:
- Finance: Credit scoring and fraud detection.
- Healthcare: Patient risk stratification and symptom-based diagnosis.
- Retail: Customer segmentation and "churn" prediction (predicting who will quit a service).

Advantages:
- High Interpretability
- Versatile Data Handling
- No Scaling Required
- Automatic Feature Selection

Limitations:
- Overfitting
- Instability (High Variance)
- Bias toward "Tall" Features
- Greedy Nature

Question 6: Write a Python program to:

- Load the Iris Dataset
- Train a Decision Tree Classifier using the Gini criterion
- Print the model’s accuracy and feature importances


In [1]:
# Answer of 6th Question

#importing useful libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

#loading dataset and creating DataFrame
data = load_iris()
df = pd.DataFrame(data.data,columns=data.feature_names)

# Dividing data to independent and target variables
X = df
y = data.target

# splitting in train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=1)

# model building
classifier = DecisionTreeClassifier(criterion='gini')
classifier.fit(x_train,y_train)
y_pred = classifier.predict(x_test)

# model's accuracy
print(f'Accuracy = {accuracy_score(y_test,y_pred)}')

# Feature importance
print("Feature Importances:")
for feature_name, importance in zip(data.feature_names, classifier.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


# ● Boston Housing Dataset for regression tasks
# (sklearn.datasets.load_boston() or provided CSV).


Accuracy = 0.9555555555555556
Feature Importances:
sepal length (cm): 0.0215
sepal width (cm): 0.0215
petal length (cm): 0.0632
petal width (cm): 0.8939


Question 7: Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [2]:
# Answer to 7th Question

# Split into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a shallow Decision Tree (max_depth=3)
clf = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
accuracy_clf = accuracy_score(Y_test, Y_pred)

# Train a fully-grown Decision Tree (no max_depth limit)
clf_full = DecisionTreeClassifier(criterion="gini", random_state=42)
clf_full.fit(X_train, Y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# comparison
print(f"Tree (max_depth=3) Accuracy: {accuracy_clf:.2f}")
print(f"Fully-grown Tree Accuracy: {accuracy_full:.2f}")

Tree (max_depth=3) Accuracy: 1.00
Fully-grown Tree Accuracy: 0.16


Question 8: Write a Python program to:
- Load the Boston Housing Dataset
- Train a Decision Tree Regressor
- Print the Mean Squared Error (MSE) and feature importances


In [3]:
#  Answer to 8th Question

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the Boston Housing dataset from OpenML
boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

# Print feature importances
print("Feature Importances:")
for feature_name, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")



Mean Squared Error (MSE): 11.59
Feature Importances:
CRIM: 0.0585
ZN: 0.0010
INDUS: 0.0099
CHAS: 0.0003
NOX: 0.0071
RM: 0.5758
AGE: 0.0072
DIS: 0.1096
RAD: 0.0016
TAX: 0.0022
PTRATIO: 0.0250
B: 0.0119
LSTAT: 0.1900


Question 9: Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
- Print the best parameters and the resulting model accuracy


In [4]:
# Answer to 9th Question

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define the Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# Perform GridSearchCV
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)

# Get the best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy on Test Set:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy on Test Set: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Answer:
- Handle Missing Values:
Impute numerical features with mean/median,Fill categorical features with mode or “Unknown",Add missing indicators if clinically relevant.
- Encode Categorical Features: Use one-hot encoding for nominal variables,Use label encoding for ordinal variables.
- Train Decision Tree Model: Split data into train/test sets,Fit a DecisionTreeClassifier on the training set.
- Tune Hyperparameters: Use GridSearchCV or RandomizedSearchCV to optimize max_depth, min_samples_split, and min_samples_leaf.
- Evaluate Performance:  Check accuracy, precision, recall, F1-score, and ROC-AUC,Prioritize recall (sensitivity) in healthcare to minimize missed disease cases.
- Business Value: Early detection of disease risk,Better resource allocation for hospitals,Improved patient outcomes through timely intervention,Cost savings by reducing unnecessary tests,Trustworthy decision support thanks to interpretable tree rules.
