**Question 1:** **What is a Decision Tree, and how does it work in the context of**
**classification**.

Ans : A Decision Tree is a type of supervised machine learning algorithm that is
mainly used for classification and regression tasks. In classification, it helps predict
the category or class of a data point based on input features.
The structure of a decision tree is similar to a flowchart. It starts at the top with a root
node, which represents the entire dataset. From there, the data is split into branches
using decision rules based on feature values. Each split leads to a new internal node
or a leaf node, which holds the final prediction.

How it works:
1. The algorithm looks for the feature that best divides the data into classes.
2. It uses metrics like Gini Impurity or Information Gain to determine the best splits.
3. This process continues recursively, creating new branches until stopping criteria
are met (like maximum depth or pure leaves).
Example: Suppose we want to classify whether a person will buy a product or not
based on their age and income. The tree might first split by age (>30 or <=30), then
by income level.

Conclusion: Decision Trees are easy to understand and interpret. They mimic human decision-making, making them popular in business and educational settings.


**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.**

**How do they impact the splits in a Decision Tree?**

Ans : In a decision tree, Gini Impurity and Entropy are used to measure how
mixed the classes are in a dataset. These help the algorithm decide
where to split the data for the best classification.

1. Gini Impurity:
Measures the probability of wrongly classifying a randomly chosen
element.

Formula: (Gini = 1 - p_i^2) where (p_i) is the probability of
class (i). A Gini value of 0 means perfect classification.

2. Entropy:

● Comes from information theory. Measures disorder or uncertainty.

● Formula: (Entropy = -p_i _2(p_i))

● Entropy is highest when classes are equally mixed.

Impact on Splits:

● The decision tree selects the feature and threshold that results in
the greatest reduction in impurity (either Gini or Entropy).

● This helps create pure child nodes where samples mostly belong
to one class.

Example: If a node has 10 class A and 10 class B samples, impurity is
high. A good split will create child nodes like one with 9A, 1B and
another with 1A, 9B.

**Question 3**: What is the difference between Pre-Pruning and Post-Pruning in
Decision Trees?Give one practical advantage of using each.

Ans: Pre-Pruning (Early Stopping):

● Stops the tree from growing too large during training.

● It uses rules like max_depth, min_samples_split, or min_samples_leaf to
limit growth.

● Prevents overfitting by simplifying the tree early.

Advantage:

● Faster training time since it avoids building a large tree unnecessarily.
Post-Pruning:

● First builds a full tree, then removes branches that do not improve accuracy.

● Also called cost-complexity pruning.

● Advantage:

Leads to a more accurate and generalized model, since pruning
is done after seeing the full data. Conclusion: Both methods help avoid
overfitting. Pre-pruning saves time, while post pruning improves model
performance.

Conclusion: Both methods help avoid overfitting. Pre-pruning saves time, while post
pruning improves model performance.


**Question 4:** **What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

Ans: Information Gain is a metric used to choose the feature that best splits the dataset in a Decision Tree.
It measures the reduction in entropy after a dataset is split based on a feature. The idea is that a good split gives us more “pure” groups.

Formula: (IG = Entropy(parent) - Entropy(child))

Why it’s important:

● A higher information gain means better separation between classes.

● The tree selects the feature with the highest information gain at each step.

Example:
If we split data by the feature “Age > 30”, and this split results in two child nodes
where each node has mostly one class, the entropy decreases and information gain
increases.
Conclusion: Information Gain helps build trees that make better decisions by
focusing on the most informative features.


Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

**Applications:**

● 1. Healthcare: Diagnosing diseases based on symptoms.

● 2. Finance: Approving loans based on credit score, income.

● 3. Marketing: Predicting customer churn or product purchase.

● 4. Education: Predicting student performance.

**Advantages:**

● Easy to understand and visualize

● Can handle both numerical and categorical data

● Requires little data preprocessing (no need for normalization)

**Limitations:**

● Prone to overfitting on noisy data

● Small changes in data can change the structure drastically

● Greedy approach may not lead to the optimal tree


In [10]:
#Question 6: Python Program – Load Iris Dataset and Train Decision Tree with Gini

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1)
# Train Decision Tree with Gini criterion
model = DecisionTreeClassifier(criterion="gini", random_state=42)
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Print results
print("Accuracy:", accuracy)
print("Feature Importances:", model.feature_importances_)

Accuracy: 0.9555555555555556
Feature Importances: [0.02146947 0.02146947 0.57196476 0.38509631]


In [11]:
# Question 7: Python Program – Compare Depth-Limited vs Fully Grown Tree
# Model with max_depth=3
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Classifier with max_depth=3
tree_depth_3 = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_depth_3.fit(X_train, y_train)

# Predict and calculate accuracy for max_depth=3
y_pred_depth_3 = tree_depth_3.predict(X_test)
accuracy_depth_3 = accuracy_score(y_test, y_pred_depth_3)

# Train a fully-grown Decision Tree Classifier
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)

# Predict and calculate accuracy for the fully-grown tree
y_pred_full_tree = full_tree.predict(X_test)
accuracy_full_tree = accuracy_score(y_test, y_pred_full_tree)

# Print the results
print(f"Accuracy of Decision Tree with max_depth=3: {accuracy_depth_3:.2f}")
print(f"Accuracy of Fully-Grown Decision Tree: {accuracy_full_tree:.2f}")

Accuracy of Decision Tree with max_depth=3: 1.00
Accuracy of Fully-Grown Decision Tree: 1.00


In [12]:
#Question 8: Python Program – Train Decision Tree on Boston Housing Dataset

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
# Load dataset (Boston deprecated, use California Housing instead)
data = fetch_california_housing()
X = data.data
y = data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Train model
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)
# Predict and evaluate
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
print("Feature Importances:", model.feature_importances_)

MSE: 0.5280096503174904
Feature Importances: [0.52345628 0.05213495 0.04941775 0.02497426 0.03220553 0.13901245
 0.08999238 0.08880639]


In [13]:
#Question 9: Python Program – Hyperparameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor # Import DecisionTreeRegressor

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [2, 4, 6]
}

# Initialize model - Use DecisionTreeRegressor for regression task
dt = DecisionTreeRegressor(random_state=42)

# Grid search
grid = GridSearchCV(dt, param_grid, cv=3)
grid.fit(X_train, y_train)

# Results
print("Best Parameters:", grid.best_params_)
print("Best MSE:", -grid.best_score_) # Print negative of best score for MSE

Best Parameters: {'max_depth': 5, 'min_samples_split': 4}
Best MSE: -0.6091798720196308


Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

**Step-by-Step Process:**

1. Handling Missing Values:

o Use imputation methods like SimpleImputer to fill missing values.

o Mean for numerical features, most frequent or mode for categorical ones.

**2. Encoding Categorical Features:**

o Use OneHotEncoder or LabelEncoder based on whether features are nominal or
ordinal.

**3. Training Decision Tree Model:**

o Use DecisionTreeClassifier() from scikit-learn.

o Train the model on the cleaned dataset..

**4. Hyperparameter Tuning:**

o Use GridSearchCV to find best max_depth, min_samples_split, etc.

o Helps avoid overfitting and underfitting.

**5. Model Evaluation:**

o Use accuracy, confusion matrix, precision-recall, and AUC-ROC to evaluate.

o Perform cross-validation for reliable performance.

**Business** **Value**:

Helps doctors prioritize patients at risk.

Enables preventive treatment by predicting disease early.

Saves cost for hospitals and improves patient care.

Interpretable models build trust with medical professionals.


**Conclusion:**
A well-tuned decision tree in healthcare can greatly improve diagnosis efficiency and
drive data-based decision-making.