**Question 1: What is a Decision Tree, and how does it work in the context of
classification?**

* Answer:A Decision Tree is a supervised learning algorithm used for classification and regression tasks. It works by recursively splitting the dataset into subsets based on feature values to create a tree-like model of decisions.

A Decision Tree is a flowchart-like structure where:

Internal nodes represent tests on features.

Branches represent outcomes of those tests.

Leaf nodes represent class labels (in classification).

Here's how it works:

The algorithm starts at the root node and selects the feature that best separates the data using an impurity measure (like Gini or Entropy).

It splits the dataset based on this feature into subsets.

This process repeats recursively for each subset until a stopping criterion is met (e.g., max depth, minimum samples per leaf, or pure nodes).

The final tree can then be used to classify new data by traversing from root to leaf based on feature values.

Decision Trees are intuitive and interpretable, making them popular for tasks like customer segmentation, medical diagnosis, and fraud detection.


**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.How do they impact the splits in a Decision Tree?**
* Answer: Gini Impurity and Entropy are metrics used to measure the impurity of a node in a Decision Tree. They guide the algorithm in selecting the best feature and threshold for splitting the data at each step.

Gini Impurity vs Entropy: Impurity Measures
Both Gini Impurity and Entropy quantify how mixed the classes are in a node:

* Gini Impurity

Measures the probability of misclassifying a randomly chosen element.

Lower Gini means purer nodes.

Often preferred for its computational efficiency.

* Entropy

Measures the amount of disorder or uncertainty.

Used with Information Gain to decide splits.

Higher entropy means more mixed classes.

***********

Impact on Splits
At each node, the algorithm evaluates all possible splits and selects the one that maximizes purity (i.e., minimizes impurity).

Gini and Entropy often lead to similar splits, but Gini tends to be faster.

The chosen impurity measure affects the shape and depth of the tree, and potentially its performance.


**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

Answer: Pre-Pruning (Stop Early)
What it means: We stop growing the tree before it gets too big.

How it works: We set rules like:

“Don’t go deeper than 5 levels.”

“Only split if there are at least 10 data points.”

Why it’s useful: ✅ It makes the tree faster and easier to understand.

Post-Pruning (Trim Later)
What it means: We let the tree grow fully, then cut off parts that don’t help.

How it works: We test the tree on new data and remove branches that don’t improve accuracy.

Why it’s useful: ✅ It helps the tree make better predictions on new data.

Quick Example
Imagine you’re teaching a kid to recognize fruits:

Pre-Pruning: You stop teaching after a few examples to keep it simple.

Post-Pruning: You teach everything, then remove confusing parts later.

Would you like a visual to show how pruning changes the tree shape?

**Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

Answer:Information Gain tells us how useful a feature is for splitting the data in a Decision Tree.

Imagine you’re trying to sort a mixed basket of fruits (apples and oranges). You want to ask a question that helps you separate them better.

If you ask “Is it round?” and it doesn’t help much, that’s low Information Gain.

If you ask “Is it orange in color?” and it perfectly splits apples and oranges, that’s high Information Gain.

Why is it Important?
It helps the tree choose the best question to ask at each step.

The goal is to make groups that are as pure as possible (mostly one type of thing).

More Information Gain = better split = smarter tree!

imple Example
Let’s say you’re sorting fruits:

Before splitting: 5 apples, 5 oranges (mixed group).

After splitting by color:

Group 1: 5 apples

Group 2: 5 oranges


**Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

Answewr: Real-Life Uses of Decision Trees
Decision Trees are like flowcharts that help make decisions. Here’s where they’re used:

Hospitals: To help doctors figure out what illness a patient might have.

Banks: To decide if someone should get a loan or not.

Online Shopping: To suggest products based on what you like.

Email Apps: To sort emails into spam or not spam.

Telecom Companies: To guess which customers might leave soon.

Advantages (Why People Like Them)
Easy to Understand: You can follow the steps like a game of 20 questions.

Works with All Kinds of Data: Numbers, words, categories—you name it.

No Fancy Math Needed: You don’t need to scale or change your data much.

Fast to Use: They don’t take too long to train.

Limitations (The Not-So-Great Parts)
Can Overthink: Sometimes they make the tree too big and memorize the training data (this is called overfitting).

Easily Confused: A small change in data can make a very different tree.

Not Always the Best Alone: They might not be as accurate as other models unless combined (like in Random Forests).


Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

**Question 6: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances**



In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Model Accuracy:", accuracy)
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


**Question 7: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.**


In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree with max_depth=3
tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Train fully-grown Decision Tree (no max_depth)
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print the comparison
print("Accuracy with max_depth=3:", accuracy_limited)
print("Accuracy with fully-grown tree:", accuracy_full)


Accuracy with max_depth=3: 1.0
Accuracy with fully-grown tree: 1.0


**Question 8: Write a Python program to:

● Load the Boston Housing Dataset

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances**


In [4]:
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the Boston Housing dataset
boston = load_boston()
X = boston.data
y = boston.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict and calculate Mean Squared Error
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print results
print("Mean Squared Error (MSE):", mse)
print("Feature Importances:")
for feature, importance in zip(boston.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


Question 9: Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV

● Print the best parameters and the resulting model accur

In [5]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# Create a Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Use GridSearchCV to find the best parameters
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best model and evaluate it
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Model Accuracy:", accuracy)


Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Model Accuracy: 1.0


**Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.**

* Answer- Handle Missing Values
Numerical Features: Fill missing values using mean or median.

In [8]:
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy='mean')
X_num = num_imputer.fit_transform(X_num)
# Categorical Features: Fill missing values using the most frequent category.
cat_imputer = SimpleImputer(strategy='most_frequent')
X_cat = cat_imputer.fit_transform(X_cat)




NameError: name 'X_num' is not defined

In [9]:
#Encode Categorical Features
#Use One-Hot Encoding for nominal categories (no order).
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
X_cat_encoded = encoder.fit_transform(X_cat)

NameError: name 'X_cat' is not defined

In [None]:
# Train a Decision Tree Model
#Combine numerical and encoded categorical features.

#Train the model:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)


In [None]:
#Tune Hyperparameters
#Use GridSearchCV to find the best values for max_depth, min_samples_split, etc.
from sklearn.model_selection import GridSearchCV
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_


In [None]:
#Evaluate Performance
# Use metrics like accuracy, precision, recall, and F1-score.
from sklearn.metrics import classification_report
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))


Business Value in Healthcare
Early Detection: Helps doctors identify high-risk patients quickly.

Personalized Treatment: Supports tailored care plans based on patient data.

Resource Optimization: Prioritizes patients who need urgent attention.

Decision Support: Assists clinicians with data-driven insights.

Compliance & Auditing: Provides transparent, interpretable decisions for regulatory review.