# Question 1: What is a Decision Tree, and how does it work in the context of classification?

**Ans:-** A Decision Tree is a supervised learning algorithm mostly used for classification tasks. It models decisions by recursively splitting the data into subsets based on feature values, creating nodes and branches. Each internal node represents a test on a feature, each branch an outcome, and each leaf node a class label. The tree is constructed by selecting splits that best separate the classes until a stopping criterion is met.

# Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

**Ans:-** Gini Impurity measures the probability of incorrectly classifying a randomly chosen element. Lower Gini means purer splits.

Entropy quantifies the disorder or uncertainty. Lower entropy means more confident predictions.

Both are used to evaluate splits: the algorithm chooses splits that decrease impurity (Gini or Entropy), resulting in more homogeneous branches.

# Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

**Ans:-** Pre-Pruning: Limits the tree's growth before fully developing it (e.g., setting max depth, min samples per leaf). Advantage: Prevents overfitting by stopping tree growth early.

Post-Pruning: Trims branches after a large tree is built, removing sections that do not provide power. Advantage: Allows initially complex trees then simplifies them based on validation performance

# Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Ans:-** Information Gain quantifies the reduction in impurity achieved by splitting a node on a certain feature. It helps the tree choose splits that maximize the separation between classes, leading to more accurate classification.

# Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Ans:-** Applications: Medicine (diagnosis), finance (credit scoring), retail (customer segmentation), marketing, web services, etc.

Advantages: Interpretable, handles both categorical and numerical data, no need for scaling.

Limitations: Prone to overfitting, unstable (small data changes can alter the tree), may not capture complex relationships as well as some models.






In [4]:
# Question 6:
'''
 Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
'''

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load data
data = load_iris()
X, y = data.data, data.target

# Train model
clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X, y)

# Predictions
y_pred = clf.predict(X)
accuracy = accuracy_score(y, y_pred)
print("Accuracy:", accuracy)
print("Feature Importances:", clf.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.         0.01333333 0.56405596 0.42261071]


In [5]:
#  Question 7:
'''
 Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
'''

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)

# Fully grown tree
clf_full = DecisionTreeClassifier()
clf_full.fit(X, y)
acc_full = accuracy_score(y, clf_full.predict(X))

# Max depth=3
clf_pruned = DecisionTreeClassifier(max_depth=3)
clf_pruned.fit(X, y)
acc_pruned = accuracy_score(y, clf_pruned.predict(X))

print("Full Tree Accuracy:", acc_full)
print("Pruned Tree Accuracy:", acc_pruned)


Full Tree Accuracy: 1.0
Pruned Tree Accuracy: 0.9733333333333334


In [8]:
#Question 8:
'''
Write a Python program to:
● Load the California Housing Dataset (as Boston Housing is deprecated)
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
'''

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train model
reg = DecisionTreeRegressor()
reg.fit(X, y)

# Predictions
y_pred = reg.predict(X)
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error:", mse)
print("Feature Importances:", reg.feature_importances_)

Mean Squared Error: 1.0070971343301193e-31
Feature Importances: [0.52421567 0.05084302 0.05330042 0.02782791 0.03172282 0.13181295
 0.09460417 0.08567304]


In [9]:
# Question 9:
'''
Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
'''

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
X, y = load_iris(return_X_y=True)

param_grid = {'max_depth': [2,3,4,5], 'min_samples_split': [2, 4, 6]}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)

Best Parameters: {'max_depth': 3, 'min_samples_split': 6}
Best Accuracy: 0.9733333333333334


In [None]:
# Question 10:
'''
Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.
'''



# Question 10:
 Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


# Ans:-
Steps for handling missing values, encoding, training, tuning, evaluating Decision Tree, and business value:

Step 1: Handle missing values – Use imputation (mean/mode for numerics, most frequent for categoricals).

Step 2: Encode categorical features – Apply label encoding or one-hot encoding.

Step 3: Train Decision Tree model – Choose criterion (e.g., Gini), fit tree to data.

Step 4: Tune hyperparameters – Use grid search on max_depth, min_samples_split.

Step 5: Evaluate performance – Use accuracy, precision, recall, ROC-AUC.

Business Value: The model assists doctors in early disease identification, prioritizes patient care, improves medical resource allocation, and helps insurers manage risk