1. What is a Decision Tree, and how does it work in the context of classification?
   - A Decision Tree is a type of machine learning model used to make decisions or predictions based on data. In classification, it works by splitting the data into smaller groups based on certain rules or conditions related to the features of the data. The tree starts from a root node and branches out into different paths based on the answers or outcomes of that feature. Each internal node represents a decision based on a feature, and each leaf node represents a final class or outcome. The goal of the Decision Tree is to divide the data in a way that separates the different classes as clearly as possible, helping to make accurate predictions for new data.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
   - Gini Impurity and Entropy are measures used in Decision Trees to check how mixed or impure the data is before making a split. Gini Impurity shows how often a randomly chosen item would be incorrectly classified if it were labeled according to the class distribution in the group, while Entropy measures the amount of uncertainty or disorder in the data. Both aim to find how pure each node is lower values mean the data mostly belongs to one class. When building a Decision Tree, the algorithm uses these measures to decide the best places to split the data so that each new group becomes as pure as possible. In short, Gini Impurity and Entropy help the tree choose the most effective splits to improve classification accuracy.


3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
   - Pre-pruning and post-pruning are techniques used to prevent a Decision Tree from becoming too complex and overfitting the data. Pre-pruning stops the tree from growing once it reaches a certain condition, such as a maximum depth or a minimum number of samples in a node. This means the tree is simplified during its construction. A practical advantage of pre-pruning is that it saves time and computational resources since the tree stops growing early. Post-pruning, on the other hand, allows the tree to grow fully and then removes or trims branches that do not improve accuracy on a validation dataset. A practical advantage of post-pruning is that it often results in better model performance because it considers the full tree before deciding which parts to remove.

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
   - Information Gain is a measure used in Decision Trees to decide which feature to split on at each step. It tells us how much 'information' or certainty we gain about the target class after splitting the data based on a particular feature. In simple terms, it compares the impurity of the parent node with the impurity of the child nodes created after the split. A higher Information Gain means that the split makes the data more organized and helps the tree classify examples more accurately. It is important because it guides the tree to choose the feature that best separates the classes, leading to more accurate and efficient decision making.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
   - Decision Trees are widely used in many real world applications because they are easy to understand and interpret. Some common uses include medical diagnosis, where they help doctors decide the likelihood of a disease, finance, for assessing credit risk or detecting fraud ,marketing, for predicting customer behavior and manufacturing, for quality control and process optimization. The main advantages of Decision Trees are that they are simple to visualize, handle both numerical and categorical data, and don’t require much data preparation. However, their main limitations are that they can easily overfit the data, especially if the tree is too deep, and they can be sensitive to small changes in the data, which might lead to very different trees. Despite these drawbacks, Decision Trees remain popular due to their clarity and effectiveness in many practical problems.


In [4]:
# Question 6: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = load_iris()
X = df.data
y = df.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Decision Tree Classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# feature importances
print("Feature Importances:")
for name, importance in zip(df.feature_names, clf.feature_importances_):
    print(f"{name}: {round(importance,2)}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0
sepal width (cm): 0.02
petal length (cm): 0.89
petal width (cm): 0.09


In [6]:
# 7. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# a fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = load_iris()
X = df.data
y = df.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Classifier with max_depth=3
tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)

# Train a fully-grown Decision Tree Classifier
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)

# Make predictions
y_pred_limited = tree_limited.predict(X_test)
y_pred_full = tree_full.predict(X_test)

# Calculate accuracies
accuracy_limited = accuracy_score(y_test, y_pred_limited)
accuracy_full = accuracy_score(y_test, y_pred_full)

print("Decision Tree (max_depth=3) Accuracy:", accuracy_limited)
print("Fully-Grown Decision Tree Accuracy:", accuracy_full)



Decision Tree (max_depth=3) Accuracy: 1.0
Fully-Grown Decision Tree Accuracy: 1.0


In [8]:
# Question 8: Write a Python program to:
# ● Load the Boston Housing Dataset
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances


from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the dataset
df = fetch_california_housing()
X = df.data
y = df.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("\nFeature Importances:")
for name, importance in zip(df.feature_names, regressor.feature_importances_):
    print(f"{name}: {round(importance,2)}")


Mean Squared Error (MSE): 0.5280096503174904

Feature Importances:
MedInc: 0.52
HouseAge: 0.05
AveRooms: 0.05
AveBedrms: 0.02
Population: 0.03
AveOccup: 0.14
Latitude: 0.09
Longitude: 0.09


In [10]:
# Question 9: Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using
# GridSearchCV
# ● Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
df = load_iris()
X = df.data
y = df.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Define the hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10, 15]
}

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy with Best Parameters: 1.0


10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.
    - To build a Decision Tree model for predicting whether a patient has a disease, the first step is to handle missing values by identifying them and imputing appropriately numerical features can be filled with the mean or median, while categorical features can use the mode or a new “Unknown” category. Next, categorical features must be encoded into numeric form, using one hot encoding for nominal variables and ordinal encoding for features with a natural order. After preprocessing, the dataset is split into training and testing sets, and a Decision Tree Classifier is trained on the training data, allowing us to also examine feature importances. To improve performance and prevent overfitting, hyperparameters such as max_depth, min_samples_split, and min_samples_leaf are tuned using GridSearchCV or RandomizedSearchCV with cross validation. The model is then evaluated on the test set using metrics like accuracy, precision, recall, F1-score, and ROC-AUC, which are especially important if the disease is rare. In a real world healthcare setting, this model provides significant business value by identifying high risk patients early, guiding preventive care, reducing healthcare costs, and offering interpretable insights into which patient factors contribute most to disease risk, ultimately supporting better clinical decisions and resource allocation.