# Decision Tree | Assignment

1. What is a Decision Tree, and how does it work in the context of
classification?
- A Decision Tree is a supervised machine learning algorithm used for classification and regression problems. It works by splitting the dataset into smaller subsets based on feature values.

In classification, the tree starts with a root node that represents the entire dataset. It then selects the best feature to split the data based on a criterion like Gini Impurity or Entropy. Each split creates branches, and this process continues until a stopping condition is met.
The final nodes are called leaf nodes, and each leaf node represents a class label.

Decision Trees are easy to understand and visualize, making them very useful for explaining model decisions.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

- Gini Impurity and Entropy are measures used to check how “pure” or “impure” a node is.

- Gini Impurity measures the probability that a randomly chosen data point would be incorrectly classified.

- Lower Gini value means purer node.

- Entropy measures the level of disorder or uncertainty in the data.

- Lower entropy means less uncertainty.

When building a Decision Tree, the algorithm chooses splits that reduce impurity the most.
Better splits result in nodes where most samples belong to one class, improving classification accuracy.



3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

- Pre-Pruning stops the tree from growing too deep by setting limits like maximum depth or minimum samples per split.

- Advantage: Reduces overfitting and makes the model faster to train.

- Post-Pruning allows the tree to grow fully and then removes unnecessary branches.

- Advantage: Often gives better performance because pruning decisions are made after seeing the full tree.

4.  What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

- Information Gain measures how much uncertainty is reduced after splitting the data on a particular feature.
It is calculated as the difference between the entropy before the split and the weighted entropy after the split.

The feature with the highest Information Gain is chosen for splitting because it gives the most useful information and leads to better classification.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

- Applications:

Medical diagnosis

Credit risk assessment

Spam email detection

Customer churn prediction

Fraud detection

- Advantages:

Easy to understand and interpret

Works with both numerical and categorical data

Requires little data preprocessing

- Limitations:

Can easily overfit the data

Sensitive to small changes in data

Not as accurate as ensemble methods like Random Forests

In [3]:
# 6. Write a Python program to:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = DecisionTreeClassifier(criterion="gini")
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Feature importances
print("Feature Importances:", model.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.01667014 0.         0.90614339 0.07718647]


In [4]:
# 7 Write a Python program to:
# Load the Iris Dataset
# Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# fully-grown tree.

# Fully grown tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

# Pruned tree
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
pruned_acc = accuracy_score(y_test, pruned_tree.predict(X_test))

print("Fully-grown tree accuracy:", full_acc)
print("Max depth = 3 accuracy:", pruned_acc)


Fully-grown tree accuracy: 1.0
Max depth = 3 accuracy: 1.0


In [8]:
# 8 Write a Python program to:
# Load the Boston Housing Dataset
# Train a Decision Tree Regressor
# Print the Mean Squared Error (MSE) and feature importances

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions
y_pred = reg.predict(X_test)

# MSE
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Feature importances
print("Feature Importances:", reg.feature_importances_)

Mean Squared Error: 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


In [10]:
# 9 Write a Python program to:
# Load the Iris Dataset
# Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Load Iris dataset explicitly for this task
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

# Train-test split for Iris dataset
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42
)

param_grid = {
    "max_depth": [2, 3, 4, 5],
    "min_samples_split": [2, 5, 10]
}

grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5
)

grid.fit(X_train_iris, y_train_iris)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)

Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Best Accuracy: 0.9416666666666668


10.  Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:

- Handling Missing Values:

Numerical features can be filled using mean or median.

Categorical features can be filled using the most frequent value.

- Encoding Categorical Features:

Use Label Encoding for ordinal data.

Use One-Hot Encoding for nominal data.

- Training the Decision Tree:

Split data into training and testing sets.

Train a Decision Tree Classifier using suitable impurity criteria.

- Hyperparameter Tuning:

Tune parameters like max_depth, min_samples_split using GridSearchCV.

- Model Evaluation:

Use accuracy, precision, recall, F1-score, and confusion matrix.