# Decision Tree

q.no.1) What is a Decision Tree, and how does it work in the context of
classification?

-> A Decision Tree  is a non-parametric supervised learning method used in machine learning for both classification and regression tasks. Its goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
In the context of classification, a Decision Tree works by recursively splitting the dataset into more homogeneous subsets based on the features until a stopping criterion is met. This process is often called Decision Tree Induction -:
1.Start with the Root Node
2.Determine the Best Split
3.Create Child Nodes and Recurse
4.Stop Splitting (Stopping Criteria)
5.Classification




q.no.2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

-> Gini Impurity and Entropy are the primary metrics used by Decision Tree algorithms to determine the best way to split a node (data subset) into sub-nodes. They are measures of purity or homogeneity within a set of data points.
gini impurity: The Gini Impurity measures how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset.

Impact on Splits:When the Decision Tree algorithm evaluates a potential split, it calculates the Weighted Gini Impurity of the resulting child nodes. It selects the split that results in the lowest overall weighted Gini Impurity among the children.

Entropy:Entropy is a concept borrowed from information theory, and it measures the randomness or unpredictability in a dataset.

Impact on Splits: The Decision Tree doesn't directly look for the lowest entropy, but rather the split that provides the highest Information Gain (IG).

q.no. 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

-> Pre-Pruning : Pre-Pruning involves halting the growth of the decision tree before it is fully induced (grown to its maximum possible depth). You set constraints before or during the tree-building process to stop splits that are not deemed beneficial.

primary practical advantage of pre-pruning is Computational Efficiency and Speed

Post-Pruning:Post-Pruning involves growing the decision tree fully (allowing it to overfit the training data) and then systematically trimming back the non-significant branches or subtrees by replacing them with leaf nodes.

The primary practical advantage of post-pruning is Optimal Generalization and Higher Accuracy.


q.no.4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?


->Information Gain is mathematically defined as the difference between the Entropy of the parent node (before the split) and the weighted average Entropy of the child nodes (after the split).

importance for choosing : a Decision Tree algorithm is to grow the tree in a way that creates the purest possible leaf nodes with the minimum number of splits. Information Gain guides this greedy search process at every single node.
1. Selection Criterion
2. Maximization
3. Impurity Reduction

q.no.5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?


-> common real- world application
1. Finance / Banking - Credit Risk Assessment
2.Healthcare / Medicine - Disease Diagnosis
3. E-commerce / Marketing - Customer Churn Prediction
4. Fraud Detection - Transaction Screening
5. Manufacturing - Quality Control

q.no.6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

q.no.7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

print("--- Iris Dataset Loaded ---")
print(f"Features: {feature_names}")


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


dt_classifier = DecisionTreeClassifier(
    criterion='gini',
    random_state=42,
    max_depth=3
)

print("\nTraining Decision Tree Classifier (Gini Criterion)...")
dt_classifier.fit(X_train, y_train)
print("Training complete.")




y_pred = dt_classifier.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)
print("\n--- Model Evaluation ---")
print(f"Model Accuracy on Test Set: **{accuracy:.4f}**")


importances = dt_classifier.feature_importances_

print("\n--- Feature Importances ---")

feature_importances = sorted(
    zip(feature_names, importances),
    key=lambda x: x[1],
    reverse=True
)

for name, score in feature_importances:
    print(f"{name:<20}: {score:.4f}")




--- Iris Dataset Loaded ---
Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Training Decision Tree Classifier (Gini Criterion)...
Training complete.

--- Model Evaluation ---
Model Accuracy on Test Set: **1.0000**

--- Feature Importances ---
petal length (cm)   : 0.9346
petal width (cm)    : 0.0654
sepal length (cm)   : 0.0000
sepal width (cm)    : 0.0000


q.no8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances


In [None]:
import numpy as np
import warnings
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


housing = fetch_california_housing()
X = housing.data
y = housing.target
feature_names = housing.feature_names

print("--- California Housing Dataset Loaded (Replacement for Boston Housing) ---")
print(f"Features: {feature_names}")


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


dt_regressor = DecisionTreeRegressor(
    random_state=42,
    max_depth=8
)

print("\nTraining Decision Tree Regressor...")
dt_regressor.fit(X_train, y_train)
print("Training complete.")


y_pred = dt_regressor.predict(X_test)


mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)


print("\n--- Model Evaluation Metrics ---")
print(f"Mean Squared Error (MSE): **{mse:.4f}** (Lower is better)")
print(f"R-squared Score (R²): **{r2:.4f}** (Closer to 1.0 is better)")


importances = dt_regressor.feature_importances_

print("\n--- Feature Importances (Regressor) ---")

feature_importances = sorted(
    zip(feature_names, importances),
    key=lambda x: x[1],
    reverse=True
)

for name, score in feature_importances:
    print(f"{name:<20}: {score:.4f}")




--- California Housing Dataset Loaded (Replacement for Boston Housing) ---
Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

Training Decision Tree Regressor...
Training complete.

--- Model Evaluation Metrics ---
Mean Squared Error (MSE): **0.4220** (Lower is better)
R-squared Score (R²): **0.6779** (Closer to 1.0 is better)

--- Feature Importances (Regressor) ---
MedInc              : 0.6629
AveOccup            : 0.1321
Latitude            : 0.0613
Longitude           : 0.0502
HouseAge            : 0.0422
AveRooms            : 0.0341
AveBedrms           : 0.0089
Population          : 0.0082


q.no.9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

print("--- Iris Dataset Loaded ---")
print(f"Features: {feature_names}")


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

dt_classifier = DecisionTreeClassifier(random_state=42)


param_grid = {
    'max_depth': [2, 3, 4, 5, 6, 7],
    'min_samples_split': [2, 5, 10, 15, 20]
}

print("\nStarting GridSearchCV to find optimal hyperparameters...")


grid_search = GridSearchCV(
    estimator=dt_classifier,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    verbose=1
)


grid_search.fit(X_train, y_train)


best_params = grid_search.best_params_
print("\n--- GridSearchCV Results ---")
print(f"Best Parameters Found: **{best_params}**")

best_cv_score = grid_search.best_score_
print(f"Best Cross-Validation Accuracy (on training data): **{best_cv_score:.4f}**")


best_dt_model = grid_search.best_estimator_
y_pred_test = best_dt_model.predict(X_test)
final_test_accuracy = accuracy_score(y_test, y_pred_test)

print(f"Final Model Accuracy on Test Set: **{final_test_accuracy:.4f}**")


--- Iris Dataset Loaded ---
Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Starting GridSearchCV to find optimal hyperparameters...
Fitting 5 folds for each of 30 candidates, totalling 150 fits

--- GridSearchCV Results ---
Best Parameters Found: **{'max_depth': 4, 'min_samples_split': 2}**
Best Cross-Validation Accuracy (on training data): **0.9417**
Final Model Accuracy on Test Set: **1.0000**


q.no. 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.
