#DECISION TREE

1: What is a Decision Tree, and how does it work in the context of
classification?

ans-

    A Decision Tree is a supervised machine-learning model that splits data into branches based on feature values, forming a tree-like structure.
    classification:

    The tree starts at the root node.

    It chooses the best feature and best threshold that separates classes.

    It creates decision nodes until the data becomes pure or stopping criteria are met.

    A leaf node assigns the final class label.


2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

ans-

    Gini Impurity
  
    Formula:
         
         G=1−i=1∑k​pi2
	​
    Measures how often a randomly chosen element would be incorrectly labelled.

    Entropy

    Formula:

          H=−i=1∑kpilog2(pi)

   
    Measures impurity based on information theory.

    Impact on Splits

    Both choose the purest split.

    Gini is faster and prefers the most frequent class.

    Entropy is more sensitive to all class distributions.

    Both aim to reduce impurity, improving classification accuracy.

3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each

ans-

    Pre-Pruning (Early Stopping)

    Stops tree growth early using conditions (e.g., max_depth, min_samples_split).
    Advantage: Prevents overfitting early → faster training.

    Post-Pruning

    Grow full tree first then prune weak branches using validation data.
    Advantage: Produces simpler and more accurate models.

4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

ans-

    Information Gain = Reduction in impurity after a split.

     IG=Impurity(parent)−∑Impurity(children)

     It is important for choosing the best spilt-

    Helps select the best feature to split.

    More Information Gain = better separation between classes.

    Drives optimal tree structure.

5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

ans-

  Some common real-world applications of decisions trees are-
    
    -Medical diagnosis

    -Fraud detection

    -Credit risk assessment

    -Customer segmentation

    -Manufacturing defect prediction

  Their main advantages are-
    
    -Easy to interpret

    -Works with both numerical & categorical data

    -Requires little data preprocessing  

  The limitations are-

    -Prone to overfitting

    -Unstable with small data changes

    -Not good with continuous boundaries  
   

    




6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train using Gini
clf = DecisionTreeClassifier(criterion="gini")
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Feature Importances:", clf.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.01911002 0.         0.89326355 0.08762643]


7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Full Tree
full_tree = DecisionTreeClassifier()
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

# Depth 3 Tree
small_tree = DecisionTreeClassifier(max_depth=3)
small_tree.fit(X_train, y_train)
small_acc = accuracy_score(y_test, small_tree.predict(X_test))

print("Full Tree Accuracy:", full_acc)
print("Max Depth=3 Accuracy:", small_acc)


Full Tree Accuracy: 1.0
Max Depth=3 Accuracy: 1.0


 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

In [6]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

# Predictions
y_pred = reg.predict(X_test)

# MSE
mse = mean_squared_error(y_test, y_pred)

print("MSE:", mse)
print("Feature Importances:", reg.feature_importances_)


MSE: 0.5341680816352551
Feature Importances: [0.52439682 0.05208106 0.04843452 0.02755707 0.03172974 0.13854318
 0.08903023 0.08822739]


#since boston data has been deleted from scikit hence clifornia housing is taken


9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Accuracy: 1.0


10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


ans-

1.Handling Missing Values

    -Use mean/median for numerical features.

    -Use most frequent category for categorical columns.

     Alternatively use sklearn's SimpleImputer.

2.Encoding Categorical Features

Options:

    One-Hot Encoding (preferred for tree models)

    Label Encoding (when categories have natural order)

3.Training Decision Tree

  Steps:

    Split dataset into train/test.

    Build a DecisionTreeClassifier.

    Fit model on training data.

4.Hyperparameter Tuning

    Use GridSearchCV for:

    max_depth

    min_samples_split

    min_samples_leaf

    criterion (gini/entropy)

5.Model Evaluation

    Metrics:

    Accuracy

    Precision, Recall, F1-score

    Confusion Matrix

6.Business Value

    Early identification of disease risks

    Reduce manual workload for doctors

    Faster diagnosis

    Personalized patient monitoring

    Cost reduction by optimizing interventions

    This model helps healthcare providers predict diseases accurately, improving patient outcomes.