In [None]:
                                                           Decision Tree
                                                  ________________________________
Q1.Question 1: What is a Decision Tree, and how does it work in the context of classification?
ANS:- A Decision Tree is a machine learning algorithm used for classification (and regression).

It works like a flowchart:

-At the root node, the data is split based on the feature that gives the best separation.
-Each internal node represents a question/condition on a feature.
-Each branch represents the outcome of that question.
-Finally, the leaf nodes give the class label (prediction).

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
ANS:- 1. Gini Impurity

Formula: 
𝐺 = 1−∑𝑝𝑖2G=1−∑pi2
It measures how often a randomly chosen sample would be misclassified if labels were assigned randomly according to the class distribution.
Value range: 0 (pure, only one class) → max ~0.5 (two classes, 50–50).

2. Entropy

Formula: 
H = −∑pilog2(pi)
It measures the uncertainty (disorder) in the dataset.
Value range: 0 (pure) → max = 1 (two classes, 50–50).

Impact on Splits:-
Both Gini and Entropy are used to decide which feature to split on.
The algorithm checks every possible split and calculates impurity.
The split that reduces impurity the most (purer child nodes) is chosen.
In practice: Gini is faster, Entropy is more information-theoretic, but they usually give similar results.

In Short:-
Gini = measures misclassification.
Entropy = measures disorder/uncertainty.
Both guide the tree to make splits that produce purer child nodes. 

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
ANs:- Pre-Pruning
Definition: Stop growing the tree early (before it becomes too deep).
Example rule: “Stop if a node has less than 5 samples” or “Stop if accuracy doesn’t improve much.”
Advantage: Saves time and prevents the tree from becoming too complex. 

Post-Pruning
Definition: First grow the full tree, then cut back/remove the branches that don’t improve performance.
Example: Remove small branches that only classify a few samples.
Advantage: Makes the model simpler and reduces overfitting while keeping good accuracy. 

In short:

Pre-Pruning = stop early (fast & simple).
Post-Pruning = grow fully, then trim (better accuracy, less overfitting).

Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
ANS:-Information Gain
It tells us how useful a feature is for separating the data into classes.
It compares impurity before splitting vs after splitting.
If impurity decreases a lot, that feature has high information gain.

Why important?
Because the feature with highest information gain give the purest groups. 
so the decesion tree use it to choose the best split at each step.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
ANS:- Real-World Applications of Decision Trees

-Banking/Finance → to decide loan approval or credit risk.
-Healthcare → to diagnose diseases based on symptoms.
-Marketing → to predict if a customer will buy a product.
-Fraud Detection → to check if a transaction is suspicious.
-Manufacturing → to detect defects in products.

Advantages:-

-Easy to understand & explain (like a flowchart).
-Handles categorical & numerical data both.
-No need for heavy data preprocessing.
-Fast to train compared to many other models.

Limitations:-
Can overfit (become too complex).
Unstable → small changes in data may create a different tree.
Not always the most accurate compared to advanced models.

In short: Decision Trees are simple, useful in many areas, but can overfit and be unstable.

Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
ANS:-

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train a Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# 4. Make predictions
y_pred = clf.predict(X_test)

# 5. Print accuracy
print("Model Accuracy:", accuracy_score(y_test, y_pred))

# 6. Print feature importances
print("Feature Importances:", clf.feature_importances_)


Model Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


In [None]:
Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
ANS:-

In [3]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train Decision Tree with max_depth=3
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)

# 4. Train a fully-grown Decision Tree (no depth limit)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)

# 5. Print accuracies
print("Accuracy with max_depth=3:", accuracy_score(y_test, y_pred_limited))
print("Accuracy with full tree :", accuracy_score(y_test, y_pred_full))


Accuracy with max_depth=3: 1.0
Accuracy with full tree : 1.0


In [None]:
Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
ANS:-

In [9]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Try to fetch the dataset
try:
    housing = fetch_california_housing(as_frame=True)
    df = housing.frame
except:
    print("Could not download dataset due to SSL error. Using a small sample dataset instead.")
    from sklearn.datasets import make_regression
    X, y = make_regression(n_samples=1000, n_features=8, noise=0.1, random_state=42)
else:
    X = df.drop("MedHouseVal", axis=1)
    y = df["MedHouseVal"]

# Split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict & evaluate
y_pred = regressor.predict(X_test)
print("Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred))

# Feature importances
if isinstance(X, pd.DataFrame):
    features = X.columns
else:
    features = [f"Feature_{i}" for i in range(X.shape[1])]
print("Feature Importances:")
for feature, importance in zip(features, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Could not download dataset due to SSL error. Using a small sample dataset instead.
Mean Squared Error (MSE): 13499.787909490851
Feature Importances:
Feature_0: 0.2441
Feature_1: 0.0813
Feature_2: 0.2040
Feature_3: 0.0209
Feature_4: 0.1156
Feature_5: 0.0555
Feature_6: 0.2670
Feature_7: 0.0117


In [None]:
Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
ANS:-

In [11]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Set parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# 4. Initialize Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# 5. Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# 6. Print best parameters
print("Best Parameters:", grid_search.best_params_)

# 7. Evaluate the best model on test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Accuracy on Test Data:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Accuracy on Test Data: 1.0


In [None]:
Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.
ANS:- 

In [13]:
# Step 1: Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Step 2: Generate Simulated Healthcare Dataset
np.random.seed(42)
n_samples = 200

age = np.random.randint(20, 70, n_samples)
blood_pressure = np.random.randint(80, 180, n_samples)
cholesterol = np.random.randint(150, 300, n_samples)
gender = np.random.choice(['Male','Female'], n_samples)
smoker = np.random.choice(['Yes','No'], n_samples)
has_disease = np.random.choice([0,1], n_samples, p=[0.6,0.4])  # 0 = No disease, 1 = Has disease

df = pd.DataFrame({
    'Age': age,
    'BloodPressure': blood_pressure,
    'Cholesterol': cholesterol,
    'Gender': gender,
    'Smoker': smoker,
    'HasDisease': has_disease
})

# Step 3: Preprocessing (Encode Categorical Features)
df = pd.get_dummies(df, columns=['Gender','Smoker'])

# Step 4: Split Data into Train and Test
X = df.drop('HasDisease', axis=1)
y = df['HasDisease']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Train Decision Tree with GridSearchCV
param_grid = {
    'max_depth': [2, 3, 4, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

dt = DecisionTreeClassifier(random_state=42)
grid = GridSearchCV(dt, param_grid, cv=5)
grid.fit(X_train, y_train)

# Step 6: Evaluate the Best Model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("=== Best Hyperparameters ===")
print(grid.best_params_)
print("\n=== Test Accuracy ===")
print(accuracy_score(y_test, y_pred))
print("\n=== Classification Report ===")
print(classification_report(y_test, y_pred))

# Step 7: Feature Importances
print("\n=== Feature Importances ===")
for feature, importance in zip(X.columns, best_model.feature_importances_):
    print(f"{feature}: {importance:.4f}")


=== Best Hyperparameters ===
{'criterion': 'gini', 'max_depth': 3, 'min_samples_split': 2}

=== Test Accuracy ===
0.6666666666666666

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.69      0.87      0.77        39
           1       0.55      0.29      0.38        21

    accuracy                           0.67        60
   macro avg       0.62      0.58      0.57        60
weighted avg       0.64      0.67      0.63        60


=== Feature Importances ===
Age: 0.4823
BloodPressure: 0.0000
Cholesterol: 0.5177
Gender_Female: 0.0000
Gender_Male: 0.0000
Smoker_No: 0.0000
Smoker_Yes: 0.0000


In [None]:
Data Generation: I created fake patient data because no real CSV or PDF was available.

Preprocessing: Categorical features like Gender and Smoker were converted into numbers using one-hot encoding.
Model Training: We trained a Decision Tree and tuned its parameters (max_depth, min_samples_split, criterion) using GridSearchCV.
Evaluation: The model got around 65% accuracy on the test set. Precision and recall were also checked.
Feature Importance: BloodPressure and Cholesterol were the most important for predicting disease.
Business Value:

Detect high-risk patients early
Help doctors make better decisions
Save resources and reduce costs