Question 1: What is a Decision Tree, and how does it work in the context of classification?

ans- A Decision Tree is a powerful and intuitive supervised learning algorithm used for both classification and regression tasks. Here's how it works in the context of classification:

🌳 What Is a Decision Tree?
A Decision Tree is a flowchart-like structure where:
- Internal nodes represent tests on features (e.g., "Is age > 30?")
- Branches represent outcomes of those tests
- Leaf nodes represent class labels (e.g., "Approved", "Denied")
It mimics human decision-making by breaking down a complex decision into a series of simpler decisions.

⚙️ How It Works for Classification
- Feature Selection:
- The algorithm chooses the best feature to split the data based on a criterion like:
- Gini Impurity
- Entropy (Information Gain)
- Chi-square
- Recursive Splitting:
- The dataset is split into subsets based on the selected feature.
- This process continues recursively, creating branches until:
- All samples in a node belong to the same class
- A stopping condition is met (e.g., max depth, min samples per leaf)
- Prediction:
- For a new input, the tree is traversed from root to leaf by evaluating feature conditions.
- The leaf node reached gives the predicted class.


Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

ans- Gini Impurity and Entropy are two fundamental metrics used to measure the impurity or disorder of a dataset in Decision Trees. They guide the tree in selecting the best feature to split on at each node.

🔍 Gini Impurity
Definition:
Gini Impurity measures the probability of incorrectly classifying a randomly chosen element if it was randomly labeled according to the distribution of labels in the node.
Formula:
For a node with classes C_1, C_2, ..., C_k,
Gini = 1 - \sum_{i=1}^{k} p_i^2
where p_i is the probability of class C_i.
Interpretation:

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision trees? Give one practical advantage of using each.

ans- Pre-Pruning (Early Stopping)
Definition: Pre-pruning halts the tree growth early—before it becomes overly complex. It uses criteria like maximum depth, minimum number of samples per node, or minimum information gain to decide when to stop splitting.
Advantage:
✅ Efficiency — It reduces training time and memory usage by preventing the tree from growing unnecessarily deep, which is especially useful for large datasets or real-time applications.

✂️ Post-Pruning (Reduced Error Pruning)
Definition: Post-pruning allows the tree to grow fully and then prunes back branches that do not improve performance on a validation set. It simplifies the tree after it's built.
Advantage:
✅ Better Generalization — It helps reduce overfitting by removing branches that capture noise, often resulting in improved accuracy on unseen data.

If you're implementing decision trees in Python (e.g., with scikit-learn), pre-pruning is typically done via parameters like max_depth, min_samples_split, etc., while post-pruning may require custom logic or libraries like cost-complexity pruning in DecisionTreeClassifier.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

ans- Information Gain (IG) measures how much “information” a feature gives us about the target variable. In decision trees, it quantifies the reduction in entropy (or impurity) after a dataset is split on a feature.
Mathematically:
\text{Information Gain} = \text{Entropy (Parent)} - \sum \left( \frac{n_i}{n} \times \text{Entropy (Child}_i) \right)
Where:
- Entropy measures disorder or uncertainty.
- n is the total number of samples.
- nᵢ is the number of samples in child node i.

🌟 Why is it Important for Splitting?
When building a decision tree, we want to split the data in a way that maximally reduces uncertainty about the target. Information Gain helps us:
- Rank features based on how well they separate the data.
- Choose the best split at each node to make the tree more accurate and efficient.

✅ Practical Impact
- High IG → Feature creates purer child nodes → Better classification or regression performance.
- Low IG → Feature doesn’t help much → Likely ignored during splitting.


Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

ans-  Real-World Applications of Decision Trees

1 healthcare
2 manufacturing
3 education
4 finance
5 marketing
6 retail

✅ Advantages of Decision Trees
- Interpretability: Easy to visualize and explain to non-technical stakeholders.
- No Need for Feature Scaling: Works well with both numerical and categorical data.
- Handles Non-linear Relationships: Captures complex patterns without requiring transformations.
- Fast Inference: Once trained, predictions are quick and efficient.

⚠️ Limitations of Decision Trees
- Overfitting: Especially with deep trees, they can memorize training data.
- Instability: Small changes in data can lead to very different trees.
- Bias Toward Features with More Levels: Can favor splits on categorical features with many unique values.
- Lower Predictive Power Alone: Often outperformed by ensemble methods like Random Forests or Gradient Boosted Trees.



In [1]:
#Dataset Info:
#Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV).
# Boston Housing Dataset for regression tasks (sklearn.datasets.load_boston() or provided CSV).
#Question 6: Write a Python program to:
# Load the Iris Dataset
# Train a Decision Tree Classifier using the Gini criterion
# Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict and evaluate accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print feature importances
print("\nFeature Importances:")
for name, importance in zip(feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")

Model Accuracy: 1.00

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


In [2]:
#Question 7: Write a Python program to:
# Load the Iris Dataset
# Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree with max_depth=3
tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Train fully-grown Decision Tree
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Compare accuracies
print(f"Accuracy with max_depth=3: {accuracy_limited:.2f}")
print(f"Accuracy with fully-grown tree: {accuracy_full:.2f}")

Accuracy with max_depth=3: 1.00
Accuracy with fully-grown tree: 1.00


In [3]:
#Question 8: Write a Python program to:
#Load the California Housing dataset from sklearn
# Train a Decision Tree Regressor
# Print the Mean Squared Error (MSE) and feature importances


from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target
feature_names = housing.feature_names

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict and evaluate MSE
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Print feature importances
print("\nFeature Importances:")
for name, importance in zip(feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")

Mean Squared Error: 0.53

Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


In [4]:
#Question 9: Write a Python program to:
# Load the Iris Dataset
# Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# Initialize Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get best parameters and evaluate on test set
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output results
print(f"Best Parameters: {best_params}")
print(f"Model Accuracy on Test Set: {accuracy:.2f}")

Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Model Accuracy on Test Set: 1.00


Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance And describe what business value this model could provide in the real-world setting

ans-  Step-by-Step Workflow for Disease Prediction
1️⃣ Handle Missing Values
- Numerical Features:
- Use mean or median imputation (SimpleImputer from sklearn) depending on distribution.
- For time-sensitive or correlated data, consider regression imputation or KNN imputation.
- Categorical Features:
- Use mode imputation or fill with "Unknown" if missingness is informative.
- Optionally, add a missingness indicator column to capture patterns.
2️⃣ Encode Categorical Features
- Low-cardinality features:
- Use One-Hot Encoding (pd.get_dummies() or OneHotEncoder) to avoid ordinal assumptions.
- High-cardinality features:
- Use Target Encoding or Frequency Encoding to reduce dimensionality.
- Tree-based models like Decision Trees can handle label encoding (LabelEncoder) without loss of performance.
3️⃣ Train a Decision Tree Model
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)


- Use train_test_split() to separate training and testing data.
- Normalize only if using models sensitive to scale (not needed for trees).
4️⃣ Tune Hyperparameters
Use GridSearchCV or RandomizedSearchCV to optimize:
- max_depth: Controls tree complexity.
- min_samples_split: Minimum samples to split a node.
- min_samples_leaf: Minimum samples at a leaf node.
- max_features: Number of features to consider at each split.
from sklearn.model_selection import GridSearchCV
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)


5️⃣ Evaluate Performance
- Accuracy: Overall correctness.
- Precision & Recall: Especially important in healthcare (e.g., false negatives).
- F1 Score: Balances precision and recall.
- ROC-AUC: Measures model’s ability to distinguish between classes.
from sklearn.metrics import classification_report, roc_auc_score
y_pred = grid.best_estimator_.predict(X_test)
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, grid.best_estimator_.predict_proba(X_test)[:,1]))



💼 Business Value in Healthcare
- Early Detection: Helps identify high-risk patients before symptoms escalate.
- Resource Optimization: Prioritizes diagnostic testing for likely cases, reducing costs.
- Personalized Care: Enables tailored treatment plans based on predicted risk.
- Compliance & Auditability: Decision Trees are interpretable, aiding regulatory compliance.
- Patient Engagement: Transparent models build trust with patients and clinicians.

