In [None]:
                                                       DESCION TREE

In [None]:
Question 1: What is a Decision Tree, and how does it work in the context of classfication?

In [None]:
A decision tree is a **supervised learning algorithm** that predicts an outcome by splitting data into branches based on feature values.

In classification, it works by:

* Starting at a root node with the whole dataset.
* Splitting the data using rules (like *Gini impurity* or *entropy*) to create homogeneous subsets.
* Repeating the process until leaf nodes represent class labels or stopping criteria are met.


In [None]:
Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

In [None]:
**Gini Impurity** and **Entropy** are metrics used to measure how “pure” a node is in a decision tree.

* **Gini Impurity**

  * Formula: $Gini = 1 - \sum p_i^2$
  * Measures the probability that a randomly chosen sample would be misclassified if randomly labeled according to the class distribution in the node.
  * Lower Gini → more pure node.

* **Entropy**

  * Formula: $Entropy = -\sum p_i \log_2(p_i)$
  * Measures the amount of uncertainty or disorder in a node.
  * Lower entropy → more pure node.

**Impact on Splits**:

* Decision trees choose splits that **minimize impurity** (Gini or Entropy).
* Both often give similar results: Gini is faster to compute, while entropy is based on information theory and can be more sensitive to class imbalance.

In [None]:
Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

In [None]:
**Difference between Pre-Pruning and Post-Pruning in Decision Trees**

| Aspect           | Pre-Pruning (Early Stopping)                                                                 | Post-Pruning (Prune After Growth)                                                            |
| ---------------- | -------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| **When Applied** | Stops tree growth early, before it becomes too complex.                                      | Grows the full tree first, then removes unnecessary branches.                                |
| **Method**       | Uses constraints like maximum depth, minimum samples per leaf, or minimum impurity decrease. | Uses validation data or cost-complexity pruning to cut branches that don’t improve accuracy. |
| **Goal**         | Prevent overfitting from the start.                                                          | Simplify an already overfit tree to improve generalization.                                  |

**Practical Advantage**

* **Pre-Pruning Advantage:** Saves computational time and memory since the tree is never allowed to grow too large.
* **Post-Pruning Advantage:** Often yields better accuracy than pre-pruning because it considers the full picture before deciding what to cut.

If you want, I can also give you a **real-world example** where both are applied differently in practice.


In [None]:
Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

In [None]:
**Information Gain (IG) in Decision Trees**

* **Definition:**
  Information Gain measures the reduction in impurity (or uncertainty) about the target variable after splitting the data on a specific feature.
  It’s calculated as:

$$
IG = \text{Impurity before split} - \text{Weighted impurity after split}
$$

Where impurity is often measured using **Entropy** or **Gini Index**.

---

**Why It’s Important for Choosing the Best Split:**

* Decision Trees work by dividing data into smaller, purer subsets.
* IG quantifies **how much a split improves the “purity” of the target classes**.
* The higher the IG, the better the split, because it means the feature is more informative about the target.
* Without IG (or a similar measure), the tree might choose irrelevant features, leading to poor predictions and overfitting.

---

**Example:**
If splitting by “Age < 30” reduces the class mixing more than splitting by “Income > 50K,” then **Age** would be chosen as the splitting feature at that node because it has higher IG.

If you’d like, I can also **show you a quick numeric example** of IG calculation so it’s crystal clear.


In [None]:
Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

In [None]:
**Common Real-World Applications of Decision Trees**

1. **Customer Segmentation & Marketing**

   * Predicting which customers are likely to buy a product.
   * Example: Targeting high-value customers for promotions.

2. **Fraud Detection**

   * Classifying transactions as legitimate or fraudulent.
   * Example: Credit card companies flagging suspicious purchases.

3. **Medical Diagnosis**

   * Assisting doctors in predicting diseases based on symptoms and test results.

4. **Loan Approval & Credit Scoring**

   * Deciding whether to approve a loan based on applicant’s income, credit history, etc.

5. **Manufacturing & Quality Control**

   * Identifying defects or causes of machine failures.

---

**Main Advantages**

* **Easy to Interpret:** Visual and understandable even for non-technical stakeholders.
* **Handles Both Numerical & Categorical Data:** Flexible with different data types.
* **No Feature Scaling Needed:** Works without normalization or standardization.
* **Captures Non-Linear Relationships:** Can model complex decision boundaries.

---

**Main Limitations**

* **Overfitting:** Trees can grow too complex without pruning.
* **Instability:** Small changes in data can lead to different splits.
* **Biased Towards Features with More Levels:** Categorical variables with many categories may be chosen more often.
* **Lower Predictive Accuracy:** Often less accurate than ensemble methods like Random Forests or Gradient Boosted Trees.

---

If you want, I can also **summarize this into a neat table** so it’s exam-friendly.

In [None]:
Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).
Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)

In [1]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data       # Features
y = iris.target     # Target labels

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Model Accuracy:", accuracy)
print("Feature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [None]:
Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
(Include your Python code and output in the code box below.)

In [2]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Model 1: Decision Tree with max_depth=3 (Pre-Pruned)
clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)
y_pred_depth3 = clf_depth3.predict(X_test)
accuracy_depth3 = accuracy_score(y_test, y_pred_depth3)

# Model 2: Fully-grown Decision Tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print comparison
print(f"Accuracy (max_depth=3): {accuracy_depth3:.4f}")
print(f"Accuracy (Full Tree):   {accuracy_full:.4f}")


Accuracy (max_depth=3): 1.0000
Accuracy (Full Tree):   1.0000


In [None]:
Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
(Include your Python code and output in the code box below.)

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load data
X, y = fetch_california_housing(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Evaluate
mse = mean_squared_error(y_test, reg.predict(X_test))
print("MSE:", mse)
print("Feature Importances:", reg.feature_importances_)


In [None]:
Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
(Include your Python code and output in the code box below.)

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)
