**Decision Tree | Assignment**

**1: What is a Decision Tree, and how does it work in the context of** **classification?**

---
Ans: A **Decision Tree** is a simple model used in machine learning to make predictions.  
In classification, it is used to put data into different groups or categories.  

It works like a flowchart:
- The tree starts at the root node, which is the first question about the data.  
- Based on the answer, the data follows a branch to the next question.  
- This continues until the data reaches a leaf node, which gives the final class or result.  

For example, to decide if a fruit is an apple or an orange, the tree might first ask about the color.  
If it is red, the tree predicts "apple." If it is orange in color, the tree predicts "orange."  




**2: Explain the concepts of Gini Impurity and Entropy as impurity measures**. **How do they impact the splits in a Decision Tree?**

---



When building a Decision Tree, the algorithm needs a way to decide which feature
and value should be used to split the data. For this, it uses impurity measures.
Two common measures are Gini Impurity and Entropy.

**1. Gini Impurity**  
- Gini shows how often a randomly chosen element from the dataset
would be labeled incorrectly if it was randomly classified.  
- A Gini value of 0 means the node is pure (all data belongs to one class).  
- Higher values mean the data is mixed between classes.  

**2. Entropy**  
- Entropy comes from information theory. It measures the "disorder" or "uncertainty" in the data.  
- Entropy is 0 when all examples belong to the same class.  
- Entropy is higher when the classes are more evenly mixed.  

**Impact on Splits**  
Both Gini and Entropy are used to find the "best split" in the data:  
- The Decision Tree looks for the feature and threshold that give the greatest reduction in impurity.  
- The lower the impurity after the split, the better the split.  
- In practice, both Gini and Entropy often lead to very similar trees, but Gini is a bit faster to compute.  




**3: What is the difference between Pre-Pruning and Post-Pruning in Decision**
**Trees? Give one practical advantage of using each**

---
Ans:
**Pre-Pruning**  
Pre-pruning stops the growth of a Decision Tree early by setting limits, such as maximum depth or minimum number of samples required to split.  
- *Advantage:* It saves time and prevents the tree from becoming too large and complex.  

**Post-Pruning**  
Post-pruning allows the tree to grow fully and then removes the branches that do not add much value. This makes the tree simpler and avoids overfitting.  
- *Advantage:* It improves accuracy on new data by reducing overfitting.  



**4: What is Information Gain in Decision Trees, and why is it important for**
**choosing the best split?**

---
Ans: **Information Gain** measures how much a feature helps to reduce uncertainty or disorder (entropy) in the data.  
- When a Decision Tree considers a split, it calculates the reduction in entropy after splitting the data using that feature.  
- The feature that gives the highest information gain is chosen for the split because it best separates the classes.  

**Why it is important:**  
Information Gain helps the tree make the most useful splits, which leads to more accurate predictions and a simpler tree structure.



**5: What are some common real-world applications of Decision Trees, and**
**what are their main advantages and limitations?**

---
Ans:**Applications of Decision Trees:**  
- **Medical diagnosis:** Predicting diseases based on symptoms and test results.  
- **Finance:** Credit scoring and loan approval decisions.  
- **Marketing:** Customer segmentation and predicting purchase behavior.  
- **Engineering:** Fault detection and quality control.  

**Advantages:**  
- Easy to understand and interpret.  
- Can handle both numerical and categorical data.  
- Requires little data preparation.  

**Limitations:**  
- Can easily overfit the data if the tree grows too deep.  
- Sensitive to small changes in data.  
- Sometimes less accurate compared to other complex models like Random Forests or Gradient Boosting.  



In [1]:
# Dataset Info:
# ● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV).
# ● Boston Housing Dataset for regression tasks (sklearn.datasets.load_boston() or provided CSV).
# Question 6: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

# Ans:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Decision Tree Classifier: {accuracy:.2f}")

for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"Feature: {feature}, Importance: {importance:.2f}")


Accuracy of the Decision Tree Classifier: 1.00
Feature: sepal length (cm), Importance: 0.00
Feature: sepal width (cm), Importance: 0.02
Feature: petal length (cm), Importance: 0.91
Feature: petal width (cm), Importance: 0.08


In [2]:
# 7: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

# Ans:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
y_pred_limited = limited_tree.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

print(f"Accuracy of fully-grown tree: {accuracy_full:.2f}")
print(f"Accuracy of tree with max_depth=3: {accuracy_limited:.2f}")


Accuracy of fully-grown tree: 1.00
Accuracy of tree with max_depth=3: 1.00


In [3]:
# 8: Write a Python program to:
# ● Load the California Housing dataset from sklearn
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

# Ans:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

for feature, importance in zip(data.feature_names, regressor.feature_importances_):
    print(f"Feature: {feature}, Importance: {importance:.4f}")


Mean Squared Error: 0.50
Feature: MedInc, Importance: 0.5285
Feature: HouseAge, Importance: 0.0519
Feature: AveRooms, Importance: 0.0530
Feature: AveBedrms, Importance: 0.0287
Feature: Population, Importance: 0.0305
Feature: AveOccup, Importance: 0.1308
Feature: Latitude, Importance: 0.0937
Feature: Longitude, Importance: 0.0829


In [4]:
# 9: Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# ● Print the best parameters and the resulting model accuracy

# Ans:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {'max_depth': [2, 3, 4, 5, None],'min_samples_split': [2, 5, 10]}

dt = DecisionTreeClassifier(random_state=42)

grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Best Parameters: {best_params}")
print(f"Accuracy of the tuned Decision Tree: {accuracy:.2f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy of the tuned Decision Tree: 1.00


**10: Imagine you’re working as a data scientist for a healthcare company that**
**wants to predict whether a patient has a certain disease. You have a large** **dataset with**
**mixed data types and some missing values**.
**Explain the step-by-step process you would follow to**:

●**Handle the missing values**

● **Encode the categorical features**

● **Train a Decision Tree model**

● **Tune its hyperparameters**

● **Evaluate its performance**

**And describe what business value this model could provide in the real-world**
**setting**.

---
Ans: **1. Handle Missing Values:**  
- Identify which columns have missing data.  
- For numerical features, replace missing values with the mean or median.  
- For categorical features, replace missing values with the mode or create a special category like "Unknown".  

**2. Encode Categorical Features:**  
- Convert categorical features into numeric form so the Decision Tree can use them.  
- Use techniques like **one-hot encoding** for nominal variables or **label encoding** for ordinal variables.  

**3. Train a Decision Tree Model:**  
- Split the dataset into **training and testing sets**.  
- Train a Decision Tree classifier on the training data.  
- Use default parameters first to get a baseline performance.  

**4. Tune Hyperparameters:**  
- Adjust parameters like max_depth, min_samples_split, or max_features to prevent overfitting and improve accuracy.  
- Use GridSearchCV or RandomizedSearchCV to find the best combination of parameters.  

**5. Evaluate Performance:**  
- Measure the model’s accuracy, precision, recall, and F1-score on the test set.  
- Use a confusion matrix to see how well the model predicts positive and negative cases.  

**Business Value:**  
- The model can help doctors identify patients at risk of the disease early.  
- It supports better resource allocation by focusing attention on high-risk patients.  
- It can reduce healthcare costs by enabling preventive measures.  
- Overall, it improves patient outcomes and operational efficiency.


