## Q1: What is a Decision Tree, and how does it work in classification?

A Decision Tree is a supervised learning algorithm used for classification and regression tasks. In classification:

- Each internal node represents a decision based on a feature.
- Each branch represents the outcome of the decision.
- Each leaf node represents a class label.

The tree splits data recursively based on feature values to create pure subsets. It uses metrics like **Gini Impurity** or **Entropy** to choose the best splits.

---

## Q2: Explain Gini Impurity and Entropy. How do they impact splits?

- **Gini Impurity** measures the probability of misclassification:



- **Entropy** measures the disorder or uncertainty:
  


Both are used to evaluate how well a feature splits the data. Lower impurity or entropy means better splits.

---

## Q3: Difference between Pre-Pruning and Post-Pruning

| Type         | Description | Pros | Cons |
|--------------|-------------|------|------|
| Pre-Pruning  | Stops tree growth early using constraints like `max_depth`, `min_samples_split` | Faster, avoids overfitting | Might underfit |
| Post-Pruning | Builds full tree first, then prunes back using validation | More accurate, flexible | Computationally expensive |

---

## Q4: What is Information Gain and why is it important?

**Information Gain (IG)** measures the reduction in entropy after a split.


It helps select the best feature for splitting by quantifying how much uncertainty is reduced.

---

## Q5: Real-world applications of Decision Trees

- **Applications**: Medical diagnosis, loan approval, fraud detection, customer segmentation.
- **Advantages**:
  - Easy to interpret
  - Handles both numerical and categorical data
  - Requires minimal preprocessing
- **Limitations**:
  - Prone to overfitting
  - Sensitive to small data changes
  - Less accurate than ensemble methods

---

## Q10: Healthcare Case Study – Step-by-Step Process

1. **Handle Missing Values**: Use imputation (`SimpleImputer`) for numerical and categorical features.
2. **Encode Categorical Features**: Use `OneHotEncoder` or `OrdinalEncoder`.
3. **Train Model**: Fit `DecisionTreeClassifier` on cleaned data.
4. **Tune Hyperparameters**: Use `GridSearchCV` to optimize `max_depth`, `min_samples_split`.
5. **Evaluate Performance**: Use metrics like accuracy, precision, recall, and confusion matrix.

**Business Value**: Enables early disease detection, personalized treatment, and resource optimization—leading to better patient outcomes and reduced costs.

---

In [1]:
# Q6
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X, y)

print("Accuracy:", accuracy_score(y, clf.predict(X)))
print("Feature Importances:", clf.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.01333333 0.01333333 0.55072262 0.42261071]


In [2]:
#Q7
clf_full = DecisionTreeClassifier()
clf_limited = DecisionTreeClassifier(max_depth=3)

clf_full.fit(X, y)
clf_limited.fit(X, y)

print("Full Tree Accuracy:", accuracy_score(y, clf_full.predict(X)))
print("Limited Tree Accuracy:", accuracy_score(y, clf_limited.predict(X)))

Full Tree Accuracy: 1.0
Limited Tree Accuracy: 0.9733333333333334


In [4]:
#Q8
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X, y = housing.data, housing.target

reg = DecisionTreeRegressor()
reg.fit(X, y)

preds = reg.predict(X)
print("MSE:", mean_squared_error(y, preds))
print("Feature Importances:", reg.feature_importances_)

MSE: 9.570289276518477e-32
Feature Importances: [0.52509405 0.05087686 0.05481703 0.02650712 0.03115477 0.13090022
 0.09486864 0.08578129]


In [6]:
#Q9
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor # Import DecisionTreeRegressor

params = {
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [2, 5, 10]
}

grid = GridSearchCV(DecisionTreeRegressor(), params, cv=5) # Changed to DecisionTreeRegressor
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_) # Note: This will now be R^2 score for regression

Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Best Accuracy: 0.47357814961836614
