# *DECISION TREE ASSIGNMENT*

## What is a Decision Tree, and how does it work in the context of classification? 

A **Decision Tree** is a type of supervised learning algorithm that is used for both **classification** and **regression** problems.  
In simple words, it works like a flowchart — we start from the top (root node), ask questions at each step (internal nodes), and based on the answers (yes/no or true/false), we move down different branches until we reach a final decision (leaf node).

For **classification**, the goal is to split the data into groups that are as pure as possible — meaning each group mostly belongs to one class.

Here’s how it works step-by-step:
1. The algorithm picks the **best feature** to split the data using measures like **Gini Impurity** or **Information Gain (Entropy)**.
2. It divides the dataset based on that feature.
3. This process keeps repeating for each branch until:
   - All the data in a node belongs to one class, or  
   - No further splits can improve the classification.


## 2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

When a Decision Tree splits data, it tries to make each group (node) as "pure" as possible — meaning most samples in that node belong to the same class.  
To measure how pure or impure a node is, we use **impurity measures** like **Gini Impurity** and **Entropy**.
####  Gini Impurity:
Gini measures how often a randomly chosen element would be **incorrectly classified** if we randomly label it based on class distribution.
####  Entropy:
Entropy measures the **uncertainty** or **randomness** in the data.
#### Impact on Splits:
Both Gini and Entropy help the Decision Tree choose **where to split**:
- The algorithm tests different features and finds the one that gives the **lowest impurity after the split**.
- The **cleaner** (purer) the resulting groups, the **better** the split.

So basically:
- **Gini** is faster to compute and often used by default (like in `sklearn`).
- **Entropy** gives a more detailed measure of disorder but behaves quite similarly.


## 3.  What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

When a Decision Tree grows, it can easily become too complex — learning even the noise in the data (overfitting).  
**Pruning** is the technique used to simplify the tree and improve its ability to generalize.

---

####  Pre-Pruning (Early Stopping):
In **pre-pruning**, we **stop the tree from growing too deep** during the building process itself.

This is done by setting some limits like:
- Maximum depth of the tree (`max_depth`)
- Minimum samples required to split a node (`min_samples_split`)
- Minimum samples required at a leaf node (`min_samples_leaf`)
**Advantage:**  
- Prevents the tree from becoming too large and overfitting early on, saving both **time** and **computational cost**.

####  Post-Pruning (Reduced Error Pruning):
In **post-pruning**, the tree is **fully grown first**, and then we **trim unnecessary branches** that don’t improve performance on a validation set.

Steps:
1. Grow the full tree.
2. Evaluate performance on validation data.
3. Remove branches that don’t add much predictive power.

**Advantage:**  
- Helps improve **model accuracy and generalization** by removing weak or noisy splits after seeing the whole picture.


## 4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Information Gain (IG)** is a measure used in Decision Trees to decide **which feature to split on** at each step.  
It tells us how much **“information” or “purity” improvement** we get when we split the dataset using a particular feature.

####  Why It’s Important:
Information Gain helps the Decision Tree **choose the best attribute** to split on at each level.  
The algorithm always picks the feature with the **highest Information Gain**, because it leads to the **most significant reduction in uncertainty**.


**Example:**  
If we’re classifying whether someone will buy a phone:
- Splitting by “Income” might give higher IG than “Age”  
- So the tree will choose **Income** as the first split feature.


## 5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

####  Real-World Applications:
1. **Finance:**  
   Used for credit scoring, loan approval, and detecting fraud.  
   Example: Predicting whether a person is likely to repay a loan or not.

2. **Healthcare:**  
   Helps in diagnosing diseases based on symptoms and test results.  
   Example: Predicting if a tumor is malignant or benign.

3. **Marketing:**  
   Used for customer segmentation and predicting buying behavior.  
   Example: Will a customer buy a product after seeing an ad?

4. **HR and Hiring:**  
   Helps filter job applicants based on their qualifications and experience.

5. **Manufacturing:**  
   Used for quality control and predicting machine failures.


####  Advantages:
1. **Easy to Understand and Interpret:**  
   The tree structure looks like a simple flowchart — great for explaining decisions to non-technical people.

2. **Handles Both Numerical and Categorical Data:**  
   Works well even when data types are mixed.

3. **No Need for Feature Scaling:**  
   Unlike algorithms like SVM or KNN, decision trees don’t need normalization or standardization.

4. **Good for Small to Medium Datasets:**  
   They can perform well without needing huge amounts of data.


#### Limitations:
1. **Overfitting:**  
   If not pruned properly, trees can become too deep and learn noise instead of actual patterns.

2. **Unstable:**  
   Small changes in data can create a completely different tree.

3. **Biased Toward Features with More Levels:**  
   Features with many categories might dominate splits.

4. **Not Great for Continuous Predictions Alone:**  
   Decision Trees can give step-like predictions, which may not be smooth for regression tasks.


## 6. Write a Python program to: 
● Load the Iris Dataset 
● Train a Decision Tree Classifier using the Gini criterion 
● Print the model’s accuracy and feature importances

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data        # features
y = iris.target      # labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", round(accuracy * 100, 2), "%")

print("\nFeature Importances:")
for name, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"{name}: {round(importance, 3)}")


Model Accuracy: 100.0 %

Feature Importances:
sepal length (cm): 0.0
sepal width (cm): 0.017
petal length (cm): 0.906
petal width (cm): 0.077


## 7. Question 7:  Write a Python program to: 
● Load the Iris Dataset 
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to 
a fully-grown tree.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)

tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)

y_pred_limited = tree_limited.predict(X_test)
y_pred_full = tree_full.predict(X_test)

accuracy_limited = accuracy_score(y_test, y_pred_limited)
accuracy_full = accuracy_score(y_test, y_pred_full)

print("Accuracy (max_depth=3):", round(accuracy_limited * 100, 2), "%")
print("Accuracy (fully-grown tree):", round(accuracy_full * 100, 2), "%")


Accuracy (max_depth=3): 100.0 %
Accuracy (fully-grown tree): 100.0 %


## 8. Question 8: Write a Python program to: 
● Load the Boston Housing Dataset 
● Train a Decision Tree Regressor 
● Print the Mean Squared Error (MSE) and feature importances

Since load_boston got removed from scikit-learn for ethical reasons, so in recent versions you can’t use it directly So I am using fetch_california_housing

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", round(mse, 3))

print("\nFeature Importances:")
for name, importance in zip(data.feature_names, regressor.feature_importances_):
    print(f"{name}: {round(importance, 3)}")


Mean Squared Error (MSE): 0.495

Feature Importances:
MedInc: 0.529
HouseAge: 0.052
AveRooms: 0.053
AveBedrms: 0.029
Population: 0.031
AveOccup: 0.131
Latitude: 0.094
Longitude: 0.083


## 9. Question 9: Write a Python program to: 
● Load the Iris Dataset 
● Tune the Decision Tree’s max_depth and min_samples_split using 
GridSearchCV 
● Print the best parameters and the resulting model accuracy

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dtree = DecisionTreeClassifier(random_state=42)
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10, 15]
}

grid_search = GridSearchCV(estimator=dtree, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print("Best Parameters:", best_params)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with Best Parameters:", round(accuracy * 100, 2), "%")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy with Best Parameters: 100.0 %


## 10. Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. 
Explain the step-by-step process you would follow to: 
● Handle the missing values 
● Encode the categorical features 
● Train a Decision Tree model 
● Tune its hyperparameters 
● Evaluate its performance And describe what business value this model could provide in the real-world setting. 

If I were working on a healthcare dataset with mixed data types and missing values, here’s how I’d handle it:

#### Handle Missing Values:
- **Identify missing data** in both numeric and categorical features.  
- **Numeric columns:** Impute missing values with **mean** or **median**. Median is better if there are outliers.  
- **Categorical columns:** Impute missing values with **mode** (most frequent category).  
- Use libraries like `pandas` or `sklearn.impute.SimpleImputer` for this.

```python
from sklearn.impute import SimpleImputer
numeric_imputer = SimpleImputer(strategy='median')
categorical_imputer = SimpleImputer(strategy='most_frequent')
```

#### Encode Categorical Features:

- Convert categorical variables to numbers for the model to understand.
- Decision Trees can handle ordinal numbers, but for non-ordinal categories, use One-Hot Encoding.
- Use `pd.get_dummies()` or `sklearn.preprocessing.OneHotEncoder`.

#### Train a Decision Tree Model:

- Split the dataset into training and testing sets.
- Initialize the Decision Tree (DecisionTreeClassifier).
- Fit the model on the training data.

```python 
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
```

#### Tune Hyperparameters:

- Use GridSearchCV or RandomizedSearchCV to find the best parameters like:
    - `max_depth`
    - `min_samples_split`
    - `min_samples_leaf`
    - `criterion` (Gini or Entropy)

- This prevents overfitting and improves generalization.

```python
from sklearn.model_selection import GridSearchCV
param_grid = {
    'max_depth': [3,5,7,None],
    'min_samples_split': [2,5,10],
    'criterion': ['gini', 'entropy']
}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
```

#### Evaluate Performance:

- Check model accuracy on test data.
- Also use metrics like:
    - Precision & Recall (important for healthcare to avoid false negatives)
    - F1-score
    - Confusion matrix for detailed insight.

```python
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
y_pred = best_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


