# Decision Trees in Machine Learning

## 1. What is a Decision Tree, and how does it work in the context of classification?

A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. In classification, the Decision Tree splits the data into different classes based on the features. It works by recursively dividing the dataset into subsets using decision rules at each node. At each node, the algorithm selects a feature that best splits the data, using criteria like Gini Impurity or Information Gain. The process continues until the tree is deep enough or all data points are perfectly classified.


## 2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

- **Gini Impurity** measures the "impurity" of a dataset. It ranges from 0 to 1, where 0 indicates perfect purity (all samples belong to the same class) and 1 means maximum impurity (equal distribution of classes).

- **Entropy** is another measure of impurity, which is based on the concept of information theory. It quantifies the disorder or randomness in the dataset. Like Gini, entropy also ranges from 0 to 1, where 0 represents perfect classification.

Both measures influence the splits in the tree—lower values indicate better splits. The algorithm chooses the feature that results in the lowest impurity after the split.


## 3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

- **Pre-Pruning** involves stopping the tree from growing too deep during training by imposing limits like `max_depth` or `min_samples_split`. This prevents overfitting by not allowing the tree to model too much noise from the training data. A practical advantage of pre-pruning is that it helps control the complexity of the tree early on, making it faster to train.

- **Post-Pruning** is the process of trimming branches after the tree has been fully grown. The idea is to remove parts of the tree that have little predictive power. A practical advantage of post-pruning is that it allows the tree to grow fully and capture complex patterns, then fine-tunes the model to prevent overfitting.


## 4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Information Gain measures how much uncertainty (or entropy) is reduced when the data is split on a particular feature. It calculates the difference between the entropy of the dataset before and after the split. The feature with the highest Information Gain is selected for the split because it provides the most "information" in terms of reducing uncertainty. This helps the tree make decisions that improve classification accuracy.


## 5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

- **Real-World Applications**:
  - **Healthcare**: Decision Trees are used for diagnosing diseases based on patient data.
  - **Finance**: They help in credit scoring and risk assessment.
  - **Marketing**: Decision Trees can predict customer behavior and preferences.

- **Advantages**:
  - Easy to interpret and visualize.
  - Handle both categorical and numerical data.
  - Can be used for both classification and regression.

- **Limitations**:
  - Prone to overfitting, especially with complex trees.
  - Can be unstable (small changes in the data can lead to different tree structures).


## 6. Write a Python program to:
 - Load the Iris Dataset
 - Train a Decision Tree Classifier using the Gini criterion
 - Print the model’s accuracy and feature importances

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Evaluate the model
accuracy = clf.score(X_test, y_test)
print(f"Model Accuracy: {accuracy:.2f}")

# Print feature importances
print("Feature Importances:", clf.feature_importances_)


Model Accuracy: 1.00
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


## Write a Python program to:
   - Load the Iris Dataset
   - Train a Decision Tree Classifier with max_depth=3 and compare its accuracy toa fully-grown tree.

In [2]:
# Train a Decision Tree with max_depth=3
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)

# Train a fully-grown Decision Tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)

# Compare accuracies
accuracy_pruned = clf_pruned.score(X_test, y_test)
accuracy_full = clf_full.score(X_test, y_test)

print(f"Accuracy with max_depth=3: {accuracy_pruned:.2f}")
print(f"Accuracy with fully-grown tree: {accuracy_full:.2f}")


Accuracy with max_depth=3: 1.00
Accuracy with fully-grown tree: 1.00


## 8. Write a Python program to:
    - Load the Boston Housing Dataset
    - Train a Decision Tree Regressor
    - Print the Mean Squared Error (MSE) and feature importances

In [4]:
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the Boston Housing dataset
boston = load_boston()
X_boston = boston.data
y_boston = boston.target

# Split the dataset into training and testing sets
X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_test_split(X_boston, y_boston, test_size=0.3, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train_boston, y_train_boston)

# Make predictions
y_pred = regressor.predict(X_test_boston)

# Calculate MSE
mse = mean_squared_error(y_test_boston, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Print feature importances
print("Feature Importances:", regressor.feature_importances_)


ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


### Solution with California housing data set

In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Load the California Housing dataset
california = fetch_california_housing()
X_california = california.data
y_california = california.target

# Split the dataset into training and testing sets
X_train_california, X_test_california, y_train_california, y_test_california = train_test_split(X_california, y_california, test_size=0.3, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train_california, y_train_california)

# Make predictions
y_pred = regressor.predict(X_test_california)

# Calculate MSE
mse = mean_squared_error(y_test_california, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Print feature importances
print("Feature Importances:", regressor.feature_importances_)


Mean Squared Error: 0.53
Feature Importances: [0.52345628 0.05213495 0.04941775 0.02497426 0.03220553 0.13901245
 0.08999238 0.08880639]


## 9. Write a Python program to:
    - Load the Iris Dataset
    - Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
    - Print the best parameters and the resulting model accuracy

In [6]:
from sklearn.model_selection import GridSearchCV

# Parameters for GridSearchCV
param_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}

# Initialize the Decision Tree Classifier
clf_grid = DecisionTreeClassifier(random_state=42)

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(clf_grid, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters from the grid search
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")

# Best model and its accuracy
best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test, y_test)
print(f"Accuracy with best parameters: {accuracy:.2f}")


Best Parameters: {'max_depth': 5, 'min_samples_split': 10}
Accuracy with best parameters: 1.00
