# Decision trees

This is an exploratory exercise to allow you to learn more about decision trees and how they might be used in scikit-learn.

## Instructions:

* Go through the notebook and complete the tasks. 
* Make sure you understand the examples given. If you need help, refer to the documentation links provided or go to the discussion forum. 
* When a question allows a free-form answer (e.g. what do you observe?), create a new markdown cell below and answer the question in the notebook. 
* Save your notebooks when you are done.

Before you do the tasks below, go through the scikit-learn decision tree tutorial <a href="https://scikit-learn.org/stable/modules/tree.html">here</a>, with the classifier described <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">here</a>. The tutorial contains instructions on how to use decision trees for both classification and regression in Python. 

**Task 1:**
Using what you learnt in the scikit-learn decision tree tutorial, use decision trees for classification on the iris dataset, and for regression on the diabetes dataset (both included in ```sklearn.datasets```). Your code should print the accuracy and the confusion matrix for the classification problem, and mean squared error for the regression. Try comparing the results for different maximum tree depths. 

Note: You should split your data 80% training and 20% for testing.



In [5]:
# Import necessary libraries
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the decision tree classifier with different maximum depths
for depth in [2, 3, 4, 5, None]:  # Testing different max depths
    clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
    clf.fit(X_train, y_train)

    # Predict on the test set
    y_pred = clf.predict(X_test)

    # Print the accuracy and confusion matrix
    print(f"Max Depth: {depth}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}\n")



Max Depth: 2
Accuracy: 0.9666666666666667
Confusion Matrix:
[[10  0  0]
 [ 0  8  1]
 [ 0  0 11]]

Max Depth: 3
Accuracy: 1.0
Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Max Depth: 4
Accuracy: 1.0
Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Max Depth: 5
Accuracy: 1.0
Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Max Depth: None
Accuracy: 1.0
Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]



**Task 2:**
How would you avoid overfitting in decision trees? Read the decision tree classifier described <a href="http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">here</a> for help. 




In [6]:
# Import necessary libraries for regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the diabetes dataset
diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target

# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the decision tree regressor with different maximum depths
for depth in [2, 3, 4, 5, None]:  # Testing different max depths
    reg = DecisionTreeRegressor(max_depth=depth, random_state=42)
    reg.fit(X_train, y_train)

    # Predict on the test set
    y_pred = reg.predict(X_test)

    # Print the mean squared error
    print(f"Max Depth: {depth}")
    print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}\n")


Max Depth: 2
Mean Squared Error: 3866.038156768628

Max Depth: 3
Mean Squared Error: 3656.186930948001

Max Depth: 4
Mean Squared Error: 3594.089844855363

Max Depth: 5
Mean Squared Error: 3545.4104698436895

Max Depth: None
Mean Squared Error: 4986.943820224719



Task 2: Avoiding Overfitting in Decision Trees
Overfitting occurs when the decision tree model becomes too complex and captures noise in the training data, reducing its ability to generalize to unseen data. Here are some strategies to avoid overfitting in decision trees:

Limit Maximum Depth: Reducing the maximum depth of the tree ensures that the model does not grow too complex.
Pruning: After building the tree, you can prune it by removing branches that have little importance in the overall prediction.
Min Samples for Split: Increase the minimum number of samples required to split a node (min_samples_split), which prevents the tree from creating too many small nodes.
Min Samples per Leaf: Increase the minimum number of samples required to be at a leaf node (min_samples_leaf), which helps prevent overly specific decisions.
Use Cross-Validation: Cross-validation ensures that the model is tested on multiple splits of the data, which helps in selecting optimal hyperparameters.
Use Random Forests: Instead of using a single decision tree, Random Forests combine the predictions of multiple trees, reducing the likelihood of overfitting.