# Decision Trees Tutorial - Student Template

## Introduction

Decision Trees are a popular machine learning algorithm that can be used for both classification and regression tasks. They work by splitting the data into subsets based on feature values, creating a tree-like structure of decisions.

### Key Concepts:

1. **Root Node**: The topmost node that represents the entire dataset
2. **Internal Nodes**: Nodes that represent decisions based on feature values
3. **Leaf Nodes**: Terminal nodes that represent the final output/prediction
4. **Branches**: Paths from one node to another

### Advantages of Decision Trees:
- Easy to understand and interpret
- Can handle both numerical and categorical data
- Requires little data preprocessing
- Non-parametric method (no assumptions about data distribution)

### Disadvantages of Decision Trees:
- Prone to overfitting, especially with deep trees
- Can be unstable (small changes in data can lead to different trees)
- Biased toward features with more levels in categorical data

In this tutorial, we'll explore how to implement decision trees using scikit-learn with a real-world dataset.

In [None]:
# Import necessary libraries
# TODO: Import numpy, pandas, matplotlib, seaborn, and sklearn modules
# Hint: You'll need datasets, train_test_split, DecisionTreeRegressor, DecisionTreeClassifier,
#       mean_squared_error, r2_score, accuracy_score, classification_report, and plot_tree

# Your code here:

## Dataset: California Housing

We'll use the California Housing dataset, which contains information about houses in California districts. This dataset has 20,640 samples and 8 features, making it a good example of a "big dataset" for machine learning.

### Features:
1. **MedInc**: Median income in block group
2. **HouseAge**: Median house age in block group
3. **AveRooms**: Average number of rooms per household
4. **AveBedrms**: Average number of bedrooms per household
5. **Population**: Block group population
6. **AveOccup**: Average number of household members
7. **Latitude**: Block group latitude
8. **Longitude**: Block group longitude

### Target:
**MedHouseVal**: Median house value for California districts (in hundreds of thousands)

Let's load the dataset and explore its structure.

In [None]:
# Load the California Housing dataset
# TODO: Use fetch_california_housing() to load the dataset
# TODO: Create a DataFrame with the feature data and add the target variable
# Hint: Use pd.DataFrame() and add the target as a new column

# Your code here:

In [None]:
# Display basic statistics of the dataset
# TODO: Use the describe() method to show statistics

# Your code here:

## Data Visualization

Let's visualize the distribution of our target variable and some key features to better understand the data.

In [None]:
# Plot the distribution of the target variable (Median House Value)
# TODO: Create a histogram of the MedHouseVal column
# Hint: Use plt.hist() and add appropriate labels and title

# Your code here:

In [None]:
# Plot correlation matrix to understand relationships between features
# TODO: Create a heatmap of the correlation matrix
# Hint: Use sns.heatmap() with the correlation matrix from df.corr()

# Your code here:

## Preparing Data for Decision Tree

Before training our decision tree, we need to split our data into training and testing sets. This allows us to evaluate how well our model generalizes to unseen data.

In [None]:
# Separate features (X) and target variable (y)
# TODO: Create X (features) by dropping the target column
# TODO: Create y (target) with just the target column
# TODO: Split the data using train_test_split with test_size=0.2 and random_state=42

# Your code here:

## Training a Decision Tree Regressor

Now we'll train a Decision Tree Regressor to predict house values. We'll use scikit-learn's `DecisionTreeRegressor` class.

In [None]:
# Create a Decision Tree Regressor model
# TODO: Create a DecisionTreeRegressor with max_depth=10 and random_state=42
# TODO: Fit the model on the training data

# Your code here:

## Making Predictions

Let's use our trained model to make predictions on the test set and evaluate its performance.

In [None]:
# Make predictions on the test set
# TODO: Use the predict method to make predictions on X_test
# TODO: Calculate MSE, RMSE, and R² score using the appropriate sklearn functions

# Your code here:

## Visualizing Predictions vs Actual Values

Let's visualize how well our model's predictions match the actual values.

In [None]:
# Create a scatter plot of actual vs predicted values
# TODO: Create a scatter plot with y_test on x-axis and y_pred on y-axis
# TODO: Add a diagonal reference line
# TODO: Add appropriate labels and title

# Your code here:

In [None]:
# Plot residuals
# TODO: Calculate residuals (y_test - y_pred)
# TODO: Create a scatter plot of predicted values vs residuals
# TODO: Add a horizontal line at y=0
# TODO: Add appropriate labels and title

# Your code here:

## Feature Importance

One of the advantages of decision trees is that they provide insight into which features are most important for making predictions.

In [None]:
# Get feature importances from the decision tree
# TODO: Extract feature importances from the trained model
# TODO: Create a DataFrame with features and their importances
# TODO: Sort by importance (descending)
# TODO: Print the feature importances
# TODO: Create a bar plot of feature importances

# Your code here:

## Visualizing the Decision Tree

Let's visualize a simplified version of our decision tree to understand how it makes decisions.

In [None]:
# Create a simpler decision tree for visualization (limiting depth to 3)
# TODO: Create a DecisionTreeRegressor with max_depth=3 and random_state=42
# TODO: Fit the model on training data
# TODO: Use plot_tree to visualize the tree
# Hint: Use feature_names parameter to label the features

# Your code here:

## Hyperparameter Tuning

Decision trees have several hyperparameters that can be tuned to improve performance. Let's experiment with different values for `max_depth`.

In [None]:
# Test different max_depth values
# TODO: Create a range of max_depth values from 1 to 20
# TODO: For each depth, train a model and calculate train/test R² scores
# TODO: Store the scores in lists
# TODO: Plot the results
# TODO: Find and print the best depth and score

# Your code here:

## Comparison with Other Models

Let's compare our decision tree with a simple baseline model (mean predictor) to see how much improvement we get.

In [None]:
# Baseline model: always predict the mean
# TODO: Create predictions that are all equal to the mean of y_train
# TODO: Calculate baseline MSE, RMSE, and R²
# TODO: Compare with your decision tree performance
# TODO: Calculate the improvement percentage

# Your code here:

## Classification Example

Decision trees can also be used for classification tasks. Let's create a binary classification problem by categorizing houses as "expensive" or "affordable" based on their median value.

In [None]:
# Create a binary classification target
# TODO: Calculate the median of MedHouseVal
# TODO: Create a binary target where 1 = expensive (above median) and 0 = affordable (below or equal to median)
# TODO: Add the new columns to the DataFrame
# TODO: Print statistics about the classification

# Your code here:

In [None]:
# Split data for classification
# TODO: Create X_class (features) and y_class (binary target)
# TODO: Split the data using train_test_split
# TODO: Create and train a DecisionTreeClassifier
# TODO: Make predictions
# TODO: Calculate and print accuracy
# TODO: Print classification report

# Your code here:

## Conclusion

In this tutorial, we've explored how to use decision trees for both regression and classification tasks:

### Key Takeaways:

1. **Decision trees are versatile**: They can be used for both regression (predicting continuous values) and classification (predicting categories)

2. **Data preprocessing is important**: We split our data into training and testing sets to evaluate model performance properly

3. **Hyperparameter tuning matters**: We experimented with different `max_depth` values to find the optimal balance between underfitting and overfitting

4. **Model evaluation is crucial**: We used multiple metrics (MSE, RMSE, R² for regression; accuracy for classification) to assess our models

5. **Interpretability is a strength**: Decision trees provide feature importances and can be visualized to understand how decisions are made

### Next Steps:

1. Try other hyperparameters like `min_samples_split`, `min_samples_leaf`, and `max_features`
2. Experiment with ensemble methods like Random Forest or Gradient Boosting
3. Apply decision trees to other datasets
4. Handle overfitting with techniques like pruning or setting constraints

Decision trees are a powerful and interpretable machine learning method that serves as a foundation for more complex algorithms. Understanding them is crucial for any machine learning practitioner!