# Chapter 5: Decision Trees

**Student Exercise Version**  
Complete the exercises below. The solution notebook will be provided after one week.

In [None]:
low_memory=False

import pandas as pd
import matplotlib.pyplot as plt
import imblearn

import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score
from imblearn.metrics import specificity_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

## 5.1 Introduction & Motivation

Welcome to an exciting hands-on exploration of decision trees! In this exercise, we'll apply the theoretical concepts you've learned to solve a real-world classification problem with practical implications.

We have already explored our first classification model (logistic regression), but the world of machine learning offers many more powerful algorithms to choose from. One of the most intuitive and interpretable of these is the **Decision Tree**.

**How Decision Trees Work:**
Decision trees mirror human decision-making by creating a hierarchical structure of questions. Starting at the **root node**, the algorithm systematically creates **internal nodes** (decision points) and **leaf nodes** (final predictions) by asking binary questions about the features in our data.

**Real-World Example:**
When predicting an animal's family, a decision tree might ask:
- Root question: "Does the animal have fur?"
  - If YES → "Does it have four legs?"
    - If YES → "Does it bark?" → Dog/Cat classification
    - If NO → "Does it swing from trees?" → Primate classification
  - If NO → "Does it have feathers?" → Bird/Reptile classification

Each **node** represents a decision point based on feature values, and each **branch** represents the outcome of that decision. This creates an easily interpretable set of rules that even non-technical stakeholders can understand and validate.

**Key Advantages:**
- **Transparency**: Every prediction can be traced through a clear path of decisions
- **No data preprocessing**: Works with mixed data types without scaling or normalization
- **Automatic feature selection**: Identifies the most important features for classification
- **Handles non-linear relationships**: Can capture complex patterns that linear models miss

## 5.2 Problem Setting: Wine Quality Assessment Challenge

The world of wine is fascinating and complex. Wine quality assessment typically requires years of training and expertise, with sommeliers spending decades honing their ability to detect subtle differences that distinguish exceptional wines from ordinary ones.

**Our Challenge:**
Today, we'll attempt to automate this expert knowledge using decision trees! Our goal is to predict wine quality based solely on its measurable chemical and physical properties. This represents a classic **multi-class classification problem** where we need to predict discrete quality ratings.

**Dataset Overview:**
- **Target Variable**: Wine quality scores ranging from 0 (lowest quality) to 10 (highest quality)
- **Features**: Chemical and physical properties measurable through laboratory analysis
- **Real-World Impact**: Such models could help wineries:
  - Standardize quality control processes
  - Identify key factors that influence wine quality
  - Make data-driven decisions about production methods
  - Provide objective quality assessments

**Why This Problem Suits Decision Trees:**
Wine quality assessment involves complex, non-linear relationships between chemical compounds. Decision trees excel at capturing these intricate patterns through their hierarchical question-answering approach, potentially mimicking how human experts mentally process multiple factors to reach quality judgments.

## 5.3 Data Exploration and Model Development

Before building our decision tree, we need to thoroughly understand our dataset. Proper data exploration is crucial for making informed decisions about feature selection, preprocessing, and model configuration.

**Step 1: Initial Data Examination**

Let's start by examining the structure and content of our wine quality dataset. Understanding the features and their potential relationships with quality will guide our modeling approach.

In [None]:
df = pd.read_csv("Wine.csv")
df.head()

##### Question 1: Examine the dataset above and identify any columns that will not contribute meaningful information to our wine quality prediction model. Provide your reasoning and implement the necessary data cleaning steps.

**Your Answer:**

In [None]:
# Your code here

##### Question 2: Create a comprehensive correlation analysis using a heatmap visualization. Based on the correlation patterns, hypothesize which variables will have the strongest positive and negative impacts on wine quality. Justify your predictions using domain knowledge about wine chemistry.

In [None]:
# Your code here

**Your Analysis:**

##### Question 3: Implement the standard machine learning workflow by creating appropriate train-test splits and building your decision tree model. Use a 70/30 split for training and testing data, and ensure you're using all relevant features identified in your data cleaning process.

In [None]:
# Your code here

##### Question 4: Test your trained model's practical applicability by predicting the quality of a specific wine sample. Consider the wine with the characteristics shown below and interpret what quality rating your model assigns to it.

**Wine Sample Analysis:**
This wine sample represents a real-world scenario where you'd use your model to assess an unknown wine. Examine how the individual feature values might contribute to the final quality prediction.

| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | Id |
| --- | --- | --- |--- | --- | --- | --- | --- | --- | --- | --- | --- |
| 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.9954 | 3.57 | 0.71 | 10.2 | 178 |

In [None]:
# Your code here

**Your Interpretation:**

## 5.4 Comprehensive Model Evaluation

Model performance evaluation is critical for understanding how well our decision tree generalizes to unseen data. We'll examine multiple metrics to gain a complete picture of our model's strengths and weaknesses.

Predictions alone don't tell us much about our model's reliability. We need quantitative metrics to assess performance objectively and identify areas for improvement.

##### Question 5: Calculate and interpret both accuracy and precision metrics for your model. Evaluate whether these performance scores indicate a reliable model for wine quality prediction. Support your analysis by explaining what these metrics mean in the context of our specific problem.

In [None]:
# Your code here

**Your Analysis:**

##### Question 6: Create and analyze a confusion matrix visualization for your model. Explain how the visual patterns in this matrix confirm or contradict your accuracy and precision findings from the previous question.

In [None]:
# Your code here

**Your Visual Analysis:**

##### Question 7: Investigate an important aspect of confusion matrix interpretation in multi-class problems. Analyze the relationship between the matrix indices and actual quality ratings, and explain why understanding label encoding is crucial for proper interpretation. 

**Investigation Hint:** Use pandas methods to examine the distribution of actual and predicted values, then consult the sklearn documentation to understand how confusion matrices handle class labels.

In [None]:
# Your investigation code here

**Your Findings:**

In [None]:
# Create a properly labeled confusion matrix here

## 5.5 Advanced Topics and Model Optimization

Now that we've established baseline performance, let's explore advanced concepts to improve our decision tree model and understand important algorithmic considerations.

##### Question 8: Investigate the complexity of your decision tree by determining how many hierarchical levels (depth) it has created. Research the appropriate method to extract this information and discuss why tree depth is an important consideration in decision tree modeling.

In [None]:
# Your code here

**Your Analysis:**

##### Question 9: Optimize your model's performance by systematically testing different maximum depth values. Create a comprehensive analysis by training models with depths ranging from 1 to 20, plot the accuracy scores, and identify the optimal depth that balances model complexity with performance. Retrain your final model using this optimal depth and report the improved metrics.

In [None]:
# Your code here

**Your Optimization Results:**

##### Question 10: Implement a comprehensive overfitting detection analysis. Calculate both training and testing accuracy for all depth values tested in Question 9, then create a visualization showing both curves. Analyze the results to identify clear signs of overfitting and determine if your chosen optimal depth from Question 9 still represents the best choice when considering overfitting patterns.

In [None]:
# Your code here

**Your Overfitting Analysis:**

##### Question 11: Explore ensemble learning by implementing Random Forest, an advanced variant of decision trees. Based on the [scikit-learn Random Forest documentation](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html), create a Random Forest classifier and evaluate its performance. Compare the accuracy and precision with your optimized single decision tree and explain why ensemble methods often outperform individual models.

In [None]:
# Your code here

**Your Ensemble Analysis:**

##### Question 12: Conduct a theoretical analysis of overfitting susceptibility by comparing Random Forest to single decision trees. Research the underlying mechanisms of Random Forest algorithms and provide a comprehensive explanation of why ensemble methods are generally more robust against overfitting. Support your analysis with specific algorithmic features that contribute to this robustness.

**Your Theoretical Analysis:**